← lisson.co

What $\lambda$ does β€” Lasso, Ridge, and the Elastic Net

Constantin Lisson

Ridge, Lasso, and the Elastic Net all fight overfitting the same way: they add a penalty on the size of the regression coefficients. The knob that sets how hard they push is $\lambda$. It decides how β€œexpensive” a large model β€” in the sense of having coefficients that are numerous or large or both β€” is. This page is a way to see what turning that knob does.

A regularized fit can always be re-read as a constrained one β€” minimize the least-squares loss, but keep the coefficients inside a budget. The shape of that budget is decided by the penalty, and its size by $\lambda$. Where the budget meets the loss is the fitted model.

The picture

Two coefficients, $\theta_1$ and $\theta_2$, so that we can visualize this in two dimensions. The blue ellipses are level curves of the least-squares loss β€” the unconstrained best fit sits at their center (drag the green dot to move it). The shaded shape is the constraint region the penalty allows; $\lambda$ shrinks it, the mixing parameter $\alpha$ changes its shape. The fitted estimate is wherever the smallest loss ellipse first touches that region.

least-squares optimum regularized estimate loss level curves constraint region

The same estimate, traced as a path: each coefficient as a function of $\lambda$, with the dashed line marking where $\lambda$ sits now.

$\hat{\theta}_1(\lambda)$ $\hat{\theta}_2(\lambda)$
$\lambda$ β€” penalty strength1.00
$\alpha$ β€” mixing (lasso)1.00
ridge β€” $\ell_2$lasso β€” $\ell_1$
estimate  $\hat{\theta}_1$ = –  $\hat{\theta}_2$ = –
budget  $t$ = –  Β·  lasso

Slide $\lambda$ from zero upward and follow each coefficient. At $\lambda = 0$ there is no penalty and the estimate is just the least-squares fit; as $\lambda$ grows the budget tightens and the coefficients are pulled toward zero. With ridge ($\alpha = 0$) the paths bend smoothly toward zero but never quite arrive β€” every variable is kept, just shrunk. With lasso ($\alpha = 1$) the paths are piecewise-linear and a coefficient can hit exactly zero at a finite $\lambda$: the variable is dropped from the model. That is the difference between shrinkage and selection.

The penalties

All three estimators minimize the least-squares loss plus a penalty on the coefficient vector $\beta$:

$$\hat\beta \;=\; \arg\min_{\beta}\; \|y - X\beta\|_2^2 \;+\; \lambda\,P(\beta)$$

The penalty $P$ is what sets the shape of the constraint region:

$$P_{\text{ridge}}(\beta) = \|\beta\|_2^2 = \sum_j \beta_j^2$$ $$P_{\text{lasso}}(\beta) = \|\beta\|_1 = \sum_j |\beta_j|$$

The ridge penalty is a circle (in 2-D); its smooth boundary meets the loss ellipse at a generic point, so coefficients are pulled in but stay non-zero. The lasso penalty is a diamond β€” its boundary has corners on the axes, and the loss ellipse very often touches it there, setting a coefficient to exactly zero. That corner is where variable selection comes from.

The elastic net mixes the two with a parameter $\alpha \in [0,1]$:

$$P_{\text{enet}}(\beta) = \alpha\,\|\beta\|_1 + (1-\alpha)\,\|\beta\|_2^2$$

At $\alpha = 1$ it is the lasso, at $\alpha = 0$ it is ridge, and in between the constraint region is a rounded diamond β€” corners soft enough to share weight across correlated predictors, sharp enough to still select. In every case, $\lambda$ scales the whole penalty: larger $\lambda$, smaller budget, more shrinkage.

Further reading

For ridge and the lasso β€” and how $\lambda$ is chosen in practice by cross-validation β€” see An Introduction to Statistical Learning by James, Witten, Hastie, and Tibshirani (chapter 6), freely available at statlearning.com. That book does not cover the elastic net; for it, turn to the more advanced companion volume, The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman (section 3.4), also free at hastie.su.domains/ElemStatLearn.

The inspiration for this interactive explanation was setosa.io’s Principal Components Explained Visually.