What $\lambda$ does — Lasso, Ridge, and the Elastic Net

Constantin Lisson

Ridge, Lasso, and the Elastic Net all fight overfitting the same way: they add a penalty on the size of the regression coefficients. The knob that sets how hard they push is $\lambda$. It decides how “expensive” a large model — in the sense of having coefficients that are numerous or large or both — is. This page is a way to see what turning that knob does.

A regularized fit can always be re-read as a constrained one — minimize the least-squares loss, but keep the coefficients inside a budget. The shape of that budget is decided by the penalty, and its size by $\lambda$. Where the budget meets the loss is the fitted model.

The picture

Two coefficients, $\theta_1$ and $\theta_2$, so that we can visualize this in two dimensions. The blue ellipses are level curves of the least-squares loss — the unconstrained best fit sits at their center (drag the green dot to move it). The shaded shape is the constraint region the penalty allows; $\lambda$ shrinks it, the mixing parameter $\alpha$ changes its shape. The fitted estimate is wherever the smallest loss ellipse first touches that region.

least-squares optimum regularized estimate loss level curves constraint region

The same estimate, traced as a path: each coefficient as a function of $\lambda$, with the dashed line marking where $\lambda$ sits now.

$\hat{\theta}_1(\lambda)$ $\hat{\theta}_2(\lambda)$

$\lambda$ — penalty strength1.00

$\alpha$ — mixing (lasso)1.00

ridge — $\ell_2$lasso — $\ell_1$

estimate $\hat{\theta}_1$ = – $\hat{\theta}_2$ = –
budget $t$ = – · lasso

Slide $\lambda$ from zero upward and follow each coefficient. At $\lambda = 0$ there is no penalty and the estimate is just the least-squares fit; as $\lambda$ grows the budget tightens and the coefficients are pulled toward zero. With ridge ($\alpha = 0$) the paths bend smoothly toward zero but never quite arrive — every variable is kept, just shrunk. With lasso ($\alpha = 1$) the paths are piecewise-linear and a coefficient can hit exactly zero at a finite $\lambda$: the variable is dropped from the model. That is the difference between shrinkage and selection.

The penalties

All three estimators minimize the least-squares loss plus a penalty on the coefficient vector $\beta$:

$$\hat\beta \;=\; \arg\min_{\beta}\; \|y - X\beta\|_2^2 \;+\; \lambda\,P(\beta)$$

The penalty $P$ is what sets the shape of the constraint region:

$$P_{\text{ridge}}(\beta) = \|\beta\|_2^2 = \sum_j \beta_j^2$$ $$P_{\text{lasso}}(\beta) = \|\beta\|_1 = \sum_j |\beta_j|$$

The ridge penalty is a circle (in 2-D); its smooth boundary meets the loss ellipse at a generic point, so coefficients are pulled in but stay non-zero. The lasso penalty is a diamond — its boundary has corners on the axes, and the loss ellipse very often touches it there, setting a coefficient to exactly zero. That corner is where variable selection comes from.

The elastic net mixes the two with a parameter $\alpha \in [0,1]$:

$$P_{\text{enet}}(\beta) = \alpha\,\|\beta\|_1 + (1-\alpha)\,\|\beta\|_2^2$$

At $\alpha = 1$ it is the lasso, at $\alpha = 0$ it is ridge, and in between the constraint region is a rounded diamond — corners soft enough to share weight across correlated predictors, sharp enough to still select. In every case, $\lambda$ scales the whole penalty: larger $\lambda$, smaller budget, more shrinkage.

What $\lambda$ does — Lasso, Ridge, and the Elastic Net

The picture

The penalties

Further reading