Interesting interpretations of Ridge and Lasso

Unveiling the beauty of simple penalizations in linear regression

Machine Learning is all about balance. You can have a strong learner that overfits your training data, but struggles with out-of-sample prediction. In that case, you may use right amount and kind of regularization to improve the overall performance. Two well-known regularization methods are L1 and L2 penalties.

Albeit used in many regression and classification algorithms, I feel that their linear regression counterparts are usually not studied in deep by many data science practicioners. Understanding the effects of those penalties in simple scenarios can guide us when navigating more complex ones, such as a neural network layer.

This post will introduce interesting interpretations of ridge and lasso regressions, that are described in books and other blog posts (see references in the end of this article). I intend to add some visual explanation and to compile the differents interpretations that unveil the beauty of how simple penalizations yield completely different outcomes.

Note: Unfortunately, I didn't have enough time to dive into the details of the maths and statistics included in this post or organize the sections as I think they should have been. The sections may be dense in terms of content. Anyway, I believe this post may be useful to you. If you have any questions or suggestions, feel free to reach me on linkedin or by email.

The Ordinary Linear Regression

The linear regression is the simpler, but not less powerful, regression method we can think of. If you feel confortable with the equations of linear regression, please feel free to jump to the next section. If not, let's do a quick recap.

Suppose we have nn observations of a dependent variable yy and some explanatory variables x1,,xpx_1, \dots, x_p. We can use a linear model y=i=1pβixi+β0y = \sum\limits_{i=1}^p \beta_ix_i + \beta_0 to fit the data . In matrix form, we may rewrite it as y=Xβy = X\beta, where XRn×p+1X \in \mathbb{R}^{n\times p +1} is the matrix containing the input variables and a column of 1s representing the intercept, and βRp+1\beta \in \mathbb{R}^{p+1} is the vector of coefficients. We can estimate our β\betas in a way that minimizes the mean squared error between the predicted values y^\hat{y} and the true values yy:

β^ols=argminβ  (yXβ)2=argminβ(yβ0i=1pβixi)2\hat{\beta}^{ols} = \text{arg}\min\limits_{\beta}\; (y - X\beta)^2 = \text{arg}\min\limits_{\beta} (y - \beta_0 - \sum\limits_{i=1}^p \beta_ix_i )^2

The problem has a closed-form solution and we can write:

β^ols=(XTX)1XTy\hat{\beta}^{ols} = (X^TX)^{-1}X^Ty

Note that if the matrix XX is not linear independent or has correlated columns, the matrix (XTX)1(X^TX)^{-1} is ill-conditioned. In real-world problems, we know that happens all the time. Fortunately, a solution has been developed and it's strongly related to L2 penalization. We call it the ridge regression.

Ridge and multicolinearity

The ridge regression was actually developed to solve the issues of linear regression in the case of non-orthogonal inputs, also known as multicolinearity. In the Ordinary Linear Regression case, correlated inputs create unstable β\betas and increase their variance. Suppose we had the following matrix XX, with x15x2x_1 \approx 5x_2 and no intercept:

X=[0.50.11.50.3021.00.213.50.703] X = \begin{bmatrix} 0.5 & 0.1 \\ 1.5 & 0.302 \\ 1.0 & 0.21 \\ 3.5 & 0.703 \end{bmatrix}

With outcomes yy:

y=[0.6001.8021.2104.203] y = \begin{bmatrix} 0.600 \\ 1.802 \\ 1.210 \\ 4.203 \end{bmatrix}

we can easily find that β^ols=[11]T\hat{\beta}^{ols} = \begin{bmatrix} 1 & 1 \end{bmatrix} ^T. Now, assume that some random additive noise had perturbed our observed outcomes:

ynoisy=y+[0.0070.0050.00120.005] y^{noisy} = y + \begin{bmatrix} 0.007 \\ 0.005 \\ -0.0012 \\ 0.005 \end{bmatrix}

Our estimates for β\beta would be β^ols  noisy=[1.070.64]T\hat{\beta}^{ols \; noisy} = \begin{bmatrix} 1.07 & 0.64 \end{bmatrix} ^T! Small pertubations in the observed outcome yy can significantly change the coefficients estimated with linear regression.

Another look at the effects of multicolinearity on the regression coefficients comes from the interpretation of linear regression as orthogonal projection. You can understand linear regression as projection of observations onto a plane defined by β^1x1+β^2x2=y\hat{\beta}_1x_1 + \hat{\beta}_2x_2 = y. This plane is the closest to the training points in terms of average distance. However, when the correlation between x1x_1 and x2x_2 comes closer to 1, the true subspace that defines de relation between yy and the inputs is the line β1x1=y\beta_1x_1 = y, or β2x2=y\beta_2x_2 = y, The regression algorithm tries to find a plane, but there are infinite planes that contain that line. Any outlier in the training data will give a (bad) hint that there's a plane that fits the data.

The ridge

By adding a special penalization on the magnitude of the coefficients of the regression, however, only one solution solves the problem. The ridge regression can be stated as:

β^ridge=arg minβ    (yβ0i=1pxiβi)2+λi=1pβi2\hat{\beta}^{ridge} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 + \lambda\sum\limits_{i=1}^p\beta_i^2

or

β^ridge=arg minβ    (yβ0i=1pxiβi)2subject toi=1pβi2t\hat{\beta}^{ridge} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 \quad\text{subject to} \sum\limits_{i=1}^p \beta_i^2 \leq t

There's a one to one correspondence between λ\lambda and tt. Note that β0\beta_0 is not included in the penalization.

After centering the input vectors xix_i, i{1,,p}i \in \{1, \dots, p\}, and the target yy, the solution has closed-form and can be written disconsidering the intercept:

β^ridge=(XTX+λI)1XTy\hat{\beta}^{ridge} = (X^TX + \lambda \mathbf{I})^{-1}X^Ty

where the input matrix XRn×pX \in \mathbb{R}^{n \times p} has pp columns, β^ridgeRp\hat{\beta}^{ridge} \in \mathbb{R}^p and I\mathbf{I} is the identity matrix.

Now, compare this solution with the one from ordinary linear regression. If you're familiar to eigendecomposition, you have probably found out why the λI\lambda \mathbf{I} term makes the ridge more stable. Supposing we had correlated columns in our data, some columns of XTXX^TX will be "almost" linear dependent, with at least one eigenvalue of XTXX^TX equal to or close to zero. However, the term λI\lambda \mathbf{I} increases the eigenvalues of XTXX^TX by λ\lambda, making it possible to invert the matrix.

The beauty of the L2 penalization starts to appear. The inversion of a ill-conditioned matrix amplifies the noise of the observations, and we obtain unstable results with ordinary linear regression when we have correlated features in our data, as in the example above. However, the ridge avoid those instabilities. In the next section, we'll dive deeper to understand how the ridge is related to the covariance matrix and why it is more robust to noisy observations. We'll do that with the help of Singular Value Decomposition, a matrix factorization technique.

Singular Value Decomposition

The theorem states that any matrix XRm×nX \in \mathbb{R}^{m\times n} can factorized into a singular value decomposition:

X=UDVTX = UDV^T

where URm×mU \in \mathbb{R}^{m\times m} and VRn×nV \in \mathbb{R}^{n\times n} are orthogonal matrices, and DRm×nD \in \mathbb{R}^{m\times n} is a diagonal matrix with non-negative values.

In other words, the SVD provide a way to decompose the matrix into three sequential linear transformations: a rotation, followed by a scaling, followed by another rotation. The amount of scaling is defined by the diagonal values of DD, also known as singular values. This factorization is related to eigendecomposition: if the original matrix has some eigenvalue equal to zero, the scaling step will shrink those directions to zero. As a consequence, the number of zeros in the diagonal of DD is equal to the dimension of the null space of XX.

When XX is the matrix containing our data, the sample covariance matrix Σ=XTXN1\Sigma = \frac{X^TX}{N-1} is strongly related to the SVD factorization. If we replace XX by its factorized form, we obtain:

Σ=XTXN1=VD2N1VT\Sigma = \frac{X^TX}{N-1} = V\frac{D^2}{N-1}V^T

which is the eigen decomposition of the covariance matrix. The columns vjv_j of VV are the eigenvectors of Σ\Sigma (also called principal components directions of XX), and the eigenvalues are the squared singular values divided by N1N-1. You can feel how the SVD is related to Principal Components Analysis. But I find even more intriguing its relation to the Ridge regression.

After replacing XX in the closed form solution of the ridge regression by its SVD factorization, we obtain the following:

β^ridge=(XTX+λI)1XTy=(VD2VT+λI)1VDUTy=(VD2VT+VλIVT)1VDUTy=[V(D2+λI)VT]1VDUTy=[V(D2+λI)1VT]VDUTy=VD(D2+λI)1UTy\hat{\beta}^{ridge} = (X^TX + \lambda \mathbf{I})^{-1}X^Ty \\ = (VD^2V^T + \lambda \mathbf{I})^{-1}VDU^Ty \\ = (VD^2V^T + V\lambda \mathbf{I}V^T)^{-1}VDU^Ty \\ = [V(D^2 + \lambda\mathbf{I})V^T]^{-1}VDU^Ty \\ = [V(D^2 + \lambda\mathbf{I})^{-1}V^T]VDU^Ty \\ = VD(D^2 + \lambda\mathbf{I})^{-1}U^Ty

while, for the ordinary least squares, we obtain

β^ols=(XTX)1XTy=VD(D2)1UTy=VD1UTy\hat{\beta}^{ols} = (X^TX)^{-1}X^Ty \\ = VD(D^2)^{-1}U^Ty \\ = VD^{-1}U^Ty

Considering that y=Xβ+ϵy = X\beta + \epsilon, where ϵ\epsilon is an irreducible error:

β^ols=βols+VD1UTϵ\hat{\beta}^{ols} = \beta^{ols} + VD^{-1}U^T\epsilon

β^ridge=βridge+VD(D2+λI)1UTϵ\hat{\beta}^{ridge} = \beta^{ridge} + VD(D^2 + \lambda\mathbf{I})^{-1}U^T\epsilon

Note that when we have colinearity between two features, one of the singular values djd_j will shrink torwards zero. That would increase the effects of the errors in the estimation of coefficients in the case of OLS due to the near-zero denominators djd_j in the diagonal of D1D^{-1} . The ridge term, however, avoid those near zero denominators, replacing 1dj\frac{1}{d_j} with djdj2+λ\frac{d_j}{d_j^2 + \lambda}. The penalization is stronger in the smallest principal components of the input data, i.e. the direction that minimizes the sample variance of the data.

The Lasso

The ridge is a stable tecnique to use in regressions, but it doesn't set any coefficients to zero and may fail to provide interpretable models. The lasso tries to attack this problem by replacing β2\beta^2 with β|\beta| in the penalization term:

β^lasso=arg minβ    (yβ0i=1pxiβi)2+λi=1pβi\hat{\beta}^{lasso} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 + \lambda\sum\limits_{i=1}^p|\beta_i|

or

β^lasso=arg minβ    (yβ0i=1pxiβi)2subject toi=1pβit\hat{\beta}^{lasso} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 \quad\text{subject to} \sum\limits_{i=1}^p |\beta_i| \leq t

Just like the ridge, we can separate this optimization problem into two parts: first, centering xjx_j and yy with regard to their means. Then, solving the estimation problem without the intercept β0\beta_0. The lasso is a quadratic programming problem, and doesn't have a closed-form solution in the general setting, but can be solved with the same computation cost as for the ridge regression.

Let's see it in an example.

The prostate dataset by Stamey et al., 1987 contain measures of Prostate Specific Antigen and some clinical measures in 97 men. We would like to discover which features may help us identifying prostate cancer in men. Applying the lasso to the dataset and increasing the lagrangian λ\lambda we obtain the paths for the coefficients showed in figure below.

Lasso coefficients paths. The x-axis represents the lagrangian, and the y-axis the coefficients' values. Each line is the coefficient of a feature.

Those paths are really different from those we obtain with the ridge regression. The following figure illustrates the paths we obtain when we increase the lagrangian of the ridge:

Rdige coefficients paths. The x-axis represents the lagrangian, and the y-axis the coefficients' values. Each line is the coefficient of a feature.

Altough ridge is a powerful tool to avoid instabilities arising from correlated inputs in a regression, it isn't a tool for feature selection, as we can see from the figures above. The Lasso, on the other hand, has characteristics that are useful for understanding the linear relations between the dependent and explanatory variables, and can be used as a feature selection tool.

The two-dimensional case

We can also understand the ridge and lasso geometrically in a simplified setting with only two explanatory variables. This example uses a dataset with one dependent variable yy and two explanatory x1x_1 and x2x_2.

As we described in the previous sections, the ridge keeps the best least squares solution inside the ball defined by i=1pβi2t\sum\limits_{i=1}^p \beta_i^2 \leq t, and the lasso keeps it inside the diamond i=1pβit\sum\limits_{i=1}^p |\beta_i| \leq t. The point (β1,β2)(\beta_1, \beta_2) with the lowest squared error inside the filled area will be our solution:

Lasso regression contourRidge regression contour
Contours of Lasso and Ridge regressions. The axis are the coefficients for the two explanatory variables in the regression. The point "b" represents the coefficient estimated by ordinary linear regression. The blue area may expand or contract depending on the penalization magnitude.

The corners of the diamond favours the presence of null coefficients. It's more clear in the following figures, with the contours of the residual sum of squares surface:

The residual sum of squares surface. The diamond represents the region where the lasso regression coefficients must be for a given lagrangian.
The residual sum of squares surface. The circle represents the region where the ridge regression coefficients must be for a given lagrangian.

Note how the upper corner is in an advantageous position in comparison to the other parts of the lasso contour. This helps us understanding why the change in the penalization term affects so drastrically the outcomes. In the ridge figure, the intersection between the contours and the constrained region clearly happens in a place where β10\beta_1 \neq 0 and β20\beta_2 \neq 0.

Lasso and Ridge from a bayesian point of view

So far, we have seen how the L2 penalization works and why it avoids instabilities in the case of correlated features, and we have also seen why the lasso and the ridge behave differently from a geometric point of view. This section will show the relation between bayesian inference and the ridge and lasso regressions.

Brief introduction to bayesian linear regression

In bayesian inference, we are not only interested in the values of our coefficients. We suppose they are random variables and we would like to know their probability distribution. To do that, we use the Bayes' rule. We update our prior knowledge with the data to obtain our estimates.

The approach consists in creating a model that provides the joint probability distribution for p(β,y)p(\beta, y), and having a prior on the distribution of our parameters p(β)p(\beta). Then, with the data, we can estimate our posterior distribution p(βy)p(\beta | y), which represents the distribution of our coefficients after updating our priors with the data. From Bayes' rule, we have:

p(βy)=p(β,y)p(y)=p(β)p(yβ)p(y)p(\beta | y) = \frac{p(\beta, y)}{p(y)} = \frac{p(\beta)p(y|\beta)}{p(y)}

With fixed yy, we can consider the term p(y)p(y), which does not depend on β\beta, constant, and rewrite the above equation in an equivalent form:

p(βy)p(β)p(yβ)p(\beta | y) \propto p(\beta)p(y|\beta)

Note that the data yy affects the posterior inference through p(yβ)p(y|\beta). This term, when we look at it as a function of β\beta and for a fixed yy, is called likelihood function. Suppose our model yi=β0+j=1pβjxij+ϵiy_i = \beta_0 + \sum\limits_{j=1}^p\beta_jx_{ij} + \epsilon_i, with ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2). We have the following conditional probability:

yiX,βN(β0+j=1pβjxij,σ2)y_i | X, \beta \sim \mathcal{N(\beta_0 + \sum\limits_{j=1}^p\beta_jx_{ij}, \sigma^2)}

and our likelihood function is

p(yX,β)=i=1n1σ2πexp(ϵi22σ2)=(1σ2π)nexp(12σ2i=1nϵi2)p(y|X, \beta) = \prod \limits_{i=1}^{n} \frac{1}{ \sigma \sqrt{2\pi}} \exp{\left( -\frac{\epsilon_i^2}{2\sigma^2} \right)} = \left(\frac{1}{ \sigma \sqrt{2\pi}} \right)^n \exp{\left( - \frac{1}{2\sigma^2} \sum\limits_{i=1}^{n}\epsilon_i^2 \right)}

With the likelihood and our priors p(β)p(\beta), we can estimate our posterior distributions for the regression coefficients:

p(βX,y)p(yβ,X)p(βX)=p(yβ,X)p(β)p(\beta|X, y) \propto p(y | \beta, X) p(\beta | X ) = p(y | \beta, X) p(\beta)

Lasso and ridge as a consequence of our priors

The link between the bayesian inference and regularized linear regression is in our priors p(β)p(\beta). Two distributions play a important role: the Normal and the Laplace distributions:

Laplacian and normal distributions.
Laplacian and normal distribution plots. If use one of them as our priors, we can find the same solution we would find with a linear regression with L1 or L2 penalties.

If we replace our priors for β\beta with a normal distribution of mean zero βN(0,σ2λ)\beta \sim \mathcal{N}(0, \frac{\sigma^2}{\lambda}), we find that the ridge solution βridge\beta^{ridge} is the mean of our posterior distribution p(βX,y)p(\beta | X, y). The larger the lagrangian, the more we believe that our coefficients are close to zero.

Similarly, if we use a laplacian prior with mean zero for our coefficients βLaplace(0,2σ2λ)\beta \sim \text{Laplace}(0, \frac{2\sigma^2}{\lambda}), the mode of p(βX,y)p(\beta | X, y) is the lasso solution βlasso\beta^{lasso}. Indeed, the Laplacian prior means a stronger conviction that our coefficients are equal to zero.

Conclusion

When to use one over the other

The ridge and the lasso have their unique properties that suit different use-cases. Below, I'll summarise some rules-of-thumb that you can use to know if you should use one of them.

  • If you want to avoid instabilies in the case of correlated predictors, use the ridge regression. When there's a group of variables among which pairwise correlation are very high, the lasso tends to select only one variable from the group and does not care which one is selected Zou & Hastie, 2005.
  • If you have a lot of data and a prediction problem, use the ridge regression. It has been empirically observed that the prediction performance of the lasso is dominated by the ridge Tibshirani, 1996.
  • If you want to select features, use the lasso. It has the ability to set coefficients to zero, which is not a characteristic of the ridge regression. However, if you have n<pn < p, i.e. less data observations than variables, the lasso will select at most nn variables.

Final remarks

Albeit simple, the ridge and the lasso regressions have useful interpretations. When you apply L1 and L2 regularization terms in other statistical learning problems, you may remember the linear regression case to understand why the coefficient estimates are behaving in a certain way. There are still many other links that I didn't mentioned here, such as the Least Angle Regression coefficient paths and their relation with Lasso paths.

The adaptative lasso is another topic that I suggest for further reading. Zou, 2006 have showed that the lasso shrinkage produce biases estimates for large coefficients. The adaptative lasso has oracle properties, i.e. it performs as well as if the true underlying model was given in advance.

References

  1. Stamey, T. A., Yang, N., Hay, A. R., McNeal, J. E., Freiha, F. S., & Redwine, E. (1987). Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate. New England Journal of Medicine, 317(15), 909–916.
  2. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
  3. Friedman, J. H. (2017). The elements of statistical learning: Data mining, inference, and prediction. springer open.
  4. Gundersen, G. (2018). Singular Value Decomposition as Simply as Possible. In Gregory Gundersen’s Blog. https://gregorygundersen.com/blog/2018/12/10/svd/
  5. Martin, C. D., & Porter, M. A. (2012). The extraordinary SVD. The American Mathematical Monthly, 119(10), 838–851.
  6. Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
  7. Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. Chapman.
  8. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.