Interesting interpretations of Ridge and Lasso

Unveiling the beauty of simple penalizations in linear regression

Author

Email

Felipe Angelim

felipeangelim@gmail.com

Published

October 25th, 2021

Machine Learning is all about balance. You can have a strong learner that overfits your training data, but struggles with out-of-sample prediction. In that case, you may use right amount and kind of regularization to improve the overall performance. Two well-known regularization methods are L1 and L2 penalties.

Albeit used in many regression and classification algorithms, I feel that their linear regression counterparts are usually not studied in deep by many data science practicioners. Understanding the effects of those penalties in simple scenarios can guide us when navigating more complex ones, such as a neural network layer.

This post will introduce interesting interpretations of ridge and lasso regressions, that are described in books and other blog posts (see references in the end of this article). I intend to add some visual explanation and to compile the differents interpretations that unveil the beauty of how simple penalizations yield completely different outcomes.

Note: Unfortunately, I didn't have enough time to dive into the details of the maths and statistics included in this post or organize the sections as I think they should have been. The sections may be dense in terms of content. Anyway, I believe this post may be useful to you. If you have any questions or suggestions, feel free to reach me on linkedin or by email.

The Ordinary Linear Regression

The linear regression is the simpler, but not less powerful, regression method we can think of. If you feel confortable with the equations of linear regression, please feel free to jump to the next section. If not, let's do a quick recap.

Suppose we have $n$ observations of a dependent variable $y$ and some explanatory variables $x_1, \dots, x_p$ . We can use a linear model $y = \sum\limits_{i=1}^p \beta_ix_i + \beta_0$ to fit the data . In matrix form, we may rewrite it as $y = X\beta$ , where $X \in \mathbb{R}^{n\times p +1}$ is the matrix containing the input variables and a column of 1s representing the intercept, and $\beta \in \mathbb{R}^{p+1}$ is the vector of coefficients. We can estimate our $\beta$ s in a way that minimizes the mean squared error between the predicted values $\hat{y}$ and the true values $y$ :

\hat{\beta}^{ols} = \text{arg}\min\limits_{\beta}\; (y - X\beta)^2 = \text{arg}\min\limits_{\beta} (y - \beta_0 - \sum\limits_{i=1}^p \beta_ix_i )^2

The problem has a closed-form solution and we can write:

\hat{\beta}^{ols} = (X^TX)^{-1}X^Ty

Note that if the matrix $X$ is not linear independent or has correlated columns, the matrix $(X^TX)^{-1}$ is ill-conditioned. In real-world problems, we know that happens all the time. Fortunately, a solution has been developed and it's strongly related to L2 penalization. We call it the ridge regression.

Ridge and multicolinearity

The ridge regression was actually developed to solve the issues of linear regression in the case of non-orthogonal inputs, also known as multicolinearity. In the Ordinary Linear Regression case, correlated inputs create unstable $\beta$ s and increase their variance. Suppose we had the following matrix $X$ , with $x_1 \approx 5x_2$ and no intercept:

X = \begin{bmatrix} 0.5 & 0.1 \\ 1.5 & 0.302 \\ 1.0 & 0.21 \\ 3.5 & 0.703 \end{bmatrix}

With outcomes $y$ :

y = \begin{bmatrix} 0.600 \\ 1.802 \\ 1.210 \\ 4.203 \end{bmatrix}

we can easily find that $\hat{\beta}^{ols} = \begin{bmatrix} 1 & 1 \end{bmatrix} ^T$ . Now, assume that some random additive noise had perturbed our observed outcomes:

y^{noisy} = y + \begin{bmatrix} 0.007 \\ 0.005 \\ -0.0012 \\ 0.005 \end{bmatrix}

Our estimates for $\beta$ would be $\hat{\beta}^{ols \; noisy} = \begin{bmatrix} 1.07 & 0.64 \end{bmatrix} ^T$ ! Small pertubations in the observed outcome $y$ can significantly change the coefficients estimated with linear regression.

Another look at the effects of multicolinearity on the regression coefficients comes from the interpretation of linear regression as orthogonal projection. You can understand linear regression as projection of observations onto a plane defined by $\hat{\beta}_1x_1 + \hat{\beta}_2x_2 = y$ . This plane is the closest to the training points in terms of average distance. However, when the correlation between $x_1$ and $x_2$ comes closer to 1, the true subspace that defines de relation between $y$ and the inputs is the line $\beta_1x_1 = y$ , or $\beta_2x_2 = y$ , The regression algorithm tries to find a plane, but there are infinite planes that contain that line. Any outlier in the training data will give a (bad) hint that there's a plane that fits the data.

The ridge

By adding a special penalization on the magnitude of the coefficients of the regression, however, only one solution solves the problem. The ridge regression can be stated as:

$\hat{\beta}^{ridge} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 + \lambda\sum\limits_{i=1}^p\beta_i^2$

$\hat{\beta}^{ridge} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 \quad\text{subject to} \sum\limits_{i=1}^p \beta_i^2 \leq t$

There's a one to one correspondence between $\lambda$ and $t$ . Note that $\beta_0$ is not included in the penalization.

After centering the input vectors $x_i$ , $i \in \{1, \dots, p\}$ , and the target $y$ , the solution has closed-form and can be written disconsidering the intercept:

$\hat{\beta}^{ridge} = (X^TX + \lambda \mathbf{I})^{-1}X^Ty$

where the input matrix $X \in \mathbb{R}^{n \times p}$ has $p$ columns, $\hat{\beta}^{ridge} \in \mathbb{R}^p$ and $\mathbf{I}$ is the identity matrix.

Now, compare this solution with the one from ordinary linear regression. If you're familiar to eigendecomposition, you have probably found out why the $\lambda \mathbf{I}$ term makes the ridge more stable. Supposing we had correlated columns in our data, some columns of $X^TX$ will be "almost" linear dependent, with at least one eigenvalue of $X^TX$ equal to or close to zero. However, the term $\lambda \mathbf{I}$ increases the eigenvalues of $X^TX$ by $\lambda$ , making it possible to invert the matrix.

The beauty of the L2 penalization starts to appear. The inversion of a ill-conditioned matrix amplifies the noise of the observations, and we obtain unstable results with ordinary linear regression when we have correlated features in our data, as in the example above. However, the ridge avoid those instabilities. In the next section, we'll dive deeper to understand how the ridge is related to the covariance matrix and why it is more robust to noisy observations. We'll do that with the help of Singular Value Decomposition, a matrix factorization technique.

Singular Value Decomposition

The theorem states that any matrix $X \in \mathbb{R}^{m\times n}$ can factorized into a singular value decomposition:

$X = UDV^T$

where $U \in \mathbb{R}^{m\times m}$ and $V \in \mathbb{R}^{n\times n}$ are orthogonal matrices, and $D \in \mathbb{R}^{m\times n}$ is a diagonal matrix with non-negative values.

In other words, the SVD provide a way to decompose the matrix into three sequential linear transformations: a rotation, followed by a scaling, followed by another rotation. The amount of scaling is defined by the diagonal values of $D$ , also known as singular values. This factorization is related to eigendecomposition: if the original matrix has some eigenvalue equal to zero, the scaling step will shrink those directions to zero. As a consequence, the number of zeros in the diagonal of $D$ is equal to the dimension of the null space of $X$ .

When $X$ is the matrix containing our data, the sample covariance matrix $\Sigma = \frac{X^TX}{N-1}$ is strongly related to the SVD factorization. If we replace $X$ by its factorized form, we obtain:

$\Sigma = \frac{X^TX}{N-1} = V\frac{D^2}{N-1}V^T$

which is the eigen decomposition of the covariance matrix. The columns $v_j$ of $V$ are the eigenvectors of $\Sigma$ (also called principal components directions of $X$ ), and the eigenvalues are the squared singular values divided by $N-1$ . You can feel how the SVD is related to Principal Components Analysis. But I find even more intriguing its relation to the Ridge regression.

After replacing $X$ in the closed form solution of the ridge regression by its SVD factorization, we obtain the following:

$\hat{\beta}^{ridge} = (X^TX + \lambda \mathbf{I})^{-1}X^Ty \\ = (VD^2V^T + \lambda \mathbf{I})^{-1}VDU^Ty \\ = (VD^2V^T + V\lambda \mathbf{I}V^T)^{-1}VDU^Ty \\ = [V(D^2 + \lambda\mathbf{I})V^T]^{-1}VDU^Ty \\ = [V(D^2 + \lambda\mathbf{I})^{-1}V^T]VDU^Ty \\ = VD(D^2 + \lambda\mathbf{I})^{-1}U^Ty$

while, for the ordinary least squares, we obtain

$\hat{\beta}^{ols} = (X^TX)^{-1}X^Ty \\ = VD(D^2)^{-1}U^Ty \\ = VD^{-1}U^Ty$

Considering that $y = X\beta + \epsilon$ , where $\epsilon$ is an irreducible error:

$\hat{\beta}^{ols} = \beta^{ols} + VD^{-1}U^T\epsilon$

$\hat{\beta}^{ridge} = \beta^{ridge} + VD(D^2 + \lambda\mathbf{I})^{-1}U^T\epsilon$

Note that when we have colinearity between two features, one of the singular values $d_j$ will shrink torwards zero. That would increase the effects of the errors in the estimation of coefficients in the case of OLS due to the near-zero denominators $d_j$ in the diagonal of $D^{-1}$ . The ridge term, however, avoid those near zero denominators, replacing $\frac{1}{d_j}$ with $\frac{d_j}{d_j^2 + \lambda}$ . The penalization is stronger in the smallest principal components of the input data, i.e. the direction that minimizes the sample variance of the data.

The Lasso

The ridge is a stable tecnique to use in regressions, but it doesn't set any coefficients to zero and may fail to provide interpretable models. The lasso tries to attack this problem by replacing $\beta^2$ with $|\beta|$ in the penalization term:

$\hat{\beta}^{lasso} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 + \lambda\sum\limits_{i=1}^p|\beta_i|$

$\hat{\beta}^{lasso} = \argmin\limits_\beta\;\;(y-\beta_0 - \sum\limits_{i=1}^px_i\beta_i)^2 \quad\text{subject to} \sum\limits_{i=1}^p |\beta_i| \leq t$

Just like the ridge, we can separate this optimization problem into two parts: first, centering $x_j$ and $y$ with regard to their means. Then, solving the estimation problem without the intercept $\beta_0$ . The lasso is a quadratic programming problem, and doesn't have a closed-form solution in the general setting, but can be solved with the same computation cost as for the ridge regression.

Let's see it in an example.

The prostate dataset by Stamey et al., 1987 contain measures of Prostate Specific Antigen and some clinical measures in 97 men. We would like to discover which features may help us identifying prostate cancer in men. Applying the lasso to the dataset and increasing the lagrangian $\lambda$ we obtain the paths for the coefficients showed in figure below.

Lasso coefficients paths. The x-axis represents the lagrangian, and the y-axis the coefficients' values. Each line is the coefficient of a feature.

Those paths are really different from those we obtain with the ridge regression. The following figure illustrates the paths we obtain when we increase the lagrangian of the ridge:

Rdige coefficients paths. The x-axis represents the lagrangian, and the y-axis the coefficients' values. Each line is the coefficient of a feature.

Altough ridge is a powerful tool to avoid instabilities arising from correlated inputs in a regression, it isn't a tool for feature selection, as we can see from the figures above. The Lasso, on the other hand, has characteristics that are useful for understanding the linear relations between the dependent and explanatory variables, and can be used as a feature selection tool.

The two-dimensional case

We can also understand the ridge and lasso geometrically in a simplified setting with only two explanatory variables. This example uses a dataset with one dependent variable $y$ and two explanatory $x_1$ and $x_2$ .

As we described in the previous sections, the ridge keeps the best least squares solution inside the ball defined by $\sum\limits_{i=1}^p \beta_i^2 \leq t$ , and the lasso keeps it inside the diamond $\sum\limits_{i=1}^p |\beta_i| \leq t$ . The point $(\beta_1, \beta_2)$ with the lowest squared error inside the filled area will be our solution:

Contours of Lasso and Ridge regressions. The axis are the coefficients for the two explanatory variables in the regression. The point "b" represents the coefficient estimated by ordinary linear regression. The blue area may expand or contract depending on the penalization magnitude.

The corners of the diamond favours the presence of null coefficients. It's more clear in the following figures, with the contours of the residual sum of squares surface:

The residual sum of squares surface. The diamond represents the region where the lasso regression coefficients must be for a given lagrangian.

The residual sum of squares surface. The circle represents the region where the ridge regression coefficients must be for a given lagrangian.

Note how the upper corner is in an advantageous position in comparison to the other parts of the lasso contour. This helps us understanding why the change in the penalization term affects so drastrically the outcomes. In the ridge figure, the intersection between the contours and the constrained region clearly happens in a place where $\beta_1 \neq 0$ and $\beta_2 \neq 0$ .

Lasso and Ridge from a bayesian point of view

So far, we have seen how the L2 penalization works and why it avoids instabilities in the case of correlated features, and we have also seen why the lasso and the ridge behave differently from a geometric point of view. This section will show the relation between bayesian inference and the ridge and lasso regressions.

Brief introduction to bayesian linear regression

In bayesian inference, we are not only interested in the values of our coefficients. We suppose they are random variables and we would like to know their probability distribution. To do that, we use the Bayes' rule. We update our prior knowledge with the data to obtain our estimates.

The approach consists in creating a model that provides the joint probability distribution for $p(\beta, y)$ , and having a prior on the distribution of our parameters $p(\beta)$ . Then, with the data, we can estimate our posterior distribution $p(\beta | y)$ , which represents the distribution of our coefficients after updating our priors with the data. From Bayes' rule, we have:

$p(\beta | y) = \frac{p(\beta, y)}{p(y)} = \frac{p(\beta)p(y|\beta)}{p(y)}$

With fixed $y$ , we can consider the term $p(y)$ , which does not depend on $\beta$ , constant, and rewrite the above equation in an equivalent form:

$p(\beta | y) \propto p(\beta)p(y|\beta)$

Note that the data $y$ affects the posterior inference through $p(y|\beta)$ . This term, when we look at it as a function of $\beta$ and for a fixed $y$ , is called likelihood function. Suppose our model $y_i = \beta_0 + \sum\limits_{j=1}^p\beta_jx_{ij} + \epsilon_i$ , with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ . We have the following conditional probability:

$y_i | X, \beta \sim \mathcal{N(\beta_0 + \sum\limits_{j=1}^p\beta_jx_{ij}, \sigma^2)}$

and our likelihood function is

$p(y|X, \beta) = \prod \limits_{i=1}^{n} \frac{1}{ \sigma \sqrt{2\pi}} \exp{\left( -\frac{\epsilon_i^2}{2\sigma^2} \right)} = \left(\frac{1}{ \sigma \sqrt{2\pi}} \right)^n \exp{\left( - \frac{1}{2\sigma^2} \sum\limits_{i=1}^{n}\epsilon_i^2 \right)}$

With the likelihood and our priors $p(\beta)$ , we can estimate our posterior distributions for the regression coefficients:

$p(\beta|X, y) \propto p(y | \beta, X) p(\beta | X ) = p(y | \beta, X) p(\beta)$

Lasso and ridge as a consequence of our priors

The link between the bayesian inference and regularized linear regression is in our priors $p(\beta)$ . Two distributions play a important role: the Normal and the Laplace distributions:

Laplacian and normal distribution plots. If use one of them as our priors, we can find the same solution we would find with a linear regression with L1 or L2 penalties.

If we replace our priors for $\beta$ with a normal distribution of mean zero $\beta \sim \mathcal{N}(0, \frac{\sigma^2}{\lambda})$ , we find that the ridge solution $\beta^{ridge}$ is the mean of our posterior distribution $p(\beta | X, y)$ . The larger the lagrangian, the more we believe that our coefficients are close to zero.

Similarly, if we use a laplacian prior with mean zero for our coefficients $\beta \sim \text{Laplace}(0, \frac{2\sigma^2}{\lambda})$ , the mode of $p(\beta | X, y)$ is the lasso solution $\beta^{lasso}$ . Indeed, the Laplacian prior means a stronger conviction that our coefficients are equal to zero.

Conclusion

When to use one over the other

The ridge and the lasso have their unique properties that suit different use-cases. Below, I'll summarise some rules-of-thumb that you can use to know if you should use one of them.

If you want to avoid instabilies in the case of correlated predictors, use the ridge regression. When there's a group of variables among which pairwise correlation are very high, the lasso tends to select only one variable from the group and does not care which one is selected Zou & Hastie, 2005.
If you have a lot of data and a prediction problem, use the ridge regression. It has been empirically observed that the prediction performance of the lasso is dominated by the ridge Tibshirani, 1996.
If you want to select features, use the lasso. It has the ability to set coefficients to zero, which is not a characteristic of the ridge regression. However, if you have $n < p$ , i.e. less data observations than variables, the lasso will select at most $n$ variables.

Final remarks

Albeit simple, the ridge and the lasso regressions have useful interpretations. When you apply L1 and L2 regularization terms in other statistical learning problems, you may remember the linear regression case to understand why the coefficient estimates are behaving in a certain way. There are still many other links that I didn't mentioned here, such as the Least Angle Regression coefficient paths and their relation with Lasso paths.

The adaptative lasso is another topic that I suggest for further reading. Zou, 2006 have showed that the lasso shrinkage produce biases estimates for large coefficients. The adaptative lasso has oracle properties, i.e. it performs as well as if the true underlying model was given in advance.

References

Stamey, T. A., Yang, N., Hay, A. R., McNeal, J. E., Freiha, F. S., & Redwine, E. (1987). Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate. New England Journal of Medicine, 317(15), 909–916.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Friedman, J. H. (2017). The elements of statistical learning: Data mining, inference, and prediction. springer open.
Gundersen, G. (2018). Singular Value Decomposition as Simply as Possible. In Gregory Gundersen’s Blog. https://gregorygundersen.com/blog/2018/12/10/svd/
Martin, C. D., & Porter, M. A. (2012). The extraordinary SVD. The American Mathematical Monthly, 119(10), 838–851.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476), 1418–1429.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis. Chapman.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288.