Regularizing with Ridge Regression

In part one of this series, we talked about the basic LinearRegression module which utilised the least-squares estimation method of linear regression. In this article, we'll be talking about Ridge Regression, which has a popular misconception of being a type of linear regression but in fact is still the least-squares estimation method but with a twist. We'll be covering the following concepts in ridge regression and we won't be ignoring the maths (yes! That's right again, the maths);

Difference between Linear Regression and Ridge Regression
How to use Ridge Regression in python
How to select the best alpha parameter

Ordinary least squares estimation regression is a great way to fit a linear model on a dataset right until that dataset has dependent features i.e. when the features/columns of a dataset are linearly correlated, least squares estimation starts to perform terribly. In the previous article, we used a single column dataset to simplify our approach, for this article we would use a dataset that has multiple columns and these columns would be linearly correlated. When this situation occurs in a dataset, it could make the least-squares estimate regression highly sensitive to noise in the data. Unfortunately, in the real world, data cannot always be as noiseless as the data we used in the previous article and instead of making noise about it (get it? 😁), we can combat this issue using regularization and this is what ridge regression is all about.

For ordinary least squares estimate, what we are trying to do is to find weights, $w$, that will minimize the sum of squared residuals as much as possible (read more about the sum of squared residuals here or refer to the previous article). The equation is shown below: $$ \sum_{i=1}^{n}(y_i - x_i*w)^2 $$ Where; each $x_i$ = a data observation and each $y_i$ = the corresponding label

Meanwhile, when regularizing, we're not only trying to minimize the sum of squared residuals but also trying to minimize the coefficients as well (read up on coefficients and intercepts in my previous article). A smaller coefficient offers more immunity to random noise in that data.

Ridge regularization is the most commonly used form of regularization in linear regression. With ridge regularization, we're now trying to find weights, $w$ and $α$, to minimize this new equation \[ \alpha ||w||^2 + \sum_{i=1}^{n}(y_i - x_i*w)^2 \] where $α$ is a non-negative real number hyperparameter and  represents the L2 norm of the weights. Ridge regularization is sometimes called L2 regularization because it uses L2 norm.

The added term, $\alpha ||w||^2$, is referred to as the penalty term because it penalizes larger weight values. Using large quantities of $\alpha$ will also put more importance or emphasis on the penalty term, forcing the model to have even smaller weight values.

The plot above shows how an example of least squares estimate and ridge regression models both fitted on the same dataset. The blue lines represent the regression lines for the two points marked in red crosses (0.5, 0.5 and 1, 1) for each model. The grey lines represent the regression lines for the original points (which are marked in grey points).

You can see that in the case of least squares estimate, the regression lines are heavily influenced by the noise (grey points), the model is basically overfitting itself to the data (read more about overfitting here). In the case of ridge regression, the model is adjusted well to the data is showing an apparent pattern and the degree of variance is much less here.

It's time to code, let's see how the ridge regression performs in practice.

Please note that I searched really hard to find a dataset that could help show how ridge regression performs better than traditional least square estimates but I was unable to find any dataset that gave a wide difference. So I'm just going to show you how to use ridge regression in code. Now, where were we? Yes... coding. The first is to import the necessary libraries, which includes numpy and pandas.

import pandas as pd
import numpy as np

Next thing is to import the dataset, we'll be using the "diamonds" dataset from the seaborn library (you can read more on seaborn and it's datasets here). We'll also preview the dataset using the head() function of pandas.

import seaborn as sns

df = sns.load_dataset('diamonds')
df.head()

The price column is going to be our dependent variable ($y$) and the other columns will be the features/independent variables ($x$). If you recall in part 1 of this series, I mentioned that linear regression algorithm can only be used with numeric data and not categorical data. You might have noticed the categorical data in the columns ['cut', 'color', 'clarity'], we can turn these categorical data into a format suitable for our model by one-hot encoding the data (read more on one hot encoding here.

df = pd.get_dummies(df, prefix=['cut',    'color',    'clarity'])

y = df['price'] #label
x = df.drop('price', axis=1) #features

Our categorical variables will now look like this.

Next, we split the dataset into training data and testing data.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

It's time to fit our model to the training data to get that nice juicy regression line. Recall we talked about the $\alpha$ constant that is used in our penalty term. It is a parameter in the Ridge module in sci-kit learn and the default value is 1.0 and as mentioned earlier, it is any real number above zero i.e. any positive real number. The Ridge module can be initialized with the code below.

from sklearn.linear_model import Ridge
ridge = Ridge(alpha=0.5)

Safe to say, the value of alpha used is paramount to the performance of the model. This begs the question of knowing which value of alpha to use but the truth is that we can't truly know for sure without trial and error. Instead of manually doing this trial and error, sci-kit learn already has a module that can do this for us using cross-validation, the module is RidgeCV from the linear_module package just like Ridge.

from sklearn.linear_model import RidgeCV

# define model
ridge = RidgeCV(alphas=np.arange(0.01, 10, 0.01))

# fit model
ridge.fit(X_train, y_train)

# summarize chosen configuration
print('alpha chosen: %f' % ridge.alpha_)

From the code above, RidgeCV was initialized using cross-validation of a set of alpha values from 0.01 to 10.0 set using numpy's arange() function. The model was fit on the training data and the alpha value chosen was alpha: 1.05.

Let's evaluate how the model performed. We can check the $r^2$ value for the training and the test data to see if there was overfitting.

y_pred = ridge.predict(X_test)
print("Training score:", ridge.score(X_train, y_train))
print("Testing score:", ridge.score(X_test, y_test))

In the code above we made a prediction on test set then went ahead to find the score on the test and training data. The Training score: 0.919132151370367 was very close to the Testing score: 0.9223028509247448 which shows that that almost no overfitting going on and the model is ignoring the noise that may be in the data.

To conclude, almost 90% of the time, it's advisable to use ridge regression over linear regression (least-squares estimate). Ridge regression is more complex than linear regression and sometimes the improvement in performance might be negligible and may not be needed for your dataset but one rule of thumb to decide to pick between the two is to check for linear correlation between the columns of the dataset