Exploring Least Squares Estimate

A few months ago I wrote a short article in the form of a Kaggle notebook which you can read "here", it was titled Linear Regression from scratch, as at the time of writing that article, I thought I had known all there was to know on Linear Regression in relation to Machine Learning/Data Science but it was just of recent that I discovered I had only scratched the surface. You probably are where I was those many months ago, oblivious to the wonderful complicated simple world of Linear Regression in Data Science, if so, welcome to your enlightenment and take a seat, it's going to be a mathematical ride.

Before I go into what linear regression is, we should talk about categorical data and regression data.

Categorical data is a collection of information that is divided into groups and as the name implies, they are usually grouped into a category or multiple categories. Examples of categorical data include gender, ratings, colours and so on.
Numerical data refers to the data that is in the form of numbers, and not in descriptive form. Examples include prices, time, age, height and so on.

Linear Regression simply refers to using a linear model to represent the relationship between a set of independent variables and a dependent variable. The dependent variable is in the form of numerical data and the relationship being represented is a linear one.

Let's take a look under the hood, you probably remember this algebra formula from high/secondary school. $$ y = mx + c $$ This equation above is a simple linear model, to make it easier to understand, let's adjust the letters we use. $$ y = a_1 x +a_2 $$

The independent variable is labelled x and the dependent variable is labelled as y. $a_1$ is the coefficient and $a_2$ is the intercept. These help our model to fit, more on that later.

The basic form of linear regression used in machine learning is the least-squares estimation method, this strategy produces a regression model, which is a linear combination of the independent variables that minimize the sum of squared residuals between the model's predictions and actual values for the dependent variable (you can read more on the sum of squared residuals here).

I mentioned coefficients and intercepts helps the linear model to fit. This means that they help our linear model to fit itself to the data that we currently have as every data is different and we can't have a one-size-fits-all kind of model and this makes model fitting important. To fit our model, the sum of squared residuals (between predictions and actual values) is being minimized by adjusting the coefficients and intercepts till the sum of squared residuals is as small as they can be. Now, it's time to move this into the computer code. The least-squares linear regression is implemented using the LinearRegression object from the linear_model module in the sci-kit learn library.

First, the supporting libraries like numpy and pandas need to be imported.

import pandas as pd
import numpy as np

We now load our dataset and review a section of it, a simple dataset was selected for the purpose of this article, you can find it here. As seen below, the years of experience is going to be the "x" of the linear equation while the salary paid is the "y" of the equation, the coefficient $a_1$ and the intercept $a_2$ will be gotten after the model is done fitting.

location = "dataset/Salary_Data.csv"

df = pd.read_csv(location)
print(df.head())

   YearsExperience   Salary
0              1.1  39343.0
1              1.3  46205.0
2              1.5  37731.0
3              2.0  43525.0
4              2.2  39891.0

The data is preprocessed next but I won't be covering that as it's out of the scope of this writeup, you can refer to the GitHub link to the complete code at the end of the page. Next, we initialize the linear regression model that we imported. We are going to use all the default parameters.

from sklearn.linear_model import LinearRegression

model = LinearRegression()

The next step is to fit the initialized model to our training data. The fit function of LinearRegression module takes in the array dependent variables(x) and the independent variables (y) and tries to find a line that best fits the data points on the graph, don't understand? Let's fit the model to this data and let's visualize it to see what I'm talking about.

model.fit(X_train, y_train)

We get the coefficient and intercept using the code below.

coefficient = model.coef_[0][0]
intercept = model.intercept_[0]

print(f"The coefficient is {coefficient}")
print(f"The intercept is at {intercept}")

Now that we have our values, we can visualize them by plotting them using matplotlib.pyplot.

import matplotlib.pyplot as plt

plt.scatter(X_test, y_test) #scatter plot of the test
plt.plot(X_test, predictions, color='red') #regression line of the model
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Regression Line of the Trained Model The red line running diagonally is known as the regression line, when the model is finding the best fit, it's finding the best path to draw this regression line that puts all the data points at an appropriate distance around that line. You can read more on regression line here. After all we've done, we can go ahead and see what our model has done by filling in our coefficient and intercept into our linear model (the linear equation). \[ y = 9423.8x + 25321.6\] $$Where; a_1 = coefficient = 9423.8$$ $$a_2 = intercept = 25321.6$$

And that's it. To see the accuracy of the model, we can find the coefficient of determination a.k.a. r² value. It's also a function of linear regression and can be used using the code below.

#predict the test data
predictions = model.predict(X_test)

#test the accuracy of our model
from sklearn.metrics import r2_score

score = model.score(X_train, y_train)
print(score)

Normally, the r² value is a value between 0 and 1, it shows how close of a fit the linear model is to the data. In sci-kit learn, the r² value ranges from -∞ to 1 as a model can perform so badly that it's below 0. Our model's score, 0.9645401573418146, is very close to 1 which is quite good.

If we want to use the linear regression model on a dataset that has multiple features/columns, each feature is represented as a new x variable on the equation. Linear Equation with Multiple Variables Where; $a1$ to $a{i_1}$ are the coefficients and aᵢ is the intercept.

Remember the LinearRegression module is based on the least-squares estimation method(read more here), it relies on the fact that the dataset's features (i.e. each x) are each independent, i.e. uncorrelated, but what happens when many of the dataset features are linearly correlated? That is where Ridge Regression comes in. I won't be covering ridge regression in this write-up, That will be in Part 2 of my Linear Regression in Machine Learning Series.

Thank you for reading, you deserve a milkshake or whatever she's offering!

You can get the code used in this article here.