ML Notes #1 – Linear Regression from Scratch

I don’t want to pursue into ML. Yeah. I don’t. Reason is “I can’t explain the functionality clearly to another person. If other person asks any questions, then i am done.” But i want to go again deep inside ML to understand. Also don’t want to get lost in this AI wave.

In this blog, i try to explain Linear Regression, which i learned from Navneeth (Nunnari Labs).

Pre Requisite

I would recommend you to go through the below youtube video on Gradient Descent.

Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables.

It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.

What is Linear Regression ?

Simple linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable. It looks for statistical relationship but not deterministic relationship.

Relationship between two variables is said to be deterministic if one variable can be accurately expressed by the other.

For example,

Using temperature in degree Celsius it is possible to accurately predict Fahrenheit. Statistical relationship is not accurate in determining relationship between two variables. For example, relationship between height and weight.

The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction error (all data points) are as small as possible. Error is the distance between the point to the regression line.

1. Consider the below dataset,

height = [150, 160, 155, 173, 180, 169]
weight = [48, 63, 60, 67, 72, 66]

2. Now, the user is giving a new height – 165 of a person. We need to find the weight (approx.) of the person. As a first go, we can think of a general solution. We can take a mean value of the weight. Then we can say for a person with any height, we will return a constant mean value. ( Naive Idea )

height = [150, 160, 155, 173, 180, 169]
weight = [48, 63, 60, 67, 72, 66]
mean_weight = weight.mean()

# Mean Weight = 62.666666666666664

3. We know that, Taking a mean wont be a good solution. So we can think of a different idea, we will take a look in to the data.

height = [150, 160, 155, 173, 180, 169]
weight = [48, 63, 60, 67, 72, 66]

From the data we can see, that height divided by 3 gives an approximated value of the weight. Its better than the Idea 1 (Taking Average.).

Before going to the next idea, we need to define a mechanism to tell the validation of our model. For that we can try to find the residuals between the line and point.

Residuals = Actual Point – Predicted Point.

4.We can use the MSE ( Mean Squared Error ) Validation.

def mse(y, y_hat):
  return np.mean((y - y_hat)**2)

So, Now if we are trying to reduce the error, we can get the optimal solution to solve our problem.

5.Since we are trying to plot the line, the equation of the line will be ,

Y = M * X + C
Y – Predicted Value
M – Slope of the Line
X – Actual Value
C – Intercept of the X axis ( Point which crosses the X-axis )

So here, in our scenario Y is the Weight, X is the Height. We need to substitute some values for M (Slope) and Intercept (C) to get the equation of the Best Fit Line.

6.Now the problem is narrowed down to substitute some values for both M and C to get the optimal solution. Initially we can try to generate random values and will calculate the error value for analyzing.

for temp_m in zip(np.linspace(0, 1, num=100)):
  for temp_c in np.linspace(0, 1, num=100):
    line = temp_m*x + temp_c
    print(temp_m, "error", mse(weight, line))

7.After executing the code you will see the error is getting increased with the values. We can try to set between some start and end ranges to find the values. But this process is also tiresome. Is there any way to resolve this issue.

Yeah, We can use gradient descent algorithm ( I assume you went through the youtube video suggested at beginning ).

Here, we can see a small overview of the gradient descent and the need for it.

Let us consider a constant value for C, then we can change the values for M (random values). Then we can see how the error is behaving.

c = 1
error_val = []
temp_m_vals = np.linspace(0, 1, num=100)

for temp_m in zip(temp_m_vals):
    line = temp_m*x + c
    error_val.append(mse(weight, line))

If we plot temp_m_vals on x-axis and error_val on y-axis we can see a plot like this,

From the above image its clear like error is minimum in the range of 0.3 to 0.45 of the model value. Our optimal solution is also hidden in there. We can go again and do the same above steps by just changing the linspace range with 0.3 to 0.45. But that wont be a generic approach for all problems.

At the same time, we need to avoid more calculations to get the optimal solution. What to do ? .

Let’s look at the nature,

Our motive is to attain the lowest point in the error value,

On seeing the above two images, we can consider a person who is climbing down the mountain ( U shaped mountain )

What will be his approach while climbing down,

When there is a steep slope, he will be taking a big step.
When there is a less steep slope, he will be taking smaller steps.

By making this in an iteration he will reach the minimum point ( in his view, its the ground ).

This process is known as Gradient Descent Algorithm ( Taking big steps when we are away from the minimum point and small steps when we are near to the minimum point ).

In our scenario, we need to simply take larger steps when the slope is having more steep, and smaller step when we have lesser steep.

So now our problem is narrowed to find the slope of the curve and and move to a different point and continue the process, till we are having the error ideally to 0 or close to 0 ( Which mean less error ).

To find the slope of the curve, we can use calculus ( Differenciation ). The curve in the above image is formed with the MSE equation.

def mse(y, y_hat):
  return np.mean((y - y_hat)**2)

Lets differentiate and solve it,

1.Initially let m = 0 and c = 0. Let L be our learning rate ( how fast ). This controls how much the value of m changes with each step. L could be a small value like 0.0001 for good accuracy.

2.Calculate the partial derivative (to find the slope of the curve) of the loss function (MSE) with respect to m, and plug in the current values of x, y, m and c in it to obtain the derivative value D.

3.D_m is the value of the partial derivative with respect to m. Similarly lets find the partial derivative with respect to c, D_c,

4.After finding the slope w.r.t m and slope w.r.t c, we can update these values,

5.We repeat this process until our loss function is a very small value or ideally 0 (which means 0 error or 100% accuracy). The value of m and c that we are left with now will be the optimum values.

# Imports
import numpy as np

# Dataset
height = np.array([150, 160, 155, 173, 180, 169]) 
weight = np.array([48, 63, 60, 67, 72, 66])

# Define error function
def error(y_hat):
  return np.mean(np.square(y-y_hat))

def normalize(x):
  x = (x - min(x))/(max(x)-min(x))
  return x

x = normalize(height)
y = normalize(weight)


# Model Weights
m = 0.1
c = 0.1

# Hyperparameters
learning_rate = 0.001 # The learning Rate 0.1, 0.001, 0.0001 # adaptive learning rate
epochs = 30000 # The number of iterations to perform gradient descent

n = len(x)
error_vals = []

# Performing Gradient Descent
for i in range(epochs):
  # Forward propagation (basically multiplication)
  y_hat = m*x + c # The current predicted value of Y

  # Calculate Mean Squared Error
  error_vals.append(error(y_hat)) # Just for plotting purpose.

  # Find the Slope
  de_dm = (-2/n) * np.sum(x * (y - y_hat)) # Derivative of error function w.r.t m
  de_dc = (-2/n) * np.sum(y - y_hat) # Derivative of error function w.r.t c

  # Updating the weights
  m = m - (de_dm * learning_rate)
  c = c - (de_dc * learning_rate)

print("Intercept", c)
print("Slope", m)

Note: In the model implementation, we are normalizing the values to avoid infinity error. Since we are squaring big numbers it will lead to infinity value (+inf, -inf). To avoid such issues, we are using normalizing functionality.

Comparison with SkLearn Module.

Sklearn is a python package dedicated for Machine learning, Now let us compare our implementation with the external package implementation.

from sklearn.linear_model import LinearRegression
import pandas as pd
train_df = pd.DataFrame({'x': x})
X = train_df[['x']]
y = y

reg = LinearRegression()
reg.fit(X, y)

# Our Scatch Implementation
print(f"From Scratch M {m}, C {c}")
print(f"From sklearn module M {reg.coef_[0]} C {reg.intercept_}")

Try this in the shared colab notebook. Please comment below if you have any queries on the implementation.

Notebook Link: https://colab.research.google.com/drive/1dprgWzcNYFhVRfdEmDlQbqFJGyaGimMb?usp=sharing

Parottasalna