Implementing a Simple Linear Regression Machine Learning Model

Introduction

Suppose we are given a dataset where based on the area of the house, the prices are listed. We are asked to prepare a Machine Learning model which predicts the prices of the house based on the area of house that we supply as input to the model.

AreaPrice
2300500000
2600550000
2750530000
3000565000
3100600000
3200610000
3600680000
3750690000
4000725000

A simple way to deal with this problem is Linear Regression.

What is Linear Regression?

Linear regression is basically a type of predictive analysis used in statistics, where the value of some varibale is predicted based on the value of other variable(s). The value which we want to predict is called dependent variable. While the value(s) using which we calculate the dependent variable is/are called as independent variable(s).

Hence the equation for linear regression containing one dependent variable and 'n' independent variable looks like:

$$y = µ + µ₁x₁ + µ₂x₂ + µ₃x₃ + µ₄x₄ + .....µₙxₙ$$ where, y is dependent variable, µ is the intercept, µ₁...µₙ are the coefficients and x₁...xₙ are the independent variables.

Since we have only one independent variable i.e. area of the house. Therefore we consider equation: $$y = µ + µ₁x₁$$ In this case, our y is Price of the house and x is Area of the house. Our task now is to find the values of µ and µ₁.

We can also refer µ as y-intercept and µ₁ as slope of the linear equation.

This type of linear regression model where only one indepedent variable is involved is also known as Simple Linear Regression.

We now plot a scaterred point graph between price (on the y axis) and area (on the x axis).

Our goal is to establish a linear relationship between x and y. To do so, we would require a straight line to pass through our graph. The line should be drawn such that the value of $$\displaystyle\sum\limits_{i=1}^{\infty}(\Delta i)^2$$ is minimised. Here Δi is the difference between the actual price and predicted price (according to the line).

Now you might be confused regarding why we squared the difference instead of just normally adding them up.

Well there are two reasons to justify this: a) The difference might be negative or positive. To avoid these negative values, it is squared. b) Also, the squared differences makes it easier to derive a regression line. Indeed, to find that line we need to compute the first derivative of the loss error function, and it is much harder to compute the derivative of absolute values than squared values.

The line passing would look like this:

image.png

Therefore we now have obtained the line, now using python libraries like matplotlib, sklearn and pandas we would implement this model and hence obtain the linear equation.

Implementing Linear Regression in python

We begin by importing the important libraries:

import matplotlib.pyplot as plt
import pandas as pd
  • Matplotlib is used to plot graphs.
  • Pandas is used for data handling (like handling .csv files)

Importing the dataset:

dataset = pd.read_csv("dataset.csv")
x = dataset.iloc[:,:-1].values
y = dataset.iloc[:,1].values
  • Using pandas library we have imported the .csv dataset file.
  • We have split the dataset into two parts, x and y, where x is area and y is proprty price.

Splitting the dataset into the Training set and Test set:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=0)
  • We need to split our dataset into two parts:
    • Training data on which our machine learning model is trained.
    • Testing data which is used to test the accuracy of our model.
  • Usually 80% of the data is used as training data, whereas the rest 20% is used as testing data.
  • For dividing the data we use sklearn library of python.

Training the Simple Linear Regression model on the Training set:

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(x_train,y_train)
  • Thankfully we don't have to implement Linear Regression ourselves, sklearn has a Linear regression class which helps us do so.
  • We train the model on our training dataset ie (x_train, y_train).

Predicting the Test set results:

y_predicted = regressor.predict(x_test)
  • Once we are done with training, we can test our model by using the test data that we prepared in the previous steps.

Visualising the Training set results:

plt.scatter(x_train,y_train, color='red')
plt.plot(x_train, regressor.predict(x_train), color='blue')
plt.title("Property Price")
plt.xlabel("Area")
plt.ylabel("Price")
plt.show()

Screenshot_20220623_202618.png

Visualising the Test set results:

plt.scatter(x_test,y_test, color='red')
plt.plot(x_train, regressor.predict(x_train), color='blue')
plt.title("Property Price")
plt.xlabel("Area")
plt.ylabel("Price")
plt.show()

Screenshot_20220623_202651.png

Making a single prediction (for example the price of house with 3210 sq feet area)

print(regressor.predict([[3210]]))

Therefore our model predicts that the price of house with 3210 sq feet of area as ₹ 166468.72605157.

Important Note: Notice that the value of the feature (15 years) was input in a double pair of square brackets. That's because the "predict" method always expects a 2D array as the format of its inputs. And putting 15 into a double pair of square brackets makes the input exactly a 2D array. Simply put:

$$3210 \rightarrow \textrm{scalar}$$

$$[3210] \rightarrow \textrm{1D array}$$

$$[[3210]] \rightarrow \textrm{2D array}$$

Getting the final linear regression equation with the values of the coefficients

print(regressor.coef_)
print(regressor.intercept_)

132.87373004, 190478.95500725694 are returned as the output of above.

Therefore the equation of our simple linear regression becomes:

$$\textrm{Price of Property} = 132.87 \times \textrm{Area} + 190478.96$$

Important Note: To get these coefficients we called the "coef" and "intercept" attributes from our regressor object. Attributes in Python are different than methods and usually return a simple value or an array of values.

Using this equation we can find price of propery with any area.