1. What is Linear Regression?¶
- Linear regression helps us predict one thing (like house price) based on another (like house size). It finds a straight line that best fits the data.
2. The Equation of Linear Regression¶
The equation looks like this:
$Y = mX + b$
Where:
- ( Y ) is what you’re trying to predict (like price),
- ( X ) is the input (like size),
- ( m ) is the slope (how much Y changes when X changes),
- ( b ) is the intercept (the value of Y when X is zero).
3. Collect Data¶
- You need some data to work with. For example:
- X could be the size of houses (in square feet),
- Y could be the price of those houses.
4. Plot the Data¶
- Before doing anything, you can plot a graph of the data. This helps you see if there’s a trend (like bigger houses costing more).
5. Split Your Data¶
- Divide your data into two parts:
- Training data: Used to train the model.
- Test data: Used to check how well the model works on new data.
6. Train the Model (Fit the Line)¶
- Now, use the training data to teach the model. It will find the best line (with the best slope and intercept) that fits the data.
7. Check the Line’s Equation¶
After training, you’ll get values for:
- Slope (m): How steep the line is.
- Intercept (b): Where the line crosses the Y-axis. For example, if the line’s equation is:
$Price = 200 \times \text{Size} + 30,000$
This means for every 1 square foot increase in house size, the price increases by $200.
8. Make Predictions¶
- Now that the line is ready, you can use it to predict house prices for new house sizes. Just plug the new house size into the equation.
9. Test the Model¶
- Use the test data to see how well your model is doing. You compare the predicted prices with the actual prices to see how close they are.
10. Check for Errors¶
- You measure the difference between the actual prices and the predicted prices. A smaller difference means the model is good.
- Common ways to measure this are:
- Mean Squared Error (MSE): Shows how far off the predictions are.
- R-squared (R²): Tells you how well the line fits the data (higher is better).
11. Analyze the Errors (Residuals)¶
- Residuals are the differences between actual and predicted values. You can check these to see if your model missed any patterns.
12. Use the Model for Future Predictions¶
- Once you’ve trained and tested the model, you can use it to predict prices for new houses.
Let’s review example step by step.¶
Import the Libraries
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Make a list or Read the Data
In [2]:
L = [[1,3.5],[2,4],[3,4.5],[4,4],[5,5]]
print(L)
[[1, 3.5], [2, 4], [3, 4.5], [4, 4], [5, 5]]
Convert list into DataFrame
In [3]:
df = pd.DataFrame(L,columns=['x','y'])
df
Out[3]:
x | y | |
---|---|---|
0 | 1 | 3.5 |
1 | 2 | 4.0 |
2 | 3 | 4.5 |
3 | 4 | 4.0 |
4 | 5 | 5.0 |
In regression, the features column contains the independent variables or predictors that are used to model and predict the dependent variable (or outcome). These features provide the input data that the regression algorithm analyzes to understand and predict the relationships between variables.¶
Choose the column that contains variables you believe will influence and help predict the target data.
In [4]:
x = df['x'].values
x
Out[4]:
array([1, 2, 3, 4, 5], dtype=int64)
Select the column you want to predict, known as the target variable, which is the outcome you aim to model using the other features.
In [5]:
y = df['y'].values
y
Out[5]:
array([3.5, 4. , 4.5, 4. , 5. ])
Plot scatter on x and y
In [6]:
plt.scatter(x,y)
plt.show()
Formula to find slope & intercept
Calculate mean of x and y
In [7]:
mean_x = x.mean()
mean_x
Out[7]:
3.0
In [8]:
mean_y = y.mean()
mean_y
Out[8]:
4.2
Slope¶
In [9]:
(x[0] - mean_x) * (y[0]-mean_y)
Out[9]:
1.4000000000000004
In [10]:
(x[0] - mean_x)**2
Out[10]:
4.0
In [11]:
Num = 0
Den = 0
for i in range(len(x)):
Num = Num + (x[i] - mean_x) * (y[i]-mean_y)
Den = Den + (x[i] - mean_x)**2
m = Num/Den
print(m)
0.3
Intercept¶
In [12]:
c = mean_y - m*mean_x
c
Out[12]:
3.3000000000000003
Check Prediction of y
In [13]:
y
Out[13]:
array([3.5, 4. , 4.5, 4. , 5. ])
In [14]:
x
Out[14]:
array([1, 2, 3, 4, 5], dtype=int64)
Formula to prediction of y: yp = mx+c
In [15]:
yp = m*x[0]+c
yp
Out[15]:
3.6
In [16]:
y_pred = []
for i in range(len(x)):
yp = m*x[i]+c
y_pred.append(yp)
print(y_pred)
[3.6, 3.9000000000000004, 4.2, 4.5, 4.800000000000001]
Plot Scatter x and y
Plot line graph using x and y prediction
In [17]:
plt.scatter(x,y)
plt.plot(x,y_pred,color='k')
plt.show()
Calculate ESS
In [18]:
(y_pred[0]-mean_y)**2
Out[18]:
0.3600000000000001
Calculate RSS
In [19]:
(y[0]-mean_y)**2
Out[19]:
0.49000000000000027
Formula to find R2: RSS/TSS
In [20]:
Num = 0
Den = 0
for i in range(len(y)):
Num = Num + (y_pred[i]-mean_y)**2
Den = Den + (y[i]-mean_y)**2
R2 = Num/Den
R2
Out[20]:
0.6923076923076928
Check Accuracy
In [21]:
Acc = R2*100
Acc
Out[21]:
69.23076923076928
Check Prediction of y
In [22]:
yp = m*10+c
yp
Out[22]:
6.300000000000001
Using Scikit-Learn¶
Choose the column that contains variables you believe will influence and help predict the target data.
In [23]:
df.iloc[:,:1].values
Out[23]:
array([[1], [2], [3], [4], [5]], dtype=int64)
or¶
In [24]:
x = df.x.values.reshape(-1,1)
x
Out[24]:
array([[1], [2], [3], [4], [5]], dtype=int64)
Select the column you want to predict, known as the target variable, which is the outcome you aim to model using the other features.
In [25]:
y
Out[25]:
array([3.5, 4. , 4.5, 4. , 5. ])
Put Algorithm
In [26]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(x,y)
Out[26]:
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Find co-efficent
In [27]:
reg.coef_
Out[27]:
array([0.3])
Find intercept
In [28]:
reg.intercept_
Out[28]:
3.3000000000000003
Check accuracy
In [29]:
reg.score(x,y)*100
Out[29]:
69.23076923076925
Predict y
In [30]:
y_pred = reg.predict(x)
y_pred
Out[30]:
array([3.6, 3.9, 4.2, 4.5, 4.8])
Plot Scatter using x and y
Plot line graph using x and y prediction
In [31]:
plt.scatter(x,y)
plt.plot(x,y_pred)
plt.show()
Predict future value
In [32]:
reg.predict([[10]])
Out[32]:
array([6.3])