Multiple Linear Regression Explained Simply¶
1. What is Multiple Linear Regression?¶
- It’s a method used to predict one thing (like house price) based on multiple other factors (like size, number of bedrooms, and location). It helps us understand how different factors influence the outcome.
2. The Equation¶
The equation looks like this:
$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + … + \beta_nX_n + \epsilon$
Where:
- ( Y ) is what you want to predict (like price).
- ( X1, X2, … ) are the factors you’re using for prediction (like size, bedrooms).
- ( $\beta_0$ ) is the starting point (the price when all factors are zero).
- ( $\beta_0, \beta_2$, … ) tell you how much Y changes when each factor changes.
- ( $\epsilon $) is just a fancy way to say “random errors.”
3. Gather Your Data¶
- Collect information about the houses you want to analyze. For example, get data on:
- Size of the house (in square feet).
- Number of bedrooms.
- Age of the house.
- Price (this is what you want to predict).
4. Look at Your Data (Optional)¶
- Create some graphs to see how each factor (like size or bedrooms) relates to the house price. This helps you spot trends.
5. Split Your Data¶
- Divide your data into two parts:
- Training set: This is the data used to train your model (like teaching it).
- Test set: This is the data used to see how well the model works on new information.
6. Train the Model¶
- Use a computer program to find the best line (or equation) that fits your training data. The program will figure out the coefficients (the values for ( \beta )) for each factor.
from sklearn.linear_model import LinearRegression
# Example code to train the model
model = LinearRegression()
model.fit(X_train, Y_train)
7. Check the Coefficients¶
- After training, look at the coefficients. They tell you how much the price (Y) changes with each factor (X). For example:
- If ( $\beta_1$) (size coefficient) is 200, then for every extra square foot, the price goes up by $200.
8. Make Predictions¶
- Now, you can use your trained model to predict house prices for new houses based on their features.
Y_pred = model.predict(X_test)
9. Evaluate the Model¶
- Check how accurate your predictions are using some measures:
- Mean Squared Error (MSE): This tells you how far off your predictions are.
- R-squared (R²): This tells you how well your factors explain the house prices (higher is better).
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
10. Analyze Errors¶
- Look at the differences between what you predicted and what the actual prices were (these are called residuals). You want these to be random; if they show a pattern, your model might need adjustments.
11. Use for Future Predictions¶
- Once you’re satisfied with your model, you can use it to predict prices for any new house by putting in its features.
new_data = [[size, bedrooms, age]] # New house features
predictions = model.predict(new_data)
12. Summarize Your Findings¶
- Wrap up by discussing how well the model did and what the coefficients mean. For example, if the size coefficient is positive, it means bigger houses cost more.
Let’s review example step by step.¶
Import Libraries
In [3]:
import pandas as pd
import matplotlib.pyplot as plt
Read Data
In [4]:
Stock_Market = {'Year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
'Month': [12, 11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'Interest_Rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'Unemployment_Rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'Stock_Index_Price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
}
In [5]:
df = pd.DataFrame(Stock_Market)
df
Out[5]:
Year | Month | Interest_Rate | Unemployment_Rate | Stock_Index_Price | |
---|---|---|---|---|---|
0 | 2017 | 12 | 2.75 | 5.3 | 1464 |
1 | 2017 | 11 | 2.50 | 5.3 | 1394 |
2 | 2017 | 10 | 2.50 | 5.3 | 1357 |
3 | 2017 | 9 | 2.50 | 5.3 | 1293 |
4 | 2017 | 8 | 2.50 | 5.4 | 1256 |
5 | 2017 | 7 | 2.50 | 5.6 | 1254 |
6 | 2017 | 6 | 2.50 | 5.5 | 1234 |
7 | 2017 | 5 | 2.25 | 5.5 | 1195 |
8 | 2017 | 4 | 2.25 | 5.5 | 1159 |
9 | 2017 | 3 | 2.25 | 5.6 | 1167 |
10 | 2017 | 2 | 2.00 | 5.7 | 1130 |
11 | 2017 | 1 | 2.00 | 5.9 | 1075 |
12 | 2016 | 12 | 2.00 | 6.0 | 1047 |
13 | 2016 | 11 | 1.75 | 5.9 | 965 |
14 | 2016 | 10 | 1.75 | 5.8 | 943 |
15 | 2016 | 9 | 1.75 | 6.1 | 958 |
16 | 2016 | 8 | 1.75 | 6.2 | 971 |
17 | 2016 | 7 | 1.75 | 6.1 | 949 |
18 | 2016 | 6 | 1.75 | 6.1 | 884 |
19 | 2016 | 5 | 1.75 | 6.1 | 866 |
20 | 2016 | 4 | 1.75 | 5.9 | 876 |
21 | 2016 | 3 | 1.75 | 6.2 | 822 |
22 | 2016 | 2 | 1.75 | 6.2 | 704 |
23 | 2016 | 1 | 1.75 | 6.1 | 719 |
Plot the columns
In [6]:
plt.title("Stock Market")
plt.xlabel("Interest Rate")
plt.ylabel("Stock Index Price")
plt.scatter(df['Interest_Rate'],df['Stock_Index_Price'])
plt.show()
In [7]:
plt.title("Stock Market")
plt.xlabel("Unemployment Rate")
plt.ylabel("Stock Index Price")
plt.scatter(df['Unemployment_Rate'],df['Stock_Index_Price'])
plt.show()
Check co-relation
In [8]:
df.corr()
Out[8]:
Year | Month | Interest_Rate | Unemployment_Rate | Stock_Index_Price | |
---|---|---|---|---|---|
Year | 1.000000e+00 | 7.884865e-14 | 0.882851 | -0.877000 | 0.863232 |
Month | 7.884865e-14 | 1.000000e+00 | 0.339526 | -0.351189 | 0.481287 |
Interest_Rate | 8.828507e-01 | 3.395257e-01 | 1.000000 | -0.925814 | 0.935793 |
Unemployment_Rate | -8.769997e-01 | -3.511891e-01 | -0.925814 | 1.000000 | -0.922338 |
Stock_Index_Price | 8.632321e-01 | 4.812873e-01 | 0.935793 | -0.922338 | 1.000000 |
Choose the columns that contains variables you believe will influence and help predict the target data.
In [10]:
X = df.iloc[:,2:4].values
X
Out[10]:
array([[2.75, 5.3 ], [2.5 , 5.3 ], [2.5 , 5.3 ], [2.5 , 5.3 ], [2.5 , 5.4 ], [2.5 , 5.6 ], [2.5 , 5.5 ], [2.25, 5.5 ], [2.25, 5.5 ], [2.25, 5.6 ], [2. , 5.7 ], [2. , 5.9 ], [2. , 6. ], [1.75, 5.9 ], [1.75, 5.8 ], [1.75, 6.1 ], [1.75, 6.2 ], [1.75, 6.1 ], [1.75, 6.1 ], [1.75, 6.1 ], [1.75, 5.9 ], [1.75, 6.2 ], [1.75, 6.2 ], [1.75, 6.1 ]])
Select the column you want to predict, known as the target variable, which is the outcome you aim to model using the other features.
In [11]:
y = df.iloc[:,4].values
y
Out[11]:
array([1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719], dtype=int64)
Put Algorithm
In [12]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X,y)
Out[12]:
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Find intercept
In [13]:
reg.intercept_
Out[13]:
1798.4039776258544
Find co-efficent
In [14]:
reg.coef_
Out[14]:
array([ 345.54008701, -250.14657137])
Predict y
In [15]:
y
Out[15]:
array([1464, 1394, 1357, 1293, 1256, 1254, 1234, 1195, 1159, 1167, 1130, 1075, 1047, 965, 943, 958, 971, 949, 884, 866, 876, 822, 704, 719], dtype=int64)
In [16]:
y_pred = reg.predict(X)
y_pred
Out[16]:
array([1422.86238865, 1336.47736689, 1336.47736689, 1336.47736689, 1311.46270976, 1261.43339548, 1286.44805262, 1200.06303087, 1200.06303087, 1175.04837373, 1063.64869484, 1013.61938057, 988.60472343, 927.23435881, 952.24901595, 877.20504454, 852.1903874 , 877.20504454, 877.20504454, 877.20504454, 927.23435881, 852.1903874 , 852.1903874 , 877.20504454])
Check Accuracy
In [17]:
reg.score(X,y) * 100
Out[17]:
89.76335894170217
Predict future value
In [18]:
reg.predict([[5.4,3.7]])
Out[18]:
array([2738.77813342])
In [19]:
round(reg.predict([[5.4,3.7]])[0])
Out[19]:
2739
In [21]:
a = float(input("Enter Interest_Rate:"))
b = float(input("Enter Unemployment_Rate:"))
val = reg.predict([[a,b]])
print("Stock Index Price:",round(val[0]))
Stock Index Price: 2814
In [ ]: