1. What is Random Forest Regression?¶
- Random Forest Regression is an advanced machine learning method used to predict continuous values (like house prices). It combines multiple decision trees to improve accuracy and reduce errors.
2. How Random Forest Works¶
- Instead of using just one decision tree (which can be prone to errors), Random Forest uses a collection of decision trees (called a “forest”).
- Each tree makes its own prediction, and the final output is the average of all the tree predictions. This helps to smooth out any errors.
3. Collect Your Data¶
- Gather the dataset you want to work with. For example, if you’re predicting house prices:
- Features might include size, number of bedrooms, location, age, etc.
- The target variable is the price of the house.
4. Split the Data¶
- Divide your dataset into two parts:
- Training set: For training the Random Forest model.
- Test set: For evaluating how well the model performs on new data.
5. Build the Random Forest Model¶
- Use a machine learning library to create and train the Random Forest Regression model with your training data.
- The model will create multiple decision trees using random subsets of the data and features.
from sklearn.ensemble import RandomForestRegressor
# Create the Random Forest model
model = RandomForestRegressor(n_estimators=100) # 100 trees in the forest
model.fit(X_train, Y_train) # Train the model
6. Make Predictions¶
- Use the trained model to predict values for your test set or new data.
Y_pred = model.predict(X_test)
7. Evaluate the Model¶
- Check how well your model performed by comparing the predicted values to the actual values using metrics like:
- Mean Squared Error (MSE): Shows how far off your predictions are from the actual values.
- R-squared (R²): Indicates how well the model explains the variability in the target variable.
from sklearn.metrics import mean_squared_error, r2_score
mse = mean_squared_error(Y_test, Y_pred)
r2 = r2_score(Y_test, Y_pred)
print("Mean Squared Error:", mse)
print("R-squared:", r2)
8. Feature Importance¶
- Random Forest can show you how important each feature is in making predictions. This helps in understanding which factors influence the target variable the most.
feature_importances = model.feature_importances_
print("Feature Importances:", feature_importances)
9. Analyze Errors¶
- Look at the errors (differences between predicted and actual values) to see if the model is doing well or if it needs adjustments.
10. Use the Model for Future Predictions¶
- After validating your model, you can use it to make predictions for new inputs.
new_data = [[size, bedrooms, age]] # Example new data
predictions = model.predict(new_data)
11. Conclusion¶
- Summarize how well the Random Forest Regression model performed and what you learned from the predictions. Discuss the significance of different features and how they contributed to the final predictions.
Let’s review example step by step.¶
We use random forest because it improves prediction accuracy by combining multiple decision trees and reducing overfitting.¶
Import the Libraries
In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Make a list or Read Data
In [2]:
l = [[1,45],[2,51],[3,60],[4,80],[5,110],[6,150],[7,200],[8,240]]
l
Out[2]:
[[1, 45], [2, 51], [3, 60], [4, 80], [5, 110], [6, 150], [7, 200], [8, 240]]
Convert List into DataFrame
In [3]:
df = pd.DataFrame(l,columns=['x','y'])
df
Out[3]:
x | y | |
---|---|---|
0 | 1 | 45 |
1 | 2 | 51 |
2 | 3 | 60 |
3 | 4 | 80 |
4 | 5 | 110 |
5 | 6 | 150 |
6 | 7 | 200 |
7 | 8 | 240 |
Plot scatter x and y
In [4]:
x = df.iloc[:,:1].values
x
Out[4]:
array([[1], [2], [3], [4], [5], [6], [7], [8]], dtype=int64)
In [5]:
y = df.iloc[:,1].values
y
Out[5]:
array([ 45, 51, 60, 80, 110, 150, 200, 240], dtype=int64)
Plot scatter x and y
In [6]:
plt.scatter(x,y)
plt.show()
Put Algorithm
In [7]:
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(n_estimators=100,random_state=0)
reg.fit(x,y)
Out[7]:
RandomForestRegressor(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=0)
Predict y
In [8]:
y_pred = reg.predict(x)
y_pred
Out[8]:
array([ 48.79, 51.37, 59.21, 76.8 , 104.1 , 141.9 , 189.7 , 222.4 ])
In [9]:
y
Out[9]:
array([ 45, 51, 60, 80, 110, 150, 200, 240], dtype=int64)
Plot scatter x and y
Plot line x and y predict
In [10]:
plt.scatter(x,y)
plt.plot(x,y_pred)
plt.show()
Check accuracy
In [11]:
reg.score(x,y)*100
Out[11]:
98.54843999571207
Predict future value of y
In [12]:
reg.predict([[6.0]])
Out[12]:
array([141.9])
In [13]:
X = np.arange(min(x),max(x),0.01).reshape(-1,1)
Yp = reg.predict(X)
plt.scatter(x,y)
plt.plot(X,Yp)
plt.show()
C:\Users\Mehak\AppData\Local\Temp\ipykernel_18808\398314002.py:1: DeprecationWarning: Conversion of an array with ndim > 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.) X = np.arange(min(x),max(x),0.01).reshape(-1,1)