What is Data Preprocessing?¶
Data preprocessing is the process of preparing raw data for analysis or modeling. In machine learning, data usually needs to be cleaned and transformed into a suitable format before feeding it into algorithms. The main goal is to improve the quality of the data so that models can perform better and give accurate results.
Steps in Data Preprocessing¶
1. Collecting the Data¶
- Gather the raw data from various sources like databases, CSV files, or sensors.
2. Handling Missing Data¶
- Sometimes data may be incomplete (e.g., some values are missing). You can handle missing data by:
- Removing the rows or columns with missing values.
- Imputing missing values with the mean, median, or most common value.
3. Handling Categorical Data¶
- Data can be in categories (like “Male” or “Female”), but machine learning models need numerical data. You convert categorical data into numbers using methods like:
- Label Encoding: Assign a unique number to each category (e.g., “Male” = 1, “Female” = 0).
- One-Hot Encoding: Create new columns for each category (e.g., “Male” becomes one column and “Female” becomes another).
4. Feature Scaling¶
- Data might have features with different scales (e.g., income in thousands, age in years). Scaling helps to bring all features to a similar range:
- Normalization: Rescale features to have values between 0 and 1.
- Standardization: Adjust values to have a mean of 0 and a standard deviation of 1.
5. Splitting the Dataset¶
- Divide the data into:
- Training set: The data used to train the machine learning model.
- Test set: The data used to evaluate the model’s performance on unseen data.
Why is Data Preprocessing Important?¶
- Improves Model Accuracy: Clean, consistent data leads to better predictions.
- Reduces Errors: Helps remove noise or incorrect data.
- Saves Time: Preprocessed data allows models to learn faster and more effectively.
Let’s review example step by step.¶
Import the Libraries
In [1]:
import numpy as np
import pandas as pd
Read Data
In [2]:
data = pd.read_csv('preprocessing.csv')
data
Out[2]:
Country | Age | Salary | Purchased | |
---|---|---|---|---|
0 | France | 44.0 | 72000.0 | No |
1 | Spain | 27.0 | 48000.0 | Yes |
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
5 | France | 35.0 | 58000.0 | Yes |
6 | Spain | NaN | 52000.0 | No |
7 | France | 48.0 | 79000.0 | Yes |
8 | Germany | 50.0 | 83000.0 | No |
9 | France | 37.0 | 67000.0 | Yes |
Clean Data
In [3]:
data.isnull().sum()
Out[3]:
Country 0 Age 1 Salary 1 Purchased 0 dtype: int64
In [4]:
data['Age'].fillna(data.Age.mean())
Out[4]:
0 44.000000 1 27.000000 2 30.000000 3 38.000000 4 40.000000 5 35.000000 6 38.777778 7 48.000000 8 50.000000 9 37.000000 Name: Age, dtype: float64
In [5]:
data['Salary'].fillna(data.Salary.mean())
Out[5]:
0 72000.000000 1 48000.000000 2 54000.000000 3 61000.000000 4 63777.777778 5 58000.000000 6 52000.000000 7 79000.000000 8 83000.000000 9 67000.000000 Name: Salary, dtype: float64
Using Scikit Learn – Simple Imputer¶
In [6]:
data
Out[6]:
Country | Age | Salary | Purchased | |
---|---|---|---|---|
0 | France | 44.0 | 72000.0 | No |
1 | Spain | 27.0 | 48000.0 | Yes |
2 | Germany | 30.0 | 54000.0 | No |
3 | Spain | 38.0 | 61000.0 | No |
4 | Germany | 40.0 | NaN | Yes |
5 | France | 35.0 | 58000.0 | Yes |
6 | Spain | NaN | 52000.0 | No |
7 | France | 48.0 | 79000.0 | Yes |
8 | Germany | 50.0 | 83000.0 | No |
9 | France | 37.0 | 67000.0 | Yes |
Choose the column that contains variables you believe will influence and help predict the target data.
In [7]:
X = data.iloc[:,:3].values
X
Out[7]:
array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, nan], ['France', 35.0, 58000.0], ['Spain', nan, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
Select the column you want to predict, known as the target variable, which is the outcome you aim to model using the other features.
In [8]:
y = data.iloc[:,-1].values
y
Out[8]:
array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'], dtype=object)
Clean the Data using another method
In [9]:
from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan, strategy='mean')
X[:,1:] = si.fit_transform(X[:,1:])
X
Out[9]:
array([['France', 44.0, 72000.0], ['Spain', 27.0, 48000.0], ['Germany', 30.0, 54000.0], ['Spain', 38.0, 61000.0], ['Germany', 40.0, 63777.77777777778], ['France', 35.0, 58000.0], ['Spain', 38.77777777777778, 52000.0], ['France', 48.0, 79000.0], ['Germany', 50.0, 83000.0], ['France', 37.0, 67000.0]], dtype=object)
Label Encoding¶
A label encoder converts categorical values into numerical values so that they can be used in machine learning algorithms.¶
Choose column in which you want to use label encoder
In [10]:
X[:,0]
Out[10]:
array(['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France', 'Spain', 'France', 'Germany', 'France'], dtype=object)
Put Algorithm
In [11]:
from sklearn.preprocessing import LabelEncoder
In [12]:
le_x = LabelEncoder()
In [13]:
X[:,0] = le_x.fit_transform(X[:,0])
In [14]:
X
Out[14]:
array([[0, 44.0, 72000.0], [2, 27.0, 48000.0], [1, 30.0, 54000.0], [2, 38.0, 61000.0], [1, 40.0, 63777.77777777778], [0, 35.0, 58000.0], [2, 38.77777777777778, 52000.0], [0, 48.0, 79000.0], [1, 50.0, 83000.0], [0, 37.0, 67000.0]], dtype=object)
check: how index is defined.
In [15]:
le_x.classes_
Out[15]:
array(['France', 'Germany', 'Spain'], dtype=object)
Apply same method to y because y also have categorical values¶
In [16]:
le_y = LabelEncoder()
y = le_y.fit_transform(y)
In [17]:
y
Out[17]:
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
In [18]:
le_x.classes_
Out[18]:
array(['France', 'Germany', 'Spain'], dtype=object)
In [19]:
le_y.classes_
Out[19]:
array(['No', 'Yes'], dtype=object)
check values from particular index number using inverse transform
In [20]:
le_x.inverse_transform([1])
Out[20]:
array(['Germany'], dtype=object)
In [21]:
le_x.inverse_transform([0])
Out[21]:
array(['France'], dtype=object)
In [22]:
le_y.inverse_transform([1])
Out[22]:
array(['Yes'], dtype=object)
OneHot Encoding¶
A one-hot encoder converts categorical variables into a binary matrix, where each category is represented by a separate column with a 1 or 0 indicating the presence or absence of that category.¶
Put Algorithm
In [23]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
one_X = ohe.fit_transform(X[:,0:1])
In [24]:
one_X
Out[24]:
array([[1., 0., 0.], [0., 0., 1.], [0., 1., 0.], [0., 0., 1.], [0., 1., 0.], [1., 0., 0.], [0., 0., 1.], [1., 0., 0.], [0., 1., 0.], [1., 0., 0.]])
In [25]:
X
Out[25]:
array([[0, 44.0, 72000.0], [2, 27.0, 48000.0], [1, 30.0, 54000.0], [2, 38.0, 61000.0], [1, 40.0, 63777.77777777778], [0, 35.0, 58000.0], [2, 38.77777777777778, 52000.0], [0, 48.0, 79000.0], [1, 50.0, 83000.0], [0, 37.0, 67000.0]], dtype=object)
Put Algorithm
In [26]:
X = np.concatenate([one_X,X[:,1:]],axis=1)
In [27]:
X
Out[27]:
array([[1.0, 0.0, 0.0, 44.0, 72000.0], [0.0, 0.0, 1.0, 27.0, 48000.0], [0.0, 1.0, 0.0, 30.0, 54000.0], [0.0, 0.0, 1.0, 38.0, 61000.0], [0.0, 1.0, 0.0, 40.0, 63777.77777777778], [1.0, 0.0, 0.0, 35.0, 58000.0], [0.0, 0.0, 1.0, 38.77777777777778, 52000.0], [1.0, 0.0, 0.0, 48.0, 79000.0], [0.0, 1.0, 0.0, 50.0, 83000.0], [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)
In [28]:
y
Out[28]:
array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
Split Data in Training and Testing¶
We split data into training and testing sets to evaluate how well the model performs on unseen data and to ensure it generalizes well to new, real-world situations.¶
Put Algorithm
In [29]:
from sklearn.model_selection import train_test_split
In [30]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)
In [31]:
len(X) #rows in x
Out[31]:
10
In [32]:
X_train
Out[32]:
array([[1.0, 0.0, 0.0, 37.0, 67000.0], [0.0, 0.0, 1.0, 27.0, 48000.0], [0.0, 0.0, 1.0, 38.77777777777778, 52000.0], [1.0, 0.0, 0.0, 48.0, 79000.0], [0.0, 0.0, 1.0, 38.0, 61000.0], [1.0, 0.0, 0.0, 44.0, 72000.0], [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)
In [33]:
len(X_train) #rows in x_train
Out[33]:
7
In [34]:
len(X_test) #rows in x_test
Out[34]:
3
In [35]:
X_test
Out[35]:
array([[0.0, 1.0, 0.0, 30.0, 54000.0], [0.0, 1.0, 0.0, 50.0, 83000.0], [0.0, 1.0, 0.0, 40.0, 63777.77777777778]], dtype=object)
In [36]:
y_train
Out[36]:
array([1, 1, 0, 1, 0, 0, 1])
In [37]:
y_test
Out[37]:
array([0, 0, 1])
In [ ]: