Data Preprocessing

Data preprocessingg

What is Data Preprocessing?¶

Data preprocessing is the process of preparing raw data for analysis or modeling. In machine learning, data usually needs to be cleaned and transformed into a suitable format before feeding it into algorithms. The main goal is to improve the quality of the data so that models can perform better and give accurate results.

Steps in Data Preprocessing¶

1. Collecting the Data¶

Gather the raw data from various sources like databases, CSV files, or sensors.

2. Handling Missing Data¶

Sometimes data may be incomplete (e.g., some values are missing). You can handle missing data by:
- Removing the rows or columns with missing values.
- Imputing missing values with the mean, median, or most common value.

3. Handling Categorical Data¶

Data can be in categories (like “Male” or “Female”), but machine learning models need numerical data. You convert categorical data into numbers using methods like:
- Label Encoding: Assign a unique number to each category (e.g., “Male” = 1, “Female” = 0).
- One-Hot Encoding: Create new columns for each category (e.g., “Male” becomes one column and “Female” becomes another).

4. Feature Scaling¶

Data might have features with different scales (e.g., income in thousands, age in years). Scaling helps to bring all features to a similar range:
- Normalization: Rescale features to have values between 0 and 1.
- Standardization: Adjust values to have a mean of 0 and a standard deviation of 1.

5. Splitting the Dataset¶

Divide the data into:
- Training set: The data used to train the machine learning model.
- Test set: The data used to evaluate the model’s performance on unseen data.

Why is Data Preprocessing Important?¶

Improves Model Accuracy: Clean, consistent data leads to better predictions.
Reduces Errors: Helps remove noise or incorrect data.
Saves Time: Preprocessed data allows models to learn faster and more effectively.

Let’s review example step by step.¶

Import the Libraries

In [1]:

import numpy as np
import pandas as pd

Read Data

In [2]:

data = pd.read_csv('preprocessing.csv')
data

Out[2]:

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

Clean Data

In [3]:

data.isnull().sum()

Out[3]:

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

In [4]:

data['Age'].fillna(data.Age.mean())

Out[4]:

0    44.000000
1    27.000000
2    30.000000
3    38.000000
4    40.000000
5    35.000000
6    38.777778
7    48.000000
8    50.000000
9    37.000000
Name: Age, dtype: float64

In [5]:

data['Salary'].fillna(data.Salary.mean())

Out[5]:

0    72000.000000
1    48000.000000
2    54000.000000
3    61000.000000
4    63777.777778
5    58000.000000
6    52000.000000
7    79000.000000
8    83000.000000
9    67000.000000
Name: Salary, dtype: float64

Using Scikit Learn – Simple Imputer¶

In [6]:

data

Out[6]:

	Country	Age	Salary	Purchased
0	France	44.0	72000.0	No
1	Spain	27.0	48000.0	Yes
2	Germany	30.0	54000.0	No
3	Spain	38.0	61000.0	No
4	Germany	40.0	NaN	Yes
5	France	35.0	58000.0	Yes
6	Spain	NaN	52000.0	No
7	France	48.0	79000.0	Yes
8	Germany	50.0	83000.0	No
9	France	37.0	67000.0	Yes

Choose the column that contains variables you believe will influence and help predict the target data.

In [7]:

X = data.iloc[:,:3].values
X

Out[7]:

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Select the column you want to predict, known as the target variable, which is the outcome you aim to model using the other features.

In [8]:

y = data.iloc[:,-1].values
y

Out[8]:

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

Clean the Data using another method

In [9]:

from sklearn.impute import SimpleImputer
si = SimpleImputer(missing_values=np.nan, strategy='mean')
X[:,1:] = si.fit_transform(X[:,1:])
X

Out[9]:

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

Label Encoding¶

A label encoder converts categorical values into numerical values so that they can be used in machine learning algorithms.¶

Choose column in which you want to use label encoder

In [10]:

X[:,0]

Out[10]:

array(['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France',
       'Spain', 'France', 'Germany', 'France'], dtype=object)

Put Algorithm

In [11]:

from sklearn.preprocessing import LabelEncoder

In [12]:

le_x = LabelEncoder()

In [13]:

X[:,0] = le_x.fit_transform(X[:,0])

In [14]:

Out[14]:

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

check: how index is defined.

In [15]:

le_x.classes_

Out[15]:

array(['France', 'Germany', 'Spain'], dtype=object)

Apply same method to y because y also have categorical values¶

In [16]:

le_y = LabelEncoder()
y = le_y.fit_transform(y)

In [17]:

Out[17]:

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In [18]:

le_x.classes_

Out[18]:

array(['France', 'Germany', 'Spain'], dtype=object)

In [19]:

le_y.classes_

Out[19]:

array(['No', 'Yes'], dtype=object)

check values from particular index number using inverse transform

In [20]:

le_x.inverse_transform([1])

Out[20]:

array(['Germany'], dtype=object)

In [21]:

le_x.inverse_transform([0])

Out[21]:

array(['France'], dtype=object)

In [22]:

le_y.inverse_transform([1])

Out[22]:

array(['Yes'], dtype=object)

OneHot Encoding¶

A one-hot encoder converts categorical variables into a binary matrix, where each category is represented by a separate column with a 1 or 0 indicating the presence or absence of that category.¶

Put Algorithm

In [23]:

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
one_X = ohe.fit_transform(X[:,0:1])

In [24]:

one_X

Out[24]:

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [25]:

Out[25]:

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

Put Algorithm

In [26]:

X = np.concatenate([one_X,X[:,1:]],axis=1)

In [27]:

Out[27]:

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [28]:

Out[28]:

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

Split Data in Training and Testing¶

We split data into training and testing sets to evaluate how well the model performs on unseen data and to ensure it generalizes well to new, real-world situations.¶

Put Algorithm

In [29]:

from sklearn.model_selection import train_test_split

In [30]:

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0)

In [31]:

len(X) #rows in x

Out[31]:

In [32]:

X_train

Out[32]:

array([[1.0, 0.0, 0.0, 37.0, 67000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [1.0, 0.0, 0.0, 44.0, 72000.0],
       [1.0, 0.0, 0.0, 35.0, 58000.0]], dtype=object)

In [33]:

len(X_train) #rows in x_train

Out[33]:

In [34]:

len(X_test) #rows in x_test

Out[34]:

In [35]:

X_test

Out[35]:

array([[0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778]], dtype=object)

In [36]:

y_train

Out[36]:

array([1, 1, 0, 1, 0, 0, 1])

In [37]:

y_test

Out[37]:

array([0, 0, 1])

In [ ]:

Data Preprocessing

Related Articles

What is Data Preprocessing?¶

Steps in Data Preprocessing¶

1. Collecting the Data¶

2. Handling Missing Data¶

3. Handling Categorical Data¶

4. Feature Scaling¶

5. Splitting the Dataset¶

Why is Data Preprocessing Important?¶

Let’s review example step by step.¶

Import the Libraries

Read Data

Clean Data

Using Scikit Learn – Simple Imputer¶

Choose the column that contains variables you believe will influence and help predict the target data.

Select the column you want to predict, known as the target variable, which is the outcome you aim to model using the other features.

Clean the Data using another method

Label Encoding¶

A label encoder converts categorical values into numerical values so that they can be used in machine learning algorithms.¶

Choose column in which you want to use label encoder

Put Algorithm

check: how index is defined.

Apply same method to y because y also have categorical values¶

check values from particular index number using inverse transform

OneHot Encoding¶

A one-hot encoder converts categorical variables into a binary matrix, where each category is represented by a separate column with a 1 or 0 indicating the presence or absence of that category.¶

Put Algorithm

Put Algorithm

Split Data in Training and Testing¶

We split data into training and testing sets to evaluate how well the model performs on unseen data and to ensure it generalizes well to new, real-world situations.¶

Put Algorithm

Related

About Machine Learning

Check Also

Leave a Reply Cancel reply