Saturday , December 21 2024

Introduction To Data Cleaning Using Python Pandas Library. Learn How To Do Data Preprocessing Before Data Analysis.

1 – Handle Missing Data

Import Library

In [2]:
import numpy as np
import pandas as pd

Read CSV File

In [3]:
data = pd.read_csv('weather_data.csv')
data
Out[3]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

Check Rows and Columns

In [4]:
data.shape
Out[4]:
(7, 4)

Data has 7 Rows and Columns

Check Data Type of Each Column

In [5]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
day            7 non-null object
temperature    7 non-null int64
windspeed      7 non-null int64
event          7 non-null object
dtypes: int64(2), object(2)
memory usage: 352.0+ bytes

Replacing Single Value with NaN

In [6]:
data
Out[6]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

Replace -99999 to NaN

In [7]:
data = data.replace(-99999,value=np.NaN)
data
Out[7]:
day temperature windspeed event
0 1/1/2017 32.0 6.0 Rain
1 1/2/2017 NaN 7.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 7.0 0
4 1/5/2017 32.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0
In [8]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 4 columns):
day            7 non-null object
temperature    5 non-null float64
windspeed      5 non-null float64
event          7 non-null object
dtypes: float64(2), object(2)
memory usage: 352.0+ bytes

Now we have 2 missing values in temperature and windspeed column respectively

Replacing List of Values with Single Value

In [9]:
data
Out[9]:
day temperature windspeed event
0 1/1/2017 32.0 6.0 Rain
1 1/2/2017 NaN 7.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 7.0 0
4 1/5/2017 32.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0
In [10]:
data = data.replace([32.0,7.0],value=99)
data
Out[10]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 99.0 0
4 1/5/2017 99.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0

Replacing Per Column

In [11]:
data
Out[11]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 99.0 0
4 1/5/2017 99.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0

Replace Temperature Column value 99, Windspeed Missing Value (NaN) and Event value 0 with value 100

In [12]:
data.replace({'temperature':99.0,'windspeed':np.nan,'event':'0'},100)
Out[12]:
day temperature windspeed event
0 1/1/2017 100.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 100.0 Snow
3 1/4/2017 NaN 99.0 100
4 1/5/2017 100.0 100.0 Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 100

Replacing by using Mapping

In [13]:
data
Out[13]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 NaN 99.0 Sunny
2 1/3/2017 28.0 NaN Snow
3 1/4/2017 NaN 99.0 0
4 1/5/2017 99.0 NaN Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 0
In [14]:
data = data.replace({np.nan:69,'0':'Sunny'})
data
Out[14]:
day temperature windspeed event
0 1/1/2017 99.0 6.0 Rain
1 1/2/2017 69.0 99.0 Sunny
2 1/3/2017 28.0 69.0 Snow
3 1/4/2017 69.0 99.0 Sunny
4 1/5/2017 99.0 69.0 Rain
5 1/6/2017 31.0 2.0 Sunny
6 1/6/2017 34.0 5.0 Sunny

Advance Data Handling Techniques using Regular Expressions

Read Weather Data

In [15]:
data = pd.read_csv('weather2.csv')
data
Out[15]:
day temperature windspeed event
0 1/1/2017 32 F 6 mph Rain
1 1/2/2017 -99999 F 7 mph Sunny
2 1/3/2017 28 F -99999 mph Snow
3 1/4/2017 -99999 F 7 mph 0
4 1/5/2017 32 F -99999 mph Rain
5 1/6/2017 31 F 2 mph Sunny
6 1/6/2017 34 F 5 mph 0

Remove mph from windspeed & F from Temperature

In [16]:
data = data.replace({'temperature':'[A-Za-z]','windspeed':'[A-Za-z]'},'',regex=True)
data
Out[16]:
day temperature windspeed event
0 1/1/2017 32 6 Rain
1 1/2/2017 -99999 7 Sunny
2 1/3/2017 28 -99999 Snow
3 1/4/2017 -99999 7 0
4 1/5/2017 32 -99999 Rain
5 1/6/2017 31 2 Sunny
6 1/6/2017 34 5 0

Replacing Column Values with Another List of Values

In [17]:
d = {'score':['exceptional','average','good','poor','average','exceptional'],
  'student':['Karan','Arpit','Varun','Robin','Akshay','Ankush']}
In [18]:
data = pd.DataFrame(d)
data
Out[18]:
score student
0 exceptional Karan
1 average Arpit
2 good Varun
3 poor Robin
4 average Akshay
5 exceptional Ankush

Replace the value of Score Column with values 1,2,3,4 Respectively

In [19]:
data = data.replace(['poor','average','good','exceptional'],[1,2,3,4])
data
Out[19]:
score student
0 4 Karan
1 2 Arpit
2 3 Varun
3 1 Robin
4 2 Akshay
5 4 Ankush

About Machine Learning

Check Also

Groupby in Pandas - Data Science Tutorials

Groupby in Pandas – Data Science Tutorials

14- Groupby Groupby in Pandas¶Pandas groupby is used for grouping the data according to the …

Leave a Reply

Your email address will not be published. Required fields are marked *