Data Transformation in Pandas – Data Science Tutorials

Machine Learning July 31, 2022 Data Science, Pandas Leave a comment 4,407 Views

11- Data Transformation

Data Transformation in Pandas¶

In [1]:

#import libraries
import pandas as pd
import numpy as np

Finding Duplicate Values¶

In [2]:

data=pd.DataFrame({"a":[1,5]*3,"b":[1,1,2,3,2,3]})
data

Out[2]:

	a	b
0	1	1
1	5	1
2	1	2
3	5	3
4	1	2
5	5	3

In [3]:

#To see whether the row is repeated or not, you can use the duplicated method.
data.duplicated()

Out[3]:

0    False
1    False
2    False
3    False
4     True
5     True
dtype: bool

In [4]:

#To remove duplicate rows, you can use the drop_duplicates method which returns a data frame.
data.drop_duplicates()

Out[4]:

	a	b
0	1	1
1	5	1
2	1	2
3	5	3

Mapping¶

In [5]:

import pandas as pd
data={'Name':['A','B','C','D'],'Age':[42,54,20,63]}
df=pd.DataFrame(data,index=['value1','value2','value3','value4'])
df

Out[5]:

	Name	Age
value1	A	42
value2	B	54
value3	C	20
value4	D	63

Now, lets add a new column to show score

In [6]:

score={'A':24,'B':30,'C':36,'D':42}
# Converting the first character of the name to uppercase so that the names are same.
n=df["Name"].str.capitalize()
n

Out[6]:

value1    A
value2    B
value3    C
value4    D
Name: Name, dtype: object

In [7]:

#Now, adding the score variable to df dataset using map() method.
df["Score"]=n.map(score)
df

Out[7]:

	Name	Age	Score
value1	A	42	24
value2	B	54	30
value3	C	20	36
value4	D	63	42

Replacing¶

Rplacing is a method use to replace the values in the dataframe.

In [8]:

s=pd.Series([14,55,87,966,78,44])
s

Out[8]:

0     14
1     55
2     87
3    966
4     78
5     44
dtype: int64

In [9]:

#replacing the value.
s.replace(87,np.nan)

Out[9]:

0     14.0
1     55.0
2      NaN
3    966.0
4     78.0
5     44.0
dtype: float64

In [10]:

#replacing multiple values.
s.replace([78,14],[np.nan,566])

Out[10]:

0    566.0
1     55.0
2     87.0
3    966.0
4      NaN
5     44.0
dtype: float64

Renaming¶

In [11]:

import pandas as pd
import numpy as np
df=pd.DataFrame(np.arange(12).reshape(3,4),index=[0,1,2],columns=["a","b","c","d"])
df

Out[11]:

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11

In [12]:

#Create a variable and pass it into the dataset
s=pd.Series(["one","two","three"])
df.index=df.index.map(s)
df

Out[12]:

	a	b	c	d
one	0	1	2	3
two	4	5	6	7
three	8	9	10	11

In [13]:

#capitalize the row and column names with the rename method
df.rename(index=str.title,columns=str.upper)

Out[13]:

	A	B	C	D
One	0	1	2	3
Two	4	5	6	7
Three	8	9	10	11

we can also change the row or column names using the dictionary structure with the rename method.

In [14]:

df.rename(index={"one":"10"},columns={"d":"f"},inplace=True)
df

Out[14]:

	a	b	c	f
10	0	1	2	3
two	4	5	6	7
three	8	9	10	11

Finding Specific value in a dataset¶

In [15]:

data=pd.DataFrame(np.random.randn(1000,4))
data

Out[15]:

	0	1	2	3
0	1.095845	-0.579053	0.021126	-0.615226
1	1.265209	1.606119	-0.167518	0.022493
2	0.260408	0.015183	0.986117	0.203856
3	-0.805561	0.224160	0.992170	-1.046847
4	-0.257810	0.598237	1.087677	1.733996
…	…	…	…	…
995	0.172482	0.239283	-0.460610	0.383162
996	-0.012413	1.782691	0.706961	-1.101175
997	-0.420072	-0.071259	-1.346796	0.795517
998	-0.151630	0.788102	1.759462	0.306519
999	-2.626494	1.541084	0.057281	0.153461

1000 rows × 4 columns

In [16]:

data.head()

Out[16]:

	0	1	2	3
0	1.095845	-0.579053	0.021126	-0.615226
1	1.265209	1.606119	-0.167518	0.022493
2	0.260408	0.015183	0.986117	0.203856
3	-0.805561	0.224160	0.992170	-1.046847
4	-0.257810	0.598237	1.087677	1.733996

In [17]:

#using the describe method to see summary statistics.
data.describe()

Out[17]:

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.032380	-0.003029	0.074722	0.029196
std	0.991572	0.972940	0.937606	0.956663
min	-3.541948	-3.488546	-2.799998	-3.595273
25%	-0.663346	-0.640519	-0.556768	-0.658451
50%	-0.026216	-0.038468	0.085684	0.041386
75%	0.583050	0.647501	0.676943	0.691507
max	3.232359	2.951305	3.066394	3.040598

let’s find values whose absolute value exceeds 3 in a column with 1 index.

In [18]:

#assigning the column with the 1st index of the data to the col variable.
col=data[1]

In [19]:

col[np.abs(col)>3]

Out[19]:

446   -3.126541
938   -3.488546
Name: 1, dtype: float64

In [20]:

#To find rows with at least one absolute value exceeding 3 in the entire data set
data[(np.abs(data)>3).any(1)]

Out[20]:

	0	1	2	3
331	-3.541948	-1.260699	0.602065	-0.489064
396	-3.121986	0.180793	0.289388	-0.949250
446	1.407799	-3.126541	0.692714	0.296811
619	-0.239821	-1.725038	-0.766128	-3.595273
676	-0.037124	-0.861719	3.066394	1.267702
713	3.045511	-0.058880	-0.191928	0.909453
912	3.232359	0.292739	-0.357451	0.677699
938	-0.731734	-3.488546	0.187923	-1.593345
973	0.495800	-1.063279	-1.364449	3.040598

Machine Learning Tutorials, Courses and Certifications

Data Transformation in Pandas – Data Science Tutorials

Related Articles

Data Transformation in Pandas¶

Finding Duplicate Values¶

Mapping¶

Replacing¶

Renaming¶

Finding Specific value in a dataset¶

Related

About Machine Learning

Check Also

Groupby in Pandas – Data Science Tutorials

Leave a Reply Cancel reply

OpenCV Python Project for Bus Detection from an Image

Multiple Linear Regression:

Microsoft AI Classroom Series Assessment Answers

Polynomial Regression

Support Vector Regression

Security Operations Quiz Answers – NSE 3 Fortinet

Suicide Case Study – NumPy

FUNDAMENTALS OF DIGITAL MARKETING: MODULE 12 Quiz Answers

Data Visualization in Pandas – Data Science Tutorials

Operators

OpenCV Python Project for Bus Detection from an Image

OpenCV Python Project for Vehicle Detection From an Image

OpenCV Python Project for Vehicle Detection in a Video frame

Airline Quality Service

Airport Quality Service