Combining and Merging in Pandas – Data Science Tutorials

Machine Learning July 31, 2022 Data Science, Pandas Leave a comment 3,636 Views

13- Combining and Merging

Combining and Merging in Pandas¶

The datasets you want to analyze can come from different sources. Before starting data analyses, you may want to merge these datasets. To combine datasets, you can use the merge, join, and concat methods in Pandas. Pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational databases like SQL.

pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True)

In [1]:

import pandas as pd
import numpy as np

In [2]:

df1 = pd.DataFrame({
         'id':[1,2,3,4,5],
         'Name': ['Piyush', 'Gautam', 'Garry', 'Abhishek', 'Ankit'],
         'subject_id':['sub1','sub2','sub4','sub6','sub5']})
df1

Out[2]:

	id	Name	subject_id
0	1	Piyush	sub1
1	2	Gautam	sub2
2	3	Garry	sub4
3	4	Abhishek	sub6
4	5	Ankit	sub5

In [3]:

df2 = pd.DataFrame(
         {'id':[1,2,3,4,5],
         'Name': ['Saurabh', 'Akshay', 'Rajat', 'Ankush', 'Sam'],
         'subject_id':['sub2','sub4','sub3','sub6','sub5']})
df2

Out[3]:

	id	Name	subject_id
0	1	Saurabh	sub2
1	2	Akshay	sub4
2	3	Rajat	sub3
3	4	Ankush	sub6
4	5	Sam	sub5

Merging DataFrame¶

To combine rows using one or more keys, use the merge or join procedures. Let’s make two datasets to demonstrate this.

In [4]:

pd.merge(df1,df2)

Out[4]:

	id	Name	subject_id

Merge Two DataFrames on Column ID¶

In [5]:

pd.merge(df1,df2,on='id')

Out[5]:

	id	Name_x	subject_id_x	Name_y	subject_id_y
0	1	Piyush	sub1	Saurabh	sub2
1	2	Gautam	sub2	Akshay	sub4
2	3	Garry	sub4	Rajat	sub3
3	4	Abhishek	sub6	Ankush	sub6
4	5	Ankit	sub5	Sam	sub5

Merge Two DataFrames on Column Subject ID¶

In [6]:

pd.merge(df1,df2,on='subject_id')

Out[6]:

	id_x	Name_x	subject_id	id_y	Name_y
0	2	Gautam	sub2	1	Saurabh
1	3	Garry	sub4	2	Akshay
2	4	Abhishek	sub6	4	Ankush
3	5	Ankit	sub5	5	Sam

Merge Two DataFrames on Column ID and Subject ID Together¶

In [7]:

(pd.merge(df1,df2,on=['id','subject_id']))

Out[7]:

	id	Name_x	subject_id	Name_y
0	4	Abhishek	sub6	Ankush
1	5	Ankit	sub5	Sam

Inner Join (default)¶

In [8]:

pd.merge(df1,df2, on='subject_id')

Out[8]:

	id_x	Name_x	subject_id	id_y	Name_y
0	2	Gautam	sub2	1	Saurabh
1	3	Garry	sub4	2	Akshay
2	4	Abhishek	sub6	4	Ankush
3	5	Ankit	sub5	5	Sam

In [9]:

pd.merge(df1,df2, on='subject_id', how='inner')

Out[9]:

	id_x	Name_x	subject_id	id_y	Name_y
0	2	Gautam	sub2	1	Saurabh
1	3	Garry	sub4	2	Akshay
2	4	Abhishek	sub6	4	Ankush
3	5	Ankit	sub5	5	Sam

Left Join¶

In [10]:

pd.merge(df1,df2, on='subject_id', how='left')

Out[10]:

	id_x	Name_x	subject_id	id_y	Name_y
0	1	Piyush	sub1	NaN	NaN
1	2	Gautam	sub2	1.0	Saurabh
2	3	Garry	sub4	2.0	Akshay
3	4	Abhishek	sub6	4.0	Ankush
4	5	Ankit	sub5	5.0	Sam

Right Join¶

In [11]:

pd.merge(df1,df2, on='subject_id', how='right')

Out[11]:

	id_x	Name_x	subject_id	id_y	Name_y
0	2.0	Gautam	sub2	1	Saurabh
1	3.0	Garry	sub4	2	Akshay
2	NaN	NaN	sub3	3	Rajat
3	4.0	Abhishek	sub6	4	Ankush
4	5.0	Ankit	sub5	5	Sam

Outer Join¶

In [12]:

pd.merge(df1,df2, on='subject_id', how='outer')

Out[12]:

	id_x	Name_x	subject_id	id_y	Name_y
0	1.0	Piyush	sub1	NaN	NaN
1	2.0	Gautam	sub2	1.0	Saurabh
2	3.0	Garry	sub4	2.0	Akshay
3	4.0	Abhishek	sub6	4.0	Ankush
4	5.0	Ankit	sub5	5.0	Sam
5	NaN	NaN	sub3	3.0	Rajat

Another Example¶

One to One Join¶

Perhaps the simplest type of merge expression is the one-to-one join, which is in many ways very similar to the column-wise concatenation

In [13]:

df1 = pd.DataFrame({'employee': ['Karan', 'Robin', 'Akshay', 'Arpit'],'group': ['IT', 'Sales', 'Sales', 'HR']})
df2 = pd.DataFrame({'employee': ['Akshay', 'Karan', 'Robin', 'Arpit'],'hire_date': [2018, 2017, 2015, 2014]})

In [14]:

df1

Out[14]:

	employee	group
0	Karan	IT
1	Robin	Sales
2	Akshay	Sales
3	Arpit	HR

In [15]:

df2

Out[15]:

	employee	hire_date
0	Akshay	2018
1	Karan	2017
2	Robin	2015
3	Arpit	2014

To combine this information into a single DataFrame, we can use the pd.merge() function:¶

In [16]:

df3 = pd.merge(df1, df2)
df3

Out[16]:

	employee	group	hire_date
0	Karan	IT	2017
1	Robin	Sales	2015
2	Akshay	Sales	2018
3	Arpit	HR	2014

Many to One Joins¶

Many-to-one joins are joins in which one of the two key columns contains duplicate entries. For the many-to-one case, the resulting DataFrame will preserve those duplicate entries as appropriate.

In [17]:

df4 = pd.DataFrame({'group': ['IT', 'Sales', 'Sales', 'HR'],'supervisor': ['Keisha', 'Priya', 'Chetna','Atharv']})
df4

Out[17]:

	group	supervisor
0	IT	Keisha
1	Sales	Priya
2	Sales	Chetna
3	HR	Atharv

In [18]:

pd.merge(df3, df4)

Out[18]:

	employee	group	hire_date	supervisor
0	Karan	IT	2017	Keisha
1	Robin	Sales	2015	Priya
2	Robin	Sales	2015	Chetna
3	Akshay	Sales	2018	Priya
4	Akshay	Sales	2018	Chetna
5	Arpit	HR	2014	Atharv

The resulting DataFrame has an additional column with the “supervisor” information, where the information is repeated in one or more locations as required by the inputs.

Many to Many Joins¶

Many-to-many joins are a bit confusing conceptually, but are nevertheless well defined. If the key column in both the left and right array contains duplicates, then the result is a many-to-many merge.

In [19]:

df5 = pd.DataFrame({'group': ['IT', 'Sales', 'Sales', 'HR','HR','IT'],'skills': ['Python', 'Digital Marketing', 'Spreadsheets', 'Email-Marketing','Tele-Calling', 'Data Science']})
df5

Out[19]:

	group	skills
0	IT	Python
1	Sales	Digital Marketing
2	Sales	Spreadsheets
3	HR	Email-Marketing
4	HR	Tele-Calling
5	IT	Data Science

In [20]:

pd.merge(df1, df5)

Out[20]:

	employee	group	skills
0	Karan	IT	Python
1	Karan	IT	Data Science
2	Robin	Sales	Digital Marketing
3	Robin	Sales	Spreadsheets
4	Akshay	Sales	Digital Marketing
5	Akshay	Sales	Spreadsheets
6	Arpit	HR	Email-Marketing
7	Arpit	HR	Tele-Calling

The left_on and right_on keywords¶

At times you may wish to merge two datasets with different column names; for exam‐ ple, we may have a dataset in which the employee name is labeled as “name” rather than “employee”. In this case, we can use the left_on and right_on keywords to specify the two column names:

In [21]:

df1 = pd.DataFrame({'employee': ['Karan', 'Robin', 'Akshay', 'Arpit'],'group': ['IT', 'Sales', 'Sales', 'HR']})
df1

Out[21]:

	employee	group
0	Karan	IT
1	Robin	Sales
2	Akshay	Sales
3	Arpit	HR

In [22]:

df2 = pd.DataFrame({'name': ['Karan', 'Robin', 'Akshay', 'Arpit'],'salary': [70000, 80000, 120000, 90000]})
df2

Out[22]:

	name	salary
0	Karan	70000
1	Robin	80000
2	Akshay	120000
3	Arpit	90000

In [23]:

pd.merge(df1, df2, left_on='employee', right_on='name')

Out[23]:

	employee	group	name	salary
0	Karan	IT	Karan	70000
1	Robin	Sales	Robin	80000
2	Akshay	Sales	Akshay	120000
3	Arpit	HR	Arpit	90000

The left_index and right_index keywords¶

Sometimes, rather than merging on a column, you would instead like to merge on an index. For example, your data might look like this:

In [24]:

df1 = pd.DataFrame({'employee': ['Karan', 'Robin', 'Akshay', 'Arpit'],'group': ['IT', 'Sales', 'Sales', 'HR']})
df2 = pd.DataFrame({'employee': ['Akshay', 'Karan', 'Robin', 'Arpit'],'hire_date': [2018, 2017, 2015, 2014]})

In [25]:

df1

Out[25]:

	employee	group
0	Karan	IT
1	Robin	Sales
2	Akshay	Sales
3	Arpit	HR

In [26]:

df2

Out[26]:

	employee	hire_date
0	Akshay	2018
1	Karan	2017
2	Robin	2015
3	Arpit	2014

In [27]:

df1a = df1.set_index('employee')
df2a = df2.set_index('employee')

In [28]:

df1a

Out[28]:

	group
employee
Karan	IT
Robin	Sales
Akshay	Sales
Arpit	HR

In [29]:

df2a

Out[29]:

	hire_date
employee
Akshay	2018
Karan	2017
Robin	2015
Arpit	2014

In [30]:

pd.merge(df1a, df2a, left_index=True, right_index=True)

Out[30]:

	group	hire_date
employee
Karan	IT	2017
Robin	Sales	2015
Akshay	Sales	2018
Arpit	HR	2014

Concatenate¶

In [31]:

df1 = pd.DataFrame({
         'Name': ['Akash', 'Abhishek', 'Arpit', 'aayush', 'Ankit'],
         'subject':['sub1','sub2','sub4','sub6','sub5'],
         'Marks':[98,90,87,69,78]},
         index=[1,2,3,4,5])
df1

Out[31]:

	Name	subject	Marks
1	Akash	sub1	98
2	Abhishek	sub2	90
3	Arpit	sub4	87
4	aayush	sub6	69
5	Ankit	sub5	78

In [32]:

df2= pd.DataFrame({
         'Name': ['Bobby', 'Bhavnish', 'Bhavya', 'Bonny', 'Brett'],
         'subject':['sub2','sub4','sub3','sub6','sub5'],
         'Marks':[89,80,79,97,88]},
         index=[1,2,3,4,5])
df2

Out[32]:

	Name	subject	Marks
1	Bobby	sub2	89
2	Bhavnish	sub4	80
3	Bhavya	sub3	79
4	Bonny	sub6	97
5	Brett	sub5	88

In [34]:

pd.concat([df1,df2])

Out[34]:

	Name	subject	Marks
1	Akash	sub1	98
2	Abhishek	sub2	90
3	Arpit	sub4	87
4	aayush	sub6	69
5	Ankit	sub5	78
1	Bobby	sub2	89
2	Bhavnish	sub4	80
3	Bhavya	sub3	79
4	Bonny	sub6	97
5	Brett	sub5	88

Reset index¶

In [35]:

pd.concat([df1,df2],ignore_index=True)

Out[35]:

	Name	subject	Marks
0	Akash	sub1	98
1	Abhishek	sub2	90
2	Arpit	sub4	87
3	aayush	sub6	69
4	Ankit	sub5	78
5	Bobby	sub2	89
6	Bhavnish	sub4	80
7	Bhavya	sub3	79
8	Bonny	sub6	97
9	Brett	sub5	88

Along the Axis = 0¶

In [36]:

pd.concat([df1,df2],ignore_index=True,axis=0)

Out[36]:

	Name	subject	Marks
0	Akash	sub1	98
1	Abhishek	sub2	90
2	Arpit	sub4	87
3	aayush	sub6	69
4	Ankit	sub5	78
5	Bobby	sub2	89
6	Bhavnish	sub4	80
7	Bhavya	sub3	79
8	Bonny	sub6	97
9	Brett	sub5	88

Along the axis = 1¶

In [37]:

pd.concat([df1,df2],ignore_index=True,axis=1)

Out[37]:

	0	1	2	3	4	5
1	Akash	sub1	98	Bobby	sub2	89
2	Abhishek	sub2	90	Bhavnish	sub4	80
3	Arpit	sub4	87	Bhavya	sub3	79
4	aayush	sub6	69	Bonny	sub6	97
5	Ankit	sub5	78	Brett	sub5	88