join()
: Primarily joins DataFrames based on their indices.merge()
: More versatile, allows joining on specified columns, and supports various join types (inner, outer, left, right, cross).
Key takeaways:
The
merge()
method supports multiple merge types (left
,right
,outer
,inner
, andcross
), catering to various data combination needs.Users can specify the columns to merge on, allowing precise control over how datasets are combined.
The method closely resembles SQL JOIN operations, making it intuitive for those familiar with relational databases.
merge()
methodThe merge()
method in Python is used to combine two or more DataFrames based on their common columns. The merge()
method works similar to the JOIN
operations in SQL. Merging multiple DataFrames is useful when we want to visualize different chunks of information from different DataFrames all in one place. It enables us to bring information together for further processing and analysis.
Here is the generic syntax for merging two DataFrames:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)
The merge()
method takes the following parameters:
left
: This is a DataFrame.
right
: This is another DataFrame. The left
DataFrame is merged with the right
DataFrame and a new DataFrame is returned.
Note: Nothing happens to the
left
andright
DataFrames, i.e., they do not change.
how
: This parameter specifies the type of merge to be performed. The following are the types of merges available within the merge()
method: left
, right
, outer
, inner
and cross
.
left
type: This type of merge returns all the rows from the left
DataFrame. It also returns the rows from the right
DataFrame that are in common with those of the left
DataFrame. If there are no rows in the right
DataFrame that are in common (rows with similar values) with those of the left
DataFrame, NaN
is returned. The merge is performed on a common column or columns.
right
type: This type of merge returns all the rows from the right
DataFrame. It also returns the rows from the left
DataFrame that are in common with those of the right
DataFrame. If there are no rows in the left
DataFrame that are in common with those of the right
DataFrame, NaN
is returned. The merge is performed on a common column or columns.
outer
type: This type of merge returns all the rows from the left
and right
DataFrames. This also includes the rows from the left
and right
DataFrames that are in common with each other. If no common rows are found, NaN
is returned. The merge is performed on a common column or columns.
inner
type: This is the default merge type. It returns only the rows that are common between the left
and right
DataFrames. If there are no common rows, an empty DataFrame is returned. The merge is performed on a common column or columns.
cross
type: This type of merge returns the Cartesian product of the rows from the left
and right
DataFrames, i.e., it combines and returns each row from the left
DataFrame with each row from the right
DataFrame. We don’t need to specify any common column or columns when using this merge type.
on
: This parameter represents the column or columns to perform the merge on. These columns should be present in both the left
and right
DataFrames. Its default value is None
, i.e., no common column or columns are used to perform the merge()
operation.
left_on
(None
by default):
Column(s) to merge on from the left DataFrame.
Used when column names differ between DataFrames.
If None
, it merges on the index of the left DataFrame.
right_on
(None
by default):
Column(s) to merge on from the right DataFrame.
If None
, it merges on the index of the right DataFrame.
left_index
(False
by default):
If True
, merge uses the index of the left DataFrame as the key.
right_index
(False
by default):
If True
, merge uses the index of the right DataFrame as the key.
sort
(False
by default):
If True
, sorts the resulting DataFrame by the merge keys.
suffixes
(('_x', '_y')
by default):
A tuple of two strings that are added as suffixes to overlapping column names from the left and right DataFrames. This helps avoid column name duplication and allows clear identification of source columns. At least one value must not be None
.
copy
(True
by default):
If True
, creates a copy of the DataFrame.
If False
, modifies in place (not often necessary).
indicator
(False
by default):
If True
, adds a _merge
column to the output DataFrame, showing the source of each row. It indicates whether a row is from the left DataFrame only, right DataFrame only, or both. This column is helpful for understanding the merge outcome.
validate
(None
by default):
Ensures that the merge follows the specified relationship type (e.g., "one_to_one", "one_to_many", "many_to_one", "many_to_many"). It checks if the merge keys are unique as per the specified type, providing validation on the merging logic.
merge()
functionLet’s look at an example where we will create two DataFrames and then perform the merge()
operation on them.
import pandas as pdleft_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['Leo', 'Jacob', 'James', 'Mason'],'age': [18, 20, 23, 19],'course': ['english', 'persian', 'arts', 'chemistry']})print('')right_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['William', 'Lucas', 'Henry', 'Elio'],'age': [18, 21, 26, 25],'course': ['english', 'persian', 'Literature', 'Physics']})print('')print('Left data frame')print(left_dataframe)print('')print('Right data frame')print(right_dataframe)print('')
import pandas as pdleft_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['Leo', 'Jacob', 'James', 'Mason'],'age': [18, 20, 23, 19],'course': ['english', 'persian', 'arts', 'chemistry']})print('')right_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['William', 'Lucas', 'Henry', 'Elio'],'age': [18, 21, 26, 25],'course': ['english', 'persian', 'Literature', 'Physics']})print('')print('Left data frame')print(left_dataframe)print('')print('Right data frame')print(right_dataframe)print('')print('Left merge')print(pd.merge(left_dataframe, right_dataframe, how='left', on='course'))print('')
First, two data frames are created. The left_dataframe
contains information about students, such as their id
, name
, age
, and the course
they are enrolled in. Similarly, the right_dataframe
also contains student information, but it has different names, ages, and courses. The columns in both data frames that are common are id
, name
, age
, and course
.
The pd.merge()
function is then used to merge these two data frames based on a common column, which is the course
column. The merge operation is specified to be a 'left' merge, meaning all rows from the left data frame (left_dataframe
) will be included, and matching rows from the right data frame (right_dataframe
) will be added where possible. If there is no match for a particular course from the left data frame, the corresponding columns from the right data frame will contain NaN
.
The result of this merge operation is printed, which shows how the two data frames are combined by the course
column. In this case, the age
and name
columns from the right_dataframe
are added to the left_dataframe
, and any missing values (if no matching courses are found) are filled with NaN
.
pandas
DataFrameThis code demonstrates how to perform a right merge between two data frames using the pandas
library in Python.
import pandas as pdleft_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['Leo', 'Jacob', 'James', 'Mason'],'age': [18, 20, 23, 19],'course': ['english', 'persian', 'arts', 'chemistry']})print('')right_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['William', 'Lucas', 'Henry', 'Elio'],'age': [18, 21, 26, 25],'course': ['english', 'persian', 'Literature', 'Physics']})print('')print('Left data frame')print(left_dataframe)print('')print('Right data frame')print(right_dataframe)print('')print('Right merge')print(pd.merge(left_dataframe, right_dataframe, how='right', on='course'))print('')
Just like the previous example, two data frames are created. The left_dataframe
contains details about students, including their id
, name
, age
, and the course
they are taking. The right_dataframe
contains similar information but with different names, ages, and courses.
In this case, the pd.merge()
function is used to merge the two data frames, with the how='right'
argument. This specifies that the merge should be a right merge, meaning all rows from the right data frame (right_dataframe
) will be included in the result. Matching rows from the left data frame (left_dataframe
) will be added where possible. If no match is found for a particular course from the right data frame, the corresponding columns from the left data frame will contain NaN
.
The result of the right merge operation is printed, which shows how the two data frames are combined by the course
column. Here, the age
and name
columns from the left_dataframe
are added to the right_dataframe
, and any missing values are filled with NaN
if no corresponding course is found in the left data frame.
This right merge ensures that all entries from the right data frame are retained, which is useful when you want to prioritize the data in the right frame but still combine matching rows from the left.
pandas
DataFrameThis code demonstrates how to perform an inner merge between two data frames using the pandas
library in Python.
import pandas as pdleft_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['Leo', 'Jacob', 'James', 'Mason'],'age': [18, 20, 23, 19],'course': ['english', 'persian', 'arts', 'chemistry']})print('')right_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['William', 'Lucas', 'Henry', 'Elio'],'age': [18, 21, 26, 25],'course': ['english', 'persian', 'Literature', 'Physics']})print('')print('Left data frame')print(left_dataframe)print('')print('Right data frame')print(right_dataframe)print('')print('Inner merge')print(pd.merge(left_dataframe, right_dataframe, how='inner', on=['age', 'course']))print('')
First, two data frames are created: left_dataframe
and right_dataframe
. Each contains information about students, such as their id
, name
, age
, and the course
they are enrolled in. The courses and ages vary between the two data frames, but both data frames share the same structure.
The pd.merge()
function is used to merge the two data frames with the how='inner'
argument, specifying that only the rows with matching values in both the age
and course
columns will be included in the result. In an inner merge, only the rows where there is a match between the two data frames (based on the specified columns, which in this case are age
and course
) are retained. Rows that do not have a corresponding match in both data frames are excluded from the result.
The output shows how the two data frames are merged based on the common values in both the age
and course
columns. For instance, students with the same age
and course
values in both data frames will be included, while others that don't match in both columns will be omitted.
This type of merge is useful when you want to retain only the data that has exact matches across both data frames for the specified columns, ensuring that only common entries are included.
pandas
DataFrameThis code demonstrates two different types of merges between two data frames using the pandas
library in Python: an outer merge.
import pandas as pdleft_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['Leo', 'Jacob', 'James', 'Mason'],'age': [18, 20, 23, 19],'course': ['english', 'persian', 'arts', 'chemistry']})print('')right_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['William', 'Lucas', 'Henry', 'Elio'],'age': [18, 21, 26, 25],'course': ['english', 'persian', 'Literature', 'Physics']})print('')print('Left data frame')print(left_dataframe)print('')print('Outer merge')print(pd.merge(left_dataframe, right_dataframe, how='outer', on='course'))print('')
First, two data frames, left_dataframe
and right_dataframe
, are created. Each contains student data with columns for id
, name
, age
, and course
. The data in these two data frames are similar but not identical, especially in the course
column, which has some common values and some unique ones between the two data frames.
The pd.merge()
function is first used with the how='outer'
argument, which specifies an outer merge. In an outer merge, all rows from both data frames are included in the result. For rows that do not have a match in one of the data frames based on the course
column, NaN
is used to fill the columns from the non-matching data frame. This ensures that no data is discarded, even if there is no match for a given course. The result is a combination of both data frames, where all the courses are included, and missing values are handled with NaN
where necessary.
pandas
DataFrameThis code demonstrates two different types of merges between two data frames using the pandas
library in Python: an outer merge.
import pandas as pdleft_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['Leo', 'Jacob', 'James', 'Mason'],'age': [18, 20, 23, 19],'course': ['english', 'persian', 'arts', 'chemistry']})print('')right_dataframe = pd.DataFrame({'id': [1, 2, 3, 4],'name': ['William', 'Lucas', 'Henry', 'Elio'],'age': [18, 21, 26, 25],'course': ['english', 'persian', 'Literature', 'Physics']})print('')print('Left data frame')print(left_dataframe)print('')print('Cross merge')print(pd.merge(left_dataframe, right_dataframe, how='cross'))print('')
First, two data frames, left_dataframe
and right_dataframe
, are created. Each contains student data with columns for id
, name
, age
, and course
. The data in these two data frames are similar but not identical, especially in the course
column, which has some common values and some unique ones between the two data frames.
The merge operation uses how='cross'
. A cross-merge combines every row from the left data frame with every row from the right data frame. Unlike the other types of merges, there are no conditions for matching columns; every row from the left is paired with every row from the right. This results in a large number of combinations, where the course
and age
columns from both data frames are combined across all possible pairings. In this case, the number of rows in the result will be the product of the number of rows in the left and right data frames (4 * 4 = 16 rows).
These two types of merges are useful in different scenarios: an outer merge is useful when you want to retain all data, even if some values don't have corresponding matches, while a cross merge is helpful when you want to generate all possible combinations between two data sets.
Imagine we work for a supermarket with two separate databases—one for customer information and another for orders. The customer database includes customer IDs, names, addresses, and other details, while the order database includes order IDs, customer IDs, product details, and order dates. Which method would we use to get a list of all the customers who have placed orders?
left
merge
right
merge
inner
merge
outer
merge
The pandas merge()
method simplifies the process of combining datasets, offering flexibility with different merge types such as left
, right
, outer
, inner
, and cross
. Its ability to handle diverse scenarios makes it a cornerstone function for data analysts and developers working with relational datasets.
Haven’t found what you were looking for? Contact Us
Free Resources