How to find and remove duplicate data in pandas

A common task in Data Science and Analysis is the identification and removal of duplicate data. In most cases, duplicate data is of little use and may cause your analysis to go in the wrong direction. This is why it is important to know how to identify and remove duplicate data.

We will be using pandas to identify and remove the duplicate data from the data frame. Take a look at the code snippet below:

import pandas as pd
user_cols = ['user_id', 'age', 'gender',
'occupation', 'zip_code']
users = pd.read_table('http://bit.ly/movieusers',
sep='|', header=None,
names=user_cols, index_col='user_id')
print("\nDuplicate Zip Codes:")
print(users.zip_code.duplicated().tail())
print("\nNumber of Duplicate Zip Codes:")
print(users.zip_code.duplicated().sum())
print("\nDuplicate Rows:")
print(users.duplicated().tail())
print("\nNumber of Duplicate Rows:")
print(users.duplicated().sum())
print("\nTotal number of Rows:", users.shape[0])
users = users.drop_duplicates()
print("\nTotal number of Unique Rows :", users.shape[0])

Explanation:

  • In line 1, we import the required package.

  • In line 3, we create a list of column names that are present in our data.

  • In line 6, we read the data as a data frame and pass the column names and index. At this point, we have our data loaded as a data frame in df.

  • In line 11, we print whether there are any duplicates (True indicating duplicate, False indicating unique) in the zip_code column. We then print the last five entries in our data frame. Here, we can see that, of those last five entries, there is one zip_code that is a duplicate.

  • In line 14, we print the number of duplicate values in the zip_code column by using the sum() function. In the sum, True represents 1 and False represents 0.

  • In line 17, we print whether there is an entire duplicate row in the data frame. Note that, here, we have not used any column name before using the duplicated() function. We then print the last five rows. In the output, we can see that the last five rows are not duplicates.

  • In line 20, we print the number of rows that are duplicates using the sum() function.

    Now that we have identified the duplicate data in our data frame, it is time to remove the duplicates.

  • In line 23, we use the function drop_duplicates() on the entire data frame. This will remove all of the duplicate rows from the data frame and only return the unique rows. We can verify this by looking at the number of rows before and after removing the duplicates.

In this way, we can easily identify and remove the duplicate data from our data frame in pandas.

Free Resources