How to check for duplicated rows of a DataFrame in Pandas

Overview

In Pandas, the duplicated() function returns a Boolean series indicating duplicated rows of a dataframe.

Syntax

The syntax for the duplicated() function is as follows:

DataFrame.duplicated(subset=None, keep='first')
Syntax for the duplicated() function

Parameters

The duplicated() function takes the following parameter values:

  • subset (optional): This represents a column label or sequence of labels denoting the column in which the duplicates are to be identified.
  • keep (optional): This takes any of the values:
    • "first": To mark any existing duplicate as True except for the first occurrence.
    • "last": To mark any existing duplicate as True except for the last occurrence.
    • "false": To mark all duplicates as True.

Return value

The duplicated() function returns a Boolean Series for each duplicated row.

By default the duplicated() function will return False for the first occurrence of a duplicated row and will return True for the other occurrence. By setting the keep = "last", the first occurrence is set as True while the last occurrence is set as False.

Example

# A code to illustrate the duplicate() function
# importing the pandas library
import pandas as pd
# creating a dataframe
df = pd.DataFrame([["THEO",1,1,3,"A"],
["Theo",1,1,3,"A"],
["THEO",1,1,3,"A"]],
columns=list('ABCDE'))
# printing the dataframe
print(df)
print("\n")
# to check for duplicate rows
print(df.duplicated())
print("/n")
# setting first occurence as true
print(df.duplicated(keep = "last"))
print("\n")
# getting duplicates on column A
print(df.duplicated(subset = ["A"]))

Explanation

  • Line 4: We'll import the pandas library.
  • Lines 7-10: We'll create a dataframe, df.
  • Line 12: We'll print the dataframe.
  • Line 16: We'll check the default values of all duplicated rows of the dataframe using the duplicated() function.
  • Line 20: We obtain the duplicated rows by returning True for any first occurrence of duplicated rows using the duplicate() function and passing "last" as the parameter value of keep.
  • Line 24: We obtain the duplicated values of column "A".

Free Resources