Pandas duplicated()

The duplicated() method in Pandas is used to mark duplicate rows based on column values.

Example

import pandas as pd

# sample DataFrame
data = {'A': [1, 2, 2],
        'B': [4, 5, 5]}

df = pd.DataFrame(data)

# identify duplicate rows
duplicates = df.duplicated()

print(duplicates)
'''
Output

0    False
1    False
2     True
dtype: bool
'''

duplicated() Syntax

The syntax of the duplicated() method in Pandas is:

df.duplicated(subset=None, keep='first')

duplicated() Arguments

The duplicated() method has the following arguments:

  • subset (optional): column label or sequence of labels to consider for identifying duplicates
  • keep (optional): determines which duplicates (if any) to mark

duplicated() Return Value

The duplicated() method returns a boolean Series indicating whether each row is a duplicate.


Example 1: Identifying Duplicates in a Specific Column

import pandas as pd

data = {'A': [1, 2, 2],
        'B': [4, 5, 6]}
df = pd.DataFrame(data)

# identify duplicates in column 'A'
duplicates_in_A = df.duplicated(subset='A')

print(duplicates_in_A)

Output

0    False
1    False
2     True
dtype: bool

In this example, we identified duplicates based on column A using the subset='A' argument.

Here, the third element of column A is a duplicate.


Example 2: Keeping Last Occurrences

import pandas as pd

data = {'A': [1, 2, 2, 2],
        'B': [4, 5, 5, 5]}
df = pd.DataFrame(data)

# keep the last occurrence of the duplicate rows
last_occurrences = df.duplicated(keep='last')

print(last_occurrences)

Output

0    False
1     True
2     True
3    False
dtype: bool

In this example, we marked all duplicates as True except for the last occurrence using the keep='last' argument.

Here, there are three occurrences of the row values [2, 5]. The first two are marked True whereas the last one is marked False.


Example 3: Marking All Duplicates

import pandas as pd

data = {'A': [1, 2, 2, 2],
        'B': [4, 5, 5, 5]}
df = pd.DataFrame(data)

# mark all duplicates
all_duplicates = df.duplicated(keep=False)

print(all_duplicates)

Output

0    False
1     True
2     True
3     True
dtype: bool

In this example, we marked all duplicate rows as True using the keep=False argument.