Filtering data is a common operation in data analysis. Pandas allows us to filter data based on different conditions.
We can filter the data in Pandas in two main ways:
- By column names (Labels)
- By the actual data inside (Values)
Filter Data By Labels
We can use the filter()
function to select columns by their names or labels. Let's look at an example.
import pandas as pd
# create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Marketing', 'Marketing', 'IT'],
'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
# display the original DataFrame
print("Original DataFrame:")
print(df)
print("\n")
# use the filter() method to select columns based on a condition
filtered_df = df.filter(items=['Name', 'Salary'])
# display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
Output
Original DataFrame: Name Department Salary 0 Alice HR 50000 1 Bob Marketing 60000 2 Charlie Marketing 55000 3 David IT 70000 Filtered DataFrame: Name Salary 0 Alice 50000 1 Bob 60000 2 Charlie 55000 3 David 70000
In this example, we used filter()
to select the columns Name and Salary using their columns names.
To learn more, visit Pandas filter().
Filter Data By Values
We can also filter data by values. Some of the common ways to filter data by values are:
- Using logical operator
- The
isin()
method - The
str
Accessor - The
query()
method
Logical Operators
You can filter rows based on column values using logical operators. For example,
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Marketing', 'Marketing', 'IT'],
'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
# display the original DataFrame
print("Original DataFrame:")
print(df)
print("\n")
# use logical operators to filter
filtered_df = df[df.Salary > 55000]
# display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
Output
Original DataFrame: Name Department Salary 0 Alice HR 50000 1 Bob Marketing 60000 2 Charlie Marketing 55000 3 David IT 70000 Filtered DataFrame: Name Department Salary 1 Bob Marketing 60000 3 David IT 70000
In the above example, we selected the rows based on the condition Salary > 55000
using logical operator >
.
The isin() Method
The isin()
method provides another way to filter data using column values. Let's look at an example.
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Marketing', 'Marketing', 'IT'],
'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
# display the original DataFrame
print("Original DataFrame:")
print(df)
print("\n")
# use isin() method
departments = ['HR', 'IT']
filtered_df = df[df.Department.isin(departments)]
# display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
Output
Original DataFrame: Name Department Salary 0 Alice HR 50000 1 Bob Marketing 60000 2 Charlie Marketing 55000 3 David IT 70000 Filtered DataFrame: Name Department Salary 0 Alice HR 50000 3 David IT 70000
In this example, we selected the rows whose Department
values are present in the departments
list.
The str Accessor
We can effectively filter rows based on string values using the str
accessor. Let's look at an example.
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Marketing', 'Marketing', 'IT'],
'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
# display the original DataFrame
print("Original DataFrame:")
print(df)
print("\n")
# use str accessor
filtered_df = df[df.Department.str.contains('Market')]
# display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
Output
Original DataFrame: Name Department Salary 0 Alice HR 50000 1 Bob Marketing 60000 2 Charlie Marketing 55000 3 David IT 70000 Filtered DataFrame: Name Department Salary 1 Bob Marketing 60000 2 Charlie Marketing 55000
Here, we filtered the rows based on a string value. We only selected the rows whose Department
values contained the string Market
.
The query() Method
This is the most flexible method for filtering a dataframe based on column values.
A query containing the filtering conditions can be passed as a string to the query()
method.
Let's look at an example.
import pandas as pd
# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Department': ['HR', 'Marketing', 'Marketing', 'IT'],
'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
# display the original DataFrame
print("Original DataFrame:")
print(df)
print("\n")
# use query method
filtered_df = df.query('Salary > 55000 and Department == "Marketing"')
# display the filtered DataFrame
print("Filtered DataFrame:")
print(filtered_df)
Output
Original DataFrame: Name Department Salary 0 Alice HR 50000 1 Bob Marketing 60000 2 Charlie Marketing 55000 3 David IT 70000 Filtered DataFrame: Name Department Salary 1 Bob Marketing 60000
In this example, we selected the rows with Salary > 55000
and Department == "Marketing"
using the query()
method.