In Pandas, missing values, often represented as NaN
(Not a Number), can cause problems during data processing and analysis. These gaps in data can lead to incorrect analysis and misleading conclusions.
Pandas provides a host of functions like dropna()
, fillna()
and combine_first()
to handle missing values.
Let's consider the following DataFrame to illustrate various techniques on handling missing data:
import pandas as pd
import numpy as np
# create dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
print(df)
Output
A B C D 0 1.0 NaN 1.0 1 1 2.0 2.0 2.0 2 2 NaN 3.0 3.0 3 3 4.0 4.0 NaN 4 4 5.0 5.0 5.0 5
Here, we have used the NumPy library to generate NaN
values in the DataFrame.
Remove Rows Containing Missing Values
One straightforward way to handle missing values is by removing them. Since the data sets we deal with are often large, eliminating a few rows typically has minimal impact on the final outcome.
We use the dropna()
function to remove rows containing at least one missing value. For example,
import pandas as pd
import numpy as np
# create a dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# remove rows with missing values
df.dropna(inplace=True)
print(df)
Output
A B C D 1 2.0 2.0 2.0 2 4 5.0 5.0 5.0 5
In this example, we removed all the rows containing NaN
values using dropna()
. The dropna()
method detects the rows with NaN
values and removes them.
Here, inplace=True
specifies that changes are to be made in the original DataFrame itself.
Replace Missing Values
Instead of deleting the entire row containing missing values, we can replace the missing values with a specified value using fillna()
.
Let's look at an example.
import pandas as pd
import numpy as np
# create a dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# replace missing values with 0
df.fillna(value=0, inplace=True)
print(df)
Output
A B C D 0 1.0 0.0 1.0 1 1 2.0 2.0 2.0 2 2 0.0 3.0 3.0 3 3 4.0 4.0 0.0 4 4 5.0 5.0 5.0 5
In this example, we replaced the NaN
values with 0 using fillna()
.
Replace Missing Values With Mean, Median and Mode
A more refined approach is to replace missing values with the mean, median, or mode of the remaining values in the column. This can give a more accurate representation than just replacing it with a default value.
We can use the fillna()
function with aggregate functions to replace missing values with mean, median or mode.
Let's look at an example.
import pandas as pd
import numpy as np
# create a dataframe with missing values
data = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df = pd.DataFrame(data)
# replace missing values with mean
df['A'].fillna(value=df['A'].mean(), inplace=True)
# replace missing values with median
df['B'].fillna(value=df['B'].median(), inplace=True)
# replace missing values with mode
df['C'].fillna(value=df['C'].mode()[0], inplace=True)
print(df)
Output
A B C D 0 1.0 3.5 1.0 1 1 2.0 2.0 2.0 2 2 3.0 3.0 3.0 3 3 4.0 4.0 1.0 4 4 5.0 5.0 5.0 5
In this example, we replaced the missing values of A
, B
and C
columns with their mean, median and mode respectively.
Here, mode()[0]
returns the most frequent value. Since all the values have the same frequency, it returns the first value of the column.
Replace Values Using Another DataFrame
We can replace missing values in one DataFrame using another DataFrame using the fillna()
method.
Let's look at an example.
import pandas as pd
import numpy as np
# create a dataframe with missing values
data1 = {
'A': [1, 2, np.nan, 4, 5],
'B': [np.nan, 2, 3, 4, 5],
'C': [1, 2, 3, np.nan, 5],
'D': [1, 2, 3, 4, 5]
}
df1 = pd.DataFrame(data1)
# create datframe to fill the missing values with
data2 = {
'A': [10, 20, 30, 40, 50],
'B': [10, 20, 30, 40, 50],
'C': [10, 20, 30, 40, 50],
'D': [10, 20, 30, 40, 50]
}
df2 = pd.DataFrame(data2)
# replace missing values
df1.fillna(df2, inplace=True)
print(df1)
Output
A B C D 0 1.0 10.0 1.0 1 1 2.0 2.0 2.0 2 2 30.0 3.0 3.0 3 3 4.0 4.0 40.0 4 4 5.0 5.0 5.0 5
Here, we've two dataframes df1 and df2. The fillna()
replaces missing values in df1 with corresponding values from df2.
Frequently Asked Questions
We can remove the columns containing only NaN
values by selecting such columns first using isnull()
and all()
methods first and then dropping the columns.
Let's look at an example.
import pandas as pd
import numpy as np
# create a DataFrame
data = {
'A': [1, 2, 3, 4],
'B': [5, 6, np.nan, np.nan],
'C': [np.nan, np.nan, np.nan, np.nan],
'D': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
# check which columns contain only NaN values
columns_with_nan = df.columns[df.isnull().all()]
# drop the columns containing only NaN values
df = df.drop(columns=columns_with_nan)
print(df)
Output
A B D 0 1 5.0 9 1 2 6.0 10 2 3 NaN 11 3 4 NaN 12
Here,
df.columns[df.isnull().all()]
- returns a list of columns where all values are nulldf.drop()
- removes the specified columns
Since column C
has NaN
values only, it is removed.
import pandas as pd
import numpy as np
# create a DataFrame
data = {
'A': [1, 2, 3, np.nan],
'B': [5, 6, np.nan, np.nan],
'C': [np.nan, np.nan, np.nan, 7],
'D': [9, 10, 11, 12]
}
df = pd.DataFrame(data)
# set the threshold for the maximum number of NaN values allowed
threshold = 2
# calculate the number of NaN values in each column
nan_counts = df.isnull().sum()
# remove columns that have more NaN values than the threshold
columns_to_drop = nan_counts[nan_counts > threshold].index
df = df.drop(columns=columns_to_drop)
print(df)
Output
A B D 0 1.0 5.0 9 1 2.0 6.0 10 2 3.0 NaN 11 3 NaN NaN 12
Here,
df.isnull().sum()
- returns a list of integers containing counts ofNaN
values in each columnnan_counts[nan_counts>threshold].index
- returns a list of column indices whoseNaN
count exceeds the threshold valuedf.drop()
- removes the specified columns
The above code removes the columns containing more than two NaN
values.