The get_dummies()
method in Pandas is used to convert categorical variables into dummy variables.
Each category is transformed into a new column with binary value (1 or 0) indicating the presence of the category in the original data.
Example
import pandas as pd
# create a Series
data = pd.Series(['A', 'B', 'A', 'C', 'B'])
# use get_dummies on the Series
dummies = pd.get_dummies(data)
print(dummies)
'''
Output
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
'''
get_dummies() Syntax
The syntax of the get_dummies()
method in Pandas is:
get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, drop_first=False)
get_dummies() Arguments
The get_dummies()
method takes following arguments:
data
- the input data to be transformedprefix
(optional) - string to append DataFrame column namesprefix_sep
(optional) - separator for the prefix and the dummy column namedummy_na
(optional) - add a column to indicate NaNs, ifFalse
NaNs are ignored.drop_first
(optional) - whether to remove first level or not
get_dummies() Return Value
The get_dummies()
method returns a DataFrame where the value in the input becomes a separate column filled with binary values (1s and 0s), indicating the presence or absence of that value in each row of the original data.
Example 1: Grouping by a Single Column in Pandas
import pandas as pd
# create a Series
data = pd.Series(['apple', 'orange', 'apple', 'banana'])
# use get_dummies() to convert the series into dummy variables
dummy_data = pd.get_dummies(data)
print(dummy_data)
Output
apple banana orange 0 1 0 0 1 0 0 1 2 1 0 0 3 0 1 0
In the above example, we have created the data Series with fruit names.
We then applied get_dummies()
which creates a new DataFrame where each fruit name becomes a column.
And for each row in the data Series, the corresponding column in the new DataFrame will have a 1 if the fruit name was present in that row, and 0 otherwise.
Example 2: Apply get_dummies() With Prefix
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# create a DataFrame
df = pd.DataFrame(data)
# get dummies with a specified prefix
dummies = pd.get_dummies(df['Color'], prefix='Color')
print(dummies)
Output
Color Color_Blue Color_Green Color_Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
Here, we have passed the prefix='Color'
argument to get_dummies()
, so the new dummy variable columns are prefixed with Color_
.
Hence, the resulting DataFrame contains columns Color_Blue
, Color_Green
, and Color_Red
, representing the presence or absence of the respective color categories.
Example 3: Get Dummies With Specified Prefix and Prefix Separator
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# create a DataFrame
df = pd.DataFrame(data)
# get dummies with a specified prefix and prefix separator
dummies = pd.get_dummies(df['Color'], prefix='Color', prefix_sep='--')
print(dummies)
Output
Color--Blue Color--Green Color--Red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 1 0
4 0 0 1
In this example, the prefix_sep='--'
argument means that the prefix and the original category name will be separated by --
.
So, for a color like Blue
, the resulting column name in the dummies DataFrame would be Color--Blue
and so on.
Example 4: Use dummy_na to Manage Missing Data
import pandas as pd
# sample data with a missing value
data = {'Color': ['Red', 'Green', 'Blue', None, 'Red']}
# create a DataFrame
df = pd.DataFrame(data)
# get dummies without considering NaN
dummies_without_nan = pd.get_dummies(df['Color'])
# get dummies considering NaN
dummies_with_nan = pd.get_dummies(df['Color'], dummy_na=True)
print("Dummies without NaN handling:\n", dummies_without_nan)
print("\nDummies with NaN handling:\n", dummies_with_nan)
Output
Dummies without NaN handling:
Blue Green Red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 0
4 0 0 1
Dummies with NaN handling:
Blue Green Red NaN
0 0 0 1 0
1 0 1 0 0
2 1 0 0 0
3 0 0 0 1
4 0 0 1 0
Here,
get_dummies(df['Color'])
- generates columns forRed
,Green
, andBlue
, but no indication of theNaN
value.get_dummies(df['Color'], dummy_na=True)
- generates the same columns and an additional one calledNaN
indicating whereNaN
values were present in the original data.
Example 5: Specifying Columns for Dummy Encoding
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# creating a DataFrame
df = pd.DataFrame(data)
# getting dummies without dropping any columns
dummies_all = pd.get_dummies(df['Color'])
print("DataFrame with all dummy columns:")
print(dummies_all)
print("\n")
# getting dummies and dropping the first category column ('Blue' in this case)
dummies = pd.get_dummies(df['Color'], drop_first=True)
print("DataFrame after dropping 'Blue':")
print(dummies)
Output
DataFrame with all dummy columns:
Color Blue Green Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
DataFrame after dropping 'Blue':
Color Green Red
0 Red 0 1
1 Green 1 0
2 Blue 0 0
3 Green 1 0
4 Red 0 1
Here, the drop_first=True
argument is passed to get_dummies()
to indicate that the first category should be dropped.
Hence the resulting DataFrame contains two columns Green
and Red
. The category named Blue
is not represented in these columns because it was dropped.