A dummy variable is a numerical representation used to encode categorical data.
Dummy variables exhibit binary values, exclusively 0 or 1.
For some data, each item can only belong to one category. For example, a car can be red or blue, but not both at the same time.
However, some data can belong to more than one category. Like a movie that's both action and comedy.
In both cases, the point of get_dummies()
in Pandas is to change these categories into 0s and 1s. This makes it easier for computer programs to understand and work with the data.
In the context of a dummy variable:
- The value 1 signifies the existence of a specific category.
- The value 0 signifies the non-existence of a particular category.
In Pandas, we use the get_dummies()
function to transform categorical variables into binary values.
Using get_dummies() on Pandas Series
In Pandas, to use get_dummies()
on the Series, we pass the Series inside the function. For example,
import pandas as pd
# create a Panda Series
data = pd.Series(['A', 'B', 'A', 'C', 'B'])
# using get_dummies on the Series
dummies = pd.get_dummies(data)
print(dummies)
Output
A B C
0 1 0 0
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
In the above example, each column A
, B
, and C
contains binary values (1 or 0) indicating the presence or absence of each category for each row in the data Series.
Use get_dummies() on a DataFrame Column
We can also apply multiple aggregation functions to one or more columns using the aggregate()
function in Pandas. For example,
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# creating a DataFrame
df = pd.DataFrame(data)
# using get_dummies to convert the categorical column
dummies = pd.get_dummies(df['Color'])
# concatenating the dummies DataFrame with the original DataFrame
df = pd.concat([df, dummies], axis=1)
print(df)
Output
Color Blue Green Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
In this example, we have applied the get_dummies()
function to the Color
column of the df DataFrame.
This function converts the categorical values in the Color
column into a set of binary indicator columns.
In this case, since there are three unique colors Red
, Green
, Blue
, these three new columns.
And the values in these columns are 1 if the corresponding color is present for a row and 0 if not.
Note: axis=1
refers to operations along columns or the horizontal axis. It means that the operation will be applied column-wise across the DataFrame.
Use of drop_first Inside get_dummies()
In Pandas, we can use the get_dummies()
function to create dummy variables for a categorical column in a DataFrame and then drop the first category using the drop_first parameter.
Let's look at an example.
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# creating a DataFrame
df = pd.DataFrame(data)
# getting dummies without dropping any columns
dummies_all = pd.get_dummies(df['Color'])
# concatenating the dummies DataFrame with the original DataFrame
df_all = pd.concat([df, dummies_all], axis=1)
print("DataFrame with all dummy columns:")
print(df_all)
print("\n")
# getting dummies and dropping the first category column ('Blue' in this case)
dummies = pd.get_dummies(df['Color'], drop_first=True)
# concatenating the dummies DataFrame with the original DataFrame
df = pd.concat([df, dummies], axis=1)
print("DataFrame after dropping 'Blue':")
print(df)
Output
DataFrame with all dummy columns:
Color Blue Green Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
DataFrame after dropping 'Blue':
Color Green Red
0 Red 0 1
1 Green 1 0
2 Blue 0 0
3 Green 1 0
4 Red 0 1
Here, the drop_first=True
argument is passed to get_dummies()
to indicate that the first category should be dropped.
Hence the resulting DataFrame contains two columns Green
and Red
. The category named Blue
is not represented in these columns because it was dropped.
Use of prefix Inside get_dummies()
We can use the prefix parameter inside the get_dummies()
function to specify a prefix for the dummy variables created from a DataFrame column.
Let's look at an example.
import pandas as pd
# sample data
data = {'Color': ['Red', 'Green', 'Blue', 'Green', 'Red']}
# creating a DataFrame
df = pd.DataFrame(data)
# getting dummies with a specified prefix
dummies = pd.get_dummies(df['Color'], prefix='Color')
# concatenating the dummies DataFrame with the original DataFrame
df = pd.concat([df, dummies], axis=1)
print(df)
Output
Color Color_Blue Color_Green Color_Red
0 Red 0 0 1
1 Green 0 1 0
2 Blue 1 0 0
3 Green 0 1 0
4 Red 0 0 1
Here, we have passed the prefix='Color'
argument to get_dummies()
, so the new dummy variable columns are prefixed with Color_
.
Hence, the resulting DataFrame contains columns Color_Blue
, Color_Green
, and Color_Red
, representing the presence or absence of the respective color categories.
Note: To learn more, visit Pandas get_dummies().