The crosstab()
method in Pandas allows us to create contingency tables, also known as cross-tabulations.
A contingency table helps us understand the relationship between two or more categorical variables within a dataset.
Example
import pandas as pd
# sample DataFrame
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}
df = pd.DataFrame(data)
# create a cross-tabulation of Gender and Smoker
cross_tab = pd.crosstab(df['Gender'], df['Smoker'])
print(cross_tab)
'''
Output
Smoker No Yes
Gender
Female 2 0
Male 1 2
'''
crosstab() Syntax
The syntax of the crosstab()
method in Pandas is:
pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)
crosstab() Arguments
The crosstab()
method has the following arguments:
index
: the column or array-like object whose values will be used as rowscolumns
: the column or array-like object whose values will be used as columnsvalues
(optional): the column to aggregate values based on the intersection ofindex
andcolumns
rownames
(optional): the names to be used for the row indexcolnames
(optional): the names to be used for the column indexaggfunc
(optional): the aggregation function to apply to valuesmargins
(optional): whether to include row and column marginsmargins_name
(optional): the name to be used for the margin labelsdropna
(optional): whether to exclude missing valuesnormalize
(optional): whether to normalize the values to show proportions.
crosstab() Return Value
The crosstab()
method returns a DataFrame representing the cross-tabulation of the factors specified in index
and columns
.
Example 1: Basic Cross-Tabulation
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Employed': ['Yes', 'Yes', 'Yes', 'Yes', 'No']}
df = pd.DataFrame(data)
# create a basic cross-tabulation of Gender and Employed
cross_tab = pd.crosstab(df['Gender'], df['Employed'])
print(cross_tab)
Output
Employed No Yes Gender Female 0 2 Male 1 2
In this example, we created a basic cross-tabulation of Gender
and Employed
to understand the distribution of employed and unemployed people among genders.
Example2: Margins in crosstab()
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}
df = pd.DataFrame(data)
# create a cross-tabulation with margins
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], margins=True, margins_name='Total')
print(cross_tab)
Output
Smoker No Yes Total Gender Female 2 0 2 Male 1 2 3 Total 3 2 5
In this example, we included row and column margins in the cross-tabulation to show the totals for each row and column.
Example 3: Normalized Cross-Tabulation
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Smoker': ['Yes', 'No', 'Yes', 'No', 'No']}
df = pd.DataFrame(data)
# create a normalized cross-tabulation of Gender and Smoker
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], normalize=True)
print(cross_tab)
Output
Smoker No Yes Gender Female 0.166667 0.166667 Male 0.333333 0.333333
In this example, we created a normalized cross-tabulation to show proportions instead of raw counts.
Example 4: Aggregate Functions with crosstab()
import pandas as pd
data = {'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
'Smoker': ['Yes', 'No', 'Yes', 'No', 'No'],
'Age': [25, 30, 35, 40, 45]}
df = pd.DataFrame(data)
# create a cross-tabulation of Gender and Smoker with average Age as the aggregation
cross_tab = pd.crosstab(df['Gender'], df['Smoker'], values=df['Age'], aggfunc='mean')
print(cross_tab)
Output
Smoker No Yes Gender Female 35.0 NaN Male 45.0 30.0
In this example, we used aggfunc=mean
to calculate the mean age for smokers and non smokers of different genders.