The corr()
method in Pandas is used to compute the pairwise correlation coefficients of columns.
A correlation coefficient is a statistical measure that describes the extent to which two variables are related to each other.
Example
import pandas as pd
# sample DataFrame with numeric data
data = {'A': [3, 2, 1],
'B': [4, 6, 5],
'C': [7, 18, 91]}
df = pd.DataFrame(data)
# compute correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)
'''
Output
A B C
A 1.000000 -0.50000 -0.919953
B -0.500000 1.00000 0.120470
C -0.919953 0.12047 1.000000
'''
corr() Syntax
The syntax of the corr()
method in Pandas is:
df.corr(method='pearson', min_periods=1, numeric_only=False)
corr() Arguments
The corr()
method takes the following arguments:
method
(optional): method to calculate correlationmin_periods
(optional): minimum number of observations required per pair of columns to have a valid resultnumeric_only
(optional): whether to include only numeric data types
corr() Return Value
The corr()
method returns a DataFrame containing correlation coefficients between columns.
Example 1: Default Pearson Correlation Coefficient
import pandas as pd
# sample DataFrame with numeric data
data = {'A': [3, 2, 1],
'B': [4, 6, 5],
'C': [7, 18, 91]}
df = pd.DataFrame(data)
# compute correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)
Output
A B C A 1.000000 -0.50000 -0.919953 B -0.500000 1.00000 0.120470 C -0.919953 0.12047 1.000000
In this example, we demonstrated the default use of the corr()
method for calculating the Pearson correlation coefficient for each pair of columns.
Example 2: Kendall Tau Correlation Coefficient
import pandas as pd
# sample DataFrame with numeric data
data = {'A': [3, 2, 1],
'B': [4, 6, 5],
'C': [7, 18, 91]}
df = pd.DataFrame(data)
# compute correlation matrix
correlation_matrix = df.corr(method='kendall')
print(correlation_matrix)
Output
A B C A 1.000000 -0.333333 -1.000000 B -0.333333 1.000000 0.333333 C -1.000000 0.333333 1.000000
In this example, we calculated the Kendall Tau correlation coefficient for each pair of columns using method='kendall'
.
To learn about correlation and different correlation methods in detail, please visit Pandas Correlation.
Example 3: Specify Minimum Number of Observations
import pandas as pd
# sample DataFrame with numeric data
data = {'A': [1, 2, 3, None, None],
'B': [4, 7, None, None, None],
'C': [7, 9, 8, None, None]}
df = pd.DataFrame(data)
# specify minimum number of observations required to perform computation
correlation_matrix = df.corr(min_periods=3)
print(correlation_matrix)
Output
A B C A 1.0 NaN 0.5 B NaN NaN NaN C 0.5 NaN 1.0
In this example, the DataFrame df contains None
values representing missing data. By setting min_periods=3
, we specified that at least three non-null observations are required to compute a correlation coefficient for each pair of columns.
Here, since the B
column contains only two non-null values, the correlation coefficients involving B
are not calculated.
Example 4: Calculate Correlation for Numeric Data Only
import pandas as pd
# sample DataFrame
data = {'A': [3, 2, 'A', 1],
'B': [4, 6, 5, 7],
'C': [7, 18.5, 91, 55]}
df = pd.DataFrame(data)
# compute correlation matrix
correlation_matrix = df.corr(numeric_only=True)
print(correlation_matrix)
Output
B C B 1.00000 0.24257 C 0.24257 1.00000
In this example, we used the numeric_only=True
argument to skip the columns with non-numeric data. As a result, column A
is excluded from the computation.
This argument is useful to avoid ValueError
due to the presence of non-numeric data in the DataFrame.