Pandas corr()

The corr() method in Pandas is used to compute the pairwise correlation coefficients of columns.

A correlation coefficient is a statistical measure that describes the extent to which two variables are related to each other.

Example

import pandas as pd

# sample DataFrame with numeric data
data = {'A': [3, 2, 1],
        'B': [4, 6, 5],
        'C': [7, 18, 91]}

df = pd.DataFrame(data)

# compute correlation matrix
correlation_matrix = df.corr()

print(correlation_matrix)

'''
Output

          A        B         C
A  1.000000 -0.50000 -0.919953
B -0.500000  1.00000  0.120470
C -0.919953  0.12047  1.000000
'''

corr() Syntax

The syntax of the corr() method in Pandas is:

df.corr(method='pearson', min_periods=1, numeric_only=False)

corr() Arguments

The corr() method takes the following arguments:

  • method (optional): method to calculate correlation
  • min_periods (optional): minimum number of observations required per pair of columns to have a valid result
  • numeric_only (optional): whether to include only numeric data types

corr() Return Value

The corr() method returns a DataFrame containing correlation coefficients between columns.


Example 1: Default Pearson Correlation Coefficient

import pandas as pd

# sample DataFrame with numeric data
data = {'A': [3, 2, 1],
        'B': [4, 6, 5],
        'C': [7, 18, 91]}

df = pd.DataFrame(data)

# compute correlation matrix
correlation_matrix = df.corr()

print(correlation_matrix)

Output

          A        B         C
A  1.000000 -0.50000 -0.919953
B -0.500000  1.00000  0.120470
C -0.919953  0.12047  1.000000

In this example, we demonstrated the default use of the corr() method for calculating the Pearson correlation coefficient for each pair of columns.


Example 2: Kendall Tau Correlation Coefficient

import pandas as pd

# sample DataFrame with numeric data
data = {'A': [3, 2, 1],
        'B': [4, 6, 5],
        'C': [7, 18, 91]}

df = pd.DataFrame(data)

# compute correlation matrix
correlation_matrix = df.corr(method='kendall')

print(correlation_matrix)

Output

          A         B         C
A  1.000000 -0.333333 -1.000000
B -0.333333  1.000000  0.333333
C -1.000000  0.333333  1.000000

In this example, we calculated the Kendall Tau correlation coefficient for each pair of columns using method='kendall'.

To learn about correlation and different correlation methods in detail, please visit Pandas Correlation.


Example 3: Specify Minimum Number of Observations

import pandas as pd

# sample DataFrame with numeric data
data = {'A': [1, 2, 3, None, None],
        'B': [4, 7, None, None, None],
        'C': [7, 9, 8, None, None]}

df = pd.DataFrame(data)

# specify minimum number of observations required to perform computation
correlation_matrix = df.corr(min_periods=3)

print(correlation_matrix)

Output

     A   B    C
A  1.0 NaN  0.5
B  NaN NaN  NaN
C  0.5 NaN  1.0

In this example, the DataFrame df contains None values representing missing data. By setting min_periods=3, we specified that at least three non-null observations are required to compute a correlation coefficient for each pair of columns.

Here, since the B column contains only two non-null values, the correlation coefficients involving B are not calculated.


Example 4: Calculate Correlation for Numeric Data Only

import pandas as pd

# sample DataFrame
data = {'A': [3, 2, 'A', 1],
        'B': [4, 6, 5, 7],
        'C': [7, 18.5, 91, 55]}

df = pd.DataFrame(data)

# compute correlation matrix
correlation_matrix = df.corr(numeric_only=True)

print(correlation_matrix)

Output

         B        C
B  1.00000  0.24257
C  0.24257  1.00000

In this example, we used the numeric_only=True argument to skip the columns with non-numeric data. As a result, column A is excluded from the computation.

This argument is useful to avoid ValueError due to the presence of non-numeric data in the DataFrame.