Calculating Spearman's Rank Correlation Coefficient in Python with Pandas

Introduction

This guide is an introduction to Spearman's rank correlation coefficient, its mathematical calculation, and its computation via Python's pandas library. We'll construct various examples to gain a basic understanding of this coefficient and demonstrate how to visualize the correlation matrix via heatmaps.

What Is the Spearman Rank Correlation Coefficient?

Spearman rank correlation is closely related to the Pearson correlation, and both are a bounded value, from -1 to 1 denoting a correlation between two variables.

If you'd like to read more about the alternative correlation coefficient - read our Guide to the Pearson Correlation Coefficient in Python.

The Pearson correlation coefficient is computed using raw data values, whereas the Spearman correlation is calculated from the ranks of individual values. While the Pearson correlation coefficient is a measure of the linear relation between two variables, the Spearman rank correlation coefficient measures the monotonic relation between a pair of variables. To understand the Spearman correlation, we need a basic understanding of monotonic functions.

Monotonic Functions

There are monotonically increasing, monotonically decreasing, and non-monotonic functions.

For a monotonically increasing function, as X increases, Y also increases (and it doesn't have to be linear). For a monotonically decreasing function, as one variable increases, the other one decreases (also doesn't have to be linear). A non-monotonic function is where the increase in the value of one variable can sometimes lead to an increase and sometimes lead to a decrease in the value of the other variable.

Spearman rank correlation coefficient measures the monotonic relation between two variables. Its values range from -1 to +1 and can be interpreted as:

  • +1: Perfectly monotonically increasing relationship
  • +0.8: Strong monotonically increasing relationship
  • +0.2: Weak monotonically increasing relationship
  • 0: Non-monotonic relation
  • -0.2: Weak monotonically decreasing relationship
  • -0.8: Strong monotonically decreasing relationship
  • -1: Perfectly monotonically decreasing relationship

Mathematical Expression

Suppose we have \(n\) observations of two random variables, \(X\) and \(Y\). We first rank all values of both variables as \(X_r\) and \(Y_r\) respectively. The Spearman rank correlation coefficient is denoted by \(r_s\) and is calculated by:

$$
r_s = \rho_{X_r,Y_r} = \frac{\text{COV}(X_r,Y_r)}{\text{STD}(X_r)\text{STD}(Y_r)} = \frac{n\sum\limits_{x_r\in X_r, y_r \in Y_r} x_r y_r - \sum\limits_{x_r\in X_r}x_r\sum\limits_{y_r\in Y_r}y_r}{\sqrt{\Big(n\sum\limits_{x_r \in X_r} x_r^2 -(\sum\limits_{x_r\in X_r}x_r)^2\Big)}\sqrt{\Big(n\sum\limits_{y_r \in Y_r} y_r^2 - (\sum\limits_{y_r\in Y_r}y_r)^2 \Big)}}
$$

Here, COV() is the covariance, and STD() is the standard deviation. Before we see Python's functions for computing this coefficient, let's do an example computation by hand to understand the expression and get to appreciate it.

Example Computation

Suppose we are given some observations of the random variables \(X\) and \(Y\). The first step is to convert \(X\) and \(Y\) to \(X_r\) and \(Y_r\), which represent their corresponding ranks. A few intermediate values would also be needed, which are shown below:

X = [ 2 1 0 1 2 ] T Y = [ 4 1 3 2 0 ] T X r = [ 1 2 3 4 5 ] T Y r = [ 5 2 4 3 1 ] T X r 2 = [ 1 4 9 16 25 ] T Y r 2 = [ 25 4 16 9 1 ] T X r Y r = [ 5 4 12 12 5 ] T

Let's use the formula from before to compute the Spearman correlation:

r s = 5 38 ( 15 ) ( 15 ) ( 5 55 15 2 ) ( 5 55 15 2 ) = 181 532 = 0.7

Great! Though, calculating this manually is time-consuming, and the best use of computers is to, well, compute things for us. Computing the Spearman correlation is really easy and straightforward with built-in functions in Pandas.

Computing the Spearman Rank Correlation Coefficient Using Pandas

The various correlation coefficients, including Spearman, can be computed via the corr() method of the Pandas library.

As an input argument, the corr() function accepts the method to be used for computing correlation (spearman in our case). The method is called on a DataFrame, say of size mxn, where each column represents the values of a random variable and m represents the total samples of each variable.

For n random variables, it returns an nxn square matrix R. R(i,j) indicates the Spearman rank correlation coefficient between the random variable i and j. As the correlation coefficient between a variable and itself is 1, all diagonal entries (i,i) are equal to unity. In short:

R ( i , j ) = { r i , j  if  i j 1 otherwise

Note that the correlation matrix is symmetric as correlation is symmetric, i.e., M(i,j)=M(j,i). Let's take our simple example from the previous section and see how to use Pandas' corr() function:

import numpy as np
import pandas as pd
import seaborn as sns # For pairplots and heatmaps
import matplotlib.pyplot as plt

We'll be using Pandas for the computation itself, Matplotlib with Seaborn for visualization and NumPy for additional operations on the data.

The code below computes the Spearman correlation matrix on the dataframe x_simple. Note the ones on the diagonals, indicating that the correlation coefficient of a variable with itself is naturally, one:

x_simple = pd.DataFrame([(-2,4),(-1,1),(0,3),(1,2),(2,0)],
                        columns=["X","Y"])
my_r = x_simple.corr(method="spearman")
print(my_r)
     X    Y
X  1.0 -0.7
Y -0.7  1.0

Visualizing the Correlation Coefficient

Given the table-like structure of bounded intensities, [-1, 1] - a natural and convenient way of visualizing the correlation coefficient is a heatmap.

If you'd like to read more about heatmaps in Seaborn, read our Ultimate Guide to Heatmaps in Seaborn with Python!

A heatmap is a grid of cells, where each cell is assigned a color according to its value, and this visual way of interpreting correlation matrices is much easier for us than parsing numbers. For small tables like the one previously output - it's perfectly fine. But with a lot of variables, it's much harder to actually interpret what's going on.

Let's define a display_correlation() function that computes the correlation coefficient and displays it as a heatmap:

def display_correlation(df):
    r = df.corr(method="spearman")
    plt.figure(figsize=(10,6))
    heatmap = sns.heatmap(df.corr(), vmin=-1, 
                      vmax=1, annot=True)
    plt.title("Spearman Correlation")
    return(r)

Let's call display_correlation() on our r_simple DataFrame to visualize the Spearman correlation:

r_simple=display_correlation(x_simple)

Understanding the Spearman's Correlation Coefficient on Synthetic Examples

To understand the Spearman correlation coefficient, let's generate a few synthetic examples that accentuate how the coefficient works - before we dive into more natural examples. These examples will help us understand, for what type of relationships this coefficient is +1, -1, or close to zero.

Before generating the examples, we'll create a new helper function, plot_data_corr(), that calls display_correlation() and plots the data against the X variable:

def plot_data_corr(df,title,color="green"):    
    r = display_correlation(df)
    fig, ax = plt.subplots(nrows=1, ncols=len(df.columns)-1,figsize=(14,3))
    for i in range(1,len(df.columns)):
        ax[i-1].scatter(df["X"],df.values[:,i],color=color)
        ax[i-1].title.set_text(title[i] +'\n r = ' + 
                             "{:.2f}".format(r.values[0,i]))
        ax[i-1].set(xlabel=df.columns[0],ylabel=df.columns[i])
    fig.subplots_adjust(wspace=.7)    
    plt.show()

Monotonically Increasing Functions

Let's generate a few monotonically increasing functions, using NumPy, and take a peek at the DataFrame once filled with the synthetic data:

seed = 11
rand = np.random.RandomState(seed)
# Create a data frame using various monotonically increasing functions
x_incr = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_incr["Line+"] = x_incr.X*2+1
x_incr["Sq+"] = x_incr.X**2
x_incr["Exp+"] = np.exp(x_incr.X)
x_incr["Cube+"] = (x_incr.X-5)**3

print(x_incr.head())
X Line+ Sq+ Exp+ Cube+
0 1.802697 4.605394 3.249716 6.065985 -32.685221
1 0.194752 1.389505 0.037929 1.215010 -110.955110
2 4.632185 10.264371 21.457140 102.738329 -0.049761
3 7.249339 15.498679 52.552920 1407.174809 11.380593
4 4.202036 9.404072 17.657107 66.822246 -0.508101

Now let's look at the Spearman correlation's heatmap and the plot of various functions against X:

plot_data_corr(x_incr,["X","2X+1","$X^2$","$e^X$","$(X-5)^3$"])

We can see that for all these examples, there is a perfectly monotonically increasing relationship between the variables. The Spearman correlation is a +1, regardless of whether the variables have a linear or a non-linear relationship.

Pearson would've produced much different results here, since it's computed based on the linear relationship between the variables.

As long as Y increases as X increases, without fail, the Spearman Rank Correlation Coefficient will be 1.

Monotonically Decreasing Functions

Let's repeat the same examples on monotonically decreasing functions. We'll again generate synthetic data and compute the Spearman rank correlation. First, let's look at the first 4 rows of the DataFrame:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

# Create a data matrix
x_decr = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_decr["Line-"] = -x_decr.X*2+1
x_decr["Sq-"] = -x_decr.X**2
x_decr["Exp-"] = np.exp(-x_decr.X)
x_decr["Cube-"] = -(x_decr.X-5)**3
x_decr.head()
X Line- Sq- Exp- Cube-
0 3.181872 -5.363744 -10.124309 0.041508 6.009985
1 2.180034 -3.360068 -4.752547 0.113038 22.424963
2 8.449385 -15.898771 -71.392112 0.000214 -41.041680
3 3.021647 -5.043294 -9.130350 0.048721 7.743039
4 4.382207 -7.764413 -19.203736 0.012498 0.235792

The correlation matrix's heatmap and the plot of the variables is given below:

plot_data_corr(x_decr,["X","-2X+1","$-X^2$","$-e^X$","$-(X-5)^3$"],"blue")

Non-monotonic Functions

The examples below are for various non-monotonic functions. The last column added to the DataFrame is that of an independent variable Rand, which has no association with X.

These examples should also clarify that Spearman correlation is a measure of monotonicity of a relationship between two variables. A zero coefficient does not necessarily indicate no relationship, but it does indicate that there is no monotonicity between them.

Before generating synthetic data, we'll define yet another helper function, display_corr_pairs(), that calls display_correlation() to display the heatmap of the correlation matrix and then plots all pairs of variables in the DataFrame against each other using the Seaborn library.

On the diagonals, we'll display the histogram of each variable in yellow color using map_diag(). Below the diagonals, we'll make a scatter plot of all variable pairs. As the correlation matrix is symmetric, we don't need the plots above the diagonals.

Let's also display the Pearson correlation coefficient for comparison:

def display_corr_pairs(df,color="cyan"):
    s = set_title = np.vectorize(lambda ax,r,rho: ax.title.set_text("r = " + 
                                        "{:.2f}".format(r) + 
                                        '\n $\\rho$ = ' + 
                                        "{:.2f}".format(rho)) if ax!=None else None
                            )      

    r = display_correlation(df)
    rho = df.corr(method="pearson")
    g = sns.PairGrid(df,corner=True)
    g.map_diag(plt.hist,color="yellow")
    g.map_lower(sns.scatterplot,color="magenta")
    set_title(g.axes,r,rho)
    plt.subplots_adjust(hspace = 0.6)
    plt.show()    

We'll create a non-monotonic DataFrame, x_non, with these functions of X:

  • Parabola: \( (X-5)^2 \)

  • Sin: \( \sin (\frac{X}{10}2\pi) \)

  • Frac: \( \frac{X-5}{(X-5)^2+1} \)

  • Rand: Random numbers in the range [-1,1]

Below are the first 4 lines of x_non:

x_non = pd.DataFrame({"X":rand.uniform(0,10,100)})
x_non["Parabola"] = (x_non.X-5)**2
x_non["Sin"] = np.sin(x_non.X/10*2*np.pi)
x_non["Frac"] = (x_non.X-5)/((x_non.X-5)**2+1)
x_non["Rand"] = rand.uniform(-1,1,100)

print(x_non.head())
X Parabola Sin Frac Rand
0 0.654466 18.883667 0.399722 -0.218548 0.072827
1 5.746559 0.557351 -0.452063 0.479378 -0.818150
2 6.879362 3.532003 -0.924925 0.414687 -0.868501
3 5.683058 0.466569 -0.416124 0.465753 0.337066
4 6.037265 1.075920 -0.606565 0.499666 0.583229

The Spearman correlation coefficient between different data pairs is illustrated below:

display_corr_pairs(x_non)

These examples show for what type of data the Spearman correlation is close to zero and where it has intermediate values. Another thing to note is that the Spearman correlation and Pearson correlation coefficient are not always in agreement with each other, so a lack of one doesn't mean a lack of another.

They're used to test correlation for different facets of data, and can't be used interchangeably. While they will be in agreement in some cases, they won't always be.

Spearman Correlation Coefficient on Linnerud Dataset

Let's apply the Spearman Correlation coefficient on an actual dataset. We have chosen the simple physical exercise dataset called linnerud from the sklearn.datasets package for demonstration:

import sklearn.datasets.load_linnerud

The code below loads the dataset and joins the target variables and attributes in one DataFrame. Let's look at the first 4 rows of the linnerud data:

d=load_linnerud()

dat = pd.DataFrame(d.data,columns=d.feature_names)
alldat=dat.join(pd.DataFrame(d.target,columns=d.target_names) )
alldat.head()
Chins Situps Jumps Weight Waist Pulse
0 5.0 162.0 60.0 191.0 36.0 50.0
1 2.0 110.0 60.0 189.0 37.0 52.0
2 12.0 101.0 101.0 193.0 38.0 58.0
3 12.0 105.0 37.0 162.0 35.0 62.0
4 13.0 155.0 58.0 189.0 35.0 46.0

Now, let's display the correlation pairs using our display_corr_pairs() function:

display_corr_pairs(alldat)

Looking at the Spearman correlation values, we can make interesting conclusions such as:

  • Higher waist values imply increasing weight values (from r = 0.81)
  • More situps have lower waist values (from r = -0.72)
  • Chins, situps and jumps don't seem to have a monotonic relationship with pulse, as the corresponding r values are close to zero.

Conclusions

In this guide, we discussed the Spearman rank correlation coefficient, its mathematical expression, and its computation via Python's pandas library.

We demonstrated this coefficient on various synthetic examples and also on the Linnerrud dataset. Spearman correlation coefficient is an ideal measure for computing the monotonicity of the relationship between two variables. However, a close to zero value does not necessarily indicate that the variables have no association between them.

Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

Mehreen SaeedAuthor

I am an educator and I love mathematics and data science!

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms