Statistics for Data science(with python): Correlation Coefficient

Statistics for Data science(with python): Correlation Coefficient

Hi and welcome

Have you wondered how to tell what features influences others in a given experiemnt or analysis? For instance , does the height of individual have any effect o his/her weight or is there any linear relationship between age and salary? In this post i will be explaining one of the tools you can use for this purpose.

Correlation coefficient

This is a statistical measure of the linear relationship between two variables. It shows if variables can actually be related using the formula y=a+bx, where a and b are constants, x and y are the variables or parameters.

Correlation, denoted as r or R lies between 1 and -1 inclusive. There are different degrees of correlation, however they can be broadly classified into positive, negative and no correlation. If the correlation is a 0, then there is no correlation between them. Take note that no correlation does not mean that they are not related, it only means that the correlation between these parameters is not a linear one; there relationship could be exponential , quadratic, e.t.c.

Positive correlation is when the correlation coefficient is positive , i.e it lies between 0 and 1 with 1 being inclusive but 0 is not , while negative correlation is for a negative correlation coefficient i.e between 0 and -1 with 0 not being inclusive but -1 is.

There exists order degrees of correlation as shown below

c1.PNG

This means a value of 0.73 is a strong positive correlation. Which means the increase in one parameter would lead to a correspondingly increase in the other parameter and same goes to decrease while a value of -0.43 is a weak negative correlation , indicating that as one parameter increases , the other would decrease and vice versa. The former can be seen as a direct variation and the later is an inverse variation.

Determination of Correlation coefficient

There are different methods that can be used to determine the correlation coefficient between two parameters namely pearson's correlation, spearman's rank and kendall's correlation, however in this post we shall be considering the pearson's correlation as it is the most common correlation coefficient measure.

The pearson's product moment correlation (PPMC) coefficient can be calculated using the formula

c2.PNG

where x and y are the individual values of the parameters X and Y.

Let's consider an example.

The data below is a simulated data that represents the ages and weights of 25 individuals from a survey. We shall be using numpy , pandas and matplotlib for this illustration. First, let's see how we can perform a semi automated computation for the data. below is the data set and how it was loaded

import pandas as pd
df=pd.read_csv("survey_data.csv")
print(df)

c4.PNG

First we have to complete the table by computing the relevant parameters, this will be done with the help of pandas as shown below

df["XY"]=df.X*df.Y
df["X_squared"]=df.X**2
df["Y_squared"]=df.Y**2
print(df.head())

c6.PNG

Now , for the numerator of the formula , we have c5.PNG which when implemented in pandas becomes

n=len(df) # gets the number of pairs
r_numerator=n*df.XY.sum()-df.X.sum()*df.Y.sum() # gets the numerator of the PPMC formula

while for the denominator, we have
c7.PNG and this can be implemented in pandas as

r_denominator=((n*df.X_squared.sum()-(df.X.sum())**2)*(n*df.Y_squared.sum()-(df.Y.sum())**2))**(0.5)

The final result, i.e the correlation coefficient will be the ratio of this two values in order , i.e r=r_numerator/r_denominator as shown in the snippet below

correlation=r_numerator/r_denominator
print(correlation)

c8.PNG

The result shows that there is a very strong positive correlation between X and Y.

Using the pandas .corr() method

Sorry to take you through the stress. Pandas actally does this computation easily using the .corr() method. This gives a matrix called the dispersion amtrix or the covariance matrix. It shows all possible pairwise combination among the numerical columns of the DataFrame, as shown below

df[["X","Y"]].corr() # Selects the "X" and "Y" columns using fancy indexing then obtains the correlation matrix

c9.PNG

The plot for the above data set can be gotten using the pandas library or the matplotlib library as shown below

df.plot.scatter("X","Y")

c10.PNG

The strong upward trend is a clear indication of the positive correlation between the parameters.

Conclusion

With the knowledge you have now, you have added yet another tool to your data exploratory process. This helps you in feature selection and some other data exploratory or decision making processes. Have a lovely time and stay safe!

This post is part of a series of blog post on the probability and statistics , based on the course Practical Machine Learning Course from The Port Harcourt School of AI (pmlcourse).