A correlation matrix is used to show the degree of the linear relationship between variables in a dataset. It indicates the correlation using the correlation coefficient.
The correlation coefficient shows how strongly or weakly any two variables are related. Scores range between 1 and -1. 1 indicates a perfect positive correlation, whereas -1 indicates a perfect negative correlation. Scores closer to 0 indicate a weak correlation.
Correlation refers to a degree of relationship between variables. It can be causal or non-causal. We say that there is a positive correlation when an increase in variable causes an increase in variable . We say that there is a negative correlation when an increase in variable causes a decrease in variable .
The illustration below shows positive and negative correlations:
The table below summarizes correlation coefficients:
Coefficient | Meaning |
---|---|
1 | Perfect positive correlation. A unit increase in variable means a unit increase in variable . |
-1 | Perfect negative correlation. A unit increase in variable means a unit decrease in variable . |
0 | No correlation. Variables are not related. |
A correlation matrix displays the correlation between all numerical variables present in the dataset. If a dataset has numerical features, a correlation matrix may have values that are symmetric about the center. Therefore, it is sufficient to analyze only the top or bottom half of the matrix.
The illustration below shows a visual representation of a correlation matrix:
The diagonal always has a coefficient of 1.00, since it represents a relation between the variable with itself.
A gradient color scheme helps to improve understanding of the coefficient scores.
The code snippet below shows how we can create a correlation matrix in Python:
import pandas as pd # for creating a dataframeimport seaborn as sn # for shaping our matriximport matplotlib.pyplot as plt # for creating visualizations# Data for matrixdata = {'A': [45,37,42,35,39],'B': [38,31,26,28,33],'C': [10,15,17,21,12]}df = pd.DataFrame(data,columns=['A','B','C'])print("Original Matrix")print(df) # original matrixprint("\n")corrMatrix = df.corr() # finding correlationsprint("Correlation Coefficients Matrix")print (corrMatrix) # printing correlations
# Visual Representation of Correlation Matrixsn.heatmap(corrMatrix, annot = True, cmap = 'Blues')
Line 11
creates a dataframe. A dataframe can be referred to as a matrix.
Line 16
uses thecorr
function on our dataframe to calculate the correlation coefficients matrix.
The second code snippet is a continuation of the first code snippet.
It creates a visualization of the correlation matrix using Seaborn and Matplotlib. It takes in the correlation coefficients, annotates them, and colors them blue.
Free Resources