Seaborn is a Python data visualization library built on top of Matplotlib, offering a high-level interface for creating visually appealing statistical graphics. It seamlessly integrates with pandas DataFrames, making it easy to visualize datasets loaded into pandas. Seaborn excels at producing various statistical visualizations such as scatter plots, line plots, bar plots, histograms, kernel density estimates, violin plots, and box plots. These plots often include built-in statistical estimation and aggregation functions, allowing users to quickly analyze and visualize data distributions.
To import Seaborn, we have to have Seaborn and Matplotlib installed in our environment. You can install the libraries using the following commands:
!pip install matplotlib!pip install seaborn
After successful installation, we can import the libraries like so:
import matplotlib.pyplot as pltimport seaborn as sns
After importing the libraries, let's see different ways we can use data for visualization:
We can create our own dataset using NumPy and pandas. First, we can create our data point using numpy arrays, and then we can convert those arrays into a DataFrame using pandas.
import numpy as npimport pandas as pdcol1 = np.random.rand(15)data = pd.DataFrame({'x':col1,'y':np.random.normal(0.1,5,15)})print(data)
There are seventeen famous datasets that are built into the Seaborn library. To see the built-in dataset, we can check using the following code:
import seaborn as snsprint(sns.get_dataset_names())
Let's choose the iris
dataset. We can load the dataset by running the following code:
import seaborn as snsdata = sns.load_dataset("iris")print(data.head())
Seaborn provides us with built-in functions to visualize our data using basic plots.
A scatter plot visually represents the relationship between two variables by plotting points on a Cartesian plane. It's useful for identifying patterns or correlations in data.
# Scatter Plotplt.figure(figsize=(8, 6))sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=data)plt.title('Scatter Plot of Sepal Length vs. Sepal Width')plt.show()
A line plot connects data points with straight lines, typically used to display data over time or sequential data points. It helps in visualizing trends or patterns in data.
# Line Plotplt.figure(figsize=(8, 6))sns.lineplot(x=data.index, y='sepal_length', data=data)plt.title('Line Plot of Sepal Length')plt.xlabel('Index')plt.ylabel('Sepal Length (cm)')plt.show()
A bar plot displays data using rectangular bars with lengths proportional to the values they represent. It's effective for comparing categories of data or showing distributions.
# Bar Plotplt.figure(figsize=(8, 6))sns.barplot(x='species', y='sepal_length', data=data)plt.title('Bar Plot of Sepal Length by Species')plt.xlabel('Species')plt.ylabel('Sepal Length (cm)')plt.show()
A histogram displays the distribution of numerical data by dividing it into bins and showing the frequency of each bin. It's useful for understanding the underlying distribution of a dataset.
# Histogramplt.figure(figsize=(8, 6))sns.histplot(data['sepal_length'], bins=20)plt.title('Histogram of Sepal Length')plt.xlabel('Sepal Length (cm)')plt.ylabel('Frequency')plt.show()
A box plot visualizes the distribution of numerical data through quartiles, providing insights into the central tendency, variability, and outliers of the dataset.
# Box Plotplt.figure(figsize=(8, 6))sns.boxplot(x='species', y='sepal_length', data=data)plt.title('Box Plot of Sepal Length by Species')plt.xlabel('Species')plt.ylabel('Sepal Length (cm)')plt.show()
A count plot represents the frequency of unique values in a dataset, often used for categorical data to show the distribution of different categories.
# Count Plotplt.figure(figsize=(8, 6))sns.countplot(x='sepal_length', data=data)plt.title('Count Plot of Sepal length')plt.xlabel('Sepal length')plt.ylabel('Count')plt.show()
A point plot displays point estimates and confidence intervals to represent the relationship between two variables. It's helpful for comparing groups or conditions in an experiment or study.
# Point Plotplt.figure(figsize=(8, 6))sns.pointplot(x='species', y='sepal_length', data=data)plt.title('Point Plot of Sepal Length by Species')plt.xlabel('Species')plt.ylabel('Sepal Length (cm)')plt.show()
Along with basic plots, we can also create advanced plots using Seaborn. These advanced plots provide a deeper insight into our data and highlight a deeper relationship between our different variables.
A pair plot displays pairwise relationships between different variables in a dataset. It shows scatterplots for each pair of variables and histograms for each variable along the diagonal, making it easy to visualize correlations and distributions within the dataset.
# Pairplotsns.pairplot(data)plt.title("Pairplot of Iris Dataset")plt.show()
A joint plot displays the relationship between two variables along with their individual distributions. It typically includes a scatterplot with marginal histograms or kernel density estimates, providing insights into the correlation and the distribution of the variables.
# Jointplotsns.jointplot(x='petal_length', y='petal_width', data=data, kind='scatter')plt.show()
A violin plot displays the distribution of a numeric variable for different categories or groups. It combines a box plot with a kernel density estimate, showing the distribution of the data as well as its summary statistics such as median, quartiles, and outliers.
# Violinplotsns.violinplot(x='species', y='sepal_length', data=data)plt.title("Violinplot of Sepal Length by Species")plt.show()
A KDE plot visualizes the probability density function of a continuous variable. It provides a smoothed representation of the distribution of the data, making it easier to identify peaks, valleys, and the overall shape of the distribution.
# KDEplotsns.kdeplot(data['sepal_length'], shade=True)plt.title("KDEplot of Sepal Length")plt.show()
A heatmap visualizes the correlation matrix of a dataset using colors. It is particularly useful for identifying patterns and relationships between variables, with brighter colors indicating stronger correlations.
# Heatmapdata = data.drop(columns=['species'])confusion_matrix = data.corr()sns.heatmap(confusion_matrix, annot=True, cmap='coolwarm')plt.title("Heatmap of Correlation Matrix")plt.show()
A FacetGrid divides a dataset into subsets based on one or more categorical variables and creates a separate plot for each subset. It allows for comparison between different groups or categories within the dataset.
# FacetGridg = sns.FacetGrid(data, col='species')g.map(sns.scatterplot, 'petal_length', 'petal_width')plt.show()
A regplot (regression plot) displays the relationship between two variables and fits a linear regression model to the data. It provides insights into the strength and direction of the relationship, along with the uncertainty associated with the regression line.
# Regplotsns.regplot(x='sepal_length', y='sepal_width', data=data)plt.title("Regplot of Sepal Length vs. Sepal Width")plt.show()
A categorical plot visualizes the distribution of a numeric variable across different categories or groups. It can take various forms, such as box plots, violin plots, or bar plots, providing insights into how the distribution varies between different categories.
# Categorical Plotsns.catplot(x='species', y='petal_length', data=data, kind='box')plt.show()
Free Resources