Boxplot is a valuable tool for visualizing the distribution of data, and while the Polars library itself doesn’t offer direct support for creating boxplots, we can easily generate them in Polars by integrating with the Matplotlib library.
boxplot()
functionThe boxplot()
function in Matplotlib is used to create boxplots, a common way to visualize a dataset’s distribution and summary statistics.
plt.boxplot(x, notch=None, sym=None, vert=None, whis=None, positions=None, widths=None, patch_artist=None)
Here are the main parameters of the boxplot()
function and their explanations:
x
: This is the data we want to plot. It can be a single array or a list of arrays (one array per box in the boxplot).
notch
: This creates a notched boxplot that displays a confidence interval around the median if it’s set to True
.
sym
: This is the symbol to indicate outliers. By default, it’s set to '+'
, but we can customize it to any symbol.
vert
: This creates vertical boxplots if set to True
(default),. If set to False
, it creates horizontal boxplots.
whis
: This is the whisker length as a proportion of the interquartile range (IQR). The default is 1.5, which is the standard definition. The line (whisker) will be drawn from the box to the minimum value within the range (Q1 - 1.5 * IQR) and from the box to the maximum value within the range (Q3 + 1.5 * IQR). Any data points that fall outside this range are treated as outliers and are displayed as individual points, not connected to the end of the whiskers.
positions
: This specifies the positions of boxes on the x-axis. This can be a list of scalars or an array-like object.
widths
: This specifies the width of the boxes. We can provide a list of scalars or an array-like object to customize box widths.
patch_artist
: This function returns a list of True
.
These parameters allow us to customize various aspects of the boxplot to suit our visualization needs. Depending on the data and the specific insights we want to convey, we can adjust these parameters accordingly when calling plt.boxplot()
.
Here is an example code to demonstrate how to create a boxplot using Matplotlib with data from a Polars DataFrame:
# Import required librariesimport polars as plimport matplotlib.pyplot as plt# Create a sample Polars DataFramedata = pl.DataFrame({'Category': ['X', 'Y', 'Z', 'X', 'Y', 'Z','X', 'Y', 'Z', 'X', 'Y', 'Z'],'Value': [5, 8, 12, 6, 9, 14, 7, 10, 16, 20, 4, 11]})# Extract the data we want to visualizecategories = data['Category'].to_list()values = data['Value'].to_list()# Create an empty list to hold the data for each categorycategory_data = []# Extract and organize data by categoryfor category in set(categories):category_values = [values[i] for i in range(len(categories)) if categories[i] == category]category_data.append(category_values)# Create a boxplot using Matplotlibplt.figure(figsize=(8, 6))plt.boxplot(category_data, labels=set(categories))plt.xlabel('Category')plt.ylabel('Value')plt.title('Boxplot')plt.show()
In the above code:
Lines 6–9: We create a Polars DataFrame called data
with two columns: Category
and Value
.
Lines 12–13: We extract the Category
and Value
columns into Python lists using to_list()
for the purpose of organizing and plotting the data using Matplotlib.
Line 16: We create an empty list called category_data
to store data for each category.
Lines 19–21: We iterate through the unique categories in the Category
column and extract the corresponding Value
data for each category.
Lines 24–28: We use Matplotlib to create a boxplot, passing the category_data
list and labels as arguments. We set the title and axis labels.
Line 29: Finally, we display the boxplot using plt.show()
.
The code generates a boxplot that visualizes the distribution of Value
data for each unique Category
in the sample dataset. The x-axis represents the categories ('X'
, 'Y'
, 'Z'
), and the y-axis represents the values. Each box in the plot represents a category, and within each box, we see a horizontal line indicating the median value, a box representing the IQR, and whiskers extending to the minimum and maximum values within a certain range (typically 1.5 times the IQR).
Free Resources