Polars is a versatile data manipulation library in Python designed for efficient data processing and analysis. One of the powerful features provided by Polars is the ability to obtain unique values from the arrays. This functionality is particularly useful in scenarios where we need to identify and extract distinct elements from an array in Polars DataFrame. In this Answer, we will discuss the Expr.arr.unique()
method to fulfill such scenarios.
Exp.arr.unique()
methodThe Exp.arr.unique()
method is designed to retrieve the unique or distinct values from an array in Polars DataFrame. By invoking this method on a DataFrame expression, we can obtain a new expression representing the array containing only the unique values.
Here’s the syntax of the Expr.arr.unique()
method:
Expr.arr.unique(*, maintain_order: bool = False)
*
shows that the arguments passed after the *
must be specified using keyword arguments.
maintain_order
is an optional boolean parameter, which, if set to True
, preserves the order of the unique values in the result. It’s default value is False
.
Let’s walk through a practical example to understand the usage of the Expr.arr.unique()
method:
import polars as pl# Create a DataFrame with an array columndf = pl.DataFrame({"a": [[1, 2, 3, 2]],}, schema_overrides={"a": pl.Array(width=4, inner=pl.Int64)})# Use arr.unique() to get unique values from the arrayunique_values_expr = df.select(pl.col("a").arr.unique())# Display the resultprint(unique_values_expr)
Lines 4–6: We’re creating a DataFrame df
with a
single column. The column a
contains a single row with the [1, 2, 3, 2]
array. The schema_overrides
parameter is used to specify the schema of the DataFrame explicitly. In this case, it specifies that the column a
is an array of width 4
(i.e., it should contain four elements), where each element is of the Int64
type.
Line 9: The select()
method is used to create a new DataFrame (unique_values_expr
) by selecting the unique values of the a
column. The pl.col("a")
method retrieves the column a
from the DataFrame, and then .arr.unique()
is used to obtain the unique values in that array.
Line 12: Finally, the unique values DataFrame (unique_values_expr
) is printed to the console.
This will display the unique values present in the array column a
of the original df
DataFrame. Note that this will be a DataFrame with a single column containing the unique values of the array.
Let’s add more columns in a DataFrame and find unique values from an array. Here’s how we can do it:
import polars as pl# Create a DataFrame with an array columnsdf = pl.DataFrame({"a": [[1, 2, 3, 2]],"b": [[3, 4, 3, 7]],"c": [[8, 12, 13, 12]],}, schema_overrides={"a": pl.Array(width=4, inner=pl.Int64),"b": pl.Array(width=4, inner=pl.Int64),"c": pl.Array(width=4, inner=pl.Int64),})# Use arr.unique() to get unique values from the array columnsunique_values_expr = df.select(pl.col("a", "b", "c").arr.unique())# Display the resultprint(unique_values_expr)
Lines 4–12: We’re creating DataFrame df
with multiple columns named a
, b
, and c
. The schema_overrides
parameter is used to specify the schema of the DataFrame explicitly.
Line 15: We’re using the select()
method to create a new DataFrame (unique_values_expr
) by selecting the unique values from the a
, b
, and c
array columns. The pl.col("a", "b", "c")
invocation retrieves the columns a
, b
, and c
from the df
DataFrame, and then .arr.unique()
is used to obtain the unique values from the arrays.
This will display the unique values present in the array columns.
The Expr.arr.unique()
method in Polar provides a convenient and efficient way to extract unique values from array columns in DataFrames. By understanding its method signature, parameters, and usage through practical examples, we can leverage this functionality to enhance data manipulation and analysis workflows in Polars.
Free Resources