Polars
is an open-source data manipulation library in Python that combines the ease of use of Pandas with the performance benefits of Apache Arrow. It is designed to handle large-scale datasets efficiently by leveraging a columnar memory format of Apache Arrow specification.
In this answer, we’ll learn how to use the Polars
library in Python to inspect data, along with examples to help us understand the process.
To inspect data using polars, we first need to convert the data into a polars DataFrame. Here’s an example code for creating a DataFrame in polars:
import polars as pldata = {"Id":[1,2,3,4,5,6,7],"Item": ['Apple','Cupboard','Peach','Banana','Table','Mango','Chair'],"Price": [50,5000,30,30,2000,100,1000],"Category": ['Fruit','Furniture','Fruit','Fruit','Furniture','Fruit','Furniture']}df = pl.DataFrame(data)print(df)
Polars provides several methods for basic data inspection. Here are a few commonly used methods:
head()
: Displays the first three rows of the DataFrame.
print(df.head(3))
tail()
: Displays the last three rows of the DataFrame.
print(df.tail(3))
shape()
: Returns the number of rows and columns in the DataFrame.
print(df.shape)
columns()
: Returns a list of column names in the DataFrame.
print(df.columns)
dtypes()
: Returns the data types of each column.
print(df.dtypes)
Polars provides various methods to calculate summary statistics of the data. Here are a few examples:
describe()
: Computes descriptive statistics for each numeric column.
print(df.describe)
value_counts(column)
: Counts the unique values in a specific column.
print(df['Price'].value_counts())
min()
, max()
, mean()
, median()
: Compute the minimum, maximum, mean, and median values of a column.
print(df['Price'].min())print(df['Price'].max())print(df['Price'].mean())print(df['Price'].median())
Polars provides powerful methods to filter and select data based on specific conditions. Here are a few examples:
filter(predicate)
: Filters the DataFrame based on a boolean predicate.
filtered_df = df.filter(pl.col('Price') > 100)print(filtered_df)
select(columns)
: Selects specific columns from the DataFrame.
selected_df = df.select(['Item', 'Price'])print(selected_df)
In this answer, we've seen the usage of the Polars
library in Python. We've seen that with its versatile DataFrame API, it provides a wide range of operations for data inspection. It integrates seamlessly with popular Python libraries like NumPy, Pandas, and PyArrow, making it a valuable tool for data professionals working with big data. Polars empowers users to streamline their data processing pipeline and extract valuable insights from their datasets effectively.
Free Resources