Vaex is a powerful Python library implemented in C++. It stands out for its ability to efficiently handle extensive datasets, making it a valuable tool for data scientists and analysts dealing with big data scenarios. This Answer will show how it can handle data, like pandas and Polars. Please visit the comparison between pandas and vaex for a detailed analysis.
vaex
libraryUse the following command to import the library:
import vaex as vx
The cornerstone of vaex
is the vaex
as follows:
import vaex as vxdata = {"column1": [1, 2, 3], "column2": ["foo", "bar", "baz"]}df = vx.from_dict(data)print(" === Vaex DataFrame === ")print(df)
In the code above:
Line 1: Import the vaex
library and name it as vx
for convenience in the code.
Line 3: Define a Python dictionary named data
, representing tabular data. It consists of two columns, "column1"
with integer values and "column2"
with string values.
Line 5: Use vaex
’s from_dict
method and create a DataFrame (i.e., df
) from the provided dictionary (i.e., data
). This DataFrame will have two columns, "column1"
and "column2"
, with corresponding data.
Lines 6–7: Print the vaex
DataFrame (i.e., df
) to the console.
Some of the basic operations that we can perform using vaex
in Python as follows:
# Select column(s): Choose specific column(s) with easedf_new = df[["column1"]]print(" === Select === ")print(df_new)print("\n")# Filter rows: Apply conditions to filter rowsdf_filtered = df[df["column1"] > 1]print(" === Filter === ")print(df_filtered)print("\n")# Group and aggregate: Aggregating data based on specific column(s)df_grouped = df.groupby("column2").agg({"column1": "sum"})print(" === Group === ")print(df_grouped)print("\n")# Sort the DataFrame effortlesslydf_sorted = df.sort("column1")print(" === Sort === ")print(df_sorted)print("\n")# Join DataFrames: Perform various join operationsdf1 = vx.from_dict({"key": ["Alpha", "Beta", "Gamma"], "value": [10, 20, 30]})df2 = vx.from_dict({"key": ["Beta", "Gamma", "Delta"], "value": [40, 50, 60]})df_joined = df1.join(df2, on="key", rsuffix='_df2')print(" === Join === ")print(df_joined)
In the code above:
Lines 1–5: Select only the "column1"
from the existing DataFrame (i.e., df
). It creates a new DataFrame (df_new
) with only the specified column. The result is then printed.
Lines 7–11: Filter rows based on a condition. It creates a new DataFrame (i.e., df_filtered
) containing only the rows where the value in "column1"
is greater than 1
. The result is then printed.
Lines 13–17: Group the DataFrame by "column2"
and aggregate the values in "column1"
using the sum
. The result is stored in df_grouped
and then it is printed.
Lines 19–23: Sort the DataFrame based on the values in "column1"
in ascending order. The result is stored in df_sorted
and then then it is printed.
Lines 25–32: Create two additional DataFrames (i.e., df1
and df2
), and then perform the join
operation on "key"
column. The rsuffix='_df2'
parameter is added to the join
method. This specifies a suffix ('_df2'
) for the columns from the second DataFrame (i.e., df2
). The result is stored in df_joined
and then it is printed.
In conclusion, vaex
is a swift and memory-efficient data analysis library in Python, making it a valuable addition to the data science toolkit. Keep on exploring its versatility for diverse tasks, from basic manipulations to complex analytics.
Free Resources