Apache Parquet is a column-oriented data file format that is open source and designed for data storage and retrieval. It offers high-performance data compression and encoding schemes for handling large amounts of complex data.
The read_parquet
method is used to load a parquet file to a data frame.
Note: Refer to What is pandas in Python to learn more about pandas.
Here’s the syntax for this:
pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False, **kwargs)
path
: The file path to the parquet file. The file path can also point to a directory containing multiple files. The file path can also be a valid file URL. Valid URL schemes are http
, ftp
, s3
, gs
, and file
.engine
: This parameter indicates which parquet library to use. Available options are auto
, pyarrow
or fastparquet
.columns
: This parameter indicates the columns to be read into the data frame.storage_options
: Extra options for a certain storage connection, such as host, port, username, password, and so on.use_nullable_dtypes
: This is a boolean parameter. If True
, use types for the resultant data frame that uses pd.NA
as the missing value indicator.Let’s see an example of the read_parquet
method in Python.
import pandas as pd df = pd.read_parquet('data.parquet', engine='pyarrow') print(df) cols = ["Name"] df1 = pd.read_parquet('data.parquet', columns=cols) print(df1)
pandas
library is imported.data.parquet
is loaded to a pandas data frame i.e., df
using the read_parquet
method.df
is printed.cols
to be read into the data frame.data.parquet
file is read into a pandas data frame called df1
using the read_parquet
method and passing cols as the columns to be read rejecting other columns.df1
is printed.