Apache Parquet is a column-oriented, open-source data file format for data storage and retrieval. It offers high-performance data compression and encoding schemes to handle large amounts of complex data.
We use the to_parquet()
method in Python to write a DataFrame to a Parquet file.
Note: Refer to What is pandas in Python? to learn more about pandas.
DataFrame.to_parquet(path=None, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None, **kwargs)
path
: This is the path to the Parquet file.engine
: This parameter indicates which Parquet library to use. The available options are auto
, pyarrow
, and fastparquet
.compression
: This parameter indicates the type of compression to use. The available options are snappy
, gzip
, and brotli
. The default compression is snappy
.index
: This is a boolean parameter. If True
, the DataFrame’s indexes are written to the file. If False
, the indexes are ignored.partition_cols
: These are the names of the columns that partition the DataFrame. The order in which the columns are given determines the order in which they are partitioned.storage_options
: These are the extra options for a certain storage connection, such as a host, port, username, password, and so on.import pandas as pdimport osdata = [['dom', 10], ['abhi', 15], ['celeste', 14]]df = pd.DataFrame(data, columns = ['Name', 'Age'])df.to_parquet("dataframe.parquet")print("Listing the contents of the current directory:")print(os.listdir('.'))
pandas
and os
packages.data
for constructing the pandas dataframe.data
to a pandas DataFrame called df
.df
to a Parquet file using the to_parquet()
function. The resulting file name as dataframe.parquet
.os.listdir
method. We observe that the dataframe.parquet
file is created.