The ORC stands for Optimized Row Columnar. ORC is a highly efficient row columnar data format used to read, write, and process data in Hive. ORC files are made of data stripes, each of which comprises an index, row data, and a footer.
The read_orc method is used to load an orc file to a DataFrame.
Note: Refer to What is pandas in Python to learn more about pandas.
pandas.read_orc(path, columns=None, **kwargs)
path: This is the location/path of the orc file. A directory with many files can be referenced by the file path. The file path can also be a legitimate file URL. The acceptable URL schemes are http, ftp, s3, gs, and file.columns: These are the columns to be read into the DataFrame.Let’s look at the code below:
import pandas as pdimport pyarrow.orc# Creating an orc filedf = pd.DataFrame(data={"Name": ["John", "Kelly"], "Age": [3, 4]})df.to_orc("./df.orc")# Reading an orc filedf = pd.read_orc("df.orc")print(df)# Selecting a column from an orc filecols = ["Name"]df1 = pd.read_orc("df.orc", columns=cols)print(df1)
pandas and pyarrow packages are imported.df.ocdf.orc file is read into a pandas data frame called df using the read_orc method.df is printed.cols, to be read into the data frame.df.orc file is read into a pandas DataFrame called df1 using the read_orc method and passing cols as the columns to be read, rejecting other columns.df1 is printed.