How to load an orc file in pandas

What is ORC Format?

The ORC stands for Optimized Row Columnar. ORC is a highly efficient row columnar data format used to read, write, and process data in Hive. ORC files are made of data stripes, each of which comprises an index, row data, and a footer.

The read_orc method is used to load an orc file to a DataFrame.

Note: Refer to What is pandas in Python to learn more about pandas.

Syntax

pandas.read_orc(path, columns=None, **kwargs)

Parameter

path: This is the location/path of the orc file. A directory with many files can be referenced by the file path. The file path can also be a legitimate file URL. The acceptable URL schemes are http, ftp, s3, gs, and file.
columns: These are the columns to be read into the DataFrame.

Code example

Let’s look at the code below:

Code explanation

Lines 1-2 : pandas and pyarrow packages are imported.
Lines 4-5 : A DataFrame is created and written to a file named df.oc
Line 8: The df.orc file is read into a pandas data frame called df using the read_orc method.
Line 9: The df is printed.
Line 12: We define the columns, cols, to be read into the data frame.
Line 13: The df.orc file is read into a pandas DataFrame called df1 using the read_orc method and passing cols as the columns to be read, rejecting other columns.
Line 14: The df1 is printed.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

License: Creative Commons-Attribution NonCommercial-ShareAlike 4.0 (CC-BY-NC-SA 4.0)