Apache Spark is an open-source, scalable, ultra-fast, distributed processing framework for large-scale processing and machine learning. It is primarily designed for large data workloads. With Spark, fast, analytical queries can be performed against data of any size while leveraging in-memory caching and optimized query execution.
Apache Spark distributes massive computing tasks to multiple nodes, where they are decomposed into smaller tasks that can be executed simultaneously.
Apache Spark supports Java, Scala, Python, and R APIs. PySpark is a Python interface to Apache Spark. In a distributed processing environment, PySpark allows you to manipulate and analyze data using Python and SQL-like commands.
Let’s walk through the following example showing how to extract a DataFrame from a CSV file and save it to a JSON file.
The following example aims to familiarize you with the following concepts:
Read a CSV file using PySpark and store the data exported into a DataFrame.
Convert the PySpark DataFrame into a pandas DataFrame.
Save the pandas DataFrame into a JSON file.
from pyspark.sql import SparkSessionimport pandas as pdspark = SparkSession.builder \.appName('demo') \.master("local") \.getOrCreate()spark.sparkContext.setLogLevel("ERROR")df = spark.read.load("test.csv",format="csv", sep=",",inferSchema="true",header="true")df.printSchema()df.show()pandas_df = df.toPandas()pandas_df.to_json('/usercode/output/test.json')
Let’s go through the code widget above:
Lines 1–2: We import the required Python libraries: pyspark
and pandas
.
Lines 4–7: We create a Spark session and set the name of the application to 'demo'
and the master program, "local"
.
Line 9: We change the log level to ERROR
to disable DEBUG
or INFO
messages in the console.
Line 10: Using the spark.read.load()
function, we read the test.csv
file and load it into a DataFrame.
Line 11: We print out the DataFrame schema using the printSchema()
function.
Line 12: We invoke the show()
function to display the head of the DataFrame.
Line 14: We convert the PySpark DataFrame into a pandas DataFrame.
Line 15: We call the method to_json()
to convert the pandas DataFrame to a JSON file.
Free Resources