It is essential to drop null
values from a DataFrame
for most use cases.
The dropna
method of pyspark
API is used to remove or omit null values in a DataFrame
.
DataFrame.dropna(how='any', thresh=None, subset=None)
how
: This parameter can have two values, all
and any
. If specified any
, the method drops a row if it contains any nulls. If specified all
, the method drops a row only if all its values are null
.thresh
: This parameter (if specified) indicates dropping rows with less than thresh non-null values.subset
: This is the list of column names to consider.All of the parameters above are optional.
This method returns a new DataFrame
with no null values.
how=any
Let’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('answers').getOrCreate()data = [("James","Educative","Engg","USA"),("Michael","Google",None,"Asia"),("Robert",None,"Marketing","Russia"),("Maria","Netflix","Finance","Ukraine")]columns = ["emp name","company","department","country"]df = spark.createDataFrame(data = data, schema = columns)df_any = df.dropna(how="any")df_any.show(truncate=False)
pyspark
and SparkSession
are imported.SparkSession
with the application name answers
is created.None
values.Dataframe
is created.null
values are dropped by invoking the dropna()
method on the DataFrame
with the how
parameter as any
.DataFrame
obtained after dropping the null values is printed.how=all
Let’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('answers').getOrCreate()data = [("James","Educative","Engg","USA"),("Michael","Google",None,"Asia"),("Robert",None,"Marketing","Russia"),("Maria","Netflix","Finance","Ukraine"),(None, None, None, None)]columns = ["emp name","company","department","country"]df = spark.createDataFrame(data = data, schema = columns)df_all = df.dropna(how="all")df_all.show(truncate=False)
pyspark
and SparkSession
are imported.SparkSession
with the application name answers
is created.None
values.DataFrame
is created.null
values are dropped by invoking the dropna()
method on the DataFrame
with the how
parameter as all
.dataFrame
obtained after dropping the null values is printed.Free Resources