How to drop null values in Pyspark

It is essential to drop null values from a DataFrame for most use cases.

The dropna method of pyspark API is used to remove or omit null values in a DataFrame.

Syntax

DataFrame.dropna(how='any', thresh=None, subset=None)

Parameters

  • how: This parameter can have two values, all and any. If specified any, the method drops a row if it contains any nulls. If specified all, the method drops a row only if all its values are null.
  • thresh: This parameter (if specified) indicates dropping rows with less than thresh non-null values.
  • subset: This is the list of column names to consider.

All of the parameters above are optional.

Return value

This method returns a new DataFrame with no null values.

Dropping with how=any

Let’s look at the code below:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('answers').getOrCreate()
data = [("James","Educative","Engg","USA"),
("Michael","Google",None,"Asia"),
("Robert",None,"Marketing","Russia"),
("Maria","Netflix","Finance","Ukraine")
]
columns = ["emp name","company","department","country"]
df = spark.createDataFrame(data = data, schema = columns)
df_any = df.dropna(how="any")
df_any.show(truncate=False)

Code explanation

  • Lines 1–2: The pyspark and SparkSession are imported.
  • Line 4: A SparkSession with the application name answers is created.
  • Lines 6–12: The dummy data for the DataFrame with the columns are defined. The data contains None values.
  • Line 13: A spark Dataframe is created.
  • Line 15: The null values are dropped by invoking the dropna() method on the DataFrame with the how parameter as any.
  • Line 16: The new DataFrame obtained after dropping the null values is printed.

Dropping with how=all

Let’s look at the code below:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('answers').getOrCreate()
data = [("James","Educative","Engg","USA"),
("Michael","Google",None,"Asia"),
("Robert",None,"Marketing","Russia"),
("Maria","Netflix","Finance","Ukraine"),
(None, None, None, None)
]
columns = ["emp name","company","department","country"]
df = spark.createDataFrame(data = data, schema = columns)
df_all = df.dropna(how="all")
df_all.show(truncate=False)

Code explanation

  • Lines 1 to 2: The pyspark and SparkSession are imported.
  • Line 4: A SparkSession with the application name answers is created.
  • Lines 6 to 12: The dummy data for the DataFrame with the columns are defined. The data contains None values.
  • Line 13: A spark DataFrame is created.
  • Line 15: The null values are dropped by invoking the dropna() method on the DataFrame with the how parameter as all.
  • Line 16: The new dataFrame obtained after dropping the null values is printed.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved