It is essential to drop null values from a DataFrame for most use cases.
The dropna method of pyspark API is used to remove or omit null values in a DataFrame.
DataFrame.dropna(how='any', thresh=None, subset=None)
how: This parameter can have two values, all and any. If specified any, the method drops a row if it contains any nulls. If specified all, the method drops a row only if all its values are null.thresh: This parameter (if specified) indicates dropping rows with less than thresh non-null values.subset: This is the list of column names to consider.All of the parameters above are optional.
This method returns a new DataFrame with no null values.
how=anyLet’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('answers').getOrCreate()data = [("James","Educative","Engg","USA"),("Michael","Google",None,"Asia"),("Robert",None,"Marketing","Russia"),("Maria","Netflix","Finance","Ukraine")]columns = ["emp name","company","department","country"]df = spark.createDataFrame(data = data, schema = columns)df_any = df.dropna(how="any")df_any.show(truncate=False)
pyspark and SparkSession are imported.SparkSession with the application name answers is created.None values.Dataframe is created.null values are dropped by invoking the dropna() method on the DataFrame with the how parameter as any.DataFrame obtained after dropping the null values is printed.how=allLet’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('answers').getOrCreate()data = [("James","Educative","Engg","USA"),("Michael","Google",None,"Asia"),("Robert",None,"Marketing","Russia"),("Maria","Netflix","Finance","Ukraine"),(None, None, None, None)]columns = ["emp name","company","department","country"]df = spark.createDataFrame(data = data, schema = columns)df_all = df.dropna(how="all")df_all.show(truncate=False)
pyspark and SparkSession are imported.SparkSession with the application name answers is created.None values.DataFrame is created.null values are dropped by invoking the dropna() method on the DataFrame with the how parameter as all.dataFrame obtained after dropping the null values is printed.Free Resources