How to determine the number of unique words in a file in PySpark

Problem Statement

If we have a text file, how do we find the total number of unique words in it?

Algorithm

The assumption here is that a single space separates the words.

The steps involved are as follows:

  1. Read the text file into memory.
  2. Split the text file into individual tokens (or words).
  3. Find the unique tokens and the count of unique tokens.

Code

main.py
word_count.txt
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('Educative_Answers').getOrCreate()
sc = spark.sparkContext
f_path = "word_count.txt"
f_rdd = sc.textFile(f_path)
words_rdd = f_rdd.flatMap(lambda line: line.split(' '))
distinct_words_rdd = words_rdd.distinct()
print("The unique words in the file are as follows:", distinct_words_rdd.collect())
count = distinct_words_rdd.count()
print("The count of unique words in the file is:", count)

Explanation

  • Lines 1–2: Import the pyspark and SparkSession.
  • Line 4: We create a SparkSession with the application name Educative_Answers.
  • Line 6: The spark context object is assigned to a variable sc.
  • Line 8: The path to the text file is defined.
  • Line 10: The file is read into spark RDDResilient Distributed Dataset using the textFile() method.
  • Line 12: We apply flatMap() on the RDD to split the file into tokens. Here, we pass a lambda function that takes a line as the input and splits the text into words with space as the delimiter using the split() method. The resulting RDD will be the individual words of the text file.
  • Line 14: The unique words can be found by invoking the distinct() function on the RDD.
  • Line 16: The unique words in the text file are printed.
  • Line 18: The count of unique words is obtained by invoking the count() function.
  • Line 20: The count of unique words is printed.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved