If we have a text file, how do we find the total number of unique words in it?
The assumption here is that a single space separates the words.
The steps involved are as follows:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('Educative_Answers').getOrCreate()sc = spark.sparkContextf_path = "word_count.txt"f_rdd = sc.textFile(f_path)words_rdd = f_rdd.flatMap(lambda line: line.split(' '))distinct_words_rdd = words_rdd.distinct()print("The unique words in the file are as follows:", distinct_words_rdd.collect())count = distinct_words_rdd.count()print("The count of unique words in the file is:", count)
pyspark
and SparkSession.
sc.
textFile()
method.flatMap()
on the RDD to split the file into tokens. Here, we pass a lambda function that takes a line as the input and splits the text into words with space as the delimiter using the split()
method. The resulting RDD will be the individual words of the text file.distinct()
function on the RDD.count()
function.Free Resources