How to use TF-IDF Vectorizer on DataFrame

Vectorization is the process of performing operations on the entire data set simultaneously. Vectorization uses built-in libraries such as NumPy in Python to perform operations on the entire data set in a single process. It improves computational efficiency, minimizes the need for nested loops, and performs complex operations efficiently.

Term frequency (TF) measures the frequency of the text within the document, whereas inverse document frequency (IDF) measures the importance of the term across the collection of documents.

Term frequency-inverse document frequency (TF-IDF) vectorizer is a numerical representation of text documents to capture the importance of words in a collection of documents. The vectorizer works by assigning weights to each word in a document depending on the frequency of a word in that document (TF) and inverse to its frequency across all the documents (inverse document frequency).

How is it calculated?

Say we have a document with 1000 words, and the word "focus" appears 15 times.

The term frequency is calculated as given below.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Data set
data_01 = "The dead trees waited to be ignited by the smallest spark and seek their revenge."
data_02 = "They throw cabbage that turns your brain into emotional baggage."
data_03 = "As he waited for the shower to warm, he noticed that he could hear water change temperature."
# Conversion to Data frame
df = pd.DataFrame({'Data_First' : [data_01], 'Data_Second' : [data_02], 'Data_Third' : [data_03]})
# Vectorization
vectorizer_init = TfidfVectorizer()
transformFit = vectorizer_init.fit_transform(df.iloc[0])
# Transformation and feature extraction
newDataFrame = pd.DataFrame(transformFit.toarray().transpose(), vectorizer_init.get_feature_names_out())
print(newDataFrame)

Explanation

Line 1: Here, we import pandas to use functions that are helpful in working with the data values.
Line 2: Most popular machine learning library sklearn has sklearn.feature_extraction.text module to extract useful features from the dataset. You can read more about it in the official documentation. TfidfVectorizer() computes TF, IDF, and TF-IDF values of the collection of documents (data set) in a single operation.
Line 14: This line calls the TfidfVectorizer() function to generate a matrix of TF-IDF features.
Line 15: Here, we use the fit_transform() method to determine the required parameters and transform the dataset by extracting the first row and its corresponding value.

Note: df.iloc[0] takes the first row of the data frame and prints it as a column. The output is displayed as a column name followed by the value.

Line 19: Finally, transformFit transforms the object into a dense arrayA data structure in which each element occupies its own memory. by the toarray() method. The transpose() method then converts columns into rows and vice versa. The get_feature_names_out() method returns the names of the output features.
The resulting output shows the frequency of how much a particular text belongs to a document. Let's break it down.
- The TF-IDF score is supposed to be higher for words that appear frequently in a document.
- The term "baggage" has a TF-IDF score of 0 in column 0 and column 2. It does not appear in the first and third documents, so its relevance score is 0.
- In the first document (column 0), "baggage" has a relevance score of 0.323112. It shows that the word appears frequently in the second document compared to the others. So, it has a relatively high TF-IDF score.
- Overall, the TF-IDF score of the word "baggage" from the output tells us that it is a significant word in the second document but not relevant to the first and third documents.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

You TubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

How to use TF-IDF Vectorizer on DataFrame

How is it calculated?

Code

Explanation