How to implement TF-IDF in Python

TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a widely used technique in natural language processing (NLP) and information retrieval. It is employed to quantify the importance of words in a document relative to a collection of documents (corpus). TF-IDF assigns a numerical value to each word based on its frequency in a specific document and its rarity across the entire corpus.

Why TF-IDF?

The primary goal of TF-IDF is to highlight words that are unique and important to a document while downplaying common words that occur frequently across many documents. This allows for the identification of key terms that distinguish one document from another, making it valuable for tasks such as document retrieval, text mining, and information retrieval.

Mathematical representation of TF-IDF

The TF-IDF score for a term tt in a document dd within a corpus of DDdocuments is calculated using the following formula:

Here:

TF(t,d)\text{TF}(t, d): Term Frequency, the number of times a term tt appears in the document dd is calculated as:

IDF(t)\text{IDF}(t): Inverse Document Frequency, calculated as:

Implementation of TF-IDF

Let's see the implementation of TF-IDF in Python using the sci-kit learn library.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample documents
documents = ["This is the first answer.",
"This answer is the second answer.",
"And this is the third one.",
"Is this the first answer?"]
# Create the TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the documents
tfidf_matrix = vectorizer.fit_transform(documents)
# Get the feature names (terms)
feature_names = vectorizer.get_feature_names_out()
# Create a DataFrame to display the TF-IDF matrix
df_tfidf = pd.DataFrame(data=tfidf_matrix.toarray(), columns=feature_names)
print(df_tfidf)

Code explanation

Lines 1–2: We import the TfidfVectorizer class from sci-kit-learn for TF-IDF implementation and the pandas library for creating a DataFrame to display the TF-IDF matrix.

Lines 5–8: The variable documents contain a list of sample text documents for which we want to calculate TF-IDF scores.

Lines 11–14: We created an instance of TfidfVectorizer(), and used fit_transform method to both fit the vectorizer to the documents (learn the vocabulary and IDF values) and transform the documents into a TF-IDF matrix.

Line 17: We used the get_feature_names_out method to retrieve the feature names that correspond to the terms in the TF-IDF matrix.

Line 20: We created a DataFrame with the TF-IDF matrix data and feature names using pd.DataFrame. This DataFrame will display the TF-IDF scores for each term in each document.

  • Terms like "first" and "second" have higher values in specific documents, suggesting that they are more specific to those documents. In contrast, common terms like "this" have lower TF-IDF scores.

  • The "inverse document frequency" aspect of TF-IDF helps in reducing the importance of terms that occur frequently across all documents. This is reflected in the lower TF-IDF scores for common terms like "this" and "the".

Applications of TF-IDF

The following are the practical applications of TF-IDF.

  • Information retrieval: TF-IDF is widely used in search engines to rank documents based on their relevance to a user query.

  • Text summarization: It is utilized to identify significant terms in a document, aiding in the generation of concise summaries.

  • Document clustering: TF-IDF helps group similar documents together by capturing the distinctive terms that characterize each cluster.

  • Keyword extraction: It is employed to extract essential keywords from a document, facilitating content analysis.

  • Spam filtering: TF-IDF can be applied to distinguish between important terms and common words in emails, assisting in spam detection.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved