TF-IDF, which stands for Term Frequency-Inverse Document Frequency, is a widely used technique in natural language processing (NLP) and information retrieval. It is employed to quantify the importance of words in a document relative to a collection of documents (corpus). TF-IDF assigns a numerical value to each word based on its frequency in a specific document and its rarity across the entire corpus.
The primary goal of TF-IDF is to highlight words that are unique and important to a document while downplaying common words that occur frequently across many documents. This allows for the identification of key terms that distinguish one document from another, making it valuable for tasks such as document retrieval, text mining, and information retrieval.
The TF-IDF score for a term
Here:
Let's see the implementation of TF-IDF in Python using the sci-kit learn library.
import pandas as pdfrom sklearn.feature_extraction.text import TfidfVectorizer# Sample documentsdocuments = ["This is the first answer.","This answer is the second answer.","And this is the third one.","Is this the first answer?"]# Create the TF-IDF vectorizervectorizer = TfidfVectorizer()# Fit and transform the documentstfidf_matrix = vectorizer.fit_transform(documents)# Get the feature names (terms)feature_names = vectorizer.get_feature_names_out()# Create a DataFrame to display the TF-IDF matrixdf_tfidf = pd.DataFrame(data=tfidf_matrix.toarray(), columns=feature_names)print(df_tfidf)
Lines 1–2: We import the TfidfVectorizer
class from sci-kit-learn for TF-IDF implementation and the pandas
library for creating a DataFrame to display the TF-IDF matrix.
Lines 5–8: The variable documents
contain a list of sample text documents for which we want to calculate TF-IDF scores.
Lines 11–14: We created an instance of TfidfVectorizer()
, and used fit_transform
method to both fit the vectorizer to the documents (learn the vocabulary and IDF values) and transform the documents into a TF-IDF matrix.
Line 17: We used the get_feature_names_out
method to retrieve the feature names that correspond to the terms in the TF-IDF matrix.
Line 20: We created a DataFrame with the TF-IDF matrix data and feature names using pd.DataFrame
. This DataFrame will display the TF-IDF scores for each term in each document.
Terms like "first" and "second" have higher values in specific documents, suggesting that they are more specific to those documents. In contrast, common terms like "this" have lower TF-IDF scores.
The "inverse document frequency" aspect of TF-IDF helps in reducing the importance of terms that occur frequently across all documents. This is reflected in the lower TF-IDF scores for common terms like "this" and "the".
The following are the practical applications of TF-IDF.
Information retrieval: TF-IDF is widely used in search engines to rank documents based on their relevance to a user query.
Text summarization: It is utilized to identify significant terms in a document, aiding in the generation of concise summaries.
Document clustering: TF-IDF helps group similar documents together by capturing the distinctive terms that characterize each cluster.
Keyword extraction: It is employed to extract essential keywords from a document, facilitating content analysis.
Spam filtering: TF-IDF can be applied to distinguish between important terms and common words in emails, assisting in spam detection.
Free Resources