What is TF-IDF?

TF-IDF stands for “Term Frequency – Inverse Document Frequency.” It reflects how important a word is to a document in a collection or corpus. This technique is often used in information retrieval and text mining as a weighing factor.

TF-IDF is composed of two terms:

widget
  • Term Frequency (TF):
    The number of times a word appears in a document divided by the total number of words in that document.
widget
  • Inverse Document Frequency (IDF):
    The logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
widget

So, essentially, the TF-IDF value increases as the word’s frequency in a document (TF) increases. However, this is offset by the number of times the word appears in the entire collection of documents or corpus (IDF).

We have IDF to help remove common words like “the” or “is” that would, otherwise, have a high term frequency but are not that important.

Example

Let’s look at an example of how TF-IDF works.

Consider two sentences (or documents):

  1. “The cat is white”
  2. “The cat is black”

Notice that the only difference between the two sentences is the words “white” and “black”. These are important words that should get a high TF-IDF value, while words like “the” and “cat” should get a low value.

TF-IDF value for the word "white"
TF-IDF value for the word "white"
TF-IDF value for the word "the"
TF-IDF value for the word "the"
New on Educative
Learn to Code
Learn any Language as a beginner
Develop a human edge in an AI powered world and learn to code with AI from our beginner friendly catalog
🏆 Leaderboard
Daily Coding Challenge
Solve a new coding challenge every day and climb the leaderboard

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved