What is "Term Frequency- Inverse Document Frequency" (TF-IDF)?
TF-IDF is used in machine learning and natural language processing for measuring the significance of terms for a given document. It consists of two parts that are multiplied together:
- Term Frequency- A measure of how many times a given word appears in a document
- Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus
If we were to merely consider word frequency, the most frequent words would be common function words like: "the", "and", "of". We could use a stopwords list to remove the common function words, but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:
- Given a set of scientific journal articles in biology, the term "lab" may not be significant since biologists often rely on and mention labs in their research. However, if the term "lab" were to occur frequently in a history or English article, then it is likely to be significant since humanities articles rarely discuss labs.
- If we were to look at thousands of articles in literary studies, then the term "postcolonial" may be significant for any given article. However, if were to look at a few hundred articles on the topic of "the global south," then the term "postcolonial" may occur so frequently that it is not a significant way to differentiate between the articles.
The TF-IDF calculation reveals the words that are frequent in this document yet rare in other documents. The goal is to find out what is unique or remarkable about a document given the context (and the given context can change the results of the analysis).
Here is how the calculation is mathematically written:

In plain English, this means: The value of TF-IDF is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency. Let's unpack these terms one at a time.


TF-IDF Calculation in Plain English!
There are variations on the TF-IDF formula, but this is the most widely-used version.

There are variations on the TF-IDF formula, but this is the most widely-used version.
An Example Calculation of TF-IDF
Let's take a look at an example to illustrate the fundamentals of TF-IDF. First, we need several texts to compare. Our texts will be very simple.
- text1 = 'The grass was green and spread out the distance like the sea.'
- text2 = 'Green eggs and ham were spread out like the book.'
- text3 = 'Green sailors were met like the sea met troubles.'
- text4 = 'The grass was green.'
The first step is we need to discover how many unique words are in each text.
text1 | text2 | text3 | text4 |
---|---|---|---|
the | green | green | the |
grass | eggs | sailors | grass |
was | and | were | was |
green | ham | met | green |
and | were | like | |
spread | spread | the | |
out | out | sea | |
into | like | met | |
distance | the | troubles | |
like | book | ||
sea |
Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts. (When we use the gensim library later, we will call this list a gensim dictionary.)
id | Unique Words |
---|---|
0 | and |
1 | book |
2 | distance |
3 | eggs |
4 | grass |
5 | green |
6 | ham |
7 | like |
8 | met |
9 | out |
10 | sailors |
11 | sea |
12 | spread |
13 | the |
14 | troubles |
15 | was |
16 | were |
Now let's count the occurences of each unique word in each sentence
id | word | text1 | text2 | text3 | text4 |
---|---|---|---|---|---|
0 | and | 1 | 1 | 0 | 0 |
1 | book | 0 | 1 | 0 | 0 |
2 | distance | 1 | 0 | 0 | 0 |
3 | eggs | 0 | 1 | 0 | 0 |
4 | grass | 1 | 0 | 0 | 1 |
5 | green | 1 | 1 | 1 | 1 |
6 | ham | 0 | 1 | 0 | 0 |
7 | like | 1 | 1 | 1 | 0 |
8 | met | 0 | 0 | 2 | 0 |
9 | out | 1 | 1 | 0 | 0 |
10 | sailors | 0 | 0 | 1 | 0 |
11 | sea | 1 | 0 | 1 | 0 |
12 | spread | 1 | 1 | 0 | 0 |
13 | the | 3 | 1 | 1 | 1 |
14 | troubles | 0 | 0 | 1 | 0 |
15 | was | 1 | 0 | 0 | 1 |
16 | were | 0 | 1 | 1 | 0 |
Computing TF-IDF (Example 1)
We have enough information now to compute TF-IDF for every word in our corpus. Recall the plain English formula:
We can use the formula to compute TF-IDF for the most common word in our corpus: 'the'. In total, we will compute TF-IDF four times (once for each of our texts).
id | word | text1 | text2 | text3 | text4 |
---|---|---|---|---|---|
13 | the | 3 | 1 | 1 | 1 |

The results of our analysis suggest "the" has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.
Given that idf is:
and
we can see that TF-IDF will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any given document.
Computing TF-IDF (Example 2)
Let's try a second example with the word 'out'. Recall the plain English formula.
We will compute TF-IDF four times, once for each of our texts.
id | word | text1 | text2 | text3 | text4 |
---|---|---|---|---|---|
9 | out | 1 | 1 | 0 | 0 |
The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.
Computing TF-IDF (Example 3)
Let's try one last example with the word 'met'. Here's the TF-IDF formula again:
And here's how many times the word 'met' occurs in each text.
id | word | text1 | text2 | text3 | text4 |
---|---|---|---|---|---|
8 | met | 0 | 0 | 2 | 0 |
As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text.
The Full TF-IDF Example Table
Here are the original sentences for each text:
- text1 = 'The grass was green and spread out the distance like the sea.'
- text2 = 'Green eggs and ham were spread out like the book.'
- text3 = 'Green sailors were met like the sea met troubles.'
- text4 = 'The grass was green.'
And here's the corresponding TF-IDF scores for each word in each text:
id | word | text1 | text2 | text3 | text4 |
---|---|---|---|---|---|
0 | and | .3010 | .3010 | 0 | 0 |
1 | book | 0 | .6021 | 0 | 0 |
2 | distance | .6021 | 0 | 0 | 0 |
3 | eggs | 0 | .6021 | 0 | 0 |
4 | grass | .3010 | 0 | 0 | .3010 |
5 | green | 0 | 0 | 0 | 0 |
6 | ham | 0 | .6021 | 0 | 0 |
7 | like | .1249 | .1249 | .1249 | 0 |
8 | met | 0 | 0 | 1.2042 | 0 |
9 | out | .3010 | .3010 | 0 | 0 |
10 | sailors | 0 | 0 | .6021 | 0 |
11 | sea | .3010 | 0 | .3010 | 0 |
12 | spread | .3010 | .3010 | 0 | 0 |
13 | the | 0 | 0 | 0 | 0 |
14 | troubles | 0 | 0 | .6021 | 0 |
15 | was | .3010 | 0 | 0 | .3010 |
16 | were | 0 | .3010 | .3010 | 0 |
There are a few noteworthy things in this data.
- The TF-IDF score for any word that does not occur in a text is 0.
- The scores for almost every word in text4 are 0 since it is a shorter version of text1. There are no unique words in text4 since text1 contains all the same words. It is also a short text which means that there are only four words to consider. The words 'the' and 'green' occur in every text, leaving only 'was' and 'grass' which are also found in text1.
- The words 'book', 'eggs', and 'ham' are significant in text2 since they only occur in that text.
Now that you have a basic understanding of how TF-IDF is computed at a small scale, you can try computing it in Python.