What is "Term Frequency- Inverse Document Frequency" (TF-IDF)?

TF-IDF is used in machine learning and natural language processing for measuring the significance of terms for a given document. It consists of two parts that are multiplied together:

  1. Term Frequency- A measure of how many times a given word appears in a document
  2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

If we were to merely consider word frequency, the most frequent words would be common function words like: "the", "and", "of". We could use a stopwords list to remove the common function words, but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:

  • Given a set of scientific journal articles in biology, the term "lab" may not be significant since biologists often rely on and mention labs in their research. However, if the term "lab" were to occur frequently in a history or English article, then it is likely to be significant since humanities articles rarely discuss labs.
  • If we were to look at thousands of articles in literary studies, then the term "postcolonial" may be significant for any given article. However, if were to look at a few hundred articles on the topic of "the global south," then the term "postcolonial" may occur so frequently that it is not a significant way to differentiate between the articles.

The TF-IDF calculation reveals the words that are frequent in this document yet rare in other documents. The goal is to find out what is unique or remarkable about a document given the context (and the given context can change the results of the analysis).

Here is how the calculation is mathematically written:

The generic formula for TF-IDF: Term Frequency multiplied by Inverse Document Frequency

In plain English, this means: The value of TF-IDF is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency. Let's unpack these terms one at a time.

Term Frequency in a given document. The number of times (t) a term occurs in a given document (d).
Inverse Document Frequency for the corpus. The log of the total number of documents (N) divided by the number of documents that contain the term.

TF-IDF Calculation in Plain English!

There are variations on the TF-IDF formula, but this is the most widely-used version.

In plain English, the formula is: word frequency in a given document times the log of total number of documents over the number of documents containing the word.

There are variations on the TF-IDF formula, but this is the most widely-used version.

An Example Calculation of TF-IDF

Let's take a look at an example to illustrate the fundamentals of TF-IDF. First, we need several texts to compare. Our texts will be very simple.

  • text1 = 'The grass was green and spread out the distance like the sea.'
  • text2 = 'Green eggs and ham were spread out like the book.'
  • text3 = 'Green sailors were met like the sea met troubles.'
  • text4 = 'The grass was green.'

The first step is we need to discover how many unique words are in each text.

text1 text2 text3 text4
the green green the
grass eggs sailors grass
was and were was
green ham met green
and were like
spread spread the
out out sea
into like met
distance the troubles
like book
sea

Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts. (When we use the gensim library later, we will call this list a gensim dictionary.)

id Unique Words
0 and
1 book
2 distance
3 eggs
4 grass
5 green
6 ham
7 like
8 met
9 out
10 sailors
11 sea
12 spread
13 the
14 troubles
15 was
16 were

Now let's count the occurences of each unique word in each sentence

id word text1 text2 text3 text4
0 and 1 1 0 0
1 book 0 1 0 0
2 distance 1 0 0 0
3 eggs 0 1 0 0
4 grass 1 0 0 1
5 green 1 1 1 1
6 ham 0 1 0 0
7 like 1 1 1 0
8 met 0 0 2 0
9 out 1 1 0 0
10 sailors 0 0 1 0
11 sea 1 0 1 0
12 spread 1 1 0 0
13 the 3 1 1 1
14 troubles 0 0 1 0
15 was 1 0 0 1
16 were 0 1 1 0

Computing TF-IDF (Example 1)

We have enough information now to compute TF-IDF for every word in our corpus. Recall the plain English formula:

Word frequency in a given document times the log of total number of documents over the number of documents containing the word.

We can use the formula to compute TF-IDF for the most common word in our corpus: 'the'. In total, we will compute TF-IDF four times (once for each of our texts).

id word text1 text2 text3 text4
13 the 3 1 1 1
The value of tf-idf for the word "the" is 0 for all texts.

The results of our analysis suggest "the" has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.

Given that idf is:

Inverse Document Frequency for the corpus. The log of the total number of documents (N) divided by the number of documents that contain the term.

and

Log of 1 equals 0.

we can see that TF-IDF will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any given document.

Computing TF-IDF (Example 2)

Let's try a second example with the word 'out'. Recall the plain English formula.

Word frequency in a given document times the log of total number of documents over the number of documents containing the word.

We will compute TF-IDF four times, once for each of our texts.

id word text1 text2 text3 text4
9 out 1 1 0 0

TF-IDF score for the word "out" for text one and two is .3010. For texts three and four the score is 0.

The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.

Computing TF-IDF (Example 3)

Let's try one last example with the word 'met'. Here's the TF-IDF formula again:

Word frequency in a given document times the log of total number of documents over the number of documents containing the word.

And here's how many times the word 'met' occurs in each text.

id word text1 text2 text3 text4
8 met 0 0 2 0

The TF-IDF score for the word met is 0 in texts one, two, and four. In text three, the score is 1.2042.

As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text.

The Full TF-IDF Example Table

Here are the original sentences for each text:

  • text1 = 'The grass was green and spread out the distance like the sea.'
  • text2 = 'Green eggs and ham were spread out like the book.'
  • text3 = 'Green sailors were met like the sea met troubles.'
  • text4 = 'The grass was green.'

And here's the corresponding TF-IDF scores for each word in each text:

id word text1 text2 text3 text4
0 and .3010 .3010 0 0
1 book 0 .6021 0 0
2 distance .6021 0 0 0
3 eggs 0 .6021 0 0
4 grass .3010 0 0 .3010
5 green 0 0 0 0
6 ham 0 .6021 0 0
7 like .1249 .1249 .1249 0
8 met 0 0 1.2042 0
9 out .3010 .3010 0 0
10 sailors 0 0 .6021 0
11 sea .3010 0 .3010 0
12 spread .3010 .3010 0 0
13 the 0 0 0 0
14 troubles 0 0 .6021 0
15 was .3010 0 0 .3010
16 were 0 .3010 .3010 0

There are a few noteworthy things in this data.

  • The TF-IDF score for any word that does not occur in a text is 0.
  • The scores for almost every word in text4 are 0 since it is a shorter version of text1. There are no unique words in text4 since text1 contains all the same words. It is also a short text which means that there are only four words to consider. The words 'the' and 'green' occur in every text, leaving only 'was' and 'grass' which are also found in text1.
  • The words 'book', 'eggs', and 'ham' are significant in text2 since they only occur in that text.

Now that you have a basic understanding of how TF-IDF is computed at a small scale, you can try computing it in Python.