What is "Term Frequency- Inverse Document Frequency" (TF-IDF)?

TF-IDF is used in machine learning and natural language processing for measuring the significance of terms for a given document. It consists of two parts that are multiplied together:

1. Term Frequency- A measure of how many times a given word appears in a document
2. Inverse Document Frequency- A measure of how many times the same word occurs in other documents within the corpus

If we were to merely consider word frequency, the most frequent words would be common function words like: "the", "and", "of". We could use a stopwords list to remove the common function words, but that still may not give us results that describe the unique terms in the document since the uniqueness of terms depends on the context of a larger body of documents. In other words, the same term could be significant or insignificant depending on the context. Consider these examples:

• Given a set of scientific journal articles in biology, the term "lab" may not be significant since biologists often rely on and mention labs in their research. However, if the term "lab" were to occur frequently in a history or English article, then it is likely to be significant since humanities articles rarely discuss labs.
• If we were to look at thousands of articles in literary studies, then the term "postcolonial" may be significant for any given article. However, if were to look at a few hundred articles on the topic of "the global south," then the term "postcolonial" may occur so frequently that it is not a significant way to differentiate between the articles.

The TF-IDF calculation reveals the words that are frequent in this document yet rare in other documents. The goal is to find out what is unique or remarkable about a document given the context (and the given context can change the results of the analysis).

Here is how the calculation is mathematically written:

In plain English, this means: The value of TF-IDF is the product (or multiplication) of a given term's frequency multiplied by its inverse document frequency. Let's unpack these terms one at a time.

TF-IDF Calculation in Plain English!

There are variations on the TF-IDF formula, but this is the most widely-used version.

There are variations on the TF-IDF formula, but this is the most widely-used version.

An Example Calculation of TF-IDF

Let's take a look at an example to illustrate the fundamentals of TF-IDF. First, we need several texts to compare. Our texts will be very simple.

• text1 = 'The grass was green and spread out the distance like the sea.'
• text2 = 'Green eggs and ham were spread out like the book.'
• text3 = 'Green sailors were met like the sea met troubles.'
• text4 = 'The grass was green.'

The first step is we need to discover how many unique words are in each text.

text1 text2 text3 text4
the green green the
grass eggs sailors grass
was and were was
green ham met green
and were like
out out sea
into like met
distance the troubles
like book
sea

Our four texts share some similar words. Next, we create a single list of unique words that occur across all three texts. (When we use the gensim library later, we will call this list a gensim dictionary.)

id Unique Words
0 and
1 book
2 distance
3 eggs
4 grass
5 green
6 ham
7 like
8 met
9 out
10 sailors
11 sea
13 the
14 troubles
15 was
16 were

Now let's count the occurences of each unique word in each sentence

id word text1 text2 text3 text4
0 and 1 1 0 0
1 book 0 1 0 0
2 distance 1 0 0 0
3 eggs 0 1 0 0
4 grass 1 0 0 1
5 green 1 1 1 1
6 ham 0 1 0 0
7 like 1 1 1 0
8 met 0 0 2 0
9 out 1 1 0 0
10 sailors 0 0 1 0
11 sea 1 0 1 0
12 spread 1 1 0 0
13 the 3 1 1 1
14 troubles 0 0 1 0
15 was 1 0 0 1
16 were 0 1 1 0

Computing TF-IDF (Example 1)

We have enough information now to compute TF-IDF for every word in our corpus. Recall the plain English formula:

We can use the formula to compute TF-IDF for the most common word in our corpus: 'the'. In total, we will compute TF-IDF four times (once for each of our texts).

id word text1 text2 text3 text4
13 the 3 1 1 1

The results of our analysis suggest "the" has a weight of 0 in every document. The word 'the' exists in all of our documents, and therefore it is not a significant term to differentiate one document from another.

Given that idf is:

and

we can see that TF-IDF will be 0 for any word that occurs in every document. That is, if a word occurs in every document, then it is not a significant term for any given document.

Computing TF-IDF (Example 2)

Let's try a second example with the word 'out'. Recall the plain English formula.

We will compute TF-IDF four times, once for each of our texts.

id word text1 text2 text3 text4
9 out 1 1 0 0

The results of our analysis suggest 'out' has some significance in text1 and text2, but no significance for text3 and text4 where the word does not occur.

Computing TF-IDF (Example 3)

Let's try one last example with the word 'met'. Here's the TF-IDF formula again:

And here's how many times the word 'met' occurs in each text.

id word text1 text2 text3 text4
8 met 0 0 2 0

As should be expected, we can see that the word 'met' is very significant in text3 but not significant in any other text since it does not occur in any other text.

The Full TF-IDF Example Table

Here are the original sentences for each text:

• text1 = 'The grass was green and spread out the distance like the sea.'
• text2 = 'Green eggs and ham were spread out like the book.'
• text3 = 'Green sailors were met like the sea met troubles.'
• text4 = 'The grass was green.'

And here's the corresponding TF-IDF scores for each word in each text:

id word text1 text2 text3 text4
0 and .3010 .3010 0 0
1 book 0 .6021 0 0
2 distance .6021 0 0 0
3 eggs 0 .6021 0 0
4 grass .3010 0 0 .3010
5 green 0 0 0 0
6 ham 0 .6021 0 0
7 like .1249 .1249 .1249 0
8 met 0 0 1.2042 0
9 out .3010 .3010 0 0
10 sailors 0 0 .6021 0
11 sea .3010 0 .3010 0
12 spread .3010 .3010 0 0
13 the 0 0 0 0
14 troubles 0 0 .6021 0
15 was .3010 0 0 .3010
16 were 0 .3010 .3010 0

There are a few noteworthy things in this data.

• The TF-IDF score for any word that does not occur in a text is 0.
• The scores for almost every word in text4 are 0 since it is a shorter version of text1. There are no unique words in text4 since text1 contains all the same words. It is also a short text which means that there are only four words to consider. The words 'the' and 'green' occur in every text, leaving only 'was' and 'grass' which are also found in text1.
• The words 'book', 'eggs', and 'ham' are significant in text2 since they only occur in that text.

Now that you have a basic understanding of how TF-IDF is computed at a small scale, you can try computing it in Python.