We have added keyphrases to Constellate. Each keyphrase is up to three words contained in the document.  Together they  provide a concise summary of the document. In addition to Constellate using them in search weighting, these phrases will appear alongside search results as a “Word Bubble” visualization to give a broad overview of the key concepts in a search result set.

What is a keyphrase?

What is a keyphrase? For Constellate each keyphrase is up to three words contained in the document. Together they provide a concise summary of the document.

Assigning keyphrases to a large set of documents, like what is contained in Constellate, is challenging. Not only does Constellate have a lot of content (approximately 32 million documents, with about 3.4 non-English language documents) but the content comes from multiple sources and includes not just academic articles but historic newspapers and digitized special collections. As we don’t have a team of indexers to set at this challenge, we built an automated process.

How did we assign key phrases?

Constellate is using a customized process built with the unique challenges of the content in mind.

First we set some guidelines:

  • A Constellate keyphrase can be up to three words.
  • The text of the phrase has to appear in the document, no stemmed or lemmatized versions (e.g., we will use ‘adoption’, not stem it to ‘adopt’).

With these guidelines in mind, we read blog posts, academic papers and reviewed open source code. See the resource section below for those that we consulted and greatly informed our solution.

After this background work, we created the following process that each document is passed through:

  • For each ngram (up to three words) in the document, calculate a weighted score. To generate this score, we sampled the Constellate corpus to create a term to document frequency. This allows us to calculate a TFIDF-like score for a given phrase that measures its relative importance in context of the entire Constellate corpus.
  • Remove stopwords. We use the Python NLTK library for language specific stopwords. For non-english documents, we apply the NLTK stopword list for that language.
  • Remove phrases that frequently appear in the Constellate corpus but aren’t useful as keyphrases, for example “front matter”.
  • Bigrams and trigrams (two or three word phrases) are weighed more heavily than unigrams (single word phrases). This was informed by Chuang, et al.’s research into what makes a good keyphrase where they found that graduate students were more likely to apply bigrams as keyphrases to documents.
  • Once a ranked list of phrases is generated, we attempt to deduplicate the list by removing words that are very similar (using Levenstein distance or comparing stemmed versions) or single word phrases (unigrams) that are also part of a multiple word phrase.

Evaluation

How well does this keyphrase extraction process work? Well, there isn’t a straightforward answer:

  • Assigning keyphrases to documents is subjective. Trained indexers may read a document and assign different keyphrases.
  • What is the “correct” set of keyphrases for a given document?
  • There is no controlled vocabulary so the number of possible assigned terms is large.

To address the second item, we identified 200 documents that are in Constellate and also in PubMed and have author assigned keywords. We then extracted keyphrases from these documents using our process and compared the results against the author assigned phrases.

As seen in the plot below, you will see that the Constellate keyphrase extraction process (in blue) will assign on average about one (0.99) “correct” phrase when extracting five phrases per document and as many 1.6 when assigning 20 keyphrases. This is roughly similar to what Witten et al. found for their Kea algorithm.

A “correct” phrase is an exact match between an author supplied keyphrase and a keyphrase assigned by our process. We use this metric for evaluation since, as Witten et al. describe, precision and recall aren’t great fits for this type of evaluation (each could be gamed) and “correctness” is easier to interpret.

For comparison reasons, we also evaluated the YAKE algorithm. YAKE is a keyphrase extraction algorithm created by Campos et. al and made available as a Python library. It shares many of the same goals as we had for this project (lightweight, unsupervised, multilingual, no training required) but is also corpus agnostic. We were prepared to use YAKE for this project but found a Constellate specific approach yielded better results, as seen below in the plot and the table.

#phrases Constellate YAKE
5 0.995 0.695
10 1.355 1.095
15 1.53 1.38
20 1.675 1.52
  • values are average match with author

Conclusion

The Constellate team has assigned up to 10 keyphrases to each document in the corpus. These keyphrases aid in discovering and evaluating content. We devised a customized process for extracting phrases and evaluated it against author assigned keywords from Pubmed. We will aim to improve this process over time and add additional functionality to Contellate to leverage it.

Resources

Blog posts and papers
Python libraries