Metadata and n-gram dataset files

The dataset builder creates three kinds of files:

The textual data includes:

The metadata may include:

Column Name Description
id a unique item ID (In JSTOR, this is a stable URL)
title the title for the document
subTitle the subtitle for the document
docType the type of document (for example, article or book)
publicationYear the year of publication
provider the source or provider of the dataset
collection collection informat as identified by the source
doi the digital object identifier
datePublished the publication date in yyyy-mm-dd format
url a URL for the item and/or the item's metadata
creator the author or authors of the item
pageStart the first page number of the print version
pageEnd the last page number of the print version
pageCount the number of print pages in the item
wordCount the number of words in the item
pagination the page sequence in the print version
language the language or languages of the item (eng is the ISO 639 code for English)
publisher the publisher for the item
placeOfPublication the city of the publisher
abstract the abstract description for the document
isPartOf the larger work that holds this title (for example, a journal title)
hasPartTitle the title of sub-items
identifier the set of identifiers connected with the document (doi, issn, isbn, oclc, etc.)
tdmCategory the inferred category of the content based on machine learning
sourceCategory the category according to the provider
sequence the article or chapter sequence
issueNumber the issue number for a journal publication
volumeNumber the volume number for a journal publication
outputFormat what data is available (unigrams, bigrams, trigrams, and/or full-text)

For more detail, see the current version of the schema.

All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:

  1. The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas or Excel.
  2. The JSON Lines data can be very large. Each file contains all of the metadata plus unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computational resources. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.

We are still refining the structure of the dataset file. We anticipate adding additional “features” (such as named entity recognition) in the future. Please reach out to us at if you have comments or suggestions.

Data Structure

CSV File

The CSV file is a comma-delimited, tabular structure that can easily be viewed in Excel or Pandas.

JSON-L file

The JSON Lines file (file extension ".jsonl") is served in a compressed gzip format (.gz). The data for each document in the corpus is a written on a single line. (If there are 1,245 documents in the corpus, the JSON Lines file will 1,245 lines long.) Each line contains a list of key/value pairs that map a key concept to a matching value.

The basic structure looks like:

"Key": Value

Instead of attempting to decode the structure of a single large line, we can plug a single line into a JSON editor. The screenshot below was created using JSON Editor Online. The JSON editor reveals the file structure by breaking it down into a set of nested hierarchies, similar to XML. These can also be collapsed using arrows in a separate viewer pane within JSON Editor Online.

View of the top of a sample file

A single line from a JSON Lines dataset expressed as a nested hierarchy using JSON Editor Online

The editor makes it easier for human readers to discern a portion of the metadata for the text. In the data above, we can see:

  • The title is "Shakespeare and the Middling Sort" ("title": "Shakespeare and the Middling Sort")
  • The author is "Theodore B. Leinwand" ("creators": ["Theodore B. Leinwand"])
  • The text is a journal article ("doctypeType": "article")
  • The journal is Shakespeare Quarterly ("isPartOf": "Shakespeare Quarterly")
  • Identifiers such as ISSN, OCLC, and DOI
  • PageCount and WordCount

If you examine the rest of the file, you'll discover additional metadata such as the publication date, DOI, page numbers, ISSN, and more.

The most significant data for text analysis is usually the "unigramCount" section where the frequency of each word is recorded. In this context, the word "unigram" describes a single word construction like the word "chicken." There are also bigrams (e.g. "chicken stock"), trigrams ("homemade chicken stock"), and n-grams of any length. Depending on the licensing for the content, there may also be full-text available.

Showing the unigram counts

On each line, a key on the left is matched to value representing its frequency on the right

The texts have been minimally pre-processed, so casing will affect n-gram counts. Each word here is treated as a string. Since JavaScript and Python strings are case-sensitive, that means that "Tiger" is considered a different word than "tiger". Counting all the occurences of the word "tiger" then would require combining the counts of both strings. These methods are covered in the notebooks.

Constellate data vs. JSTOR Data for Research (DfR)

While the contents of what is delivered to users is very similar, the format of the dataset differs.

The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery. DfR delivers datasets to end users in a ZIP file with the following structure:

  • metadata
  • ngram1
  • ngram2
  • ngram3

Each of those directories contained one file per article or book chapter in the dataset. For example, the metadata directory contains one XML file for each document in the dataset and the ngram1 directory contains one CSV file for each document in the dataset -- where on each row is one of the words from the document in the first column and the number of times it occurred in the second column.

In addition to the new JSON format described above, this platform is not doing any preemptive cleanup on the data it delivers, whereas DfR removes stopwords and lowercases all the words in the dataset. Our philosophy is that the researcher will know best how to clean-up his or her own data. In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own clean-up.

Constellate data vs. HathiTrust Extracted Features Format

They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level). We are excited by the possibility of including content from the HathiTrust Digital Library in the future.