Metadata and n-gram dataset files
The dataset builder creates three kinds of files:
- A CSV file containing only metadata
- A CSV file containing only unigrams or bigrams or trigrams.
- A JSON Lines file containing metadata and the textual data
The textual data includes:
The metadata may include:
|id||a unique item ID (In JSTOR, this is a stable URL)|
|title||the title for the document|
|subTitle||the subtitle for the document|
|docType||the type of document (for example, article or book)|
|publicationYear||the year of publication|
|provider||the source or provider of the dataset|
|collection||collection informat as identified by the source|
|doi||the digital object identifier|
|datePublished||the publication date in yyyy-mm-dd format|
|url||a URL for the item and/or the item's metadata|
|creator||the author or authors of the item|
|pageStart||the first page number of the print version|
|pageEnd||the last page number of the print version|
|pageCount||the number of print pages in the item|
|wordCount||the number of words in the item|
|pagination||the page sequence in the print version|
|language||the language or languages of the item (eng is the ISO 639 code for English)|
|publisher||the publisher for the item|
|placeOfPublication||the city of the publisher|
|abstract||the abstract description for the document|
|isPartOf||the larger work that holds this title (for example, a journal title)|
|hasPartTitle||the title of sub-items|
|identifier||the set of identifiers connected with the document (doi, issn, isbn, oclc, etc.)|
|tdmCategory||the inferred category of the content based on machine learning|
|sourceCategory||the category according to the provider|
|sequence||the article or chapter sequence|
|issueNumber||the issue number for a journal publication|
|volumeNumber||the volume number for a journal publication|
|outputFormat||what data is available (unigrams, bigrams, trigrams, and/or full-text)|
For more detail, see the current version of the schema.
All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:
- The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas or Excel.
- The JSON Lines data can be very large. Each file contains all of the metadata plus unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computational resources. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.
We are still refining the structure of the dataset file. We anticipate adding additional “features” (such as named entity recognition) in the future. Please reach out to us at firstname.lastname@example.org if you have comments or suggestions.
The CSV file is a comma-delimited, tabular structure that can easily be viewed in Excel or Pandas.
The JSON Lines file (file extension ".jsonl") is served in a compressed gzip format (.gz). The data for each document in the corpus is a written on a single line. (If there are 1,245 documents in the corpus, the JSON Lines file will 1,245 lines long.) Each line contains a list of key/value pairs that map a key concept to a matching value.
The basic structure looks like:
Instead of attempting to decode the structure of a single large line, we can plug a single line into a JSON editor. The screenshot below was created using JSON Editor Online. The JSON editor reveals the file structure by breaking it down into a set of nested hierarchies, similar to XML. These can also be collapsed using arrows in a separate viewer pane within JSON Editor Online.
The editor makes it easier for human readers to discern a portion of the metadata for the text. In the data above, we can see:
- The title is "Shakespeare and the Middling Sort" ("title": "Shakespeare and the Middling Sort")
- The author is "Theodore B. Leinwand" ("creators": ["Theodore B. Leinwand"])
- The text is a journal article ("doctypeType": "article")
- The journal is Shakespeare Quarterly ("isPartOf": "Shakespeare Quarterly")
- Identifiers such as ISSN, OCLC, and DOI
- PageCount and WordCount
If you examine the rest of the file, you'll discover additional metadata such as the publication date, DOI, page numbers, ISSN, and more.
The most significant data for text analysis is usually the "unigramCount" section where the frequency of each word is recorded. In this context, the word "unigram" describes a single word construction like the word "chicken." There are also bigrams (e.g. "chicken stock"), trigrams ("homemade chicken stock"), and n-grams of any length. Depending on the licensing for the content, there may also be full-text available.
On each line, a key on the left is matched to value representing its frequency on the right
Constellate data vs. JSTOR Data for Research (DfR)
While the contents of what is delivered to users is very similar, the format of the dataset differs.
The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery. DfR delivers datasets to end users in a ZIP file with the following structure:
Each of those directories contained one file per article or book chapter in the dataset. For example, the metadata directory contains one XML file for each document in the dataset and the ngram1 directory contains one CSV file for each document in the dataset -- where on each row is one of the words from the document in the first column and the number of times it occurred in the second column.
In addition to the new JSON format described above, this platform is not doing any preemptive cleanup on the data it delivers, whereas DfR removes stopwords and lowercases all the words in the dataset. Our philosophy is that the researcher will know best how to clean-up his or her own data. In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own clean-up.
Constellate data vs. HathiTrust Extracted Features Format
They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level). We are excited by the possibility of including content from the HathiTrust Digital Library in the future.