Note that there are two kinds of content in Constellate. Some content is 'open' and the full-text of open content is included in any JSONL datasets you download. Some content is 'rights-restricted' and Constellate cannot include the full-text in the JSONL datasets you download through the application directly. Read on to learn about all the ways you can get access to all the content in Constellate (including to the full-text of this rights-restricted content.)
Constellate provides a number of dataset options.
Default - By default, when you build a dataset, Constellate builds two 1500 item samples. These are both available in the Download window and include:
- sampled metadata (CSV) - a comma separated value file with the bibliographic metadata for the 1500 item sample
- sampled metadata, ngrams, (some) full-text (JSONL) - all 1500 items in your sample in the Constellate JSONL format, those items that are open will contain the full-text
More Download Options
In addition to working with the 1500 item samples, users may download a number of versions of their full dataset. Users who are not at a participating institution are limited to datasets of 25,000 items and users who are at a participating institution may work with 50,000 items – everyone is welcome to download up to 10 datasets a day.
Please note, if your dataset is larger than 25,000 (or 50,000) items, we will down-sample it to the allowed size when you select any of these download options listed below.
These additional options are available in the More Download Options link in the Download window.
- metadata (CSV)
- unigrams (CSV) - a comma separated value that contains every unique word in every document and how many times it occurs in the document
- bigrams (CSV) - same as above, only the unique two word phrases
- trigrams (CSV) - same as above, only the unique three word phrases
- metadata, ngrams, and (some) full-text (JSONL) - all documents in the Constellate JSONL format, those items that are open will contain the full-text
Custom Request Options
We have two additional options that require mediation by Constellate staff
- Subset of sentences from your dataset in CSV - Build a dataset in Constellate, provide the Constellate team with the dataset ID, and then provide a term or regular expression to match, and we build and deliver a dataset of sentences that contain that term or regular expression from within the dataset you configured. We can include sentences from any document from any data source so long as: the document is longer than 10 sentences and the number of sentences that match the supplied string from each document is 10% or less of the total sentences in the document. (For example, you could build a dataset of performing arts content and we provide to you all the sentences that contain the word “experimental” from that dataset.)
- Datasets in JSON of full-text for JSTOR rights restricted content - please fill out the Data for Research request form.
If you have thoughts on other useful datasets we could provide or if you need access to the full-text of Portico content, please let us know.