The Constellate tutorial notebooks include examples that use a dataset identifier to retrieve data from Constellate's backend systems and make it available in the notebook environment for analysis. This is accomplished through a client library we have developed and make available, e.g. import constellate. This client is intentionally kept very simple since we want the focus to be on learning text analysis and not on how to query our systems. The example notebooks should give you a clear idea of the client's methods but here is a comprehensive list of options:

Client details

common parameters

There are three common parameters used:

  • dataset_id: a string representing the identifier for the dataset. This can be found on the dataset dashboard.
  • fname: an optional string representing a filename to assign to a download.
  • force: an optional boolean indicating whether a file should be re-downloaded

methods

  • get_metadata(dataset_id, fname=None, force=False) - downloads the dataset metadata (sampled to 1500 documents) for a dataset in csv format.
  • get_dataset(dataset_id, fname=None, force=False) - downloads the dataset  (sampled to 1500 documents) in the Constellate Document Format (jsonl).
  • download(dataset_id, download_type, fname=None, force=False)- downloads additional dataset formats. See the list of download formats below. The requested download has to be created in the Constellate application before downloading it with the client. An error will be raised if the file does not exist.
  • dataset_reader(file_path) - file_path is a string containing the path to the dataset file you would like to read, e.g. ~/data/mydataset.jsonl.gz. This is a helper method to read the Gzipped Constellate Document Format files (jsonl) and load each document as a Python dictionary.

Dataset download formats

This is current as of June, 2021 but more options are planned.

  • metadata the non-sampled metadata.
  • jsonl - the non-sampled full data for the dataset.
  • unigrams - unigrams for the dataset. The ngram csv files will contain three columns: id, ngram, count. The count will be the number of times the ngram appears in the document.
  • bigrams - dataset bigrams.
  • trigrams - dataset trigrams.

Using the client on a local machine or another environment

You may want to use the Constellate client outside of the hosted environment. The Constellate Client is on PyPI and you may install it locally using the pip package manager:

pip install constellate-client

Please keep in mind that the functionality may change over time. If you have built an example or application with the client, please let email us at tdm@ithaka.org to let us know so that we can learn from what you have done and alert you to any future breaking changes.