Having your own data in the Constellate format can be very helpful, since it will allow you to use existing notebooks on your own data! Data cleanup and tailoring can be complex, but we offer a couple notebooks that we hope are useful to researchers.
- Tokenizing Text Files (More compatible for combining with other Constellate datasets)
- Tokenize Text Files with NLTK (A more complex tokenization using the Natural Language Toolkit)
These notebooks may require some modifications to work with your data, so we offer them as a starting for those interested. If you're not able to code Python at an intermediate or advanced level, you may need some help to implement these notebooks.
You'll need:
- A set of plain text files
- A CSV file of relevant metadata (optional)
The notebook will output:
- A JSON-L dataset containing the unigrams, bigrams, trigrams, full-text, and metadata for your entire dataset
- A gzip compressed version of the JSON-L file for easier file transfer
We recommend researchers run these notebooks on their local machines. If you prefer to do this analysis remotely, keep in mind that any data created in your Jupyter session will disappear if the notebook is idle for a significant period of time.