This introduction explains the various kinds of text analysis for a humanities audience. What are they? Why would you use them? How long will it take to apply them? (The methods presented here are among the most well-known but certainly not exhaustive.) Afterward, you'll be better-prepared to decide how much or how little text analysis may be useful for your research. As you read about these methods, it will be helpful to keep in mind the current, intractable problems that face your field. Could you use one of these methods to address them?
There are five main questions that text analysis can help answer:

  1. What are these texts about?
  2. How are these texts connected?
  3. What emotions (or affects) are found within these texts?
  4. What names are used in these texts?
  5. Which of these texts are most similar?

Question 1: What are these texts about?

  • Word Frequency (Beginner)
    Counting the frequency of a word in any given text. This includes Bag of Words and TF-IDF. Example: "Which of these texts focus on women?"
  • Collocation (Beginner)
    Examining where words occur close to one another. Example: "Where are women mentioned in relation to home ownership?"
  • Topic Analysis (or Topic Modeling) (Intermediate)
    Discovering the topics within a group of texts. Example: "What are the most frequent topics discussed in this newspaper?"
  • TF/IDF (Intermediate)
    Finding the significant words within a text. Example: "What language is most significant within 1970s political speech?"

Question 2: How are these texts connected?

  • Concordance (Beginner)
    Where is this word or phrase used in these documents? Example: "Which journal articles mention Maya Angelou's phrase, 'If you're for the right thing, then you do it without thinking.'"
  • Network Analysis (Advanced)
    How are the authors of these texts connected? Example: "What local communities formed around civil rights in 1963?"

Question 3: What emotions (or affects) are found within these texts?

  • Sentiment Analysis (Intermediate)
    Does the author use positive or negative language? Example: "How do presidents describe gun control?"

Question 4: What names are used in these texts?

  • Named Entity Recognition (Intermediate)
    List every example of a kind of entity from these texts. Example: "What are all of the geographic locations mentioned by Tolstoy?"

Question 5: Which of these texts are most similar?

  • Authorship Attribution (Advanced)
    Find the author of an anonymous document. Example: "Who wrote The Federalist Papers?"
  • Clustering (Advanced)
    Which texts are the most similar? Example: "Is this play closer to comedy or tragedy?"
  • Supervised Machine Learning (Advanced)
    Are there other texts similar to this? Example: "Are there other Jim Crow laws like these we have already identified?"

Get started!