This introduction explains the various kinds of text analysis methods for a business and data science audience. What are they? Why would you use them? How long will it take to apply them? (The methods presented here are among the most well-known but certainly not exhaustive.) Afterward, you'll be better-prepared to decide how much or how little text analysis may be useful for your work.
There are five main questions that text analysis can help answer:

  1. What are these texts about?
  2. How are these texts connected?
  3. What emotions (or affects) are found within these texts?
  4. What names are used in these texts?
  5. Which of these texts are most similar?

Question 1: What are these texts about?

  • Word Frequency (Beginner)
    Counting the frequency of a word in any given text. This includes Bag of Words and TF-IDF. Example: "What words are most common in customer support tickets?"
  • Collocation (Beginner)
    Examining where words occur close to one another. Example: "When people mention our premium product, what do they say about the packaging?"
  • Topic Analysis (or Topic Modeling) (Intermediate)
    Discovering the topics within a group of texts. Example: "What are the most frequent topics discussed in five years of email from our advertising department?"
  • TF/IDF (Intermediate)
    Finding the significant words within a text. Example: "Given a decade of board reports, are there seasonal issues that crop up in summer vs. winter?"

Question 2: How are these texts connected?

  • Concordance (Beginner)
    Where is this word or phrase used in these documents? Example: "Show me every email where someone mentions our least visible product."
  • Network Analysis (Advanced)
    How are the authors of these texts connected? Example: "Given email data, how often does marketing connect with engineering?"

Question 3: What emotions (or affects) are found within these texts?

  • Sentiment Analysis (Intermediate)
    Does the author use positive or negative language? Example: "How do our customers feel about our new product line?"

Question 4: What names are used in these texts?

  • Named Entity Recognition (Intermediate)
    List every example of a kind of entity from these texts. Example: "What are all of the geographic locations mentioned by our users?"
  • Removing Sensitive Information (Intermediate)
    Remove sensitive or personally identifiable information (PII) from data for archiving. Example: "Since our founder is retiring, we want to preserve his business emails." "We want to save user data without linking it to their identities."

Question 5: Which of these texts are most similar?

  • Clustering (Advanced)
    Which texts are the most similar? Example: "How does our help documentation compare with that of our competitors?"
  • Supervised Machine Learning (Advanced)
    Are there other texts similar to this? Example: "Given these examples of accessible content, can we identify where our content is not accessible?" "Given user search data, can we predict user search terms?"

Get started!