Topic models are easy to train, but do they generate useful topics? In this post, we discuss several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.

To accomplish this, we use Mallet to generate fifty topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and fifty topics for a corpus of ~11 million Twitter posts related to COVID-19. We used hashtag pooling to generate topics for the Twitter corpus. We use Python to calculate diagnostic measures from Mallet topic-term frequency output files.

Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) term frequency, and 2) term specificity. Furthermore, on average, we found topics with common, specific terms score significantly better on coherence scores than topics with uncommon, unspecfic terms. However, we also found several cases of poor topics that scored relatively high on coherence scores. In other words, our results suggest topics that use the common, specific terms should be easier to interpret, but interpretability doesn’t imply a topic is comprised of terms that are specific or central to a corpus.

Evaluation Metrics

The table below provides a description of the metrics used in this analysis. The first 3 metrics are calculated using all terms assigned to a given topic. The last 3 metrics are calculated using the top-n terms of a topic.

Metric Description
1. Token Count Measures the number of word tokens assigned to a topic. Comparing this number to the sum of the token counts for all topics will give you the proportion of the corpus assigned to the topic. Interesting topics typically are not on the extremes of this range. A high token count indicates a topic appears frequently in the corpus and may be too general. Topics with low token counts may be unreliable due to inability to derive effective word distributions.
2. Uniform Distance Measures the distance from a topic’s distribution over words to a uniform distribution. This distance is often interpreted as the amount of information lost when a uniform distribution is used instead of the topic distribution. Larger values indicate more specificity.
3. Corpus Distance Measures the distance from a topic’s distribution over words to the overall distribution of words in the corpus. Smaller values indicate a topic is distinct. Larger values indicate a topic is similar to the corpus.
4. Effective Number of Words Measures the extent to which the top words of a topic appear as top words in other topics. For each word in a topic, we calculate the inverse of the squared probability of the word in the topic and then sum the values. Larger values indicate more specificity.
5. Exclusivity Measures the share of top words which are distinct to a given topic. We calculate exclusivity as the average, over each top word, of the probability of that word in the topic divided by the sum of the probabilities of that word in all topics. Smaller values indicate more general topics.
6. Coherence
(Interpretability)
Measures whether the words in a topic tend to co-occur. Large negative values indicate a topic’s top-n terms do not co-occur often. Values close to zero indicate that words co-occur often.

Data Standardization

First, we compare the evaluation metrics associated with each of the data sets. The density plots below show there are clear differences between the diagnostic measures for Facebook and Twitter topics. For example, Facebook has better topic coherence scores which implies the top terms of each topic co-occur more often in Facebook posts. Likewise, Facebook topics have a larger tokens per topic count which indicates the top terms of each topic occur more frequently in the Facebook corpus.

Prior to performing PCA, we must standardize the data so the metrics have a common scale. We standardize the data so that each measure has a mean of zero and a standard deviation of 1. The density plots below show the distributions of the standardized data.

Raw Data

Standardized Data

Correlation Analysis

Correlation analysis of the evaluation metrics shows that we are dealing with highly correlated multivariate data.

  • Topics with common terms (i.e., high token counts) tend to be similar to the corpus (-0.776 correlation with corpus distance).

  • Topics with specific terms (i.e., high uniform distance) tend to have less common top terms (-0.726 correlation with effective number of words).

  • Topics with distinct terms (i.e., high corpus distance) tend to have more exclusive top terms (-0.737 correlation with exclusivity)

Pairwise Plots

Correlation Matrix

Eigen Analysis

A scree plot of the eigenvalues of the correlation matrix suggests we should retain two principal components (PCs). The general rule of thumb is to keep PCs that are “one less than the elbow” of the scree plot or PCs with an eigenvalue of 1 or greater.

Loading Analysis

The loading matrix below shows token count, corpus distance, and exclusivity are weighted heavily in the 1st PC, which explains 44.6% of the variance based on the eigenanalysis (2.679/6 = 44.6%). Uniform distance and the effective number of words are weighted heavily in the 2nd PC, which explains 31.5% of the variance. Coherence contributes most to the 3rd PC, which explains 11.4% of the variance.

Based on the loading plot below, the 1st PC appears to capture term frequency. Token count is positioned on the far right and implies a topic’s terms appears often in the corpus. The 2nd PC appears to capture term specificity. Uniform distance is located at the top and implies a the topic word distributions capture more information than a uniform distribution.

Score Plots

Examining score plots of the PC values associated with each topic is helpful to validate our interpretation of the PCs. The interactive plots below facilitate conducting a more subjective evaluation of the data. Sizing the score plot by token count supports our interpretation that the 1st PC captures term frequency. Likewise, sizing the points by the effective number of words supports our interpretation that the 2nd PC captures term specificity.

The score plot with points sized by coherence scores is less clear. It looks like the more coherent topics (i.e., the points with a smaller diameter) are concentrated more heavily in the top right quadrant and the less coherent scores are concentrated in the bottom left. However, there are many relatively small-sized points scattered throughout all four quadrants. Given that coherence scores are based on top terms co-occuring in documents, the pattern that emerges in the score plot makes more sense.

For example, the Twitter topic in the top left includes the following top terms: economy, people, urge, coronavirus, million, debt, package, needed, student, and stimulate. This topic is very coherent. Phrases like “stimulate economy” or “student debt” are common word pairings. Hence, it’s very plausible that posts about economic stimulus related to student debt could have generated this topic. Likewise, it’s reasonable to think this topic would be relatively less central to a COVID-19 discussion focused on health risks, disease prevention, and government restrictions.

In contrast, the Facebook topic in the bottom right includes the following top terms: people, government, crisis, world, pandemic, time, coronavirus, political, country, and public. This topic is not very specific, but the top terms seem like words that would co-occur frequently. This topic also seems like it would be central to the COVID-19 discussion.

Token Count

Corpus Dist

Exclusivity

Uniform Dist

Effective # of Words

Coherence

Box Plots by Quadrant

Plotting the normalized coherence scores by each quadrant of the score plot allows us to see that the coherence scores are significantly better for topics that use common, specific terms. However, there is a also large overlap between Twitter topics that use uncommon, unspecific terms and Twitter topics that use common, specific terms. Stated simply, coherence looks like a good measure of topic quality most of the time, but not always.

Conclusion

So, what is the best metric to evaluate topic quality? It depends.

If the goal is to find topics that are most representative of a corpus of documents, we believe the combination of high token count and high uniform distance will identify relatively coherent topics. More importantly, we don’t think using coherence alone is prudent. Coherence doesn’t imply a topic is central to a discussion, and it doesn’t imply a topic has a specific focus.

In contrast, if the goal is to quickly surface unique insights that may not be readily apparent, even after reading many documents, then low token count and high uniform distance may be better suited. The downside is that these topics may require more contextual cues and effort to understand how the top terms are related.