Topic Modeling

Late in 2010, a client asked if we could help them with what turned out to be a topic modeling problem. They had already investigated, and rejected, a product that performed topic modeling because it did not give them certain control they felt they needed.

Latent Dirichlet Allocation (LDA), perhaps the most commonly employed topic model, is a type of statistical model for discovering the topics that occur in a collection of documents. For this application, it suffers from the following  perceived drawbacks; first, the algorithm requires a priori specification of how many topics to create; and second, there is no facility for humans to contribute to the topic model.

In order to quickly get something in users hands to try out, our initial solution uses components of the GATE text engineering system ( To begin, a topic map file is created in a format suitable for use by the ANNIE Gazetteer, which will annotate a document based on what it finds in various lists.  Specifically, we create a two column .lst file containing the term to match and the associated topic. If a term matches multiple topics, it is entered on multiple rows. For example:

chemicals topic=chemical
buyout topic=acquisition
offered topic=acquisition
operating topic=reporting
operating topic=license
costs topic=reporting
costs topic=finance

If the term “operating” is found in a document, it is annotated with both the reporting and license topics. In addition, the ANNIE gazetteer is not limited to single word terms; we could just as easily include “leveraged buyout” as a term.

A suitable entry must also be made in the lists.def file that controls the Gazetteer. This entry is:
This entry tells the Gazetteer that any text that matches an entry in the topic_map.lst file is to be annotated as a “topicmap”, which helps us find them later.

GATE processing resources, such as the ANNIE Gazetteer, are chained together into a processing pipeline. To complete the processing, we wrote a processing resource to process each topicmap entry in order to compute which topics to assign to the document.

The implementation is pretty easy. However, one significant concern was whether it will be practical for a customer to build appropriate term/topic maps. It would be a simple matter if we had access to a representative user, but, as is often the case in our line of work, we don’t! So, we have to look for clues, and make some guesses.

Even from the small example above, it’s obvious that some terms are more topic-specific than others. If we find the word chemicals, for example, the only topic associated with that word is the “chemical” topic. Other words are not so specific. What is the frequency of these “specific” terms? Are they obvious, and easy to find? Can we successfully model topics using a small number of terms? If so, then building a topic model might be relatively simple, but if not, then we’ll need to find a way to help users build the model. Let’s look at the data in a little more detail for some clues.

Processing the Reuters corpus ( using MALLET (Machine Learning for LanguagE Toolkit) (, and further processing the output with some custom VBA code in MS Excel, I first identified some relationships  between terms and topics.

After setting up MALLET and unpacking the Reuters data, I used the command:

Mallet train-topics –-input topic.input.mallet –-num-topics 90 –-output-topic-keys topickeys.txt –-num-top-words 25 –-output-doc-topics doctopics.gz

–num-topics 90 causes MALLET to create 90 topics.

–output-topic-keys topickeys.gz –-num-top-words 25 causes MALLET to output a list of the top 25 words associated with each topic.

–output-doc-topics doctopics.gz causes MALLET to output a list of every document and the affinity of that document to each topic.

Loading topickeys.txt into Excel, and splitting the data so that every term occupies its own cell by using the Data.TexttoColumns feature, we can then build a matrix of terms to topics using the following VBA script:

Sub creatematrix()
i = 1
While Worksheets("topic-keys").Cells(i, 1) ""
j = 3
While Worksheets("topic-keys").Cells(i, j) ""
k = 1
found = False
While Worksheets("Sheet1").Cells(k, 1) ""
If Worksheets("Sheet1").Cells(k, 1) = Worksheets("topic-keys").Cells(i, j) Then
found = True
Worksheets("Sheet1").Cells(k, Worksheets("topic-keys").Cells(i, 1)) = 1
End If
k = k + 1
If found = False Then
Worksheets("Sheet1").Cells(k, 1) = Worksheets("topic-keys").Cells(i, j)
Worksheets("Sheet1").Cells(k, Worksheets("topic-keys").Cells(i, 1) + 2) = 1
End If
j = j + 1
i = i + 1
End Sub

A second script is run to compute the table below:

Sub calcstrength()
i = 1
While Worksheets("Sheet1").Cells(i, 1) ""
j = 2
k = 0
For j = 2 To 255
If Worksheets("Sheet1").Cells(i, j) = 1 Then
k = k + 1
End If
Next j
Worksheets("Sheet1").Cells(i, 256) = k
i = i + 1
End Sub

I now have a table listing the number of terms associated with 1 topic, 2 topics, etc. When num-topics = 90, the term/topic relationship is described below. 1173 unique terms were identified:

# terms # topics term is associated with Approximate % of total
784 1 .67
179 2 .15
85 3 .07
41 4 .03
27 5 .02
17 6 .01

As can be seen, roughly 2/3 of the terms identified correspond with exactly one topic, with more than 90% of the terms associated with 4 or fewer topics.

Increasing the topic count to 250 (–-num-topics 250) produces similar results. 2587 unique terms were selected:

# terms # topics term is associated with Approximate % of total
1664 1 .64
362 2 .14
167 3 .06
100 4 .04

Using the following scripts, we can compute the coverage provided by these terms.

Sub buildtopiccountmap()
For i = 1 To 48
j = 1
While Worksheets("Sheet1").Cells(j, 256) ""
If Worksheets("Sheet1").Cells(j, 256) = i Then
For k = 2 To 255
If Worksheets("Sheet1").Cells(j, k) = "1" Then
Worksheets("Sheet1").Cells(2590 + i, k) = Worksheets("Sheet1").Cells(2590 + i, k) + 1
End If
Next k
End If
j = j + 1
Next i
End Sub

Sub topicmapcounttotals()
For i = 1 To 48
For j = 2 To 255
If Worksheets("Sheet1").Cells(2590 + i, j) < 1 Then
Worksheets("Sheet1").Cells(2590 + i, 256) = Worksheets("Sheet1").Cells(2590 + i, 256) + 1
End If
Next j
Next i
End Sub

We now have 9 sheets in our spreadsheet, with each sheet mapping a document to one or more topics based on the number of topics a particular term maps. Looking at the results, we can compute the following:

Topics/term average # of topics/document
1 1.27
2 1.99
3 2.42

In other words, considering only those terms that map to a single topic, each document gets mapped to 1.27 topics on average. Considering only those terms that map to exactly two topics, each document gets mapped to 1.99 topics on average. Given that the documents themselves are relatively short, this is an encouraging discovery as I expected this number to be low. Furthermore, manual inspection of the document to topic mapping appears reasonable for a small random sampling of documents.

So far, the data seems to support the belief that topic maps can be built using a reasonably small number of terms, 6 terms per topic on average. This is important because the user has expressed that their subject-matter experts (SMEs) should produce the topic maps, and, since we would not have access to these SMEs before we had to deliver working code, we needed to have some assurance that the approach was even feasible. Six terms per topic seems feasible.

But which 6 terms? Not being an expert in the topics discussed in the Reuters corpus, it is impossible for me to fully assess how well an SME would have chosen the terms that best mapped to a single topic, but I am concerned. We won’t know until it fails, and then, we’ll need a rapid response. So, what can we do if the SME-driven approach fails?

The first approach is the obvious one: Could we use LDA to generate an initial working set, following a process similar to what I did before to isolate the terms for my analysis? We would end up with a term/topic map which the SMEs could then use as a starting point to building the “real” map. Difficulties include visualizing the map (perhaps as a graph) for the SMEs, and the process of actually manipulating it. This seems arduous, and it’s not clear that it would even work. For example, how many topics do you tell LDA to create?

The second approach is to treat the problem incrementally, similar to a recommendation engine. What if users manually assigned documents to topics, and we use their input to develop the topic map in the background? The system would suffer from what is known as “cold start”, because it would not have enough data to assign any topics to a document, so ideally, we would like to minimize the learning time, perhaps by some explicit interaction with users.

The first time a user assigns a document D to a topic T, T has all of D’s terms assigned to it. When a second document is assigned to T, the weight of every term/topic relation is updated; those terms that belong to both documents get a higher weight than those terms that only appear in one of the two. In this structure, which is essentially a Bayesian network, the weight represents the probability that an instance of a specific term indicates a particular topic.  This process continues as users assign documents to topics, until some point is reached where the system can begin recommending topic assignments. Theoretically “some point” can be as little as two documents; that is, the system can begin recommending  topic assignments as soon as two documents are assigned to a single topic, using terms shared by both documents.

This will be the topic of a future post.


About jeffmershon

Director of Program Management at SiriusXM.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s