In a previous post on topic modeling, I described a topic-modeling problem we were given that could not be solved by conventional means, such as Latent Dirichlet Allocation (LDA), due to an important constraint imposed by the customer–they wanted to manually build and maintain the topic map (that is, the mapping between terms and topics). The solution we provided uses the GATE ANNIE Gazetteer and a simple processing resource to compute the final scores and assign topics to a document.
Because I was concerned about the practicality of building and maintaining such lists, I conducted some fairly rudimentary analysis of the data, and concluded that, at least for the test data set I had available, each topic had on average 6 unique terms, where unique means that the term does not map to any other topics. I also discovered that unique, or near-unique terms dominated the topic map generated by LDA. That is, the distribution of terms to topics, when plotted, looked similar to what you would expect to see if you graphed a power law. Some equally crude experiments using Excel led me to believe that a small number of well-chosen terms could produce a high quality topic map.
Still, I’m not convinced that the customer would be able to build and maintain effective topic maps. Perusal of the output from the generated LDA models revealed a larger than expected proportion of non-obvious terms. In other words, many of the terms that seemed to best define a particular topic were contextual. That is, the terms were not part of the topic’s core vocabulary, but tended to be used only in that particular context nonetheless. This discovery is, at least to me, highly counter-intuitive, and I have to admit to being a little skeptical that my interpretation is correct. I *can* say that, to me, some of the topic-specific terms produced by LDA were unexpected, and this leads me to believe that it will be difficult for the customer to effectively build and manage these lists.
But the real problem is that validating a feature in our market takes months, and a poor showing could have long-lasting dire consequences. Trust is always difficult to come by and easily destroyed, but its worse when you get very few opportunities to get in front of a customer, and I could easily imagine the customer trying this new topic-modeling capability and failing, setting back our relationship by years. So, to quote Tim Hartford: “In an uncertain world, we need more than just Plan A; and that means finding safe havens for Plans B, C, D, and beyond.” .
I’m not sure what C, D and beyond are, but enter Plan B, which is an attempt to treat my particular topic modeling problem using an active learning method. As per Wikipedia: “Active learning is a form of supervised machine learning in which the learning algorithm is able to interactively query the user (or some other information source) to obtain the desired outputs at new data points.” . In practice, users would still exercise control over the creation of the topic map, but by assigning documents to topics, rather than assigning terms to topics.
Given a new document and a blank topic map, a user is asked to nominate topics to which this document belongs. Visualizing the problem as a graph, where terms and topics are nodes, the weight of a connection between a term and a topic is adjusted when a document is assigned to a topic. Based on last post’s theory that a small number of terms having a strong affinity to a topic can be used to create a topic map, this approach has the potential to quickly reach a point where the system can begin validating assumptions by suggesting topics for documents.
Is it possible to guide the system so that it validates its assumptions in the most efficient manner, i.e., in the fewest steps? If the weight of a term/topic association is normalized to range between 0 and 1, and corresponding to the inverse of the number of topics (N) to which the term is associated (i.e.: 1/N), then the biggest payoff comes when we make the first connection between a term and a single topic (the weight changes from 0->1), the second biggest payoff comes when we connect a term to its second topic (the weight changes from 1->0.5), and the weight change (payoff) decreases from there as we associate a term with additional topics. We gain very little (i.e.: we change the weight of a term/topic association very little), when we assign a term to its 10th topic. Note that this holds true for any similar curve, not just 1/n.
In the learning phase, when a user associates a document with a topic, the system will associate all of the relevant terms, then choose one term that is solely associated with that topic (i.e.: term/topic weight = 1), and present to the user another document that contains that term, asking the user to assign it to one or more topics, or perhaps just asking the user to confirm or refute that documents ‘s association with the original topic the user had chosen. Because the new learning phase (Plan B) ultimately creates the same topic map that a user can create by hand, the user still retains the control they claimed they wanted to have.
That’s it. Assuming the customer has difficulty building a topic map by hand, I now have at least a viable plan to resolve the issue. But there isn’t any point in building it until the customer verifies that a) they can or can not build a topic map as they had wished, and b) if a suitable topic map can be built the results are useful.