By Sarah Goldberg
While some teachers might have heard of topic modeling technology in the context of cutting-edge digital humanities research, few have considered how these exciting new tools might find a place in the history classroom. However, a topic-modeling project offers an ideal introduction to the dynamic world of digital historical methodology. Perhaps best suited for an undergraduate setting, aspiring history majors can gain hands-on experience and useful technical skills in quantitative historical analysis. Yet while topic modeling has continued to make waves in historical research, most teachers are hesitant to include this technology in their curricula. For those more accustomed to conventional historical methodology, attempting topic modeling can be a daunting task. Even the most technologically literate teachers are left scrambling for the basics: What is topic modeling? How can these tools be used in meaningful ways? Are they teachable in already overburdened classrooms?
Topic modeling refers to a variety of programs that use complex algorithms to reduce an extremely large collection of text to a number of distinct topics. While diverse options exist for a more technologically advanced audience, the casual researcher and student should be most familiar with MALLET, an open source Java-based program that can be used without any coding knowledge through Topic Modeling Tools (TMT). There are plenty of good web-based resources for all skill levels that can help first time users get started with MALLET and TMT (like this overview of topic modeling, or this introduction to TMT).
To understand the philosophy behind topic modeling, it’s easiest to start by identifying the human function this technology is trying to mimic on a grand scale. When an individual reads through a text, they use their own critical thinking skills to identify what topics are contained in the work. For example, a student might listen to a politician’s speech and be able to articulate distinct themes in the text – the economy, the law, civil rights, etc. While some of this ability to categorize comes from the reader’s prior knowledge, the distribution of certain words is also crucial to making these connections. For example, a reader might see the words “troop,” “victory,” and “casualty” in a paragraph and come to the rational conclusion that the speaker is referring to the military. Topic modeling aims to do the same by identifying distinct subjects within enormous quantities of documents that would be impossible to read page-by-page.
When a user inputs their database of text, topic-modeling technology generates lists of words that are likely to appear in proximity to each other throughout the entirety of the collection. These lists represent topics, or “a group of words that often co-occur with each other in the same documents,” explains University of Richmond Director of Digital Sciences Robert K. Nelson. Users are left to use their own knowledge of the source-base to give meaning to these word lists and decide which categories represent legitimate themes. Once these topics are identified, the program can quantify the appearance of these categories within the body of work. For example, once having identified topic A, the program could tell you that 20% of the collection was relevant to topic A, or that Document 1 was 70% related to topic A. From these computations, researchers and students can draw conclusions about the collection of sources as a whole.
Nelson utilizes this technology in his Mining the Dispatch, one of the most noteworthy examples of topic modeling research. Working with the Digital Scholarship Lab at the University of Richmond in partnership with Tufts University’s Perseus Project, Nelson analyzed the rich archives of the Richmond Daily Dispatch. Using 37 unique topics, Nelson and his research team were able to track “the dramatic and often traumatic changes as well as the sometimes surprising continuities in the social and political life of Civil War Richmond.” While it would be impossible for his research team to read through the 112,000 pieces containing nearly 24 million words, Nelson used MALLET to discover unexpected insights . Teachers looking for a good introduction of the value and limitations of topic modeling would do well to read through Nelson’s project as an example of topic modeling success. While Nelson’s introduction explains how topic modeling can both succeed and fail at the individual document level, he emphasizes how these tools are most suitable on a much larger scale. Using engaging graphics, Nelson shows the rise and fall of his topics between 1860 and 1865. Some patterns are easily explained. For example, Nelson’s discovery that fugitive slave advertisements were more likely to appear when the Union Army was near Richmond makes sense, as slaves were likely to run away in hopes of joining the northern forces. Others are less obviously correlated to outside events, such as Nelson’s discovery that slave rental advertisements sharply reduced in 1862, suggesting a destabilized market without an apparent cause. Unexpected results such as this are just as potentially significant, argues Nelson: “Topic modeling and other distant reading methods are most valuable not when they allow us to see patterns that we can easily explain but when they reveal patterns that we can’t, patterns that surprise us and that prompt interesting and useful research questions,” he explains. Mining the Dispatch should serve as an example of successful topic modeling that can demonstrate to undergraduates the utility of quantitative analysis in historical research.
Yet while Nelson sets high standards, teachers shouldn’t write off topic modeling as the domain of research professionals. My first encounter with topic modeling was as a college sophomore at the Dickinson College Digital Humanities Boot Camp. With minimal instruction, I learned quickly how to use the Stanford Topic Modeling Toolbox to create topics out of a large digitized collection of student publications from the Carlisle Indian Industrial School. My research partner and I chose to generate five distinct topics, which we then labeled based on our own prior knowledge of the works. For example, one list of words was “Baby Babies Indian Care Children Water Mother Child Food Proper,” which we decided was most relevant to Reservation Life. We could then identify which works were most relevant to this topic, and how relevant this topic was to the collection as a whole. While we ended up using the topics as a taxonomy system for an online exhibit of historical photographs, the exercise of using topic modeling proved useful in and of itself by allowing us to get a better grasp of a collection much too large to read closely. Moreover, I gained a broader understanding of data-driven historical methods and how to apply digital skills to meaningful analysis.
The bottom line is that topic modeling is worth your attention. With the basic instruction, a few strong examples, and the right set of sources, topic modeling can provide an engaging and surprisingly simple lesson in the possibilities of digital humanities.