When using the technique, a researcher begins by feeding topic modeling software the texts that she would like to find topics for and by specifying the number of topics to be returned. There are multiple algorithms for identifying topics, but we focus on the most common: \emph{latent Dirichlet allocation} or LDA \citep{blei_latent_2003}. The nuts and bolts of how LDA works are complex and beyond the scope of this chapter, but the basic goal is fairly simple: LDA identifies sets of words that are likely to be used together and calls these sets `topics.' For example, a computer science paper is likely to use words like `algorithm', `memory', and `network.' While a communication article might also use `network,' it would be much less likely to use `algorithm' and more likely to use words like `media' and `influence.' The other key feature of LDA is that it does not treat documents as belonging to only one topic, but as consisting of a mixture of multiple topics with different degrees of emphasis. For example, an LDA analysis might characterize this chapter as a mixture of computer science and communication (among other topics).
LDA identifies topics inductively from the observed distributions of words in documents. The LDA algorithm looks at all of the words that co-occur within a corpus of documents and assumes that words used in the same document are more likely to be from the same topic. The algorithm then looks across all of the documents and finds the set of topics and topic distributions that would be, in a statistical sense, most likely to produce the observed documents. LDA's output is the set of topics: ranked lists of words likely to be used in documents about each topic, as well as the distribution of topics in each document. \citet{dimaggio_exploiting_2013} argue that while many aspects of topic modeling are simplistic, many of the assumptions have parallels in sociological and communication theory. Perhaps more importantly, the topics created by LDA frequently correspond to human intuition about how documents should be grouped or classified.
When using the technique, a researcher begins by feeding topic modeling software the texts that she would like to find topics for and by specifying the number of topics to be returned. There are multiple algorithms for identifying topics, but we focus on the most common: \emph{latent Dirichlet allocation} or LDA \citep{blei_latent_2003}. The nuts and bolts of how LDA works are complex and beyond the scope of this chapter, but the basic goal is fairly simple: LDA identifies sets of words that are likely to be used together and calls these sets `topics.' For example, a computer science paper is likely to use words like `algorithm', `memory', and `network.' While a communication article might also use `network,' it would be much less likely to use `algorithm' and more likely to use words like `media' and `influence.' The other key feature of LDA is that it does not treat documents as belonging to only one topic, but as consisting of a mixture of multiple topics with different degrees of emphasis. For example, an LDA analysis might characterize this chapter as a mixture of computer science and communication (among other topics).
LDA identifies topics inductively from the observed distributions of words in documents. The LDA algorithm looks at all of the words that co-occur within a corpus of documents and assumes that words used in the same document are more likely to be from the same topic. The algorithm then looks across all of the documents and finds the set of topics and topic distributions that would be, in a statistical sense, most likely to produce the observed documents. LDA's output is the set of topics: ranked lists of words likely to be used in documents about each topic, as well as the distribution of topics in each document. \citet{dimaggio_exploiting_2013} argue that while many aspects of topic modeling are simplistic, many of the assumptions have parallels in sociological and communication theory. Perhaps more importantly, the topics created by LDA frequently correspond to human intuition about how documents should be grouped or classified.