---
title: "Topic Modeling"
author: "Thomas Elliott"
date: "`r Sys.Date()`"
output: html_document
---
This particular topic modeling analysis is at the article level. I take the paragraphs for each article, and concatinate them.
```{r setup, tidy=TRUE }
library(stm)
library(dplyr)
library(tidyr)
knitr::opts_chunk$set(cache=TRUE)
load("~/Dropbox/Research/Dissertation/Data/analysis_articles.Rdata")
articles$text<-""
for(art in articles$id) {
df<-talk.info %>% filter(article_id==art)
text<-paste(df$quote,collapse=" ",sep=" ")
articles$text[which(articles$id==art)]<-text
}
articles$newspaper<-factor(articles$newspaper,levels=c(1,2,3),labels=c("NYT","LAT","WSJ"))
remove.words<-c("gay","lesbian","homosexual","homosexuality")
```
This analysis is on `r dim(articles)[1]` articles, which includes `r dim(talk.info)[1]` paragraphs of text.
Next, I pre-process the text. This involves removing stop words (common words as well as user specified stop words), stemming words (to remove suffixes), and other basic steps to make the text ready for processing.
```{r process}
processed<-textProcessor(articles$text,meta=articles,
customstopwords = remove.words)
num.docs<-length(processed$documents)
max.docs<-num.docs-10
```
Next I prep the documents. In this step, I remove words that appear in less than 5 documents, as well as words that appear in more than `r max.docs` articles. This helps remove some noise.
```{r prep}
# out <- prepDocuments(processed$documents, processed$vocab, processed$meta,
# lower.thresh = 5,upper.thresh = max.docs)
out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
texts<-articles$text[-processed$docs.removed]
```
Next is the workhorse. This actually does the topic modeling. First, we can run stm with init.type="Spectral" and K=0 to have the algorithm calculate the appropriate number of topics itself. This does not mean the number of topics it finds is the true number of topics, but it is a good place to start.
If I were to not use "Spectral," it would be good practice to run multiple models of the same specification but with different starting values (seeds) and selecting the best of the results. This can be done automatically with `selectModel` but is unnecessary with "Spectral" as this initialization is determanistic and globally consistent, and so will return the same results despite the seed used. However, it is only useful for vocabularies below 10,000 words. A common problem in big data is multi-modality, in that when optimizing a solution there are numerous local maxima that may not be the global maximum but that the solution might converge onto. Thus, running multiple models with different starting values increases the chances of at least one of the solutions converging on the global maximum.
Spectral initialization uses a decomposition of the VxV word co-occurence matrix to identify "anchor" words - words that belong to only one topic and therefore identify that topic. The topic loadings of other words are then calculated based on these anchor words. This processes is deterministic, so that the same values are always arrived at with the same VxV matrix. The problem is that this process assumes that the VxV matrix is generated from the population of documents (or, put another way, assumes it is generated from an infinite number of documents). Thus, the process does not behave well with infrequent words. The solution to this is to remove infrequent words, though one should still be careful if you don't have a lot of documents.
My guess is that I should use Spectral, but make sure it is robust to a series of LDA models (meaning that Spectral produces results as good as or better than LDA models). I also suspect that I don't have enough documents to trust the number of models to use.
```{r stm-spectral-0,message=FALSE,cache=TRUE,dependson=c("setup","process","prep"),results="hide"}
stmfit<-stm(out$documents,out$vocab,K=0,
prevalence = ~newspaper + year, max.em.its = 500,
data=out$meta,seed=24601,
init.type = "Spectral")
```
The above analysis identifies `r stmfit$settings$dim$K` topics. Below are common words for each topics.
```{r}
labelTopics(stmfit)
```
Below graphs how common each topic is:
```{r,fig.height=8,fig.width=6}
plot.STM(stmfit,type="summary",xlim=c(0,0.1))
```
Semantic coherence - empirical cooccurence of terms with high probability under a given topic. Given term l appears, what is the probability that term m will also appear? For the topic, coherence is the sum of the log of these probabilities.
$C(t; V^{(t)}) = \sum_{m=2}^M \sum_{l=1}^{m-1} log{\frac{D(v_m^{(t)},v_l^{(t)})+1}{D(v_l^{(t)})}}$
Where $V^{(t)}$ is a list of the $M$ most probably words in topic t. D(v) is the document frequency of word v. D(v,v') is the document co-occurrence frequency of words v and v'.
Exclusivity - top words for a topic are unlikely to appear in top words of other topics
An word f's exclusivity in topic k is calculated by:
$\phi_{f,k} = \frac{\beta_{f,k}}{\sum_{j \in S}\beta_{f,j}}$
Essentially the rate of a word in a topic divided by the sum of the rate of the word in all other topics.
We then combine a word's frequency and exclusitivity into a single value:
$FE_{fk} = \left( \frac{w}{\text{ECDF}_{\phi,k}(\phi_{f,k})} + \frac{1-w}{\text{ECDF}_{\mu,k}(\mu_{f,k})} \right)$
Which is the harmonic mean of empirical cumulative distribution functions applied to the exclusivity and frequency scores.
```{r,eval=FALSE}
exclusivity <- function(mod.out, M=10, frexw=.7){
w <- frexw
if(length(mod.out$beta$logbeta)!=1) stop("Exclusivity calculation only designed for models without content covariates")
tbeta <- t(exp(mod.out$beta$logbeta[[1]]))
s <- rowSums(tbeta)
mat <- tbeta/s #normed by columns of beta now.
ex <- apply(mat,2,rank)/nrow(mat)
fr <- apply(tbeta,2,rank)/nrow(mat)
frex<- 1/(w/ex + (1-w)/fr)
index <- apply(tbeta, 2, order, decreasing=TRUE)[1:M,]
out <- vector(length=ncol(tbeta))
for(i in 1:ncol(frex)) {
out[i] <- sum(frex[index[,i],i])
}
out
}
```
```{r,fig.height=8,fig.width=6}
topicQuality(stmfit,documents=out$documents)
```
Let's try different Ks. The `searchK` function runs `selectModel` on different values of K, so that you can see which best fits the data.
```{r searchK,cache=TRUE,dependson=c("setup","process","prep"),results="hide"}
sk<-searchK(out$documents,out$vocab,K=c(10,20,30,40,50))
```
```{r,fig.height=8,fig.width=8,dependson="searchK"}
knitr::kable(sk$results)
plot(sk)
```
Based on the results above, it looks like 30 topics is a good number to start with. Let's rerun the searchK function honing in on 20 topics
```{r searchk20,dependson=c("steup","process","prep"),results="hide"}
sk30<-searchK(out$documents,out$vocab,K=c(26,27,28,29,30,31,32,33,34))
```
```{r,fig.height=8,fig.width=8,dependson="searchK20"}
knitr::kable(sk30$results)
plot(sk30)
```
Based on the results above, 18 looks like a good number.
```{r spectral-20,message=FALSE,dependson=c("setup","process","prep"),results="hide"}
stmfit2<-stm(out$documents,out$vocab,K=30,
prevalence = ~newspaper + year,
data=out$meta,seed=24601,
init.type = "Spectral")
```
Below are common words for each topics.
```{r}
labelTopics(stmfit2)
```
Below graphs how common each topic is:
```{r,fig.height=8,fig.width=6}
plot.STM(stmfit2,type="summary",xlim=c(0,0.15))
```
```{r,fig.height=8,fig.width=6,dependson="spectral-20",results="hide"}
topicQuality(stmfit2,documents=out$documents)
```
```{r}
save.image("~/Dropbox/Research/Dissertation/Data/topic.modeling.Rdata")
```