This particular topic modeling analysis is at the article level. I take the paragraphs for each article, and concatinate them.

library(stm)
## stm v1.1.0 (2015-7-27) successfully loaded. See ?stm for help.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

knitr::opts_chunk$set(cache = TRUE)

load("~/Dropbox/Research/Dissertation/Data/analysis_articles.Rdata")

articles$text <- ""
for (art in articles$id) {
    df <- talk.info %>% filter(article_id == art)
    text <- paste(df$quote, collapse = " ", sep = " ")
    articles$text[which(articles$id == art)] <- text
}
articles$newspaper <- factor(articles$newspaper, levels = c(1, 2, 3), labels = c("NYT", 
    "LAT", "WSJ"))

remove.words <- c("gay", "lesbian", "homosexual", "homosexuality")

This analysis is on 861 articles, which includes 2789 paragraphs of text.

Next, I pre-process the text. This involves removing stop words (common words as well as user specified stop words), stemming words (to remove suffixes), and other basic steps to make the text ready for processing.

processed<-textProcessor(articles$text,meta=articles,
                         customstopwords = remove.words)
## Building corpus... 
## Converting to Lower Case... 
## Removing stopwords... 
## Remove Custom Stopwords...
## Removing numbers... 
## Removing punctuation... 
## Stemming... 
## Creating Output...
num.docs<-length(processed$documents)
max.docs<-num.docs-10

Next I prep the documents. In this step, I remove words that appear in less than 5 documents, as well as words that appear in more than 835 articles. This helps remove some noise.

out <- prepDocuments(processed$documents, processed$vocab, processed$meta,
                     lower.thresh = 5,upper.thresh = max.docs)
## Removing 8463 of 10575 terms (14192 of 61579 tokens) due to frequency 
## Your corpus now has 845 documents, 2112 terms and 47387 tokens.
texts<-articles$text[-processed$docs.removed]

Next is the workhorse. This actually does the topic modeling. First, we can run stm with init.type=“Spectral” and K=0 to have the algorithm calculate the appropriate number of topics itself. This does not mean the number of topics it finds is the true number of topics, but it is a good place to start.

If I were to not use “Spectral,” it would be good practice to run multiple models of the same specification but with different starting values (seeds) and selecting the best of the results. This can be done automatically with selectModel but is unnecessary with “Spectral” as this initialization is determanistic and globally consistent, and so will return the same results despite the seed used. However, it is only useful for vocabularies below 10,000 words. A common problem in big data is multi-modality, in that when optimizing a solution there are numerous local maxima that may not be the global maximum but that the solution might converge onto. Thus, running multiple models with different starting values increases the chances of at least one of the solutions converging on the global maximum.

Spectral initialization uses a decomposition of the VxV word co-occurence matrix to identify “anchor” words - words that belong to only one topic and therefore identify that topic. The topic loadings of other words are then calculated based on these anchor words. This processes is deterministic, so that the same values are always arrived at with the same VxV matrix. The problem is that this process assumes that the VxV matrix is generated from the population of documents (or, put another way, assumes it is generated from an infinite number of documents). Thus, the process does not behave well with infrequent words. The solution to this is to remove infrequent words, though one should still be careful if you don’t have a lot of documents.

My guess is that I should use Spectral, but make sure it is robust to a series of LDA models (meaning that Spectral produces results as good as or better than LDA models). I also suspect that I don’t have enough documents to trust the number of models to use.

stmfit<-stm(out$documents,out$vocab,K=0,
            prevalence = ~newspaper + year, max.em.its = 500,
            data=out$meta,seed=24601, 
            init.type = "Spectral")

The above analysis identifies 55 topics. Below are common words for each topics.

labelTopics(stmfit)
## Topic 1 Top Words:
##       Highest Prob: said, polic, angel, los, counti, offic, depart 
##       FREX: angel, los, polic, counti, abnorm, enforc, chief 
##       Lift: abnorm, sheriff, los, angel, enforc, polic, houston 
##       Score: abil, abl, abnorm, abort, absolut, abus, accept 
## Topic 2 Top Words:
##       Highest Prob: say, public, aid, sexual, one, studi, report 
##       FREX: dannemey, acquir, studi, alfr, stori, report, diseas 
##       Lift: acquir, alfr, dannemey, horribl, syndrom, journalist, somebodi 
##       Score: abl, accept, account, accus, achiev, acknowledg, acquir 
## Topic 3 Top Words:
##       Highest Prob: said, charg, crime, arrest, polic, hate, two 
##       FREX: arrest, crime, charg, sentenc, hate, arnold, avenu 
##       Lift: arnold, sentenc, arrest, avenu, crime, charg, sunday 
##       Score: abl, abort, abus, accept, accord, accus, acknowledg 
## Topic 4 Top Words:
##       Highest Prob: homosexu, said, law, public, bar, one, assembl 
##       FREX: assembl, governor, veto, crimin, drink, bar, special 
##       Lift: assembl, reform, veto, drink, governor, lord, crimin 
##       Score: abandon, abl, abort, absenc, absolut, abus, accept 
## Topic 5 Top Words:
##       Highest Prob: aid, patient, director, men, victim, young, firm 
##       FREX: patient, firm, besid, victim, aid, volunt, lawyer 
##       Lift: besid, volunt, client, firm, latter, patient, assign 
##       Score: abandon, abil, abl, abort, absolut, abus, accept 
## Topic 6 Top Words:
##       Highest Prob: polit, say, right, abort, guy, bob, homosexu 
##       FREX: bob, guy, polit, abort, elect, republican, bush 
##       Lift: bob, reagan, guy, sam, fox, stanc, ronald 
##       Score: abandon, abil, abort, abus, accept, accord, account 
## Topic 7 Top Words:
##       Highest Prob: centuri, even, good, life, may, code, london 
##       FREX: centuri, code, london, produc, lord, dialogu, tour 
##       Lift: centuri, london, tour, oscar, intim, wild, brilliant 
##       Score: abandon, abort, accept, accord, account, accus, action 
## Topic 8 Top Words:
##       Highest Prob: court, law, right, polici, case, suprem, rule 
##       FREX: court, suprem, sodomi, conduct, law, decis, rule 
##       Lift: claus, sodomi, consent, exclud, privaci, georgia, virginia 
##       Score: abil, abl, abort, absenc, absolut, abus, accept 
## Topic 9 Top Words:
##       Highest Prob: children, parent, said, studi, one, sexual, mother 
##       FREX: parent, children, consequ, studi, michael, mother, psychologist 
##       Lift: consequ, psychologist, choos, michael, biolog, parent, deviat 
##       Score: abl, abort, accept, accord, account, accus, achiev 
## Topic 10 Top Words:
##       Highest Prob: communiti, said, center, art, hollywood, peopl, one 
##       FREX: art, hollywood, center, communiti, west, core, los 
##       Lift: core, art, santa, hollywood, player, space, primarili 
##       Score: abort, accept, accord, accus, across, action, activ 
## Topic 11 Top Words:
##       Highest Prob: school, student, said, group, district, high, campus 
##       FREX: student, school, campus, district, allianc, curriculum, high 
##       Lift: curriculum, student, campus, antiabort, school, lawsuit, allianc 
##       Score: abl, abort, abus, accept, accord, account, accus 
## Topic 12 Top Words:
##       Highest Prob: judg, child, yearold, custodi, believ, said, parent 
##       FREX: child, judg, custodi, yearold, admit, parent, youth 
##       Lift: custodi, child, charl, scott, tabl, yearold, admit 
##       Score: abandon, abil, abl, abort, abus, accept, accord 
## Topic 13 Top Words:
##       Highest Prob: play, love, work, charact, young, two, novel 
##       FREX: play, novel, love, denial, pain, symbol, charact 
##       Lift: denial, loneli, sensat, discoveri, jack, fantasi, passion 
##       Score: abandon, abil, abort, absolut, accept, accord, account 
## Topic 14 Top Words:
##       Highest Prob: democrat, said, republican, right, campaign, parti, bush 
##       FREX: democrat, republican, bush, parti, candid, voter, campaign 
##       Lift: denounc, deleg, gop, bush, republican, log, cabin 
##       Score: abandon, abil, abl, abort, absenc, abus, accept 
## Topic 15 Top Words:
##       Highest Prob: said, univers, communiti, counti, orang, gay, peopl 
##       FREX: orang, design, univers, convent, counti, festiv, pride 
##       Lift: design, laguna, moor, orang, grove, festiv, univers 
##       Score: abil, abort, accept, accord, account, accus, across 
## Topic 16 Top Words:
##       Highest Prob: said, offic, aid, test, discharg, peopl, health 
##       FREX: test, donor, discharg, navi, offic, dannemey, health 
##       Lift: donor, anonym, procedur, test, tennesse, avoid, eight 
##       Score: abil, abl, abort, absolut, abus, accept, accord 
## Topic 17 Top Words:
##       Highest Prob: right, said, state, main, group, peopl, colorado 
##       FREX: main, colorado, draw, measur, voter, protect, right 
##       Lift: draw, colorado, oregon, main, statewid, referendum, discourag 
##       Score: abandon, abil, abort, absenc, abus, accept, accord 
## Topic 18 Top Words:
##       Highest Prob: east, year, presid, right, last, polit, charact 
##       FREX: east, obama, percent, poet, presid, charact, clinton 
##       Lift: east, poet, deliber, partial, consult, passion, brief 
##       Score: abil, abort, accept, accord, accus, achiev, acknowledg 
## Topic 19 Top Words:
##       Highest Prob: book, discuss, one, will, peopl, cult, age 
##       FREX: discuss, cult, erot, book, aspect, age, drawn 
##       Lift: erot, cult, juvenil, materi, drawn, vulner, masturb 
##       Score: abandon, abl, abort, abus, accept, accord, account 
## Topic 20 Top Words:
##       Highest Prob: coupl, marriag, sex, union, said, civil, law 
##       FREX: coupl, union, marriag, civil, jersey, spous, vermont 
##       Lift: evolv, jersey, vermont, spous, union, coupl, goldstein 
##       Score: abil, abl, abort, absolut, abus, accept, accord 
## Topic 21 Top Words:
##       Highest Prob: young, one, god, men, forbid, boy, anoth 
##       FREX: young, god, forbid, attack, fellow, boy, enough 
##       Lift: forbid, god, weapon, constant, young, permiss, habit 
##       Score: abandon, abil, abl, abort, absenc, absolut, abus 
## Topic 22 Top Words:
##       Highest Prob: said, offic, presid, will, white, remark, britain 
##       FREX: britain, remark, white, foreign, offic, presid, clinton 
##       Lift: foreign, britain, ceremoni, bigot, leagu, apolog, suspend 
##       Score: abort, abus, accept, accord, account, accus, acknowledg 
## Topic 23 Top Words:
##       Highest Prob: marri, said, famili, marriag, children, man, one 
##       FREX: marri, husband, famili, wife, children, forward, dear 
##       Lift: forward, dear, happili, sister, husband, mayb, physician 
##       Score: abandon, abil, abl, abort, accept, accord, account 
## Topic 24 Top Words:
##       Highest Prob: film, movi, play, name, music, one, man 
##       FREX: broadway, movi, film, brother, music, franc, actor 
##       Lift: franc, broadway, nomine, match, pregnant, french, actress 
##       Score: abandon, abl, abort, accept, accord, account, accus 
## Topic 25 Top Words:
##       Highest Prob: said, gay, militari, privat, ban, polici, obama 
##       FREX: privat, militari, obama, gate, ban, serv, polici 
##       Lift: gate, obama, testifi, soldier, pentagon, veteran, privat 
##       Score: abort, abus, accept, accord, account, accus, acknowledg 
## Topic 26 Top Words:
##       Highest Prob: use, gone, homosexu, said, park, three, site 
##       FREX: gone, use, site, park, manhattan, yesterday, photograph 
##       Lift: gone, squar, site, manhattan, park, photograph, materi 
##       Score: abandon, abil, abl, abort, absenc, absolut, abus 
## Topic 27 Top Words:
##       Highest Prob: said, peopl, communiti, gay, violenc, group, victim 
##       FREX: violenc, victim, attack, increas, communiti, conserv, grand 
##       Lift: grand, violenc, cook, ordain, bay, urban, victim 
##       Score: abil, abort, abus, accept, accord, accus, acknowledg 
## Topic 28 Top Words:
##       Highest Prob: church, cathol, said, priest, homosexu, digniti, bishop 
##       FREX: church, priest, cathol, digniti, bishop, roman, rev 
##       Lift: headlin, roman, priest, digniti, church, episcop, cathol 
##       Score: abort, absenc, absolut, abus, accept, accord, account 
## Topic 29 Top Words:
##       Highest Prob: militari, homosexu, servic, discharg, clinton, heard, said 
##       FREX: militari, heard, clinton, discharg, servic, administr, navi 
##       Lift: heard, militari, reserv, navi, pentagon, duti, discharg 
##       Score: abil, abl, abort, absenc, absolut, accept, accord 
## Topic 30 Top Words:
##       Highest Prob: right, bill, support, said, civil, group, democrat 
##       FREX: bill, introduc, democrat, civil, legisl, right, support 
##       Lift: introduc, leadership, bill, spoken, chose, tax, brown 
##       Score: abil, abl, abort, absenc, absolut, abus, accept 
## Topic 31 Top Words:
##       Highest Prob: san, francisco, milk, citi, right, polit, year 
##       FREX: milk, francisco, san, alan, land, harvey, california 
##       Lift: land, milk, castro, alan, harvey, chines, supervisor 
##       Score: abil, abl, abort, absolut, abus, accept, accord 
## Topic 32 Top Words:
##       Highest Prob: discrimin, citi, protect, orient, homosexu, sexual, job 
##       FREX: discrimin, orient, protect, employ, job, hire, prohibit 
##       Lift: lengthi, defeat, prejudic, affirm, discrimin, disabl, curb 
##       Score: abil, abl, abort, absenc, absolut, abus, accept 
## Topic 33 Top Words:
##       Highest Prob: center, health, say, men, diseas, use, servic 
##       FREX: center, diseas, sport, medicin, health, hospit, model 
##       Lift: medicin, sport, canada, inhibit, center, artifici, routin 
##       Score: abandon, abil, abl, abort, absolut, abus, accept 
## Topic 34 Top Words:
##       Highest Prob: school, homosexu, teacher, two, high, american, book 
##       FREX: teacher, seen, nazi, high, happi, evid, school 
##       Lift: nazi, experienc, arriv, persecut, happi, error, germani 
##       Score: abandon, abil, abort, absenc, abus, accept, accord 
## Topic 35 Top Words:
##       Highest Prob: one, know, onto, rape, victim, anoth, play 
##       FREX: onto, rape, song, bash, band, mrs, got 
##       Lift: onto, makeup, rape, song, band, lyric, babi 
##       Score: abandon, abl, abort, absolut, abus, accept, accord 
## Topic 36 Top Words:
##       Highest Prob: marriag, court, sex, state, coupl, said, right 
##       FREX: marriag, court, california, constitut, coupl, licens, francisco 
##       Lift: overturn, schwarzenegg, stewart, atti, licens, constitut, marriag 
##       Score: abil, abl, abort, abus, accord, accus, achiev 
## Topic 37 Top Words:
##       Highest Prob: men, show, said, one, televis, pornograph, art 
##       FREX: pornograph, televis, pledg, art, featur, less, close 
##       Lift: pornograph, pledg, poster, childhood, preced, documentari, mysteri 
##       Score: abil, abl, abort, abus, accept, accord, account 
## Topic 38 Top Words:
##       Highest Prob: one, novel, possess, also, play, movement, first 
##       FREX: possess, novel, movement, adam, written, hero, latent 
##       Lift: possess, latent, escap, adam, scale, modern, pleasur 
##       Score: abandon, abort, absolut, accept, accord, account, accus 
## Topic 39 Top Words:
##       Highest Prob: group, said, right, women, ralli, union, organ 
##       FREX: ralli, boycott, group, union, plan, newspap, parti 
##       Lift: ralli, boycott, socialist, irish, labor, patrick, briefli 
##       Score: abil, abl, abort, abus, accept, accord, account 
## Topic 40 Top Words:
##       Highest Prob: one, new, home, play, polic, make, restaur 
##       FREX: restaur, home, area, north, hair, daniel, street 
##       Lift: restaur, daniel, north, hair, asia, lake, eastern 
##       Score: abandon, abort, accept, accord, accus, achiev, across 
## Topic 41 Top Words:
##       Highest Prob: state, secur, homosexu, govern, alleg, depart, charg 
##       FREX: secur, alleg, govern, file, charg, depart, rid 
##       Lift: rid, alleg, secur, subvers, penal, suspend, secretari 
##       Score: abandon, abil, abl, abort, absenc, absolut, abus 
## Topic 42 Top Words:
##       Highest Prob: novel, time, will, one, kind, make, sexual 
##       FREX: novel, search, fiction, father, english, kind, manag 
##       Lift: search, pop, intellig, fiction, pathet, homo, episod 
##       Score: abandon, abil, abl, abort, absolut, accept, accord 
## Topic 43 Top Words:
##       Highest Prob: paul, bishop, one, church, christian, relationship, condemn 
##       FREX: bishop, paul, sentiment, bibl, condemn, christian, offens 
##       Lift: sentiment, bishop, holi, bibl, paul, scriptur, orthodox 
##       Score: abl, abort, abus, accept, accord, account, accus 
## Topic 44 Top Words:
##       Highest Prob: issu, say, appear, shoot, magazin, clear, point 
##       FREX: shoot, magazin, evangel, appear, humor, pastor, clear 
##       Lift: shoot, pastor, preacher, evangel, eloqu, mail, racism 
##       Score: abandon, abil, abl, abort, abus, accept, accord 
## Topic 45 Top Words:
##       Highest Prob: aid, peopl, alcohol, gay, new, mani, will 
##       FREX: alcohol, aid, hiv, drug, abus, sing, religi 
##       Lift: sing, hiv, alcohol, outreach, ministri, condom, abus 
##       Score: abandon, abil, abort, absolut, abus, accept, account 
## Topic 46 Top Words:
##       Highest Prob: homosexu, sexual, moral, right, peopl, subject, prohibit 
##       FREX: prohibit, sought, propos, committe, moral, congress, subject 
##       Lift: sought, congression, disapprov, minist, pervert, hatr, biblic 
##       Score: abil, abl, abort, absenc, absolut, abus, accept 
## Topic 47 Top Words:
##       Highest Prob: aid, men, drug, said, report, risk, health 
##       FREX: spread, infect, drug, risk, blood, user, aid 
##       Lift: spread, intraven, needl, user, infect, transmit, blood 
##       Score: abandon, abil, abl, abort, absolut, abus, accept 
## Topic 48 Top Words:
##       Highest Prob: said, frank, togeth, coupl, year, time, startl 
##       FREX: frank, togeth, startl, town, coupl, spous, passag 
##       Lift: startl, frank, town, togeth, plot, passag, intellig 
##       Score: abandon, abil, abl, abort, absenc, absolut, accept 
## Topic 49 Top Words:
##       Highest Prob: word, queer, use, said, film, group, charact 
##       FREX: queer, word, film, strain, media, screen, attend 
##       Lift: strain, queer, word, blue, feminin, floor, bag 
##       Score: abandon, abil, abort, absolut, abus, accept, accord 
## Topic 50 Top Words:
##       Highest Prob: like, man, men, film, straight, get, make 
##       FREX: straight, studio, closet, movi, hudson, theater, film 
##       Lift: studio, hudson, camera, documentari, satir, strang, percept 
##       Score: abort, absolut, accept, action, activist, actor, actress 
## Topic 51 Top Words:
##       Highest Prob: one, show, now, peopl, even, old, public 
##       FREX: feminist, old, progress, stuff, show, late, piec 
##       Lift: stuff, progress, feminist, smooth, sir, remind, bear 
##       Score: abil, abort, absolut, accept, accord, accus, achiev 
## Topic 52 Top Words:
##       Highest Prob: investig, state, senat, subcommitte, depart, inform, homosexu 
##       FREX: investig, subcommitte, senat, sen, mccarthi, inform, depart 
##       Lift: subcommitte, mccarthi, investig, briefli, sen, roy, probe 
##       Score: abandon, abil, abort, abus, accept, accord, account 
## Topic 53 Top Words:
##       Highest Prob: mother, one, love, play, daughter, man, made 
##       FREX: daughter, taught, mother, moment, discov, recal, happi 
##       Lift: taught, delight, larri, daughter, widow, comfort, compel 
##       Score: abandon, abil, abl, abort, abus, accept, accord 
## Topic 54 Top Words:
##       Highest Prob: compani, employe, said, partner, benefit, right, polici 
##       FREX: employe, compani, benefit, partner, extend, domest, thoma 
##       Lift: thoma, employe, compani, extend, coverag, qualifi, workplac 
##       Score: abil, abort, absolut, abus, accept, accord, account 
## Topic 55 Top Words:
##       Highest Prob: cultur, murder, anoth, man, kill, tough, involv 
##       FREX: murder, cultur, tough, kill, game, anoth, involv 
##       Lift: tough, game, kill, murder, cultur, quiet, construct 
##       Score: abandon, abil, abl, abort, absenc, absolut, abus

Below graphs how common each topic is:

plot.STM(stmfit,type="summary",xlim=c(0,0.1))