This blog post contains spoilers for episode 3 of season 8 of RuPaul’s Drag Race.
Season 8 of RuPaul’s Drag Race premiered on March 7. As the season airs, Drag Race fans enjoy speculating who will make it to the top three, rooting for their favorites. I’ve been diving into machine learning recently, and one of the biggest uses of machine learning is for prediction, and I thought it would be fun to try to apply a few machine learning algorithms to data about the 100 queens who have appeared on Drag Race to try to predict how season 8 might progress. This is inspired in no small part by Alex Hanna’s excellent survival analysis of season 5, and I use the data she collected, adding in seasons 6-8. If you haven’t read Alex’s posts about season 5, and are fans of both Drag Race and statistical analysis, I recommend checking it out before reading on.
Today Arizona Governor Jan Brewer vetoed a bill that would have explicitly allowed business owners to refuse service to LGBT persons (or anyone, really) if the business owners invoked a religious freedom objection. The bill was obviously and intentionally anti-gay. What was interesting was the way different media outlets framed the issue. For example, the Washington Post, on their website’s front page, wrote the headline as “Brewer vetoes bill denying service to gays”:
Yesterday I wrote about the co-occurrence of claims about homosexuality in national newspapers since 1950. As part of that analysis, I argued that the more relationships that exist between claims, the denser the co-occurrence networks, the more discourse work is being done. If a claim is made without justification or contradiction, the claim is taken on its face – it is not considered controversial. A controversial, or unpopular, claim requires justification and will likely be accompanied by opposing claims, which leads to denser networks. Yesterday I focused on how these claim networks changed over time. Today, I want to briefly explore how these networks differ across types of speakers.
For my dissertation, I coded a sample of newspaper articles published since 1950 that mentioned homosexuality. My sample consisted of 720 articles. In the course of coding these articles, I copied any paragraph that mentioned homosexuality. 2382 total paragraphs were copied into my coding application. I then coded each paragraph for the presence of one of 12 claims about homosexuality and/or gay men and lesbians. Using this data, I constructed co-occurrence networks. I calculated how many times two claims appeared together in the same article. I then calculated how many times we might expect these two claims to appear together randomly given the number of articles and the total number of articles each claim appeared in. A tie is present if two claims appeared together more often than one standard deviation above this random expectation. Ties are colored blue if the two claims appeared together more often than two standard deviations above what we would expect by random chance. The nodes represent claims. Red nodes represent negative claims and green nodes represent positive claims. The size of the node represents how many times a claim appeared in total. The Bad claim is a “other-negative” category for negative claims that did not fit in any of the other claims. Similarly, the Good claim is a “other-positive” category for positive claims that did not fit elsewhere.
Who leads the LGBT movement? The answer to that question, predictably, says a lot about the movement overall. I calculated the top covered LGBT SMO for each year from 1960 (the first year an organization appears in the newspapers) through 2010. This includes coverage from the New York Times, Los Angeles Times, and Wall Street Journal. The results are in the table at the end of this post. Continue reading
My dissertation investigates how and why newspaper discourse of homosexuality has changed over time. As I’ve coded my data, I keep seeing the same year marking some turning point in various trends: 1990. Here’s a selection of those trends: Continue reading
This is part 2 of a series on storing and managing social science data in relational databases. If you haven’t already, read part 1 to get up to speed.
To help illustrate some of the concepts introduced in Part 1, I’m going to use the Dynamics of Collective Action (DOCA) dataset to design an efficient relational database. You can download the entire dataset at the DOCA website. For our purposes, I’m going to refer to a 15 case sample, which you can view here. The DOCA dataset consists of data coded from newspaper coverage of protest events. Each case is a distinct protest event. Table 1 contains a description of each of the over 100 variables in the dataset. For more detailed description of the dataset, you can browse the DOCA website.
This is the first of a series of posts about storing social science data in relational databases. I’ve found that grad school has prepared me for many things: using statistical software to do complex analyses, writing up academic papers and submit them for review to journals, collecting and coding data for research projects, presenting and networking at conferences. But I never learned how to properly keep track and store the data I’ve collected throughout my young academic career. Instead, I’ve cobbled together techniques, borrowing heavily from my skills as amateur computer and web programmer, that seem to work for me. In an effort to get more social scientists to think about this, I’ve decided to share some of my techniques, beginning with storing data in databases.