Season 8 of RuPaul’s Drag Race premiered on March 7. As the season airs, Drag Race fans enjoy speculating who will make it to the top three, rooting for their favorites. I’ve been diving into machine learning recently, and one of the biggest uses of machine learning is for prediction, and I thought it would be fun to try to apply a few machine learning algorithms to data about the 100 queens who have appeared on Drag Race to try to predict how season 8 might progress. This is inspired in no small part by Alex Hanna’s excellent survival analysis of season 5, and I use the data she collected, adding in seasons 6-8. If you haven’t read Alex’s posts about season 5, and are fans of both Drag Race and statistical analysis, I recommend checking it out before reading on.
For those not familiar with Rupaul’s Drag Race, it is a reality competition show, similar to America’s Next Top Model or Project Runway, in which 9-14 (it varies each season) drag queens must succeed at weekly challenges that test their Charisma, Uniqueness, Nerve, and Talent to become America’s Next Drag Superstar. In recent seasons, this has come with a cash prize of 100K dollars, along with various other perks. The weekly challenges take various forms, but usually include sewing challenges, in which queens must make themed garments out of unusual materials, acting challenges, in which queens act out humorous (and often irreverent) scenes, and every season since season 2 has included the Snatch Game, a parody of the 1970s TV game show Match Game in which queens must dress up and perform celebrity impressions panelists on the Snatch Game, with guest judges for the week serving as contestents. The end of every episode, regardless of challenge, begins with a runway walk in which the queens must walk the runway in a themed look and then the queens are critiqued by the judges for their performance in the challenge as well as their runway look. The two worst performing queens for the week must then lipsync for their life, and whoever impresses Rupaul the most with their lipsync gets to stay, and the other queen must sashay away.
For those not familiar with machine learning, it is a family (or, really families) of algorithms for exploring and predicting data. There are two broad groups of families: supervised and unsupervised. Unsupervised learning algorithms are used when you have data you would like to classify without knowing what the right answers are before hand. Principal components analysis and K-means clustering are examples of unsupervised learning algorithms. Supervised algorithms are used when you already know the answer for at least some of your data. These algorithms work by feeding in a set of features (independent variables) and the labels, or answers, and the algorithm works to figure out how to get from the features to the labels. One of the biggest differences between standard statistical analysis and machine learning is that in standard statistical analysis, the model is the most important part of the process – understanding how it gets from the independent variables to the predicted dependent variable, and the relationship between these variables. In machine learning, the model is usually not important at all, and is treated as a black box. Instead, machine learning focuses on how well the model predicts the labels.
Meet the Queens
Let’s begin the challenge by meeting our contestants. First up is Support Vector Machines, a classifier with a pretty intuitive algorithm. Imagine you plot points on a two-dimensional graph. Support vector machines (SVM) attempts to separate out the groups defined by the labels using a line or curve that maximizes the distance between the dividing line and the closest points. If you have more than two features (as is often the case), the same thing happens but in a higher dimensional space.
The next to enter the work room is Gaussian Naive Bayes, an algorithm that is not as intuitive as SVM, but faster and simpler to implement. Gaussian naive Bayes algorithms assume that the data for each label is generated from a simple gaussian (or normal) distribution. Using Bayes theorem, along with some simplifying assumptions (which makes it naive), this algorithm uses the features and labels to estimate the gaussian distributions which it uses to make its predictions.
Our third contestant is the Random Forest Classifier. Random forests are aggregations of decision trees (get it!?). Decision trees are classifying algorithms composed of a series of decision points, splitting the data at each decision point to try to properly classify the data. Think of a game of Guess Who or Twenty Questions – you ask a series of yes/no questions to try to sort possibilities into different bins. Decision trees work the same way, with any number of possible bins. The problem with decision trees is that they tend to overthink the data, meaning that they do a really good job of predicting the training data, but the decision points are specific to the training data and so they aren’t so good at predicting testing data. The solution is to split the training data itself into different subsets, create decision trees for each subset, and then average those trees together to create a “forest” that typically does a much better with testing data than a single tree.
The fourth and final contestant is the Random Forest Regressor, a drag sister of the random forest classifier, it works much the same way the classifier does, but rather than trying to predict unordered categories, it is predicting continuous values.
The Mini Challenge
This week’s mini challenge will require each contestant to study the outcomes of seasons 1 through 6 and then predict who placed where in season 7. In machine learning parlance, seasons 1-6 are the training set, the data on which the algorithms learn their prediction models, and season 7 is the test set, the data the algorithms never saw when they were learning to see how well they do at predicting totally new data. I use the same variables as those used by Alex. Briefly, they are:
- Age of the queen
- Whether the queen is Puerto Rican
- Whether the queen is Plus Size
- The total number of main challenges a queen won during the season
- The total number of times a queen was among the top queens for the challenge, but did not win the challenge
- The total number of times a queen was among the worst queens for the challenge, but did not lip-sync
- The total number of times a queen had to lip-sync for her life (including the lip-sync that she sashayed away from)
For all four algorithms, I rank the predicted ranks, as some algorithms did not predict any queens to place first. Ranking the predicted ranks ensures that at least one queen will be predicted to come in first. Doing so does not affect the statistic, Kendall’s Tau, I use to assess how well each algorithm does overall, since it does not affect the ordering of the predictions. Kendall’s Tau ranges from -1 (the predicted placements are the exact opposite of the actual placements) to +1 (the predicted placements are the exact same as the actual placements). The table below shows how well each algorithm did in this mini-challenge:
|Actual Placement||Support Vector Machines||Gaussian Naive Bayes||Random Forest Classifier||Random Forest Regressor||Average Predicted Score|
|Jaidynn Diore Fierce||8||12||6||6||6||7.5|
|Mrs. Kasha Davis||11||1||8||10||11||7.5|
The SVM gets Violet correct, but there is quiet a bit of disagreement throughout. It predicted that Tempest would make it much further than she did, and it predicted Pearl wouldn’t make it nearly as far. SVM’s Kendall’s Tau is 0.221 which is not great.
Gaussian Naive Bayes doesn’t do much better than SVM. There are many ties in the predicted placements, likely because GNB doesn’t have enough data to form coherently separated gaussian distributions for each placement. Kendall’s Tau is higher than the support vector machine algorithm, though this is likely due to the the many ties present in the predicted placements. Random Forest Classifier does a pretty good job of predicting season 7. Both Violet and Pearl were correctly predicted to be in the top 3. The bottom three were also correctly predicted (though not in the same order). The random forest did predict that Max would make it to the top, and it predicted Ginger would only make it to 6th place. Overall, the random forest classifier achieved a Kendall’s Tau of 0.64 which is decent. Random Forest Regressor does even better than her sister. Violet and Ginger are both predicted to be top 3, and the bottom four are also correctly predicted. This algorithm has the highest Kendall’s Tau of all the algorithms, 0.73, which is pretty good.
So the winner of this week’s mini-challenge is… Random Forest Regressor
The Maxi Challenge
This week’s main challenge is to predict season 8. Obviously this will be an ongoing endeavor, as we don’t have much data on how the season 8 queens are performing yet. To start off the challenge, let’s see how the contestants do with the data from the two episodes that have aired. The last column weights the predicted placements by the algorithm’s Kendall’s Tau score from season 7:
|Actual||Support Vector Machines||Gaussian Naive Bayes||Random Forest Classifier||Random Forest Regressor||Weighted Average Predicted Score|
|Chi Chi DeVayne||8||3||1||1||2.21|
|Bob the Drag Queen||8||10||8||9||8.82|
|Cynthia Lee Fontaine||8||1||12||12||9.03|
Acid Betty and Robbie Turner are favorites among all of the algorithms, though Kim Chi and Chi Chi DeVayne come out on top in the best performing algorithms. Laila McQueen was predicted to go home earlier by nearly all of the algorithms, and she was eliminated in the latest episode. Naysha was predicted to have gone further by all the algorithms, even though she was the first queen to be eliminated this season (although, due to the double elimination in the last episode and a teaser that a queen would be coming back, Naysha has the opportunity to advance further if she does, in fact, rejoin the competition). As the season airs, I’ll update the contestant algorithms’ predictions and post a new post to this blog, so stay tuned.
Check out the data and Python script I use to generate the predictions.