It’s often hard to figure out ways to review a product. How do we determine if it’s worth our time and focus? How do we assess what other people think about that product? In the case of the gaming sector: is it good enough to warrant me playing it?
In 2015 I gave a talk at the University of Chicago Booth School of Business as part of the inaugural run of The Data Science Conference. That talk centered around how we go about applying the fancy tech of natural language processing to user reviews for games.
Games and Ratings
How good do you think this game is?
If we didn’t know any better, we would assume this would be some kind of advert in a magazine, given how much promotional material is all over it. In fact, it’s a game’s box and by the looks of it, supposedly high rated if we trust their marketing department.
But how do we go about assessing a game’s health beyond just a static number. Do those review scores change over time? Can we gauge people’s reactions and emotions in accordance with changes to those games? Since games these days are becoming more of a software-as-a-service as opposed to standalone products, it makes sense for us to look at how these values can change over time.
For this analysis, I’m focusing on PC games and in particular games released on the digital distribution platform: Steam. So why PC? Well from 2015 data, Steam’s windows platform had the most amount of data available by far.
So with the largest number of games comes the largest pool of user reviews and the largest amount of data for us to build some interesting sentiment analysis off of.
Steam’s recommendation system has a lot of interesting data in it for us to explore. We can glean if the game was recommended or not, how helpful that recommendation was, date, and how many hours the user has spent with that game.
In this review for Counter-Strike: Global Offensive, we have a positive recommendation, a lot of hours put in, a high helpfulness percentage, and some text explaining their reasoning. This text presents a boon of info for us to delve into: are there any specific keywords that point to the health of the game over time? Likewise: is there a specific emotion or sentiment the user is trying to highlight with this game?
First I went through and scraped a bunch of reviews off the Steam store to build data sets for various games that look something like this:
Lots of good info that we can quantify later. Part of the trickiness was that Steam’s store doesn’t have an API for reviews, so I more or less had to build a crawler to get the data in the first place. In any case, we have dates and the review text which is paramount for us to do text mining and look at it over time.
Our R script used to mine out the sentiment and polarity from the text runs like this:
library(sentiment) library(plyr) some_txt <- read.table("clipboard", quote="", sep="\n", header=T) # classify emotion class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0) # get emotion best fit emotion = class_emo[,7] # substitute NA's by "unknown" emotion[is.na(emotion)] = "unknown" # classify polarity class_pol = classify_polarity(some_txt, algorithm="bayes") # get polarity best fit polarity = class_pol[,4] # data frame with results sent_df = data.frame(text=some_txt, emotion=emotion, polarity=polarity, stringsAsFactors=FALSE) # sort data frame sent_df = within(sent_df, emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))
The above code takes in some kind of text like "the ball is red" and calculates emotional sentiment and positive/negative polarity of it by using a naive bayes classifier that's been trained on a pre-built mapping of definitions.
- Emotions: anger, disgust, fear, joy, sadness, surprise, unknown
- Polarity: positive, neutral, negative
Counter-Strike: Global Offensive
CSGO, is it’s more succinctly known, is a team-heavy tactical shooter game that’s had some explosive growth in the past couple years. We want to see if the emotional sentiment has changed in response to some rather big updates the game has had.
First the game recommendations. It’s a pretty solid line over time (orange) and only really starts to come down in mid 2015. Disregard the last couple months of data in 8/9 2015 since those are partial and the trendline there might not be super accurate.
The emotional sentiment over time is visualized here. We see a big chunk of it is labeled as “joy” and the other big chunk is “unknown”. The unknown space might be better explored with a more in-depth classifier, or it may be that a lot of the text data being fed in is unclassified to begin with.
We can look at the union of emotional context and comment polarity as well. In this case we are subsetting the “joy” band from before into positive, neutral, and negative sentiment. Looks pretty consistent so far.
We can also just look at the positive and negative comment polarities themselves (no emotion, no neutral). This gives us an interesting curve to work with:
So this curve is just the same plot as before, but we’re just looking at the ratio of positive to negative polarized comments over time. Looks like it starts out very positive then levels out as time goes on.
We can use the timestamps in our data to look at before and after pictures as well. On 8/13/13, CSGO launched its in-app purchasing scheme. In the two weeks before, we had 96.5% recommendations and in the two weeks immediately after: 100%. We can look at emotions and polarity too:
Emotions in two weeks before and after Arms Deal update:
Polarity in two weeks before and after Arms Deal update:
Its interesting that there was a 38% increase in fear after the update, and almost twice as much negative commentating going on after. Yet the recommendations shot up 4% during that time.
Comparison with Other Games
In my talk I went in depth on Dota 2, Batman: Arkham Knight, Payday 2, and Planetside 2. I’ll skip over the emotional sentiment for those individual looks and just put the most important plots: the combined figures.
All the games recommendations plotted over time shows most of them are pretty flat with the exception of Payday 2 and BAK. The emotional plots are hard to try and pick which ones to compare against, since there’s so many options to choose from. So I put all the polarities on the same plot instead:
The by-month polarities (ratio of positive to negative only) are sort of reasonable for Dota 2 and CSGO. Planetside 2’s polarity varies wildly. BAK is stable but negative, whereas even during Payday2’s “golden age” people’s polarity of review’s weren’t exactly stellar. How do the regression fits compare?
This shows us the model fit for each game instead. What’s interesting here is that all of the models seem to want to converge at 50% polarity over a given time. An artifact of cumulative review sentiment scoring maybe?
Ohh A Sarcasm Detector…Oh That’s a Real Useful Invention
One interesting tidbit from the polarity of the user reviews is that it doesn’t really seem to correlate well with the end state of the review: thumbs up or thumbs down.
Here we have the three polarities (negative, neutral, and positive) and their total number of not-recommends and yes-recommends. On the right hand side, for positive polarity comments, we have a vast majority of recommended reviews. However, for neutral polarity there’s still a lot of recommended reviews. In the negative category, we see about the same distribution still. This could be due to users having a sarcastic attitude in the comments like “oh man this game totally STINKS *wink wink*” with 1000+ hours played and a thumbs up review.
Myself Under the Microscope
I’m a big gamer and, of course, I have steam games that I’ve reviewed. So I decided to apply this analysis to my own reviews to see what the result was.
Interestingly, for reviews that had a positive polarity to them, the split between recommended games and not-recommended games is much closer. Neutral seems about right given what we’ve seen from CSGO’s corpus, but the negative reviews seem to indicate that (if the sentiment polarity is right) a positive review of mine might still have some negative things to say about the game.
My emotional analysis over time suggests I use lots of joyful words:
Which is reflected in a total count of in bar chart form:
Conclusions and Take Aways
The sentiment analysis package I was using in R doesn’t really seem to work well for snarky user reviews. Likewise, its performance on words that live almost exclusively on the internet like “bantz” won’t have a meaning tied to it yet. There are lots of unknown emotions that need to be further hashed out. Maybe more emotions need to be added to the training model, but there’s certainly room to grow.
Polarity, on the other hand, might be a better metric for an overall game’s review corpus. It’s a lot more straight forward (what’s the emotional value of the word “red” anyway?) than emotional sentiment, and allows us to better understand the positive/negative nature of the reviews themselves.
The best metric to use, though, is still the thumbs up or thumbs down approach. Those take out all the guesswork and can be easily tracked over time.