*** For those just interested in the code and running it, you can find it off my github here: https://github.com/ScottBurger/steam_augury
I’ve been interested in recommendation systems for a while and I thought I’d try my hand at a new implementation in Python. My previous version, built in R, leveraged lots of data pulled from unreliable sources via web scraping. This time around I had been inspired by the recent developments by Valve with their Steam Labs initiative.
I thought this was interesting and definitely quelled some of the weekly Reddit outrage about steam game discovery and searching for new things to play. I certainly found some interesting additions to my own wishlist in the process.
But while the Steam Labs implementation is good, it doesn’t really offer much in the way of granularity. What does ‘popular’ really mean in this context? Why can’t I filter based on review score?
This led me to try my hand at writing another recommendation system, this time in Python and to open-source the code so everyone could see what was going on under the hood.
The idea here is to recommend games for users based on games they’ve had a high level of engagement with.
- Starting with user data, we look at how many hours they’ve played and compare that to the user’s median playtime. Games over the median are capped at 1 for their engagement, and lower for less engaged games.
- We then take that engagement value (0 to 1) and then apply it proportionally (based on the number of votes per tag) to the tags for the games a user has played, grouping by the tag name. This gives us a ‘fingerprint’ of engagement for a user.
- We then apply that fingerprint to each game we have data from Steam for (excluding DLC and other non-game specific entities for speed), so we have a score per game.
- However, because some games can have only 1 tag which obfuscates the system, we can impose some additional requirements to flag on to bring more out more signal. Ie: if a game has more than the median total number of tags (decent game detail), has more than the median total number of tag votes (decent signal to the game’s detail), etc, then we flag the game as meeting the filter criteria.
- By sorting by the filter criteria first, then the fingerprint score, we can find a better list of games to recommend. Since we’re just flagging them instead of filtering, we can let the user explore the rest of the dataset to find games that they might find interesting that the filter would have otherwise left out.
My previous system was more focused on seeing if I could predict which games on a user’s ranked wishlist on steam should warrant more attention based on the user’s playtime. In this case, I’m more interested in seeing which games out of all of Steam’s complete library we should focus on instead of just the wishlist. It’s probably also important to note that my own data of having a ranked or ordered wishlist is likely in the minority of Steam users and therefore the previous implementation wouldn’t be super useful for the general population.
This time I tried to leverage pure Steam data as much as I could. However, the Steam API proved frustrating, since I couldn’t get all of my data in one place. When querying the Steam API for a current list of its entire catalog, you get everything. Games, apps, DLC, soundtracks, movies, and more. Turns out there’s a lot on there! I’m only interested in the games aspect and running the rest of the system on non-game entities will only slow it down, so the first step was to get only game data. Another annoying aspect is that the Steam API doesn’t tell you what’s a game and what isn’t from its complete catalog:
So the API here will give us all sorts of junk data. The value here is that we have a complete list of everything on Steam. The problem is figuring out which of this is actual game data that we care about.
To solve that issue, I built two web scrapers in R: one to build out a table to get only games from the master list and to grab metadata like the description and release date, the other to scrape review data and get the tag data that we want. These could likely have been combined into a single step, but I wasn’t totally sure what the percent breakdown between games and non-games were on Steam (lots of DLC!) and whether or not it would take longer to run a combined script versus running the review+tags script on just games.
These scripts can be found in the data folder of the code, but aren’t totally the focus here so I’ll move on. There’s some “fun” things in there about how to beat the mature gate page, annoying regex for parsing tag data, and xml node parsing for getting the review data. Again: if it were all in the API I’d be so much happier.
User Data and the Return of the Sigmoid
In the previous post, I decided on using a sigmoid function to balance out game engagement levels. Part of the reason behind this was to keep games that had a disproportionate amount of time spent in them from saturating the system, so I wouldn’t just be recommended anything that’s Planetside 2 adjacent (800+ hours, but I’m not going to be putting that much time in Planetside Arena, I’ll tell you that much).
As a recap: we take the playtime from the Steam API (in minutes), convert to hours, then take the difference between that an the median time for all my games that have non-zero playtime (in my case 6.23) and take the sigmoid function of that.
Some pros and cons here. A pro is that I no longer have to rely on either steamspy or steamdb scraped data for the population’s median playtime here. Part of that is because that data is super unreliable (some games having 500+ hours median playtime depending on the day and users logged in). However, a con is that for a game like APE OUT, my total of 2 hours doesn’t seem like much, but most people haven’t played that long. So while I’m technically more engaged than the typical user, comparing APE OUT against the rest of my games library doesn’t do it any favors unfortunately.
For each game in this list, we’ll take the sigmoid value and multiply it across the tags and weigh it depending on the number of tags:
Let’s take a look at how the game Squad goes through this process:
By digging through the tag data that we have, we can see the tag name (top 7 listed above out of 20), the count of how many times its been applied, and what percent of the total tags it makes up. My personal sigmoid value is 1, so I’m just going to multiply 1 against all of those tag percent values. That final score is my engagement score for that tag for that game. We then sum up all the scores per tag name to get the final fingerprint:
Here’s the top entries out of 269 total tags that I’ve engaged with.
Application to All Steam Games
Using basically the same methodology as above, we take the total fingerprint of user data, then apply it to an individual game and sum the result. An example for the recently released game Creature in the Well:
Do note that the tag counts and distributions here may change over time, but here’s it in action. We have the tag names, their counts, the percent of total for those tags, the sigmoid values joined from the user fingerprint, and the final tag_score, which is the multiplication of the two. The final score for this game would be 2.270758 after summing up all the final column values.
Once we have that final score for each game, we just sort by the final score, right? Well one issue here is that if a game has only 1 tag and it happens to be the tag you’re most engaged with, well the final score is going to be just that:
What we see in this case are the games sorted by the score descending. Notice they’re all one-tag games with basically no reviews. There may be some gems here, but more than likely this is just noise in the data.
Instead we want to get some kind of meaningful signal. For that, I built a flag that if a game has more than the median attributes (number of tags, total tag votes, positive reviews, total reviews) then we’ll flag it as 1. That way if we sort by that flag then the actual game score, we get a lot more relevant data:
There have definitely been some games at this level of sorting that have appealed to me a lot more than the previous sorting method.
I was excited to see what the top games were from this method and for the most part a lot of them aligned with what I like. A top down tank game? Sure. A claymation side scrolling shooter? Sounds interesting. Disney princesses? Uhh….. Quake?
For the most part the tags on Steam work pretty well. But for games like Disney Princess, whose principal tags are “Action”, “Female Protagonist”, and “Family Friendly”, there can be some bleed over in terms of recommendation.
The console output here will tell you the top 10 games based on that filtered score, but also dump a tab-separated text document that contains all the data. I find that much more interesting to explore anyway.
One interesting insight from all this was seeing that it takes a minimum number of 20 tag votes to register as an official tag on a steam game’s product page. Most games have at least 6 different tags and most games have at least 115 total tag votes. I did find it rather interesting that Steam’s site was actually pretty unreliable for scraping purposes (not that it should be) and that I’d have to restart the script every few hours to resume from 503 web errors. Steam not having a review or tags API is also a huge bummer and would make things a lot simpler to analyze.
In my previous R implementation, I had the idea to compare the sigmoid of the user’s playtime versus that of the population. The problem there was that it took forever to get that data, it had to be scraped, and it was ‘ok’ in terms of reliability. I wanted to test and see if there was a noticeable difference between using the sigmoid on the user’s data versus that of the population’s.
I tested with 5 users and compared the rank of the total playtime with the rank of the predicted app scores by building the system on 70% of the user’s played games and testing on the rest. The results game an RMSE of 10.07 for the user-only data, and 11.76 for data leveraging the population. Clearly 5 data points of people is pretty low to make a big comparison but for how much effort it is to get the population playtime distributions, there isn’t an obvious benefit as of yet. More data needed maybe?
This system was originally built in R and then I re-wrote it in Python. It seems like everything works just as well in Python, but for whatever reason the loop that executes the per-game score evaluation takes 3.6x as long as it does in R, which seems crazy since R doesn’t have native support for dictionary data, which is what the tag distributions are stored in on Steam. Either the fromJSON() function in R is crazy optimized, or (more likely), I need to tune up my Python code to be more performant.
One thing I’d like to explore further with this code would be different evaluation metrics. What if we use a tanh() function instead of a sigmoid in order to punish games below the median playtime? Another interesting idea to explore here would be to modify the code to take a list of users, build their individual recommendations, then filter to games that have the highest score that contain a multiplayer category. Maybe that would be less optimized than something like collaborative filtering, but an interesting pivot to the process nonetheless.
Featured image via the movie Anaconda‘s poster. Risk of Rain 2 image courtsey from: https://steamcommunity.com/app/632360/images/