The Data Science of Board Games – Exploring the BoardGameGeek Database

Board games have changed. Much of the time when a friend suggests breaking out a board game to play, people will sigh. Monopoly, Life, dare I go so far as to say Checkers too? These are games that have been around forever (and by forever I mean the 1950s), take way too long to play, have no real flavor to them, and everyone ends up in a sour mood at the end of the night. Doubly so if you’re playing Monopoly the correct way! Too often have people scoffed and said “more like B-O-R-E-D games.”

But board games have changed since then. In fact, new board games have been produced at such a crazy rate, that since 66% of all board games on the market today have been made after 2001. There are games of galactic conquest, games that are hybrids of tetris and patch quilting, even games about cooperative firework shows. The landscape has certainly changed as we have entered a so-called golden age of board games.

The development of the site boardgamegeek.com led to the creation of a huge database of board games, with a wealth of information to dive into. The BGGdatabase contains over 250,000 board games worth of information in it. Each game having a litany of data like its publication date, number of supported players, playing time, rating, and so forth. This is the kind of data that begs to be analyzed.

If you’re coming here for analysis and don’t really care so much about coding, feel free to skip directly to the “Analysis Time” section for pretty charts.

Data Collection and Cleanup

Boardgamegeek.com has a rather nice API for getting information programmatically, but I wanted to basically build out my own database of information. The conclusion I came to was to just hold my nose and literally download every board game’s info from the site. Is this an efficient process? Probably not, but for curiosity’s sake, I let a web scraper run for a couple days to collect all the data. If you do wind up scraping BGG, just do the standard courtesy of one scrape every couple seconds or so to minimize traffic impact.

The other issue was storing data. I had gone through the xml api from BGG and pulled the data for each board game as an xml file. You can see an example for the game Patchwork here. There’s lots of good stuff to parse out and analyze, but there’s also 250k files! I wound up splitting these into 5 directories of 50k files each to keep hard drive read/write issues to a minimum.

Part of the issue with the size of the BGG database is the signal to noise ratio. There’s lots and lots of games that have no ratings or any information about them other than a name in some cases. In fact, most of the site’s XML database is either game expansions, or stuff that isn’t even a board game to begin with. I was quite surprised to see information about SNES games in there, so I had to write some checks and filters to exclude what seemed to be a partial database of digital games as well.

So out of maybe 250k total files, only about 100k are related to board games and their expansions. Moreover, only about 17k games and expansions have enough ratings to warrant a generated bayesian ranking (called the Geek Rank) on the site’s database.

In an effort to drastically simplify the dataset, I only looked at games that had at least 30 ratings. These are games that make the BGG “geek rank” and get an official ranking on the site’s database. There may be some overlooked specific gems in that huge pile of data, but since we’re more interested in trends, that stuff would more or less fall out of the analysis anyway.

Building the Dataset

I got a list of IDs that have a known geek rank in a rather inefficient way. I basically built the database first, then found the IDs that had more than 30 votes and saved those into an array. The code below just builds out the simplified database to start so we don’t have to worry about all the noise.


ids_bayes <- read.table("bgg_ids_bayes.txt", header=T, sep = "\t", as.is = T)

ids_list <- as.vector(ids_bayes$ID)

fulldata <- data.frame(matrix(data=0, ncol=21, nrow=1))
names(fulldata) <- c(
"matrix.data...0..nrow...1..ncol...1.", "name" , "ID" , "year" , "minplayers" , "maxplayers" , "numratings" , "rating" , "bayesaverage" , "playingtime" , "minplaytime" , "maxplaytime" , "numowners" , "recommendedage" , "weight" , "numawards" , "numexpansions" , "balance", "balance_sd", "total_recplayer_votes", "expansion_flag"
)

tempdf <- data.frame(matrix(data = 0, nrow = 1, ncol = 1))

The above code is just setting up a temporary data frame which we'll append to as we go through our huge collection of XML files.

for(i in ids_list){

pg % xml_text()) == 0){
next
}else if(length(xml_nodes(pg, xpath=”//name”) %>% xml_text())>1){
tempdf$name % xml_text()
}else{
tempdf$name % xml_text()
}

#data elements
tempdf$ID <- i
tempdf$year % xml_text()
tempdf$minplayers % xml_text()
tempdf$maxplayers % xml_text()
tempdf$numratings % xml_text()
tempdf$rating % xml_text()
tempdf$bayesaverage % xml_text()
tempdf$playingtime % xml_text()
tempdf$minplaytime % xml_text()
tempdf$maxplaytime % xml_text()
tempdf$numowners % xml_text()
tempdf$recommendedage % xml_text()
tempdf$weight % xml_text()
tempdf$numawards % xml_text())
tempdf$numexpansions % xml_text())

#a temporary check to see if the XML file is a board game, or an expansion
expansion_check % xml_text(), sep=””, collapse=””)

#if it’s an expansion, label it as such
if(length(grep(“Expansion”, expansion_check)) > 0){
tempdf$expansion_flag = 1
} else {
tempdf$expansion_flag = 0
}

The above code chunk only really has 3 parts to it:

  1. A rather complicated name check to make sure the game’s being displayed properly. Lots of games can have either no data, or multiple entries, so we go with the one tagged as the primary name for our analysis.
  2. Our main data table is being built, row by row, by the next chunk which just parses out the data we’re interested in from the XML file.
  3. Creating a flag to make sure we can filter out board game expansions later. Expansions are listed basically the same as normal board games in the database, which is a little irritating, but we can work around that.

Beasts of Balance

We’re not quite done building the dataset yet, but I want to briefly explain the next section. There was a user comment somewhere on the internet asking a very interesting question: “what games play better with more people?” There’s a couple ways to answer that question, but I sort of turned it on its head and took a stab at a similar, but different analysis of: “what games play the best with any number of supported players?” Basically figuring out a balance metric.

The BGG database has user-based ranks for the game’s ratings, but also for what recommended numbers of players a game plays best with. Of course, this is all biased by the people who rate that stuff, but for a sufficient sample size we can get a rough idea if a game plays well with 2-4 people, but really plays bad with 5, for example.

This next chunk of code allows us to delve into that, albeit in a more complicated way that I would have first thought. From the poll data having to do with player count recommendations, I’m first getting the range of supported players, then figuring out the standard deviation of the percent who recommend playing at that player count.

So for example, the game 7 wonders supports (you guessed it) 7 players. On the BGG site, we can see how certain player ranges are better than others:

7wonders.png

The BGG site breaks recommendations out between Best, Recommended, and Not Recommended. My first bold assumption is that it’s easier to make a split between a game that’s not recommended at a certain player count and at a player count that is recommended. So for example, it’s pretty obvious that 7 wonders is not recommended for 2 players (even though that’s supported by the rules), but recommended for 3 players. The game is “best” with 4-5 players, but is that really much easier to discern than “recommended”? I feel like the gulf between “no” and “yes” is much bigger than “yes” and “yes!”.

So my process is to basically take the percent recommended (or 1 – the “not recommended” percent) for each player count, then take the standard deviation.

Because the BGG database is subject to human error, we have to put in checks for player counts that have a maximum player count greater than the minimum, the game is not an expansion, and doesn’t have any NULL values. We also have to index between the supported players range and NOT the values listed in the picture above. For whatever reason, every one of those polls has a single player option, even for games that don’t support it.

There has to be a more compact way of writing the code below, but that’s basically what it’s finishing out the FOR() loop with: a simple standard deviation number. Whew!


xml_parse <- xmlParse(i)

best <- xpathSApply(
doc=xml_parse,
path="//poll[@name='suggested_numplayers']//results//result[@value='Best']"
,xmlAttrs)

rec <- xpathSApply(
doc=xml_parse,
path="//poll[@name='suggested_numplayers']//results//result[@value='Recommended']"
,xmlAttrs)

not_rec <- xpathSApply(
doc=xml_parse,
path="//poll[@name='suggested_numplayers']//results//result[@value='Not Recommended']"
,xmlAttrs)

min_players = as.numeric(tempdf$minplayers)
max_players = as.numeric(tempdf$maxplayers)
vec  0){
if(ncol(best) < max_players){
index <- ncol(best)
} else {
index  min_players){
if(tempdf$expansion_flag == 0){
if(!is.null(ncol(best))){

for(j in min_players:index){

reccomends = sum(
as.numeric(best[2,j])
,as.numeric(rec[2,j])
)

total = sum(
as.numeric(best[2,j])
,as.numeric(rec[2,j])
,as.numeric(not_rec[2,j])
)

x <- round(reccomends / total, digits=2)
vec <- c(vec, x)

}

tempdf$balance <- paste(vec,collapse="|")
tempdf$balance_sd <- sd(vec)

tempdf$total_recplayer_votes <- sum(
sum(as.numeric(best[2,]))
, sum(as.numeric(rec[2,]))
, sum(as.numeric(not_rec[2,]))
)

}

} else{

tempdf$balance = "NA"
tempdf$balance_sd = "NA"
tempdf$total_recplayer_votes = "NA"

}

} else {

tempdf$balance = "NA"
tempdf$balance_sd = "NA"
tempdf$total_recplayer_votes = "NA"

}
}else{
tempdf$balance = "NA"
tempdf$balance_sd = "NA"
tempdf$total_recplayer_votes = "NA"
}

fulldata <- rbind(fulldata, tempdf)

}

Analysis Time!

Finally we can start to do some deep diving in the data and not have to worry about code or people setting the maximum player range for their game to 99+ (I’m looking at you, Werewolf). If you’re interested in playing around with the dataset, feel free to download it in tab-separated form from here.

If you’ve skipped over the R code in order to get to the cool plots, I’ve bamboozled you. We still have a little bit of cleanup to do before trying to plot anything coherent. Because some games have their playtime listed as high as like 6000 minutes, we need to filter those edge cases out in order to have our plots be readable.


library(ggplot2)

bggdata <- read.table("C:\\Users\\scott\\Desktop\\gamestest\\new\\bgg_full_4_2_17.txt", quote="", header = T, sep = "\t", as.is = T, fill = T)

bgg_games 0)
bgg_games 0 & playingtime<240)
bgg_rated2 0)
bgg_rated2 <- subset(bgg_rated2, weight<8)

segment <- 60
bgg_rated2$gametime <- floor(bgg_rated2$playingtime/segment)
bgg_rated2$floorweight <- floor(bgg_rated2$weight)

What I'm doing here is filtering out any game that has a 0 minimum playtime (often garbage data anyway) and has a playtime less than 240 minutes. Games that have a user-rated weight, or complexity of anything higher than a 5 are rare indeed, so we are still capturing most of the data. Finally, I'm breaking up the playtime of each game into 60 minute chunks. This just makes the plots a little easier to read. By default you can set "segment = 1" and see how that changes things, but it's not pretty.

Ok, no FINALLY let's get to some plots!

Movers and Shakers

The BGG database is very long-tailed in terms of votes. That means there are tons and tons of games that have very low numbers of votes, and very few numbers of games that have a lot of them. In fact 10% of the entire geek rank games’ votes are concentrated in the top 25 games! We can see this in both table and graph form:

numratings_table

numratings_tail

It’s amazing to see almost 3 times as many votes for the games at the top of the list (ordered by number of rating votes) as those at rank 25. This list can be used as a proxy for the most popular games, since the correlation between the number of owners and the number of ratings a game has is very high at 0.94. The correlation between a game’s Geek Rank (its bayesian rating based on votes) and its ownership, though is less correlated at 0.40.

All of the games on that list are so different than the ones people tend to think of in terms of traditional board games. Very briefly:

  • (Settlers of) Catan: A game of rolling dice and buying resources on a randomized hexagonal tile board
  • Carcassonne: A city building game where you score points by placing workers to do stuff for you
  • Pandemic: A co-operative game of trying to fight a global outbreak of disease
  • Dominion: A deck building game with a dizzying amount of customization
  • 7 wonders: A card drafting and empire building game based on ancient times
  • Agricola: A complex game of farming and survival
  • Ticket to Ride: A simple, but highly competitive series of train route building games

These games are a stark difference to the rather dry games of yore like Scrabble, Boggle, or Clue. While some enthusiasts might claim there are other games that have more atmosphere and engaging themes to them than those on the most voted list, there’s still a stark difference between those and the more classic games.

I grew up playing Battleship, Connect Four, and Stratego. I thought those were fun, but not nearly as good as the video games I grew up on. These days, I’ve spent more time playing Race for the Galaxy than I have been playing Starcraft. A stark reversal from the late ’90s.

Guilty Pleasures

One aspect of this database I wanted to explore was what people’s guilty pleasures were. Lots of people own highly rated games, that should come as no surprise, but what are the games that have the highest ownership but the lowest ratings? What are the games most people begrudgingly own?

top_10_leq7

Above we see a table for games that have a ratings of less than, or equal to 7, sorted by number of owners. I seem to remember playing Munchkin a long time ago and thinking it was pretty fun, likewise with Betrayal. I kind of take issue with Chess being rated at a 6.9/10, but these are bayesian rankings and are biased more towards the collector end of the spectrum anyway.

top_10_leq6

Games with less than or equal to a 5 rating seem to be more along the lines of what we think of as “boring” bored games. Some of these are pretty fun in my opinion (Uno, Apples, and Stratego come to mind) but I don’t know of anyone who goes out of their way to play Yahtzee on a regular basis.

top_10_leq5

Here we are in the bottom of the barrel: games with less than or equal to a 4 rating. There are games with lower ratings than these, but they don’t have enough numbers of votes to really warrant being as relevant for discussion here. This is the unholy list of “bored” games people tend to think of. Are these games actually bad? Probably not, but the BGG community certainly seems to think they’re not worth their time. Also is Twister really a board game?

Best Balanced Games

So above we went through a lot of pain to calculate out a game’s balance. I have balanced listed in two ways: one by a percentage by player count, and the other is just a flat standard deviation of those numbers. So for example, Race for the Galaxy supports 2-4 players. It’s sum of “recommended” and “best” votes, divided by the total is 96%. It’s 96% also for 3 players and 94% for 4 players. The standard deviation of those numbers (0.96, 0.96, 0.94) is 0.01154.

Games that have a very low standard deviation of their recommended player ranges are games that play best with any of their supported players. One would think that you wouldn’t print a game if it didn’t play well with certain player ranges, but here we are.

Below is a mathematically formulated list of the most balanced board games listed on the BGG database with at least 500 votes:

top_balance

I had heard people talk up about how Race for the Galaxy was so good for a long time, but seeing it at the top of the balance rankings made me go out of my way to try and learn it. I might be biased because I wrote the balancing algorithm, but the game certainly seems very well balanced at all player counts. Likewise with Dominion and Takenoko.

By all means, feel free to download the raw data file and play around! Maybe your threshold for recommended player votes is lower than mine and you might find a game that’s hiding from my filters.

Rating vs Playtime

One question I wanted to answer was: how does a game’s rating change as its playtime gets longer? We can answer that below with ggplot and by drawing a curve that fits the data. I’m purposely fitting a quadratic curve to the data because I’m working under the suspicion that a game’s rating has to peak and come back down to earth eventually. I want to find where that peak is.


q <- ggplot(bgg_rated2, aes(x = gametime, y = bayesaverage))
q + geom_boxplot(aes(color=factor(gametime)), outlier.shape = NA)+ scale_y_continuous(limits = c(3, 9))+ geom_smooth(method='lm', formula = y~poly(x,2))
test <- lm(bgg_rated2$bayesaverage ~ poly(bgg_rated2$gametime,2, raw = T))
bayes <- function(x) {coef(test)[2]*x + coef(test)[3]*(x)^2 + coef(test)[1]}
optimize(bayes, interval = c(min(bgg_rated2$gametime), max(bgg_rated2$gametime)), maximum = T)

rating_vs_gametime

R tells us that our peak is at 1.98787, so just under 2 hours for peak rating. Games that take at least 1 hour are rated higher than those that can be finished in less than that, and games that take 2 hours are very close in rating still.

Rating vs Complexity

The R code for computing the BGG rating as a function of how complex the game is (or “weight”) is more straightforward:


r <- ggplot(bgg_rated2, aes(x = floorweight, y = bayesaverage))
r + geom_boxplot(aes(color=factor(floorweight))) + geom_smooth(method='lm', formula = y~poly(x,2))

test2 <- lm(bgg_rated2$bayesaverage ~ poly(bgg_rated2$floorweight, 2, raw=T))
bayes2 <- function(x) {coef(test2)[2]*x + coef(test2)[3]*(x)^2 + coef(test2)[1]}
optimize(bayes2, interval = c(min(bgg_rated2$floorweight), max(bgg_rated2$floorweight)), maximum = T)

rating_vs_complexity.png

R tells us the peak of the curve is at 3.823372. So the trend is that rating will go up until the 3.8 level then start to come back down again. What’s fun to look at here are the outliers beyond the whiskers of the boxplots. Games with weight 1 have a huge distribution between the min and max. The ones between 2-5 only have a lot of high ranking outliers at the top end and not the bottom end. That kind of illustrates the bias in the BGG community: very rarely will complex games get a poor rating. The trend does illustrate that very complex games like Virgin Queen are rated lower than middle of the road games, though.

Owners vs Weight


s <- ggplot(bgg_rated2, aes(x = floorweight, y= log(numowners)))
s + geom_boxplot(aes(color=factor(floorweight))) + geom_smooth(method='lm', formula = y~poly(x,2))

test3 <- lm(log(bgg_rated2$numowners) ~ poly(bgg_rated2$floorweight, 2, raw=T))
bayes3 <- function(x) {coef(test3)[2]*x + coef(test3)[3]*(x)^2 + coef(test3)[1]}
optimize(bayes3, interval = c(min(bgg_rated2$floorweight), max(bgg_rated2$floorweight)), maximum = T)

owners_vs_weight

Here R is telling us the peak game complexity that people will buy up to is 2.920287. So more complex than 2, but we start getting diminishing returns beyond 3, which seems to make intuitive sense. What’s interesting to note here is that the peak for ownership is less than the peak of rating: 2.9 vs 3.8. So while people may enjoy playing complex games, they certainly don’t seem to own many of them.

Owners vs Balance

</pre>
bgg_rated2$X &lt;- NULL
bgg_rated2$X.1 &lt;- NULL
bgg_rated3 &lt;- bgg_rated2[complete.cases(bgg_rated2), ]
bgg_rated4 0)

w &lt;- ggplot(bgg_rated4, aes(x = balance_sd, y= log(numowners)))
w + geom_point(alpha=1/10) + geom_smooth(method=&#039;lm&#039;, formula = y~poly(x,2))
<pre>

owners_vs_balance.png

In this case, we are looking at the log(num owners) versus a game’s balance. The trend appears to be decreasing in proportion to balance, which is an interesting insight. One would think that people would prefer a game that’s balanced across all player ranges, but maybe there are more confounding factors at play here than just balance in terms of people’s ownership.

Ratings vs Balance

x <- ggplot(bgg_rated4, aes(x = balance_sd, y= bayesaverage))
x + geom_point(alpha=1/10) + geom_smooth(method='lm', formula = y~poly(x,2))

ratings_vs_balance

Again a decreasing trend in terms of a game’s balance and its bayes rating. Maybe a game playing well with all of its supported player ranges isn’t actually that important to people? The thing about board games is that rules can be modified much easier than in digital games. In fact, a game like 7 wonders that has an abysmal balance rating for 2 players, has a fun fan-made rule set that mimics the 7 Wonders: Duel game’s rule set, but for the original game. 

 

image credit: Sampo Sikiö

One thought on “The Data Science of Board Games – Exploring the BoardGameGeek Database

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s