Analysis of the SkillCraft dataset
Introduction
In-game screenshot from StarCraft II
Ah, this brings me back!
To be honest, this dataset wasn’t selected at random, but rather I stumbled upon it whilst browsing Kaggle. I’ve been enthusiastic about StarCraft II and predecessor StarCraft for a long time, although recently I didn’t play/watch any games due to a significant lack of time/setting higher priorities for it. So, naturally, when I saw the SkillCraft dataset from Kaggle (link) I had to check it out :)
StarCraft II
StarCraft II (or SC2) is a real time strategy (RTS) game which has a large community and several professional leagues. Before the start of each game you can pick one of three races (or select one at random) and you start by constructing a base of research and construction facilities. The ultimate goal of each (regular) match is then to either
- Destroy all your opponent’s buildings or
- Make your opponent forfeit the game (i. e. craft a situation from which it is clear that your opponent has no more chance of winning)
Players have to devise specialized strategies, frequently and quickly adjusting to their opponent’s moves and finding a delicate balance between developing their base, training an army and mining resources (minerals and gas).
Player statistics in SCII
During the game, spectators can assess several statistics for each player. Amongst intuitive measures such as the amount of workers or units built, constructed buildings and resources mined, other stats such as the APM (actions-per-minute) allow to assess a player’s game activity or even skill.
The dataset collected several such measures across the SC2 scene, conveniently collected to be analyzed by us, so let’s check it out!
Data exploration
The datasets contains 20 columns and 3338 rows. A detailed description of the dataset can be found on the UCI ML repository.
Print the column names:
Variable selection
Alright. We got a GameID, LeagueIndex (there are 7 leagues, see below, in contrast to the 8 leagues indicated in the description) and Age as some general player statistics (in the rest of the analysis I’ll assume that ‘GameID’ is actually ‘GamerID’, and that the collected stats are summaries over the gamer’s game history).
From the available game statistics, for now I’ll choose some easy ones, i.e. APM, TotalHours as well as SelectByHotkeys (you can hotkey certain unit groups and buildings). Let’s also pick ActionLatency, not quite sure what it represents but it should be something along the lines of how quickly a player performs actions in response to certain events.
Quick data quality check
But first let’s get a feel for the data, just check for any missing values and get summary stats per selected column (values > 0 indicate missing values):
LeagueIndex | GameID | Age | TotalHours | APM | SelectByHotkeys | ActionLatency |
---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 |
Game leagues
Alright, the quick check does not reveal any major problems with the data, let’s go on. Within the game there are ‘leagues’ which are encoded in these data with numbers from 1-7 (7 being the highest league). We don’t like this representation very much, let’s repace it with some more ‘speaking’ names.
NOTE: We keep the correct order of the leagues which is helpful for understanding later plots by using ordered factors
Let’s have a look at the top of the new table:
General overview
Ok, now that we’ve got some nice names for our leagues, let’s create some basic overview plots for our selected variables over the different leagues.
Above we can see some interesting stuff already. First think we notice is that most players reside somewhere in the ‘medium’ leagues’. It seems that it is relatively easy to progress to Platinum/Diamond, however, in order to advance to the Master or even GrandMaster league some serious skill is required. Age distribution is about the same over all the leasgues, no real surprises there, although in GrandMaster we seem to have a slightly younger population. High APM seem to be somewhat more common in the GrandMaster league than in the others. Now, ActionLatency and SelectByHotkeys look a bit more interesting. Apparently, most gamers in the GrandMaster league have a low latency and hence quick reaction to events (which would make sense). Similarly, Master and GrandMaster gamers seem to select there units using hotkeys more often than players in the other leagues. Overall, the mean of the distributions seems to wander from somewhere around 100 to maybe 30-40 as we go from Bronze to GrandMaster (ActionLatency) and likewise for the SelectByHotkeys variable. We can quickly check the means of the respective distributions:
Indeed, the higher the league the lower the ActionLatency and the higher the SelectByHotkeys statistic.
Now let’s check on the TotalHours:
Zonk! Something’s wrong here! Seems that some few players have an extraordinary amount of played hours on their back, we should look at this in more detail!
Total hours played
Since we can’t really see anything in the plot, we check the table for some outliers. We get the table and sort it by TotalHours, decreasingly, and check the head:
GameID | LeagueIndex | Age | TotalHours | APM | SelectByHotkeys | ActionLatency |
---|---|---|---|---|---|---|
5140 | Diamond | 18 | 1000000 | 281.4246 | 0.0234282 | 36.1266 |
6518 | Master | 20 | 25000 | 247.0164 | 0.0157938 | 37.1837 |
2246 | Diamond | 22 | 20000 | 248.0490 | 0.0237032 | 45.3760 |
5610 | Platinum | 22 | 18000 | 152.2374 | 0.0119831 | 63.9811 |
6242 | Gold | 24 | 10260 | 76.5852 | 0.0007798 | 84.6340 |
72 | GrandMaster | 17 | 10000 | 212.6022 | 0.0090397 | 41.7671 |
Crazy! There is one player (age 18!) who has apparently a preposterous total of 1,000,000 played hours…
OK, let’s do the math, shall we? So 1,000,000 hours, that would be 41,666 days and about 114 years. Though we appreciate the effort, for the sake of our further analysis we should filter out unreasonable total hours in general.
NOTE: There’s a chance we interpreted the dataset wrong, since there is no real description available on Kaggle. Anyway, for now we just go with it.
We remove all players with TotalHours > 8.766 × 104 (10 years) and check out the plots again:
Ah, this looks much more like it! We can now see that we have only very few players who have about 20,000 TotalHours (2.2815423 years), still crazy! Let’s do the by-league plot once more (in log-space), and let’s again look at the means:
This satisfyingly feeds our expectations! On average, if you are placed in a higher league it seems that you did play the game for a longer period of time than your fellow ‘lower-leagures’!
LeagueIndex is indicative of APM, ActionLatency and TotalHours
Now we can check whether our favourite variables are indeed determined by the player’s current league placement.
First get an impression of whether a high amount of total hours played indicates a high APM or low ActionLatency. Again, we take the hours in log-space.
Ah, I like these plots, they allow us to extract some interesting information (and plainly look fancy!). We can see that in higher leagues in general the APM seem to be higher than in lower leagues, same for the TotalHours played. Similarly, ActionLatency is lower in higher leagues, also somewhat correlating with the TotalHours played.
Now we shall look at this information a bit differently, doing violin plots instead of these points:
Here we can see quite clearly, that for each of the chosen variables, the LeagueIndex is indeed indicative of their values. We can further assess this using linear models. Now we train and report a linear model using LeagueIndex as the independent and our respective variables aus the dependent variables.
As you can see, again we observe that we have a linear dependence between all our variabels and the LeagueIndex. We probably should explore this a bit more, but for now let’s go on with predicting league placement!
Predicting league placement
Great! Some machine learning for predicting league placement, eh? (don’t worry, this will be very high-level and we don’t go into much detail for now). We build a random forest model (i.e. an ensembl of decision trees, wiki link) using the randomForest package by Breiman and Cutler available on CRAN. In the model we use all our variables as inputs and try to model the LeagueIndex using default parameters except for the number of trees which we set to 10,000.
NOTE: Again, this is not a very sophisticated approach as we do it here, we just wanna see how good we can get by blindly applying this model. We’ll probably return to this at a later point.
Performance
Wonderful, we got some results! Hmmm, but what do we see here? Apparently, unfortunately we were not able to do a satisfying job (out-of-bag error rate (OOB) around 64%!). Well, let’s move on for now.
Mock predictions
Looking at the confusion matrix, however, we seem to be fairly ‘close’ with most predictions (if we assume that e.g. Silver league is similar to Bronze and Gold, etc.). Since this analysis is just for fun, let’s define new ‘mock’ leagues, deviding players into ‘good’ and ‘bad’ ones. Maybe we can do better on these two classes?
As we can see the OOB is a lot better at about 22%, it’s something!
Feature importance
Finally, let’s quickly check on the feature importance for the mock predictions.
The MeanDecreaseGini is highest for the ActionLatency, suggesting that this variable is the most important one for predicting ‘bad’ and ‘good’ players in our mock prediction experiment (the higher the index in general, the ‘more important’ the respective variable).
Conclusion
We looked at some of the more intuitive variables accessible in the SkillCraft dataset and got a nice feel of the data using rather straight forward ggplot functionality.
Overall, we were able to see that the league placement already gives a hint on the average APM, ActionLatency and TotalHours played for each player.
We finished with some experimental machine learning on these data, simply applying randomForests on the three variables to predict the LeagueIndex. Whereas for the multi-class case this didn’t work well, for a mock prediction experiement were we replaced the LeagueIndex naivly with a two-class label we obtained ‘ok’ results.
This could be followed up by a more sophisticated approach, using more variables, doing some serious variable selection and normalization prior to applying any model and finally evaluating the results using e.g. the AUC or F1 measure.
But for now we are happy with what we have done and can go on to the next task!
Until then, farewell!
-or-
Khas il’adare - for anyone ‘speaking’ Khalani
Comments