A fun data-analysis project with my friends to figure out our soulmates based on MK8D favorite tracks.

I’ve been playing MK8D with friends for a while now and we decided to have a bit of fun in figuring out our most beloved and most hated tracks as a group. To do so, we voted for our favorite tracks and did some data analysis to determine who were the players we share tastes with. Our so called “Mario Kart Soulmates”.

# Voting Process

The first step was to ask everyone to vote in a tiered list. To have the 48 tracks organized in a neat way, we decided to have 8 tiers with 6 tracks each. My own voting ballot looks like this:

## Votes Compilation and Preprocessing

With this template, everyone created their tier list and voted in secret from the other participants (to avoid biases). Once the 14 players submitted their votes, I compiled each one into lists that look like this:

Where the position of the tuple represents the tier (from highest to lowest), and every element within that tuple receives the same amount of votes (it would probably have made more sense to use a set instead of tuple for this).

After repeating this for the whole set, and adding some verification to prevent repetitions or missing tracks, the whole dataset was collated into two dataframes: a table-shaped, and a row-wise one.

# Data Analysis

## Similarity Metric and Matrices

With these two dataframes in place, the first thing we did was to calculate the similarity between the vote-vectors of the participants. To do this, we take each one of the votes and use it as a vector in the shape:

…where each *xn* entry represents the number of votes allocated to the track *n* (sorted globally) from 1 to 8 (the number of tiers). For example, given the global sorting:

…the initial part of my voting vector would look like:

because these were the votes I assigned to “Mario Kart Stadium”, “Water Park”, “Sweet Sweet Canyon”, “GBA Ribbon Road”, “Super Bell Subway”, “Big Blue”; respectively. Repeating this for all the players gives us a set of vectors *V(player)* upon which we can calculate the distances using some similarity metric (such as euclidean distance or manhattan distance). In the case of this application, we tried both the euclidean and cosine distances, both of which returned similar results in clustering, so we used the cosine metric.

With these vectors and distance metric, calculating the distance matrix was straightforward (by iterating through the pair-wise distances of each players’ tuples). In addition to this, I created a row-normalized similarity matrix (scaled from 10 to 100) so that it was easier to see who were each person’s highest and lowest matches.

With these in hand, it’s somewhat apparent that there’s some clusters that could be calculated more formally.

## Dendrogram and Chord Plot

As previously mentioned, we were curious to know how we would cluster based on our track preferences. One good way to do this is to generate the dataset’s dendrogram, for which we used scipy’s linkage and dendrogram functions with the ward distance:

This is a pretty interesting result, as it neatly lays out what we suspected intuitively from our preferences (as we usually chat amongst ourselves when we’re playing, so we have a good idea of which tracks everyone tends to favor).

Another way to visualize these kind of “linkage” or “flow” dependencies is through a chord diagram:

Although, in this case, it’s a bit more difficult to see the relations between people. It is, nevertheless, possible to see to which persons everyone is more connected to (wider lines).

## Waffle Plots

After looking at those results, it was natural to try and make some inferences about the players’ archetypes. To do this, a frequency inspection was setup. I wanted to do something new in parallel to doing the traditional bar charts or treemaps, so I generated waffle plots for each one of the tracks.

This took a bit of fiddling around because I wanted the sizes of the plots to be constants, and the labels to be in the same position for the whole deck, so I created a “padding” number of votes with the maximum frequency (most voted track). These plots were pretty useful as every morning, a track would be revealed to all the participants through one of these waffle plots (from lowest to highest ranked). The full waffles panels looks like this:

## Interactive Plots

Finally, I went ahead and coded plotly versions of a barchart and treemap to make some interactive visualizations (click the images for the full-fledged version).

# Results

As mentioned before, it was interesting to see that our group of friends was neatly divided into two clusters. As the results were being made public to our group, day by day, the reasons for these clusters became clearer: one of the groups tends to like longer, more sinuous tracks (blue), whilst the other one prefers simpler, shorter tracks (magenta). One anomaly, did surface, though. Amaya’s votes didn’t seem to follow the patterns too well. Even though, her ballot clustered to the magenta clade, it was a strangely loose connection. After some investigation, we found out she voted for her favorite tracks but she didn’t know them all, so the rest she was not as careful in the ranking. This cleared up the mystery for us.

# Documentation and Code

**Repository:**Github repo**Dependencies:**scikit-learn, pandas, numpy, matplotlib, scipy, pyWaffle, plotly