Data analysis approaches behind MTGCardTech

Several years ago, I started a side project with several different aims (see the Projects section on the home page). My approach to this project has been one of discovery and praxis, and I wanted to document some of my journey.

The domain of this project is the trading card game Magic: The Gathering. If you are not familiar with Magic, a little background on the game: there are roughly 20k game pieces (cards), and players construct their own decks of cards to play against each other in tournaments. Decks have a minimum size of 60 cards (most often) plus 15 utility cards (called the “side board”), and tournament play is divided into different “formats” where certain cards are allowed to be played, and other cards are excluded. The game is similar in some ways to Spades, Hearts, or Bridge, where you are essentially trying to play cards that beat (or are better than) your opponents’ cards. Each card has roughly a dozen core attributes to it, and provides a statement of the rules that effect that particular card, other cards in play, or the play of the entire game.

One of the main goals of the project was to analyze the tournament performance of decks to try to find out which cards are best, and why. With a practically infinite number of possibilities in a game that is both skill-based and has some level of randomness, what combination of attributes make some cards more valuable than others?

After building data ingestion pipelines to gather all of the cards in the game, and decks used in competitive tournament play, I had to structure the data in a way that enabled analysis.

Several card attributes are numeric counting numbers (typically 0 through ~20).

Equally important is the rules text which the card may have. For example, “Lightning Bolt deals 3 damage to any target”, and “Flying. At the beginning of each combat, any opponent may sacrifice a creature. If a player does, tap Desecration Demon and put a +1/+1 counter on it.”

As a means to quickly get started, I transformed all of the card attributes into terms, including the numeric vectors. For some key attributes that had a finite number of possibilities, I also constructed terms that represented their converse or complement (for example, a card with a “color” attribute of only “white” is also “notblack”). I also replaced text that is self-referencing with a common term. Once I had a “document” for each card, I then used some simple off-the-shelf tools (initially Mahout on Hadoop, but then Solr, and currently Elasticsearch) to do things like find differences and similarities between cards using TF/IDF. The results of this are currently used on the website now in the “Cards similar to this card” feature (e.g.,

I then used this to find similarities between tournament decks of cards. I put each card document of a deck into a new deck document, and then looked for similar deck documents. What I found as I experimented in this was that TF/IDF did not actually create spaces that classified or grouped decks together very well. With a limited vector space, and so many valuable data in very frequent terms, inverse document frequency was not the right solution for segmentation and classification to reach my goal. I experimented with several different approaches, including my reliance on bi-grams and tri-grams for rules text grouping, and moving the numeric attributes out of term-equivalent “text” and into actual attribute vectors (thus, the “power” attributes wasn’t a term like “power3” any more, and instead was the value 3 on the “power” attribute vector). This, of course, yielded different (and better) results, but still did not provide valuable insight into deck segmentation or classification.

I next used a number of different packages to expand upon that basic TF/IDF analysis, specifically with Python Sci-kit Learn (and some other tools bundled in Anaconda), Elki, and Orange. With some of those packages, I used used the same algorithms and approaches, and was just learning the tool set. I feel like I had the most success with sklearn and Orange.

With sklearn, I performed clustering of both cards and decks using K-Means, Ward hierarchical clustering, and Agglomerative clustering.

None of these produced the exact level of “rightness” that I was looking for. Although there were expert human-perceptible organizations of the clusters, I felt that they did not match the current human consensus of deck organization. As I explored these clusters, I felt that not enough value was being added to the work that was already out there (one example of some known groupings is from Patrick Chapin and his Sixteen Archetypes). I played with changing what information was included in each document, using single words and n-grams of various length, and increasing the number of stop words. I measured aspects of my resulting data set to try to reduce the total number of dimensions and create a “tighter” space, ultimately eliminating game-inconsequential terms that had frequencies that were too far outside of the average frequency.

The next set of approaches I tried was with Orange, and started with a more supervised approach. I used it’s Tree model implementation, and trained it with a handful of deck documents that were in the categories that I wanted. I then played with it classifying the documents that I trained it with, documents from the same tournament format (i.e., decks that were played against each other), and then documents from different tournament formats.

With Orange, I also tried using Random Forest, hierarchical clustering, and k-means clustering.

Most of these experiments had a similar progression, which was to first try just a concatenation of the card documents into a deck document, followed by creating specific term vectors within the deck document that captured some computed vectors (for example, the number of each of the different card types).  These models did get closer to classifying in a manner that approximated human classification.

Also, somewhere in the middle of this, I attempted a Naive Bayes implementation, however quickly confirmed my expectations that the number of important dimensions of the data would require an inhuman amount of supervised learning.

Ultimately, the results from Orange were the most meaningful to me. In part, I think that it was Orange’s visual interface that allowed be to experiment and test quickly and visually. However, this became a drawback as I wanted to make this more of an continuous, automated process. Orange does provide a programmatic interface, but I found it’s learning curve to be more than I wanted to invest in at the time.

Next, I really wanted to getting back to comparing the actual cards (and not the decks). Clustering and classification were good tools to find similarities, but did not help me compare cards that were substantially different when it came to the question “what card should I include in my deck?” Given all of the dimensions that make a card valuable, I figured that expert human’s would be a faster path than the machine. So, I created a head-to-head “game” based off of Microsoft Research’s TrueSkill algorithm.

The web application shows a player two cards, and asks the player to choose the better of the two cards using whatever measure or test that human player thinks is most valuable – Each battle is recorded, and the wins and losses are calculated into a score using TrueSkill. The higher the score of a card, then more valuable the card.

This analysis allowed me to quickly gain human “supervision”, and as long as I was able to capture enough input from players of the these card battles, they did start to align with the frequency and popularity of those cards in decks. For example, this trending analysis shows the cards that are most frequent in the “Modern” format, with their Card Battle scores.

Most recently, I wanted to create new tools to help players discover cards that may work with the strategies or cards that they know that they want to play. Leveraging the TF/IDF analysis of cards, the inclusion of cards into competitively-played decks, and the card battle scores, I created a recommendation engine. (I played with a couple of “off the shelf” recommenders that used some of the same algorithms I had been experimenting with, but found that I already had too much invested in the analysis and pipelines I had already built to want to scrap my home-grown approach.) You can see an example of this in action at – try selecting “Modern” for the format and adding the two cards “Lightning Bolt” and “Island” as seed cards.

There are two sets of recommendations. The first set of recommendations is based off of the frequency of other cards in decks that include the “seed” cards. Thus, if you put in a card that is not played at all, you won’t get any recommendations. The second set of recommendations (the “Spicy” set) takes the card documents of the seed cards, as well as the first set of recommendations, and pulls together other cards that are similar. There is an inversion of the terms between the seed cards and first set of recommendations so that there is a higher diversity of the Spicy set.

I continue to tinker with these concepts, and my current curiosity is around mana bases. And, I welcome any thoughts, feedback, or opinions on any of this work or the underlying ideas.

Leave a comment

Your email address will not be published. Required fields are marked *