MTZ: Magic, The Zachering (pt 2)

This article is part 2 in a series on mtg. If this is your bag, check out

  • Part 1: In which our hero struggles to game a multi-thousand-dollar game about goblins
  • Part 2: In which our hero checks his resume's neo4j box

So what do we do with all that cool data we just scraped? Where do we put it? What certain thing does everyone at work seem to be talking about at the same time? What is the thing about which I have most often said “… yeah, I guess I should learn how to do that, clients keep asking about it…”

Part 2: Bloodcrazed Neo(4j)nate

Neo4j is, as it turns out, just really goddamn cool. It’s an increasingly popular graph database, which is a pretty good fit for our current situation. The cannonical example of datasets which are well suited for graph databases and network analysis are social networks (Anna knows Bob, Bob knows Crusty, etc.), but for our purposes I think another common example is a better metaphor: actors and movies. One of the excellent tutorials offered out of the box for a neo4j install discusses the movie actor Tom Hanks (best known for his work in this masterpiece) and looks at movies he is in, other actors who are also in that movie, actors who are in movies with Tom Hanks’ co-actors but have not yet been in a movie with him (here’s lookin at you, Tom Cruise (nsfw, I guess?)) – you get the picture.

Well, our actors are cards. Lightning Bolt is Samuel L. Jackson, in probably about 50% of all movies (decks). The analogy helps inform our questioning and analysis, too, in my opinion: perhaps I want to get the notoriously expensive Tarmogoyf to appear in my next deck, but I just don’t have the cash. What actors are like the ‘Goyf? What actors appear in similar successful movies? Or say I’m 20th Century Fox and money is not an object – are there any indications that two actors who haven’t yet appeared together will really have good chemsitry?

Installing

I don’t even really need to talk about this, honestly, because the folks in charge of neo4j have made it so flipping easy. Just go to their download page and follow the instructions. I installed on my work laptop (OSX) and my home desktop (debian) and had no problems whatsoever. Plus, the tutorial information was just really, really straight forward, fun, and helpful. Given that I am a person who very strongly judges a technology by its install, I have to say I am pretty excited about neo4j at this point.

Conceptual setup

Given that this is my first time using neo4j or (any graph technology) in any real way, I decided it would be best to be super pedantic and collect all of my thoughts pertaining to this MTG project as I went through their introductory tutorials.

  • Built-in tutorial, :play concepts
    • Graphs contain a few primary concepts items
      1. Nodes: a single data record (cards and decks)
      2. Relationships: connections between records (card is in deck)
      3. Properties: key-value pairs of information about data records (a card has certain casting cost, a deck finished a certain place, an is in card-to-deck relationship is for the mainboard or sideboard)
      4. Labels (a deck was a standard or modern deck (when played or by today’s definition))
  • Built-in tutorial, :play cypher
    • create nodes with the CREATE clause.
      • we have a node we will call lb of type card, with a name of lightning bolt and a cost of R), so our statement is:
        • CREATE (lb:card { name: "lightning bolt", cost: "R" })
      • we have a deck d and a card c that should be in that deck. We create the relationship via:
        • CREATE (c) - [:IS_IN {board: main}] - (d)
    • look up nodes or do advanced pattern matching with the MATCH clause.
      • we want to find all cards that cost only 1R to cast, and call them c:
        • MATCH (c:card) WHERE c.cost = "1R" RETURN c
      • what about all decks that have (lb) (lightning bolt) as a card in the sideboard?
        • MATCH (c:card) - [:IS_IN {board: side}] - (decks) WHERE c.name = "lightning bolt" RETURN c, decks
    • find recommendation based on complex MATCHes
      • find cards that often go with Lightning Bolt
        • MATCH (lb:Card) - [:IS_IN] - () - [:IS_IN] - (cards) WHERE lb.name = "lightning bolt" RETURN DISTINCT cards

Loading data

Cards

Because we have json input for the cards, loading into neo4j should be super painless. Unfortunately, however, the actual format of the json is simply not what we want – the top level is a dictionary of [set_code]: { ... set info ...}. Neo4j allows us to natively unwind lists in json, and refer to dictionary keys when they are known, but (to the best of my knowledge) it does not provide a way of unpacking a list of keys. Instead, I made a simple loop that iterated over each set and loaded the sets in their entirety. The entire process took hardly any time at all – another thing I learned messing around with neo4j here and at work is that it is fast as fugg.

The entire process is encapsulated in my github repo here. The final query is a two step insert (for each set, the iteration over set IDs and cards is done in pythotopia):

Decks

Loading deck data is a little tricker, but still imminently doable. Obtaining deck data, on the other hand, required some real engineering. As I mentioned in my previous post, I decided to go with the SCG deck database. There were a ton of decks there (tens of thousands), so downloading them all took a few hours. In order to speed up development, I built in a caching layer to avoid the round trips. Currently that is still enabled by default in that github repo, so if you decide to do it yourself be prepared to eat up about 5.1G of disk space. It’s for a worth cause, though, right?

In any case, that process is also on github here. After scraping data, the loading of the decks is actually a bit more straight-forward than the loading of the cards. Loading the decks and their metadata is about as simple as a cypher load statement gets:

The extra complication here is that decks themselves have different “boards”: a main board, which contains the default build of the deck, usually 60 cards; and a side board, a collection of no more than 15 cards which can be swapped into the starting (main) deck between matches. In order to match the decks and their metadata with the cards in their boards, we need a second query which does the card name lookups and creates the graphe edges:

Next time!

We now have a relatively easy-to-access database of mtg deck and card info, so I’m going to go about analyzing it. Rather than list out what I want to do, I’m going to ruminate on the topic for a while and come back with something concrete.

Written on February 28, 2016
Keywords: hack, mtg, magic, gathering, superfly, neo4j, network, debian, install, graph, relation, node