Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic"

Introduction
Since this is still going on, I think it would be beneficial to break this down statistically. I started this as a reply but it reached sufficient length that I decided it deserved its own thread. The link to the original thread is here.
Max in the Shops Mirror
The model that best approximates a 'coinflip' scenario in which there are two outcomes, determined by luck, with probability p, repeated n times, is a binomical distribution. Let us apply this to Max's experience with the mirror. We have the following parameters of n = 16 games and p = 50% or 0.5 (since it is a 'mirror'). Ignoring skill, Max should win
u = n * p = 16 * 0.5 = 8 games
Let's stop and double check that the result makes sense. You flip 16 coins and on average there will be 8 heads and 8 tails. Moving on to variance from that mean, we use the equation of
Var(sigma^2) = n*p * (1p) = 16 * (0.5) * (1 0.5) = 4.0 games
The standard deviation is typically more useful than variance and determined by taking the square root of the variance. The standard deviation is
Std Dev (sigma) = sqr rt(4) = 2
The final breakdown is:
Max's 15 wins in 16 matches is much higher than the 5050 an average Shops player would obtain. Therefore, it would be pretty reasonable for a casual observer to say Max is an above average Shops pilot based on these results.
Max against the Field
There are two ways to establish the probability that can be used in these models. The first is theoretically derived, like we did in the first section. In the mirror the cards are assumed to be the same or close to the same, so if player skill is ignored, the theoretical probability of winning is equivalent to losing, or 5050. However, if the cards are substantially different, i.e players are playing different decks, it is much more tenuous to assume a 5050 winloss record. You can make an argument for it: the tournament structure is such that a loss for one player equals a win for another, so the overall record of the field must be 50%. If you do so and exclude Max's Shops matchups, you get
Max's actual number of wins, 66, and actual match win percentage, 78.5%, are much higher than what we would expect given a 5050 win rate. It would be pretty reasonable to conclude that Max is an above average player with Shops against the field, too.
How can we 'science' up the above conclusion?
Science is conducted through the scientific method: you make a hypothesis, conduct an experiment, then reject or accept the hypothesis based on the results. "But Max didn't have an hypothesis." In many cases, data are collected before an actual hypothesis is made. The default position is that of the null hypothesis, that there exists no statistical difference between two groups of data. Put in the context of this experiment, we are essentially taking the position that there is no statistical difference between Max's results with Shops and the theoretical results of an average player with an average deck (average defined as 50% win rate). That is, Max's results happened purely through chance and neither skill or deck selection played a role.
Rejecting the Null Hypothesis (Confidence Intervals)
Max has collected his data  Now we have to determine whether or not Max is good or lucky. And honestly, we cannot know for sure. If you flip a coin 10 times and it comes up heads 10 times, would you conclude that this was luck or something nefarious was in play? The odds landing on heads 10 times is theoretically (0.5)^10 or 1/1024. Alternatively, the coin could be weighted so that it almost always comes up heads. Both are possible, right? There is that oneinathousand chance and weighted coins exist. Granted, in this example the coin is severely disfigured and would be readily apparent that it was doctored... Still, if someone says to you "I just flipped 10 coins and had 10 heads" with no additional information, what should you believe?
This brings us to confidence intervals. We know that with any type of probabilities, there exists a range of theoretical outcomes and that certain outcomes are more likely that others. What we have to determine is our threshold for error or alternatively our confidence in the results. Luckily, we did most of the work already by calculating the standard deviations. There is a statistical rule called the 689599.7 rule that states the likelihood of a certain result falling within one, two, or three standard deviations of the mean. Those ranges are given in the above charts. If the above games were played by an average Shops pilot, there is a 68% chance that the Shops pilot would win between 610 games, a 95% chance they would win between 412 games, and a 99.7% chance that a Shops pilot would win between 214 games. Max won 15 games and so his odds of being an average Shops pilot based on this data set are <0.3%.
Dividing the number of wins by the number of games played gives us a match win percentage that allows us to compare different sample sizes. Doing so shows how win rates can vary dramatically based on limited results (and why looking at small sample sizes is unreliable, like @Timewalking suggested). We would expect an average Shops pilot to win 37.562.5% of their matches 68% of the time, 2575% of their matches 95% of the time, and 12.587.5% of the time over 16 matches. Conversely, the confidence intervals for 84 matches (the number of matches Max played against the field) are much smaller: 44.555.5% for one standard deviation, 39.160.9% for two standard deviation, and 33.666.4% for three standard deviations. Statistically, more data is always better. Max won 78.6% of these matches, so again, it strongly suggests that Max is an above average Shops pilot and/or that Shops is an above average deck.
By convention, those in the medical field (and many other fields) tend to use the cutoff of 95% (2 standard deviations) as a statistically 'true' result. Max is well beyond that, so we can statistically conclude what most of us already concluded  that Max is not an average Shops pilot. We have a higher degree of certainty, of at least 99.7%, but it's much simpler mathematically to stop here for now.
What other meaning can we derive from the data?
There are really two other questions/observations that emerged from the thread concerning Max's article.
 Does Max's higher win rate in Shops mirrors (94% vs. 79%) suggest that Shops is actually a weaker deck against the field?
 Does Max's 81% win rate in total and 79% win rate against the field suggest that Shops is an above average (or good deck) in the metagame?
Let's start with the first as it is easier to address. The argument assumes that skill in one matchup is transferable to another, that the Shops mirror is inherently a 5050 matchup, and that since Max won at a higher rate against Shops than against other decks, the skillindependent MWP of Shops is below 50% (making it a 'bad' deck). While considering assumptions is really important in interpreting data, it actually doesn't matter much statistically. The numbers are what they are: Max won 94% of matches against Shops in 16 matches and 79% of matches against nonShops decks. The question is whether or not this discrepancy is real.
Is there a statistical difference between Max's results against Shops and Max's results against the field?
The second way of determining a probability (and by far the most common) is to do so experimentally. We don't know how many matches Max should win when we factor in his skill and his deck selection. How good is Max? How good is Shops? How good is Max with Shops? Again, we don't for sure, but one thing we can do is have Max play a bunch of matches with Shops to give us an experimental value for his win probability. Well, Max already did that so let's use Max's win rate against other decks as a starting point. Max won 66 of 84 matches, for an experimental probability (P because I don't know how to add a circumflex to the letter p) of ~79%. What is our confidence interval for this value? Well, there are several ways to calculate confidence intervals of experimental means based on sample sizes. Easiest one to use is a normal approximation interval or Wald method where the range is:
The constant z depends on the desired confidence level  for 95%, z is 1.96. Punching the numbers in, we get an experimental probability of 0.79 +/ 0.09 and a range of 7088%. Max's win rate of 93.5% is outside of this range, implying that there is statistical significance in the discrepancy between the Shops mirrors and matches against the rest of the field.Does a statistically significant result actually tell us what we want to know?
Now it's time to look at our assumptions. We assumed that
 Mirrors are inherently 5050.
 Skills with a deck are transferable between the mirror and other matchups.
 Skill differences affect outcomes in other matchups to the same degree .
I can poke holes each of these arguments. The first assumption is that mirrors are inherently 5050, but that ignores the fact that 'true' mirrors are relatively rare. Most decks are not 75 card copies of each other, and most classification schemes lump similar decks into the same archetypes. For Shops, this includes Ravager Shops but Stax, Rod, and other variants. Ravager Shops tends to destroy these other versions, which is part of its dominance within the metagame. Foundry Inspector breaks the symmetry of Sphere effects and is unaffected by Null Rod, the threat base is wider and lower to the ground (i.e. many creatures that can be cast cheaply), and the mana denial is much more effective against other decks with higher mana curves. Max went at least 50 against these 'mirrors' which arguably should be considered decks. If one assumes that the remaining 101 record was against other Ravager decks, that gives a win probability of 0.91 +/ 0.17, or a lower limit of 74%. This result is no longer statistically significant.
For the second and third assumptions, Max and I both stated that we thought the mirror tested different skills and was very skillintensive (i.e. that the skill discrepancy with a deck went a long way to predicting the winner). The Ravager mirror does have blowout potential but many games develop into complicated board stalls with key pieces such as Walking Ballistas, Arcbound Ravagers, Steel Overseers, and Hangarback Walkers shutdown by Phyrexian Revokers. Oh, and Metamorphs, Wurmcoils, and Precursor Golems providing powerful threats to be navigated. Complex combat math is arguably the most valuable skill in the mirror, with sequencing less important. These types of scenarios are uncommon in other matchups and the combat math is much more simplistic as most opposing creatures provide few decision trees (most creatures are vanilia x/x's like tokens and creatures with abilities tend to be static like the lifelink of Griselbrand or triggered and predictable like Inferno Titan). Sequencing is more important for the Shops pilot who assumes the proactive role. Skill from the other side of the matchup is also minimally interactive  as Max said, either the opponent kills all your threats or deploys a massive trump like Blightsteel through Spheres and mana denial, or they don't and die. That is more draw and dieroll dependent than skill based.
I think that statistically significant results in this case point to a couple of possible conclusions. First, I think the most likely explanation is that the Shops mirror tends to be less variable than other Shops matchups. This doesn't require assumptions about the transferability of skills from one matchup to another. It actually assumes the opposite of assumption #3 in that it assumes matchups are influenced by skill to varying degrees. Max reached this conclusion as well. I think it is less likely that Shops is weaker than other decks in the field, because we have more premises that I find hard to logically accept to reach that conclusion.
Does this article indicate that Shops is an overpowered deck in the metagame?
Short answer is "No". That type of question is much better answered by our metagame breakdowns. Again, more data is better and you mitigate issues of player skill by having a much larger sample size. Applying the same statistical tests to this most recent Vintage Champs gives a win rate of 59% (+/ 5%). In this sample size of 404 matches played by 72 players, it's pretty statistically clear that Shops is a good deck. Is it the best deck? Oath is the closest of the other archetype with a win rate of 55% (+/6%). Those confidence intervals overlap, so you can't statistically claim that Shops is the best archetype. The answer of course is "more data". When you look at results from the Vintage Challenges and other tournaments (taken collectively), ideally it paints a consistent and accurate picture of reality. That's how science works...you do radiometric dating of a bunch of radioactive minerals and when many different labs reach a consensus of 4.5 billion years old, that's what they put in the textbooks. Would people be interested in a large scale analysis of all available metagame data (in essence, a metaanalysis or the strongest form of scientific evidence in medicine and other areas of science)? I am willing to do this, but I would like confirmation that players would be receptive to the data.
Alright, back to one 100 match set played by one player. We can agree that Max's skill has skewed his results away from that of an average player. The question is what additional component arises from Max's deck selection. Again, we have to make various assumptions. We don't know Max's 'true' win probability with other decks, but he has stated that he has won roughly 70% of his matches in PTQ's. If we accept this figure as accurate and assume that this MWP is transferable to Vintage, and assume that PTQ's are comparable in level of competition to Vintage leagues, then we can use this 70% value as a theoretical probability. In this case,
The confidence interval has an upper limit of 79, which suggests with 95% certainty that Max's results are not just a product of variance. He won 82 games. If you exclude Shops decks, you are at the edge of statistical relevance (remember our confidence interval from that data set was 7089). If you exclude true Ravager Shops mirrors and include Shops variants, you are back above statistical significance with a confidence interval of 7288 MWP. Given the proximity to the limits and the assumptions required, I would not personally conclude from this that Shops is an above average deck in the metagame.
Hopefully this type of data analysis was informative and accurately conveys some of the challenges with regards to interpreting data. Questions and comments? Please, let me have them.

This is the most excellent thing I've read all day. I would certainly be interested in a large scale analysis.

Exactly the kind of post i would have liked to write if my english was not so poor. It is a pleasure to read a real mathematical analysis and i hope it will help to stop all those statistical nonsense that can be read in many posts.
In the field of industrial process control, if a sample is outside 3x Sigma it is said "there is a problem", if it is between 2x and 3x it is said "beware there could be a problem, check the trend and get more samples so we can conclude".

Max has collected his data  Now we have to determine whether or not Max is good or lucky. And honestly, we cannot know for sure.
@Koby would make the claim that I exist in a perpetual state of lucky. I’m inclined to agree.
@ChubbyRain , thank you for taking the time to make this. It was datadriven, interesting, enlightening, and possibly the most statisticallybacked compliment I’ve ever received. I would also be super interested in seeing the combined data, but understand the massive time commitment that doing so entails.

Thank you so much for putting this together. It brought me back to my econometrics days in college when I read this. A somewhat random aside and I may be mistaken, but I think that Max was running Steel Overseers while many of the “Shops Mirrors” he played in may have been against Ravager decks that did not run Overseer. If that’s the case, Max had a huge leg up on his opponent’s from a deck design standpoint as Overseer is one of the best cards in the mirror. Similarly, Phyrexian Metamorph if played correctly can be one of the best cards in the mirror, preventing you from falling behind to your opponent’s best plays. There is a lot more I could say about the Workshop mirror, but the main takeaway I have which was mentioned is that a Ravager mirror is more similar to a game of limited than Vintage in the traditional sense. Combat math rules the roost and generally a PTQ player does more combat math than a Vintage only player and thus should be better at that, vital aspect to Ravager mirrors.

This was a fantastic read.

Great read Matt. I purchased Ravager Shops on MTGO about 10 minutes after reading Max's article. I figured I played enough Vintage that I would be able to figure it out. Additionally, I currently play a ton of Standard and Modern, which I think @Will is spot on with the statement that combat math favors more of a multiformat grinder than a vintage only player. After a rough start with some misclicks and going 32 my first league with the deck I have gone 41, 50, and 41. My only losses being a shops deck with MD Precursor Golem, Wurmcoil, and Null Rod (aka preboarded vs me essentially) and a PO deck with an above average draw.
I have a couple observations of my own. After experiencing such dominance so quickly after picking up this deck for the first time I am inclined to believe people don't really know how to play against it or with it. I won mirrors as alluded to just by doing combat math and determining racing scenarios that I feel my opponents missed. Additionally, I had MULTIPLE OPPONENTS keep cards like Mental Misstep and Gush in against me, on the draw no less. I don't know how Blue decks expect to win when keeping in such detrimental cards vs Shops. Should something change for Shops? I am not here to propose that at this time. Should better precautions be taken by players for the match up (such as testing and SB measures)? I would probably guess there is a flaw in how players handle preparation for this match up whether its SB tactics or play lines. I think some of this comes from playing on MTGO where you will have newer players to the format try their hand at Vintage and not fully understand the match ups. I also think that sometimes older vintage players get stuck in their ways of thinking and using SB cards that aren't necessarily as effective as they used to be against previous iterations of Shops. Just my two cents after my brief twenty games with the deck on MTGO.

@will said in Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic":
A somewhat random aside and I may be mistaken, but I think that Max was running Steel Overseers while many of the “Shops Mirrors” he played in may have been against Ravager decks that did not run Overseer.
I actually kept on switching a few cards between leagues. I rarely entered a league with the same 75 twice.
In Leagues 1619, I played Car Shops, a la Nick DiJohn. I went 21 in Shops mirrors with Car Shops vs Overseer Shops, and 130 in Shops mirrors when I had Overseers. As you said, Overseer is insane in the mirror.
@womba said in Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic":
After a rough start with some misclicks and going 32 my first league with the deck I have gone 41, 50, and 41.
It looks like you're also at 80%. That's awesome!
I'm curious if anyone has been Game 1 Mulliganing against you, thinking you were on Dredge. I know I've done that against you once. I threw away a totally reasonable 7 because it couldn't beat Dredge, and you opened up with Blue spells! Learned my lesson.

@maxtortion said in Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic":
I'm curious if anyone has been Game 1 Mulliganing against you, thinking you were on Dredge. I know I've done that against you once. I threw away a totally reasonable 7 because it couldn't beat Dredge, and you opened up with Blue spells! Learned my lesson.
@Maxtortion this is a common occurrence, more so in paper but to an extent in MTGO haha.

@womba said in Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic":
@maxtortion said in Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic":
I'm curious if anyone has been Game 1 Mulliganing against you, thinking you were on Dredge. I know I've done that against you once. I threw away a totally reasonable 7 because it couldn't beat Dredge, and you opened up with Blue spells! Learned my lesson.
@Maxtortion this is a common occurrence, more so in paper but to an extent in MTGO haha.
One day someone will get blown out against me by doing this, but if the last 9 years are any indication, it won’t happen any time soon unless the DCI has something to say about it.

@womba It was the best of times, it was the worst of times, he drinks a Whiskey drink, he drinks a Vodka drink...
Edit: I'm pretty sure the discussion of player skill deserves its own section. Basically, I think players tend to have an unrealistic expectation of a format's skill level. It's not like GPs with hundreds of players don't have bad players that make common and at times mindboggling mistakes. Even pros can make mistakes repeatedly on camera. Part of what makes Magic so great is its complexity but the tradeoff is that perfect play is a hypothetical aim to strive for but never attain. I don't think the presence of 'bad' players is unique to Vintage or that Vintage has a disproportionate number of them.
That said, Vintage leagues are definitely a step down on the competitive ladder. More popular formats like Modern and Standard have two different league types with different entry fees and prize structures. The friendly leagues are much better as entry points to a format and the competitive leagues are much better for established pilots. Without that delineation, you get a mixture of experience levels in Vintage leagues. It doesn't invalidate data  nothing besides outright fraud and shoddy methodology invalidates data in my opinion  but it is a grain of salt. Again, I think that the best approach is a far reaching one that takes data from as many sources as possible and breaks down their pros and cons.

Nice job! A few remarks :
 the second table is the same pic as the first
 you want to say "The constant z depends on the desired confidence".. LEVEL, not "interval".
 about sample size : the things is, it's not exactly that small samples can't be used, it's that they can only be used for extreme stats, but as I say in my article at https://timewalking.wordpress.com/2017/05/25/5magicstatsmythsdebunked/) for such imba matchups/decks, you don't need stats to know that imbalance; and if the matchup/deck winrate is tight, which are the cases you'd be interested in, then it's not realistic to expect to find sample of a size and quality high enough to gives us solid stats to rate such tight margins. I've had an interesting chat with @Smmenen the other day and he pushed me to give a size of a sample that would be acceptable for me, I caved and gave an answer but really shouldn't have. It seems this forum has the means to assemble sample of sufficient power to have good to great chances to detect a 60% winrate (as in Shops in http://themanadrain.com/assets/uploads/files/1501832171949uploaddcb54f6a76b7487588a2d5d5ebf6c161.png), and he seemed to insist that the skill, the decklists were stable enough, so the quality of the sample would be presumably quite good. So possibly for that kind of range you could use stats. Is it worth the effort, considering using the good statistical tools can be time consuming and that there are still so many ways to use and interpret them incorrectly ? To me it isn't, a 60/40% winrate I expect to be clear enough that if I play or observe the deck enough I'll get the idea, while learning things on the way. Seems like a better deal to me, since even a 60/40 matchup is quite skill dependant, generally.
 Therefore I don't think we should ask you to do a largescale analysis.
 Please consider not relying on the concept of "statistical significance".

Can you elaborate on your advice not to rely on the concept of statistical significance?

@senor_bisquick There's a chapter about it in the link.

 Done. I had to copy and paste from my response in the other thread to a new thread. Obviously, I screwed up. Thanks for pointing it out.
 Yes, thank you and changed.
 Agreed. Ryan and I have been trying to include standard deviations based on the binomial model in our metagame breakdowns. If you look at the end of the piece, I mentioned that in the 400 match sample size, we had a confidence interval of +/ 5%. In the 277 matches with Oath, we have +/ 6%. 123 matches 'bought us' an additional ~2% of certainty. As for small sample sizes being usable, I think I made the point that "more data is better", not that small sample sizes are unusable. The null hypothesis for MTG matchups is that both matches are equally favored. It is much easier to reject the null hypothesis given a more skewed sample and the confidence intervals support that. Shops at Champs was to the casual observer clearly an above average deck given it's dominance of the top 8. The win rate and confidence interval augment that conclusion: at 5464%, that's outside of the definition of average. The next best deck Oath does not have the same statistical support. The range of 48.560.5% includes 50%, so the deck might be average based on the confidence intervals. Taking the entirety of champs you reach a very limited conclusion about the field: Shops is above average, Paradoxical and Eldrazi are below average. Every other archetype is statistically average. And of course this is without other assumptions and arguments mucking the water.
 I am thinking it would be helpful. Again, I think the null hypothesis that matchup X vs Y is even is a valid and worthwhile result.
 I agree wholeheartedly and am familiar with the debate currently ongoing with in the scientific community. The point of this was actually not to establish conclusions but to convey the complications of deriving conclusions from statistics.
Thank you for the feedback. I will look at your blog post later tonight and am looking forward to it :)

@chubbyrain I agree on the focus on specific matchups rather than on "vs the field". Much cleaner data there.

@timewalking I read your article and enjoyed it  would recommend it to others as well. I also tracked down David Colquhoun's article and stashed that into my file of articles I'd like to keep readily available going forward. It reminds me of my lecturer, who quipped "I could devise a screening test that is 98% sensitive, 98% specific, and wrong 98% of the time."
I am curious curious why you feel specific matchups would be better than against the field?

@chubbyrain The math tools we use are made for experiments conducted in the same context. Ideally they would query the presumed universal laws of nature and nothing else. When you test deck A vs B, even if no changes to the lists are made in the sample, then you'd still need to know that the players don't have different ways of playing the matchup. And there would be a multitude of parameters that could change in the sample A and B like their health, fatigue, etc. But we omit the human factors and still use the statisticians tools. If you test deck A vs the field you make the testing environment much more unstable, your field will change more than B, might be not representative of the "real" meta (not sure there's such a thing really :) ), and anyways the deck we face in a tournament are quite random.
Also on the personal level, I find matchup winrates, well, omiting that I'm not in favor on relying on stats in mtg, much more interesting for me personally. If I'm confident my deck beats X by a considerable margin, if I could have a relatively exact measure of that margin, I could use that as a tool to see how far I could go in weakening my deck a bit against X to help other matchups for instance. Since I don't expect to have reliable stats "ever" for tight matchups anyways, that's what I could use.
When playing against the field, I can try to use metagame shares (which are much more stable than winrates, or at least that seems clear to me) to chose decks and sideboard, but those aren't winrates, and also I base my strategy on the principle that only the players of top skill matter : if I'm not one of them I'll lose, if I am I'll win against the weaker players, so I'm interested in winrates against top players by top players, that's what of real strategical importance to me and by definition winrates against the field aren't of that nature. Sure neither are matchup rates typically, but they could be devised to be so, theoretically. (another reason why for me stats are so overrated in mtg)

@womba said in Math and Max: A statistical analysis of "100 Matches with the Best Deck in Magic":
Additionally, I had MULTIPLE OPPONENTS keep cards like Mental Misstep and Gush in against me, on the draw no less.Good points added. It's generally not right to keep Misstep in but I figured I would add that Oath pilots should generally keep 1 (possibly 2) in post sideboard due to the importance of stopping Grafdigger's Cage. If they weren't on Oath, they probably have no excuse aside from "I just had so many Pyroblasts and Flusterstorms I didn't have enough cards to take out" in which case, again, no excuse. :D