ExpG Comment Reply
Note: this is a detailed response to a comment left on another post (found here) about Expected Goals (ExpG).
The claim was that I “don’t have any idea what [I’m] doing.” That’s the general from the specific that I had trashed someone else’s work, even though that work was apparently superior.
As was pointed out, having an Actual Goals/ExpG calculation closer to 1 is an indication that our model is doing a good job measuring the cumulative value of shots a team has taken. And yes, 1.116 is closer to 1 than 1.206 (this is the single statistic—two different measures of Swansea’s ExpG last season—from which the criticism above was derived).
But that’s one team. A better comparison would involve more than one team. So I did similar comparisons for multiple teams. I found four complete seasons worth of ExpG calculations using the model you referred to over at SB Nation—2013-14 seasons for each of Spain, Italy and Germany; and the 2014-15 EPL season—and did the full season calculations for three of those to compare against. I left out the Bundesliga. Why? Because I had just done something similar for a post below. Theoretically that would mean I had already done those calculations, making it less work. But I stupidly neglected to save my R workspace, and I don’t want to do the same work twice. This does take a little bit of time. Still, if you check out the charts in that Bundesliga post, visually, there’s a compelling case that we’re doing “better” (although I have no idea which model the original poster used for his calculations). Anyway, I’m pretty sure that three seasons will suffice for a comparison.
Again, just to be clear (and as you stated), closer to one is better. That’s for both an over- and under-shot. For example: Newcastle scored 56 goals in the 2011-2012. That’s not made up, that’s their actual total (and holy crap, I totally forgot they finished 5th that season, even ahead of Chelsea).
Suppose both ExpG calculations over-estimate Newcastle. One is 58 and the other is 63. The former is “better” as 56/58 (or .966) is closer to one than 56/63 (or .889). Similarly, if we undershoot, we want to undershoot by less. So if the calculations are 51 and 47, the former is again better as 56/51 (1.098) is closer to one than 56/47 (1.191)
With that in mind here are the results for the three seasons specified:
For the 2014-15 EPL, I was more accurate on 15 of 20 teams (highlighted in green). For the 2013-14 La Liga season I was again more accurate on 15 of 20 teams. The 2013-14 Serie A campaign is a little trickier because some of the goal totals on the SB Nation chart are inaccurate. Specifically, those for the following teams (the ‘Posted’ values are those on the SBNation page, the Actual are what actually happened over the course of the season).
I originally thought there might be incomplete data and that the calculations were for games available. But in that instance none of teams would have had a posted total greater than the actual (i.e. if there were missing games then there is no way Juve’s posted total (84) could have had more than their actual total (80); you can’t total up goals you don’t have). My guess is that they were honest mistakes (it’ super easy to get lost when you are moving data around). So I simply substituted in the correct values and, after doing that, I was more accurate on 15 of the 19 Serie A teams.
Yes, there are 20 teams in Serie A. I had problems with my AS Roma calculations and, even though I tried to correct in a way that penalized me for missing data, it didn’t seem right to include (FWIW, my calculation was ultimately closer to 1 but, without any guarantee of uniformity across processes, it still seemed better to toss it out). That super conspicuous black line, that’s Roma.
Add all three seasons up and it’s 45 out of 59. That’s a .763 batting average.
If you count total goals to the good, I’m better by a cumulative 99.4 goals. And that’s net. Gross, I’m up 124.6 to 25.2.
Back to the original post and the relevant ExpG calculations. Swansea was one of my biggest misses this past season. They scored 46 on ExpG of 38.1 (my number). So that’s almost off by a full 8 goals (which to a low total is a sizable percentage). The starting point for the post at Statsbomb was that the Swans substantially over-performed their expectations. Swansea looked like an outlier. That’s what made them worth digging more into.
By the SBN model, the Swans Actual/Exp was 1.1165. If you take the three seasons here as the dataset, then doing a simple mean and variance calculation it turns out that Swansea were about .80 standard deviations above the mean. They weren’t even a full standard deviation above it. About 57% of the data are going to be between +/- .80 standard deviations. I’m not sure what would constitute an outlier for ExpG, but I’m pretty sure that’s not it.
For comparison’s sake, SBN has a mean of 0.983 and an standard deviation of 0.165, my respective numbers are 1.004 and 0.1347. So my numbers have the Swans 1.50 standard deviations above the mean (so about 86% of the data are going to be +/1 1.50 standard deviations). Maybe not what you’d consider a true outlier either, but at least large enough to be worth another look. Plus, when we’re right most of the time, we can be pretty confident that, on a big miss, something is up.
Is my model perfect? Not even close. It’s football fer chrissakes. It’s like non-linear dynamics as a game. Even if it were decipherable mathematically, I can think of four additional factors I want to add (so I’m not even complete by what I want to do). They are just going to be ridiculously complex to add and the gains will be marginal. Moreover, as I’ve said elsewhere, I don’t even think these calculations are where the real value in having a good model lies. Still, if someone is going tell me that I have no idea what I’m talking about then doing some math in my defense seems like an entirely reasonable response.
Even if it’s one I spent far too much time on for my own liking.