If You’re English Maybe Start Paying More Attention To Match Appointments
This started a couple of weeks back when there was a single PL game where three goals were scored from an offside position. One was extraordinarily close, so much so that it’s barely conclusive with a frozen frame of the moment the ball was played.
The other two were laughably bad. One was from a dead ball free kick. The player who eventually scored was standing goalside of the 18 yard line. It should have been obvious he was offside because he was the only player standing upfield of that line. There was literally a visual marker painted on the field—conveniently in a nice, visible white—separating the offending player. The linesman still blew the call. The other one was probably worse. The player was a full yard offside when the ball was played to him. The curious thing about the latter decision is that the linesman appeared to shift the hand he was carrying the flag in, as if he was going to raise it, then just didn’t.
If you haven’t figured it out yet (which, at this point, would mean you’re willfully ignoring the posted images), this all happened in the Man City v. Tottenham game back on September 16, 2015.
Here’s the latter event.
Yep, that's offside #THFC #MCFC pic.twitter.com/CMdXZ2SKyB
— SportsJOE.ie (@SportsJOEdotie) September 26, 2015
The ball is played to Walker prior to Spurs’ first goal (Walker himself didn’t score but the play should have been whistled dead). You can see the linesman at the bottom right of the pic. It would be very hard for him to be better placed.
And here’s Harry Kane when the ball is played before he scores (the initial shot is taken directly from the dead ball; Kane scores when the ball rebounds off the post).
This isn’t even a good angle. I actually chose it specifically because it illustrates how easy it is to see Kane was off just using the 18, regardless of where the linesman might have been standing (which, incidentally, was staring straight down the 18… and okay, it’s not just Kane who is off; Alderweireld is slightly over the line as well).
What was common to the two players in clear offside positions who weren’t flagged? Probably lots of things, but the important one for the present discussion is that they were both English. They still are. But those two non-calls were the catalyst for the question at hand: Is there statistical evidence that some Premier League officials show favoritism to English players1? The short answer is, yes, a few do. That’s also the longer answer, but the longer answer has numbers and stuff to explain.
Start with the total infractions for the 2013-14 and 2014-15 EPL seasons. For current purposes, an infraction is a foul, or a card not accompanied by a foul2. The latter would include things like taking your shirt off after scoring or dissent3. We don’t count cards that are issued along with a foul, mostly so we don’t double-count the same event4.
Overall English players account for 32.66% of infractions called in the league. That’s small, right?. Not really. English players account for only 35% of the minutes played 5. So that 2.34% difference represents a total of about 410 infractions. That’s huge, right? Again, not really. That’s over two seasons, so at a game level it’s just a bit more than a half-a-whistle per.
Moreover, that difference in itself isn’t evidence of anything. In other words, it’s not necessarily the case that, because English players account for 35% of the minutes, they should also account for 35% of the fouls. And there are any number of perfectly good reasons for that. Maybe there are more English players on the wings, and a disproportionate amount of the play happens centrally such that they are just on or around the ball less during matches. Or perhaps proper English boys like, oh, Joey Barton are just fine, well-behaved citizens and they actually commit fewer fouls per minute than their dirty foreigner counterparts6.
The reason isn’t too important here; we can still use the 0.3266 as a baseline to evaluate individual officials. Some officials probably call a few more infractions on English players, others a few less. But if we’re finding officials that are calling significantly fewer (or more) infractions against English players, that would be indicative of, well, something.
Once we count up everything and divvy it up by official, this is what we find (yeah, I know, raw R output isn’t the prettiest; I should learn to use the pretty table output package that @11tegen11 uses):
So the first column is obviously the official7. The second is the overall ratio of fouls called on English players (that’s the same 32.66% from above, but here as a decimal). The third column is the ratio of fouls called on English players for that particular official (so, for example, 32.57% of Michael Oliver’s infractions are called against English players). Those two numbers are what we compare to determine to what extent an official might be ‘favoring’ English players. This last bit is the z-score. That’s the fourth column8.
This is how many standard deviations we are above or below the mean. For our purposes, if we have a z-score less than -1.64 we’d have evidence of pro-England bias 9. So from a statistical standpoint we have four officials—Neil Swarbrick, Phil Dowd, Roger East, and Chris Foy—whistling English players for too few infractions relative to what we should expect10. If we flip the analysis, we have Craig Pawson and Mike Jones showing the statistical leniency toward non-English players11.
So what do you do with this information12? Well if you’re a Proper Football Man™, probably nothing. But this is analytics 13 (sometimes known by its more colloquial name of ‘information’). Say I have two left backs that are nearly identically suited for the tactics I want to employ against my next opponent. One is English, one isn’t. If I know that the game is being officiated by Neil Swarbrick, then that would absolutely factor in to whom I decide to play14. Presumably I’m trying to win; part of that is A) finishing with 11 players on the pitch and B) Not conceding fouls in positions that lead to dangerous free kicks. Both become more likely with an English player, given the official.
And really, if it’s effective, we might not even know it (at least not directly). Say we played our English left back. Sure he gets whistled for a couple of fouls, but maybe he could have been whistled for three or even four, one of which was at the edge of the 18. But he didn’t. And it did not lead to a free kick from which the opponent scored the goal that would have earned them a draw. This is statistics at work. We’re looking for small gains on the margins that we can possibly exploit to our advantage. It might not work every game—Swarbrick might still have sent off our English left back for a nasty late challenge—but over the course of 38 games those small gains should add up to points15, earning draws that might have been losses or wins that might have been draws16. And often times when it’s working, it’s invisible to even informed commenters (‘Wenger starting Gibbs over Monreal today. He’s likely rotating his squad with the midweek Champions League match ahead.’).
There’s a postscript to this in that you can also do the same analysis for ethnicity. In other words, instead of English/Non-English, you can aggregate infractions by official according to whether they were called against White or Non-White players.
Fortunately—I think it qualifies as fortunate—it turns out that only one official shows any pro-White racial bias (nobody is pro Non-White). And I’m not going to name that official for a couple of reasons. First, this isn’t a “Ref A is a racist” piece. That’s not what we’re saying. And the easiest way to prevent that from becoming shorthand for the analysis is to not ID the official. There is a tremendous difference between unconscious acts of bias that affect our instantaneous decision making and overt acts of racism.
Second, even a binary White/Non-white partition doesn’t bifurcate cleanly. Seriously, if you have the least bit of humanity in you and want to feel truly awful, classify a bunch of strangers by race. It’s not fun. To both alleviate some guilt and get some consistency, we set up some rules. But just because we maintain consistency doesn’t necessarily mean we’ve come up with perfect separation between White and Non-white. So officials that are close on the margins—and there were a couple of them—might turn out differently simply depending on who we call “White”17. It’d be pretty irresponsible to imply an official might show pro-White bias when there is room for large disagreements over what may or may not qualify a player as White in the first place.
That said, the one official for whom we would say race is a factor, it wasn’t even close. He was more than three standard deviations below the overall mean18. Oof.
1 I know I probably think way more about officials than is healthy but, honestly, it never would have occurred to me to do any of this just from the poor officiating on the goals themselves. It wasn’t until the 71st minute, which was a full ten minutes and three replays after Kane’s goal, that the duo calling the game were still talking about it with the commentator saying, “The third one… possibly with the aid of enhanced technology. I’m sure people will see it many times over and over again after the game. Harry Kane might have just stole into an offside position. But we’ll have to wait and see that until it becomes clearer.” Anyway, with someone so blatantly trying to excuse an English player and rationalize the blown call, I started to wonder if something similar was happening with the officials. Not that they were consciously allowing English players to get away with being a yard offside (in the case of Walker), but that, in the split seconds during which these decisions are being made, some pro-English bias occasionally manifests itself in whether the flag goes up (or a card comes out, or a foul is called) or not. Also, lest you think my obsession with officials is leading me into weird places, it’s worth noting I’m not the first to think about this type of question.
2 Originally an infraction was a foul, a foul-less card, or an offsides call. In fact I did this entire analysis that way before realizing that, duh, linesmen flag for offsides, and I was misattributing the linesmen’s (or woman’s in the case of Sian Massey) decisions to the head official. So I had to take out offsides calls and redo pretty much everything. Yes, I can be that stupid. In case you’re wondering, wrongly attributing a couple of thousand offsides calls doesn’t much change who is or isn’t showing statistical favoritism (with just two exceptions but one of them is literally a single foul away from swinging the other way).
3 Some of these, like the shirt-off-after-goal are flow-chart events. If it happens, then the player gets carded. Others, like dissent, are more at the official’s discretion and probably carry some weight in the analysis. For instance, Wayne Rooney can apparently call an official a ‘cunt’ or any other in a string of obscenities for a good solid five minutes after a call he disagrees with and not get booked. On the other hand, DiMichelis got a yellow card after the halftime whistle in the match against Tottenham for a not irrational protestation that Walker was well offsides for the then-equalizing goal (the match went to half 1-1). DiMichelis is not English.
4 It might absolutely be worth classifying both cards with and without fouls and counting those (because, yeah, I would like to also know if there is a significant difference between how often English and Non-English players are carded when they foul) but that’s a separate analysis for another time.
5 Two things. First: Goalkeepers excepted. No goalkeeper fouls or minutes are in the calculations. Second: The fouls against almost perfectly matches the percentage of English players in the league over those two seasons (32.36%) (and here’s a nice visual of how English each club may or may not be).
6 I have no idea if this is the case, these are just plausible explanations for the discrepancy. If you want something more concrete, there was an academic paper that concluded that players coming from countries that had been in a civil war were more violent (or committed more fouls). It’s worth noting that by the criteria used to define ‘civil war’, English players qualified as being from a war-torn country. So adjust your reaction to the conclusion accordingly.
7 We’ve dropped Stuart Attwell, Graham Scott, Paul Tierney and Keith Stroud because they worked too few minutes to have meaningful numbers.
8 Z score = (phat – p0)/((p0 * (1 – p0))/n)^.5
9 There’s actually a fair amount to unpack here. First, this is just hypothesis testing, so our null hypothesis is that H0 = . 0.3266 and HA is less than 0.3266 at .05. We’re testing to see if a given officials’ infraction rate is below of 95% of the data. That’s somewhat arbitrary: 95% and 99% levels are common thresholds in stats and we could just as easily have chosen the latter (in that event we’d just have Phil Dowd and Neil Swarbrick). In any event, if we’re below the 95% threshold, we’d reject the null hypothesis. How you word that null hypothesis is kind of tricky but basically we’re hypothesizing: ‘Being English has nothing to do with how often an official whistles players for infractions.’ In testing, we’re either rejecting that (z-score less than -1.64) or failing to reject that.
10 Using percentages makes the math easy, but from a conceptual standpoint it might make it a little more difficult to get your head around what this looks like. So let’s put some numbers to Phil Dowd as an example. Dowd called about 1225 infractions over the two seasons used here. If he were average, he’d have called 400 infractions on English players. But there’s a range of about +/- 30 fouls in which we wouldn’t conclude he was any different than average. In other words, had Dowd only whistled English players for, oh, 380 infractions, we wouldn’t say statistically that he showed any pro-English bias. But Dowd whistled English players for only 359 infractions over the two seasons, well outside the allowable +/- range.
11 It might seem unusual to have so many ref-specific samples that far from the expected mean, but the data is normal under a Shapiro-Wilk normality test.
12 Actually the first thing I’d do is check my math. Not that it’s sketchy, but normally when I have to write a couple hundred lines of code to create a usable set of data, the first thing I do is hand audit a handful of games to make sure I didn’t mess something up. I didn’t get around to doing that this time. So whereas all analysis you see on the Internet should come with some kind of disclaimer, I usually try to double-check my work. The giant caveat is here I didn’t get around to it.
13 It mightn’t even rise to the ‘analytics’ level of sophistication. This is just simple hypothesis testing like you would do in a basic stats class. But yes, this is a stab at the monumentally stupid debate that broke out in the wake of Brendan Rodgers’ firing by Liverpool.
14 But obviously not the only factor. I feel like this should go without saying, but I will say it anyway: Nobody who is any good at analytics thinks numbers should be the sole (or maybe even the primary) factor driving decision making. This bit started out with the notion that the two players were almost identical for tactical purposed (so skill, speed, size, positional awareness, etc.). Now we’re adding the info that one of them might be significantly less like to be whistled for a foul.
15 If you don’t think probabilities tend to work over a large number of events, then please, keep going to Las Vegas. I own a few shares of LVS.
16 Even if it’s just one point net over the course of the season, that could have enormous consequences. Twice in the last five years, a single point has been the difference between both the Champions League and the Europa League, and relegation and survival.
17 The simplified version of the rules are as follows: All Europeans are White unless they aren’t. For example if you’re Portuguese, you’re White. But Nani isn’t White because he and his parents were born in Cape Verde. The same is true of North Americans. Central and South Americans are Non-White. It’s this classification that’s probably the most problematic. So someone like Sergio Agüero is considered Non-White whereas Santi Cazorla is White. By some measure of skin tone, Agüero might be ‘whiter’ than Cazorla. I have no idea. Like I said, even with rules to alleviate some the decision making, assigning White and Non-White status to players felt pretty awful.
18 Even the possibility that racial bias might be impacting a single official’s calls isn’t as objectionable to me as the fact that every Premier League official is a white man (they are all white, correct?). It’s 2015, certainly there has to be someone in all of England who is not white but has the desire and qualifications to be an official at this level. The FA, the PL and the PGMOL should probably look harder for that person.
Did you check to see whether what proportion of minutes english players had for each ref?
ref appointments are not entirely random, and team englishness is very non random, so it’s possible that some refs spend significant amounts of time more or less than the average refing english players.
Totally legit question. I pulled out three refs to spot check and all were pretty close to the overall ratio (enough so that I didn’t feel the need to adjust anything). I’ll probably redo this at the end of the season, and if I update the post I’ll include numbers for minutes by official for English/Non-English. If you read the footnotes, I make reference to doing the test at both the 95% and 99% level, without correcting for small variations in minutes, I’d still feel extremely confident that the two officials on the pro-English side at 99% would stay there.
Interesting stuff, but I’m going to play devil’s advocate here.
a) I have a hard time thinking that, at normal game speed, refs would be able to note that a player was English (or not) and choose to raise his flag (or not) fast enough for it to not be immediately and blatantly obvious.
b) Your table has a ‘Mike Jones’ and a ‘Michael Jones’. I can’t find any record of a ‘Michael Jones’ working as a referee during the 2013/14 or 2014/15 seasons, although there is a Mike Jones, a Mike Dean, and a Michael Oliver. Also, I wonder if we should count Webb – he hasn’t been an active EPL ref since the 2013/14 season, I think. For continuity’s sake, shouldn’t we only be pulling data from refs that worked in both seasons?
c) I Unless we know the number of minutes English vs non-English players played at the same time as each official, I think the results are next to meaningless. Going back to Dowd’s example – roughly 340 fouls out of 1225 called isn’t just outside the norm it’s over 3 standard deviations below the mean, but if he called just one more foul every other game or so, he’s basically average? That doesn’t pass the smell test for me, especially given such a wide disparity in matches played – Roger East only called 22 matches; Swarbrick and Foy were in the 30s, while six or seven referees called over 50 matches. A few matches with a more (or less)-than-average number of English players could really skew things.
d) Back to Dowd, as near as I can make out over the past two years he’s called more yellow cards per match than any other ref in the EPL. May or may not be meaningful.
e) I wonder if knowing how more or less likely an official is to call fouls vs the league average is relevant.
Thanks for the thoughtful reply. The important questions addressed in order below.
a) This is the most common criticism I got from email and Twitter. In other places I’ve referenced the book ‘Scorecasting’. It has several instances of officials acting in ways that show information affecting their decisions. For example, home plate umpires in baseball err on the side of prolonging the at-bat. In other words, if there’s a borderline pitch on an 0-2 count, it’s more likely to be called a ball than a stike (the at-bat is still alive at 1-2, 0-3 is a strikeout). There are a few other examples in the book. But basically, I’m inclined to believe that there is an awful lot of information in an official’s brain and maybe way more of it than we might think is accessed in the nano-seconds over which these decisions are made.
b) Fantastic catch. Thank you. And I imagine that once I make “Michael Jones” back into “Mike Jones” we will be left with only Pawson showing a non-English bias. As I mentioned in the footnotes, I almost always hand audit some of the data to avoid these kinds of oversights. I’ll try to get around to updating the table over the weekend and repost).
c) I would take the other approach. On a per-match basis we probably would never have enough info to do these kinds of calculations in any meaningful way. Also, if a game has 20-25 infractions, one (or the lack thereof) is 4% or 5% of the total, that’s not negligible. And with just a couple of seasons we get pretty reasonable sized data sets. A foul or two here and there, that adds up to a noticeable number—30, 40, 50 fouls; that a couplathree games worth. And again I’m not even doing anything with the type of infraction (this is just quantitative not qualitative, not raising the flag on the English player who is a yard offside has potentially a far greater impact on the outcome of a match than whistling for a foul throw or something). Point being, I wouldn’t be so dismissive of a once-per-game event. Lastly, I’m not worried by the minutes or number of matches. We’re looking at the ratio of fouls for/against English players. You could officiate 2x the number of matches as the next official and call 1/2 as many fouls, but as long as we know how many fouls you called on English v. Non-English players we’re good. N (the number of total infractions) is sufficiently large for all the officials here (we dropped 3 or 4, check the footnotes).