Wednesday, November 16, 2016

Small Samples vs Bad Samples

I got into a Twitter discussion recently on Chelsea's current hot streak, now up to five straight wins in the Premier League. The argument I responded to was pretty straightforward: a five-game sample is too small to draw any conclusions, and City's earlier five-game winning streak to start the season shows that such things don't last forever. It is certainly true that five games is not a large enough sample to project the rest of the season with any sort of certainty. However, not all small samples are created equal, and the failure to consider obvious factors (such as the strength of schedule) leads to an erroneous conclusion that both samples are equally flawed. Chelsea have been objectively better over this five-game sample than City were over theirs, and by just repeating the mantra of "small sample size" over and over analysts are doing their audience a disservice.

Let's take an example away from football for a minute: political polling. If you had conducted a poll of 1,000 people in Florida asking who respondents were planning to vote for in the recent Presidential election, that might give you a pretty good idea of the state of the race in Florida at the time. However, who is in that 1,000 person sample matters to the efficacy of the poll. If the poll included only Caucasians, it would not be very accurate in predicting the correct winner given the diversity of the state. If the poll was conducted of all registered voters, that might mean the respondents were less likely to go to the polls than if it was conducted of likely voters, so there would be greater uncertainty in the result. Put simply, it is not only the size of the sample that determines how effective the poll is, but how representative the sample is of the overall population. Size plays a key part in that obviously, but as noted above, other factors also play a role.

Football is very similar. There are plenty of ways any sample of a team's matches could not be representative of the season as a whole. Most obviously, if a team played weak opponents over the stretch, it's not very representative as each team must play every other team (good and bad) twice over the course of the season. Similarly, if they had an abnormally good finishing run in the sample (either G-xG or G/SOT being very high), that would also be unlikely to be replicated over the full season. If the sample contained games where the team went down to ten men, or their opponents did, that could also make it unrepresentative (I discussed this specific scenario in regards to Swansea last year). Games where an abnormally high number of own goals or penalties occurred would also be something to look at. As a result, it is disingenuous to suggest that all five-game samples have equal predictive value (or lack thereof), as that does not take into account any of these factors.

So in the case of City's and Chelsea's streaks, how representative are the samples of the overall population, i.e. the games they will play this season? In City's case, not very representative at all. Their opponents over the stretch had an average points per game of just 1.07, compared to a league average of 1.37. City also were finishing at higher than expected rates and their opponents were finishing at lower than expected rates. The average G/SOT (excluding penalties and own goals) so far this season is 30%, City were at 45% and their opponents at 25% over the period. Chelsea meanwhile played close to an average schedule in their streak (opponents PPG was 1.29) and though their finishing rates definitely are running hot (Chelsea were at 43% and their opponents at 0%), their SOT numbers more than good enough to offset that.

To that point, I have compiled the data for each Premier League team's best five-game stretch so far this season in terms of points (when teams have multiple stretches with the same point total, I chose the one with the higher Opponents' PPG). The table is shown below:


Team Opp PPG SOT Opp SOT SOTD TOP % GF GA GD G/SOT Opp G/SOT Points Expected GD xEGD Actual - xEGD
Chelsea 1.29 7.40 1.60 5.80 52.60 3.20 0.00 3.20 43.24% 0.00% 3.00 2.15 1.74 1.46
Southampton 1.20 7.20 1.80 5.40 54.00 1.60 0.40 1.20 20.00% 12.50% 2.20 0.46 1.62 -0.42
Everton 1.15 7.40 2.20 5.20 56.40 2.00 0.60 1.40 25.00% 18.18% 2.60 1.06 1.50 -0.10
Liverpool 1.33 7.40 3.00 4.40 58.80 2.80 1.00 1.80 32.35% 33.33% 2.60 1.27 1.14 0.66
Man City 1.07 6.40 2.60 3.80 65.00 3.00 0.80 2.20 40.00% 25.00% 3.00 1.46 1.08 1.12
Tottenham 1.44 6.60 3.00 3.60 56.80 2.00 0.40 1.60 28.13% 7.14% 2.60 0.94 1.08 0.52
Man United 1.35 5.20 3.40 1.80 51.60 1.60 1.20 0.40 28.00% 31.25% 1.80 0.15 0.54 -0.14
Arsenal 1.40 5.20 3.20 2.00 58.00 2.60 0.60 2.00 47.83% 6.67% 3.00 1.46 0.48 1.52
Watford 1.20 4.80 3.20 1.60 44.60 2.00 1.40 0.60 39.13% 43.75% 2.00 0.46 0.42 0.18
Stoke 0.89 5.20 4.00 1.20 45.80 1.80 0.60 1.20 30.77% 10.00% 2.20 0.39 0.36 0.84
Crystal Palace 1.07 4.60 3.60 1.00 51.00 2.20 1.20 1.00 47.83% 33.33% 2.20 0.10 0.30 0.70
West Ham 1.05 2.80 3.60 -0.80 51.40 0.80 0.80 0.00 21.43% 23.53% 1.60 -0.36 -0.18 0.18
Middlesbrough 1.56 3.20 4.00 -0.80 43.60 0.80 0.60 0.20 25.00% 15.00% 1.20 -0.12 -0.24 0.44
Leicester 1.44 4.20 5.00 -0.80 47.40 1.40 1.40 0.00 25.00% 28.00% 1.40 0.05 -0.30 0.30
West Brom 1.15 3.40 4.60 -1.20 35.60 1.20 1.00 0.20 31.25% 18.18% 1.40 -0.25 -0.36 0.56
Bournemouth 1.45 4.20 5.20 -1.00 50.60 2.00 1.40 0.60 45.00% 26.92% 2.00 -0.36 -0.36 0.96
Swansea 1.35 4.00 5.20 -1.20 50.00 0.80 1.40 -0.60 15.79% 26.92% 0.80 -1.11 -0.42 -0.18
Hull 1.33 3.20 5.60 -2.40 46.20 1.20 1.40 -0.20 33.33% 23.08% 1.40 -0.50 -0.66 0.46
Sunderland 1.33 2.00 6.00 -4.00 39.80 0.80 1.80 -1.00 25.00% 30.00% 0.80 -1.27 -1.32 0.32
Burnley 1.53 2.60 8.20 -5.60 33.80 1.20 1.40 -0.20 41.67% 12.82% 1.40 -0.29 -1.62 1.42


The Expected GD column is what the team's Goal Difference would be if they had finished their SOT over the sample at their rate for the season and their opponents had finished their SOT at the rate the team had allowed for the season. The xEGD column is what the team's Goal Difference would be if they and their opponents had finished their SOT at a league average rate of 30%. Chelsea lead in both columns here, suggesting that even with their finishing luck their streak has been the most impressive so far this year. This is not to say they haven't been lucky with finishing (the Actual-xEGD is 2nd largest in the set), but the number of shots on target they are producing has been very good. Also, unlike Southampton and Everton (who also rank highly), their streak hasn't been boosted by an easy slate of opponents.

A closer look at this set also confirms why hot streaks tend to happen. The average opponents PPG for these streaks is down slightly to 1.28, confirming that such streaks do tend to happen against weaker opponents, as in Stoke's most recent five-game run. The Actual-xEGD column shows that on average teams are getting a benefit of .5 in goal difference per game based on their finishing being better than league average over these streaks. The number of minutes a team's opponents play with a red card shoots up here too, accounting for 147 of the 210 total minutes a team has played with a man advantage. You can look at Arsenal's five game win streak and note the 50+ minutes against 10-man Hull and their opponents G/SOT for good examples of both. The point is for most of these streaks there are easily identifiable reasons they might be over-performing, but Chelsea aren't showing any of them.

Again, this isn't to say Chelsea will definitely win the title, nor that five games is enough to make sweeping conclusions. However, it's clear that Chelsea's recent streak contained a much more representative sample of opponents than did City's and was more impressive even when accounting for finishing. Therefore, I think it likely to be more predictive than City's early winning streak proved to be.

No comments:

Post a Comment