Dan Altman showcases his models designed to support recruitment and the early identification of future stars.
Identifying future stars is an important goal of recruitment for football’s top teams. Many of them won’t look at a young player unless he already has some minutes in a top league under his belt. But quite a few of these rookies will still flame out before realising their apparent potential. The key is to find consistent indicators that pick the genuine stars whose performance levels can be sustained over multiple seasons.
I’ve developed two models that do this, and now it’s time to test them. Usually there are three stages in the cyclical process of developing a predictive model: creating the prototype, calibrating it with existing data, and then testing it on future data. Once you’ve tested the model, you can recalibrate and try again. But a model that needs constant recalibration can be problematic; it may just be fitting the data and not picking up any innate dynamics of the system. And if the system is changing so quickly that the model always needs to be modified, then the task of modeling may not be worthwhile.
My models are now ready for the third stage of their development: testing. If I want to identify 19-year-old players who’ll be stars when they’re 23, then I need five seasons of data to calibrate my model; I need to follow at least one cohort as they progress through those ages. Opta now offers five seasons of detailed match data for several leagues, so calibration is possible. Testing comes next.
I’m going to do my test publicly by publishing my predictions here. I’m using two models: one that measures how players contribute to shots, and another that measures how players advance the ball into zones closer to the opposing goal. The first model looks at the chain of events leading up to a shot and divides credit for that shot among all the players who participated, at least in the on-the-ball action. The second model identifies a series of zones increasingly near to the goal and assesses the likelihood of scoring conditional on entering each zone; players who move the ball to a better zone receive credit for the incremental rise in the chance of scoring.
Each model evaluates players’ attacking and defending in comparison to their counterparts at the same basic position: strikers, attacking midfielders (including wingers), central midfielders, full backs, central defenders, and goalkeepers. The players’ positions are decided by an algorithm using the locations and types of their touches and defensive opportunities. Their scores are computed on a per-action basis rather than a per-minute basis, since the credit and demerits they receive depend on their actions on or near the ball. Neither model tries to measure exceptional skill in shooting or shot-stopping.
To be marked as top prospects, players have to be 20 years old or younger on 1st August of a given season, play at least 360 minutes during the season, and achieve a certain score relative to their peers in the two models. My goal was to select players who could become Premier League regulars and full internationals for their countries.
Choosing the cutoff for minutes played was a tradeoff. Players with more minutes have more robust ratings in the models, and players with fewer minutes may be enjoying high-energy spurts as substitutes. But there aren’t many young players who are regular starters, even though some of them will go on to be stars.
Choosing the cutoff for ratings was the other big part of the calibration process. I wanted to cast a wide net in order to identify potential stars as soon as possible; I could have used narrower criteria to get a shorter list of surefire winners, but then the algorithm would have taken longer to spot them. Intuitively it made sense to vary the cutoff by age, with a lower bar for younger players. Apart from that, I experimented with different values until I reduced the “false positives” and “false negatives” – judged by both the models and my eye – as much as possible in my calibration period, which ran from 2010-11 through 2013-14. I’ll explain what I mean by these terms below.
I’ve also excluded goalkeepers, as I typically use different algorithms to assess them. Moreover, only three goalkeepers aged 20 or younger played in the Premier League in the past five seasons: Wojciech Szczesny, David de Gea, and Paulo Gazzaniga. This past season, there were none.
Premier League top prospects, 2010/11 - 2013/14
The most players come up in 2010-11, because it’s the first year of my data; the models couldn’t spot them any earlier. But the model didn’t always pick up players in their first season of eligibility. Danny Welbeck, Carl Jenkinson, and Ben Davies all played plenty of minutes in the seasons prior to the ones that made them top prospects, but their scores in the models came up short.
In fact, there were some players that the model missed altogether, depending on your definition of a top prospect. For the first four seasons, these “false negatives” may have included Mario Balotelli and Ross Barkley, who both came close to selection. And then there were the “false positives” who may have turned out to be less than stellar, like Jazz Richards and Suso. To be sure, some of the false positives may not have had the good fortune to stay healthy or play on a top team. But even putting aside those excuses, the rates of false positives and false negatives look to be around 10% or lower. That’s the result of calibration.
On the lists above, players only appear the first time they were flagged by the models. Luke Shaw, for example, has been flagged in every season where he was eligible. But that doesn’t mean top prospects who switched teams, like Shaw, always continued to perform at the same level. Jordan Henderson played more than 3,400 minutes in 2010-11 and was in the top 20 percent of central midfielders – easily good enough to be flagged as a 20-year-old top prospect. In his first season for Liverpool, though, he looked less promising, if still above average. He and the team may just have needed time to adjust; in every season since, he’s been back in the top 20 percent. The key here is that the models flagged Henderson even when he was at Sunderland, the club that finished tenth in 2010-11.
Premier League top prospects, 2014/15
If you follow the Premier League closely, you might not see many surprising names on these lists. And you might ask, if that’s the case, why the models are still useful. The answer is simple: If the models pick the same players you would in leagues you know, then you can trust them to pick players in leagues you don’t know. In fact, you can teach the models to mimic your preferences – whatever they may be – for any number of positions and then set them loose on leagues around the world.
Of course, when comparing players from many leagues, it’s important to evaluate the quality of those leagues as well. For leagues of lower quality than the Premier League, the selection criteria would be tighter; you might not be interested in the Ben Davies of Liga MX, but you’d probably be interested in its Raheem Sterling. Fortunately, players often switch leagues, so there are ample data to make those adjustments.
It’s also crucial to assess players’ mindsets, gauge their susceptibility to injury, and monitor their behaviour off the pitch – all essential functions of traditional scouting. Indeed, data analysis and traditional scouting work best when they work together. Hundreds of young players make their debuts every year in global football, far too many to research each one thoroughly. These models can provide a first cut for evaluating players in bulk, saving time for coaches, scouts, and video analysts.
Dan Altman is an economist and founder of North Yard Analytics, a sports data consulting firm (www.northyardanalytics.com)