Within statistical analysis in football, identifying that your opponent generates high quality chances through statistics like Expected Goals is just as important as understanding how they create them.
Descriptive statistics which try to identify the how in this context can create additional tactical insight and be used to influence performance. Knowing, for example, that an upcoming opponent generates its best chances through cut-backs is something that can be better prepared for than just knowing the number of chances they tend to create.
One bit of work in this area which has really resonated with me was Dustin Ward’s analysis at StatsBomb surrounding similarities between team playing styles. To summarise, Ward devised a series of metrics which he felt characterised a team’s style of play on offence, and performed a cluster analysis to group teams together by similar characteristics.
His work last spring left off with a cluster analysis of team attack and defence, so I thought it would be interesting to build on that by looking at how teams across Europe from this season group together from an attacking perspective, having introduced a few new wrinkles with the benefit of full access to Opta data.
Why group and characterise teams?
Both the exercise of devising statistics to explain team differences in style and arriving at the eventual cluster groupings can provide crucial information. Knowing how a team tries to generate chances and the shape of their play when in possession can help prepare ahead of playing them. Having a quantitative tool for measuring aspects of a team’s play can provide important context to the various aspects of pre and post-match analysis. This same approach might further be applied to managers to better understand a particular playing style.
Grouping teams together may also provide insight about how a transfer target could adapt to a to new or similar environment. Roberto Soldado’s transfer to Tottenham Hotspur provides, albeit with the benefit of hindsight, an example of how moving between disparate team environments can yield uncertain results.
Selecting the clusters
For the purpose of this article I focused on offence and used statistics to cover a variety of aspects of play, from different parts of the pitch to different attributes with natural interpretations such as the width of build-up play. I also checked each statistic to make sure it was relatively consistent within each team. After all, having a measure that indicates a team prefers to cross rather than pass into the box doesn’t do much good if the metric can’t be relied on to persist. I elaborate on the technical details of this process in the addendum below.
I settled on using the following:
Possession – The simple Opta definition of team share of total game passes
Field Tilt – Team share of final third passes (Name coined by Ward, though he used two different versions). This correlated highly with possession, but it had some benefit to separating clusters in the end, so I included it.
Shots:Possession Ratio – Exactly as it’s name describes. The end result isn’t a meaningful unit since possession is represented as a percentage, but I liked the way teams stacked up given that what I was trying to measure was effective use of possession or some kind of ‘directness’.
Cross Preference – Attacking crosses into the box divided by number of regular passes into the box.
Box Attacks – Total attacking crosses and passes into the box.
Box Attack Rate – Box Attacks divided by total attacking passes starting and ending the final third. This was designed to measure patience or lack thereof in converting attacking possession into dangerous chances.
Absolute Attack Width – Taking each pass in the final third, measure the absolute value of its distance from the centre of the pitch and take the median value for each team.
Abs. Play-out Width – Same as above with passes ending in the defensive 2/3rds.
Play-out Length – Median pass length for passes starting in the defensive 2/3rds.
Final Third Entries – Total actions which result in team entering attacking 1/3rd.
What did I find from this analysis?
After grouping each team based on these statistics, some interesting results emerged. The full results of the analysis can be examined in the table below.
Newly crowned champions Leicester City being grouped with West Ham is interesting as both teams have overachieved this season relative to most pre-season expectations. They profile relatively similarly across most statistics, but what sticks out for me is that both have below average possession but convert it to shots at a high rate, as reflected in their shots:possession ratio.
Ingolstadt ended up belonging to a group with several other squads, but they were often grouped by themselves in other analyses I ran. Their statistics make for interesting viewing. They also play the ball long much more than Bundesliga teams on the whole. They’re also one of only three teams in the sample to play with above average width in attack and a below average preference for crossing into the box relative to passing. The other two? Champions League semi-finalists Manchester City and Atletico Madrid. Having been promoted this year, they sit ninth in the Bundesliga, with an identical record to Wolfsburg. However, this success may have little to do with their offensive game plan, given that they’ve scored fewer than all but bottom-dwelling Hannover and allowed fewer goals than all but three teams.
Liverpool, Napoli and Real Madrid make for an interesting trio, and perhaps suggests that Jurgen Klopp has gotten Liverpool playing a style which may pay dividends if some more variables flip their way, even if their Premier League results underwhelm slightly.
Bayern Munich go in their own cluster (16), which pretty much speaks for itself.
There’s some evidence of enduring coaching legacies. Napoli and Empoli (both recently managed by Maurizio Sarri) didn’t end up grouped together, but they stick out like sore thumbs in Serie A. Both play short and narrow out of the back in a league which trends long and wide in build-up play. Both are also among only five Serie A teams to register below average cross preference.
In more evidence for a coaching imprint, Spurs and Southampton, both bearing Mauricio Pochettino’s mark, are grouped together in cluster 19.
Preparing for the FA Cup final
Comparing the styles of Manchester United and Crystal Palace provides an interesting angle for a preview of the FA Cup final. Both teams tend to play out and especially attack with width. They differ in how they convert this attack into chances. Palace rely heavily on crossing while United are slightly more predisposed to passing in the final third. The two are near opposites in terms of possession and directness, indicating that the match may see a lot of United in possession, with Palace hoping to steal a goal on the break by crossing the ball quickly and frequently. For a recipe for success, Palace might further examine games United played away in the Premier League to Swansea City and Bournemouth and home to Norwich City, all matches where United was defeated by a team relying heavily on crosses.
Clustering teams across Europe’s top 5 leagues
Each team is shown with its cluster membership as well as standardised Z-score values for each of the metrics used to determine the clusters. The standardisation allows for comparison across metrics which are measured in different units. A value of zero means the team is at the mean for that metric, and a 3.0 (or -3.0) represents an outlier equally in one metric as in another.
Additional technical information
Determining Model Inputs
My method here was to randomly split each team’s games into two half-seasons and test the correlation of the statistic between the half-seasons. I performed this random resampling 500 times to make sure the correlations weren’t a quirk of the random shuffling. I also examined the correlation between different stats to make sure I wasn’t measuring the same thing twice with different names. The table below shows self-correlation between samples on the diagonal and ordinary correlation between the variables on the off-diagonals. There is some correlation between some of the variables I included, even after this selection process, but the preliminary PCA process described below mitigates that to some extent, and I felt that the resulting cluster model benfitted from the inclusion of the more correlated statistics as well.
Determining Model Parameters
Once I had decided the above statistics would be my criteria for identifying team clusters, I standardised each one by computing Z-Scores for each team, so that the units were comparable. I then performed a process called Principal Component Analysis to reduce the dimensionality of the data before running k-means clustering analysis. The PCA process also helps reduce issues of high correlation to highlight the differences as in the case of possession/field tilt. The “K” in k-means clustering represents the number of clusters, which was chosen by trying several values for k and analysing how well the resulting clusters captured distinct groupings, as measured by comparing the “distance” of each team to its own cluster relative to other clusters. How k-means computes distance also informed my decision to use PCA before clustering, because this distance measure suffers as the number of variables used for clustering increases. I arrived at 20 as a number for my analysis (coincidentally the same number as Ward), though this wasn’t a clear case of this many clusters being the best.