GUEST BLOG: Building an MLS roster

Article by James Appell

In 2015 OptaPro teamed up with Columbia University to launch a global sports analytics module for students enrolled in the Masters in Sports Management program.

Students were given the opportunity, using Opta data, to address relevant use cases for data and to solve some of the real-world questions posed to performance analysts, scouts and team front office staff.

The course culminated in a competition, in which students were challenged to build an MLS team roster of 20 players, from scratch and with salary cap limitations, using quantitative methods. Teams of five were given MLS data over the last three seasons, and with a deadline of six days were asked to construct a methodology, select their roster of 20, and present the results to a panel of judges which included Opta and Columbia staff, and performance analysts from two MLS teams.

James Appell, one of the participating students, explains how his team approached the case.


Years of watching the sport have taught us that, presented with a list of football players, no two fans ever agree on which ones in combination would make the best team. It’s why this Columbia University case competition offered such an intriguing challenge – an opportunity to develop an objective standard for assessing and selecting football players using performance data. (It’s also why none of us were particularly surprised by the howls of derision from some commenters when mlssoccer.com got hold of the results.)

To be fair to our critics, an MLS roster built solely off the back of six days of processing several million data points is bound to make some omissions and most likely wouldn’t win any trophies (though it might be fun to try). The question our team – Nikolai Eriksen, Andres Galicia Schwarz, James Kopanidis, Robert Moras and myself – looked to answer was more nuanced:

What would a 20-man MLS roster look like if you selected it on the basis of ranking each player by their performance data? And in addition, could that roster tell you anything useful about the MLS player pool that you wouldn’t necessarily have gleaned from traditional scouting methods?

Starting point: what makes a good team?

Initially presented with more than 120 different performance metrics, we needed to understand the role each team metric played in a team’s overall performance. We adopted as agnostic an approach as possible – any of us could guess that, say, shooting accuracy might be closely correlated with goals scored, but we wanted to see what the data told us.

One issue raised its head early on – attacking and defensive actions proved so incompatible with one another that we needed to split them. So, by plotting our attacking metrics against goals scored and our defensive metrics against goals conceded, we derived what we felt was a reasonable understanding of what, at the team level, constituted a successful team ‘philosophy’ in MLS – what we called Attacking and Defensive Identities.

Attacking Identity

Defensive Identity

The data suggested that accurate passing inside and into the final third of the field was the single most important factor in whether a team scored goals or not. Successful dribbling was also highly important, while, interestingly, we found the number of crosses a team attempted to be strongly inversely correlated to goals scored – perhaps teams that were quick to try and put crosses into the box, rather than seek higher quality attacking plays, were actually harming their chances.

In contrast we found that explaining defensive success with the dataset provided to us was going to be more of a challenge. That said, perhaps understandably, goalkeeping statistics such as save success inside and outside the box proved highly relevant indicators of a team’s ability to keep clean sheets. Aerial duel success and number of tackles made also showed a significant degree of relevance, and a picture began to emerge of successful defences maintaining discipline (particularly by not giving away fouls) but also seeking to disrupt the opposition through tackling and other consistent interference.

Adding in further metrics which we found had significant explanatory power, we settled on a list of ten metrics for an Attacking Identity and nine for a Defensive Identity (shown in the above tables), which, applied to teams’ performances in MLS in the last three years, were reasonably strong indicators of success.

From team metrics to player metrics

Aside from handily reducing the number of criteria for picking players to a more manageable size, our Attacking and Defensive Identities suggested something generally about the kind of players our 20-man roster might be. But to rank and then select specific players we needed to move from metrics at the team level down to the individual player level.

To do so, we had to effectively map across all of the team metrics which made up our Attacking and Defensive Identities into appropriate player metrics. This isn’t a perfect method – for instance, we could say for sure how many times a defending team faced a shot on goal in a given match, but our dataset couldn’t tell us which defender was closest to the ball when the opponent took his shot. With a bit of intuition and some statistical testing, we found that we could translate our team metrics into player metrics with reasonable accuracy.

Additionally, we wanted to select a squad capable of winning the most games, and therefore the most points – not one that was simply good at scoring or preventing goals. Thus we made points won (on a match-by-match basis) our dependent variable. We had some concern that this might bias our findings towards teams which win a lot of points, but as we’ll show later, this didn’t turn out to be the case. And we also knew, based on the multiple linear regression model we had constructed to test each metric against points won, the relative importance of each individual metric to a team’s overall performance – the coefficient of each metric told us how much of a role that metric played. These coefficients became very useful later on, as we’ll see.

Finding a standard for measuring player performance

The ultimate goal was to create a scoring system which ranked players according to their personal contribution to team success. We could therefore, in theory, select the 20 players with the highest score, wrap up the assignment and go and have a pint.

Each position on the field carries its own specific responsibilities, and we wanted to ensure players were measured primarily for their contribution to these.

We took each player’s data for the relevant metrics from the Opta database, on a per-game rather than aggregate. Volume metrics were adjusted to a per 90-minute figure to ensure we were giving adequate attention to substitutes, young players or other ‘impact’ players who play only intermittently. Here was where we brought the coefficients back in from our regression analysis – by adding up all of a player’s various statistical contributions, factoring in the coefficient for each metric, we could generate an overall score for each player which indicated their impact on team success.

While this score looked useful on first viewing, we found that it tended to give particularly high rankings to attacking players. So we added one final step in the process – subtracting the ‘replacement player’ score for each position, where ‘replacement’ indicates the player at the 25th percentile cut-off. This helped to even out disparities across positions, and is broadly similar to the technique used in baseball to generate WARP, where the 35th percentile is used (we have lower standards in football, clearly).

The outcome is a measure we called PAR (Performance Above Replacement), where the higher a player’s PAR score the greater his aggregate on-field contribution to points won. PAR scores can be used to derive further useful insights about players – for example, by plotting PAR against age over the last three seasons we showed how players in MLS develop over time and where we might expect them to peak. The end result was a ranking system for every player in MLS, based solely on performance data, which we could use as a basis for roster selections.

Exploring positional variations

That said, using only PAR scores to assemble a squad felt a little risky, in that we knew very little else about each player in the dataset, and worried that this might give us a pretty one-dimensional squad. To learn a bit more about the players who topped our PAR rankings, we decided to run a positional cluster analysis. The Opta dataset only told us that each of our outfield players was a defender, a midfielder or a forward. These are pretty broad categories which we felt could be broken down further into more specific player roles.

We took inspiration from the work of Will Gurpinar-Morgan, who ran a pretty complex cluster analysis to find and group players who performance data showed were alike. By giving each player a percentile ranking for each individual metric, and by then grouping similar metrics together by averaging their percentile scores, we could plot a quadrant chart roughly showing each player’s positional type.

The chart below was the one we generated for forwards. The x axis shows where each player likes to take his shots from – the lower the value, the closer to goal he tends to operate. The y axis shows how mobile we expected the player to be – the lower the value, the more dynamic actions (dribbles, recoveries, passes) he registered, while the higher the value, the more we expected him to behave as a static target man (measured by numbers of aerials and headed shots). Sebastian Giovinco, unsurprisingly, registered as highly long-range/dynamic. Kevin Doyle, on the other hand, was an archetypal close-range/static player.


The red circles on the chart indicate the players ranked in the top 15% of PAR scores for their position, and we saw that our cluster analysis demonstrated that PAR wasn’t biasing our results towards players in one particular quadrant, or of one uniform type. This was even more pronounced when we look at midfielders, below:


We had players in all four quadrants, from deep-lying passers like Osvaldo Alonso, to attacking playmakers like Kwadwo Poku, to wingers like Sebastian Lletget. This gave us greater confidence in our PAR scores as a method of assembling a versatile roster.

The finishing touches

There were just a couple of items left to deal with. Firstly, we had to ensure our roster had adequate cover across all positions. We found that an average squad of 20 should contain two goalkeepers, six defenders, eight midfielders and four forwards (in the end we plumped for seven midfielders and five forwards for a bit of flash).

Secondly, we needed to stay on budget, at or below the maximum of $3,400,000 which the case prescribed. Given that we were required to fill all 3 DP spots, we also allowed for players not currently classified as DP by salary to be bumped to DP level (set by the case at $436,250) if it maximised our team’s total PAR score.

Thirdly, we had to run one final sanity check to ensure there wasn’t any glaring issues that the data was hiding from us. In fact, there was – Marcel de Jong, who had somehow sneaked into our final 20, had left the league and was thus ineligible, while Bill Hamid, the highest-ranked goalkeeper by PAR, had suffered an injury which we decided made him too high a risk. Both were excluded from our final roster.

The Roster

The final 20-man roster, optimised for PAR scores, and meeting the salary cap and positional requirements, is as follows:

We invite everyone to share their interpretations of this list, but here’s just a few thoughts:

– The PAR score method needs some refinement, but it did select a fairly well-rounded 20-man roster without much human intervention. While there aren’t superstar names (paging the marketing department) there probably aren’t any obvious anomalies who shouldn’t be there either.

– No one team in MLS has a monopoly on talent. Our team contains no more than two players from the same team, which suggests an equitable share of quality players across the league.

– There are a handful of seemingly undervalued players. Fabian Espindola registered the highest overall PAR score in MLS, but is guaranteed just $175,000 a season (we ended up bumping him up to a DP salary, simply because not doing so resulted in our model selecting an inferior team at a far higher cost). Espindola may have had disciplinary and injury problems, but he looks a bargain at that price. David Accam, Wandrille Lefevre and Jesse Gonzalez are other players whose PAR score to salary ratio catches the eye.

– There are plenty of good players at the <$100,000 level. Kellyn Acosta, Kwadwo Poku and Jose Villarreal outperformed 90% of the MLS in PAR, despite their low salaries. Is MLS’s wage structure rewarding them fairly?

– Generation Adidas continues to produce. Graduates Kekuta Manneh and Kelyn Rowe both made our squad, while several others (Darlington Nagbe and Tony Tchani among them) recorded high PAR scores.

Back to analysis