NEWS: 2019 #OptaProForum presentations announced. Line-up.… 12 Dec

Following @ATLUTD’s #MLSCup victory, the OptaPro MLS regular season review is now available to view. The analysis…… 10 Dec

JOB: The deadline to apply for the Head of Football position at OptaPro is 10th December. Leaping.… 6 Dec

Today sees OptaPro present at #TalentID2018, discussing how data can inform and enhance how we evaluate young playe…… 30 Nov

12 - There are 12 days until the application deadline for the OptaPro Head of Football position. Reminder.… 29 Nov

12.6 - N'Golo Kanté is having 12.6 fewer touches (per 90) in his own half this season compared with 2017/18. Touchl…… 29 Nov

Analysing retention from throw-ins - @etmckinley from @AnalysisEvolved applies various analytical models to inform…… 28 Nov

NEWS: #OptaProSoccer to return in 2019. The second OptaPro analytics conference in North America takes place in Chi…… 27 Nov

The OptaPro Blog

RSS Feed

Welcome to the OptaPro blog, featuring news and analysis from OptaPro's cutting-edge research team.

MCFC Analytics – A perspective on open data and amendments to the terms & conditions

The Manchester City Analytics project has been live for nearly three weeks now. Over 5000 people have received the data, and we have started seeing some early examples of data analysis being presented.

We wanted to give some context to our motivation in providing this data release and explain what Opta and Manchester City hope to see develop as the project matures.

The history of open-source sports data: Why the MCFC Analytics project is unique

Open-source sports data being made available to a mass audience is uncommon, although more so in some sports than others.  In football, we would argue that the Opta data release that underpins the MCFC Analytics project is unique.

The most well-known example of advanced sports data being freely available is undoubtedly found in baseball. Although the baseball analytics community was highlighted by Michael Lewis' 2003 Moneyball book and subsequent film, a community of sabermetricians (named after the Society for American Baseball Research) have been analysing baseball data since the 1960s.

The accessibility of baseball data is due in part to the nature of the game - the individual elements and stop-start nature are helpful to amateur statisticians trying to record events whereas in other, more fluid sports such as football, rugby or ice hockey, the constant movement and rapid interchanges between players means that accurately recording statistics from a game lasting over 80 minutes takes several hours. However, the other key driver behind the proliferation of baseball data was a growing analytics community that needed more information to improve their work, but didn't have the budget to pay for more advanced data from the original data collection companies such as STATS Inc and Elias.

As a result, one group of enthusiasts founded Retrosheet, with the aim of collecting all the data for every game as far back as possible, and making it freely available. As above, the nature of baseball meant that this was far easier than in other sports, but it was the first time a collaborative effort such as this had existed. Sam Green, Opta's Lead Statistician, is a Retrosheet advocate: "Retrosheet at its core is just a lot of individual facts about what happened in the match, but that's where the nature of the sport is so important - it's much easier to retrospectively collect data on a baseball game from the 60s than it would be a football game. I would imagine the vast bulk of modern analytics work in baseball has been done off the back of the Retrosheet database."

Then, in 1984, Bill James ("The Moses of Moneyball" according to some commentators) became frustrated with the limitations of the existing data. Although baseball fans now had a huge amount of freely available historical data from the Retrosheet collaboration, there was a desire for a greater 'depth' of data. He enlisted the help of his subscribers to his 'Baseball Abstract' and created Project Scoresheet - an attempt to collate the work of hundreds of amateur sabermetricians and create a complete record of baseball data.

Thanks to the momentum built by these collaborative projects, baseball data became both standardised and more available than ever before.

This is in direct contrast to football. Football data has never been collected the same way historically, and therefore there isn't a codified database that contains details of every game going back to 1900 (apart from full-time scores, and possibly scorers). Whereas there was a desire and clamour for this type of information in baseball, the same can't be said for football. 'Modern', deep football data  - such the type collected by Opta, as well as other companies such as Prozone and Amisco - is highly detailed and collected using proprietary systems that have needed large and sustained investment in order to develop them to the required standard. They also require the input of several analysts per game to accurately collect the data, as well as a team of checkers that ensure accuracy and consistency. In this regard, the baseball and football data collection (and commercial) models are clearly very different.

A much more accurate comparison can be found with American Football - little or no complete, advanced game data is made publicly available. Furthermore, even though a great deal of baseball data is distributed to the statistically inclined, other areas (as detailed in this Bloomberg article) remain the expensive preserve of the clubs and data companies. 

Additional Terms and Conditions clause

Both Gavin and Opta have been asked to clarify the intent behind the Terms and Conditions of the project.

To recap, Opta have outlined four conditions for the data, which can be seen here:

Some commentators and bloggers have perceived the terms and conditions as being stringent and perhaps even aggressive - the inference being that if amateur analysts and statisticians develop anything interesting, then Opta will simply take it from them.

However, we would like to reassure everyone that is not the case. The reason the terms are worded as such is in order to protect Opta from commercial competitors using our data to perfect products that would threaten our business. Our desire is to offer data access to skilled individuals who wouldn't otherwise be able to access it.

We have listened to any concerns and queries that have been raised following the initial release of the data, and attempted to address these. In answer to those who have queried the motivation behind the T&Cs, Opta have decided to add a clause to condition 4. By adding the phrase 'at fair value'  we ensure that should Opta wish to license any work that results from the MCFC Analytics project and community, we would commit to guaranteeing that the creator receives 'fair value'for their work. It is also important to note that this condition only allows Opta tolicense the work, not own it.  In reality, the likelihood would be that either or both of Opta and Manchester City would be keen to work with the creator of any especially impressive results.

MCFC Analytics - The Future

As explained in this post, OptaPro was set up to encourage the analytics community in sport to grow and assist the development of the field, and the MCFC project is a natural extension of this. By releasing this data with Manchester City, we hope to encourage better overall data usage by developing the skills of individuals and encourage analysts who otherwise wouldn't have had access to data of this magnitude.

Although freely releasing all football data is nigh-on impossible for the reasons outlined above, it is the development of an engaged and motivated community that both Manchester City and Opta hope to encourage through the MCFCAnalytics project. By encouraging like-minded collaborators to work together on producing relevant and incisive football analysis, we hope to accelerate the progress of analytics in football by identifying skilled individuals. By subsequently releasing more advanced data to those participants who demonstrate their analytical capabilities, we will create a meritocratic, informed community of the best and most advanced football analysts. It is anticipated that this community will eventually form the talent pool that will make up the next generation of club analysts, analytical authors and data scientists.

We will have more news on how we hope to help this this community develop soon, so keep checking the OptaPro site as well as Gavin Fleig's Twitter feed (@MCFCGavinFleig). However, if you have any further questions, or require any assistance, please feel free to tweet @OptaPro.

Posted by Simon Farrant at 16:24


Post a comment

Comments have been closed for this post.