STRENGTH IN NUMBERS
DATA AND SCIENCE IN FOOTBALL
– Written by John Newell, Ireland, Lorenzo Lolli and Kenny McMillan, Qatar
Twenty years ago, the 2002 FIFA World Cup Final was played on 30 June at the International Stadium in Yokohama (Japan). The Brazil National Team made history and sealed their fifth World Cup title in front of over 69,000 spectators with an estimated global television audience of 1.5 billion. At this time, the application of data science and analytics in football was in its infancy, with the volume of data collected and digital information utilised by practitioners in the field being quite limited. Fast forward 20 years and the 2022 World Cup Final will be contested on 18 December 2022 in the iconic Lusail Stadium in Doha, Qatar. Over these last two decades, the application of data science and analytics in football has steadily grown. Today most international (and domestic) football teams are adopting new data-related technologies and analytics. It is now common for the modern performance department of international and domestic football teams to include data scientists and engineers, whose expertise help complement the work of the football coaches, sports scientists and performance analysts.
In practical terms, data collected routinely by football departments are generally used for three main purposes:
1. training, development and injury reduction of players,
2. performance analysis and
3. player recruitment.
With this in mind, we will discuss recent advances in data management and analytics applied to sports performance tracking and showcase practical examples of how data are used for the training, development and performance analysis of football players.
DATA MANAGEMENT AND ANALYSIS
a) Data Management
The ongoing incorporation of modern technologies into football (such as wearable and camera-tracking technology) has increased both the volume of data and the variety of its forms. This explosion of collected data and digital information in football has raised the need for the proper storage of data and data engineers with the knowledge of how to do this efficiently. Having vast amounts of data is of little value unless the data are connected in a central database so they can be explored collectively. In the early 2000’s it was not uncommon to have data collected and stored in lots of unconnected Microsoft Excel sheets, partly due to limited downloading facilities in the devices used. The need for off-the-shelf data housing solutions in sport brought on the emergence of Athlete Management Systems (AMS), a central platform for data collection (e.g. wellness questionnaires), data integration (e.g. wearables) and often with the capabilities for data visualisation and simple analyses such as the generation of summary statistics (Figure 1). Smartabase (https://fusionsport.com/), Kinduct AMS (https://www.kinduct.com/) and SAP Sports One (https://www.sap.com/mena/products/sports-one.html) are examples of AMS systems used today. Adopting an AMS system circumvents the need for building and maintaining traditional database management systems (DBMS) such as Microsoft SQL Server or Oracle databases. Domain expertise in computer science and data engineering skills is required which may not be available to some football performance departments. However, there is a growing trend in recent years for professional football teams to hire such professionals.
Having a fully functioning GDPR-compliant AMS is the start rather than the end of the process of using data to enhance the performance and health of footballers. This ‘data assembly’ step facilitates the use of data to tackle questions of interest and to build helpful solutions (such as the Training Load Prediction Planner and Growth Tracking tools that we will discuss later in this article). An AMS itself is not the final solution, but it serves the purpose of providing the infrastructure to accommodate the collection and storage of data and the provision of basic reporting and visualisation tools. For practitioners to use the data to make decisions, domain expertise in statistical and data science is required and, as mentioned above, it is now common for modern data-driven football performance departments to hire or consult such experts for this very purpose.
b) Statistical Analysis
To make the best use (i.e. reliable inference) of data in elite football, it requires a well-posed question of interest, a representative sample and an appropriate analysis which is reproducible and transparent with results communicated using simple and informative numerical and graphical summaries.
There is a growing perception that statistical design is less important when the volume of data available is ‘big’. Volume and noise are often synonymous as they can lead to amplification of sampling and subsequent model bias. The problems that exist in small data do not disappear as the amount of data increases; they get worse. In the sports industry, as well as other fields, adherence to the fundamental guidelines and principles of statistical sciences are paramount for the practically-relevant deployment of impactful data analytics solutions (van Smeden et al., 2022). One example is the use of machine learning approaches to predict musculoskeletal injury occurrence using aggregated internal and external player metrics. Such approaches have had limited success (Bullock et al., 2022) where the underlying studies were biased and poorly designed, had inadequate assessment of model performance (e.g. false positives/negatives) and were poorly reported.
An example of how repeated measures data, common in all athlete monitoring programs, can be used for player welfare is the use of personalised reference ranges. The aim of athlete monitoring is to identify meaningful changes over time in an athlete's biomarker, wellness, load and movement profiles. Such ‘early warning’ systems are used to identify atypical measurements using clinical or reference ranges as a benchmark. Static reference ranges, such as those based on z-scores, are of limited value as they do not account for the information in an athlete’s historical profile (i.e. within athlete variability) where in effect each comparison is at the population rather than the player level. Individualised adaptive reference ranges (Roshan et al, 2021), using Bayesian approaches, represent valuable alternatives as the reference range to compare the current value constantly adapts as more data are collected on the player (Figure 2).
In the next section, further examples are given of how statistical modelling can be used to plan training and track maturation in youth players and how in-game data have revolutionised performance analysis in football.
TRAINING AND DEVELOPMENT OF FOOTBALL PLAYERS
a) Training Load Planner
Appropriate monitoring of training loads can provide important information to football coaches. However, monitoring systems should be intuitive, provide efficient data analysis and interpretation, and enable efficient reporting of simple, yet scientifically valid, feedback (Halson, 2014). Football players participate in a wide range of training drills during a training session in order to induce adaptations needed to succeed in competitive match-play. The monitoring of external workload using global positioning system (GPS) tracking has become increasingly popular in professional football teams. Output from these devices can be used to quantify the total workload of a training session and the workload of each individual drill to ultimately monitor short- and long-term loads on each player. There is a fine balance between optimising training load to maximise performance and minimising injury. A typical football training session consists of a combination of warm-up and cool down drills, technical drills (such as passing combinations or shooting exercises), tactical drills, specific fitness components (e.g. intermittent running drills), and small/large - sided games (e.g. 4 v 4, 3 v 3 etc.). Each drill will induce a specific physiological effect and workload, and these effects may greatly differ between drills. Within-player variability may also be high depending on the training drill used. For example, a 9 v 9 training drill on a full-size pitch may induce a greater running load on a midfield player compared to a central defender. Conversely, a small-sided game (e.g. 4 v 4 on a 30m x 20m pitch size) might manifest a similar load on each player.
Accurate prediction of individual training loads for a planned training session is clearly beneficial. Training drills may be designed to overload a certain workload parameter. For example, an intermittent running drill may be designed to accumulate time above a certain speed threshold. Tactical drills may also stress players in a specific way. An example here would be a striker having to make repeated maximal accelerations when trying to score from a winger's cross into the penalty box. When designing a training session, the ability to predict the expected workload demands of the session is clearly of benefit as part of routine player load monitoring, as the coaching and performance staff can check if the load for the planned session is appropriate for that given day. Quantification of the expected load of a training session can also be used to prompt the football coach to alter the training session in order to attain the required load.
To build the Training Load Planner, annotated drills based on Annotated drills based on GPS data were collected over several seasons from players of a professional British football team. Linear mixed models (McMillan, K, Simpkin A, Moore B, Newell J, 2020) were fitted to create a user-friendly “Training Load Predictor” application deployed using the R Shiny package (https://shiny.rstudio.com/) to allow the football coach to instantly predict training load metrics from a prescribed session (Figure 3). Potential outliers (players that will find certain combinations of training drills difficult or easy) are highlighted to aid in injury reduction and to prescribe individualised load adjustment, crucial in a dynamic high-performance football environment where training plans can change quickly. The training load planner is an example of applying appropriate statistical techniques on historical data stored in an AMS, to build an application that provides quick actionable insights that are of benefit to football coaches and players.
b) Tracking football player growth and maturation
Ongoing tracking of growth progression and changes in biological maturation are now an integral part of the long-term elite football player development (Malina et al., 2015). In this context, gaining insights from growth curve data analysis can be relevant to support this process. The general idea of assessing a child’s body measurements to explore patterns of growth dates back to the 18th century (Cole, 2012). What originated from these insights became further elaborated in modern times by pioneers in endocrinology such as James Tanner in a series of research papers and textbooks that shaped and advanced knowledge of modern paediatrics sciences (Cameron, 2004). However, despite the advances in the estimation of reference centiles and growth chart development in different populations worldwide (Cole, 2019), clinical decision-making processes in the medical field remained, in general, static and paper-based.
In football (and sports in general), preferences for athlete tracking have involved utilising published regression equations derived from non-reference samples generally embedded into rudimental and custom-made Microsoft Excel spreadsheets. Given what we discussed so far, such an approach has become inconsistent with the nature and flow of athlete data management processes in modern sports organisations and academies. Furthermore, growth curve analysis procedures have evolved meaningfully over recent years (Cole, 2019), with state-of-the-art methods available in free programming languages for statistical computing and graphics, such as R (https://www.r-project.org/) that are compatible with modern business intelligence solutions such as Microsoft Power BI (https://powerbi.microsoft.com/en-us/). With this in mind, recent methodological appraisals of different adult height prediction protocols based on automated image analysis in Arab athletes (Lolli et al., 2021) were translated into dynamic business solutions for tracking growth and development in elite youth Middle Eastern football players. Live tracking of annual growth rates (i.e., height velocity) is now possible and available in user-friendly reports that superimpose an individual-player specific trajectory against the estimated population curve (Figure 4).
Coaching and medical staff can be provided with up-to-date information for obtaining adult height estimates that also account for prediction errors (Figure 5). Similar to the actionable insights gained from the Training Load Predictor, these data can be used to help guide football coaches and sports scientists to design appropriate training programs for adolescent soccer players to maximise their development whilst reducing injury risk. Moving from paper-based processes to automated digital workflows helps to bridge the gap between ongoing data acquisition into the modern AMS and subsequent actionability of the insights.
Performance analysis is a sub-discipline of the Sport Science that has received particular interest from many stakeholders at different levels in the industry (Gomez-Ruano et al., 2020). The development and implementation of new technologies to quantify individual or team's performances (e.g., tracking systems such as local positioning systems, video tracking, or observational video analysis systems) with multiple practical applications, have intensified the focus on performance analysis in football (Hughes and Franks, 2007).
In the early 2000s, in-game and post-match analyses of football matches insights were generated mainly by performance analysts who tagged in-game events such as the number of passes, shots, corners, offsides and fouls committed. This constituted a time-consuming manual task involving a combination of hand notation and customised spreadsheets, together with the available software that facilitated such a process like SportsCode and GameBreaker. Some validated multi-camera computerised video systems, such as Prozone®, were also available and allowed tracking of football players during match-play (Di Salvo et al., 2006). Since then, the analysis of performance in sport has undergone a dramatic metamorphosis. Large amounts of performance-related data are now readily available to football performance departments and coaches thanks to the continual advancements in data science techniques and technologies that enable vast amounts of performance-related data to be collected and stored.
Recent developments in video analysis of sports and computer vision techniques have achieved significant improvements that have enabled the automation of a variety of football performance analyses (van der Kruk and Reijne, 2018). The development of accurate motion tracking technology in stadiums (Ellens et al., 2021), with the recent example of Second Spectrum (https://www.secondspectrum.com), has contributed to the rapid increase in the volume of in-game data recorded. The high granularity of motion tracking spatio-temporal positional data has allowed the generation of analyses such as possession heatmap plots, touch maps and player movement profiles. Modern Artificial Intelligence (AI) techniques, coupled with increasing computing capacity and processing power enabled the automated analysis of the actions and movements of football players, ball tracking, detection of highlights, and on-demand 3D reconstruction (Naik et al., 2022). Industry providers such as SkillCorner (https://www.skillcorner.com/) utilise AI-powered video tracking technology to generate tracking data insights using football broadcast camera feeds only (Figure 6). By using a Recurrent Neural Network (RNN) based model, trained against official tracking data from multiple football leagues, player movement trajectories can be predicted while incorporating team-related structures and the behaviour of other players.
Recent advances in computer vision also allow detailed 3D motion tracking data to be captured using a single broadcast TV feed. This has enabled the generation of heatmaps for a player from games recorded before the availability of motion tracking technology. For example, Figure 7 displays, for the first time, a heatmap of Brazilian striker, Ronaldo Luís Nazário de Lima, in the final of the FIFA 2002 World Cup Final.
Modern automated motion tracking technologies generate over 2000 summaries from football match-play. Passages of play can be broken down by phase (defensive transition and attacking) and also as sequences (possession, turn overs), where consecutive events in sequence give insight into a team’s playing style and an individual player’s contribution. Position-specific metrics include the number of interceptions and ball recoveries made by midfielders and defenders, and different examples of applied solutions can be provided. Specifically, Defensive Coverage, a polygon of the player’s defensive zone, measures the area of defensive responsibility implied by a player’s defensive actions during a match. Likewise, other metrics of interest include Expected goals (xG), Number of Big Chances, and Expected Assists (xA). The xG metric indicates the predicted score given the chances that were created in a football match based on a large variety of contextual factors. The Number of Big Chances metric counts situations where a player should reasonably be expected to score, usually in a one-on-one scenario with the goalkeeper or from very close range. The xA metric measures the probability that a particular pass will result in a goal assist while accounting for the type, length and position of the pass.
Given the availability of motion tracking data, where over 25 measures of each player’s coordinates per second can now be recorded, the emphasis is changing from simple reporting of physical load metrics and player actions to gaining more of an understanding of how and why these workloads and actions take place. Collectively, automated technologies have reduced the time needed to manually collect and store performance analysis-related data where the focus is now on the analysis rather than the assembly of the data.
Data without context are of little value. In elite football, the coaching staff, sports scientists and performance analysts, and now data engineers and statistical scientists, operate as a team within the team. It is this collaboration, this teamwork, a contemporary blending of coaching, sports science and data science expertise, that is key for sports organisations to remain competitive and reach success in the era of modern football. The strength in numbers rests on pursuing an interdisciplinary approach. This approach is essential to maximising the potential in all data collected to enhance football player performance and welfare, mitigate injury occurrence, increase team success and ultimately contribute to fan enjoyment.
John Newell Ph.D.
Professor of Biostatistics
School of Mathematical and Statistical Sciences, The National University of Ireland
Lorenzo Lolli Ph.D.
Football Performance & Science Department, Aspire Academy
Football Exchange, Liverpool John Moores University
Kenny McMillan Ph.D.
Head of Performance Support and Analytics
Sports Department, Aspire Academy
1. Bullock, G.S., Mylott, J., Hughes, T. et al. Just How Confident Can We Be in Predicting Sports Injuries? A Systematic Review of the Methodological Conduct and Performance of Existing Musculoskeletal Injury Prediction Models in Sport. Sports Med (2022).
2. Cameron N. 2004. Measuring maturity. In: Molinari L, Cameron N, Hauspie RC, editors. Methods in Human Growth Research. Cambridge: Cambridge University Press; p. 108-140.
3. Cole TJ. 2012. The development of growth references and growth charts. Ann Hum Biol. 39(5):382-394.
4. Cole TJ. 2019. Commentary: Methods for calculating growth trajectories and constructing growth centiles. Stat Med. 38(19):3571-3579.
5. Di Salvo V., Collins A., McNeill B., Cardinale M. Validation of Prozone®: A new video-based performance analysis system. Int J Perform Anal Sport. 2006;6:108–119.
6. Ellens S, Hodges D, McCullagh S, Malone JJ, Varley MC. Interchangeability of player movement variables from different athlete tracking systems in professional soccer. Sci Med Footb. 2022;6(1):1-6.
7. Gomez-Ruano, Miguel-Angel et al. “Editorial: Performance Analysis in Sport.” Frontiers in psychology vol. 11 611634. 30 Oct. 2020.
8. Halson, S.L. Monitoring Training Load to Understand Fatigue in Athletes. Sports Med 44, 139–147 (2014).
9. Hughes M., Franks I. (2007). The Essentials of Performance Analysis: An Introduction. London: Routledge.
10. Lolli L, Johnson A, Monaco M, Cardinale M, Di Salvo V, Gregson W. Tanner– Whitehouse and Modified Bayley–Pinneau adult height predictions in elite youth soccer players from the Middle East. Med Sci Sports Exerc. 2021;53(12):2683-90.
11. Malina RM, Rogol AD, Cumming SP, Coelho e Silva MJ, Figueiredo AJ. Biological maturation of youth athletes: assessment and implications. Br J Sports Med. 2015;49(13):852-9.
12. Naik BT, Hashmi MF, Bokde ND. A Comprehensive Review of Computer Vision in Sports: Open Issues, Future Trends and Research Directions. Applied Sciences. 2022; 12(9):4429.
13. Roshan D, Ferguson J, Pedlar CR, Simpkin A, Wyns W, Sullivan F, et al. (2021) A comparison of methods to generate adaptive reference ranges in longitudinal monitoring. PLoS ONE 16(2): e0247338.
14. van der Kruk E, Reijne MM. Accuracy of human motion capture systems for sport applications; state-of-the-art review. Eur J Sport Sci. 2018;18(6):806-19.
15. van Smeden M, Heinze G, Van Calster B et al. Critical appraisal of artificial intelligence-based prediction models for cardiovascular disease. Eur Heart J. 2022.
Header image by Edwin Lara (Cropped)