Group 3 set out to analyze a dataset from one football (soccer) match to understand if any key themes or insights could be found. The team began the analysis with some summary statistics, but then delved deeper into various visualizations or correlations that could "bring the match to life". Some key analyses conducted include:
- Spatial analysis: Identify how different players on each team positioned themselves during the game
- Event analysis: Determine the success rate of certain "events" and their impact throughout the match
- Temporal analysis: Understand how the ball position varied throughout the game
- Visual analysis: Build a view that used the raw data to showcase a re-enactment of the match
The team approached this analysis in the following manner:
- What data do we have?
- What can we do with this data?
- What insights do these analyses inform?
- What additional analyses could be performed if more data was available?
The following sections break down how the team approached each of these questions:
- What data do we have?
We had three files to work with (along with corresponding datapoints tracked):
- Raw Events data
- 1,746 rows x 14 columns
- Column Names
- Team
- Type
- Subtype
- Period
- Start Frame
- Start Time [s]
- End Frame
- End Time [s]
- From
- To
- Start X
- Start Y
- End X
- End Y
- Raw Tracking data for the Home Team
- 145,007 rows x 33 columns
- Column Names
- X-coordinate positions for 14 home team players and the ball
- Y-coordinate positions for 14 home team players and the ball
- Raw Tracking data for the Away Team
- 145,007 rows x 33 columns
- Column Names
- X-coordinate positions for 14 away team players and the ball
- Y-coordinate positions for 14 away team players and the ball
- Raw Events data
- What can we do with this data?
The team conducted quite a few analyses on the dataset to derive some key findings:
- Spatial analysis
-
Analyses Conducted Data Manipulated Average distance to ball per player Created new series to calculate each player's average distance to the ball using X-Y coordinates provided Average field position per player Calculated average position per player on X and Y axis xxx xxx xxx xxx
-
- Event analysis
-
Analyses Conducted Data Manipulated Average distance run per team per half Created new series to calculate distance "Z" from (x1,y1) to (x2,y2) Average distance per shot (goal, on-target, off-target) Filtered and grouped data to compare success rate per average distance of shot Average time ball held per team per half (per event) Created new series to calculate difference between End Time and Start Time to determine amount of possession Average amount of time ball held by each player (per event) Grouped and summed data to determine amount of time each player possessed the ball during game events Number of passes per player Counted number of pass events per player per team Average success rate by player and team Classified events and player contribution into success, fail, or neutral based on the outcome; this helped chart a post-game picture of individual player and team performance xxx xxx
-
- Temporal analysis
-
Analyses Conducted Data Manipulated Team success rate trend over time Sketched out the average success rate of both teams over time as a measure of stamina over the course of the game Ball position throughout match As we can see that both teams had the ball on each other half almost equally, but the home team spent more time Player speed start vs. end game As you can see, here we compare 4 players performance at the start of the match vs the end of the match, Its pro level match as we dont see a noticeable difference xxx xxx xxx xxx
-
- Visual analysis
-
Analyses Conducted Data Manipulated Plot passes and goals Break down event types, filter on goals, find passes leading to goal Summarize player performance Calculate max speed, acceleration, and distance covered Video of time series Convert location coordinates to field and map to time frames Pitch control and risky passes Calculate duration for ball to arrive and duration for player to arrive -> probability each team will control ball when it arrives
-
- Spatial analysis
- What insights do these analyses inform?
- Observed a strong correlation between time held and distance ran, which runs counter to our initial thinking since we had the initial assumption that the team that ran the most was likely chasing the ball more often, rather than holding it longer
- Home team maintained possession on the away team's side of the pitch on average, with more home players being closer to the ball than away players on average
- Successful non-goal events can be a good predictor of impending goals for real time betting - data from more games can lead to more specific predictions
- Overall very few risky passes are made, and of those risky passes the majority are followed by a challenge from the other team
- What additional analyses could be performed if more data/time was available?
- Calculate average position per player to determine starting and ending formation
- Quantify homefield advantage
- Diagram pitch control (i.e., how much territory can a player cover via various run tactics)
- xxx
- Statsbomb open data: https://github.com/statsbomb/open-data
- Wyscout event data: https://figshare.com/collections/Soccer_match_event_dataset/4415000