Tag: Machine Learning

Data Science in the World Pt. 1: Data Science in Soccer
This post will be the first in a series of blog posts, called “Data Science in the World,” where I discuss the implementation of data science in different fields like sports, business, medicine, etc. To begin this series, I will be explaining how data science is used in soccer.

There are 5 main areas of the soccer world where data science plays a critical role: Tactical & Match Analysis, Player Development & Performance, Recruitment & Scouting, Training & Recovery, and Set-Piece Engineering. I will break down how data science is used in each of these areas.

Tactical & Match Analysis
- Expected Goals (xG): Quantifies shot quality based on location, angle, and defensive pressure. xG can be used to determine a player’s ability as if they generate a high xG throughout a season or their career, then they should hypothetically produce a high amount of goals eventually.
- Heatmaps & Passing Networks: Reveal spatial tendencies, player roles, and team structure. Heatmaps and passing networks can be used by coaches to point out the good and bad their team does in matches, helping them determine what to fix and what to focus on in matches.
- Opponent Profiling: Teams dissect rivals’ patterns to exploit weaknesses and tailor game plans.
Player Development & Performance
- Event Data Tracking: Every pass, tackle, and movement is logged to assess decision-making and execution. Event data tracking helps coaches and players analyze match footage to refine first touch, scanning habits, and off-ball movement.
- Wearable Tech: GPS and accelerometers monitor load, speed, and fatigue in real time. This helps tailor training intensity and reduce injury risk, especially in congested fixture periods.
- Custom Metrics: Clubs build proprietary KPIs to evaluate players beyond traditional stats. Custom metrics allow for more nuanced evaluation than traditional stats like goals or tackles.
Recruitment & Scouting
- Market Inefficiencies: Data helps identify undervalued talent with specific skill sets. This is useful for teams who do not have as much money to spend on players who are elite at multiple skills when the team only needs the player to be elite in one skill.
- Style Matching: Algorithms compare player profiles to team philosophy—think “find me the next Lionel Messi.” This ensures recruits aren’t just talented, but tactically compatible—saving time and money.
- Injury Risk Modeling: Predictive analytics flag players with high susceptibility to injury. It informs transfer decisions and contract structuring.
Training & Recovery Optimization
- Load Management: Data guides intensity and volume to prevent overtraining. Especially vital for youth development and congested schedules.
- Recovery Protocols: Biometrics and sleep data inform individualized recovery strategies. This improves performance consistency and long-term health.
- Skill Targeting: Coaches use analytics to pinpoint technical weaknesses and design drills accordingly.
Set-Piece Engineering
- Spatial Analysis: Determines optimal corner kick types (in-swing vs. out-swing) and free kick setups. It turns set pieces into high-probability scoring opportunities.
- Simulation Tools: VR and AR are emerging to rehearse scenarios with data-driven precision.
Player Examples

Now that we discussed how data science is used, I will provide examples of teams and players that utilized data science in these ways.
1. Liverpool FC – Recruitment & Tactical Modeling
  - Liverpool built one of the most advanced data science departments in soccer, led by Dr. Ian Graham. Using predictive models and custom metrics, they scouted and signed undervalued talent like Mohamed Salah and Sadio Mane off the basis of expected threat.
  - Result: Salah scored 245 goals in just 9 seasons. Liverpool won their first Champions League title since 2005 and their first ever English Premier League title in their history with Salah and Mane leading the lines.
2. Kevin De Bruyne – Contract Negotiation via Analytics FC
  - De Bruyne worked with Analytics FC to create a 40+ page data-driven report showcasing his value to Manchester City. It included proprietary metrics like Goal Difference Added (GDA), tactical simulations, and salary benchmarking.
  - Result: He negotiated his own contract extension without an agent, using data to prove his irreplaceable role in City’s system.
3. Arsenal FC – Injury Risk & Youth Development
  - Arsenal integrated wearable tech and biomechanical data to monitor player load and injury risk. Young players like Myles Lewis-Skelly used performance analytics to support their rise from academy to first team.
  - Result: Lewis-Skelly’s data-backed contract renewal included insights into his match impact, fatigue management, and tactical fit—helping him secure a long-term deal amid interest from top European clubs.
References
- The New York Times. How Liverpool Became the World’s Smartest Soccer Club (Ian Graham feature).
  https://www.nytimes.com/2019/05/22/sports/liverpool-champions-league.html
- The Athletic. Inside Liverpool’s data revolution under Ian Graham.
  https://theathletic.com/3838128/2022/11/18/liverpool-data-ian-graham/
- StatsBomb. Expected Threat (xT): The model behind modern attacking analytics.
  https://statsbomb.com/articles/soccer/introducing-expected-threat-xthreat/
- Kevin De Bruyne & Data-Driven Contract Negotiation
  The Athletic. How Kevin De Bruyne used data to negotiate his own contract.
  https://theathletic.com/2474565/2021/04/07/kevin-de-bruyne-contract-analytics/
- Analytics FC. Goal Difference Added (GDA) and player value modeling.
  https://analyticsfc.co.uk/2021/04/07/goal-difference-added/
- BBC Sport. KDB’s self-negotiated deal and analytics involvement.
  https://www.bbc.com/sport/football/56669587
- Arsenal FC, Injury Prevention & Youth Development
  Arsenal.com. How Arsenal uses sports science and performance data.
  https://www.arsenal.com/news/how-science-shapes-our-training
- Premier League Elite Player Performance Plan (EPPP). Wearable tech, GPS tracking, and youth development analytics.
  https://www.premierleague.com/youth/EPPP
- The Athletic. Inside Arsenal’s academy and the rise of Myles Lewis-Skelly.
  https://theathletic.com/4928020/2023/10/04/arsenal-lewis-skelly-academy/
August 17, 2025

Hyperparameter tuning with RandomizedSearchCV

William's Data Science Blog

In my previous post, I explored how GridSearchCV can systematically search through hyperparameter combinations to optimize model performance. While powerful, grid search can quickly become computationally expensive, especially as the number of parameters and possible values grows. In this follow-up, I try a more scalable alternative: RandomizedSearchCV. By randomly sampling from the hyperparameter space, this method offers a faster, more flexible way to uncover high-performing configurations without the exhaustive overhead of grid search. Let’s dive into how RandomizedSearchCV works, when to use it, and how it compares in practice.

What is RandomizedSearchCV

Unlike GridSearchCV, which exhaustively tests every combination of hyperparameters, RandomizedSearchCV takes a more efficient approach by sampling a fixed number of random combinations from a defined parameter space. This makes it useful when the search space is large or when computational resources are limited. By trading exhaustive coverage for speed and flexibility, RandomizedSearchCV often finds competitive, or even superior, parameter sets with far fewer evaluations. It’s a smart way to explore hyperparameter tuning when you want faster insights without sacrificing rigor.

Hyperparameter Tuning with RandomizedSearchCV

Here’s a breakdown of each parameter in my param_distributions for RandomizedSearchCV when tuning a RandomForestRegressor:

Parameter	Description
`n_estimators` `[100, 200, 300]`	Number of trees in the forest. More trees can improve performance but increase training time.
`min_samples_split` `[2, 5, 10, 20]`	Minimum number of samples required to split an internal node. Higher values reduce model complexity and help prevent overfitting.
`min_samples_leaf` `[1, 2, 4, 10]`	Minimum number of samples required to be at a leaf node. Larger values smooth the model and reduce variance.
`max_features` `["sqrt", "log2", 1.0]`	Number of features to consider when looking for the best split. `"sqrt"` and `"log2"` are common heuristics; `1.0` uses all features.
`bootstrap` `[True, False]`	Whether bootstrap samples are used when building trees. `True` enables bagging; `False` uses the entire dataset for each tree.
`criterion` `["squared_error", "absolute_error"]`	Function to measure the quality of a split. `"squared_error"` (default) is sensitive to outliers; `"absolute_error"` is more robust.
`ccp_alpha` `[0.0, 0.01]`	Complexity parameter for Minimal Cost-Complexity Pruning. Higher values prune more aggressively, simplifying the model.

Interpretation

Here is a table that compares the results in my previous post where I experimented with GridSearchCV with what I achieved while using RandomizedSearchCV.

Metric	GridSearchCV	RandomizedSearchCV	Improvement
Mean Squared Error (MSE)	173.39	161.12	↓ 7.1%
Root Mean Squared Error (RMSE)	13.17	12.69	↓ 3.6%
R² Score	0.2716	0.3231	↑ 18.9%

Interpretation & Insights

Lower MSE and RMSE:
RandomizedSearchCV yielded a model with noticeably lower error metrics. The RMSE dropped by nearly half a point, indicating better predictions. While the absolute reduction may seem modest, it’s meaningful in contexts where small improvements translate to better decision-making or cost savings.

Higher R² Score:
The R² score improved from 0.27 to 0.32, a relative gain of nearly 19%. This suggests that the model tuned via RandomizedSearchCV explains more variance in the target variable—an encouraging sign of better generalization.

Efficiency vs Exhaustiveness:
GridSearchCV exhaustively evaluated all parameter combinations, which can be computationally expensive and potentially redundant. In contrast, RandomizedSearchCV sampled a subset of combinations and still outperformed grid search. This underscores the value of strategic randomness in high-dimensional hyperparameter spaces.

Model Robustness:
The improved metrics hint that RandomizedSearchCV may have landed on a configuration that better balances bias and variance—possibly due to more diverse sampling across parameters like min_samples_leaf, criterion, and ccp_alpha.

Takeaways

RandomizedSearchCV not only delivered better predictive performance but did so with greater computational efficiency. When I ran GridSearchCV with as many parameters to explore, it ran for a long time. In contrast, RandomizedSearchCV returned almost instantaneously in comparison. For large or complex models like RandomForestRegressor, this approach offers a good balance between exploration and practicality. It’s a great reminder that smarter search strategies can outperform brute-force methods—especially when paired with thoughtful parameter ranges.

– William

August 3, 2025

Trying my hand at Hyperparameter tuning with GridSearchCV

William's Data Science Blog

In this post, I’ll try using scikit’s GridSearchCV to optimize hyperparameters. GridSearchCV is a powerful tool in scikit-learn that automates the process of hyperparameter tuning by exhaustively searching through a predefined grid of parameter combinations. It evaluates each configuration using cross-validation, allowing you to identify the settings that yield the best performance. It doesn’t guarantee the globally optimal solution, but GridSearchCV provides a reproducible way to improve model accuracy, reduce overfitting, and better understand how a model responds to different parameter choices

Hyperparameter Tuning with GridSearchCV

First Attempt

The images below show the initial parameters I used in my GridSearchCV experimentation and the results. Based on my reading, I decided to try just a few parameters to start. Here are the parameters I chose to start with and a brief description of why I felt each was a good place to start.

Parameter	Description	Why It’s a Good Starting Point
`n_estimators`	Number of trees in the forest	Controls model complexity and variance; 100–300 is a practical range for balancing performance and compute.
`bootstrap`	Whether sampling is done with replacement	Tests the impact of bagging vs. full dataset training—can affect bias and variance. Bagging means each decision tree in the forest is trained on a random sample of the training data.
`criterion`	Function used to measure the quality of a split	Offers diverse loss functions to explore how the model fits different error structures.

You may recall in my earlier post that I achieved these results during manual tuning:
Mean squared error: 160.7100736652691 RMSE: 12.677147694385717 R2 score: 0.3248694960846078

Interpretation

My Manual Configuration Wins on Performance

Lower MSE and RMSE: Indicates better predictive accuracy and smaller average errors.
Higher R²: Explains more variance in the target variable.

Why Might GridSearchCV Underperform Here?

Scoring mismatch: I used "f1" as the scoring metric, which I discovered while reading, is actually for classification! So, the grid search may have optimized incorrectly. Since I’m using a regressor, I should use "neg_mean_squared_error" or "r2".
Limited search space: My grid only varied n_estimators, bootstrap, and criterion. It didn’t explore other impactful parameters like min_samples_leaf, max_features, or max_depth.
Default values: GridSearchCV used default settings for parameters like min_samples_leaf=1, which could lead to overfitting or instability.

Second Attempt

In this attempt, I changed the scoring to neg_mean_squared_error. What that does is, it returns the negative of the mean squared error, which makes GridSearchCV minimize the mean square error (MSE). That in turn means that GridSearchCV will choose parameters that minimize large deviations between predicted and actual values.

So how did that affect results? The below images show what happened.

While the results aren’t much better, they are more valid because it was a mistake to use F1 scoring in the first place. Using F1 was wrong because:

The F1 score is defined for binary classification problems. and I am fitting continuous outputs.
F1 needs discrete class labels, not continuous outputs.
When used in regression, scikit-learn would have forced predictions into binary labels, which distorts the optimization objective.
Instead of minimizing prediction error, it tried to maximize F1 on binarized outputs.

Reflections

The "f1"-optimized model accidentally landed on a slightly better MSE, but this is not reliable or reproducible.
The "neg_mean_squared_error" model was explicitly optimized for MSE, so its performance is trustworthy and aligned with my regression goals.
The small difference could simply be due to random variation or hyperparameter overlap, not because "f1" is a viable scoring metric here.

In summary, using "f1" in regression is methodologically invalid. Even if it produces a superficially better score, it’s optimizing the wrong objective and introduces unpredictable behavior.

In my next post I will try some more parameters and also RandomizedSearchCV.

– William

July 20, 2025

Tag: Machine Learning

Data Science in the World Pt. 1: Data Science in Soccer

Tactical & Match Analysis

Player Development & Performance

Recruitment & Scouting

Training & Recovery Optimization

Set-Piece Engineering

Player Examples

References

Hyperparameter tuning with RandomizedSearchCV

William's Data Science Blog

What is RandomizedSearchCV

Hyperparameter Tuning with RandomizedSearchCV

Interpretation

Interpretation & Insights

Takeaways

Trying my hand at Hyperparameter tuning with GridSearchCV

William's Data Science Blog

Hyperparameter Tuning with GridSearchCV

First Attempt

Interpretation

Second Attempt

Reflections