Tag: artificial intelligence

Data Science in the World Pt. 1: Data Science in Soccer
This post will be the first in a series of blog posts, called “Data Science in the World,” where I discuss the implementation of data science in different fields like sports, business, medicine, etc. To begin this series, I will be explaining how data science is used in soccer.

There are 5 main areas of the soccer world where data science plays a critical role: Tactical & Match Analysis, Player Development & Performance, Recruitment & Scouting, Training & Recovery, and Set-Piece Engineering. I will break down how data science is used in each of these areas.

Tactical & Match Analysis
- Expected Goals (xG): Quantifies shot quality based on location, angle, and defensive pressure. xG can be used to determine a player’s ability as if they generate a high xG throughout a season or their career, then they should hypothetically produce a high amount of goals eventually.
- Heatmaps & Passing Networks: Reveal spatial tendencies, player roles, and team structure. Heatmaps and passing networks can be used by coaches to point out the good and bad their team does in matches, helping them determine what to fix and what to focus on in matches.
- Opponent Profiling: Teams dissect rivals’ patterns to exploit weaknesses and tailor game plans.
Player Development & Performance
- Event Data Tracking: Every pass, tackle, and movement is logged to assess decision-making and execution. Event data tracking helps coaches and players analyze match footage to refine first touch, scanning habits, and off-ball movement.
- Wearable Tech: GPS and accelerometers monitor load, speed, and fatigue in real time. This helps tailor training intensity and reduce injury risk, especially in congested fixture periods.
- Custom Metrics: Clubs build proprietary KPIs to evaluate players beyond traditional stats. Custom metrics allow for more nuanced evaluation than traditional stats like goals or tackles.
Recruitment & Scouting
- Market Inefficiencies: Data helps identify undervalued talent with specific skill sets. This is useful for teams who do not have as much money to spend on players who are elite at multiple skills when the team only needs the player to be elite in one skill.
- Style Matching: Algorithms compare player profiles to team philosophy—think “find me the next Lionel Messi.” This ensures recruits aren’t just talented, but tactically compatible—saving time and money.
- Injury Risk Modeling: Predictive analytics flag players with high susceptibility to injury. It informs transfer decisions and contract structuring.
Training & Recovery Optimization
- Load Management: Data guides intensity and volume to prevent overtraining. Especially vital for youth development and congested schedules.
- Recovery Protocols: Biometrics and sleep data inform individualized recovery strategies. This improves performance consistency and long-term health.
- Skill Targeting: Coaches use analytics to pinpoint technical weaknesses and design drills accordingly.
Set-Piece Engineering
- Spatial Analysis: Determines optimal corner kick types (in-swing vs. out-swing) and free kick setups. It turns set pieces into high-probability scoring opportunities.
- Simulation Tools: VR and AR are emerging to rehearse scenarios with data-driven precision.
Player Examples

Now that we discussed how data science is used, I will provide examples of teams and players that utilized data science in these ways.
1. Liverpool FC – Recruitment & Tactical Modeling
  - Liverpool built one of the most advanced data science departments in soccer, led by Dr. Ian Graham. Using predictive models and custom metrics, they scouted and signed undervalued talent like Mohamed Salah and Sadio Mane off the basis of expected threat.
  - Result: Salah scored 245 goals in just 9 seasons. Liverpool won their first Champions League title since 2005 and their first ever English Premier League title in their history with Salah and Mane leading the lines.
2. Kevin De Bruyne – Contract Negotiation via Analytics FC
  - De Bruyne worked with Analytics FC to create a 40+ page data-driven report showcasing his value to Manchester City. It included proprietary metrics like Goal Difference Added (GDA), tactical simulations, and salary benchmarking.
  - Result: He negotiated his own contract extension without an agent, using data to prove his irreplaceable role in City’s system.
3. Arsenal FC – Injury Risk & Youth Development
  - Arsenal integrated wearable tech and biomechanical data to monitor player load and injury risk. Young players like Myles Lewis-Skelly used performance analytics to support their rise from academy to first team.
  - Result: Lewis-Skelly’s data-backed contract renewal included insights into his match impact, fatigue management, and tactical fit—helping him secure a long-term deal amid interest from top European clubs.
References
- The New York Times. How Liverpool Became the World’s Smartest Soccer Club (Ian Graham feature).
  https://www.nytimes.com/2019/05/22/sports/liverpool-champions-league.html
- The Athletic. Inside Liverpool’s data revolution under Ian Graham.
  https://theathletic.com/3838128/2022/11/18/liverpool-data-ian-graham/
- StatsBomb. Expected Threat (xT): The model behind modern attacking analytics.
  https://statsbomb.com/articles/soccer/introducing-expected-threat-xthreat/
- Kevin De Bruyne & Data-Driven Contract Negotiation
  The Athletic. How Kevin De Bruyne used data to negotiate his own contract.
  https://theathletic.com/2474565/2021/04/07/kevin-de-bruyne-contract-analytics/
- Analytics FC. Goal Difference Added (GDA) and player value modeling.
  https://analyticsfc.co.uk/2021/04/07/goal-difference-added/
- BBC Sport. KDB’s self-negotiated deal and analytics involvement.
  https://www.bbc.com/sport/football/56669587
- Arsenal FC, Injury Prevention & Youth Development
  Arsenal.com. How Arsenal uses sports science and performance data.
  https://www.arsenal.com/news/how-science-shapes-our-training
- Premier League Elite Player Performance Plan (EPPP). Wearable tech, GPS tracking, and youth development analytics.
  https://www.premierleague.com/youth/EPPP
- The Athletic. Inside Arsenal’s academy and the rise of Myles Lewis-Skelly.
  https://theathletic.com/4928020/2023/10/04/arsenal-lewis-skelly-academy/
August 17, 2025

Hyperparameter tuning with RandomizedSearchCV

William's Data Science Blog

In my previous post, I explored how GridSearchCV can systematically search through hyperparameter combinations to optimize model performance. While powerful, grid search can quickly become computationally expensive, especially as the number of parameters and possible values grows. In this follow-up, I try a more scalable alternative: RandomizedSearchCV. By randomly sampling from the hyperparameter space, this method offers a faster, more flexible way to uncover high-performing configurations without the exhaustive overhead of grid search. Let’s dive into how RandomizedSearchCV works, when to use it, and how it compares in practice.

What is RandomizedSearchCV

Unlike GridSearchCV, which exhaustively tests every combination of hyperparameters, RandomizedSearchCV takes a more efficient approach by sampling a fixed number of random combinations from a defined parameter space. This makes it useful when the search space is large or when computational resources are limited. By trading exhaustive coverage for speed and flexibility, RandomizedSearchCV often finds competitive, or even superior, parameter sets with far fewer evaluations. It’s a smart way to explore hyperparameter tuning when you want faster insights without sacrificing rigor.

Hyperparameter Tuning with RandomizedSearchCV

Here’s a breakdown of each parameter in my param_distributions for RandomizedSearchCV when tuning a RandomForestRegressor:

Parameter	Description
`n_estimators` `[100, 200, 300]`	Number of trees in the forest. More trees can improve performance but increase training time.
`min_samples_split` `[2, 5, 10, 20]`	Minimum number of samples required to split an internal node. Higher values reduce model complexity and help prevent overfitting.
`min_samples_leaf` `[1, 2, 4, 10]`	Minimum number of samples required to be at a leaf node. Larger values smooth the model and reduce variance.
`max_features` `["sqrt", "log2", 1.0]`	Number of features to consider when looking for the best split. `"sqrt"` and `"log2"` are common heuristics; `1.0` uses all features.
`bootstrap` `[True, False]`	Whether bootstrap samples are used when building trees. `True` enables bagging; `False` uses the entire dataset for each tree.
`criterion` `["squared_error", "absolute_error"]`	Function to measure the quality of a split. `"squared_error"` (default) is sensitive to outliers; `"absolute_error"` is more robust.
`ccp_alpha` `[0.0, 0.01]`	Complexity parameter for Minimal Cost-Complexity Pruning. Higher values prune more aggressively, simplifying the model.

Interpretation

Here is a table that compares the results in my previous post where I experimented with GridSearchCV with what I achieved while using RandomizedSearchCV.

Metric	GridSearchCV	RandomizedSearchCV	Improvement
Mean Squared Error (MSE)	173.39	161.12	↓ 7.1%
Root Mean Squared Error (RMSE)	13.17	12.69	↓ 3.6%
R² Score	0.2716	0.3231	↑ 18.9%

Interpretation & Insights

Lower MSE and RMSE:
RandomizedSearchCV yielded a model with noticeably lower error metrics. The RMSE dropped by nearly half a point, indicating better predictions. While the absolute reduction may seem modest, it’s meaningful in contexts where small improvements translate to better decision-making or cost savings.

Higher R² Score:
The R² score improved from 0.27 to 0.32, a relative gain of nearly 19%. This suggests that the model tuned via RandomizedSearchCV explains more variance in the target variable—an encouraging sign of better generalization.

Efficiency vs Exhaustiveness:
GridSearchCV exhaustively evaluated all parameter combinations, which can be computationally expensive and potentially redundant. In contrast, RandomizedSearchCV sampled a subset of combinations and still outperformed grid search. This underscores the value of strategic randomness in high-dimensional hyperparameter spaces.

Model Robustness:
The improved metrics hint that RandomizedSearchCV may have landed on a configuration that better balances bias and variance—possibly due to more diverse sampling across parameters like min_samples_leaf, criterion, and ccp_alpha.

Takeaways

RandomizedSearchCV not only delivered better predictive performance but did so with greater computational efficiency. When I ran GridSearchCV with as many parameters to explore, it ran for a long time. In contrast, RandomizedSearchCV returned almost instantaneously in comparison. For large or complex models like RandomForestRegressor, this approach offers a good balance between exploration and practicality. It’s a great reminder that smarter search strategies can outperform brute-force methods—especially when paired with thoughtful parameter ranges.

– William

August 3, 2025

Trying my hand at Hyperparameter tuning with GridSearchCV

William's Data Science Blog

In this post, I’ll try using scikit’s GridSearchCV to optimize hyperparameters. GridSearchCV is a powerful tool in scikit-learn that automates the process of hyperparameter tuning by exhaustively searching through a predefined grid of parameter combinations. It evaluates each configuration using cross-validation, allowing you to identify the settings that yield the best performance. It doesn’t guarantee the globally optimal solution, but GridSearchCV provides a reproducible way to improve model accuracy, reduce overfitting, and better understand how a model responds to different parameter choices

Hyperparameter Tuning with GridSearchCV

First Attempt

The images below show the initial parameters I used in my GridSearchCV experimentation and the results. Based on my reading, I decided to try just a few parameters to start. Here are the parameters I chose to start with and a brief description of why I felt each was a good place to start.

Parameter	Description	Why It’s a Good Starting Point
`n_estimators`	Number of trees in the forest	Controls model complexity and variance; 100–300 is a practical range for balancing performance and compute.
`bootstrap`	Whether sampling is done with replacement	Tests the impact of bagging vs. full dataset training—can affect bias and variance. Bagging means each decision tree in the forest is trained on a random sample of the training data.
`criterion`	Function used to measure the quality of a split	Offers diverse loss functions to explore how the model fits different error structures.

You may recall in my earlier post that I achieved these results during manual tuning:
Mean squared error: 160.7100736652691 RMSE: 12.677147694385717 R2 score: 0.3248694960846078

Interpretation

My Manual Configuration Wins on Performance

Lower MSE and RMSE: Indicates better predictive accuracy and smaller average errors.
Higher R²: Explains more variance in the target variable.

Why Might GridSearchCV Underperform Here?

Scoring mismatch: I used "f1" as the scoring metric, which I discovered while reading, is actually for classification! So, the grid search may have optimized incorrectly. Since I’m using a regressor, I should use "neg_mean_squared_error" or "r2".
Limited search space: My grid only varied n_estimators, bootstrap, and criterion. It didn’t explore other impactful parameters like min_samples_leaf, max_features, or max_depth.
Default values: GridSearchCV used default settings for parameters like min_samples_leaf=1, which could lead to overfitting or instability.

Second Attempt

In this attempt, I changed the scoring to neg_mean_squared_error. What that does is, it returns the negative of the mean squared error, which makes GridSearchCV minimize the mean square error (MSE). That in turn means that GridSearchCV will choose parameters that minimize large deviations between predicted and actual values.

So how did that affect results? The below images show what happened.

While the results aren’t much better, they are more valid because it was a mistake to use F1 scoring in the first place. Using F1 was wrong because:

The F1 score is defined for binary classification problems. and I am fitting continuous outputs.
F1 needs discrete class labels, not continuous outputs.
When used in regression, scikit-learn would have forced predictions into binary labels, which distorts the optimization objective.
Instead of minimizing prediction error, it tried to maximize F1 on binarized outputs.

Reflections

The "f1"-optimized model accidentally landed on a slightly better MSE, but this is not reliable or reproducible.
The "neg_mean_squared_error" model was explicitly optimized for MSE, so its performance is trustworthy and aligned with my regression goals.
The small difference could simply be due to random variation or hyperparameter overlap, not because "f1" is a viable scoring metric here.

In summary, using "f1" in regression is methodologically invalid. Even if it produces a superficially better score, it’s optimizing the wrong objective and introduces unpredictable behavior.

In my next post I will try some more parameters and also RandomizedSearchCV.

– William

July 20, 2025

Exploring the Impact of Alcohol Consumption on Student Grades with Gaussian Naive Bayes
William's Data Science Blog

In today’s data-driven world, even seemingly straightforward questions can reveal surprising insights. In this post, I investigate whether students’ alcohol consumption habits bear any relationship to their final math grades. Using the Student Alcohol Consumption dataset from Kaggle, which contains survey responses on a myriad aspects of students’ lives—ranging from study habits and social factors to gender and alcohol use—I set out to determine if patterns exist that can predict academic performance.

Dataset Overview

The dataset originates from a survey of students enrolled in secondary school math and Portuguese courses. It includes rich, social, and academic information, such as:

Social and family background

Study habits and academic support

Alcohol consumption details during weekdays and weekends

I focused on predicting the final math grade (denoted as G3 in the raw data) while probing how alcohol-related features, especially weekend consumption, might play a role in performance. The binary insight wasn’t just about whether students drank, but which drinking pattern might be more telling of their academic results.

Data Preprocessing: Laying the Groundwork

Before diving into modeling, the data needed some cleanup. Here’s how I systematically prepared the dataset for analysis:

Loading the Data: I imported the CSV into a Pandas DataFrame for easy manipulation.

Renaming Columns: Clarity matters. I renamed ambiguous columns for better readability (e.g., renaming walc to weekend_alcohol and dalc to weekday_alcohol).

Label Encoding: Categorical data were converted to numeric representations using scikit-learn’s LabelEncoder, ensuring all features could be numerically processed.

Reusable Code: I encapsulated the training and testing phases within a reusable function, which made it straightforward to test different feature combinations.

Here’s are some snippets:

In those cells:

I rename columns to make them more readable.

I instantiate a LabelEncoder object and encode a list of columns that have string values.

I add an absence category to normalize absence count a little due to how variable that data is.

Experimenting With Gaussian Naive Bayes

The heart of this exploration was to see how well a Gaussian Naive Bayes classifier could predict the final math grade based on different selections of features. Naive Bayes, while greatly valued for its simplicity and speed, operates under the assumption that features are independent—a condition that might not fully hold in educational data.

Training and Evaluation Function

To streamline the experiments, I wrote a function that:

Splits the data into training and testing sets.

Trains a GaussianNB model.

Evaluates accuracy on the test set.

In that cell:

I create a function that:

Drops unwanted columns.

Runs 100 training cycles with the given data.

Captures the accuracy measured from each run and returns the average.

Single and Two column sampling

In those cells:

I get a list of all columns.

I create loop(s) over the column list and create a list of features to test.

I call my function to measure the the accuracy of the features at predicting student grades.

Diving Into Feature Combinations

I aimed to assess the predictive power by testing different combinations of features:

All Columns: This gave the best accuracy of around 22%, yet it was clear that even the full spectrum of information struggled to make strong predictions.

Handpicked Features: I manually selected features that I hypothesized might be influential. The resulting accuracy dipped below that of the full dataset.

Individual Features: Evaluating each feature solo revealed that the column indicating whether students planned to pursue higher education yielded the highest individual accuracy—though still far lower than all features combined.

Two-Feature Combinations: By testing all pairs, I noticed that combinations including weekend alcohol consumption appeared in the top 20 predictive pairs four times, including in both of the top two.

Three-Feature Combinations: The trend became stronger—combinations featuring weekend alcohol consumption topped the list ten times and were present in each of the top three combinations!

Four-Feature Combinations: Here, weekend alcohol consumption featured in the top 20 combination results even more robustly—15 times in total.

These experiments showcased one noteworthy pattern: weekend alcohol consumption consistently emerged as a common denominator in the best-performing feature combinations, while weekday consumption rarely made an appearance.

Analysis of the Findings

Several key observations emerged from this series of experiments:

Predictive Accuracy: Even with the full set of features, the best accuracy reached was only around 22%. This underwhelming performance is indicative of the challenges posed by the dataset and the restrictive assumptions embedded within the Naive Bayes model.

Role of Alcohol Consumption: The repeated appearance of weekend alcohol consumption in high-ranking feature combinations suggests a potential association—it may capture lifestyle or social habits that indirectly correlate with academic performance. However, it is not a standalone predictor; rather, it seems to be relevant as part of a multifactorial interaction.

Model Limitations: The Gaussian Naive Bayes classifier assumes feature independence. The complexities inherent in student performance—where multiple social, educational, and psychological factors interact—likely violate this assumption, leading to lower predictive performance.

Conclusion and Future Directions

While the Gaussian Naive Bayes classifier provided some interesting insights, especially regarding the recurring presence of weekend alcohol consumption in influential feature combinations, its overall accuracy was modest. Predicting the final math grade, a multifaceted outcome influenced by numerous interdependent factors, appears too challenging for this simplistic probabilistic model.

Next Steps:

Alternative Machine Learning Algorithms: Investigating other approaches like decision trees, random forests, support vector machines, or ensemble methods may yield better performance.

Enhanced Feature Engineering: Incorporating interaction terms or domain-specific features might help capture the complex relationships between social habits and academic outcomes.

Broader Data Explorations: Diving deeper into other factors—such as study habits, parental support, and extracurricular involvement—could provide additional clarity.

Final Thoughts and Next Steps

This journey reinforced the idea that while Naive Bayes is a great tool for its speed and interpretability, it might not be the best choice for all datasets. More sophisticated models and careful feature engineering are necessary when dealing with some datasets like student academic performance.

The new Jupyter notebook can be found here in my GitHub.

– William
April 13, 2025

Tag: artificial intelligence

Tactical & Match Analysis

Player Development & Performance

Recruitment & Scouting

Training & Recovery Optimization

Set-Piece Engineering

Player Examples

References

What is RandomizedSearchCV

Hyperparameter Tuning with RandomizedSearchCV

Interpretation

Interpretation & Insights

Takeaways

Hyperparameter Tuning with GridSearchCV

First Attempt

Interpretation

Second Attempt

Reflections

Dataset Overview

Data Preprocessing: Laying the Groundwork

In those cells:

Experimenting With Gaussian Naive Bayes

Training and Evaluation Function

In that cell:

Single and Two column sampling

In those cells:

Diving Into Feature Combinations

Analysis of the Findings

Conclusion and Future Directions

Final Thoughts and Next Steps