Tag: machine-learning

  • Hyperparameter tuning with RandomizedSearchCV

    Hyperparameter tuning with RandomizedSearchCV

    In my previous post, I explored how GridSearchCV can systematically search through hyperparameter combinations to optimize model performance. While powerful, grid search can quickly become computationally expensive, especially as the number of parameters and possible values grows. In this follow-up, I try a more scalable alternative: RandomizedSearchCV. By randomly sampling from the hyperparameter space, this method offers a faster, more flexible way to uncover high-performing configurations without the exhaustive overhead of grid search. Let’s dive into how RandomizedSearchCV works, when to use it, and how it compares in practice.

    What is RandomizedSearchCV

    Unlike GridSearchCV, which exhaustively tests every combination of hyperparameters, RandomizedSearchCV takes a more efficient approach by sampling a fixed number of random combinations from a defined parameter space. This makes it useful when the search space is large or when computational resources are limited. By trading exhaustive coverage for speed and flexibility, RandomizedSearchCV often finds competitive, or even superior, parameter sets with far fewer evaluations. It’s a smart way to explore hyperparameter tuning when you want faster insights without sacrificing rigor.

    Hyperparameter Tuning with RandomizedSearchCV

    Here’s a breakdown of each parameter in my param_distributions for RandomizedSearchCV when tuning a RandomForestRegressor:

    ParameterDescription
    n_estimators [100, 200, 300]Number of trees in the forest. More trees can improve performance but increase training time.
    min_samples_split [2, 5, 10, 20]Minimum number of samples required to split an internal node. Higher values reduce model complexity and help prevent overfitting.
    min_samples_leaf [1, 2, 4, 10]Minimum number of samples required to be at a leaf node. Larger values smooth the model and reduce variance.
    max_features ["sqrt", "log2", 1.0]Number of features to consider when looking for the best split. "sqrt" and "log2" are common heuristics; 1.0 uses all features.
    bootstrap [True, False]Whether bootstrap samples are used when building trees. True enables bagging; False uses the entire dataset for each tree.
    criterion ["squared_error", "absolute_error"]Function to measure the quality of a split. "squared_error" (default) is sensitive to outliers; "absolute_error" is more robust.
    ccp_alpha [0.0, 0.01]Complexity parameter for Minimal Cost-Complexity Pruning. Higher values prune more aggressively, simplifying the model.

    Interpretation

    Here is a table that compares the results in my previous post where I experimented with GridSearchCV with what I achieved while using RandomizedSearchCV.

    MetricGridSearchCVRandomizedSearchCVImprovement
    Mean Squared Error (MSE)173.39161.12↓ 7.1%
    Root Mean Squared Error (RMSE)13.1712.69↓ 3.6%
    R² Score0.27160.3231↑ 18.9%

    Interpretation & Insights

    Lower MSE and RMSE:
    RandomizedSearchCV yielded a model with noticeably lower error metrics. The RMSE dropped by nearly half a point, indicating better predictions. While the absolute reduction may seem modest, it’s meaningful in contexts where small improvements translate to better decision-making or cost savings.

    Higher R² Score:
    The R² score improved from 0.27 to 0.32, a relative gain of nearly 19%. This suggests that the model tuned via RandomizedSearchCV explains more variance in the target variable—an encouraging sign of better generalization.

    Efficiency vs Exhaustiveness:
    GridSearchCV exhaustively evaluated all parameter combinations, which can be computationally expensive and potentially redundant. In contrast, RandomizedSearchCV sampled a subset of combinations and still outperformed grid search. This underscores the value of strategic randomness in high-dimensional hyperparameter spaces.

    Model Robustness:
    The improved metrics hint that RandomizedSearchCV may have landed on a configuration that better balances bias and variance—possibly due to more diverse sampling across parameters like min_samples_leaf, criterion, and ccp_alpha.

    Takeaways

    RandomizedSearchCV not only delivered better predictive performance but did so with greater computational efficiency. When I ran GridSearchCV with as many parameters to explore, it ran for a long time. In contrast, RandomizedSearchCV returned almost instantaneously in comparison. For large or complex models like RandomForestRegressor, this approach offers a good balance between exploration and practicality. It’s a great reminder that smarter search strategies can outperform brute-force methods—especially when paired with thoughtful parameter ranges.

    – William

  • Trying my hand at Hyperparameter tuning with GridSearchCV

    Trying my hand at Hyperparameter tuning with GridSearchCV

    In this post, I’ll try using scikit’s GridSearchCV to optimize hyperparameters. GridSearchCV is a powerful tool in scikit-learn that automates the process of hyperparameter tuning by exhaustively searching through a predefined grid of parameter combinations. It evaluates each configuration using cross-validation, allowing you to identify the settings that yield the best performance. It doesn’t guarantee the globally optimal solution, but GridSearchCV provides a reproducible way to improve model accuracy, reduce overfitting, and better understand how a model responds to different parameter choices

    Hyperparameter Tuning with GridSearchCV

    First Attempt

    The images below show the initial parameters I used in my GridSearchCV experimentation and the results. Based on my reading, I decided to try just a few parameters to start. Here are the parameters I chose to start with and a brief description of why I felt each was a good place to start.

    ParameterDescriptionWhy It’s a Good Starting Point
    n_estimatorsNumber of trees in the forestControls model complexity and variance; 100–300 is a practical range for balancing performance and compute.
    bootstrapWhether sampling is done with replacementTests the impact of bagging vs. full dataset training—can affect bias and variance. Bagging means each decision tree in the forest is trained on a random sample of the training data.
    criterionFunction used to measure the quality of a splitOffers diverse loss functions to explore how the model fits different error structures.

    You may recall in my earlier post that I achieved these results during manual tuning:
    Mean squared error: 160.7100736652691
    RMSE: 12.677147694385717
    R2 score: 0.3248694960846078

    Interpretation

    My Manual Configuration Wins on Performance

    • Lower MSE and RMSE: Indicates better predictive accuracy and smaller average errors.
    • Higher R²: Explains more variance in the target variable.

    Why Might GridSearchCV Underperform Here?

    • Scoring mismatch: I used "f1" as the scoring metric, which I discovered while reading, is actually for classification! So, the grid search may have optimized incorrectly. Since I’m using a regressor, I should use "neg_mean_squared_error" or "r2".
    • Limited search space: My grid only varied n_estimators, bootstrap, and criterion. It didn’t explore other impactful parameters like min_samples_leaf, max_features, or max_depth.
    • Default values: GridSearchCV used default settings for parameters like min_samples_leaf=1, which could lead to overfitting or instability.

    Second Attempt

    In this attempt, I changed the scoring to neg_mean_squared_error. What that does is, it returns the negative of the mean squared error, which makes GridSearchCV minimize the mean square error (MSE). That in turn means that GridSearchCV will choose parameters that minimize large deviations between predicted and actual values.

    So how did that affect results? The below images show what happened.

    While the results aren’t much better, they are more valid because it was a mistake to use F1 scoring in the first place. Using F1 was wrong because:

    • The F1 score is defined for binary classification problems. and I am fitting continuous outputs.
    • F1 needs discrete class labels, not continuous outputs.
    • When used in regression, scikit-learn would have forced predictions into binary labels, which distorts the optimization objective.
    • Instead of minimizing prediction error, it tried to maximize F1 on binarized outputs.

    Reflections

    • The "f1"-optimized model accidentally landed on a slightly better MSE, but this is not reliable or reproducible.
    • The "neg_mean_squared_error" model was explicitly optimized for MSE, so its performance is trustworthy and aligned with my regression goals.
    • The small difference could simply be due to random variation or hyperparameter overlap, not because "f1" is a viable scoring metric here.

    In summary, using "f1" in regression is methodologically invalid. Even if it produces a superficially better score, it’s optimizing the wrong objective and introduces unpredictable behavior.

    In my next post I will try some more parameters and also RandomizedSearchCV.

    – William

  • Playing with Hyperparameter Tuning and Winsorizing

    Playing with Hyperparameter Tuning and Winsorizing

    In this post, I’ll revisit my earlier model’s performance by experimenting with hyperparameter tuning, pushing beyond default configurations to extract deeper predictive power. I’ll also take a critical look at the data itself, exploring how winsorizing outliers can recalibrate outliers without sacrificing the integrity of the data. The goal: refine, rebalance, and rethink accuracy.

    Hyperparameter Tuning

    The image below shows my initial experiment with the RandomForestRegressor. As you can see, I used the default value for n_estimators.

    The resulting MSE, RMSE and R² score are shown. In my earlier post I noted what those values mean. In summary:

    • An MSE of 172 indicates there may be outliers.
    • An RMSE of 13 indicates there an average error of around 13 points on 0–100 scale.
    • An R² of 0.275 means my model explains just 27.5% of the variance in the target variable.

    Experimentation

    My first attempt at manual tuning looked like the image below. There really is just a small improvement with these parameters. I tried increasing n_estimators significantly because the accuracy should be improved with the larger value. I tried increasing max_depth to 50 to see if that compares to the default value of None. I tried increasing min_samples_split to 20 and min_samples_leaf of 10 to see if it would help with any noise in the data. I didn’t really need to set max_features to 1.0, because that is currently the default value.

    The net result was slightly better results, but nothing too significant.

    Next, I tried what is shown in the image below. Interestingly, I got very similar results to the above. With these values, the model trains much faster while achieving the same results.

    Winsorizing

    Winsorization changes a dataset by replacing outlier values with less extreme ones. Unlike trimming (which removes outliers), winsorization preserves the dataset size by limiting values at the chosen threshold.

    Here is what my code looks like:

    In this cell, I’ve replaced the math score data a winsorized version. I used the same hyperparameters as before. Here we can see a more significant improvement MSE and RMSE, but a slightly lower R² score.

    That means that since the earlier model has a slightly higher R², it explains a bit more variance relative to the total variance of the target variable. Maybe because it models the core signal more tightly, even though it has noisier estimates.

    The winsorized model, with its lower MSE and RMSE indicate better overall prediction accuracy. This is nice when minimizing absolute error matters the most.

    Final Thoughts

    After experimenting with default settings, I systematically adjusted hyperparameters and applied winsorization to improve my RandomForestRegressor’s accuracy. Here’s a concise overview of the three main runs:

    • Deep, Wide Forest
      • Parameters
        • max_depth: 50
        • min_samples_split: 20
        • min_samples_leaf: 10
        • max_features: 1.0
        • random_state: 42
      • Insights
        • A large ensemble with controlled tree depth and higher split/leaf thresholds slightly reduced variance but yielded only marginal gains over defaults.
    • Standard Forest with Unlimited Depth
      • Parameters
        • max_depth: None
        • min_samples_split: 2
        • min_samples_leaf: 10
        • max_features: 1.0
        • random_state: 42
      • Insights
        • Reverting to fewer trees and no depth limit produced nearly identical performance, suggesting diminishing returns from deeper or wider forests in this setting.
    • Winsorized Data
      • Parameters
        • n_estimators: 100
        • max_depth: None
        • min_samples_split: 2
        • min_samples_leaf: 10
        • max_features: 1.0
        • random_state: 42
        • Applied winsorization to cap outliers
      • Insights
        • Winsorizing outliers drastically lowered absolute error (MSE/RMSE), highlighting its power for stabilizing predictions. The slight drop in R² reflects reduced target variance after capping extremes.

    – William

  • Deep Dive Into Random Forests

    Deep Dive Into Random Forests

    In today’s post, I’ll take an in-depth look at Random Forests, one of the most popular and effective algorithms in the data science toolkit. I’ll describe what I learned about how they work, their components and what makes them tick.

    What Are Random Forests?

    At its heart, a random forest is an ensemble of decision trees working together.

    • Decision Trees: Each tree as a model that makes decisions by splitting data based on certain features.
    • Ensemble Approach: Instead of relying on a single decision tree, a random forest builds many trees from bootstrapped samples of your data. The prediction from the forest is then derived by averaging (for regression) or taking a majority vote (for classification).

    This approach reduces the variance typical of individual trees and builds a robust model that handles complex feature interactions with ease.

    The Magic Behind the Method

    1. Bootstrap Sampling

    Each tree in the forest is trained on a different subset of data, selected with replacement. This process, known as bagging (Bootstrap Aggregating), means roughly 37% of your data isn’t used in any tree. This leftover data, the out-of-bag (OOB) set, can later be used to internally validate the model without needing a separate validation set.

    2. Random Feature Selection

    At every decision point within a tree, instead of considering every feature, the algorithm randomly selects a subset. This randomness:

    • De-correlates Trees: Each tree becomes less alike, ensuring that the ensemble doesn’t overfit or lean too heavily on one feature.
    • Reduces Variance: Averaging predictions across diverse trees smooths out misclassifications or prediction errors.

    3. Aggregating Predictions

    For classification tasks, each tree casts a vote for a class, and the class with the highest number of votes becomes the model’s prediction.

    For regression tasks, predictions are averaged to produce a final value. This collective approach generally results in higher accuracy and more stable predictions.

    Out-of-Bag (OOB) Error

    An important feature of random forests is the OOB error estimate.

    • What It Is: Each tree is trained on a bootstrap sample, leaving out a set of data that can serve as a mini-test set.
    • Why It Counts: Aggregating predictions on these out-of-bag samples can offer an estimate of the model’s test error.

    This feature can be really handy, especially when you’re working with limited data and want to avoid setting aside a large chunk of it for validation.

    Feature Importance

    Random forests don’t just predict, they can also help you understand your data:

    • Mean Decrease in Impurity (MDI): This measure tallies how much each feature decreases impurity (based on measures like the Gini index) across all trees.
    • Permutation Importance: By shuffling features and measuring the drop in accuracy the importance of a feature can be measured. This is meant to help when you need to interpret the model and communicate which features are most influential.

    Pros and Cons

    Advantages:

    • Can handle Non-Linear Data: Naturally captures complex feature interactions.
    • Can handle Noise & Outliers: Ensemble averaging minimizes overfitting.
    • Doesn’t need a lot of Preprocessing: No need for extensive data scaling or transformation.

    Disadvantages:

    • Can be Memory Intensive: Storing hundreds of trees can be demanding.
    • Slower than a single Tree: Compared to a single decision tree, the ensemble approach require more processing power.
    • Harder to Interpret: The combination of multiple trees makes it harder to interpretability compared to individual trees.

    Summary

    Random Forests are a powerful next step in my journey. With their ability to reduce variance through ensemble learning and their built-in validation mechanisms like OOB error, they offer both performance and insight.

    In my next post, I’ll share how I apply the Random Forest technique to this data set: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams/data

    – William

  • Exploring the Impact of Alcohol Consumption on Student Grades with Gaussian Naive Bayes

    Exploring the Impact of Alcohol Consumption on Student Grades with Gaussian Naive Bayes

    In today’s data-driven world, even seemingly straightforward questions can reveal surprising insights. In this post, I investigate whether students’ alcohol consumption habits bear any relationship to their final math grades. Using the Student Alcohol Consumption dataset from Kaggle, which contains survey responses on a myriad aspects of students’ lives—ranging from study habits and social factors to gender and alcohol use—I set out to determine if patterns exist that can predict academic performance.

    Dataset Overview

    The dataset originates from a survey of students enrolled in secondary school math and Portuguese courses. It includes rich, social, and academic information, such as:

    • Social and family background
    • Study habits and academic support
    • Alcohol consumption details during weekdays and weekends

    I focused on predicting the final math grade (denoted as G3 in the raw data) while probing how alcohol-related features, especially weekend consumption, might play a role in performance. The binary insight wasn’t just about whether students drank, but which drinking pattern might be more telling of their academic results.

    Data Preprocessing: Laying the Groundwork

    Before diving into modeling, the data needed some cleanup. Here’s how I systematically prepared the dataset for analysis:

    1. Loading the Data: I imported the CSV into a Pandas DataFrame for easy manipulation.
    2. Renaming Columns: Clarity matters. I renamed ambiguous columns for better readability (e.g., renaming walc to weekend_alcohol and dalc to weekday_alcohol).
    3. Label Encoding: Categorical data were converted to numeric representations using scikit-learn’s LabelEncoder, ensuring all features could be numerically processed.
    4. Reusable Code: I encapsulated the training and testing phases within a reusable function, which made it straightforward to test different feature combinations.

    Here’s are some snippets:

    In those cells:

    • I rename columns to make them more readable.
    • I instantiate a LabelEncoder object and encode a list of columns that have string values.
    • I add an absence category to normalize absence count a little due to how variable that data is.

    Experimenting With Gaussian Naive Bayes

    The heart of this exploration was to see how well a Gaussian Naive Bayes classifier could predict the final math grade based on different selections of features. Naive Bayes, while greatly valued for its simplicity and speed, operates under the assumption that features are independent—a condition that might not fully hold in educational data.

    Training and Evaluation Function

    To streamline the experiments, I wrote a function that:

    • Splits the data into training and testing sets.
    • Trains a GaussianNB model.
    • Evaluates accuracy on the test set.

    In that cell:

    • I create a function that:
      • Drops unwanted columns.
      • Runs 100 training cycles with the given data.
      • Captures the accuracy measured from each run and returns the average.

    Single and Two column sampling

    In those cells:

    • I get a list of all columns.
    • I create loop(s) over the column list and create a list of features to test.
    • I call my function to measure the the accuracy of the features at predicting student grades.

    Diving Into Feature Combinations

    I aimed to assess the predictive power by testing different combinations of features:

    1. All Columns: This gave the best accuracy of around 22%, yet it was clear that even the full spectrum of information struggled to make strong predictions.
    2. Handpicked Features: I manually selected features that I hypothesized might be influential. The resulting accuracy dipped below that of the full dataset.
    3. Individual Features: Evaluating each feature solo revealed that the column indicating whether students planned to pursue higher education yielded the highest individual accuracy—though still far lower than all features combined.
    4. Two-Feature Combinations: By testing all pairs, I noticed that combinations including weekend alcohol consumption appeared in the top 20 predictive pairs four times, including in both of the top two.
    5. Three-Feature Combinations: The trend became stronger—combinations featuring weekend alcohol consumption topped the list ten times and were present in each of the top three combinations!
    6. Four-Feature Combinations: Here, weekend alcohol consumption featured in the top 20 combination results even more robustly—15 times in total.

    These experiments showcased one noteworthy pattern: weekend alcohol consumption consistently emerged as a common denominator in the best-performing feature combinations, while weekday consumption rarely made an appearance.

    Analysis of the Findings

    Several key observations emerged from this series of experiments:

    • Predictive Accuracy: Even with the full set of features, the best accuracy reached was only around 22%. This underwhelming performance is indicative of the challenges posed by the dataset and the restrictive assumptions embedded within the Naive Bayes model.
    • Role of Alcohol Consumption: The repeated appearance of weekend alcohol consumption in high-ranking feature combinations suggests a potential association—it may capture lifestyle or social habits that indirectly correlate with academic performance. However, it is not a standalone predictor; rather, it seems to be relevant as part of a multifactorial interaction.
    • Model Limitations: The Gaussian Naive Bayes classifier assumes feature independence. The complexities inherent in student performance—where multiple social, educational, and psychological factors interact—likely violate this assumption, leading to lower predictive performance.

    Conclusion and Future Directions

    While the Gaussian Naive Bayes classifier provided some interesting insights, especially regarding the recurring presence of weekend alcohol consumption in influential feature combinations, its overall accuracy was modest. Predicting the final math grade, a multifaceted outcome influenced by numerous interdependent factors, appears too challenging for this simplistic probabilistic model.

    Next Steps:

    • Alternative Machine Learning Algorithms: Investigating other approaches like decision trees, random forests, support vector machines, or ensemble methods may yield better performance.
    • Enhanced Feature Engineering: Incorporating interaction terms or domain-specific features might help capture the complex relationships between social habits and academic outcomes.
    • Broader Data Explorations: Diving deeper into other factors—such as study habits, parental support, and extracurricular involvement—could provide additional clarity.

    Final Thoughts and Next Steps

    This journey reinforced the idea that while Naive Bayes is a great tool for its speed and interpretability, it might not be the best choice for all datasets. More sophisticated models and careful feature engineering are necessary when dealing with some datasets like student academic performance.

    The new Jupyter notebook can be found here in my GitHub.

    – William