Category: Beginning

  • Hyperparameter tuning with RandomizedSearchCV

    Hyperparameter tuning with RandomizedSearchCV

    In my previous post, I explored how GridSearchCV can systematically search through hyperparameter combinations to optimize model performance. While powerful, grid search can quickly become computationally expensive, especially as the number of parameters and possible values grows. In this follow-up, I try a more scalable alternative: RandomizedSearchCV. By randomly sampling from the hyperparameter space, this method offers a faster, more flexible way to uncover high-performing configurations without the exhaustive overhead of grid search. Let’s dive into how RandomizedSearchCV works, when to use it, and how it compares in practice.

    What is RandomizedSearchCV

    Unlike GridSearchCV, which exhaustively tests every combination of hyperparameters, RandomizedSearchCV takes a more efficient approach by sampling a fixed number of random combinations from a defined parameter space. This makes it useful when the search space is large or when computational resources are limited. By trading exhaustive coverage for speed and flexibility, RandomizedSearchCV often finds competitive, or even superior, parameter sets with far fewer evaluations. It’s a smart way to explore hyperparameter tuning when you want faster insights without sacrificing rigor.

    Hyperparameter Tuning with RandomizedSearchCV

    Here’s a breakdown of each parameter in my param_distributions for RandomizedSearchCV when tuning a RandomForestRegressor:

    ParameterDescription
    n_estimators [100, 200, 300]Number of trees in the forest. More trees can improve performance but increase training time.
    min_samples_split [2, 5, 10, 20]Minimum number of samples required to split an internal node. Higher values reduce model complexity and help prevent overfitting.
    min_samples_leaf [1, 2, 4, 10]Minimum number of samples required to be at a leaf node. Larger values smooth the model and reduce variance.
    max_features ["sqrt", "log2", 1.0]Number of features to consider when looking for the best split. "sqrt" and "log2" are common heuristics; 1.0 uses all features.
    bootstrap [True, False]Whether bootstrap samples are used when building trees. True enables bagging; False uses the entire dataset for each tree.
    criterion ["squared_error", "absolute_error"]Function to measure the quality of a split. "squared_error" (default) is sensitive to outliers; "absolute_error" is more robust.
    ccp_alpha [0.0, 0.01]Complexity parameter for Minimal Cost-Complexity Pruning. Higher values prune more aggressively, simplifying the model.

    Interpretation

    Here is a table that compares the results in my previous post where I experimented with GridSearchCV with what I achieved while using RandomizedSearchCV.

    MetricGridSearchCVRandomizedSearchCVImprovement
    Mean Squared Error (MSE)173.39161.12↓ 7.1%
    Root Mean Squared Error (RMSE)13.1712.69↓ 3.6%
    R² Score0.27160.3231↑ 18.9%

    Interpretation & Insights

    Lower MSE and RMSE:
    RandomizedSearchCV yielded a model with noticeably lower error metrics. The RMSE dropped by nearly half a point, indicating better predictions. While the absolute reduction may seem modest, it’s meaningful in contexts where small improvements translate to better decision-making or cost savings.

    Higher R² Score:
    The R² score improved from 0.27 to 0.32, a relative gain of nearly 19%. This suggests that the model tuned via RandomizedSearchCV explains more variance in the target variable—an encouraging sign of better generalization.

    Efficiency vs Exhaustiveness:
    GridSearchCV exhaustively evaluated all parameter combinations, which can be computationally expensive and potentially redundant. In contrast, RandomizedSearchCV sampled a subset of combinations and still outperformed grid search. This underscores the value of strategic randomness in high-dimensional hyperparameter spaces.

    Model Robustness:
    The improved metrics hint that RandomizedSearchCV may have landed on a configuration that better balances bias and variance—possibly due to more diverse sampling across parameters like min_samples_leaf, criterion, and ccp_alpha.

    Takeaways

    RandomizedSearchCV not only delivered better predictive performance but did so with greater computational efficiency. When I ran GridSearchCV with as many parameters to explore, it ran for a long time. In contrast, RandomizedSearchCV returned almost instantaneously in comparison. For large or complex models like RandomForestRegressor, this approach offers a good balance between exploration and practicality. It’s a great reminder that smarter search strategies can outperform brute-force methods—especially when paired with thoughtful parameter ranges.

    – William

  • First Post!

    First Post!

    My Journey into the Fascinating World of Data Science

    Inspiration Behind Starting This Blog

    Hello everyone! I’m William, a junior in high school who’s passionate about data science. Ever since I discovered data science, I’ve been fascinated by its potential to solve a myriad of problems. It’s amazing how data science can be applied in so many ways, from improving business strategies to enhancing healthcare. What truly drives me is the possibility of making a difference, starting with education. I have always enjoyed helping my friends with their schoolwork, and I believe that data science can provide powerful insights to improve educational outcomes. Hence, this blog is my way of documenting my journey and sharing my learnings with you.

    Goals of My Data Science Journey

    My primary goal is to learn all about data science—the diverse applications, methodologies, and algorithms that power this field. I want to gain a comprehensive understanding and apply what I learn to the realm of education. By leveraging data science, I aim to uncover insights that can contribute to making education more effective and accessible.

    What Drew Me to Data Science

    My interest in data science was sparked by the movie ‘Moneyball’. I watched it on an airplane, and it opened my eyes to the power of data analytics in sports. This led me to explore the world of data science further, and I discovered its applications stretch far beyond sports. From education to medicine, the possibilities are endless, and I couldn’t wait to dive in.

    Initial Steps

    Starting this journey requires some essential tools and a plan. From my research, I found that a great starting point is the Naive Bayes classifier. It’s a simple yet powerful algorithm that’s often recommended for beginners. Here’s my plan for my first set of blog posts:

    1. Tools and Services: I’ll share the tools and services I’ve learned are essential for data science, from coding environments to data visualization tools.
    2. Setup Steps: I’ll walk you through the steps I used to set up each of these tools, making it easy for you to follow along.
    3. First Algorithm to Learn: I’ll begin with the Naive Bayes classifier, a powerful and simple algorithm that’s great for classification tasks. I’ll provide a writeup of my understanding of the Naive Bayes classifier, breaking down the theory behind it.
    4. Use an Example from Wikipedia: I’ll follow an example I found on Wikipedia to implement the Naive Bayes classifier. That way I can be sure my code works as expected.
    5. Research Available Datasets: Next I’ll then research some education datasets, like on Kaggle.com, and pick one to continue my learning journey by showcasing a real-world application of the algorithm.
    6. Continue my Journey: Then I’ll decide the next algorithm to explore!

    Through this blog, I hope to share my learning experiences and provide valuable insights into the world of data science. Whether you’re a fellow student or someone interested in data science, join me as I explore the endless possibilities and applications of data science!

    Thank you for joining me on this adventure. Stay tuned as I delve deeper into the world of data science and share my experiences, discoveries, and insights with you.

    – William