Tag: Student Data Analysis

  • Short Post

    Short Post

    Just a quick ‘nothing’ post this time. I’ve been very busy with college applications and the holidays. I’ll post again soon once application season settles down a bit. I stumbled across an interesting on genetic algorithms. That sounds so interesting I think my next post may be on that topic!

    – William

  • First Experiment with SHAP Visualizations

    First Experiment with SHAP Visualizations

    In my previous post, I touched on model explainability. One approach for feature attribution is called SHAP, SHapley Additive exPlanations. In this post I will cover my first experiment with SHAP, building on one of my previous notebooks. My GitHub repo containing all of my Jupyter notebooks can be found here: GitHub – wcaubrey/learning-naive-bayes.

    What is SHAP

    SHAP (SHapley Additive Explanations) is a powerful technique for interpreting machine learning models by assigning each feature a contribution value toward a specific prediction. It’s grounded in Shapley values from cooperative game theory, which ensures that the explanations are fair, consistent, and additive.

    What SHAP Does

    • It calculates how much each feature “adds” or “subtracts” from the model’s baseline prediction.
    • It works both locally (for individual predictions) and globally (across the dataset).
    • It produces visualizations like force plots, summary plots, and dependence plots.

    What SHAP Is Good For

    • Trust-building: Stakeholders can see why a model made a decision.
    • Debugging: Helps identify spurious correlations or data leakage.
    • Fairness auditing: Reveals if certain features disproportionately affect predictions for specific groups.
    • Feature attribution: Quantifies the impact of each input on the output.

    Ideal Use Cases

    • Tree-based models (e.g., XGBoost, LightGBM, Random Forest)
    • High-stakes domains like healthcare, education, finance, and policy
    • Any scenario where transparency and accountability are critical

    My notebook changes

    In this new cell, I use the results of the previous grid search to create a SHAP TreeExplainer from the shap package. With that I create three different types of plots: a summary beeswarn, dependence and force plot.

    SHAP Visualizations

    Interpreting the summary beeswarm plot

    The x-axis shows the SHAP values. Positive values push the prediction higher, towards the positive class or higher score. Negative values push the prediction lower.

    The y-axis shows the features, ranked by overall importance. The most important features are at the top. The spread of SHAP values shows how much influence that feature can have. The wider the spread of dots along the x-axis, the more variability that feature contributes to predictions. Narrow spreads mean the feature has a consistent, smaller effect.

    Each dot represents a single observation for the feature. The color of the dots shows the feature value. Red for high values and blue for low.

    If high feature values (red dots) cluster on the right (positive SHAP values), then higher values of that feature increase the prediction. If high values cluster on the left, then higher values decrease the prediction. Blue dots (low feature values) show the opposite effect.

    Overlapping colors can suggest interactions. For example, if both high and low values of a feature appear on both sides, the feature’s effect may depend on other variables.

    Interpreting the force plot

    The base value is the average model prediction if no features were considered. It’s like the starting point. It is the neutral prediction before considering any features.

    Arrows or bars are the force each feature contributes positively or negatively to the prediction. Each feature either increases or decreases the prediction. The size of the arrow/bar shows the magnitude of its effect.

    • Red (or rightward forces): Push the prediction higher.
    • Blue (or leftward forces): Push the prediction lower.

    The final prediction is the sum of the baseline plus all feature contributions. The endpoint shows the model’s actual output for that instance

    – William

    References

  • Making Sense of the Black Box: A Guide to Model Explainability

    Making Sense of the Black Box: A Guide to Model Explainability

    In an age of AI-driven decisions, whether predicting student risk, approving loans, or diagnosing disease, understanding why a model makes a prediction is just as important as the prediction itself. This is exactly the purpose of model explainability.

    What Is Model Explainability?

    Model explainability refers to techniques that help us understand and interpret the decisions made by machine learning models. While simple models like linear regression are more easily interpretable, more powerful models, like random forests, gradient boosting, or neural networks, are often considered “black boxes”.

    Explainability tools aim to make it possible to understand that “box”, offering insights into how features influence predictions, both globally (across the dataset) and locally (for individual cases).

    Why It Matters: Trust, Transparency, and Actionability

    Explainability isn’t just a technical concern, it’s important for data scientists and society. Here’s why it matters:

    Trust: Stakeholders are more likely to act on model outputs when they understand the reasoning behind them. A principal won’t intervene based on a risk score alone but will if they see that the score is driven by declining attendance and recent disciplinary actions.

    Accountability: Explainability supports ethical AI by surfacing potential biases and enabling audits. It helps answer: “Is this model fair across different student groups?”

    Debugging: Helps data scientists identify spurious correlations, data leakage, or overfitting.

    Compliance: Increasingly required by regulations like GDPR (right to explanation), FERPA (student data protections), and the EU AI Act.

    Key Explainability Techniques

    Let’s explore and compare the most widely used methods:

    MethodTypeStrengthsLimitationsBest For
    SHAP (SHapley Additive Explanations)Local + GlobalTheoretically grounded, consistent, visual.Computationally expensive for large models.Tree-based models (e.g., XGBoost, RF).
    LIME (Local Interpretable Model-agnostic Explanations)LocalModel-agnostic, intuitive.Sensitive to perturbations, unstable explanations.Any black-box model.
    PDP (Partial Dependence Plot)GlobalShows marginal effect of features.Assumes feature independence.Interpreting average trends.
    ICE (Individual Conditional Expectation)LocalPersonalized insights.Harder to interpret at scale.Individual predictions.
    Permutation ImportanceGlobalSimple, model-agnostic.Can be misleading with correlated features.Quick feature ranking.

    SHAP vs. LIME: A Deeper Dive

    Both SHAP and LIME aim to answer the same question: “Why did the model make this prediction?” But they approach it from different angles, with distinct strengths, limitations, and implications for trust and usability.

    Theoretical Foundations

    AspectSHAPLIME
    Core IdeaBased on Shapley values from cooperative game theory.Builds a local surrogate model using disturbed samples.
    Mathematical GuaranteeAdditive feature attributions that sum to the model output.There is no guarantee of consistency or additivity.
    Model AssumptionsAssumes access to the model’s internal structure.Treats the model as a black box.
    • SHAP treats each feature as a “player” in a game contributing to the final prediction. It calculates the average contribution of each feature across all possible feature combinations.
    • LIME perturbs (disturbs) the input data around a specific instance and fits a simple interpretable model (usually linear) to approximate the local decision boundary.

    Output and Visualization

    FeatureSHAPLIME
    Local ExplanationForce plots show how each feature pushes the prediction.Bar charts show feature weights in the surrogate model.
    Global ExplanationSummary plots aggregate SHAP values across the dataset.Not designed for global insights.
    Visual IntuitionHighly visual and intuitive.Simpler but less expressive visuals.
    • SHAP’s force plots and summary plots are really great for stakeholder presentations. They show not just which features mattered, but how they interacted.
    • LIME’s bar charts are easier to generate and interpret quickly, but they can vary significantly depending on how the data was disturbed.

    Practical Considerations

    FactorSHAPLIME
    SpeedSlower, especially for large models.Faster, lightweight.
    StabilityHigh, same input yields same explanation.Low, results can vary across runs.
    Model SupportOptimized for tree-based models.Works with any model (including neural nets, ensembles!).
    ImplementationRequires more setup and compute.Easier to plug into existing workflows.
    • SHAP is ideal for production-grade models where consistency and auditability matter.
    • LIME is great for quick prototyping, debugging, or when working with opaque models like deep neural networks.

    A Real-World Example: Explaining Student Risk Scores

    My nonprofit’s goal is to build a model to identify students at risk of socio-emotional disengagement. The model uses features like attendance, GPA trends, disciplinary records, and survey responses.

    Let’s say the model flags a student as “high risk”. Without explainability, this is a black-box label. But with SHAP, we can generate a force plot that shows:

    • Attendance rate: -0.25 (low attendance strongly contributes to risk)
    • GPA change over time: -0.15 (declining grades add to concern)
    • Recent disciplinary action: +0.30 (a major driver of the risk score)
    • Survey response: “I feel disconnected from school”: +0.20 (adds emotional context)

    This breakdown transforms a numeric score into a narrative. It allows educators to:

    • Validate the prediction: “Yes, this aligns with what we’ve seen.”
    • Take targeted action: “Let’s prioritize counseling and academic support.”
    • Communicate transparently: “Here’s why we’re reaching out to this student.”

    Summary

    Model explainability isn’t just a technical add-on, it’s an ethical and operational imperative. As we build systems that influence real lives, we must ensure they are not only accurate but also understandable, fair, and trustworthy.

    – William

    References

    Technical Foundations of SHAP and LIME

  • Data Science in the World Pt. 5: Data Science in high education

    Data Science in the World Pt. 5: Data Science in high education

    This is the fifth post in my “Data Science in the World” series.

    How Data Science is Transforming Higher Education

    When most people think of college or university, they picture lecture halls, libraries, and late-night study sessions. But behind the scenes, a quiet revolution is underway, one powered by data science. Just as data has reshaped industries like healthcare, finance, and transportation, it is now transforming higher education. From improving student success to guiding institutional decisions, data science is becoming a cornerstone of how colleges and universities operate.

    This might sound abstract, but the reality is simple: data science is helping students learn more effectively, helping educators teach more efficiently, and helping institutions make smarter choices. Let’s explore three key areas where data science is making the biggest impact in higher education: student success and retention, personalized learning, and institutional decision-making.

    Student Success and Retention

    One of the most pressing challenges in higher education is ensuring that students not only enroll but also graduate. Dropout rates remain a concern, and every student who leaves represents both a personal setback as well as a loss for the institution. Data science is helping to address this challenge by identifying at-risk students early and providing targeted support.

    Colleges collect a wide range of data about students, like grades, attendance, course engagement, financial aid status, and even participation in extracurricular activities. By analyzing these data points, machine learning models can detect patterns that signal when a student might be struggling.

    For example, a sudden drop in class attendance combined with declining grades might indicate that a student is at risk of dropping out. Predictive analytics can flag this student, allowing advisors or faculty to intervene before it’s too late.

    • Georgia State University has become a leader in using predictive analytics to improve student success. By tracking over 800 risk factors, the university has significantly increased graduation rates, particularly among first-generation and low-income students.
    • Community colleges are also adopting similar systems, using data to provide proactive advising and support services tailored to individual student needs.

    For students, this means more personalized support and a greater chance of completing their degree. For institutions, it means improved retention rates, which not only enhance reputation but also ensure financial stability. For society, it means more graduates entering the workforce with the skills needed to succeed.

    Personalized Learning

    Every student learns differently. Some thrive in large lectures, while others need more hands-on support. Traditional education models often struggle to accommodate these differences. Data science is changing that by enabling personalized learning experiences tailored to each student’s strengths, weaknesses, and preferences.

    Learning management systems (LMS) and online platforms collect detailed data on how students interact with course materials: how long they spend on readings, which quiz questions they miss, and how often they participate in discussions. Data science tools analyze this information to create individualized learning pathways.

    For instance, if a student consistently struggles with a particular math concept, the system can recommend additional practice problems, videos, or tutoring resources. Conversely, if a student masters material quickly, the system can accelerate their progress to keep them engaged.

    • Adaptive learning platforms like ALEKS (for math) or Smart Sparrow (for science) use data-driven algorithms to adjust content in real time, ensuring that students receive the right level of challenge.
    • Massive Open Online Courses (MOOCs) such as Coursera and edX leverage data science to recommend courses and resources based on a learner’s past activity and performance.

    Personalized learning helps students stay motivated and engaged, reducing frustration and boredom. It also allows instructors to focus their attention where it’s needed most, rather than applying a one-size-fits-all approach. Over time, this could lead to more equitable outcomes, as students from diverse backgrounds receive the support they need to succeed.

    Institutional Decision-Making

    Running a college or university is a complex endeavor. Administrators must make decisions about everything from course offerings to campus facilities to budget allocations. Traditionally, these decisions were based on historical trends, intuition, or limited data. Today, data science is providing a more rigorous foundation for institutional decision-making.

    Universities generate enormous amounts of operational data: enrollment numbers, course demand, faculty workloads, financial aid distribution, and more. By applying data science techniques, administrators can uncover insights that guide strategic planning.

    • Course scheduling: Predictive models can forecast which classes will be in high demand, ensuring that enough sections are offered to meet student needs.
    • Resource allocation: Data can reveal which programs are growing and which are declining, helping institutions allocate funding more effectively.
    • Facilities management: Sensors and data analytics can optimize energy use, reduce costs, and create more sustainable campuses.

    Real-World Examples

    • Arizona State University uses data analytics to optimize course scheduling and advising, ensuring that students can access the classes they need to graduate on time.
    • The University of Michigan has applied data science to improve energy efficiency across campus, saving millions of dollars while reducing environmental impact.

    Smarter decision-making benefits everyone. Students get the classes and resources they need, faculty workloads are managed more effectively, and institutions operate more efficiently. In an era of rising tuition costs and financial pressures, data-driven management helps ensure that higher education remains sustainable and accessible.

    Spotlight: The Early Signal Project

    Another example of how data science can support student success is the Early Signal Project, a nonprofit initiative I founded to help educators detect socio-emotional risks in students before they escalate. By combining privacy-compliant surveys with carefully designed data pipelines, the project gives schools actionable insights while protecting student trust. Instead of waiting until problems become visible in grades or attendance, educators receive early, anonymized signals that a student may need support. This proactive approach mirrors the broader promise of data science in higher education: using information ethically and transparently to empower teachers, improve outcomes, and ensure that no student falls through the cracks.

    Conclusion

    Data science is no longer confined to tech companies or research labs. It’s becoming a central part of how higher education functions. By improving student success and retention, enabling personalized learning, and guiding institutional decision-making, data science is helping colleges and universities adapt to the challenges of the 21st century.

    Privacy concerns must be carefully managed, and institutions must ensure that data-driven decisions are fair and transparent. But the potential benefits are enormous. As data science continues to evolve, it promises to make higher education not only more efficient but also more inclusive, personalized, and effective.

    In the end, higher education has always been about unlocking human potential. With the help of data science, that mission is being reimagined for a new era—one where every student has the opportunity to succeed, every instructor has the tools to teach effectively, and every institution has the insights to thrive.


    References

  • Data Science in the World Pt. 4: Data Science in auto industry

    Data Science in the World Pt. 4: Data Science in auto industry

    This is the fourth post in my “Data Science in the World” series.

    Driving the Future: How Data Science is Transforming the Auto Industry

    Cars have always been about more than just getting from point A to point B. They represent freedom, innovation, and progress. Today, they also represent something else: data. Modern vehicles are really computers on wheels, generating and processing vast amounts of information every second. From safety systems to navigation to entertainment, data science is quietly reshaping the way we design, build, and drive cars.

    Data science is already influencing your daily commute, the safety of your family, and even the future of how we think about car ownership. Let’s explore three key areas where data science is making the biggest impact: autonomous driving, predictive maintenance, and connected car services.

    Autonomous Driving: Teaching Cars to Think

    Perhaps the most exciting—and widely discussed—application of data science in the auto industry is autonomous driving. Self-driving cars rely on a combination of sensors, cameras, radar, and lidar to perceive their surroundings. But perception alone isn’t enough. The real magic happens when data science steps in to interpret all that information and make decisions in real time.

    • Data collection: A single autonomous vehicle can generate terabytes of data every day. This includes images from cameras, distance measurements from lidar, and speed or position data from GPS.
    • Machine learning models: These massive datasets are used to train algorithms that can recognize pedestrians, traffic lights, road signs, and other vehicles.
    • Decision-making: Once trained, the system can predict what’s likely to happen next—like whether a pedestrian will step into the crosswalk—and decide how the car should respond.

    Companies like Tesla, Waymo, and traditional automakers are investing billions into this technology. While fully autonomous cars aren’t yet mainstream, features like adaptive cruise control, lane-keeping assistance, and automatic emergency braking are already powered by data science. These are steppingstones toward a future where cars can safely drive themselves.

    Autonomous driving has the potential to reduce accidents caused by human error, which accounts for the vast majority of crashes. It could also make transportation more accessible for people who can’t drive, such as the elderly or disabled. For the average driver, it promises a future where commuting time could be spent reading, working, or simply relaxing.

    Predictive Maintenance: Fixing Problems Before They Happen

    Traditionally, maintenance has been reactive: you wait until something goes wrong, then fix it. Data science is changing that by enabling predictive maintenance, where problems are identified and addressed before they cause a breakdown.

    • Sensors everywhere: Modern cars are equipped with hundreds of sensors monitoring everything from engine temperature to tire pressure.
    • Data analysis: These sensors feed data into machine learning models that can detect patterns indicating wear and tear.
    • Predictive alerts: Instead of a vague “check engine” light, predictive systems can tell you exactly which component is likely to fail and when.

    Fleet operators, like delivery companies or ride-sharing services, are already using predictive maintenance to keep vehicles on the road longer and reduce downtime. For everyday drivers, some automakers now offer apps that notify you when your car needs attention, sometimes even scheduling service appointments automatically.

    Predictive maintenance saves money by preventing costly repairs and extends the life of vehicles. More importantly, it improves safety by reducing the risk of sudden failures on the road. For consumers, it means fewer surprises and more confidence in their cars.

    Connected Car Services: Turning Vehicles into Smart Devices

    Think of your car as a smartphone on wheels. Just as your phone connects you to apps, maps, and services, connected cars use data science to provide a seamless, personalized driving experience.

    • Telematics: Cars transmit data about location, speed, and performance to cloud platforms.
    • Personalization: Data science algorithms analyze your driving habits to suggest routes, adjust climate control, or recommend nearby services.
    • Integration: Connected cars can communicate with other vehicles and infrastructure, creating smarter traffic systems.

    Connected car services make driving more convenient, efficient, and enjoyable. On a larger scale, they also contribute to smarter cities by reducing congestion and emissions. For consumers, it’s about having a car that feels less like a machine and more like a personalized companion.

    Conclusion

    Data science is no longer a behind-the-scenes tool in the auto industry—it’s the driving force behind its most exciting innovations. Autonomous driving is teaching cars to think, predictive maintenance is making them more reliable, and connected services are turning them into smart devices that fit seamlessly into our digital lives.

    For the general public, the takeaway is simple: your car isn’t just powered by gasoline or electricity, it’s powered by data. And as data science continues to evolve, the way we drive, maintain, and experience cars will keep transforming in ways that once belonged only to science fiction.


    References

    • Intel. (2016). Data-Driven Intelligence: The Future of Autonomous Vehicles. Intel Corporation Whitepaper.
    • Waymo LLC. (2020). The Waymo Open Dataset: Autonomous Vehicle Perception Benchmark. Waymo Research Publications.
    • MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). (2019). Planning and Decision-Making in Autonomous Driving. MIT Technical Report.
    • General Motors (GM). (2020). OnStar and Connected Services: Telematics Platform Overview. GM Global Technology.
    • Tesla, Inc. (2021). Vehicle Diagnostics, Over-the-Air Updates, and Predictive Service. Tesla Engineering Documentation.
    • University of Michigan Transportation Research Institute (UMTRI). (2017). Ann Arbor Connected Vehicle Test Environment: Phase 2 Results. UMTRI Technical Report.
  • Data Science in the World Pt. 3: Data Science in Finance

    Data Science in the World Pt. 3: Data Science in Finance

    This is the third post in my “Data Science in the World” series.

    How Data Science is Transforming Finance: Fraud, Credit, and Investments

    When most people think about finance, they picture swiping a card, checking a bank balance, or maybe watching the stock market ticker scroll across a screen. What’s less visible is how much data science is working behind the scenes to make those everyday interactions safer, smarter, and more personalized.

    Financial institutions have always been data-driven, but the explosion of computing power and machine learning has changed the game. Today, banks, lenders, and investment firms can analyze billions of data points in real time, uncovering patterns that humans alone could never detect. The result? Faster fraud detection, fairer lending decisions, and more accessible investment opportunities.

    I will review three of the most important ways data science is reshaping the financial services industry: fraud detection and security, credit scoring and lending, and investment and wealth management.

    Fraud Detection and Security

    Fraud has always been a cat-and-mouse game. As soon as banks develop new defenses, fraudsters find new ways to exploit vulnerabilities. Data science has tilted the balance in favor of the defenders by enabling real-time, adaptive fraud detection.

    Every financial transaction, whether it’s a credit card swipe, an online transfer, or a mobile payment, creates a data trail. Machine learning models are trained on millions of these transactions to learn what “normal” behavior looks like. They consider dozens of factors simultaneously:

    • Location: Is the purchase happening in the same city as the customer’s last transaction?
    • Timing: Does the transaction fit the customer’s usual spending patterns?
    • Device: Is the payment being made from a familiar phone or computer?

    When something looks unusual, like a purchase in another country minutes after a local coffee shop visit, the system can flag it instantly.

    For consumers, this often shows up as a text message or app notification. Behind that alert is a sophisticated model that has already calculated the probability of fraud in milliseconds.

    Banks benefit too. According to industry reports, AI-driven fraud detection has saved billions of dollars annually by reducing false positives (legitimate transactions incorrectly flagged) and catching fraudulent activity earlier.

    Fraud detection powered by data science protects money and trust. In a world where digital payments are the norm, consumers need to feel confident that their financial institutions can keep them safe. Data science makes that possible.

    Credit Scoring and Lending

    For decades, credit decisions were based on a narrow set of factors: payment history, outstanding debt, and length of credit history. While effective to a point, this system left many people without traditional credit histories, locked out of the financial system. Data science is helping to change that.

    Modern credit models can incorporate a much wider range of information:

    • Alternative data: Rent payments, utility bills, and even subscription services can demonstrate reliability.
    • Behavioral data: Spending patterns, savings habits, and cash flow stability provide additional context.
    • Digital footprints: With proper privacy safeguards, online activity can sometimes serve as a proxy for financial responsibility.

    By analyzing these broader datasets, machine learning models can paint a more complete picture of an applicant’s creditworthiness. This benefits consumers and lenders in the following ways:

    • Fairer access: People with limited or no credit history can qualify for loans they might otherwise be denied.
    • Reduced bias: Properly designed models can minimize human subjectivity in lending decisions.
    • Better risk management: Lenders can more accurately predict defaults, reducing losses and keeping interest rates competitive.

    Investment and Wealth Management

    Investing used to be the domain of the wealthy, with personalized advice available only to those who could afford a financial advisor. Data science has democratized investing, making it more accessible, affordable, and tailored to individual needs.

    An example of this democratization is Robo-advisors. Robo-advisors are digital platforms that use algorithms to build and manage investment portfolios. By asking a few simple questions about risk tolerance, time horizon, and goals, the system can recommend a diversified portfolio that automatically rebalances over time. The benefits to everyday are:

    • Lower costs: Automated systems reduce the need for expensive human advisors.
    • Accessibility: Minimum investment amounts are often much lower than traditional wealth management services.
    • Customization: Portfolios can be tailored to individual preferences, such as socially responsible investing.

    At the institutional level, hedge funds and asset managers use machine learning to detect subtle patterns in market data. Some even analyze unconventional sources like satellite imagery (e.g., counting cars in retail parking lots to predict sales) or social media sentiment to gain an edge.

    Data science also helps investors understand and manage risk. Predictive models can simulate how a portfolio might perform under different economic scenarios, giving both professionals and individuals a clearer picture of potential outcomes.

    At the same time, regulators are pushing for “explainable AI” in finance, ensuring that investment recommendations are transparent and understandable rather than black-box predictions.

    Conclusion

    Data science is no longer a buzzword in finance. It’s the backbone of how the industry operates. From protecting consumers against fraud, to opening up credit access, to democratizing investing, the impact is profound and personal.

    For the general public, the takeaway is simple: every time you swipe your card, apply for a loan, or check your investment app, data science is working behind the scenes. It’s making your financial life safer, smarter, and more tailored to your needs.

    As technology continues to evolve, expect these systems to become even more sophisticated. The future of finance isn’t just digital—it’s data-driven.


    References

  • Data Science in the World Pt. 2: Data Science in Drug Research

    Data Science in the World Pt. 2: Data Science in Drug Research

    This is the second post in my “Data Science in the World” series.

    A New Era in Medicine

    Modern medicine is undergoing a quiet revolution, powered by data. Hospital visits, lab tests, and genetic sequences add to a growing collection of medical information. Data science is helping scientists, doctors, and researchers use this information to find new treatments faster, test them more safely, and tailor care to each individual patient.

    From Molecules to Medicines: Data-Driven Discovery

    Traditionally, discovering a new drug could take a decade or more, with researchers testing thousands of chemical compounds by hand. Today, machine learning models can predict how molecules will interact with the human body long before they’re ever created in a lab.

    For example, AI-powered algorithms analyze massive databases of molecular structures and biological data to identify promising compounds that might block a virus, reduce inflammation, or kill cancer cells. These models learn from patterns across millions of examples, helping researchers focus only on the most likely winners.

    This approach reduces the cost and time to develop new medicines. In some cases, machine learning has identified viable drug candidates in months rather than years — a leap forward in the fight against fast-moving diseases.

    Refining and Developing New Treatments

    Once potential compounds are identified, scientists must determine how they behave inside the body. Data science helps here by analyzing complex biological signals, genetic variations, and even past clinical data to predict outcomes.

    For instance, computer simulations (known as in silico trials) can model how a new medicine might interact with different cell types or organs. These models help researchers optimize dosages and anticipate side effects long before human testing begins.

    Pharmaceutical companies also use predictive analytics to identify which drug formulations are most stable and effective, cutting down on costly lab iterations.

    Smarter, Faster Clinical Trials

    Clinical trials are where potential medicines are tested in humans and where most drug candidates fail. Trials are slow, expensive, and difficult to manage. Data science is making them smarter and more efficient.

    Predictive analytics can help identify the right patients for each trial, ensuring diverse participation and faster recruitment. Algorithms analyze medical records and genomic data to match patients who are most likely to benefit, or least likely to experience harm, from an experimental drug.

    Real-time data monitoring during trials also improves safety. Machine learning systems can flag early warning signs or unexpected reactions faster than human observers. These insights can help scientists adjust or even redesign studies on the fly.

    Personalized Medicine: Tailoring Treatments to Individuals

    Perhaps the most exciting result of this data revolution is personalized medicine, customizing treatment plans for individual patients. Instead of a one-size-fits-all approach, doctors can use data to predict which therapies will work best based on a person’s genes, environment, and medical history.

    For example, genomic data can reveal how patients metabolize certain drugs, allowing physicians to prescribe the safest and most effective options. AI systems can even help oncologists choose targeted cancer therapies by comparing tumor DNA against thousands of previous cases.

    This data-driven personalization not only improves outcomes but also reduces side effects, saving lives and healthcare costs alike.

    Challenges and Ethics: The Human Side of Data

    As powerful as data science is, it also raises critical ethical and technical challenges. Patient privacy must be protected. Medical data is among the most sensitive information that exists. Robust security measures and anonymization techniques are essential.

    Bias is another concern. If algorithms are trained on incomplete or unrepresentative datasets, their predictions could unfairly favor or disadvantage certain groups. Transparency, oversight, and diverse data collection are key to ensuring fairness.

    Finally, collaboration between scientists, clinicians, and data experts is crucial. Data alone doesn’t save lives, people using it wisely can.

    Conclusion: The Future of Medicine Is Intelligent

    From discovering new molecules to designing smarter clinical trials and personalizing treatments, data science has become an essential partner in modern medicine. It allows scientists to see patterns invisible to the human eye, accelerate breakthroughs, and deliver safer, more effective therapies.

    As healthcare continues to generate more data than ever, the question is no longer whether data science will shape the future of medicine, but how far it can take us.


    References

  • Data Science in the World Pt. 1: Data Science in Soccer

    Data Science in the World Pt. 1: Data Science in Soccer

    This post will be the first in a series of blog posts, called “Data Science in the World,” where I discuss the implementation of data science in different fields like sports, business, medicine, etc. To begin this series, I will be explaining how data science is used in soccer.

    There are 5 main areas of the soccer world where data science plays a critical role: Tactical & Match Analysis, Player Development & Performance, Recruitment & Scouting, Training & Recovery, and Set-Piece Engineering. I will break down how data science is used in each of these areas.

    Tactical & Match Analysis

    • Expected Goals (xG): Quantifies shot quality based on location, angle, and defensive pressure. xG can be used to determine a player’s ability as if they generate a high xG throughout a season or their career, then they should hypothetically produce a high amount of goals eventually.
    • Heatmaps & Passing Networks: Reveal spatial tendencies, player roles, and team structure. Heatmaps and passing networks can be used by coaches to point out the good and bad their team does in matches, helping them determine what to fix and what to focus on in matches.
    • Opponent Profiling: Teams dissect rivals’ patterns to exploit weaknesses and tailor game plans.

    Player Development & Performance

    • Event Data Tracking: Every pass, tackle, and movement is logged to assess decision-making and execution. Event data tracking helps coaches and players analyze match footage to refine first touch, scanning habits, and off-ball movement.
    • Wearable Tech: GPS and accelerometers monitor load, speed, and fatigue in real time. This helps tailor training intensity and reduce injury risk, especially in congested fixture periods.
    • Custom Metrics: Clubs build proprietary KPIs to evaluate players beyond traditional stats. Custom metrics allow for more nuanced evaluation than traditional stats like goals or tackles.

    Recruitment & Scouting

    • Market Inefficiencies: Data helps identify undervalued talent with specific skill sets. This is useful for teams who do not have as much money to spend on players who are elite at multiple skills when the team only needs the player to be elite in one skill.
    • Style Matching: Algorithms compare player profiles to team philosophy—think “find me the next Lionel Messi.” This ensures recruits aren’t just talented, but tactically compatible—saving time and money.
    • Injury Risk Modeling: Predictive analytics flag players with high susceptibility to injury. It informs transfer decisions and contract structuring.

    Training & Recovery Optimization

    • Load Management: Data guides intensity and volume to prevent overtraining. Especially vital for youth development and congested schedules.
    • Recovery Protocols: Biometrics and sleep data inform individualized recovery strategies. This improves performance consistency and long-term health.
    • Skill Targeting: Coaches use analytics to pinpoint technical weaknesses and design drills accordingly.

    Set-Piece Engineering

    • Spatial Analysis: Determines optimal corner kick types (in-swing vs. out-swing) and free kick setups. It turns set pieces into high-probability scoring opportunities.
    • Simulation Tools: VR and AR are emerging to rehearse scenarios with data-driven precision.

    Player Examples

    Now that we discussed how data science is used, I will provide examples of teams and players that utilized data science in these ways.

    1. Liverpool FC – Recruitment & Tactical Modeling
      • Liverpool built one of the most advanced data science departments in soccer, led by Dr. Ian Graham. Using predictive models and custom metrics, they scouted and signed undervalued talent like Mohamed Salah and Sadio Mane off the basis of expected threat.
      • Result: Salah scored 245 goals in just 9 seasons. Liverpool won their first Champions League title since 2005 and their first ever English Premier League title in their history with Salah and Mane leading the lines.
    2. Kevin De Bruyne – Contract Negotiation via Analytics FC
      • De Bruyne worked with Analytics FC to create a 40+ page data-driven report showcasing his value to Manchester City. It included proprietary metrics like Goal Difference Added (GDA), tactical simulations, and salary benchmarking.
      • Result: He negotiated his own contract extension without an agent, using data to prove his irreplaceable role in City’s system.
    3. Arsenal FC – Injury Risk & Youth Development
      • Arsenal integrated wearable tech and biomechanical data to monitor player load and injury risk. Young players like Myles Lewis-Skelly used performance analytics to support their rise from academy to first team.
      • Result: Lewis-Skelly’s data-backed contract renewal included insights into his match impact, fatigue management, and tactical fit—helping him secure a long-term deal amid interest from top European clubs.

    References

  • Hyperparameter tuning with RandomizedSearchCV

    Hyperparameter tuning with RandomizedSearchCV

    In my previous post, I explored how GridSearchCV can systematically search through hyperparameter combinations to optimize model performance. While powerful, grid search can quickly become computationally expensive, especially as the number of parameters and possible values grows. In this follow-up, I try a more scalable alternative: RandomizedSearchCV. By randomly sampling from the hyperparameter space, this method offers a faster, more flexible way to uncover high-performing configurations without the exhaustive overhead of grid search. Let’s dive into how RandomizedSearchCV works, when to use it, and how it compares in practice.

    What is RandomizedSearchCV

    Unlike GridSearchCV, which exhaustively tests every combination of hyperparameters, RandomizedSearchCV takes a more efficient approach by sampling a fixed number of random combinations from a defined parameter space. This makes it useful when the search space is large or when computational resources are limited. By trading exhaustive coverage for speed and flexibility, RandomizedSearchCV often finds competitive, or even superior, parameter sets with far fewer evaluations. It’s a smart way to explore hyperparameter tuning when you want faster insights without sacrificing rigor.

    Hyperparameter Tuning with RandomizedSearchCV

    Here’s a breakdown of each parameter in my param_distributions for RandomizedSearchCV when tuning a RandomForestRegressor:

    ParameterDescription
    n_estimators [100, 200, 300]Number of trees in the forest. More trees can improve performance but increase training time.
    min_samples_split [2, 5, 10, 20]Minimum number of samples required to split an internal node. Higher values reduce model complexity and help prevent overfitting.
    min_samples_leaf [1, 2, 4, 10]Minimum number of samples required to be at a leaf node. Larger values smooth the model and reduce variance.
    max_features ["sqrt", "log2", 1.0]Number of features to consider when looking for the best split. "sqrt" and "log2" are common heuristics; 1.0 uses all features.
    bootstrap [True, False]Whether bootstrap samples are used when building trees. True enables bagging; False uses the entire dataset for each tree.
    criterion ["squared_error", "absolute_error"]Function to measure the quality of a split. "squared_error" (default) is sensitive to outliers; "absolute_error" is more robust.
    ccp_alpha [0.0, 0.01]Complexity parameter for Minimal Cost-Complexity Pruning. Higher values prune more aggressively, simplifying the model.

    Interpretation

    Here is a table that compares the results in my previous post where I experimented with GridSearchCV with what I achieved while using RandomizedSearchCV.

    MetricGridSearchCVRandomizedSearchCVImprovement
    Mean Squared Error (MSE)173.39161.12↓ 7.1%
    Root Mean Squared Error (RMSE)13.1712.69↓ 3.6%
    R² Score0.27160.3231↑ 18.9%

    Interpretation & Insights

    Lower MSE and RMSE:
    RandomizedSearchCV yielded a model with noticeably lower error metrics. The RMSE dropped by nearly half a point, indicating better predictions. While the absolute reduction may seem modest, it’s meaningful in contexts where small improvements translate to better decision-making or cost savings.

    Higher R² Score:
    The R² score improved from 0.27 to 0.32, a relative gain of nearly 19%. This suggests that the model tuned via RandomizedSearchCV explains more variance in the target variable—an encouraging sign of better generalization.

    Efficiency vs Exhaustiveness:
    GridSearchCV exhaustively evaluated all parameter combinations, which can be computationally expensive and potentially redundant. In contrast, RandomizedSearchCV sampled a subset of combinations and still outperformed grid search. This underscores the value of strategic randomness in high-dimensional hyperparameter spaces.

    Model Robustness:
    The improved metrics hint that RandomizedSearchCV may have landed on a configuration that better balances bias and variance—possibly due to more diverse sampling across parameters like min_samples_leaf, criterion, and ccp_alpha.

    Takeaways

    RandomizedSearchCV not only delivered better predictive performance but did so with greater computational efficiency. When I ran GridSearchCV with as many parameters to explore, it ran for a long time. In contrast, RandomizedSearchCV returned almost instantaneously in comparison. For large or complex models like RandomForestRegressor, this approach offers a good balance between exploration and practicality. It’s a great reminder that smarter search strategies can outperform brute-force methods—especially when paired with thoughtful parameter ranges.

    – William

  • Trying my hand at Hyperparameter tuning with GridSearchCV

    Trying my hand at Hyperparameter tuning with GridSearchCV

    In this post, I’ll try using scikit’s GridSearchCV to optimize hyperparameters. GridSearchCV is a powerful tool in scikit-learn that automates the process of hyperparameter tuning by exhaustively searching through a predefined grid of parameter combinations. It evaluates each configuration using cross-validation, allowing you to identify the settings that yield the best performance. It doesn’t guarantee the globally optimal solution, but GridSearchCV provides a reproducible way to improve model accuracy, reduce overfitting, and better understand how a model responds to different parameter choices

    Hyperparameter Tuning with GridSearchCV

    First Attempt

    The images below show the initial parameters I used in my GridSearchCV experimentation and the results. Based on my reading, I decided to try just a few parameters to start. Here are the parameters I chose to start with and a brief description of why I felt each was a good place to start.

    ParameterDescriptionWhy It’s a Good Starting Point
    n_estimatorsNumber of trees in the forestControls model complexity and variance; 100–300 is a practical range for balancing performance and compute.
    bootstrapWhether sampling is done with replacementTests the impact of bagging vs. full dataset training—can affect bias and variance. Bagging means each decision tree in the forest is trained on a random sample of the training data.
    criterionFunction used to measure the quality of a splitOffers diverse loss functions to explore how the model fits different error structures.

    You may recall in my earlier post that I achieved these results during manual tuning:
    Mean squared error: 160.7100736652691
    RMSE: 12.677147694385717
    R2 score: 0.3248694960846078

    Interpretation

    My Manual Configuration Wins on Performance

    • Lower MSE and RMSE: Indicates better predictive accuracy and smaller average errors.
    • Higher R²: Explains more variance in the target variable.

    Why Might GridSearchCV Underperform Here?

    • Scoring mismatch: I used "f1" as the scoring metric, which I discovered while reading, is actually for classification! So, the grid search may have optimized incorrectly. Since I’m using a regressor, I should use "neg_mean_squared_error" or "r2".
    • Limited search space: My grid only varied n_estimators, bootstrap, and criterion. It didn’t explore other impactful parameters like min_samples_leaf, max_features, or max_depth.
    • Default values: GridSearchCV used default settings for parameters like min_samples_leaf=1, which could lead to overfitting or instability.

    Second Attempt

    In this attempt, I changed the scoring to neg_mean_squared_error. What that does is, it returns the negative of the mean squared error, which makes GridSearchCV minimize the mean square error (MSE). That in turn means that GridSearchCV will choose parameters that minimize large deviations between predicted and actual values.

    So how did that affect results? The below images show what happened.

    While the results aren’t much better, they are more valid because it was a mistake to use F1 scoring in the first place. Using F1 was wrong because:

    • The F1 score is defined for binary classification problems. and I am fitting continuous outputs.
    • F1 needs discrete class labels, not continuous outputs.
    • When used in regression, scikit-learn would have forced predictions into binary labels, which distorts the optimization objective.
    • Instead of minimizing prediction error, it tried to maximize F1 on binarized outputs.

    Reflections

    • The "f1"-optimized model accidentally landed on a slightly better MSE, but this is not reliable or reproducible.
    • The "neg_mean_squared_error" model was explicitly optimized for MSE, so its performance is trustworthy and aligned with my regression goals.
    • The small difference could simply be due to random variation or hyperparameter overlap, not because "f1" is a viable scoring metric here.

    In summary, using "f1" in regression is methodologically invalid. Even if it produces a superficially better score, it’s optimizing the wrong objective and introduces unpredictable behavior.

    In my next post I will try some more parameters and also RandomizedSearchCV.

    – William