Tag: Student Data Analysis

Genetic Algorithms

William's Data Science Blog

I owe you an apology. I disappeared for a bit because I was trapped in the ninth circle of the college‑application process, otherwise known as “writing essays.” So many essays. I needed a rest! But I’ve resurfaced, slightly overcaffeinated and definitely overshared, and I’m ready to talk about something far more relaxing than applications: evolutionary computation.

I stumbled across this branch of data science and found it so cool I started reading more and wanted to share what I found. Evolutionary computation is cool because it feels like the closest thing computer science has to science fiction. It is one of the rare techniques where the machine genuinely surprises you because it’s creative in a way that mirrors nature itself.

When Algorithms Evolve: The Story of Genetic Algorithms

There is something so cool about the idea that algorithms can evolve. Not in the metaphorical sense, but in the literal, biological sense, where solutions compete, adapt, and survive the same way living things do. It’s a reminder that not everything has to follow a straight line.

Genetic algorithms grew out of that spirit of curiosity. They offered a way to explore problems that were too messy or too unpredictable for traditional methods. For a while, they captured the imagination of researchers, engineers, and artists, because they made computation feel creative. Even today, long past their time in the spotlight, they still have a strange and enduring charm.

What Genetic Algorithms Actually Do

Genetic algorithms search for solutions by treating them like organisms in a population. Each candidate competes, reproduces, and mutates, and over generations the population evolves to a solution. They explore, stumble, adapt, and occasionally discover solutions no human would have thought to try.

This makes them especially useful in problems where intuition fails and the search space complex.

A Brief History

As I mentioned, genetic algorithms emerged in the 1960s through the work of John Holland at the University of Michigan. His research asked a completely new question. If evolution can produce complex organisms without a designer, could computation do the same?

By the 1980s and 1990s, this idea had spread far beyond academia. Engineers at NASA once used a genetic algorithm to design an antenna for a spacecraft. They fed the system a set of constraints, pressed go, and watched as each generation twisted itself into stranger shapes. The final design looked like a bent paperclip someone had stepped on, but it outperformed every hand‑crafted alternative.

In architecture studios, designers sometimes let genetic algorithms “grow” building facades. The results look like coral reefs or alien cathedrals, and half the fun is seeing what the algorithm thinks is beautiful.

At that time in computing history, genetic algorithms were more than tools. They were a philosophy. They suggested that creativity could be computational and exploration could be automated.

Why Genetic Algorithms Were So Beloved

They could function in messy data. If you had a problem where the objective function was noisy, discontinuous, or simply weird, a genetic algorithm didn’t care.

They were very flexible. You could encode almost anything, from bits to rules or even a neural network architecture, and evolution would happily go to work. This flexibility made it feel like they could solve any optimization problem.

They were also intuitive. People already understood evolution. The user didn’t need to understand calculus to understand how genetic algorithms work. That made them an entry point into computational thinking for many students and researchers.

What I love about genetic algorithms is that they take something that is too complicated to understand all at once, start wide, stay curious, and only narrow in when the evidence earns it. That feels very relevant outside of computation.

Curiosity on its own doesn’t guarantee efficiency. As the field grew, machine learning shifted toward methods that rewarded precision over exploration.

Why Genetic Algorithms Faded from the Spotlight

As machine learning matured, new methods emerged that were faster, more efficient, and more predictable. For example, Bayesian optimization offered a smarter way to search with fewer evaluations.

In many areas of data science, genetic algorithms couldn’t compete with the speed and precision of newer techniques. Genetic algorithms were powerful generalists in a world that increasingly rewarded specialists. Their strengths did not disappear; they just became less central as the field evolved.

Where Genetic Algorithms Still Shine

Today, genetic algorithms still thrive in places where the search space is too messy for calculus and too highly dimensional for brute force.

In engineering, genetic algorithms are still used to design structures that must balance competing constraints such as strength, weight, cost, and manufacturability. These are problems with no clean analytic path. In robotics, evolutionary strategies help discover control policies that are robust in the face of noise and uncertainty. In creative fields, genetic algorithms still generate art, music, and architectural forms.

Even in machine learning, they have not disappeared. Genetic algorithms are used for feature selection when the relationships are too tangled. They are used to evolve neural network architectures in ways that other techniques cannot.

Genetic algorithms survive because they explore. Exploration is still necessary. In messy, unpredictable environments, exploration isn’t a luxury. It’s how systems can stay resilient.

A Reflection Through the Lens of the Early Signal Project

The Early Signal Project does not use genetic algorithms directly, but the philosophy behind them resonates deeply with the work.

Genetic algorithms show us that diversity is not noise. It is information. Populations thrive when they contain many possibilities, not when they are homogenous. Premature convergence is risky because it shuts down other possibilities. Exploration needs to come before optimization. Good solutions emerge through iteration, not assumption.

Early detection in education requires the same mindset. The goal isn’t to force everything into a single pattern. It’s to let patterns emerge honestly, even if they surprise you.

You should not rush to conclusions about a student based on a single signal. You need to explore patterns without prejudice. Let insights emerge from the data. Refine carefully, ethically, and with humility. Protect the diversity of student experiences rather than trying to force them into one narrative.

In that sense, the Early Signal Project is its own evolutionary system. It evolves toward clarity, fairness, and early support for students.

Closing Thoughts

Genetic algorithms remind us that progress can often come from embracing variation rather than suppressing it. They show that creativity can be computational, that exploration is as important as optimization, and that surprising solutions often emerge from processes we do not control.

In data science and in education, solutions evolve when we give room to breathe and adapt.

Sometimes the most powerful ideas are not the newest ones. They are the ones that remind us how to think and remind us how to stay curious.

– William

February 1, 2026
Short Post

William's Data Science Blog

Just a quick ‘nothing’ post this time. I’ve been very busy with college applications and the holidays. I’ll post again soon once application season settles down a bit. I stumbled across an interesting on genetic algorithms. That sounds so interesting I think my next post may be on that topic!

– William

December 8, 2025
First Experiment with SHAP Visualizations
William's Data Science Blog

In my previous post, I touched on model explainability. One approach for feature attribution is called SHAP, SHapley Additive exPlanations. In this post I will cover my first experiment with SHAP, building on one of my previous notebooks. My GitHub repo containing all of my Jupyter notebooks can be found here: GitHub – wcaubrey/learning-naive-bayes.

What is SHAP

SHAP (SHapley Additive Explanations) is a powerful technique for interpreting machine learning models by assigning each feature a contribution value toward a specific prediction. It’s grounded in Shapley values from cooperative game theory, which ensures that the explanations are fair, consistent, and additive.

What SHAP Does

It calculates how much each feature “adds” or “subtracts” from the model’s baseline prediction.

It works both locally (for individual predictions) and globally (across the dataset).

It produces visualizations like force plots, summary plots, and dependence plots.

What SHAP Is Good For

Trust-building: Stakeholders can see why a model made a decision.

Debugging: Helps identify spurious correlations or data leakage.

Fairness auditing: Reveals if certain features disproportionately affect predictions for specific groups.

Feature attribution: Quantifies the impact of each input on the output.

Ideal Use Cases

Tree-based models (e.g., XGBoost, LightGBM, Random Forest)

High-stakes domains like healthcare, education, finance, and policy

Any scenario where transparency and accountability are critical

My notebook changes

In this new cell, I use the results of the previous grid search to create a SHAP TreeExplainer from the shap package. With that I create three different types of plots: a summary beeswarn, dependence and force plot.

SHAP Visualizations

Interpreting the summary beeswarm plot

The x-axis shows the SHAP values. Positive values push the prediction higher, towards the positive class or higher score. Negative values push the prediction lower.

The y-axis shows the features, ranked by overall importance. The most important features are at the top. The spread of SHAP values shows how much influence that feature can have. The wider the spread of dots along the x-axis, the more variability that feature contributes to predictions. Narrow spreads mean the feature has a consistent, smaller effect.

Each dot represents a single observation for the feature. The color of the dots shows the feature value. Red for high values and blue for low.

If high feature values (red dots) cluster on the right (positive SHAP values), then higher values of that feature increase the prediction. If high values cluster on the left, then higher values decrease the prediction. Blue dots (low feature values) show the opposite effect.

Overlapping colors can suggest interactions. For example, if both high and low values of a feature appear on both sides, the feature’s effect may depend on other variables.

Interpreting the force plot

The base value is the average model prediction if no features were considered. It’s like the starting point. It is the neutral prediction before considering any features.

Arrows or bars are the force each feature contributes positively or negatively to the prediction. Each feature either increases or decreases the prediction. The size of the arrow/bar shows the magnitude of its effect.

Red (or rightward forces): Push the prediction higher.

Blue (or leftward forces): Push the prediction lower.

The final prediction is the sum of the baseline plus all feature contributions. The endpoint shows the model’s actual output for that instance

– William

References

Using SHAP Values to Explain How Your Machine Learning Model Works –

A gentle introduction to SHAP values in R –
November 9, 2025

Making Sense of the Black Box: A Guide to Model Explainability

William's Data Science Blog

In an age of AI-driven decisions, whether predicting student risk, approving loans, or diagnosing disease, understanding why a model makes a prediction is just as important as the prediction itself. This is exactly the purpose of model explainability.

What Is Model Explainability?

Model explainability refers to techniques that help us understand and interpret the decisions made by machine learning models. While simple models like linear regression are more easily interpretable, more powerful models, like random forests, gradient boosting, or neural networks, are often considered “black boxes”.

Explainability tools aim to make it possible to understand that “box”, offering insights into how features influence predictions, both globally (across the dataset) and locally (for individual cases).

Why It Matters: Trust, Transparency, and Actionability

Explainability isn’t just a technical concern, it’s important for data scientists and society. Here’s why it matters:

• Trust: Stakeholders are more likely to act on model outputs when they understand the reasoning behind them. A principal won’t intervene based on a risk score alone but will if they see that the score is driven by declining attendance and recent disciplinary actions.

• Accountability: Explainability supports ethical AI by surfacing potential biases and enabling audits. It helps answer: “Is this model fair across different student groups?”

• Debugging: Helps data scientists identify spurious correlations, data leakage, or overfitting.

• Compliance: Increasingly required by regulations like GDPR (right to explanation), FERPA (student data protections), and the EU AI Act.

Key Explainability Techniques

Let’s explore and compare the most widely used methods:

Method	Type	Strengths	Limitations	Best For
SHAP (SHapley Additive Explanations)	Local + Global	Theoretically grounded, consistent, visual.	Computationally expensive for large models.	Tree-based models (e.g., XGBoost, RF).
LIME (Local Interpretable Model-agnostic Explanations)	Local	Model-agnostic, intuitive.	Sensitive to perturbations, unstable explanations.	Any black-box model.
PDP (Partial Dependence Plot)	Global	Shows marginal effect of features.	Assumes feature independence.	Interpreting average trends.
ICE (Individual Conditional Expectation)	Local	Personalized insights.	Harder to interpret at scale.	Individual predictions.
Permutation Importance	Global	Simple, model-agnostic.	Can be misleading with correlated features.	Quick feature ranking.

SHAP vs. LIME: A Deeper Dive

Both SHAP and LIME aim to answer the same question: “Why did the model make this prediction?” But they approach it from different angles, with distinct strengths, limitations, and implications for trust and usability.

Theoretical Foundations

Aspect	SHAP	LIME
Core Idea	Based on Shapley values from cooperative game theory.	Builds a local surrogate model using disturbed samples.
Mathematical Guarantee	Additive feature attributions that sum to the model output.	There is no guarantee of consistency or additivity.
Model Assumptions	Assumes access to the model’s internal structure.	Treats the model as a black box.

SHAP treats each feature as a “player” in a game contributing to the final prediction. It calculates the average contribution of each feature across all possible feature combinations.
LIME perturbs (disturbs) the input data around a specific instance and fits a simple interpretable model (usually linear) to approximate the local decision boundary.

Output and Visualization

Feature	SHAP	LIME
Local Explanation	Force plots show how each feature pushes the prediction.	Bar charts show feature weights in the surrogate model.
Global Explanation	Summary plots aggregate SHAP values across the dataset.	Not designed for global insights.
Visual Intuition	Highly visual and intuitive.	Simpler but less expressive visuals.

SHAP’s force plots and summary plots are really great for stakeholder presentations. They show not just which features mattered, but how they interacted.
LIME’s bar charts are easier to generate and interpret quickly, but they can vary significantly depending on how the data was disturbed.

Practical Considerations

Factor	SHAP	LIME
Speed	Slower, especially for large models.	Faster, lightweight.
Stability	High, same input yields same explanation.	Low, results can vary across runs.
Model Support	Optimized for tree-based models.	Works with any model (including neural nets, ensembles!).
Implementation	Requires more setup and compute.	Easier to plug into existing workflows.

SHAP is ideal for production-grade models where consistency and auditability matter.
LIME is great for quick prototyping, debugging, or when working with opaque models like deep neural networks.

A Real-World Example: Explaining Student Risk Scores

My nonprofit’s goal is to build a model to identify students at risk of socio-emotional disengagement. The model uses features like attendance, GPA trends, disciplinary records, and survey responses.

Let’s say the model flags a student as “high risk”. Without explainability, this is a black-box label. But with SHAP, we can generate a force plot that shows:

Attendance rate: -0.25 (low attendance strongly contributes to risk)
GPA change over time: -0.15 (declining grades add to concern)
Recent disciplinary action: +0.30 (a major driver of the risk score)
Survey response: “I feel disconnected from school”: +0.20 (adds emotional context)

This breakdown transforms a numeric score into a narrative. It allows educators to:

Validate the prediction: “Yes, this aligns with what we’ve seen.”
Take targeted action: “Let’s prioritize counseling and academic support.”
Communicate transparently: “Here’s why we’re reaching out to this student.”

Summary

Model explainability isn’t just a technical add-on, it’s an ethical and operational imperative. As we build systems that influence real lives, we must ensure they are not only accurate but also understandable, fair, and trustworthy.

– William

References

Technical Foundations of SHAP and LIME

ML Journey: SHAP vs. LIME – SHAP and LIME methodologies, consistency, and use cases.
MarkovML: Comparative Analysis of LIME and SHAP – Strengths and limitations of each method.
Cognitive Computing Journal: SHAP and LIME in Diagnostics – Highlights impact on trust and decision support.
DataCamp: Explainable AI Tutorial – SHAP, LIME, and the role of explainability in building trust.
Data Science Salon: Explainability in Practice – Importance of explainability in real-world deployments.
Science News Today: Why Explainability Is Critical for Trust – Explainability as a moral and legal imperative in modern AI systems.

October 26, 2025

Data Science in the World Pt. 5: Data Science in high education
This is the fifth post in my “Data Science in the World” series.

How Data Science is Transforming Higher Education

When most people think of college or university, they picture lecture halls, libraries, and late-night study sessions. But behind the scenes, a quiet revolution is underway, one powered by data science. Just as data has reshaped industries like healthcare, finance, and transportation, it is now transforming higher education. From improving student success to guiding institutional decisions, data science is becoming a cornerstone of how colleges and universities operate.

This might sound abstract, but the reality is simple: data science is helping students learn more effectively, helping educators teach more efficiently, and helping institutions make smarter choices. Let’s explore three key areas where data science is making the biggest impact in higher education: student success and retention, personalized learning, and institutional decision-making.

Student Success and Retention

One of the most pressing challenges in higher education is ensuring that students not only enroll but also graduate. Dropout rates remain a concern, and every student who leaves represents both a personal setback as well as a loss for the institution. Data science is helping to address this challenge by identifying at-risk students early and providing targeted support.

Colleges collect a wide range of data about students, like grades, attendance, course engagement, financial aid status, and even participation in extracurricular activities. By analyzing these data points, machine learning models can detect patterns that signal when a student might be struggling.

For example, a sudden drop in class attendance combined with declining grades might indicate that a student is at risk of dropping out. Predictive analytics can flag this student, allowing advisors or faculty to intervene before it’s too late.
- Georgia State University has become a leader in using predictive analytics to improve student success. By tracking over 800 risk factors, the university has significantly increased graduation rates, particularly among first-generation and low-income students.
- Community colleges are also adopting similar systems, using data to provide proactive advising and support services tailored to individual student needs.
For students, this means more personalized support and a greater chance of completing their degree. For institutions, it means improved retention rates, which not only enhance reputation but also ensure financial stability. For society, it means more graduates entering the workforce with the skills needed to succeed.

Personalized Learning

Every student learns differently. Some thrive in large lectures, while others need more hands-on support. Traditional education models often struggle to accommodate these differences. Data science is changing that by enabling personalized learning experiences tailored to each student’s strengths, weaknesses, and preferences.

Learning management systems (LMS) and online platforms collect detailed data on how students interact with course materials: how long they spend on readings, which quiz questions they miss, and how often they participate in discussions. Data science tools analyze this information to create individualized learning pathways.

For instance, if a student consistently struggles with a particular math concept, the system can recommend additional practice problems, videos, or tutoring resources. Conversely, if a student masters material quickly, the system can accelerate their progress to keep them engaged.
- Adaptive learning platforms like ALEKS (for math) or Smart Sparrow (for science) use data-driven algorithms to adjust content in real time, ensuring that students receive the right level of challenge.
- Massive Open Online Courses (MOOCs) such as Coursera and edX leverage data science to recommend courses and resources based on a learner’s past activity and performance.
Personalized learning helps students stay motivated and engaged, reducing frustration and boredom. It also allows instructors to focus their attention where it’s needed most, rather than applying a one-size-fits-all approach. Over time, this could lead to more equitable outcomes, as students from diverse backgrounds receive the support they need to succeed.

Institutional Decision-Making

Running a college or university is a complex endeavor. Administrators must make decisions about everything from course offerings to campus facilities to budget allocations. Traditionally, these decisions were based on historical trends, intuition, or limited data. Today, data science is providing a more rigorous foundation for institutional decision-making.

Universities generate enormous amounts of operational data: enrollment numbers, course demand, faculty workloads, financial aid distribution, and more. By applying data science techniques, administrators can uncover insights that guide strategic planning.
- Course scheduling: Predictive models can forecast which classes will be in high demand, ensuring that enough sections are offered to meet student needs.
- Resource allocation: Data can reveal which programs are growing and which are declining, helping institutions allocate funding more effectively.
- Facilities management: Sensors and data analytics can optimize energy use, reduce costs, and create more sustainable campuses.
Real-World Examples
- Arizona State University uses data analytics to optimize course scheduling and advising, ensuring that students can access the classes they need to graduate on time.
- The University of Michigan has applied data science to improve energy efficiency across campus, saving millions of dollars while reducing environmental impact.
Smarter decision-making benefits everyone. Students get the classes and resources they need, faculty workloads are managed more effectively, and institutions operate more efficiently. In an era of rising tuition costs and financial pressures, data-driven management helps ensure that higher education remains sustainable and accessible.

Spotlight: The Early Signal Project

Another example of how data science can support student success is the Early Signal Project, a nonprofit initiative I founded to help educators detect socio-emotional risks in students before they escalate. By combining privacy-compliant surveys with carefully designed data pipelines, the project gives schools actionable insights while protecting student trust. Instead of waiting until problems become visible in grades or attendance, educators receive early, anonymized signals that a student may need support. This proactive approach mirrors the broader promise of data science in higher education: using information ethically and transparently to empower teachers, improve outcomes, and ensure that no student falls through the cracks.

Conclusion

Data science is no longer confined to tech companies or research labs. It’s becoming a central part of how higher education functions. By improving student success and retention, enabling personalized learning, and guiding institutional decision-making, data science is helping colleges and universities adapt to the challenges of the 21st century.

Privacy concerns must be carefully managed, and institutions must ensure that data-driven decisions are fair and transparent. But the potential benefits are enormous. As data science continues to evolve, it promises to make higher education not only more efficient but also more inclusive, personalized, and effective.

In the end, higher education has always been about unlocking human potential. With the help of data science, that mission is being reimagined for a new era—one where every student has the opportunity to succeed, every instructor has the tools to teach effectively, and every institution has the insights to thrive.

References
- American Institutes for Research (AIR). (2014). Predictive Analytics in Higher Education: Five Guiding Practices for Ethical Use.
  https://www.air.org/resource/report/predictive-analytics-higher-education
- Cutter Consortium. (2013). Applying Big Data in Higher Education: A Case Study from Purdue University.
  https://www.cutter.com/article/applying-big-data-higher-education-case-study-400836
- Georgia State University (GSU). (2022). Georgia State’s Student Success Analytics: GPS Advising and Panther Retention Grants.
  https://success.gsu.edu/approach/
- LiaisonEDU. (2023). Using Predictive Analytics for Student Success and Retention at Community Colleges.
  https://www.liaisonedu.com/resources/blog/using-predictive-analytics-for-student-success-and-retention-at-community-colleges/
- Sogolytics Research Group. (2024). Data-Driven Decision-Making in Higher Education: Strategies for Modern Institutions.
  https://www.sogolytics.com/blog/data-driven-decisions-higher-education/
- U.S. Department of Education, Office of Educational Technology. (2017). Learning Analytics in Higher Education: A Guide for Innovation.
  https://tech.ed.gov/learning-analytics/
- University of Michigan Energy Institute. (2019). Campus Sustainability Through Data Analytics: Improving Energy Efficiency in Research Buildings.
  https://energy.umich.edu
October 12, 2025
Data Science in the World Pt. 4: Data Science in auto industry
This is the fourth post in my “Data Science in the World” series.

Driving the Future: How Data Science is Transforming the Auto Industry

Cars have always been about more than just getting from point A to point B. They represent freedom, innovation, and progress. Today, they also represent something else: data. Modern vehicles are really computers on wheels, generating and processing vast amounts of information every second. From safety systems to navigation to entertainment, data science is quietly reshaping the way we design, build, and drive cars.

Data science is already influencing your daily commute, the safety of your family, and even the future of how we think about car ownership. Let’s explore three key areas where data science is making the biggest impact: autonomous driving, predictive maintenance, and connected car services.

Autonomous Driving: Teaching Cars to Think

Perhaps the most exciting—and widely discussed—application of data science in the auto industry is autonomous driving. Self-driving cars rely on a combination of sensors, cameras, radar, and lidar to perceive their surroundings. But perception alone isn’t enough. The real magic happens when data science steps in to interpret all that information and make decisions in real time.
- Data collection: A single autonomous vehicle can generate terabytes of data every day. This includes images from cameras, distance measurements from lidar, and speed or position data from GPS.
- Machine learning models: These massive datasets are used to train algorithms that can recognize pedestrians, traffic lights, road signs, and other vehicles.
- Decision-making: Once trained, the system can predict what’s likely to happen next—like whether a pedestrian will step into the crosswalk—and decide how the car should respond.
Companies like Tesla, Waymo, and traditional automakers are investing billions into this technology. While fully autonomous cars aren’t yet mainstream, features like adaptive cruise control, lane-keeping assistance, and automatic emergency braking are already powered by data science. These are steppingstones toward a future where cars can safely drive themselves.

Autonomous driving has the potential to reduce accidents caused by human error, which accounts for the vast majority of crashes. It could also make transportation more accessible for people who can’t drive, such as the elderly or disabled. For the average driver, it promises a future where commuting time could be spent reading, working, or simply relaxing.

Predictive Maintenance: Fixing Problems Before They Happen

Traditionally, maintenance has been reactive: you wait until something goes wrong, then fix it. Data science is changing that by enabling predictive maintenance, where problems are identified and addressed before they cause a breakdown.
- Sensors everywhere: Modern cars are equipped with hundreds of sensors monitoring everything from engine temperature to tire pressure.
- Data analysis: These sensors feed data into machine learning models that can detect patterns indicating wear and tear.
- Predictive alerts: Instead of a vague “check engine” light, predictive systems can tell you exactly which component is likely to fail and when.
Fleet operators, like delivery companies or ride-sharing services, are already using predictive maintenance to keep vehicles on the road longer and reduce downtime. For everyday drivers, some automakers now offer apps that notify you when your car needs attention, sometimes even scheduling service appointments automatically.

Predictive maintenance saves money by preventing costly repairs and extends the life of vehicles. More importantly, it improves safety by reducing the risk of sudden failures on the road. For consumers, it means fewer surprises and more confidence in their cars.

Connected Car Services: Turning Vehicles into Smart Devices

Think of your car as a smartphone on wheels. Just as your phone connects you to apps, maps, and services, connected cars use data science to provide a seamless, personalized driving experience.
- Telematics: Cars transmit data about location, speed, and performance to cloud platforms.
- Personalization: Data science algorithms analyze your driving habits to suggest routes, adjust climate control, or recommend nearby services.
- Integration: Connected cars can communicate with other vehicles and infrastructure, creating smarter traffic systems.
Connected car services make driving more convenient, efficient, and enjoyable. On a larger scale, they also contribute to smarter cities by reducing congestion and emissions. For consumers, it’s about having a car that feels less like a machine and more like a personalized companion.

Conclusion

Data science is no longer a behind-the-scenes tool in the auto industry—it’s the driving force behind its most exciting innovations. Autonomous driving is teaching cars to think, predictive maintenance is making them more reliable, and connected services are turning them into smart devices that fit seamlessly into our digital lives.

For the general public, the takeaway is simple: your car isn’t just powered by gasoline or electricity, it’s powered by data. And as data science continues to evolve, the way we drive, maintain, and experience cars will keep transforming in ways that once belonged only to science fiction.

References
- Intel. (2016). Data-Driven Intelligence: The Future of Autonomous Vehicles. Intel Corporation Whitepaper.
- Waymo LLC. (2020). The Waymo Open Dataset: Autonomous Vehicle Perception Benchmark. Waymo Research Publications.
- MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). (2019). Planning and Decision-Making in Autonomous Driving. MIT Technical Report.
- General Motors (GM). (2020). OnStar and Connected Services: Telematics Platform Overview. GM Global Technology.
- Tesla, Inc. (2021). Vehicle Diagnostics, Over-the-Air Updates, and Predictive Service. Tesla Engineering Documentation.
- University of Michigan Transportation Research Institute (UMTRI). (2017). Ann Arbor Connected Vehicle Test Environment: Phase 2 Results. UMTRI Technical Report.
September 28, 2025
Data Science in the World Pt. 3: Data Science in Finance
This is the third post in my “Data Science in the World” series.

How Data Science is Transforming Finance: Fraud, Credit, and Investments

When most people think about finance, they picture swiping a card, checking a bank balance, or maybe watching the stock market ticker scroll across a screen. What’s less visible is how much data science is working behind the scenes to make those everyday interactions safer, smarter, and more personalized.

Financial institutions have always been data-driven, but the explosion of computing power and machine learning has changed the game. Today, banks, lenders, and investment firms can analyze billions of data points in real time, uncovering patterns that humans alone could never detect. The result? Faster fraud detection, fairer lending decisions, and more accessible investment opportunities.

I will review three of the most important ways data science is reshaping the financial services industry: fraud detection and security, credit scoring and lending, and investment and wealth management.

Fraud Detection and Security

Fraud has always been a cat-and-mouse game. As soon as banks develop new defenses, fraudsters find new ways to exploit vulnerabilities. Data science has tilted the balance in favor of the defenders by enabling real-time, adaptive fraud detection.

Every financial transaction, whether it’s a credit card swipe, an online transfer, or a mobile payment, creates a data trail. Machine learning models are trained on millions of these transactions to learn what “normal” behavior looks like. They consider dozens of factors simultaneously:
- Location: Is the purchase happening in the same city as the customer’s last transaction?
- Timing: Does the transaction fit the customer’s usual spending patterns?
- Device: Is the payment being made from a familiar phone or computer?
When something looks unusual, like a purchase in another country minutes after a local coffee shop visit, the system can flag it instantly.

For consumers, this often shows up as a text message or app notification. Behind that alert is a sophisticated model that has already calculated the probability of fraud in milliseconds.

Banks benefit too. According to industry reports, AI-driven fraud detection has saved billions of dollars annually by reducing false positives (legitimate transactions incorrectly flagged) and catching fraudulent activity earlier.

Fraud detection powered by data science protects money and trust. In a world where digital payments are the norm, consumers need to feel confident that their financial institutions can keep them safe. Data science makes that possible.

Credit Scoring and Lending

For decades, credit decisions were based on a narrow set of factors: payment history, outstanding debt, and length of credit history. While effective to a point, this system left many people without traditional credit histories, locked out of the financial system. Data science is helping to change that.

Modern credit models can incorporate a much wider range of information:
- Alternative data: Rent payments, utility bills, and even subscription services can demonstrate reliability.
- Behavioral data: Spending patterns, savings habits, and cash flow stability provide additional context.
- Digital footprints: With proper privacy safeguards, online activity can sometimes serve as a proxy for financial responsibility.
By analyzing these broader datasets, machine learning models can paint a more complete picture of an applicant’s creditworthiness. This benefits consumers and lenders in the following ways:
- Fairer access: People with limited or no credit history can qualify for loans they might otherwise be denied.
- Reduced bias: Properly designed models can minimize human subjectivity in lending decisions.
- Better risk management: Lenders can more accurately predict defaults, reducing losses and keeping interest rates competitive.
Investment and Wealth Management

Investing used to be the domain of the wealthy, with personalized advice available only to those who could afford a financial advisor. Data science has democratized investing, making it more accessible, affordable, and tailored to individual needs.

An example of this democratization is Robo-advisors. Robo-advisors are digital platforms that use algorithms to build and manage investment portfolios. By asking a few simple questions about risk tolerance, time horizon, and goals, the system can recommend a diversified portfolio that automatically rebalances over time. The benefits to everyday are:
- Lower costs: Automated systems reduce the need for expensive human advisors.
- Accessibility: Minimum investment amounts are often much lower than traditional wealth management services.
- Customization: Portfolios can be tailored to individual preferences, such as socially responsible investing.
At the institutional level, hedge funds and asset managers use machine learning to detect subtle patterns in market data. Some even analyze unconventional sources like satellite imagery (e.g., counting cars in retail parking lots to predict sales) or social media sentiment to gain an edge.

Data science also helps investors understand and manage risk. Predictive models can simulate how a portfolio might perform under different economic scenarios, giving both professionals and individuals a clearer picture of potential outcomes.

At the same time, regulators are pushing for “explainable AI” in finance, ensuring that investment recommendations are transparent and understandable rather than black-box predictions.

Conclusion

Data science is no longer a buzzword in finance. It’s the backbone of how the industry operates. From protecting consumers against fraud, to opening up credit access, to democratizing investing, the impact is profound and personal.

For the general public, the takeaway is simple: every time you swipe your card, apply for a loan, or check your investment app, data science is working behind the scenes. It’s making your financial life safer, smarter, and more tailored to your needs.

As technology continues to evolve, expect these systems to become even more sophisticated. The future of finance isn’t just digital—it’s data-driven.

References
- IBM. How machine learning improves fraud detection.
  https://www.ibm.com/topics/fraud-detection
- Visa. How AI helps prevent fraud.
  https://usa.visa.com/run-your-business/small-business-tools/ai-fraud-prevention.html
- CFPB. Alternative data and credit access.
  https://www.consumerfinance.gov/data-research/research-reports/cfpb-report-alternative-data-and-credit-access/
- OECD. Artificial intelligence in credit scoring.
  https://www.oecd.org/finance/artificial-intelligence-in-finance-and-credit-scoring.htm
- U.S. Federal Reserve. The use of cash-flow data in underwriting.
  https://www.federalreserve.gov/publications/alternative-data-underwriting.htm
- J.P. Morgan Research. Big Data and Machine Learning in Equity Investing.
  https://www.jpmorgan.com/solutions/cib/research/big-data-and-machine-learning
- European Banking Authority. Machine learning in credit risk.
  https://www.eba.europa.eu/eba-publishes-report-machine-learning-credit-risk
September 14, 2025
Data Science in the World Pt. 2: Data Science in Drug Research
This is the second post in my “Data Science in the World” series.

A New Era in Medicine

Modern medicine is undergoing a quiet revolution, powered by data. Hospital visits, lab tests, and genetic sequences add to a growing collection of medical information. Data science is helping scientists, doctors, and researchers use this information to find new treatments faster, test them more safely, and tailor care to each individual patient.

From Molecules to Medicines: Data-Driven Discovery

Traditionally, discovering a new drug could take a decade or more, with researchers testing thousands of chemical compounds by hand. Today, machine learning models can predict how molecules will interact with the human body long before they’re ever created in a lab.

For example, AI-powered algorithms analyze massive databases of molecular structures and biological data to identify promising compounds that might block a virus, reduce inflammation, or kill cancer cells. These models learn from patterns across millions of examples, helping researchers focus only on the most likely winners.

This approach reduces the cost and time to develop new medicines. In some cases, machine learning has identified viable drug candidates in months rather than years — a leap forward in the fight against fast-moving diseases.

Refining and Developing New Treatments

Once potential compounds are identified, scientists must determine how they behave inside the body. Data science helps here by analyzing complex biological signals, genetic variations, and even past clinical data to predict outcomes.

For instance, computer simulations (known as in silico trials) can model how a new medicine might interact with different cell types or organs. These models help researchers optimize dosages and anticipate side effects long before human testing begins.

Pharmaceutical companies also use predictive analytics to identify which drug formulations are most stable and effective, cutting down on costly lab iterations.

Smarter, Faster Clinical Trials

Clinical trials are where potential medicines are tested in humans and where most drug candidates fail. Trials are slow, expensive, and difficult to manage. Data science is making them smarter and more efficient.

Predictive analytics can help identify the right patients for each trial, ensuring diverse participation and faster recruitment. Algorithms analyze medical records and genomic data to match patients who are most likely to benefit, or least likely to experience harm, from an experimental drug.

Real-time data monitoring during trials also improves safety. Machine learning systems can flag early warning signs or unexpected reactions faster than human observers. These insights can help scientists adjust or even redesign studies on the fly.

Personalized Medicine: Tailoring Treatments to Individuals

Perhaps the most exciting result of this data revolution is personalized medicine, customizing treatment plans for individual patients. Instead of a one-size-fits-all approach, doctors can use data to predict which therapies will work best based on a person’s genes, environment, and medical history.

For example, genomic data can reveal how patients metabolize certain drugs, allowing physicians to prescribe the safest and most effective options. AI systems can even help oncologists choose targeted cancer therapies by comparing tumor DNA against thousands of previous cases.

This data-driven personalization not only improves outcomes but also reduces side effects, saving lives and healthcare costs alike.

Challenges and Ethics: The Human Side of Data

As powerful as data science is, it also raises critical ethical and technical challenges. Patient privacy must be protected. Medical data is among the most sensitive information that exists. Robust security measures and anonymization techniques are essential.

Bias is another concern. If algorithms are trained on incomplete or unrepresentative datasets, their predictions could unfairly favor or disadvantage certain groups. Transparency, oversight, and diverse data collection are key to ensuring fairness.

Finally, collaboration between scientists, clinicians, and data experts is crucial. Data alone doesn’t save lives, people using it wisely can.

Conclusion: The Future of Medicine Is Intelligent

From discovering new molecules to designing smarter clinical trials and personalizing treatments, data science has become an essential partner in modern medicine. It allows scientists to see patterns invisible to the human eye, accelerate breakthroughs, and deliver safer, more effective therapies.

As healthcare continues to generate more data than ever, the question is no longer whether data science will shape the future of medicine, but how far it can take us.

References
- MIT / J-Clinic. Using AI to discover new antibiotics.
  https://news.mit.edu/2020/artificial-intelligence-identifies-new-antibiotic-0220
- U.S. Food & Drug Administration (FDA). Advancing Regulatory Science with In Silico Clinical Trials.
  https://www.fda.gov/science-research/science-and-research-special-topics/model-informed-drug-development
- FDA Sentinal Initiative. Real-time safety analytics in drug trials.
  https://www.fda.gov/safety/fda-sentinel-initiative
- Mayo Clinic. Pharmacogenomics and personalized prescribing.
  https://www.mayoclinic.org/tests-procedures/pharmacogenomics/about/pac-20385063
- The Lancet Digital Health. Bias and fairness in medical AI systems.
  https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30191-0/fulltext
- HIPAA (HHS). Data privacy regulations for medical information.
  https://www.hhs.gov/hipaa/index.html
August 31, 2025
Data Science in the World Pt. 1: Data Science in Soccer
This post will be the first in a series of blog posts, called “Data Science in the World,” where I discuss the implementation of data science in different fields like sports, business, medicine, etc. To begin this series, I will be explaining how data science is used in soccer.

There are 5 main areas of the soccer world where data science plays a critical role: Tactical & Match Analysis, Player Development & Performance, Recruitment & Scouting, Training & Recovery, and Set-Piece Engineering. I will break down how data science is used in each of these areas.

Tactical & Match Analysis
- Expected Goals (xG): Quantifies shot quality based on location, angle, and defensive pressure. xG can be used to determine a player’s ability as if they generate a high xG throughout a season or their career, then they should hypothetically produce a high amount of goals eventually.
- Heatmaps & Passing Networks: Reveal spatial tendencies, player roles, and team structure. Heatmaps and passing networks can be used by coaches to point out the good and bad their team does in matches, helping them determine what to fix and what to focus on in matches.
- Opponent Profiling: Teams dissect rivals’ patterns to exploit weaknesses and tailor game plans.
Player Development & Performance
- Event Data Tracking: Every pass, tackle, and movement is logged to assess decision-making and execution. Event data tracking helps coaches and players analyze match footage to refine first touch, scanning habits, and off-ball movement.
- Wearable Tech: GPS and accelerometers monitor load, speed, and fatigue in real time. This helps tailor training intensity and reduce injury risk, especially in congested fixture periods.
- Custom Metrics: Clubs build proprietary KPIs to evaluate players beyond traditional stats. Custom metrics allow for more nuanced evaluation than traditional stats like goals or tackles.
Recruitment & Scouting
- Market Inefficiencies: Data helps identify undervalued talent with specific skill sets. This is useful for teams who do not have as much money to spend on players who are elite at multiple skills when the team only needs the player to be elite in one skill.
- Style Matching: Algorithms compare player profiles to team philosophy—think “find me the next Lionel Messi.” This ensures recruits aren’t just talented, but tactically compatible—saving time and money.
- Injury Risk Modeling: Predictive analytics flag players with high susceptibility to injury. It informs transfer decisions and contract structuring.
Training & Recovery Optimization
- Load Management: Data guides intensity and volume to prevent overtraining. Especially vital for youth development and congested schedules.
- Recovery Protocols: Biometrics and sleep data inform individualized recovery strategies. This improves performance consistency and long-term health.
- Skill Targeting: Coaches use analytics to pinpoint technical weaknesses and design drills accordingly.
Set-Piece Engineering
- Spatial Analysis: Determines optimal corner kick types (in-swing vs. out-swing) and free kick setups. It turns set pieces into high-probability scoring opportunities.
- Simulation Tools: VR and AR are emerging to rehearse scenarios with data-driven precision.
Player Examples

Now that we discussed how data science is used, I will provide examples of teams and players that utilized data science in these ways.
1. Liverpool FC – Recruitment & Tactical Modeling
  - Liverpool built one of the most advanced data science departments in soccer, led by Dr. Ian Graham. Using predictive models and custom metrics, they scouted and signed undervalued talent like Mohamed Salah and Sadio Mane off the basis of expected threat.
  - Result: Salah scored 245 goals in just 9 seasons. Liverpool won their first Champions League title since 2005 and their first ever English Premier League title in their history with Salah and Mane leading the lines.
2. Kevin De Bruyne – Contract Negotiation via Analytics FC
  - De Bruyne worked with Analytics FC to create a 40+ page data-driven report showcasing his value to Manchester City. It included proprietary metrics like Goal Difference Added (GDA), tactical simulations, and salary benchmarking.
  - Result: He negotiated his own contract extension without an agent, using data to prove his irreplaceable role in City’s system.
3. Arsenal FC – Injury Risk & Youth Development
  - Arsenal integrated wearable tech and biomechanical data to monitor player load and injury risk. Young players like Myles Lewis-Skelly used performance analytics to support their rise from academy to first team.
  - Result: Lewis-Skelly’s data-backed contract renewal included insights into his match impact, fatigue management, and tactical fit—helping him secure a long-term deal amid interest from top European clubs.
References
- The New York Times. How Liverpool Became the World’s Smartest Soccer Club (Ian Graham feature).
  https://www.nytimes.com/2019/05/22/sports/liverpool-champions-league.html
- The Athletic. Inside Liverpool’s data revolution under Ian Graham.
  https://theathletic.com/3838128/2022/11/18/liverpool-data-ian-graham/
- StatsBomb. Expected Threat (xT): The model behind modern attacking analytics.
  https://statsbomb.com/articles/soccer/introducing-expected-threat-xthreat/
- Kevin De Bruyne & Data-Driven Contract Negotiation
  The Athletic. How Kevin De Bruyne used data to negotiate his own contract.
  https://theathletic.com/2474565/2021/04/07/kevin-de-bruyne-contract-analytics/
- Analytics FC. Goal Difference Added (GDA) and player value modeling.
  https://analyticsfc.co.uk/2021/04/07/goal-difference-added/
- BBC Sport. KDB’s self-negotiated deal and analytics involvement.
  https://www.bbc.com/sport/football/56669587
- Arsenal FC, Injury Prevention & Youth Development
  Arsenal.com. How Arsenal uses sports science and performance data.
  https://www.arsenal.com/news/how-science-shapes-our-training
- Premier League Elite Player Performance Plan (EPPP). Wearable tech, GPS tracking, and youth development analytics.
  https://www.premierleague.com/youth/EPPP
- The Athletic. Inside Arsenal’s academy and the rise of Myles Lewis-Skelly.
  https://theathletic.com/4928020/2023/10/04/arsenal-lewis-skelly-academy/
August 17, 2025

Hyperparameter tuning with RandomizedSearchCV

William's Data Science Blog

In my previous post, I explored how GridSearchCV can systematically search through hyperparameter combinations to optimize model performance. While powerful, grid search can quickly become computationally expensive, especially as the number of parameters and possible values grows. In this follow-up, I try a more scalable alternative: RandomizedSearchCV. By randomly sampling from the hyperparameter space, this method offers a faster, more flexible way to uncover high-performing configurations without the exhaustive overhead of grid search. Let’s dive into how RandomizedSearchCV works, when to use it, and how it compares in practice.

What is RandomizedSearchCV

Unlike GridSearchCV, which exhaustively tests every combination of hyperparameters, RandomizedSearchCV takes a more efficient approach by sampling a fixed number of random combinations from a defined parameter space. This makes it useful when the search space is large or when computational resources are limited. By trading exhaustive coverage for speed and flexibility, RandomizedSearchCV often finds competitive, or even superior, parameter sets with far fewer evaluations. It’s a smart way to explore hyperparameter tuning when you want faster insights without sacrificing rigor.

Hyperparameter Tuning with RandomizedSearchCV

Here’s a breakdown of each parameter in my param_distributions for RandomizedSearchCV when tuning a RandomForestRegressor:

Parameter	Description
`n_estimators` `[100, 200, 300]`	Number of trees in the forest. More trees can improve performance but increase training time.
`min_samples_split` `[2, 5, 10, 20]`	Minimum number of samples required to split an internal node. Higher values reduce model complexity and help prevent overfitting.
`min_samples_leaf` `[1, 2, 4, 10]`	Minimum number of samples required to be at a leaf node. Larger values smooth the model and reduce variance.
`max_features` `["sqrt", "log2", 1.0]`	Number of features to consider when looking for the best split. `"sqrt"` and `"log2"` are common heuristics; `1.0` uses all features.
`bootstrap` `[True, False]`	Whether bootstrap samples are used when building trees. `True` enables bagging; `False` uses the entire dataset for each tree.
`criterion` `["squared_error", "absolute_error"]`	Function to measure the quality of a split. `"squared_error"` (default) is sensitive to outliers; `"absolute_error"` is more robust.
`ccp_alpha` `[0.0, 0.01]`	Complexity parameter for Minimal Cost-Complexity Pruning. Higher values prune more aggressively, simplifying the model.

Interpretation

Here is a table that compares the results in my previous post where I experimented with GridSearchCV with what I achieved while using RandomizedSearchCV.

Metric	GridSearchCV	RandomizedSearchCV	Improvement
Mean Squared Error (MSE)	173.39	161.12	↓ 7.1%
Root Mean Squared Error (RMSE)	13.17	12.69	↓ 3.6%
R² Score	0.2716	0.3231	↑ 18.9%

Interpretation & Insights

Lower MSE and RMSE:
RandomizedSearchCV yielded a model with noticeably lower error metrics. The RMSE dropped by nearly half a point, indicating better predictions. While the absolute reduction may seem modest, it’s meaningful in contexts where small improvements translate to better decision-making or cost savings.

Higher R² Score:
The R² score improved from 0.27 to 0.32, a relative gain of nearly 19%. This suggests that the model tuned via RandomizedSearchCV explains more variance in the target variable—an encouraging sign of better generalization.

Efficiency vs Exhaustiveness:
GridSearchCV exhaustively evaluated all parameter combinations, which can be computationally expensive and potentially redundant. In contrast, RandomizedSearchCV sampled a subset of combinations and still outperformed grid search. This underscores the value of strategic randomness in high-dimensional hyperparameter spaces.

Model Robustness:
The improved metrics hint that RandomizedSearchCV may have landed on a configuration that better balances bias and variance—possibly due to more diverse sampling across parameters like min_samples_leaf, criterion, and ccp_alpha.

Takeaways

RandomizedSearchCV not only delivered better predictive performance but did so with greater computational efficiency. When I ran GridSearchCV with as many parameters to explore, it ran for a long time. In contrast, RandomizedSearchCV returned almost instantaneously in comparison. For large or complex models like RandomForestRegressor, this approach offers a good balance between exploration and practicality. It’s a great reminder that smarter search strategies can outperform brute-force methods—especially when paired with thoughtful parameter ranges.

– William

August 3, 2025

Tag: Student Data Analysis

When Algorithms Evolve: The Story of Genetic Algorithms

What Genetic Algorithms Actually Do

A Brief History

Why Genetic Algorithms Were So Beloved

Why Genetic Algorithms Faded from the Spotlight

Where Genetic Algorithms Still Shine

A Reflection Through the Lens of the Early Signal Project

Closing Thoughts

What is SHAP

What SHAP Does

What SHAP Is Good For

Ideal Use Cases

My notebook changes

SHAP Visualizations

Interpreting the summary beeswarm plot

Interpreting the force plot

References

What Is Model Explainability?

Why It Matters: Trust, Transparency, and Actionability

Key Explainability Techniques

SHAP vs. LIME: A Deeper Dive

A Real-World Example: Explaining Student Risk Scores

Summary

References

How Data Science is Transforming Higher Education

Student Success and Retention

Personalized Learning

Institutional Decision-Making

Real-World Examples

Spotlight: The Early Signal Project

Conclusion

References

Driving the Future: How Data Science is Transforming the Auto Industry

Autonomous Driving: Teaching Cars to Think

Predictive Maintenance: Fixing Problems Before They Happen

Connected Car Services: Turning Vehicles into Smart Devices

Conclusion

References

How Data Science is Transforming Finance: Fraud, Credit, and Investments

Fraud Detection and Security

Credit Scoring and Lending

Investment and Wealth Management

Conclusion

References

A New Era in Medicine

From Molecules to Medicines: Data-Driven Discovery

Refining and Developing New Treatments

Smarter, Faster Clinical Trials

Personalized Medicine: Tailoring Treatments to Individuals

Challenges and Ethics: The Human Side of Data

Conclusion: The Future of Medicine Is Intelligent

References

Tactical & Match Analysis

Player Development & Performance

Recruitment & Scouting

Training & Recovery Optimization

Set-Piece Engineering

Player Examples

References

What is RandomizedSearchCV

Hyperparameter Tuning with RandomizedSearchCV

Interpretation

Interpretation & Insights

Takeaways