William's Data Science Blog

Tag: artificial-intelligence

Using Random Forests to analyze student performance
William's Data Science Blog

In this post, I’ll walk through my first notebook exploring random forests. I’m using a dataset I found on Kaggle. It can be found here: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams/data.

This is a small, simulated dataset that contains data for gender, ethnicity, the level of education attained by the parents, the lunch (free or standard), whether the student took test preparation courses and the scores for math, reading and writing.

I’ll try my hand at using random forests to understand the importance of various features on student performance.

Step 1: Clean up the data

After reading the data into a Dataframe, we do a quick check on the quality of the data. I check for simple things like empty values and duplicates using Polars APIs.

Below is the code from our notebook cell:

In these cells:

Check for null data.

Check for duplicate rows and remove the duplicates.

This makes sure we correct and, or remove any bad data before we start processing.

Step 2: Inspect the data

Now that the data is cleaned up, we can create some visualizations of the data. The first I’ll create are some histograms of the math, reading and writing scores. Histograms are one of the most foundational, and surprisingly powerful, visual tools in a data scientist’s toolkit.

Below is the code from three notebook cells to generate the histograms:

Histograms allow us to:

See whether the data is symmetrical or not.

See if there are a lot of outliers that could impact model performance.

Next we’ll look at some boxplots. Boxplots are good for summarizing the distribution of the data.

Boxplots allow us to visualize:

The median value of our features. The median represents the central tendency.

The interquartile range (IQR), showing the middle 50% of data.

The min and max values (excluding outliers).

Data outliers. Outliers are represented by the circles outside of 1.5 * the IQR.

Assess skewness. We can see of the median is very close to the top or bottom of the box.

Next, we’ll look a heatmap. Heatmaps (or heatplots) are really powerful data visualizations, they let you see relationships between variables at a glance, especially when you’re dealing with large datasets or multiple features.

Heatmaps allow us to visualize:

Correlations: Bright colors show strong positive or negative correlations while faded or neutral colors imply weak or no relationship.

Spotting Patterns: We can quickly identify where performance clusters, or drops, occur.

Identifying Anomalies: Visual blips can point to data quality problems.

Step 3: Encoding Categorical Variables

The next step is to convert our categorical columns to a numeric format using scikit-learn’s LabelEncoder.

Below is the code from our notebook cell:

In that cell:

I instantiate a LabelEncoder object.

I get the names of the columns that need to be encoded by iterating over the columns in the dataframe and filtering where the type of the column is a string.

I create encoded data for each of those columns with a new name appended with “_num”.

Lastly I create a new dataframe that combines the new columns I created with the original dataframe.

Step 4: Remove the non-numeric columns

This is a simple step, where I simply select the columns that are integers.

Below is the code from our notebook cell:

In that cell:

Iterate over the columns, filtering where the type is integer and use that list in the select function.

Now we can create a heatmap that includes the encoded data too.

Step 5: Train models for math, reading and writing

Now it’s time to build, train, and evaluate our model. I repeat this step for each of the math, reading and writing scores. I’ll only show the math cell here as they do the same thing.

In that cell:

Drop the score columns from the dataframe.

Choose “math score” as my category column.

Split the data and create a RandomForestRegressor model.

Train the model against the data.

Use the model to predict values and measure the accuracy.

The r2 score gives a sense of how well the predictors capture the ups-and-downs in your target. Or: How much better is my model at predicting Y than just guessing the average of Y every time?

R² = 1: indicates a perfect fit.

R² = 0: the model is no better than predicting the mean.

R² < 0: the model is worse than the mean.

Step 6: Visualize feature importance to the math score

Now we can create a histogram to visualize the relative importance of our features to the math score.

In that cell:

I grab all the feature columns.

Map the columns to the models feature_importances_ value.

Generate a plot.

The higher the value in feature_importances_, the more important the feature.

Final Thoughts and Next Steps

In this first step into learning about Random Forests we can see they are powerhouse in the world of data science. Random Forests are built on the idea of “wisdom of the crowd”, by combining many decision trees trained on random subsets of data and features, they reduce overfitting and improve generalization.

The new Jupyter notebook can be found here in my GitHub.

– William
June 8, 2025
Deep Dive Into Random Forests
William's Data Science Blog

In today’s post, I’ll take an in-depth look at Random Forests, one of the most popular and effective algorithms in the data science toolkit. I’ll describe what I learned about how they work, their components and what makes them tick.

What Are Random Forests?

At its heart, a random forest is an ensemble of decision trees working together.

Decision Trees: Each tree as a model that makes decisions by splitting data based on certain features.

Ensemble Approach: Instead of relying on a single decision tree, a random forest builds many trees from bootstrapped samples of your data. The prediction from the forest is then derived by averaging (for regression) or taking a majority vote (for classification).

This approach reduces the variance typical of individual trees and builds a robust model that handles complex feature interactions with ease.

The Magic Behind the Method

1. Bootstrap Sampling

Each tree in the forest is trained on a different subset of data, selected with replacement. This process, known as bagging (Bootstrap Aggregating), means roughly 37% of your data isn’t used in any tree. This leftover data, the out-of-bag (OOB) set, can later be used to internally validate the model without needing a separate validation set.

2. Random Feature Selection

At every decision point within a tree, instead of considering every feature, the algorithm randomly selects a subset. This randomness:

De-correlates Trees: Each tree becomes less alike, ensuring that the ensemble doesn’t overfit or lean too heavily on one feature.

Reduces Variance: Averaging predictions across diverse trees smooths out misclassifications or prediction errors.

3. Aggregating Predictions

For classification tasks, each tree casts a vote for a class, and the class with the highest number of votes becomes the model’s prediction.

For regression tasks, predictions are averaged to produce a final value. This collective approach generally results in higher accuracy and more stable predictions.

Out-of-Bag (OOB) Error

An important feature of random forests is the OOB error estimate.

What It Is: Each tree is trained on a bootstrap sample, leaving out a set of data that can serve as a mini-test set.

Why It Counts: Aggregating predictions on these out-of-bag samples can offer an estimate of the model’s test error.

This feature can be really handy, especially when you’re working with limited data and want to avoid setting aside a large chunk of it for validation.

Feature Importance

Random forests don’t just predict, they can also help you understand your data:

Mean Decrease in Impurity (MDI): This measure tallies how much each feature decreases impurity (based on measures like the Gini index) across all trees.

Permutation Importance: By shuffling features and measuring the drop in accuracy the importance of a feature can be measured. This is meant to help when you need to interpret the model and communicate which features are most influential.

Pros and Cons

Advantages:

Can handle Non-Linear Data: Naturally captures complex feature interactions.

Can handle Noise & Outliers: Ensemble averaging minimizes overfitting.

Doesn’t need a lot of Preprocessing: No need for extensive data scaling or transformation.

Disadvantages:

Can be Memory Intensive: Storing hundreds of trees can be demanding.

Slower than a single Tree: Compared to a single decision tree, the ensemble approach require more processing power.

Harder to Interpret: The combination of multiple trees makes it harder to interpretability compared to individual trees.

Summary

Random Forests are a powerful next step in my journey. With their ability to reduce variance through ensemble learning and their built-in validation mechanisms like OOB error, they offer both performance and insight.

In my next post, I’ll share how I apply the Random Forest technique to this data set: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams/data

– William
May 19, 2025
Exploring the Impact of Alcohol Consumption on Student Grades with Gaussian Naive Bayes
William's Data Science Blog

In today’s data-driven world, even seemingly straightforward questions can reveal surprising insights. In this post, I investigate whether students’ alcohol consumption habits bear any relationship to their final math grades. Using the Student Alcohol Consumption dataset from Kaggle, which contains survey responses on a myriad aspects of students’ lives—ranging from study habits and social factors to gender and alcohol use—I set out to determine if patterns exist that can predict academic performance.

Dataset Overview

The dataset originates from a survey of students enrolled in secondary school math and Portuguese courses. It includes rich, social, and academic information, such as:

Social and family background

Study habits and academic support

Alcohol consumption details during weekdays and weekends

I focused on predicting the final math grade (denoted as G3 in the raw data) while probing how alcohol-related features, especially weekend consumption, might play a role in performance. The binary insight wasn’t just about whether students drank, but which drinking pattern might be more telling of their academic results.

Data Preprocessing: Laying the Groundwork

Before diving into modeling, the data needed some cleanup. Here’s how I systematically prepared the dataset for analysis:

Loading the Data: I imported the CSV into a Pandas DataFrame for easy manipulation.

Renaming Columns: Clarity matters. I renamed ambiguous columns for better readability (e.g., renaming walc to weekend_alcohol and dalc to weekday_alcohol).

Label Encoding: Categorical data were converted to numeric representations using scikit-learn’s LabelEncoder, ensuring all features could be numerically processed.

Reusable Code: I encapsulated the training and testing phases within a reusable function, which made it straightforward to test different feature combinations.

Here’s are some snippets:

In those cells:

I rename columns to make them more readable.

I instantiate a LabelEncoder object and encode a list of columns that have string values.

I add an absence category to normalize absence count a little due to how variable that data is.

Experimenting With Gaussian Naive Bayes

The heart of this exploration was to see how well a Gaussian Naive Bayes classifier could predict the final math grade based on different selections of features. Naive Bayes, while greatly valued for its simplicity and speed, operates under the assumption that features are independent—a condition that might not fully hold in educational data.

Training and Evaluation Function

To streamline the experiments, I wrote a function that:

Splits the data into training and testing sets.

Trains a GaussianNB model.

Evaluates accuracy on the test set.

In that cell:

I create a function that:

Drops unwanted columns.

Runs 100 training cycles with the given data.

Captures the accuracy measured from each run and returns the average.

Single and Two column sampling

In those cells:

I get a list of all columns.

I create loop(s) over the column list and create a list of features to test.

I call my function to measure the the accuracy of the features at predicting student grades.

Diving Into Feature Combinations

I aimed to assess the predictive power by testing different combinations of features:

All Columns: This gave the best accuracy of around 22%, yet it was clear that even the full spectrum of information struggled to make strong predictions.

Handpicked Features: I manually selected features that I hypothesized might be influential. The resulting accuracy dipped below that of the full dataset.

Individual Features: Evaluating each feature solo revealed that the column indicating whether students planned to pursue higher education yielded the highest individual accuracy—though still far lower than all features combined.

Two-Feature Combinations: By testing all pairs, I noticed that combinations including weekend alcohol consumption appeared in the top 20 predictive pairs four times, including in both of the top two.

Three-Feature Combinations: The trend became stronger—combinations featuring weekend alcohol consumption topped the list ten times and were present in each of the top three combinations!

Four-Feature Combinations: Here, weekend alcohol consumption featured in the top 20 combination results even more robustly—15 times in total.

These experiments showcased one noteworthy pattern: weekend alcohol consumption consistently emerged as a common denominator in the best-performing feature combinations, while weekday consumption rarely made an appearance.

Analysis of the Findings

Several key observations emerged from this series of experiments:

Predictive Accuracy: Even with the full set of features, the best accuracy reached was only around 22%. This underwhelming performance is indicative of the challenges posed by the dataset and the restrictive assumptions embedded within the Naive Bayes model.

Role of Alcohol Consumption: The repeated appearance of weekend alcohol consumption in high-ranking feature combinations suggests a potential association—it may capture lifestyle or social habits that indirectly correlate with academic performance. However, it is not a standalone predictor; rather, it seems to be relevant as part of a multifactorial interaction.

Model Limitations: The Gaussian Naive Bayes classifier assumes feature independence. The complexities inherent in student performance—where multiple social, educational, and psychological factors interact—likely violate this assumption, leading to lower predictive performance.

Conclusion and Future Directions

While the Gaussian Naive Bayes classifier provided some interesting insights, especially regarding the recurring presence of weekend alcohol consumption in influential feature combinations, its overall accuracy was modest. Predicting the final math grade, a multifaceted outcome influenced by numerous interdependent factors, appears too challenging for this simplistic probabilistic model.

Next Steps:

Alternative Machine Learning Algorithms: Investigating other approaches like decision trees, random forests, support vector machines, or ensemble methods may yield better performance.

Enhanced Feature Engineering: Incorporating interaction terms or domain-specific features might help capture the complex relationships between social habits and academic outcomes.

Broader Data Explorations: Diving deeper into other factors—such as study habits, parental support, and extracurricular involvement—could provide additional clarity.

Final Thoughts and Next Steps

This journey reinforced the idea that while Naive Bayes is a great tool for its speed and interpretability, it might not be the best choice for all datasets. More sophisticated models and careful feature engineering are necessary when dealing with some datasets like student academic performance.

The new Jupyter notebook can be found here in my GitHub.

– William
April 13, 2025
Leveraging Scikit-Learn and Polars to Test a Naive Bayes Classifier
William's Data Science Blog

In today’s post, I use scikit-learn with the same sample dataset I used in the previous post. I need to use the LabelEncoder to encode the strings as numeric values and then the GaussianNB to train and testing a Gaussian Naive Bayes classifier model and to predict the class of an example record. While many tutorials use pandas, I use Polars for fast data manipulation alongside scikit-learn for model development.

Understanding Our Data and Tools

Remember that the dataset includes ‘features’ for height, weight, foot size. It also has a categorical field for gender. Because classifiers like Gaussian Naive Bayes require numeric inputs, I need to transform the string gender values into a numeric format.

In my new Jupyter notebook I use two libraries:

Scikit-Learn for its machine learning utilities. Specifically, LabelEncoder for encoding and GaussianNB for classification.

Polars for fast, efficient DataFrame manipulations.

Step 1: Encoding Categorical Variables

The first step is to convert our categorical column (gender) to a numeric format using scikit-learn’s LabelEncoder. This conversion is vital because machine learning models generally can’t work directly with string labels.

Below is the code from our first notebook cell:

In that cell:

I instantiate a LabelEncoder object.

For every feature in columns_to_encode (in this case, just "gender"), I create a new Polars Series with the suffix "_num", containing the encoded numeric values.

Finally, I add these series as new columns to our original DataFrame.

This ensures that our categorical data is transformed into a machine-friendly format, an also preserves the human-readable string values for future reference.

Step 2: Mapping Encoded Values to Original Labels

Once we’ve encoded the data, it’s important to retain the mapping between the original string values and their corresponding numeric codes. This mapping is particularly useful when you want to interpret or display the model’s predictions.

The following code block demonstrates how to generate and view this mapping:

In that cell:

I save the original "gender" column and its encoded counterpart "gender_num".

By grouping on "gender" and aggregating with the first encountered numeric value, I create a mapping from string labels to numerical codes.

Step 3: Training and Testing the Gaussian Naive Bayes Classifier

Now it’s time to build, train, and evaluate our model. I separate the features and target, split the data, and then initialize the classifier.

In that cell:

Get the data to use in training: I drop the raw "gender" and its encoded version from the Dataframe (X) and save the encoded classification in (y).

Data Splitting: train_test_split is used to randomly partition the data into training and testing sets.

Model Training: A GaussianNB classifier is instantiated and trained on the training data using the fit() method.

Prediction and Evaluation: The model’s predictions on the test set (y_pred) are generated and compared against the true labels using accuracy_score. This gives us a quantitative measure of the model’s performance.

Step 4: Classifying a New Record

Now I can test it on the sample observation. Consider the following code snippet:

In that cell:

Create Example Data: I define a new sample record (with features like height, weight, and foot size) and create a Polars DataFrame to hold this record.

Prediction: The classifier is then used to predict the gender (encoded as a number) for this new record.

Decoding: Use the gender_mapping to display the human-readable gender label corresponding to the model’s prediction.

Final Thoughts and Next Steps

This step-by-step notebook shows how to preprocess data, map categorical values, train a Gaussian Naive Bayes classifier, and test new data with the combination of Polars and scikit-learn.

The new Jupyter notebook can be found here in my GitHub. If you follow the instructions in my previous post you can run this notebook for yourself.

– William
March 9, 2025
Essential Tools and Services for My Data Science Learning Journey
William's Data Science Blog

Hello again, everyone! As I start on this journey into the world of data science, I want to share the tools and services I’ll be using. Each of these tools has features that make them valuable for data science applications. Let’s dive in and explore why they’re so valuable:

GitHub.com

GitHub is a web-based platform that allows developers to collaborate on code, manage projects, and track changes in their codebase using version control (Git). For data science, GitHub is incredibly useful because:

Collaboration: It lets me to collaborate with others on data science projects, share my work, and receive feedback from the community.

Version Control: I can keep track of changes in my code, experiment with different versions, and easily revert to previous states if needed.

Open-Source Projects: I can explore open-source data science projects, learn from others’ work, and contribute to the community.

Kaggle.com

Kaggle is a platform dedicated to data science and machine learning. It offers a wide range of datasets, competitions, and learning resources. Kaggle is a must-have for my journey because of its:

Datasets: Kaggle provides access to a vast collection of datasets across various domains, including education, which I plan to use for my projects.

Competitions: Participating in Kaggle competitions allows me to apply my skills to real-world problems, learn from others, and gain valuable experience.

Learning Resources: Kaggle offers tutorials, code notebooks, and forums where I can learn new techniques, ask questions, and improve my understanding of data science concepts.

Python

Python is a versatile and widely-used programming language in the data science community. Its popularity stems from several key features:

Readability: Python’s syntax is clean and easy to understand, making it an excellent choice for beginners.

Libraries: Python has many libraries and frameworks for data analysis, machine learning, and visualization, such as NumPy, pandas, and scikit-learn.

Community Support: The Python community is large and active, providing extensive documentation, tutorials, and forums to help me along my learning journey.

Jupyter Labs

Jupyter Labs is an interactive development environment that allows me to create and share documents containing live code, equations, visualizations, and narrative text. Its benefits for data science include:

Interactive Coding: I can write and execute code in small, manageable chunks, making it easier to test and debug my work.

Visualization: Jupyter Labs supports rich visualizations, enabling me to create and display graphs, charts, and plots within my notebooks.

Documentation: I can document my thought process, findings, and insights alongside my code, creating comprehensive and reproducible reports.

Polars DataFrames

Polars is a fast and efficient DataFrame library for Rust, which also has a Python interface. It is designed to handle large datasets and perform complex data manipulations. Polars is a valuable addition to my toolkit because of its:

Performance: Polars is optimized for performance, making it great for handling large datasets and performing computationally intensive tasks.

Memory Efficiency: It uses less memory compared to traditional DataFrame libraries, which will help when working with large data.

Flexible API: Polars provides a flexible and intuitive API that allows me to perform various data manipulation tasks, such as filtering, grouping, and aggregating data.

Black

Black is a code formatter for Python that ensures my code sticks to styling and readability standards. Black is an essential tool for my data science projects because of its:

Consistency: Black will automatically format my code to follow best practices, making it more readable and maintainable.

Efficiency: With Black taking care of formatting, I can focus on writing code and solving problems.

Integration: Black can be easily integrated with Jupyter Lab, so that my code remains consistently formatted within my notebooks.

By leveraging these tools and services, I will be well-equipped to dive into the world of data science and tackle exciting projects. Stay tuned as I share my experiences and discoveries along the way!

– William
February 9, 2025