William's Data Science Blog

Tag: artificialintelligence

Using Random Forests to analyze student performance
William's Data Science Blog

In this post, I’ll walk through my first notebook exploring random forests. I’m using a dataset I found on Kaggle. It can be found here: https://www.kaggle.com/datasets/whenamancodes/students-performance-in-exams/data.

This is a small, simulated dataset that contains data for gender, ethnicity, the level of education attained by the parents, the lunch (free or standard), whether the student took test preparation courses and the scores for math, reading and writing.

I’ll try my hand at using random forests to understand the importance of various features on student performance.

Step 1: Clean up the data

After reading the data into a Dataframe, we do a quick check on the quality of the data. I check for simple things like empty values and duplicates using Polars APIs.

Below is the code from our notebook cell:

In these cells:

Check for null data.

Check for duplicate rows and remove the duplicates.

This makes sure we correct and, or remove any bad data before we start processing.

Step 2: Inspect the data

Now that the data is cleaned up, we can create some visualizations of the data. The first I’ll create are some histograms of the math, reading and writing scores. Histograms are one of the most foundational, and surprisingly powerful, visual tools in a data scientist’s toolkit.

Below is the code from three notebook cells to generate the histograms:

Histograms allow us to:

See whether the data is symmetrical or not.

See if there are a lot of outliers that could impact model performance.

Next we’ll look at some boxplots. Boxplots are good for summarizing the distribution of the data.

Boxplots allow us to visualize:

The median value of our features. The median represents the central tendency.

The interquartile range (IQR), showing the middle 50% of data.

The min and max values (excluding outliers).

Data outliers. Outliers are represented by the circles outside of 1.5 * the IQR.

Assess skewness. We can see of the median is very close to the top or bottom of the box.

Next, we’ll look a heatmap. Heatmaps (or heatplots) are really powerful data visualizations, they let you see relationships between variables at a glance, especially when you’re dealing with large datasets or multiple features.

Heatmaps allow us to visualize:

Correlations: Bright colors show strong positive or negative correlations while faded or neutral colors imply weak or no relationship.

Spotting Patterns: We can quickly identify where performance clusters, or drops, occur.

Identifying Anomalies: Visual blips can point to data quality problems.

Step 3: Encoding Categorical Variables

The next step is to convert our categorical columns to a numeric format using scikit-learn’s LabelEncoder.

Below is the code from our notebook cell:

In that cell:

I instantiate a LabelEncoder object.

I get the names of the columns that need to be encoded by iterating over the columns in the dataframe and filtering where the type of the column is a string.

I create encoded data for each of those columns with a new name appended with “_num”.

Lastly I create a new dataframe that combines the new columns I created with the original dataframe.

Step 4: Remove the non-numeric columns

This is a simple step, where I simply select the columns that are integers.

Below is the code from our notebook cell:

In that cell:

Iterate over the columns, filtering where the type is integer and use that list in the select function.

Now we can create a heatmap that includes the encoded data too.

Step 5: Train models for math, reading and writing

Now it’s time to build, train, and evaluate our model. I repeat this step for each of the math, reading and writing scores. I’ll only show the math cell here as they do the same thing.

In that cell:

Drop the score columns from the dataframe.

Choose “math score” as my category column.

Split the data and create a RandomForestRegressor model.

Train the model against the data.

Use the model to predict values and measure the accuracy.

The r2 score gives a sense of how well the predictors capture the ups-and-downs in your target. Or: How much better is my model at predicting Y than just guessing the average of Y every time?

R² = 1: indicates a perfect fit.

R² = 0: the model is no better than predicting the mean.

R² < 0: the model is worse than the mean.

Step 6: Visualize feature importance to the math score

Now we can create a histogram to visualize the relative importance of our features to the math score.

In that cell:

I grab all the feature columns.

Map the columns to the models feature_importances_ value.

Generate a plot.

The higher the value in feature_importances_, the more important the feature.

Final Thoughts and Next Steps

In this first step into learning about Random Forests we can see they are powerhouse in the world of data science. Random Forests are built on the idea of “wisdom of the crowd”, by combining many decision trees trained on random subsets of data and features, they reduce overfitting and improve generalization.

The new Jupyter notebook can be found here in my GitHub.

– William
June 8, 2025
Leveraging Scikit-Learn and Polars to Test a Naive Bayes Classifier
William's Data Science Blog

In today’s post, I use scikit-learn with the same sample dataset I used in the previous post. I need to use the LabelEncoder to encode the strings as numeric values and then the GaussianNB to train and testing a Gaussian Naive Bayes classifier model and to predict the class of an example record. While many tutorials use pandas, I use Polars for fast data manipulation alongside scikit-learn for model development.

Understanding Our Data and Tools

Remember that the dataset includes ‘features’ for height, weight, foot size. It also has a categorical field for gender. Because classifiers like Gaussian Naive Bayes require numeric inputs, I need to transform the string gender values into a numeric format.

In my new Jupyter notebook I use two libraries:

Scikit-Learn for its machine learning utilities. Specifically, LabelEncoder for encoding and GaussianNB for classification.

Polars for fast, efficient DataFrame manipulations.

Step 1: Encoding Categorical Variables

The first step is to convert our categorical column (gender) to a numeric format using scikit-learn’s LabelEncoder. This conversion is vital because machine learning models generally can’t work directly with string labels.

Below is the code from our first notebook cell:

In that cell:

I instantiate a LabelEncoder object.

For every feature in columns_to_encode (in this case, just "gender"), I create a new Polars Series with the suffix "_num", containing the encoded numeric values.

Finally, I add these series as new columns to our original DataFrame.

This ensures that our categorical data is transformed into a machine-friendly format, an also preserves the human-readable string values for future reference.

Step 2: Mapping Encoded Values to Original Labels

Once we’ve encoded the data, it’s important to retain the mapping between the original string values and their corresponding numeric codes. This mapping is particularly useful when you want to interpret or display the model’s predictions.

The following code block demonstrates how to generate and view this mapping:

In that cell:

I save the original "gender" column and its encoded counterpart "gender_num".

By grouping on "gender" and aggregating with the first encountered numeric value, I create a mapping from string labels to numerical codes.

Step 3: Training and Testing the Gaussian Naive Bayes Classifier

Now it’s time to build, train, and evaluate our model. I separate the features and target, split the data, and then initialize the classifier.

In that cell:

Get the data to use in training: I drop the raw "gender" and its encoded version from the Dataframe (X) and save the encoded classification in (y).

Data Splitting: train_test_split is used to randomly partition the data into training and testing sets.

Model Training: A GaussianNB classifier is instantiated and trained on the training data using the fit() method.

Prediction and Evaluation: The model’s predictions on the test set (y_pred) are generated and compared against the true labels using accuracy_score. This gives us a quantitative measure of the model’s performance.

Step 4: Classifying a New Record

Now I can test it on the sample observation. Consider the following code snippet:

In that cell:

Create Example Data: I define a new sample record (with features like height, weight, and foot size) and create a Polars DataFrame to hold this record.

Prediction: The classifier is then used to predict the gender (encoded as a number) for this new record.

Decoding: Use the gender_mapping to display the human-readable gender label corresponding to the model’s prediction.

Final Thoughts and Next Steps

This step-by-step notebook shows how to preprocess data, map categorical values, train a Gaussian Naive Bayes classifier, and test new data with the combination of Polars and scikit-learn.

The new Jupyter notebook can be found here in my GitHub. If you follow the instructions in my previous post you can run this notebook for yourself.

– William
March 9, 2025