William's Data Science Blog

In this post, I’ll go back and take a look at the results of my earlier post on Random Forests, interpret the performance metrics, try to diagnose problems and identify some techniques I can apply to improve the results.

Math Score Performance Metrics Summary

The table below is a summary of the results of the math score analysis from my previous post.

Metric	Value	Interpretation
Mean Squared Error (MSE)	172	MSE is the average of the squared differences between predicted values and actual values. Since the differences are squared, large errors are penalized more.
Root MSE (RMSE)	≈ 13.11	Average error ~13.11 points on 0–100 scale. This means that your model’s predictions are off by roughly 13.1 points on average, which is easier to reason about on a 0–100 scale.
R² Score	0.275	Explains ~27.5% of target variance. An R² of 0.275 means my model explains just 27.5% of the variance in the target variable.
Target Range	0 – 100	Maximum possible variation is 100 points.

Interpreting the Numbers

My RMSE of around 13.1 points means that my predictions are off by 13 units out of 100. That’s the same as a 13% error. The seems pretty high, since that is more than a grade level!

An R² of 0.275 says that my model captures only 27.5% of the variability in the target. The rest, 72.5%, is unexplained. That means either the dataset is missing features that could help with predictions, or there is a lot of noise in the data, or model is still underfitting.

Diagnosing Underlying Issues

Feature Limitations
Important variables could be missing, or existing ones may need to be transformed.
Data Quality
Outliers will inflate MSE. Also, how the data is sampled across the target 0–100 range can also impact performance.
Model Complexity
Default hyperparameters, which is what I used, often underfit. The trees may be too shallow (max_depth too low) or too few (n_estimators too small) to capture complex patterns that may exist in the dataset.

Strategies to Improve Accuracy

Revisit Hyperparameter Tuning
- Try to optimize things like n_estimators, max_depth, etc.
Feature Engineering
- Explore encoding some features.
Data Augmentation & Cleaning
- Look into removing or ‘winsorizing’ outliers.
- Try to balance samples across target so the distribution isn’t lopsided.
Alternative Models & Ensembles
- Inspect stacking multiple regressors (e.g., combine RF with SVR or k-NN).
- Use bagging with different tree depths or feature subsets.
Robust Validation
- Monitor training and validation RMSE/R² to detect under/overfitting.

Final Thoughts and Next Steps

My first step into learning Random Forests using default parameters didn’t provide the desired accuracy. Researching the possibles cases and techniques to improve accuracy has provided me some direction. In my next post I’ll show how I applied the above and what impact these techniques had on the accuracy of the models.

– William

Analyzing the Random Forest Results

William's Data Science Blog

Math Score Performance Metrics Summary

Interpreting the Numbers

Diagnosing Underlying Issues

Strategies to Improve Accuracy

Final Thoughts and Next Steps

Comments

Leave a comment Cancel reply

More posts

Short Post

First Experiment with SHAP Visualizations

Making Sense of the Black Box: A Guide to Model Explainability

Data Science in the World Pt. 5: Data Science in high education

Analyzing the Random Forest Results

William's Data Science Blog

Math Score Performance Metrics Summary

Interpreting the Numbers

Diagnosing Underlying Issues

Strategies to Improve Accuracy

Final Thoughts and Next Steps

Share this:

Comments

Leave a comment Cancel reply

More posts

Short Post

First Experiment with SHAP Visualizations

Making Sense of the Black Box: A Guide to Model Explainability

Data Science in the World Pt. 5: Data Science in high education