Analyzing the Random Forest Results

In this post, I’ll go back and take a look at the results of my earlier post on Random Forests, interpret the performance metrics, try to diagnose problems and identify some techniques I can apply to improve the results.

Math Score Performance Metrics Summary

The table below is a summary of the results of the math score analysis from my previous post.

MetricValueInterpretation
Mean Squared Error (MSE)172MSE is the average of the squared differences between predicted values and actual values. Since the differences are squared, large errors are penalized more.
Root MSE (RMSE)≈ 13.11Average error ~13.11 points on 0–100 scale. This means that your model’s predictions are off by roughly 13.1 points on average, which is easier to reason about on a 0–100 scale.
R² Score0.275Explains ~27.5% of target variance. An R² of 0.275 means my model explains just 27.5% of the variance in the target variable.
Target Range0 – 100Maximum possible variation is 100 points.

Interpreting the Numbers

My RMSE of around 13.1 points means that my predictions are off by 13 units out of 100. That’s the same as a 13% error. The seems pretty high, since that is more than a grade level!

An R² of 0.275 says that my model captures only 27.5% of the variability in the target. The rest, 72.5%, is unexplained. That means either the dataset is missing features that could help with predictions, or there is a lot of noise in the data, or model is still underfitting.

Diagnosing Underlying Issues

  • Feature Limitations
    Important variables could be missing, or existing ones may need to be transformed.
  • Data Quality
    Outliers will inflate MSE. Also, how the data is sampled across the target 0–100 range can also impact performance.
  • Model Complexity
    Default hyperparameters, which is what I used, often underfit. The trees may be too shallow (max_depth too low) or too few (n_estimators too small) to capture complex patterns that may exist in the dataset.

Strategies to Improve Accuracy

  • Revisit Hyperparameter Tuning
    • Try to optimize things like n_estimators, max_depth, etc.
  • Feature Engineering
    • Explore encoding some features.
  • Data Augmentation & Cleaning
    • Look into removing or ‘winsorizing’ outliers.
    • Try to balance samples across target so the distribution isn’t lopsided.
  • Alternative Models & Ensembles
    • Inspect stacking multiple regressors (e.g., combine RF with SVR or k-NN).
    • Use bagging with different tree depths or feature subsets.
  • Robust Validation
    • Monitor training and validation RMSE/R² to detect under/overfitting.

Final Thoughts and Next Steps

My first step into learning Random Forests using default parameters didn’t provide the desired accuracy. Researching the possibles cases and techniques to improve accuracy has provided me some direction. In my next post I’ll show how I applied the above and what impact these techniques had on the accuracy of the models.

– William

Comments

Leave a comment