How do you evaluate regression models beyond R²?

How Do You Evaluate Regression Models Beyond R²?

When teaching regression to educational students, R² (the coefficient of determination) is often the first metric introduced: it tells us the proportion of variance in the target explained by the model. But R² has limitations. It can increase with more predictors even if they add little value, and in nonlinear models it may be misleading or mathematically invalid to interpret it as “explained variance." Thus, to fully assess regression models, we look beyond R² using additional metrics and validation strategies.

Key Metrics Beyond R²

Adjusted R²
This adjusts for the number of predictors, penalizing model complexity. A variable is only worth keeping if it improves adjusted R², not just raw R².
Mean Squared Error (MSE) & Root MSE (RMSE)
MSE = average of squared residuals; RMSE = square root of that, expressed in the same units as the target. These measure absolute prediction error. Lower values are better.
Mean Absolute Error (MAE)
MAE = average absolute difference between predicted and actual. Unlike MSE/RMSE, it treats all errors linearly (less punishment of large errors).
Predicted R² / Cross-validated R²
This is the R² computed on held-out data (e.g. via k-fold cross validation). It is a more honest estimate of generalization performance.
Mean Absolute Percentage Error (MAPE)
Useful when you want a relative error (percentage) metric: MAPE = (1/n) ∑ |(yᵢ – ŷᵢ) / yᵢ| × 100%. Be careful when actuals are near zero.
Residual Analysis & Model Diagnostics
Check residual plots for heteroscedasticity, non-linearity, patterns, outliers or influential points. Even a high R² can hide model misbehavior (see Anscombe’s quartet).
Other domain-specific measures (if applicable)
For example, in hydrology, the Nash–Sutcliffe Efficiency (NSE) is used to assess predictive skill similarly to R² but in time series contexts.

Teaching Tip & Role of “Quality Thought”

For educational students in your data science course, it’s valuable to teach multiple metrics side by side, compare them on the same model, and show cases when they diverge. At Quality Thought, our courses emphasize model evaluation holistically, not just pushing for high R². We guide students to interpret metrics, diagnose problems, and choose models that generalize well—skills more important than chasing a large R².

In lab assignments, we ask students: “Which model has lower RMSE? Which one has more stable predicted R² across folds? Are there outliers dragging your MAE upward?” This encourages critical thinking rather than blind reliance on one number.

Conclusion

When evaluating regression models in a data science course, R² is just the starting point. Adjusted R², RMSE, MAE, cross-validated R², residual diagnostics and domain metrics all help paint a fuller picture of model quality. For educational students, mastering interpretation of these metrics is key. With Quality Thought’s teaching philosophy and course support, we can help students build robust, interpretable models rather than models that just look good on paper. So, beyond R², what combination of metrics and diagnostics will you adopt in your next regression project?

What are best practices for handling categorical variables in ML models?

Search This Blog

Data Science