When a model – instead of learning generalizable features – approximates the patients in the training set too closely, it is said to be „overfitted“ to the training set. This means that, while the model may demonstrate high performance when making predictions on the patients it was trained on, its performance on new patients will be far poorer because the model has not in fact extracted generalizable rules for prediction. Instead, it has learnt the characteristics of the training set patients by heart. In this situation, the model demonstrates minimal bias (erroneous assumptions) and high variance (sensitivity to small fluctuations). Overfitting can be diagnosed by comparing training error with out-of-sample error (OSE) – if training set error is much lower than OSE, a model is said to overfit. Overfitting is usually propagated through too extensive model training, use of too complex algorithms for relatively simple problems, or too low number of training examples, among other factors.
OSE can be assessed using two major concepts: Resampling, or holdout (testing) data. In k-fold cross validation3, the training data are split into k equally sized folds. Subsequently, k models are trained on the k-1 respective training folds, and OSE is evaluated on the one remaining fold. Commonly used values for k are 5 and 10. Similarly, the bootstrap4 with replacement – a resampling technique derived from the jackknife5 – randomly selects n patients from an n-sized training dataset, and model performance is evaluated. This process is repeated many times – usually with 25 to 1000 repetitions. As an adaptation of jackknife resampling, leave-one-out cross validation (LOOCV) is another powerful method for estimating OSE. In LOOCV, n models are trained on the respective n-1 patients and evaluated on the 1 remaining patient.
The abovementioned resampling techniques assess OSE with the advantage of allowing training on the entire dataset, thus not losing any information for training. Still, it may sometimes be worthwhile to “sacrifice” part of the data for a dedicated holdout set, also sometimes called “testing” set, as the authors have rightly done. A dedicated holdout set allows a more unbiased judgement of OSE, and is especially useful if the risk for “manual” overfitting is high. This is the case when many different algorithms are implemented and extensively tuned based on training performance. The gold standard in holdout testing is external validation, in which a separate dataset from one or multiple different institutions is used to test model performance on.
Especially when prediction models are applied in clinical practice, a proper fit is crucial. Resampling methods such as described above can diagnose overfitting early. In combination with external validation and assessment of calibration, resampling allows generation of prediction models that are safe to apply in clinical practice.
In conclusion, if a holdout set is used, it is crucial that both the resampled training and holdout (testing) performance are reported, in parallel. In addition, it is useful to assess OSE during training using resampling. The bootstrap, k-fold cross validation, and LOOCV lend themselves as accepted standards in ML to evaluate OSE during training, and their implementation is a basic requirement when developing ML algorithms in the 21st century. If many algorithms are tested, and if they are extensively tuned, a holdout set (test set) may be necessary to rule out overfitting.
1. Farrokhi F, Buchlak QD, Sikora M, et al. Investigating Risk Factors and Predicting Complications in Deep Brain Stimulation Surgery with Machine Learning Algorithms. World Neurosurg. October 2019. doi:10.1016/j.wneu.2019.10.063
2. Staartjes VE, Schröder ML. Letter to the Editor. Class imbalance in machine learning for neurosurgical outcome prediction: are our models valid? J Neurosurg Spine. 2018;29(5):611-612. doi:10.3171/2018.5.SPINE18543
3. Larson SC. The shrinkage of the coefficient of multiple correlation. J Educ Psychol. 1931;22(1):45-55. doi:10.1037/h0072400
4. Efron B. Bootstrap Methods: Another Look at the Jackknife. Ann Stat. 1979;7(1):1-26. doi:10.1214/aos/1176344552
5. Quenouille MH. Notes on Bias in Estimation. Biometrika. 1956;43(3-4):353-360. doi:10.1093/biomet/43.3-4.353
From: Staartjes, Victor E., and Julius M. Kernbach. “Letter to the Editor Regarding ‘Investigating Risk Factors and Predicting Complications in Deep Brain Stimulation Surgery with Machine Learning Algorithms.’” World Neurosurgery 137 (May 1, 2020): 496. https://doi.org/10.1016/j.wneu.2020.01.189.