When evaluating machine learning models for diagnosis or prediction of binary outcomes, two dimensions of performance need to be considered: First, discrimination – a model‘s ability to make correct binary prediction – which is commonly assessed using area-under-the-curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and the F1 score. Second, calibration – The degree to which a model‘s predicted probability (ranging from 0% to 100%) corresponds to the actually observed incidence of the binary endpoint – which is commonly assessed using calibration curves, calibration slope and intercept, Brier score, the expected/observed ratio, and the Hosmer-Lemeshow test.1
While discrimination is practically always reported, many publications do not report calibration. While high discrimination measures and good calibration often coincide – such as is likely the case in the abovementioned publication – excellent discrimination does not necessarily mean that calibration is employable.2 Deep neural networks are especially prone to poor calibration, often massively skewing predicted probabilities towards „extreme“ values of 1% and 99%, respectively.
Especially in clinical practice, calibration is crucial. For clinicians and patients alike, a predicted probability (i.e., your risk is 7%) is much more valuable than a binary yes/no-prediction. Good calibration can often be attained through application of machine learning models with appropriate complexity in relation to the classification task at hand, such as logistic regression or generalized additive models. If poor calibration is observed and the pattern of miscalibration is consistent during resampling, recalibration techniques such as Platt scaling or isotonic regression can be applied.5 Lastly, models can also be primarily trained for measures of calibration, and intercepts can be adjusted.4
In conclusion, it is critically important to assess the calibration of clinical prediction models. A minimum of calibration curve, slope, and intercept should be reported for every published model.
1. Debray TPA, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW, Moons KGM: A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol 68:279–289, 2015
2. Guo C, Pleiss G, Sun Y, Weinberger KQ: On Calibration of Modern Neural Networks. ArXiv170604599 Cs:2017 Available: http://arxiv.org/abs/1706.04599. Accessed 11 December 2019
3. Hopkins BS, Yamaguchi JT, Garcia R, Kesavabhotla K, Weiss H, Hsu WK, et al: Using machine learning to predict 30-day readmissions after posterior lumbar fusion: an NSQIP study involving 23,264 patients. J Neurosurg Spine:1–8, 2019
4. Janssen KJM, Moons KGM, Kalkman CJ, Grobbee DE, Vergouwe Y: Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol 61:76–86, 2008
5. Niculescu-Mizil A, Caruana R: Predicting Good Probabilities with Supervised Learning, in Proceedings of the 22Nd International Conference on Machine Learning, ICML ’05. New York, NY, USA: ACM, 2005, pp 625–632 Available: http://doi.acm.org/10.1145/1102351.1102430.
From: Staartjes, Victor E., and Julius M. Kernbach. “Letter to the Editor. Importance of Calibration Assessment in Machine Learning-Based Predictive Analytics.” Journal of Neurosurgery. Spine, February 21, 2020, 1–2. https://doi.org/10.3171/2019.12.SPINE191503.