The Brier score does not evaluate the clinical utility of diagnostic tests or prediction models

Diagnostic and Prognostic Research - Tập 1 - Trang 1-7 - 2017
Melissa Assel1, Daniel D. Sjoberg2, Andrew J. Vickers2
1Department of Epidemiology and Biostatistics, Memorial Sloan Kettering Cancer Center, New York, USA
2Department of Epidemiology and Biostatistics, Memorial Sloan-Kettering Cancer Center, New York, USA

Tóm tắt

A variety of statistics have been proposed as tools to help investigators assess the value of diagnostic tests or prediction models. The Brier score has been recommended on the grounds that it is a proper scoring rule that is affected by both discrimination and calibration. However, the Brier score is prevalence dependent in such a way that the rank ordering of tests or models may inappropriately vary by prevalence. We explored four common clinical scenarios: comparison of a highly accurate binary test with a continuous prediction model of moderate predictiveness; comparison of two binary tests where the importance of sensitivity versus specificity is inversely associated with prevalence; comparison of models and tests to default strategies of assuming that all or no patients are positive; and comparison of two models with miscalibration in opposite directions. In each case, we found that the Brier score gave an inappropriate rank ordering of the tests and models. Conversely, net benefit, a decision-analytic measure, gave results that always favored the preferable test or model. Brier score does not evaluate clinical value of diagnostic tests or prediction models. We advocate, as an alternative, the use of decision-analytic measures such as net benefit. Not applicable.

Tài liệu tham khảo

Collins GS, Reitsma JB, Altman DG, KGM M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Ann Intern Med. 2015;162(1):55–63. Cook NR. Statistical evaluation of prognostic versus diagnostic models: beyond the ROC curve. Clin Chem. 2008;54(1):17–23. Baker SG. The central role of receiver operating characteristic (ROC) curves in evaluating tests for the early detection of cancer. J Natl Cancer Inst. 2003;95(7):511–5. Pencina MJ, D'Agostino RB, Vasan RS. Statistical methods for assessment of added usefulness of new biomarkers. Clin Chem Lab Med. 2010;48(12):1703–11. Hilden J, Gerds TA. A note on the evaluation of novel biomarkers: do not rely on integrated discrimination improvement and net reclassification index. Stat Med. 2014;33(19):3405–14. Vickers AJ, Pepe M. Does the net reclassification improvement help us evaluate models and markers? Ann Intern Med. 2014;160(2):136–7. Kerr KF, Wang Z, Janes H, McClelland RL, Psaty BM, Pepe MS. Net reclassification indices for evaluating risk-prediction instruments: a critical review. Epidemiology. 2014;25(1):114–21. Kattan MW, Kattan MW, Cowen ME: Encyclopedia of medical decision making, volume 1. 2009. la Cour FN, Gerds TA, Forman JL, Silver JD, Nyboe Andersen A, Popovic-Todorovic B. Risk charts to identify low and excessive responders among first-cycle IVF/ICSI standard patients. Reprod BioMed Online. 2011;22(1):50–8. Braga JU, Bressan C, Dalvi APR, Calvet GA, Daumas RP, Rodrigues N, Wakimoto M, Nogueira RMR, Nielsen-Saines K, Brito C, et al. Accuracy of Zika virus disease case definition during simultaneous Dengue and Chikungunya epidemics. PLoS One. 2017;12(6):e0179725. Kloeckner R, Pitton MB, Dueber C, Schmidtmann I, Galle PR, Koch S, Worns MA, Weinmann A. Validation of clinical scoring systems ART and ABCR after transarterial chemoembolization of hepatocellular carcinoma. J Vasc Interv Radiol. 2017;28(1):94–102. Wu YC, Lee WC. Alternative performance measures for prediction models. PLoS One. 2014;9(3):e91249. Brier GW. Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950;78(1):1–3. Murphy AH, Epstein ES. A note on probability forecasts and “hedging”. J Appl Meteorol. 1967;6(6):1002–4. Vickers AJ, Cronin AM, Gonen M. A simple decision analytic solution to the comparison of two binary diagnostic tests. Stat Med. 2013;32(11):1865–76. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74. Pepe M, Fang J, Feng Z, Gerds T, Hilden J: The net reclassification index (NRI): a misleading measure of prediction improvement with miscalibrated or overfit models. 2013. Van Calster B, Vickers AJ. Calibration of risk prediction models: impact on decision-analytic performance. Med Decis Making. 2015;35(2):162–9. Glasziou P, Hilden J. Test selection measures. Med Decis Mak. 1989;9(2):133–41.