Towards a more nuanced conceptualisation of differential examiner stringency in OSCEs

Matt Homer1
1School of Medicine, University of Leeds, Leeds, UK

Tóm tắt

Quantitative measures of systematic differences in OSCE scoring across examiners (often termed examiner stringency) can threaten the validity of examination outcomes. Such effects are usually conceptualised and operationalised based solely on checklist/domain scores in a station, and global grades are not often used in this type of analysis. In this work, a large candidate-level exam dataset is analysed to develop a more sophisticated understanding of examiner stringency. Station scores are modelled based on global grades—with each candidate, station and examiner allowed to vary in their ability/stringency/difficulty in the modelling. In addition, examiners are also allowed to vary in how they discriminate across grades—to our knowledge, this is the first time this has been investigated. Results show that examiners contribute strongly to variance in scoring in two distinct ways—via the traditional conception of score stringency (34% of score variance), but also in how they discriminate in scoring across grades (7%). As one might expect, candidate and station account only for a small amount of score variance at the station-level once candidate grades are accounted for (3% and 2% respectively) with the remainder being residual (54%). Investigation of impacts on station-level candidate pass/fail decisions suggest that examiner differential stringency effects combine to give false positive (candidates passing in error) and false negative (failing in error) rates in stations of around 5% each but at the exam-level this reduces to 0.4% and 3.3% respectively. This work adds to our understanding of examiner behaviour by demonstrating that examiners can vary in qualitatively different ways in their judgments. For institutions, it emphasises the key message that it is important to sample widely from the examiner pool via sufficient stations to ensure OSCE-level decisions are sufficiently defensible. It also suggests that examiner training should include discussion of global grading, and the combined effect of scoring and grading on candidate outcomes.

Tài liệu tham khảo

Bartman, I., Smee, S., & Roy, M. (2013). A method for identifying extreme OSCE examiners. The Clinical Teacher, 10(1), 27–31. https://doi.org/10.1111/j.1743-498X.2012.00607.x Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01 Bell, A., Fairbrother, M., & Jones, K. (2019). Fixed and random effects models: Making an informed choice. Quality & Quantity, 53(2), 1051–1074. https://doi.org/10.1007/s11135-018-0802-x Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Routledge. Cook, D. A., Brydges, R., Ginsburg, S., & Hatala, R. (2015). A contemporary approach to validity arguments: A practical guide to Kane’s framework. Medical Education, 49(6), 560–575. https://doi.org/10.1111/medu.12678 Crowson, M. (2020). Multilevel modeling in R using lme4 package (video). Youtube. https://www.youtube.com/watch?v=8r9bUKUVecc General Medical Council. (2020a). What is the PLAB 2 exam? What is the PLAB 2 exam? Retrieved May 7, 2020, fromhttps://www.gmc-uk.org/registration-and-licensing/join-the-register/plab/plab-2-guide/what-is-the-plab-2-exam General Medical Council. (2020b). PLAB (Professional and Linguistic Assessments Board). Professional and Linguistic Assessments Board. Retrieved May 7, 2020, fromhttps://www.gmc-uk.org/registration-and-licensing/join-the-register/plab General Medical Council. (2022). PLAB reports. Retrieved February 28, 2023, fromhttps://www.gmc-uk.org/registration-and-licensing/join-the-register/plab/plab-reports Harasym, P., Woloschuk, W., & Cunning, L. (2008). Undesired variance due to examiner stringency/leniency effect in communication skill scores assessed in OSCEs. Advances in Health Sciences Education: Theory and Practice. https://doi.org/10.1007/s10459-007-9068-0 Hatala, R., Cook, D. A., Brydges, R., & Hawkins, R. (2015). Constructing a validity argument for the Objective Structured Assessment of Technical Skills (OSATS): A systematic review of validity evidence. Advances in Health Sciences Education: Theory and Practice. https://doi.org/10.1007/s10459-015-9593-1 Hays, R., Gupta, T. S., & Veitch, J. (2008). The practical value of the standard error of measurement in borderline pass/fail decisions. Medical Education, 42(8), 810–815. https://doi.org/10.1111/j.1365-2923.2008.03103.x Hodges, B. (2013). Assessment in the post-psychometric era: Learning to love the subjective and collective. Medical Teacher, 35(7), 564–568. https://doi.org/10.3109/0142159X.2013.789134 Homer, M. (2020). Re-conceptualising and accounting for examiner (cut-score) stringency in a ‘high frequency, small cohort’ performance test. Advances in Health Sciences Education. https://doi.org/10.1007/s10459-020-09990-x Homer, M. (2022). Pass/fail decisions and standards: The impact of differential examiner stringency on OSCE outcomes. Advances in Health Sciences Education. https://doi.org/10.1007/s10459-022-10096-9 Homer, M. (2023). Setting defensible minimum-stations-passed standards in OSCE-type assessments. Medical Teacher. https://doi.org/10.1080/0142159X.2023.2197138 IBM Corp. (2021). IBM SPSS Statistics for Windows, Version 28.0. IBM Corp. Ilgen, J. S., Ma, I. W. Y., Hatala, R., & Cook, D. A. (2015). A systematic review of validity evidence for checklists versus global rating scales in simulation-based assessment. Medical Education, 49(2), 161–173. https://doi.org/10.1111/medu.12621 Khan, K. Z., Gaunt, K., Ramachandran, S., & Pushkar, P. (2013). The Objective Structured Clinical Examination (OSCE): AMEE Guide No. 81. Part II: organisation & administration. Medical Teacher, 35(9), e1447–e1463. https://doi.org/10.3109/0142159X.2013.818635 Kramer, A., Muijtjens, A., Jansen, K., Düsman, H., Tan, L., & van der Vleuten, C. (2003). Comparison of a rational and an empirical standard setting procedure for an OSCE. Objective Structured Clinical Examinations. Medical Education, 37(2), 132–139. Livingston, S. A., & Lewis, C. (1995). Estimating the consistency and accuracy of classifications based on test scores. Journal of Educational Measurement, 32(2), 179–197. https://doi.org/10.1111/j.1745-3984.1995.tb00462.x McKinley, D. W., & Norcini, J. J. (2014). How to set standards on performance-based examinations: AMEE Guide No. 85. Medical Teacher, 36(2), 97–110. https://doi.org/10.3109/0142159X.2013.853119 McManus, I., Thompson, M., & Mollon, J. (2006). Assessment of examiner leniency and stringency ('hawk-dove effect’) in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling. BMC Medical Education, 6(1), 42. https://doi.org/10.1186/1472-6920-6-42 Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to linear regression analysis (5th ed.). Wiley-Blackwell. Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086 Nimon, K. (2012). Statistical assumptions of substantive analyses across the general linear model: A mini-review. Frontiers in Psychology. https://doi.org/10.3389/fpsyg.2012.00322 Norman, G., Bordage, G., Page, G., & Keane, D. (2006). How specific is case specificity? Medical Education, 40(7), 618–623. https://doi.org/10.1111/j.1365-2929.2006.02511.x Osterlind, S. J., & Everson, H. T. (2009). Differential item functioning. SAGE. Pearce, J. (2020). In defence of constructivist, utility-driven psychometrics for the ‘post-psychometric era.’ Medical Education, 54(2), 99–102. https://doi.org/10.1111/medu.14039 Pell, G., Fuller, R., Homer, M., & Roberts, T. (2010). How to measure the quality of the OSCE: A review of metrics—AMEE guide no. 49. Medical Teacher, 32(10), 802–811. https://doi.org/10.3109/0142159X.2010.507716 Schauber, S. K., Hecht, M., & Nouns, Z. M. (2018). Why assessment in medical education needs a solid foundation in modern test theory. Advances in Health Sciences Education: Theory and Practice, 23(1), 217–232. https://doi.org/10.1007/s10459-017-9771-4 Thompson, B. (2007). Effect sizes, confidence intervals, and confidence intervals for effect sizes. Psychology in the Schools, 44(5), 423–432. https://doi.org/10.1002/pits.20234 Valentine, N., Durning, S. J., Shanahan, E. M., van der Vleuten, C., & Schuwirth, L. (2022). The pursuit of fairness in assessment: Looking beyond the objective. Medical Teacher. https://doi.org/10.1080/0142159X.2022.2031943 Wong, W. Y. A., Thistlethwaite, J., Moni, K., & Roberts, C. (2023). Using cultural historical activity theory to reflect on the sociocultural complexities in OSCE examiners’ judgements. Advances in Health Sciences Education, 28(1), 27–46. https://doi.org/10.1007/s10459-022-10139-1 Yeates, P., Cope, N., Hawarden, A., Bradshaw, H., McCray, G., & Homer, M. (2018). Developing a video-based method to compare and adjust examiner effects in fully nested OSCEs. Medical Education. https://doi.org/10.1111/medu.13783 Yeates, P., Moult, A., Cope, N., McCray, G., Xilas, E., Lovelock, T., Vaughan, N., Daw, D., Fuller, R., & McKinley, R. K. (2021). Measuring the effect of examiner variability in a multiple-circuit Objective Structured Clinical Examination (OSCE). Academic Medicine. https://doi.org/10.1097/ACM.0000000000004028 Yeates, P., Moult, A., Lefroy, J., Walsh-House, J., Clews, L., McKinley, R., & Fuller, R. (2020). Understanding and developing procedures for video-based assessment in medical education. Medical Teacher, 42(11), 1250–1260. https://doi.org/10.1080/0142159X.2020.1801997 Yeates, P., O’Neill, P., Mann, K., & Eva, K. (2013). Seeing the same thing differently. Advances in Health Sciences Education, 18(3), 325–341. https://doi.org/10.1007/s10459-012-9372-1