Methods
Ethics statement
Since this study was not about human subjects or human-originated materials, informed consent from subjects was not indicated and waived. The Institutional Review Board of Dong-A University approved this study protocol (IRB approval no., 2-1040709-AB-N-01-202206-HR-031-02).
Study design
This was an explorative study to model the implementation of GT. Specifically, this was a psychometric study aimed at measuring the reliability of the OSCE. The present study analyzed clinical skill examination data from 439 fourth-year medical students in the Busan and Gyeongnam areas of South Korea from July 12 to 15, 2021.
Setting
There are 5 medical schools in the Busan and Gyeongnam areas, located in the southeastern part of South Korea. These 5 medical schools form the Busan-Gyeongnam Clinical Skill Examination (BGCSE) consortium. Since 2014, the consortium has conducted joint clinical skill examinations annually as normative evaluations for third- and fourth-year medical students.
In the 2021 BGCSE, there was a change in the form of the OSCE due to changes in the Korean Medical Licensing Examination (KMLE) by the Korea Health Personnel Licensing Examination Institute. In 2022, the number of OSCE simulations of the KMLE was scheduled to be reduced from 12 to 10. However, the BGCSE consortium lacked the resources to operate all 10 simulations, which required a further reduction to 8. As a result, the OSCE comprised 7 stations where students encountered standardized patients (SPs) and 1 station where students performed procedures on a manikin.
Table 1 shows the number of examinees, the topics of the cases, and the number of items in the cases on each OSCE day. The average number of items per case on each OSCE day was 20. Students were given 12 minutes at each station.
By 2020, the consortium had tracked the reliability of the OSCE using Cronbach’s α, and it remained at an acceptable level (above 0.70). However, with the change in the 2021 OSCE, it was necessary to identify the reliability of the test and its error components. Consequently, the consortium decided to analyze the reliability using GT.
The examiners’ training proceeded in the same way as usual. Physician examiners from 4 medical schools evaluated examinees’ performance at each station by completing the checklist and assigning a value from global rating scales. The SPs’ training also proceeded in the same way as usual. The experienced SP trainer trained SPs on scenarios for 2 hours, and they rehearsed for more than 2 hours. All SPs had more than 5 years of SP experience with the BGCSE consortium.
Participants
A total of 439 fourth-year medical students from 5 medical schools participated in the BGCSE at 4 medical school skill simulation centers for 4 days, from July 12 to 15, 2021.
Variables
In OSCEs, examples of facets usually include students (p), cases (c), items (i), and raters (r), among others. GT estimates the variance associated with each facet and provides information about the examination’s measurement characteristics. For example, students (p) refer to the variability in scores between examinees that reflects the true difference in competency between students. A greater variance between students indicates that the difference is due to examinee competency, not measurement errors. Cases (c) refer to the variability in difficulty associated with SP encounters in the OSCE. In this study, examinees were randomly assigned to 8 of 23 cases. Items (i) refer to the variability in difficulty associated with checklist items within each case. Raters (r) refer to the variability among examiners. In this study, only 1 rater assessed each case. Thus, there was no variability caused by different raters. In the OSCE, there are interactions between facets. For instance, person-by-case (p×c) interactions indicate differences in student performance according to the cases. The proportion of VCs from each facet provides valuable information about the examination, such as whether the test discriminates high-performance students from low-performance students and whether the number of cases and items is sufficient for reliability.
In this study, we defined 3 facets—students (p), cases (c), and items (i)—and designed them as p×(i:c) due to items being nested in a case. Five types of VCs can be derived from this design: (1) p, (2) c, (3) i:c, (4) p×c, and (5) p×(i:c).
Study outcomes
We set the primary outcomes as examining the reliability presented as G coefficients and analyzing the VCs on each OSCE examination day (G-study). We set the acceptable reliability level of G coefficients to 0.70 [
5]. Since this examination was a normative evaluation, phi coefficient criteria were not set. We set the secondary outcomes as the D-study. Using estimates of VCs via the G-study, a post hoc projection of reliability was examined.
Data sources/measurement
The data analyzed in this study were from the BGCSE consortium. The scores of examinees’ clinical performance were inserted by faculty examiners using a computer program, and the results were automatically processed. All data were recorded in an Excel spreadsheet (Microsoft Corp., Redmond, WA, USA) and available at
Dataset 1.
Bias
No bias was found in the study scheme.
Study size
A sample size was not calculated due to the nature of the study design.
Statistical methods
Descriptive statistics for OSCE scores were calculated, including the mean and standard deviation of each case. The G-study and D-study were performed using G String IV ver. 6.3.8 (2013; Papaworx, Hamilton, ON, Canada). G String IV is a user-centered Windows program that applies GT to analyze empirical datasets. It uses Brennan’s urGenova command-line program to perform the analogous analysis of variance procedure necessary to estimate VCs. It was designed and coded by Ralph Bloch at Papaworx as part of a project commissioned by the Medical Council of Canada. In 2018, G String V was released, and G String can be downloaded for free from the website papaworx.com.
Discussion
Key results
In the 2021 BGCSE, when the number of cases changed from 12 to 8, the G coefficient was at an acceptable level (above 0.70) except for 1 of the 4 examination days. Most VCs were attributed to the items nested in the case and residual error. If the stakes of the OSCE are changed and the reliability needs to be increased, increasing the number of items nested in each case rather than the number of cases would be reasonable.
Interpretation
According to a systematic review regarding real-world OSCE reliability, the overall reliability presented as α coefficients in medical school examinations was 0.66 (95% confidence interval, 0.62–0.70), which was below the generally accepted minimum reliability [
6]. However, the reliability coefficients seem to depend on the purpose of the assessment. If the stakes are high, such as certification, professionals suggest a reliability of at least 0.90. However, for moderate-stakes assessments such as summative examinations in medical school, the reliability is expected to range from 0.80 to 0.89. Lower-stakes assessments, such as formative assessments or those administered by local faculty, would be expected to range from 0.70 to 0.79 [
5]. The stakes of the BGCSE are considered low to moderate, as a formative assessment.
According to the D-study, there are 2 approaches for G coefficients above 0.70. One is increasing the number of cases from 8 to 9 or 10, and the other is increasing the number of items nested in cases to more than 20 while maintaining the number of cases at 8.
Each approach has its advantages and disadvantages. Increasing the number of cases will increase reliability, but more resources are needed. If the number of cases rises to 10, the consortium must prepare 2 more cases. This means that an additional 32 physician examiners and 8 SPs will be needed. More staff for the operation of the OSCE and item developers for new cases will also need to join. More manikins and equipment for added stations will also be required. In this case, the consortium will have to consider the cost-effectiveness of the OSCE.
Increasing the number of items will also increase reliability. However, when developing cases, the number of items tends to depend on the case’s topic. For example, as shown in
Table 1, the vaccination counseling case (a 32-year-old woman is counseled about vaccination for her 9-month-old baby) included 22 items since many key questions are to be asked before vaccination, such as previous vaccination history and allergy reaction history, and current medication history. However, in the case of intimate partner violence (a 41-year-old woman with a swollen and bruised right eye), there may be fewer key questions. If we add superfluous items, these will have low assessment value and eventually reduce the validity of the case. Thus, it will not always be possible to increase the number of items to secure reliability.
Comparison with previous studies
It is well known that the major threat to reliable measurements in evaluating performance is case specificity [
7]. Case specificity can be defined as a phenomenon in which student performance varies depending on the scenario [
8]. This is because some students may have more prior knowledge or experience in some scenarios than others. Previous studies have shown that case specificity in multicase examinations is naturally a significant VC. Therefore, a reliable test is needed for many cases [
9,
10]. However, recent studies have shown that the number of cases is not necessarily the source of variance. Instead, the source of significant variance can be attributed to items nested in cases or other factors [
11,
12]. The findings of our study are consistent with recent studies because the proportion of VCs for cases was negligible, from 0.00% to 2.03% (
Table 3). Therefore, in this examination, increasing the item number per case can increase the reliability of the examination, since most of the VCs were from items nested in cases (i:c).
This study found that if the OSCE was performed in 8 cases, the G coefficient was above 0.70 when the average number of items was above 21. This means that if the number of items in some cases is more than 21, the number of items in other cases could be less than 21. In this situation, a combination of cases with various items may be important in the blueprinting of the OSCE. The consortium should have sufficient cases in which various items are included in the case bank.
Limitations
This study has some limitations. First, it was conducted by 1 consortium, although 5 medical schools participated. Applying the same OSCE will result in different findings depending on the student population. Second, items evaluating patient-physician interactions (PPIs) were excluded from the G-study. Because the number and contents of items evaluating PPIs are already set in all cases by the Korea Health Personnel Licensing Examination Institute, the consortium cannot modify them. Third, the items of the cases belong to categories such as history taking, physical examination, and patient education. The composition ratio of these categories may vary depending on the case. For each case, a sub-design using the p×(i:c) structure was possible. However, we did not analyze whether the number of items in the categories was appropriate because it was beyond our research question. Other studies on this topic should be conducted in the future.
Generalizability
Reliability analysis using GT can improve the reliability of other OSCEs.
Suggestions
There was 1 examiner for each case in this study, and the rater (r) was not considered in the G-study design. However, we did not verify intrarater reliability. Further research is needed on this topic in the future.
Conclusion
In the 2021 BGCSE, the case number decreased from 12 to 8. However, the reliability was acceptable. In the D-study, reliability was maintained at 0.70 or higher if there were more than 21 items per case with 8 cases and more than 18 items per case with 9 cases. However, according to the G-study, increasing the number of items nested in cases rather than the number of cases could further improve reliability because most VCs were from items nested in cases. The consortium needs to maintain a case bank with a diverse number of items to implement reliable blueprinting for the OSCE.