Examinations required for medical qualification and certification of fitness to practice must be designed with careful attention to key issues, including blueprinting, validity, reliability, and standard setting, as well as clarity about their formative or summative function (
12). Items used in assessment should be sufficiently discriminatory for minimally competent and high-achieving students and reasonably easy to construct. Additionally, an assessment should reflect key educational objectives in all components of the cognitive domain of Bloom’s taxonomy (
21). High reliability is especially important for the final MBBS examination, given its function to license medical practitioners (
22). Assessment processes should be continuously evaluated, and the feedback should be used to improve subsequent examinations. This study compared the reliability, discrimination index, and quality of EMQs and MCQs constructed by faculty members trained in item writing and standard set using the modified Angoff method (
17) for the final MBBS examination completed by students from campuses in four member countries of the same regional university with the same curriculum and learning objectives. These attributes make this study unique and, to our knowledge, the first such study to be reported in the medical education literature.
5.1. Scoring Pattern for EMQs and MCQs
In the current study, the overall mean score (
Table 1) for the EMQs (69% ± 9.8%) was significantly higher than that for the MCQs (62.7% ± 7.4%). Significantly higher mean scores for the EMQs were seen for all four cohorts of students who attempted this examination. Similar findings have been reported in another comparative study of different modalities of assessment used in the MBBS examination (
23). The scores from the EMQs had a larger spread with higher standard deviation values and variance, which would be advantageous and more discriminatory for feedback to students and teachers in the formative assessment. An important finding from this study was that the score from the MCQs had a higher positive predictive value for the overall failure in the written examination when compared to the score from the EMQs. One criticism of EMQs in medical assessment has been that they are less capable of detecting poor performers compared to MCQs (
13). Our study provides strength to this criticism.
5.2. Discrimination Index (DI or r) for EMQs and MCQs
The mean DI (
24) was higher for the EMQs than for the MCQs in all four cohorts, although the difference was not statistically significant (
Table 2). The mean DIs for the EMQs (range: 0.33 ± 0.32 - 0.37 ± 0.25 among the four cohorts) and the MCQs (range: 0.23 ± 0.61 - 0.27 ± 0.47) were comparable to DIs for MCQs in previous studies (
25,
26). Additionally, the proportion of questions with a DI > 0.02 was higher for the EMQs than for the MCQs in all four cohorts, although insignificant. As a general rule, items with DI values < 0.20 are considered poor, indicating that they should be eliminated or revised, and items with DI values > 0.20 are considered fair to good (
27). In the present analysis, between 50% - 70% of both EMQs and MCQs had DI values > 0.20, which are comparable to those reported for similar high-stakes examinations (
28,
29). The high proportion of EMQs and MCQs with fair to good DIs in this exam analysis supports the validity of the written assessment tool in this examination (
27).
5.3. Reliability (Internal Consistency) for EMQs and MCQs
The KR-20 for the EMQs, ranging from 0.52 to 0.70, was lower than that for the MCQs, which ranged between 0.71 and 0.79 (
Table 3). The KR-20 index ranges from 0 to 1, and it is a measure of inter-item reliability. A higher value for an exam indicates a stronger relationship between items on the test. A low reliability coefficient may be reflected when a test covers multiple topics and also reflects the total number of test questions. Generally, for a high-stakes or licensing examination, a KR-20 value closer to 0.80 is preferred. Of note, there were 80 EMQs and 200 MCQs in this examination. The lower KR-20 for the EMQs may partly be due to the lower number of EMQs used in this examination. Also, this examination covered a number of specialty topics for which a KR-20 value of 0.50 would be an acceptable lower limit (
19).
5.5. Distractor Efficiency of EMQs and MCQs
Functional distractors for MCQs decrease correct guessing and cueing. In fact, one advantage of EMQs over MCQs is an increase in the number of distractors, decreasing the likelihood of guessing and cueing (
33,
34). In the present study, a higher proportion of the MCQs had two or more functional distractors when compared to the EMQs in all four cohorts of students, although the difference was statistically significant for only one of the four cohorts. The increased number of distractors in EMQs makes item writing more difficult by requiring more plausible distractors compared to MCQs. It is important to note that having a higher proportion of functional distractors is especially important in EMQs to avoid testing time increase without having its given advantages. Overall, items with two or more functional distractors in both EMQs and MCQs were comparable to those reported from other studies (
28,
31,
32). However, of concern was the finding that up to 30% of the EMQs and 19% of the MCQs had no functional distractors. This finding may reflect poor item construction by some examiners, as shown in other studies (
35). Repeated use of questions from a bank for successive examinations may negatively impact the performance of distractors. Although fewer than 15% of items were questions that were repeated from the recent final MBBS examinations, this proportion may have been higher if all of the past examinations were taken into account. With a higher proportion of repeated questions, the distractors may become less effective, and this may have partly contributed to the finding of the high proportion of questions with no functional distractor in this study.
Regular revision and replenishing are required to sustain the viability of the question bank.
The observed wider spread of scores and higher mean of EMQs compared to MCQs suggest that EMQs are more suitable for feedback in formative assessment. However, the MCQ scores were more predictive of overall exam failure on the written component, suggesting that MCQs are more suitable for high-stakes assessments such as the final MBBS examination.
5.6. Conclusions
Although there was no significant difference between the DIs of the EMQ and MCQ items, the MCQs demonstrated higher internal consistency. However, both EMQs and MCQs demonstrated similar levels of difficulty. Also, the EMQs displayed poorer distractor efficiency than the MCQs, a finding that reflects the inherent difficulty in EMQ item construction.