Experimental questions, test forms, exam statistics from NBME published articles

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

paul411

ANES
10+ Year Member
Joined
May 27, 2010
Messages
1,616
Reaction score
38
Here's some non-speculative snippets of information about how the NBME makes USMLE Step 1 tests: Article 1 Article 2

"Dozens of test forms are used, with examinees randomly assigned to forms. Test sessions are scheduled for eight hours… Sections and items within sections are presented in random order."

Indices use to quantify question difficulty and quality
  • item difficulty (P value) - calculated as the proportion of examinees who responded to the item correctly
  • logit transform of the item difficulty -log[p / (1 - p)]
  • index of item discrimination: the item-total (biserial) correlation - the correlation between the item (scored 0/1 for incorrect/correct) and the reported total score
  • r-to-z transformation of the biserial correlation - commonly used to correct for nonlinearities in the magnitude of correlation coefficients
  • mean response time in seconds
  • mean of the natural logs of response times

In article 2, the distribution of experimental (unscored) questions within a block was interesting; out of the two experimental items, one was included randomly in the section and the other item was always the last question in a block.

Just thought I'd share some of the info I found while obsessing over Step 1.
 
So 14 experimental q's per form? That's interesting. The random distribution seems to make the most sense, I wonder the justification for constantly having one last?
 
So 14 experimental q's per form? That's interesting. The random distribution seems to make the most sense, I wonder the justification for constantly having one last?

I don't think we can necessarily deduce that there's 14 experimental questions per form based on that article. It only discusses a very particular set of experimental questions. I'd speculate there's more than 14 experimental Q's per form.
 
I don't think we can necessarily deduce that there's 14 experimental questions per form based on that article. It only discusses a very particular set of experimental questions. I'd speculate there's more than 14 experimental Q's per form.

My initial guess would have been >14, although, if the exact experimental questions don't necessarily remain consistent per form, you could easily generate enough forms with just that many
 
I thought it was interesting that the test takers who performed well spent a shorter amount of time on easier questions and a longer time for harder questions versus the poor test takers. It makes sense when you think about it, but it seems to suggest to me that the lower scoring students are reading the harder questions and just giving up on them.
 
I thought it was interesting that the test takers who performed well spent a shorter amount of time on easier questions and a longer time for harder questions versus the poor test takers. It makes sense when you think about it, but it seems to suggest to me that the lower scoring students are reading the harder questions and just giving up on them.

or, alternatively, the fact that they spent so much time on the easier questions left them less time to spend on the hard questions. or it can be both reasons.
 
I thought it may be because when lower scorers see the harder questions they might think "I guess I don't know this either, so I'll guess and move on" whereas higher performers know just about every topic and realize the question is a deduction type of question rather than recall and appropriately spend more time on it. I think all these reasons can play a role.
 
Top