.

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.
What I've been trying to understand is, why would they do this? It really doesn't make much sense. The test is no longer standardized because of this. Maybe they don't want to "waste" experimental qs at non-Prometric sites. Regardless, it's just dumb
 
Members don't see this ad :)
So yeah this is insane that some people will just have a 2 hour shorter exam. But are they really telling me right now that 80 questions per step exam are unscored filler?

Yeah that’s the real issue here.
 
What’s really going to be fun is what that’s gonna do to your psyche if you take the test like that. Before, you could write off a really hard question as “well, maybe that one was experimental, so I’m fine.”

Now, you’d 100% know you lost points because you know that question is scored. No thanks. I’ll take the longer test where I can miss stuff and feel okay about it when I walk out of the door.
 
There are other students who feel differently. I talked to a student who tested at Brown. He thought it was easier to stay sharp for 5 blocks of 40 questions than for 7 blocks of 40 questions. That could be an advantage.

Makes sense, depending on how you test.

I don’t really get too much testing fatigue. I just know I’d be sitting there making tick marks on my whiteboard or whatever, counting every single question I wasn’t sure about, and wondering how many questions I could miss and still score well. And agonizing over every single question I remembered well enough to look up and missed for weeks.
 
How are you people not getting this? Severely modifying a standardized test destroys the entire validity of a STANDARDIZED test within a year of applicants and beyond.

That is so crazy, because apparently the curve for the current year is set by the previous year's test takers.
 
That's nuts that 80 of the questions on Step 1 are unscored.

I think if I were an informed PD, I would want to ask applicants in what setting they took the exam, but I don't think tons of PDs will be well enough informed on this. Then again I'm not really sure what I would do with that information once I asked.

But over the years I've come to see that Step 1 is not as important as my peers and I thought it was in med school (and I am in one of those hyper-competitive specialties), especially as it seems that more and more applicants have great scores. I never really bought into this view, but now I do understand that what really seems to matter for us is whether you meet a certain benchmark, like 240 or whatever. 250 is golden. Beyond that it's not that big a deal.

That is so crazy, because apparently the curve for the current year is set by the previous year's test takers.
Maybe this is the right time to implement the p/f decision for next year.
 
Maybe this is the right time to implement the p/f decision for next year.

Whoa whoa whoa, lol. I would still 100% rather have it scored and be unfair than to have it P/F. They also said that they're definitely not moving up the timeline for it anyway.
 
Whoa whoa whoa, lol. I would still 100% rather have it scored and be unfair than to have it P/F. They also said that they're definitely not moving up the timeline for it anyway.
I'm not sure how the scoring works, but if what you said is true, and people do significantly better this year because they've had an extra 3 months to study or they take a neutered version of the exam in their school library, wouldn't it be more likely that scores would be scaled down for the following year? Meaning you could turn in a killer performance and get a score of 235. That would wreck your chances at a competitive specialty much more than a p/f.

P.S. In the back of my mind I'm thinking that there is something off about this—I feel like I've heard that the score is and always has been referenced directly to individual exam performance and not group metrics, like when people discussing score inflation say that a 250 today is the same as a 250 was 15 years ago. I know there are people on here who are very knowledgeable about this stuff.
 
Members don't see this ad :)
I'm not sure how the scoring works, but if what you said is true, and people do significantly better this year because they've had an extra 3 months to study or they take a neutered version of the exam in their school library, wouldn't it be more likely that scores would be scaled down for the following year?

Yeah, there's actually a lot of debate over whether the extra time to study will help, hurt, or even have no effect. I imagine it'll be the same for the shorter version of the exam. But yeah, I'm still not completely sure how the curve is generated, so hopefully someone that knows more will chime in.
 
Are program directors diligent enough to take into account covid and reduction to 200q? Or will they see the numbers as what they are superficially. Has anyone heard a PD’s opinion?
 
Are program directors diligent enough to take into account covid and reduction to 200q? Or will they see the numbers as what they are superficially. Has anyone heard a PD’s opinion?

I haven't heard from a PD on this, but I can say with 100% confidence that they will not care
 
Last edited:
I'm not sure how the scoring works, but if what you said is true, and people do significantly better this year because they've had an extra 3 months to study or they take a neutered version of the exam in their school library, wouldn't it be more likely that scores would be scaled down for the following year? Meaning you could turn in a killer performance and get a score of 235. That would wreck your chances at a competitive specialty much more than a p/f.

P.S. In the back of my mind I'm thinking that there is something off about this—I feel like I've heard that the score is and always has been referenced directly to individual exam performance and not group metrics, like when people discussing score inflation say that a 250 today is the same as a 250 was 15 years ago. I know there are people on here who are very knowledgeable about this stuff.

I doubt that a 250 today is the same as a 250 15 years ago. The way they standardize the test results in a natural score creep.

I emailed them telling them this is ridiculous and to remove all experimental questions for the foreseeable future

They need the experimental questions otherwise the next year’s test will be overinflated due to people leaking answers. That is why they aren’t removing experimental questions for everyone. It’s just incredibly unfair that students at certain medical schools are taking an easier test.
 
Last edited:
What do you expect PDs to do anyway, subtract 10 points from the scores of students who took the 5 hour test instead of the 7 hour test?

no idea what they should do, if I could take a guess they would just act like nothings happened and the scores will be what they are. Just another lesson reminding us life is unfair and there’s no obligation to accommodate for us.
 
Ok so I know people are mad at the lack of standardization, but I can't get over how two entire block's worth of questions have never counted for the actual score. That's almost 30% of the exam!
Wouldn't be surprised if the MCAT was like this either
 
I doubt that a 250 today is the same as a 250 15 years ago. The way they standardize the test results in a natural score creep.
It's not the same percentile-wise, but I do believe it represents the same performance on the exam. This is my understanding:

Because Step 1 is a criterion-referenced test (designed to decide yes/no for licensure) and not a norm-referenced test like the MCAT, SAT, etc., the score is reflective of the person's performance with respect to the questions, not others taking the test. The score creep is more likely due to an actual improvement in performance over the years as Step 1 has grown in importance and taken on the role of a sort of residency aptitude test. People have been much more deliberate and organized in their studying, and study materials have become much more advanced, which has resulted in the percentile curve shifting to the right as the absolute performance scale stays the same.

I don't think that's compatible with scaling scores each year based on the previous years' results. But again, I'm fairly ignorant on the topic. I would love to hear from some experts.
 
Apparently the nbmes already reconsidering this, y'all are so lucky your testing company listens to your feedback. Nbme>>>>>>Nbome Edit:
Post image
 
It's not the same percentile-wise, but I do believe it represents the same performance on the exam. This is my understanding:

Because Step 1 is a criterion-referenced test (designed to decide yes/no for licensure) and not a norm-referenced test like the MCAT, SAT, etc., the score is reflective of the person's performance with respect to the questions, not others taking the test. The score creep is more likely due to an actual improvement in performance over the years as Step 1 has grown in importance and taken on the role of a sort of residency aptitude test. People have been much more deliberate and organized in their studying, and study materials have become much more advanced, which has resulted in the percentile curve shifting to the right as the absolute performance scale stays the same.

I don't think that's compatible with scaling scores each year based on the previous years' results. But again, I'm fairly ignorant on the topic. I would love to hear from some experts.

I don’t think so since Step 2CK scores have increased even faster than Step 1, even though there is less emphasis placed on Step 2CK and few organized resources. Here’s my theory:

The NBME have just admitted that the experimental questions are validated and scored using performance on old questions. So if someone taking a test in 2017 performs at a 230 level on old questions and gets 80% right on experimental questions, 230 = 80% correct (averaged) for future examinees taking those questions. Then two things happen, (1) Some of the experimental questions get thrown out because they have no predictive value. The average score will tend to drift higher because the experimental questions have improved in quality compared to when they were taken by the old examinees. The NBME attempts to correct for this, but are imperfect. (2) Some of the information from the experimental part gets passed on to First Aid, UWorld, etc. Future test takers who study these sources then get a boost. Thus, examinees will technically “know more,” but what would have been critical thinking questions become simple memorization questions if sources are telling you the answer. These changes might be small enough to compare applicants year by year, but not over 5+ years.
 
Last edited:
I don’t think so since Step 2CK score averages have increased even faster than Step 1 averages, even though there is less emphasis placed on Step 2CK and few organized resources. Here’s my theory:

The NBME have just admitted that the experimental questions are validated and scored using performance on old questions. So if someone taking a test in 2017 performs at a 230 level on old questions and gets 80% right on experimental questions, 230 = 80% correct (averaged) for future examinees taking those questions. Then two things happen, (1) Some of the experimental questions get thrown out because they have no predictive value. The average score will tend to drift higher because the experimental questions have improved in quality compared to when they were taken by the old examinees. The NBME attempts to correct for this, but are imperfect. (2) Some of the information from the experimental part gets passed on to First Aid, UWorld, etc. Future test takers who study these sources then get a boost. Thus, examinees will technically “know more,” but what would have been critical thinking questions become simple memorization questions if sources are telling you the answer. These changes might be small enough to compare applicants year by year, but not over 5+ years.
Good, thoughtful post. I can't refute your point about CK with any real evidence, but I would guess that CK scores have gone up because people are studying more for Step 1. Putting in that legwork for a more challenging exam really makes CK easier. Anecdotally, most of my friends and I did very well on Step 1, didn't study at all for CK per se, and scored quite well.

I follow you on the second paragraph except for the bolded. Presumably the experimental questions are turned over, and there are constantly new questions added and validated questions being funneled into the main pool. If I'm interpreting your point correctly (may have had one too many beers with dinner this evening), I don't see a reason to believe examinee performance would be better on the experimental questions in aggregate, aside from general, glacial advances in educational testing and question writing, from one era to another.

Regarding the second bolded part, I think that partially restates my main point. Sources telling you the answer to what was once more arcane is part of performing better on questions over the same fund of knowledge due to superior preparation.

I think a relevant thought experiment is whether someone who scored 250 in 2005 would score 250 today (discounting new scientific knowledge). I don't see why he/she wouldn't.
 
So have students taken this shorter version? From reading the NBME announcement it doesn't sound like they have instituted it yet.
 
Man, does this mean we've all missed the boat on taking a shorter test? Would have been nice....
 
I don’t think so since Step 2CK scores have increased even faster than Step 1, even though there is less emphasis placed on Step 2CK and few organized resources. Here’s my theory:

The NBME have just admitted that the experimental questions are validated and scored using performance on old questions. So if someone taking a test in 2017 performs at a 230 level on old questions and gets 80% right on experimental questions, 230 = 80% correct (averaged) for future examinees taking those questions. Then two things happen, (1) Some of the experimental questions get thrown out because they have no predictive value. The average score will tend to drift higher because the experimental questions have improved in quality compared to when they were taken by the old examinees. The NBME attempts to correct for this, but are imperfect. (2) Some of the information from the experimental part gets passed on to First Aid, UWorld, etc. Future test takers who study these sources then get a boost. Thus, examinees will technically “know more,” but what would have been critical thinking questions become simple memorization questions if sources are telling you the answer. These changes might be small enough to compare applicants year by year, but not over 5+ years.

Experimental questions are not scored. Hence the more proper name for them: unscored pretest items (see the first post in this thread). Unfortunately I think this renders your theory moot.

The inclusion of unscored pretest items on USMLE exams is done to gather statistical data on their performance, most notably whatever discrimination index the NBME uses. These indices basically tell you whether or not a given test question separates those who know the content from those who don't know the content. Items with good discrimination indices get called up to the big leagues and are used "live" on a future exam. When a question gets used for real the NBME is still collecting data on it to ensure the discrimination index remains acceptable. The other important determination is how difficult it is (how many people get it correct). The reason it takes 3+ weeks to get your score back is because of all the statistics that have to be run to evaluate discrimination and relative test difficulty between different takers.

To my knowledge the year-over-year comparisons have more to do with estimating percentiles than anything else, which change very slowly over time.
 
Experimental questions are not scored. Hence the more proper name for them: unscored pretest items (see the first post in this thread). Unfortunately I think this renders your theory moot.

The inclusion of unscored pretest items on USMLE exams is done to gather statistical data on their performance, most notably whatever discrimination index the NBME uses. These indices basically tell you whether or not a given test question separates those who know the content from those who don't know the content. Items with good discrimination indices get called up to the big leagues and are used "live" on a future exam. When a question gets used for real the NBME is still collecting data on it to ensure the discrimination index remains acceptable. The other important determination is how difficult it is (how many people get it correct). The reason it takes 3+ weeks to get your score back is because of all the statistics that have to be run to evaluate discrimination and relative test difficulty between different takers.

To my knowledge the year-over-year comparisons have more to do with estimating percentiles than anything else, which change very slowly over time.

This is actually consistent with my theory. Maybe I didn’t word it the best way, but what you are saying is basically the same thing I was saying. “Experimental questions” are scored for future examinees, not current ones.

Anyway here’s a possible example of how it could result in score creep. According to your link, a discriminatory index of 0 means a question cannot discriminate between high and low scorers, and it ranges from -1 to +1. Let’s say out of 80 unscored questions, 20 have discriminatory indices less than 0.2, which is considered “low.” These questions get tossed out, but because they have some discriminatory value, ranging from 0-0.2 (negative values are rare due to the nature of the exam, which is heavily based on factual knowledge), when you take away these questions it results in a shift in scores for future examinees. Take for example someone who scored a 230 and got 75% of the unscored questions correct or 60/80. Let’s say a future examinee takes an exam with the new questions but with some of the unscored questions taken out due to poor discrimination, he might get between 45-47/60 correct, or 75-78%. It tends toward slight score increases because questions with positive discriminatory value are taken away.
 
Hope they roll this back. I know I was thinking a little slower by the last couple blocks on test day, it's mentally exhausting. Seems like a huge advantage to get two hours shaved off.

Not to mention that the head-scratcher questions that you have to read 5 times to understand what they're asking are more likely to be experimental than scored. Getting rid of all those stumbling blocks and taking only 200 high-quality Q's under the same time limit per block is yet another advantage.

Why can't they use 280 validated items and provide a much more accurate score under the original format? Does anyone know why they need it to be shorter?
 
Hope they roll this back. I know I was thinking a little slower by the last couple blocks on test day, it's mentally exhausting. Seems like a huge advantage to get two hours shaved off.

Not to mention that the head-scratcher questions that you have to read 5 times to understand what they're asking are more likely to be experimental than scored. Getting rid of all those stumbling blocks and taking only 200 high-quality Q's under the same time limit per block is yet another advantage.

Why can't they use 280 validated items and provide a much more accurate score under the original format? Does anyone know why they need it to be shorter?
To get you out of the testing center faster?????
 
Yeah, F them for trying to reduce the chances of you catching or spreading SARS-CoV2.
I hate to break this to you Goro, but if 2 hours socially distanced sitting in a test booth is enough to give a bunch of med students SARS2, then they're all going to immediately get it when they return to the wards anyways. There is no way to stay healthy for 2 years of rotations in a hospital, if a testing room is that dangerous.
 
I hate to break this to you Goro, but if 2 hours socially distanced sitting in a test booth is enough to give a bunch of med students SARS2, then they're all going to immediately get it when they return to the wards anyways. There is no way to stay healthy for 2 years of rotations in a hospital, if a testing room is that dangerous.
It's all about reducing risk, in my view.
 
Yeah, F them for trying to reduce the chances of you catching or spreading SARS-CoV2.

That's really a straw-man... Screw them for changing the exam for some, but not all. This introduces factors that make this a non-standard exam, particularly on board exams that have a very real fatigue component.
 
It's all about reducing risk, in my view.
You know what, I think I read that the LSAT or GRE (or both?) are also being abridged right now, and those are actually being remote proctored to people sitting at home.

I'm really lost on this one...why are all these tests shortening themselves? What do they gain from this?
 
Jeeze, the hyperacheivers need to chill.

What Part of "we're getting rid of the tryout items that aren't part of the exam score " is so traumatic?

Just take the exam and do as best as you can. That's all that's expected of you. PDs aren't going to care. They just want a screening tool.
 
Yeah, F them for trying to reduce the chances of you catching or spreading SARS-CoV2.
Then why are the people at Prometric centers taking the full length unabridged test? How is their exposure different? Heck if I take it at Prometric, I might be exposing people from outside of any healthcare career.

Sorry but that doesn’t make sense. And when it doesn’t make sense, it probably makes money.
 
Then why are the people at Prometric centers taking the full length unabridged test? How is their exposure different? Heck if I take it at Prometric, I might be exposing people from outside of any healthcare career.

Sorry but that doesn’t make sense. And when it doesn’t make sense, it probably makes money.
Didn't know this. Still waters clearly run deep
 
And when it doesn’t make sense, it probably makes money.
I was just thinking about this regarding COMLEX PE. It seems absurd at first for them to insist on keeping PE around when CS is cancelled...but not if PE could be the new way for IMGs to get ECFMG certified (previously required CS).

Maybe they're insisting on their nonsense PE exam because they could pick up so much $$$ from the IMG/FMG crowd this year?
 
Everything I've heard from this organization over the last 6 months, in addition to my interactions with my own admins, has convinced me that everyone in Med Ed is brain dead. Who else could have possibly green lighted this decision.
We do have a few people on our side, lemme plug Carmody one more time. I think he's also popping off on Twitter about the abridged step exam.

 
Top