I fail to understand why step matters so much....

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.
The biggest advantage of being an MD/PhD student is that you act and talk like an adult. You've mentored people at that point and you know exactly what sort of behaviors look good vs. rub the wrong way. You see just how unpalatable it is to train a know-it-all. You're the same age as the residents and the attendings don't see you as a child. Patients also tend to give you a bit more respect. 26 year old me would have whiffed on some of the patients I've built strong alliances with, and I'm sure I would have come across as a brat to some of the residents.

This isn't limited to MD/PhDs. This can also be people with previous careers.

Members don't see this ad.
 
  • Like
Reactions: 5 users
Members don't see this ad :)
Wait, what?????
Someone gave a very vivid description of a possum decaying in the sun on a hot day. Thankfully, we didn’t end up recruiting Dexter Morgan.
 
  • Haha
  • Like
  • Wow
Reactions: 12 users
Someone gave a very vivid description of a possum decaying in the sun on a hot day. Thankfully, we didn’t end up recruiting Dexter Morgan.
Come on - you now Dexter would be too smart to put all that on paper ;)
 
  • Haha
Reactions: 1 user
Yes, it's the new normal. Step 2 is the new Step 1.
It was the predictable outcome of making Step 1 P/F.
Now there is only one bite at the apple and it comes too late to change course if there is an unfortunate score.

I don't think Step 2 could seriously have been considered a "second bite at the apple", especially at the height of Step 1 madness, especially for competitive specialties.

If there's going to be a high-stakes test for residency stratification purposes, I would much rather have that be something similar to Step 2 and 3 than Step 1; so, although I don't think this was the intention of the Step 1 change, I do agree with the result.

A better scenario would be to make all licensing tests pass-fail, then create a "new" test purely for residency stratification purposes that allows for retakes. I put "new" in quotes because, practically speaking, this would just be a Step 2/3 rip-off. Alternatively, we just keep using Step 2, allow for retakes, and let state boards figure out what to do in the rare case where someone gets a passing score on Step 2 followed by a retake where they fail. Perhaps this would be as easy as saying "your most recent Step 2 score must be in the passing range".
 
  • Dislike
Reactions: 1 user
A better scenario would be to make all licensing tests pass-fail, then create a "new" test purely for residency stratification purposes that allows for retakes. I put "new" in quotes because, practically speaking, this would just be a Step 2/3 rip-off.
From a purely psychometric standpoint, it would be reasonable to create a new exam that is designed to stratify. The current step exams are designed to yield a yes/no result around a central question of minimal competency, which is a different goal.

Of course, this would require someone having to build and administer this new exam, which would be expensive, time-consuming, and add testing and financial burdens to medical students. The question of whether or not the resulting stratification is actually meaningful would likely persist, as well.
 
  • Like
Reactions: 1 users
From a purely psychometric standpoint, it would be reasonable to create a new exam that is designed to stratify. The current step exams are designed to yield a yes/no result around a central question of minimal competency, which is a different goal.

They provide a score and a percentile, though. I know their original intent was to be criterion-based, but the scoring definitely allows for stratification at this time.

Of course, this would require someone having to build and administer this new exam, which would be expensive, time-consuming, and add testing and financial burdens to medical students. The question of whether or not the resulting stratification is actually meaningful would likely persist, as well.

I think the natural answer to this would be the NBME with its decades of experience in building these tests.

The cynic in me says that the NBME just makes a "new" test which is just a Frankenstein of Step 2+3 to rake in more money.

The hopeful in me says that the NBME keeps Step 2 scored, allows for retakes, then the individual state licensing boards just deal with the fact that some people will have multiple Step 2 scores.
 
Last edited:
They provide a score and a percentile, though. I know their original intent was to be criterion-based, but the scoring definitely allows for stratification at this time.



I think the natural answer to this would be the NBME with its decades of experience in building these tests.

The cynic in me says that the NBME just makes a "new" test which is just a Frankenstein of Step 2+3 to rake in more money.

The hopeful in me says that the NBME keeps Step 2 scored, allows for retakes, then the individual state licensing boards just deal with the fact that some people will have multiple Step 2 scores.
State boards don't care about number of step scores as long as none are failures. Even then, most allow more than 1 fail per step before they care.

Here's what my state has to say about it:
For the United States Medical Licensing Examination or the Comprehensive Osteopathic Medical
Licensing Examination, or the Medical Council of Canada Qualifying Examination, the applicant
shall pass all steps within ten years of passing the first taken step. The results of the first three takings
of each step examination must be considered by the board. The board may consider the results from a
fourth taking of any step; however, the applicant has the burden of presenting special and compelling
circumstances why a result from a fourth taking should be considered.

So basically if you haven't passed a Step exam by the 4th try you can't get licensed. But if you take it 6 times to try and get the highest score possible, the state doesn't care as long as you passed one of the first 4 times.
 
State boards don't care about number of step scores as long as none are failures. Even then, most allow more than 1 fail per step before they care.

Here's what my state has to say about it:
For the United States Medical Licensing Examination or the Comprehensive Osteopathic Medical
Licensing Examination, or the Medical Council of Canada Qualifying Examination, the applicant
shall pass all steps within ten years of passing the first taken step. The results of the first three takings
of each step examination must be considered by the board. The board may consider the results from a
fourth taking of any step; however, the applicant has the burden of presenting special and compelling
circumstances why a result from a fourth taking should be considered.

So basically if you haven't passed a Step exam by the 4th try you can't get licensed. But if you take it 6 times to try and get the highest score possible, the state doesn't care as long as you passed one of the first 4 times.

Right, but I think they haven't cared because the NBME makes you stop once you pass.

In my mind "Fail -> Pass" (or even "Fail -> Fail -> Pass") is much more straightforward to interpret from a licensing point of view than "Pass -> Fail" ( or "Pass -> Fail -> Pass"). With the former you can say "This person had some learning deficiencies, which they shored up and are now proficient to practice medicine"; with the latter you'd have to say something like "this person achieved proficiency, then... lost proficiency, but they're still good to go anyways."

Like I said, maybe I'm overthinking these rare edge cases.
 
They provide a score and a percentile, though. I know their original intent was to be criterion-based, but the scoring definitely allows for stratification at this time.
This statement sort of gets at the original point of this thread: the fact that scores and percentiles are generated as a byproduct of a pass/fail exam does not make them meaningful.

The passing threshold for the step exams is set using something called the Modified-Angoff method, which is a criterion-referenced approach.
 
  • Like
Reactions: 2 users
the fact that scores and percentiles are generated as a byproduct of a pass/fail exam does not make them meaningful.

Is the issue that the questions are not good enough to accurately assess knowledge, or is there an issue with the scoring method itself that inherently increases the error in a score?

I admit I don't know enough about standardized testing methodology--I just assumed that the NBME practices were the best we could do given their long history.
 
Right, but I think they haven't cared because the NBME makes you stop once you pass.

In my mind "Fail -> Pass" (or even "Fail -> Fail -> Pass") is much more straightforward to interpret from a licensing point of view than "Pass -> Fail" ( or "Pass -> Fail -> Pass"). With the former you can say "This person had some learning deficiencies, which they shored up and are now proficient to practice medicine"; with the latter you'd have to say something like "this person achieved proficiency, then... lost proficiency, but they're still good to go anyways."

Like I said, maybe I'm overthinking these rare edge cases.
Is that a new thing, because that definitely wasn't the case when I was in med school?
 
Members don't see this ad :)
Is that a new thing, because that definitely wasn't the case when I was in med school?

From the USMLE FAQs:

"If you pass a Step, you are not allowed to retake it, except to comply with certain state board requirements which have been previously approved by USMLE governance."

As to how recent it is--I'm not sure, but this was definitely the policy 10+ years ago when I took them.
 
  • Like
Reactions: 1 user
From the USMLE FAQs:

"If you pass a Step, you are not allowed to retake it, except to comply with certain state board requirements which have been previously approved by USMLE governance."

As to how recent it is--I'm not sure, but this was definitely the policy 10+ years ago when I took them.
Must have been right after me then, I'm 15 years out from Step 1 and we could take it again if we wanted (basically no one did though).
 
  • Wow
Reactions: 1 user
I don't think Step 2 could seriously have been considered a "second bite at the apple", especially at the height of Step 1 madness, especially for competitive specialties.

If there's going to be a high-stakes test for residency stratification purposes, I would much rather have that be something similar to Step 2 and 3 than Step 1; so, although I don't think this was the intention of the Step 1 change, I do agree with the result.

A better scenario would be to make all licensing tests pass-fail, then create a "new" test purely for residency stratification purposes that allows for retakes. I put "new" in quotes because, practically speaking, this would just be a Step 2/3 rip-off. Alternatively, we just keep using Step 2, allow for retakes, and let state boards figure out what to do in the rare case where someone gets a passing score on Step 2 followed by a retake where they fail. Perhaps this would be as easy as saying "your most recent Step 2 score must be in the passing range".
Whatever the problem is with the current USMLE system, the answer cannot possibly be "make another high stakes test."
 
  • Like
Reactions: 1 user
Whatever the problem is with the current USMLE system, the answer cannot possibly be "make another high stakes test."

It's tough to imagine, but if step 1 remains P/F it might be worth it for those aiming for competitive specialties.

Imagine spending thousands of dollars setting up away rotations before taking step 2, and getting a very low score. People who don't match into competitive specialties often forfeit thousands of dollars taking research years to improve their applications.
 
  • Like
Reactions: 1 users
It's tough to imagine, but if step 1 remains P/F it might be worth it for those aiming for competitive specialties.

Imagine spending thousands of dollars setting up away rotations before taking step 2, and getting a very low score. People who don't match into competitive specialties often forfeit thousands of dollars taking research years to improve their applications.
Yeah but presumably you'd spend thousands of dollars taking this specialty-specific exam. I can't imagine the specialty-specific exam would be any sooner than current Step 2 timeframe, so it's not like you could make useful decisions based on that information. And there just aren't enough months before ERAS opens to allow for yet another dedicated study period, plus sub-I, plus away rotations.

For all of the above reasons I think it is a net negative that Step 1 is now P/F, but now that we are here I think applicants who choose to shoot for a competitive specialty just have to embrace a certain level of risk. If risk makes you uncomfortable, then pick a different specialty. Again--not that I am saying this is by any means FAIR, but I'm not sure there is a good alternative.
 
  • Like
Reactions: 1 user
Yeah but presumably you'd spend thousands of dollars taking this specialty-specific exam. I can't imagine the specialty-specific exam would be any sooner than current Step 2 timeframe, so it's not like you could make useful decisions based on that information. And there just aren't enough months before ERAS opens to allow for yet another dedicated study period, plus sub-I, plus away rotations.

For all of the above reasons I think it is a net negative that Step 1 is now P/F, but now that we are here I think applicants who choose to shoot for a competitive specialty just have to embrace a certain level of risk. If risk makes you uncomfortable, then pick a different specialty. Again--not that I am saying this is by any means FAIR, but I'm not sure there is a good alternative.

The timing of VSAS with Step 2 scoring is crap. Even though there are many schools that are switching to 1.5 preclinical curriculum and students are taking Step 2 approximately 6-7 months before ERAS is due, this still isn't enough time to switch audition rotations because those usually get locked in around this time anyways.
 
  • Like
Reactions: 1 user
Must have been right after me then, I'm 15 years out from Step 1 and we could take it again if we wanted (basically no one did though).

I took Step 1 summer of 2008. At that time it you couldn’t retake it if you passed. I also don’t recall any kerfluffle about a policy change and I feel like that’s something that would have been a hot topic of conversation and much complaining from various students. So I suspect the policy has been around since at least 2006.

The MCAT you could take a million times.
 
Last edited:
I took Step 1 summer of 2008. At that time it you couldn’t retake it if you passed. I also don’t recall any kerfluffle about a policy changed and I feel like that’s something that would have been a hot topic of conversation and much complaining from various students. So I suspect the policy has been around since at least 2006.

The MCAT you could take a million times.
Basically, you're old @VA Hopeful Dr :rofl:
 
  • Haha
  • Love
Reactions: 3 users
Now now, I’m saying I’m equally as old. 😂 2008 puts me at 15 years out too! Not trying to throw shade at my fellow “seasoned” docs. 😂
Lol, I can't say much--I think mine was 2011 :rofl: But by then there was definitely no ambiguity, you only got one shot, and I actually was unaware that at one point you could have retaken the exam if you were a masochist.
 
I took Step 1 summer of 2008. At that time it you couldn’t retake it if you passed. I also don’t recall any kerfluffle about a policy change and I feel like that’s something that would have been a hot topic of conversation and much complaining from various students. So I suspect the policy has been around since at least 2006.

The MCAT you could take a million times.
I found a thread here from 2008 saying you couldn't retake it, so I'm guessing I heard wrong at the time.

Also, get off my lawn!
 
  • Haha
  • Like
Reactions: 3 users
Now now, I’m saying I’m equally as old. 😂 2008 puts me at 15 years out too! Not trying to throw shade at my fellow “seasoned” docs. 😂
Yeah but you did like eleventy billion PGY years so in doctor years I'm older.
 
  • Haha
Reactions: 3 users
Is the issue that the questions are not good enough to accurately assess knowledge, or is there an issue with the scoring method itself that inherently increases the error in a score?

I admit I don't know enough about standardized testing methodology--I just assumed that the NBME practices were the best we could do given their long history.
Nor am I a psychmetrician, so take this with a grain of salt. But, if you're building an exam to assess a minimum level of knowledge, then the basic question for each exam item is "will a minimally competent test-taker get this right?" The exam only needs to be long enough to provide statistical heft to that analysis, and the passing threshold set to minimize false positives.

If, on the other hand, you want to build and exam that can statistially differentiate between two individuals with similar knowledge and test-taking abilities, that's a different situation. In the last year Step 1 was scored the standard deviation was 19, which is pretty large. In order to reduce that you're probably going to have to make the exam much longer, essentially adding power until getting 90% of the items correct is statistically different than getting 88% of the items correct.

And, as I said earlier, the question of whether or not this stratification is meaningful would persist. If you look at correlation studies between step scores and specialty board passage (more multiple choice exams), the curves flatten as you go up the score scale. Carmody examined this back in 2019. Ultimately it seems the exams are good at predicting future problems for low-scorers, but aren't very good at predicting anything useful for high-scorers.
 
  • Like
Reactions: 6 users
Nor am I a psychmetrician, so take this with a grain of salt. But, if you're building an exam to assess a minimum level of knowledge, then the basic question for each exam item is "will a minimally competent test-taker get this right?" The exam only needs to be long enough to provide statistical heft to that analysis, and the passing threshold set to minimize false positives.

If, on the other hand, you want to build and exam that can statistially differentiate between two individuals with similar knowledge and test-taking abilities, that's a different situation. In the last year Step 1 was scored the standard deviation was 19, which is pretty large. In order to reduce that you're probably going to have to make the exam much longer, essentially adding power until getting 90% of the items correct is statistically different than getting 88% of the items correct.

And, as I said earlier, the question of whether or not this stratification is meaningful would persist. If you look at correlation studies between step scores and specialty board passage (more multiple choice exams), the curves flatten as you go up the score scale. Carmody examined this back in 2019. Ultimately it seems the exams are good at predicting future problems for low-scorers, but aren't very good at predicting anything useful for high-scorers.
And therein lies the problem. Students have mastered the knowledge so well that their only option is to greatly expand the scope of what is tested or to make the exam much longer. Both of these are unreasonable considering it is already a 9 hour exam that commands hundreds (dedicated) if not thousands (including all clinical year) of hours of studying to master the material. They've made their best effort at making the questions substantially more difficult (you can compared old NMBE exam to present and they are laughably simple) but students have still "beaten" the test so to say. At some point we have to accept that you can't use this to differentiate among the top anymore. A 260 vs a 250 absolutely will have a strong effect on residency placement, and a PD can look at the percentiles and be wowed that there's a 26% difference (80th vs 54th), but this could literally mean 85 vs 82% of questions correct. When looking at it like that it becomes much less meaningful in terms of real world effect. Just like how finding a landmark diabetes drug that has a p value of .000000001 for lowering A1C by .01% doesn't have any real world meaning.

Obviously they can't just report raw score as forms are different, but maybe we should switch to reporting equated percent correct only without percentiles. The way it's reported now encourages people to make false assumptions ("someone with a 260 must have performed way better than someone with a 250 to be in 80th percentile vs 54th percentile"). Why not let the PD's make judgements themselves based on equated percent correct? I guarantee the amount of people selecting a 260 over a 250 would go way down if it was reported as 85 vs 82% correct.

(Percentages are for example only I do not know the true difference)
 
Last edited:
  • Like
Reactions: 4 users
Whether or not higher scores predict anything (other than how people are likely to do on future exams) is unclear.

The performance distribution on all tests like this is very steep around the mean. Small changes in absolute performance will herald large changes in percentiles. The same is true for the MCAT also. The difference with the MCAT is that people with scores at the mean or below often don't get into med school at all. So when you look at all the higher scorers, the absolute differences tend to be a bit bigger. MCAT doesn't release absolute percentages as far as I know but I expect you'd find the same thing.

Regarding psychometrics as described by @Med Ed, although in general that's true and often argued by the USMLE as a reason not to use scores, it's also not really applicable because the USMLE isn't a test designed to assess minimum knowledge. If you really want a minimum knowledge test, you create it such that most people will get 100% of the questions correct. A written driver's ed test is a good example of this. It's designed so that if you know the basic material, you get everything correct. And by inference, the passing cut off tends to be relatively high. Another example are the innumerable online HR modules I need to complete each year -- each has a test, I need to score 90% to pass, and getting 100% is usually very easy. (Pointless aside, I am really annoyed when these tests have a minimum pass of 90% but only have 5 questions)

That's not the USMLE exam design. The USMLE is designed as a general knowledge test with the mean in the middle. The minimum pass level is theoretically picked to define minimum necessary knowledge. But the score clearly represents the taker's knowledge as measured on a MCQ test.

Not to nit pick, but the standard deviation doesn't tell you whether a score of 250 is different from 260. That's approximated by the standard error of measurement, which is much smaller (about 9 I think). And even that doesn't say that scores within 9 points are "indistinguishable" -- unless you want to make that statement with 66+% certainty.
 
  • Like
  • Love
Reactions: 2 users
Whether or not higher scores predict anything (other than how people are likely to do on future exams) is unclear.

The performance distribution on all tests like this is very steep around the mean. Small changes in absolute performance will herald large changes in percentiles. The same is true for the MCAT also. The difference with the MCAT is that people with scores at the mean or below often don't get into med school at all. So when you look at all the higher scorers, the absolute differences tend to be a bit bigger. MCAT doesn't release absolute percentages as far as I know but I expect you'd find the same thing.

Regarding psychometrics as described by @Med Ed, although in general that's true and often argued by the USMLE as a reason not to use scores, it's also not really applicable because the USMLE isn't a test designed to assess minimum knowledge. If you really want a minimum knowledge test, you create it such that most people will get 100% of the questions correct. A written driver's ed test is a good example of this. It's designed so that if you know the basic material, you get everything correct. And by inference, the passing cut off tends to be relatively high. Another example are the innumerable online HR modules I need to complete each year -- each has a test, I need to score 90% to pass, and getting 100% is usually very easy. (Pointless aside, I am really annoyed when these tests have a minimum pass of 90% but only have 5 questions)

That's not the USMLE exam design. The USMLE is designed as a general knowledge test with the mean in the middle. The minimum pass level is theoretically picked to define minimum necessary knowledge. But the score clearly represents the taker's knowledge as measured on a MCQ test.

Not to nit pick, but the standard deviation doesn't tell you whether a score of 250 is different from 260. That's approximated by the standard error of measurement, which is much smaller (about 9 I think). And even that doesn't say that scores within 9 points are "indistinguishable" -- unless you want to make that statement with 66+% certainty.
My point was not whether a 250 vs 260 is statistically different, I'm sure it is. My point was that statistically different might mean absolutely nothing if we were to find out the true value. If I were a PD I would not care whether someone got 5 more questions right on a 300 question exam, regardless of its statistical significance, because the real world meaning of that is close to 0. When I go to practice clinical medicine I would use that same principle and not chose a diabetes drug that reduces A1C by an additional .001% when that clearly has no real effect vs other factors of the drug. When we report scores as they are now with percentiles, we are encouraging PD's to draw conclusions about the score that might not be true. Why not report an equated percent correct without percentiles and let them draw their own conclusion? If a PD sees 85% vs 82% and doesn't really care for the difference, doesn't that mean it's silly to show 260 vs 250 and encourage them to conclude the 260 is a superior student because they are 26 percentile places higher?

We need to start showing how bunched up students are in terms of raw performance because 26 percentile places might be very few questions. It's like they're hiding information that would make people take the test less seriously. I am all for standardized testing, but if everyone starts doing well you can't game the system by making some people look bad for being in the 5th percentile even if they are barely doing worse in terms of raw questions correct than someone in the 25th percentile.

Reveal the raw data and let people draw conclusions for themselves.

A score should look like:
260- 80th percentile- In 2022 students who scored a 260 got an average of 270/318 questions correct across all forms
250-54th percentile- In 2022 students who scored a 250 got an average of 255/318 questions correct across all forms

With this data you at least give program directors the chance to say hey I don't really care about an X question difference. Right now they don't have a choice but to blindly trust that higher is better without knowing the true difference.

(I made the raw numbers up as an example)
 
Last edited:
  • Like
Reactions: 1 user
My point was not whether a 250 vs 260 is statistically different, I'm sure it is. My point was that statistically different might mean absolutely nothing if we were to find out the true value. If I were a PD I would not care whether someone got 5 more questions right on a 300 question exam, regardless of its statistical significance, because the real world meaning of that is close to 0. When I go to practice clinical medicine I would use that same principle and not chose a diabetes drug that reduces A1C by an additional .001% when that clearly has no real effect vs other factors of the drug. When we report scores as they are now with percentiles, we are encouraging PD's to draw conclusions about the score that might not be true. Why not report an equated percent correct without percentiles and let them draw their own conclusion? If a PD sees 85% vs 82% and doesn't really care for the difference, doesn't that mean it's silly to show 260 vs 250 and encourage them to conclude the 260 is a superior student because they are 26 percentile places higher?

We need to start showing how bunched up students are in terms of raw performance because 26 percentile places might be very few questions. It's like they're hiding information that would make people take the test less seriously. I am all for standardized testing, but if everyone starts doing well you can't game the system by making some people look bad for being in the 5th percentile even if they are barely doing worse in terms of raw questions correct than someone in the 25th percentile.

Reveal the raw data and let people draw conclusions for themselves.

A score should look like:
260- 80th percentile- In 2022 students who scored a 260 got an average of 270/318 questions correct across all forms
250-54th percentile- In 2022 students who scored a 250 got an average of 255/318 questions correct across all forms

With this data you at least give program directors the chance to say hey I don't really care about an X question difference. Right now they don't have a choice but to blindly trust that higher is better without knowing the true difference.

(I made the raw numbers up as an example)
Even if what you are saying is correct and score differences are meaningless, how would you concretely suggest PDs stratify a bunch of applicants whose applications are otherwise also very similar? Because that’s what it always comes back to for me—complaining about the system doesn’t help if you don’t have a realistic suggestion for something better.
 
  • Like
Reactions: 1 users
I also want to point out that the most competitive specialties (except for dermatology) get the lowest number of applicants. For example, the average plastic surgery program gets less than 100 and ENT/Uro both get 300-400. But there is a very strong self selection bias there since applicants with weak scores don't even bother applying.
 
Even if what you are saying is correct and score differences are meaningless, how would you concretely suggest PDs stratify a bunch of applicants whose applications are otherwise also very similar? Because that’s what it always comes back to for me—complaining about the system doesn’t help if you don’t have a realistic suggestion for something better.
My suggestion is to include the raw data and let PD's decide for themselves how to interpret it. There's nothing inherently wrong with the exam or the fact that it is used to stratify. I'm not suggesting to make it p/f. My problem is with how scores are reported in a way that encourages conclusions that might not be true.

Is there any good reason not to reveal the raw data?
 
Last edited:
  • Like
Reactions: 1 user
My apologies, I didn't really address your point.

I completely agree with you. For sure, the difference between percentiles in the middle of the pack is going to be very small. Although we can assess whether they are "statistically" different, that doesn't mean they have any practical difference - outcomes like this are common when the n in the population is very large.

I disagree that it would change anyone's behavior. People like thinking in percentiles, and that's likely to persist no matter what you do. Unless they report only the raw score, with no percentiles at all -- and that's unlikely.

In any case, reporting raw scores isn't feasible because not everyone takes the same exam and some exam forms may be harder than others. The SoS explored this in depth: Breaking the magic: the USMLE three-digit score

This also explains why you don't get your score immediately. They need to assess the group performance before they can score your exam.
 
  • Like
Reactions: 2 users
My apologies, I didn't really address your point.

I completely agree with you. For sure, the difference between percentiles in the middle of the pack is going to be very small. Although we can assess whether they are "statistically" different, that doesn't mean they have any practical difference - outcomes like this are common when the n in the population is very large.

I disagree that it would change anyone's behavior. People like thinking in percentiles, and that's likely to persist no matter what you do. Unless they report only the raw score, with no percentiles at all -- and that's unlikely.

In any case, reporting raw scores isn't feasible because not everyone takes the same exam and some exam forms may be harder than others. The SoS explored this in depth: Breaking the magic: the USMLE three-digit score

This also explains why you don't get your score immediately. They need to assess the group performance before they can score your exam.
They are able to create an "equated percent correct" for step 1 now, which is a percent correct adjusted for exam difficulty. Imo this would be much better than the current 3 digit score.

If not that, they could release the average # per each score correct from the prior year to give PD's an idea. Although individual results may differ, if you tell someone the average of all students who got a 260 in 2022 was 85% of questions correct, I see no problem with that. That way when a PD compares scores, they have a good idea of how many questions set these scores apart on average.

I agree they should never release individual performance data as there are more difficult forms as you mentioned
 
Regarding psychometrics as described by @Med Ed, although in general that's true and often argued by the USMLE as a reason not to use scores, it's also not really applicable because the USMLE isn't a test designed to assess minimum knowledge.
The USMLE is designed to give medical licensing boards a binary yes/no answer regarding an individual's possession of a minimum level of medical knowledge. It is expressly made for that purpose. All other uses of the score are secondary.
 
  • Like
  • Haha
Reactions: 3 users
Not to nit pick, but the standard deviation doesn't tell you whether a score of 250 is different from 260. That's approximated by the standard error of measurement, which is much smaller (about 9 I think). And even that doesn't say that scores within 9 points are "indistinguishable" -- unless you want to make that statement with 66+% certainty.
This is what I get for posting while sleep deprived!
 
The USMLE is designed to give medical licensing boards a binary yes/no answer regarding an individual's possession of a minimum level of medical knowledge. It is expressly made for that purpose. All other uses of the score are secondary.
This is a USMLE talking point. They say it over and over. It's simply not true. As I mentioned before, if they want to design a test that really tests minimal knowledge, they should do so by building one that has a minimum pass around 85% of the questions correct, and the most common score would be 100%.

Put another way: yes, the USMLE uses the test to determine minimum competence. Using it to assess general medical knowledge is a secondary use. But doing so is completely statistically valid. Whether it reflects anything other than ability to pick the right answer on an MCQ test is an open question. But the USMLE stating that programs shouldn't use it because it wasn't designed for that is silly. At least from a psychmetric viewpoint.
 
  • Like
Reactions: 1 user
This is a USMLE talking point. They say it over and over. It's simply not true. As I mentioned before, if they want to design a test that really tests minimal knowledge, they should do so by building one that has a minimum pass around 85% of the questions correct, and the most common score would be 100%.

Put another way: yes, the USMLE uses the test to determine minimum competence. Using it to assess general medical knowledge is a secondary use. But doing so is completely statistically valid. Whether it reflects anything other than ability to pick the right answer on an MCQ test is an open question. But the USMLE stating that programs shouldn't use it because it wasn't designed for that is silly. At least from a psychmetric viewpoint.

I 100% agree with you!! There is a reason why they made Step 2 much more difficult when Step 1 became P/F.
 
I 100% agree with you!! There is a reason why they made Step 2 much more difficult when Step 1 became P/F.
I wasn't aware of that. Was there a sharp drop in Step 2 scores after Step 1 became pass/fail?
 
This is a USMLE talking point. They say it over and over. It's simply not true. As I mentioned before, if they want to design a test that really tests minimal knowledge, they should do so by building one that has a minimum pass around 85% of the questions correct, and the most common score would be 100%.

Put another way: yes, the USMLE uses the test to determine minimum competence. Using it to assess general medical knowledge is a secondary use. But doing so is completely statistically valid. Whether it reflects anything other than ability to pick the right answer on an MCQ test is an open question. But the USMLE stating that programs shouldn't use it because it wasn't designed for that is silly. At least from a psychmetric viewpoint.
I think you can visibly see the evolution of this as well. Looking at the first practice forms for NBME or shelf exams reveals that they are laughably easy by today's standards. It's very easy to get 90%+. At this point in time it WAS designed as a minimum competency test.

The more modern forms are much much more difficult as they try and evolve the test to be able to adequately stratify students, because they know residency programs are using it to select people. The problem is they haven't been able to keep up with students increased performance as noted by the constant score increase. I think expanding ethics to 15% of the test was a play at adding a CARS like section that can't be as well studied for so to say...
 
Last edited:
  • Like
Reactions: 1 users
I think expanding ethics to 15% of the test was a play at adding a CARS like section that can't be as well studied for so to say...

Which is nuts because ethics is entirely too complex, nuanced and individualized to be tested.
 
This is a USMLE talking point. They say it over and over. It's simply not true. As I mentioned before, if they want to design a test that really tests minimal knowledge, they should do so by building one that has a minimum pass around 85% of the questions correct, and the most common score would be 100%.
What would be the unintended consequences to such an approach?

Put another way: yes, the USMLE uses the test to determine minimum competence. Using it to assess general medical knowledge is a secondary use. But doing so is completely statistically valid. Whether it reflects anything other than ability to pick the right answer on an MCQ test is an open question. But the USMLE stating that programs shouldn't use it because it wasn't designed for that is silly. At least from a psychmetric viewpoint.
What statistically valid secondary use is the NBME saying you should avoid?
 
My suggestion is to include the raw data and let PD's decide for themselves how to interpret it. There's nothing inherently wrong with the exam or the fact that it is used to stratify. I'm not suggesting to make it p/f. My problem is with how scores are reported in a way that encourages conclusions that might not be true.

Is there any good reason not to reveal the raw data?
The issue is that there is something inherently wrong with the exam when used for stratification. The test-to-test variability is incredibly high, especially when compared to aptitude tests like the SATs or MCATs. You can take 10 predictive practice tests and still have a predicted score range of ~30 points (e.g., 250 +/- 15). It's meant to maximize accurate prediction right around a passing score, 209. So when most people are scoring in the 240 range on average, it's already nearing the ceiling of the exam.

Because of this high variability, PDs can really only meaningfully separate candidates into three groups with any degree of statistical certainty, 210-230, 230-250, and 250+. PDs already know this, which is why step cutoffs tend to be pretty low and they don't put as much stock in 255 vs. 265. Most actions by academic faculty in medicine absolutely baffle me and make me question if they have any grasp on statistics whatsoever, but somehow they get this one right.

It's also just a bad exam. I'm convinced half of the "what is the best thing to say to the patient?" questions on USMLE/NBME exams are written by radiologists. If they could just hire whoever does quality control at UWorld they might have an exam with statistical relevance that people actually respect.
I 100% agree with you!! There is a reason why they made Step 2 much more difficult when Step 1 became P/F.
Is there any proof of this, or is step 2 just harder for people who took step 1 P/F? As someone who took step 1 scored but is now going through rotations with students who took it P/F, I've noticed wildly different study habits in this group of students. I'm not saying that's a bad thing either, because step exams always emphasized the wrong thing (i.e., obscure details over concepts). However, students I work with now are way, way less focused on minutiae and generally operate at a lower level of content mastery. Again, not saying that's a bad thing. This profession has needed to shift away from knowledge and shift towards interpersonal skills, leadership, and business/management for at least 20 years.
 
  • Like
Reactions: 1 users
I'm not certain I understand what you're asking.
What would be the unintended consequences to such an approach?
I assume you're asking: what would the unintended consequences be if they changed the USMLE to have a raw score of 85% to pass? It would be similar to reporting a pass/fail score only. Fail would remain a negative as it is today. Pass would be uninterpretable other than knowing that you passed. Since most people would get 95-100% of the questions correct, there would be absolutely no discrimination at that level of performance. There would be a slight difference from just P/F, as those scoring 85-95% would likely be considered differently than those scoring >=95%. Perhaps students wouldn't bother studying very much for the exam - similar to concerns raised about S1 being P/F. Is that what you're getting at?
What statistically valid secondary use is the NBME saying you should avoid?
Again, not sure what you're asking. I'm saying that a USMLE score of 250 shows that you "know more as assessed on an MCQ test" than people with a 240, and those more than those with a 230. The NBME seems to think that I should just treat anyone with a score higher than passing the same? This makes no sense to me at all. Again, I completely agree that a higher score on the USMLE doesn't necessarily predict that someone will be a better doctor/resident. But to state that it doesn't represent anything seems incorrect.
The test-to-test variability is incredibly high, especially when compared to aptitude tests like the SATs or MCATs.
Based on what? How certain are we that the SAT doesn;t have ranges like this? And the MCAT has a smaller range because the overall score range is smaller. We can fix that with the USMLE if we want -- simply divide the score by 10 and report that. Round it to a whole number if you wish. Now, scores will range from 16-28, pass will be a 20, and inter test variability will be 1.5. Does that make it better?
You can take 10 predictive practice tests and still have a predicted score range of ~30 points (e.g., 250 +/- 15).
Who says that these predictive practice tests are actually reflective of the test? Honestly, I think this is the biggest scam of all. The NBME should not be in the business of selling practice exams for it's own high stakes exam. This is all sorts of wrong.
 
  • Like
Reactions: 1 users
I'm not certain I understand what you're asking.

I assume you're asking: what would the unintended consequences be if they changed the USMLE to have a raw score of 85% to pass? It would be similar to reporting a pass/fail score only. Fail would remain a negative as it is today. Pass would be uninterpretable other than knowing that you passed. Since most people would get 95-100% of the questions correct, there would be absolutely no discrimination at that level of performance. There would be a slight difference from just P/F, as those scoring 85-95% would likely be considered differently than those scoring >=95%. Perhaps students wouldn't bother studying very much for the exam - similar to concerns raised about S1 being P/F. Is that what you're getting at?

Again, not sure what you're asking. I'm saying that a USMLE score of 250 shows that you "know more as assessed on an MCQ test" than people with a 240, and those more than those with a 230. The NBME seems to think that I should just treat anyone with a score higher than passing the same? This makes no sense to me at all. Again, I completely agree that a higher score on the USMLE doesn't necessarily predict that someone will be a better doctor/resident. But to state that it doesn't represent anything seems incorrect.

Based on what? How certain are we that the SAT doesn;t have ranges like this? And the MCAT has a smaller range because the overall score range is smaller. We can fix that with the USMLE if we want -- simply divide the score by 10 and report that. Round it to a whole number if you wish. Now, scores will range from 16-28, pass will be a 20, and inter test variability will be 1.5. Does that make it better?

Who says that these predictive practice tests are actually reflective of the test? Honestly, I think this is the biggest scam of all. The NBME should not be in the business of selling practice exams for it's own high stakes exam. This is all sorts of wrong.
Do you agree that the group of students entering this year's match cycle that have high numeric Step 1 scores (such as those who may have delayed a year for research or other reasons) will have an advantage over those students with Step 1 scores of "PASS"?
 
Top