Why is STEP scored the way it is?

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

Womb Raider

Full Member
10+ Year Member
7+ Year Member
Joined
Aug 20, 2013
Messages
3,495
Reaction score
3,083
Does anyone know why STEP uses a scaled scoring system rather than percentiles? It seems to me like it would make a lot more sense to receive you score back in the form of a percentile (i.e. 1-100, a 50 would mean you scored in the 50th percentile, which is a ~230 today).

You might be thinking that scaled scores are required because it provides a way to compare the difficulty level of different tests across years, but I don't think that's really necessary because the difficulty level isn't important (aside from passing/failing, which could be done internally w/ a simple P or F attached to the score) - what schools/residencies care about is percentile or how well you performed compared to your peers. The percentile/difficulty of each question could be kept on a rolling basis (with a large enough pool so that you couldn't exploit this by taking the test in off season) that constantly evolves so that it doesn't get outdated - probably similar to how they do it now.

Finally, there is one more aspect to the scaled score that makes absolutely no sense to me: it magnifies differences at the peaks of performance and gives the illusion that there's a large gap in performance when one doesn't exist. For example, a 260 (96th percentile) to 270 (100% percentile) is 4% whereas a 230 (48th) to 240 (67%) is 19%.

Anyway, the system makes no sense to me and I don't understand why USMLE doesn't use percentiles as the primary score. It seems like it would resolve a lot of headache for students, schools and residency programs. Was wondering if someone could fill me in on what I'm missing.

Members don't see this ad.
 
Last edited:
Not 100% sure but I have heard there are historical reasons that a passing score needs to be >70 due to certain state laws.
Yeah I get the need for a certain passing score. I just don't understand why that requires reporting scores using a scaled system. Scores could be reported after the test with something like: "Status: PASS. Percentile: 75th."
 
Members don't see this ad :)
This might help tangentially with the motivation for using a predetermined scale (based on previous groups): http://www.compassprep.com/how-to-score-the-sat-and-act-curving-scaling-and-equating/
I think part of the reason is because the scale scores are set in advance is to avoid the potential gaming of the system, as you mentioned, where performance could fluctuate with the preparedness of the test takers on a give day or other reasons. Theoretically, everyone could walk into test on a certain day and get a 270, but the scaled scores are set in a way that it's extremely difficult for anyone to do. The percentile we get with our scores can fluctuate from year to year, which makes sense given the nature of the scoring. If percentiles were given right away, they would need to wait till year's end to see where each person stood for the year, or they could use the historical percentiles, but we can see that always isn't a good approach. For example, 220 used to be average, now it's 230-- the value of a 240 is still good but has comparatively declined since more people achieved that level of competency/preparedness for the exam.

Not a perfect example or explanation at all, but I think there are definitely some benefits to using a scaled score over just saying "here's your percentile."

The scaled score, in my best guess, more indicates a level of knowledge/proficiency (at least). If 250 became "average", that's still a high proficiency/knowledge base, but the percentile would show you are at the 50th percentile (or thereabouts). There's two dimensions to the score for a reason, I think. The primary motivation should be to evaluate the "competency" and knowledge base of the examinee (scaled score), secondarily to see how they match up with their peers (percentile after the year). I think this sort of fits with how Step scores were less important back in the day, but now have become more important as competition has allegedly increased.

Thoughts?

Edit: The other issue about just using historical percentiles is that it takes a long time to shift those, so it would be important to know that 240 used to be 85th percentile (no clue if that's true), but in one or two years it became the 75th percentile. The historical percentile would be misleading with regards to the student's relative standing although we know the level of competence/knowledge, as determined by the exam makers, is the same.
 
Last edited by a moderator:
  • Like
Reactions: 1 user
This might help tangentially with the motivation for using a predetermined scale (based on previous groups): http://www.compassprep.com/how-to-score-the-sat-and-act-curving-scaling-and-equating/
I think part of the reason is because the scale scores are set in advance is to avoid the potential gaming of the system, as you mentioned, where performance could fluctuate with the preparedness of the test takers on a give day or other reasons. Theoretically, everyone could walk into test on a certain day and get a 270, but the scaled scores are set in a way that it's extremely difficult for anyone to do. The percentile we get with our scores can fluctuate from year to year, which makes sense given the nature of the scoring. If percentiles were given right away, they would need to wait till year's end to see where each person stood for the year, or they could use the historical percentiles, but we can see that always isn't a good approach. For example, 220 used to be average, now it's 230-- the value of a 240 is still good but has comparatively declined since more people achieved that level of competency/preparedness for the exam.

Not a perfect example or explanation at all, but I think there are definitely some benefits to using a scaled score over just saying "here's your percentile."

The scaled score, in my best guess, more indicates a level of knowledge/proficiency (at least). If 250 became "average", that's still a high proficiency/knowledge base, but the percentile would show you are at the 50th percentile (or thereabouts). There's two dimensions to the score for a reason, I think. The primary motivation should be to evaluate the "competency" and knowledge base of the examinee (scaled score), secondarily to see how they match up with their peers (percentile after the year). I think this sort of fits with how Step scores were less important back in the day, but now have become more important as competition has allegedly increased.
Thoughts?
I understand what you're saying, but I still don't think it makes sense. If competency is the issue, then may the test pass/fail. You don't need scaled test scores for that. The test is used as a way to separate candidates that are otherwise equal in every way. The residency directors that I've talked to have basically said "Everyone has great class grades, LORs, research, CV, etc... STEP is a way to separate people." The ONLY way. That's what it's used for - a measuring stick to separate us based on percentile. They don't look at the score and think about "competency" (also I have to LOL at STEP1 being equated with physician competency), they look at it and think about how they performed compared to everyone else.

As for your gaming the system point - I don't understand that part. Who cares if everyone (on average) is getting smarter? I mean it's a great factoid and great for medicine overall, but who really cares? Not students. Not residency directors looking for new residents. If the average STEP score drops 20 points next year, the competitive residency position are going to be forced to accept applicants with lower scores, or not take any at all - and it certainly won't be the latter. Should the information still be made public? Sure. Absolutely. But it doesn't mean we need to use scaled scores.

Finally - you wouldn't need to wait until year's end to see percentiles. New questions are (from my understanding) tested first for quality/difficulty before being factored into the score. This implies that questions stick around for a while. So, every question could compared your performance to everyone else that answered it in the past x # years or something. Kind of like how Uworld does it but constantly removing old data.
 
I understand what you're saying, but I still don't think it makes sense. If competency is the issue, then may the test pass/fail. You don't need scaled test scores for that. The test is used as a way to separate candidates that are otherwise equal in every way. The residency directors that I've talked to have basically said "Everyone has great class grades, LORs, research, CV, etc... STEP is a way to separate people." The ONLY way. That's what it's used for - a measuring stick to separate us based on percentile. They don't look at the score and think about "competency" (also I have to LOL at STEP1 being equated with physician competency), they look at it and think about how they performed compared to everyone else.

As for your gaming the system point - I don't understand that part. Who cares if everyone (on average) is getting smarter? I mean it's a great factoid and great for medicine overall, but who really cares? Not students. Not residency directors looking for new residents. If the average STEP score drops 20 points next year, the competitive residency position are going to be forced to accept applicants with lower scores, or not take any at all - and it certainly won't be the latter. Should the information still be made public? Sure. Absolutely. But it doesn't mean we need to use scaled scores.

Finally - you wouldn't need to wait until year's end to see percentiles. New questions are (from my understanding) tested first for quality/difficulty before being factored into the score. This implies that questions stick around for a while. So, every question could compared your performance to everyone else that answered it in the past x # years or something. Kind of like how Uworld does it but constantly removing old data.

I'm kinda confused. Given that those of us who are choosing candidates already know how to read scores, why exactly does this matter?

Also, not everyone who applies to residency came in taking the step the same year or same time. Sure most do, but not all, and enough that we need to be able to compare scores year to year.
 
  • Like
Reactions: 1 user
I'm kinda confused. Given that those of us who are choosing candidates already know how to read scores, why exactly does this matter?

Also, not everyone who applies to residency came in taking the step the same year or same time. Sure most do, but not all, and enough that we need to be able to compare scores year to year.
As I said, it's simpler to look at a percentile than look at some arbitrary number and compare it to a table that shows percentiles correlated with that random number. You also wouldn't have to worry about finding the right year's conversion table. It's just simpler.

Your second point seems a little strange. I figured programs would prefer students in a higher percentile rather than someone who just has a higher raw score. The former seems to me to be a better predictor of intelligence, at least in regards to STEP1.
 
I understand what you're saying, but I still don't think it makes sense. If competency is the issue, then may the test pass/fail. You don't need scaled test scores for that. The test is used as a way to separate candidates that are otherwise equal in every way. The residency directors that I've talked to have basically said "Everyone has great class grades, LORs, research, CV, etc... STEP is a way to separate people." The ONLY way. That's what it's used for - a measuring stick to separate us based on percentile. They don't look at the score and think about "competency" (also I have to LOL at STEP1 being equated with physician competency), they look at it and think about how they performed compared to everyone else.
I meant competency in terms of a student's knowledge base-- not at all in any sense for physician or clinical competency. I agree with the pass/fail, but then people will ask for another dimension to separate these candidates (my guess is that would occur).

As for your gaming the system point - I don't understand that part.
People used to say you could get higher MCAT scores by testing in January when the testing pool was "less prepared" or some b.s. like that-- trying to game the system. That's what I meant. If the score is determined by the specific testing pool rather than in advance, people could try to game the system (but I think this would equilibriate and disappear anyway, if it ever did exist, although I don't think it did).
Who cares if everyone (on average) is getting smarter? I mean it's a great factoid and great for medicine overall, but who really cares? Not students. Not residency directors looking for new residents. If the average STEP score drops 20 points next year, the competitive residency position are going to be forced to accept applicants with lower scores, or not take any at all - and it certainly won't be the latter. Should the information still be made public? Sure. Absolutely. But it doesn't mean we need to use scaled scores.
I agree with you, but we're not making decisions. I think the best answer we could get is by directly emailing and asking USMLE-- but they might not actually be too clear for "security" purposes.

Finally - you wouldn't need to wait until year's end to see percentiles. New questions are (from my understanding) tested first for quality/difficulty before being factored into the score. This implies that questions stick around for a while. So, every question could compared your performance to everyone else that answered it in the past x # years or something. Kind of like how Uworld does it but constantly removing old data.
Again, I agree with you, but the historical numbers are from larger groups, so they will take a lot longer to move, where as the yearly percentiles will be a better indicator of current relative standing (which would better support your goal, from my point of view).
 
  • Like
Reactions: 1 user
I meant competency in terms of a student's knowledge base-- not at all in any sense for physician or clinical competency.
Haha I knew what you meant (not trying to be a dic*). I just find it funny how some of the information we're required to know is in a national board exam.
 
As said, most people know how to interpret scores already, so it really doesn't matter. Why would one perceive it is a "headache" for a residency program? Also, Step scores are not a predictor of intelligence, they are a predictor of test taking ability. Sure, there's probably a correlation, but I've seen enough high scorers who still have a problem with critical thinking and problem solving. The residency wants you to do well because they want you to pass your board certification. That is far more important to them, because that is a reflection of their training ability to the ACGME.
 
  • Like
Reactions: 1 user
As said, most people know how to interpret scores already, so it really doesn't matter. Why would one perceive it is a "headache" for a residency program? Also, Step scores are not a predictor of intelligence, they are a predictor of test taking ability. Sure, there's probably a correlation, but I've seen enough high scorers who still have a problem with critical thinking and problem solving. The residency wants you to do well because they want you to pass your board certification. That is far more important to them, because that is a reflection of their training ability to the ACGME.
My original question was basically asking why use scaled scores in the first place? To be able to compare scores from year to year? The primary utility of STEP is obviously seeing how we perform compared to other students, which is completely irrelevant to raw/scaled score. If it was competency than we'd see P/F.
 
My original question was basically asking why use scaled scores in the first place? To be able to compare scores from year to year? The primary utility of STEP is obviously seeing how we perform compared to other students, which is completely irrelevant to raw/scaled score. If it was competency than we'd see P/F.
The Step scores do give a relative value in proportion to ones peers. The absolute scaled score or percentile is meaningless when comparing small differences so it basically doesn't matter. A 265 versus a 260 versus a 250 aren't terribly different. Most residencies and interviewers would say "Wow, those are good scores", not "Well this person is 5% percentile points higher". Step score are important for categorizing incoming trainees into relative boxes, but they don't sell a trainee alone. Thus, the small differences you are eluding of 255 versus 265, really don't make that much difference for matching.
 
  • Like
Reactions: 1 user
Members don't see this ad :)
The USMLE never intended the Step 1 to be a quantitative assessment of ones abilities, merely a qualitative measure of whether the test-taker had the fundamental knowledge base to be a doctor or not.

Given the vast array of grading systems and ability of graduates of the various medical schools, programs naturally look for some standard by which to measure applicants. Thus, the USMLE Step 1 became the defacto scale for stratifying applicants.
 
My original question was basically asking why use scaled scores in the first place? To be able to compare scores from year to year? The primary utility of STEP is obviously seeing how we perform compared to other students, which is completely irrelevant to raw/scaled score. If it was competency than we'd see P/F.

that.
 
The Step scores do give a relative value in proportion to ones peers. The absolute scaled score or percentile is meaningless when comparing small differences so it basically doesn't matter. A 265 versus a 260 versus a 250 aren't terribly different. Most residencies and interviewers would say "Wow, those are good scores", not "Well this person is 5% percentile points higher". Step score are important for categorizing incoming trainees into relative boxes, but they don't sell a trainee alone. Thus, the small differences you are eluding of 255 versus 265, really don't make that much difference for matching.

True, but I would add that there is a threshold beyond which high scores become exceptionally impressive. Even though it's just 5 points off, I would argue that there is a major difference in psychological appeal between a 265 and a 270. Once you get into 270+ territory, I think PD's must look at you differently. You become a true rarity.
 
True, but I would add that there is a threshold beyond which high scores become exceptionally impressive. Even though it's just 5 points off, I would argue that there is a major difference in psychological appeal between a 265 and a 270. Once you get into 270+ territory, I think PD's must look at you differently. You become a true rarity.
Maybe some people do, but I've seen those scores and I can tell you, when I see a applicant with a 260 and another with a 270, I say to myself, "Wow those are great scores... what else they got?"
 
  • Like
Reactions: 3 users
True, but I would add that there is a threshold beyond which high scores become exceptionally impressive. Even though it's just 5 points off, I would argue that there is a major difference in psychological appeal between a 265 and a 270. Once you get into 270+ territory, I think PD's must look at you differently. You become a true rarity.


PDs have seen enough that they know how to interpret scores for what they are.
 
  • Like
Reactions: 1 users
Maybe some people do, but I've seen those scores and I can tell you, when I see a applicant with a 260 and another with a 270, I say to myself, "Wow those are great scores... what else they got?"

"Also, I wish this kid with the 270 would stop picking his nose during this interview"
 
  • Like
Reactions: 1 user
I've heard that some PD's actually scrutinize applicants with 270+ more closely under the assumption that you are more likely to be some sort of social weirdo if you can score that high.
 
  • Like
Reactions: 1 user
PDs have seen enough that they know how to interpret scores for what they are.
Maybe, but I doubt it. The USMLE released a document last year showing that scores that were within 15 points of one another were not statistically significant. ie, everything from 235-250 was pretty similar. I doubt any PD considers scores in that manner especially given the nature of score cutoffs.
 
Maybe, but I doubt it. The USMLE released a document last year showing that scores that were within 15 points of one another were not statistically significant. ie, everything from 235-250 was pretty similar. I doubt any PD considers scores in that manner especially given the nature of score cutoffs.

Well that's great that a med student apparently knows more than those of us who've been on both sides of the admissions process.
 
  • Like
Reactions: 5 users
Well that's great that a med student apparently knows more than those of us who've been on both sides of the admissions process.
Ah I should rephrase: I wouldn't be surprised if some PDs don't consider scores in a thoroughly analytical way. There have been several PDs/APDs on SDN openly stating how they order their ROLs with "students who are most likely to rank us highly" at the top and other non-logic based paradigms that leave some room for doubt that this entire process isn't a crapshoot.

But again, you're right. Only a med student (just a skeptical one).
 
I've heard that some PD's actually scrutinize applicants with 270+ more closely under the assumption that you are more likely to be some sort of social weirdo if you can score that high.
A score of 270+ would be exceptionally unlikely not to net you an interview (unless the applicant had a previous criminal charge or horrible letters). But it does go to show that just because you have an amazing board score, doesn't mean someone wants you in their program. If you don't interview well, you won't match well either.
 
  • Like
Reactions: 1 users
Maybe, but I doubt it. The USMLE released a document last year showing that scores that were within 15 points of one another were not statistically significant. ie, everything from 235-250 was pretty similar. I doubt any PD considers scores in that manner especially given the nature of score cutoffs.

I feel like some places must be screening with a strict cut-off at 240+, maybe even 250 at the top places. The difference between a 235 and 250 step score is > than 1 standard deviation.
 

You have to compare Z scores, not subtract the scores. 235 is ~58th %ile and 250 is ~85th %ile. That's 27 percentiles difference , which is greater than the SD of ~20.

Source: USMLE Score Interpretation Guidelines:

https://www.google.com/url?sa=t&sou...6acHKdCGi-zRkWGig&sig2=sRI9t95VsS4SKms8U6NcNA
 
I feel like some places must be screening with a strict cut-off at 240+, maybe even 250 at the top places. The difference between a 235 and 250 step score is > than 1 standard deviation.

NSurg seems to use ~230 at most places as their hard cut-off (median and 75th percentile, but mean and 25th percentile are 220). I'm sure 240 or 250 could be possible at the top residencies, but the IQR to always interview is 240-250, so most programs can't be using minimums in that range.

I just randomly picked NSurg, but glanced at Derm and the numbers are pretty close

http://www.nrmp.org/wp-content/uploads/2016/09/NRMP-2016-Program-Director-Survey.pdf
 
For the past years the SD has hoverer at 20
Correct...


You have to compare Z scores, not subtract the scores. 235 is ~58th %ile and 250 is ~85th %ile. That's 27 percentiles difference , which is greater than the SD of ~20.
Respectfully, you're very mistaken (you could compare z-scores, but if they're z-scores, the SD is 1...by definition, the z-score is a standard normal variable with a mean of zero and SD of 1...). Standard deviation is in the units of measurement-- in this case, points on Step 1 examination. So, 20 is 20 points on the Step 1 exam, not percentiles. You don't compare percentiles to standard deviations-- percentiles are essentially rank-orderings without units whereas the SD is a quantity measuring the spread of individual observations on a variable. Additionally, the SD retains the units of measure of the underlying variable. A z-score is a standardized quantity without the original units of measure but represents the number of standard deviations away from the mean for a particular observation. You compare apples to apples-- you're not doing that when you compare percentiles to standard deviations.
 
  • Like
Reactions: 1 users
I was operating under the assumption that the distribution of step 1 scores is approximately a normal curve
 
I think the whole reason for all high stakes exams to have the scoring system they do is to simply confuse you.

There is a lot of that going around.

Can you imagine if we just used plain language in medicine? Using Latin derived "crepitus" to avoid vernacular "popping noises" makes the same phenomenon harder to understand without the guidance of a pro. That is probably the point. If it were too straightforward and easy to understand, we wouldn't need quite as much of an industry surrounding examinations.
 
  • Like
Reactions: 1 user
I was operating under the assumption that the distribution of step 1 scores is approximately a normal curve
Your assumption is totally irrelevant to the fact that your approach is mistaken. Irrespective of the distribution's shape, you don't compare percentiles to the standard deviation. The units of measure on the standard deviation are step 1 points. A score of 220 (points) is 1 SD below as score of 240 (points) if the SD is 20 (points).
 
  • Like
Reactions: 2 users
Your assumption is totally irrelevant to the fact that your approach is mistaken. Irrespective of the distribution's shape, you don't compare percentiles to the standard deviation. The units of measure on the standard deviation are step 1 points. A score of 220 (points) is 1 SD below as score of 240 (points) if the SD is 20 (points).

Youre right
 
  • Like
Reactions: 1 user
There is a lot of that going around.

Can you imagine if we just used plain language in medicine? Using Latin derived "crepitus" to avoid vernacular "popping noises" makes the same phenomenon harder to understand without the guidance of a pro. That is probably the point. If it were too straightforward and easy to understand, we wouldn't need quite as much of an industry surrounding examinations.
Imagine my surprise when I started teaching and my Microbiology colleague was using the term "rhinorrhea" when discussing the common cold!

Still, Latin has its uses. The name levator ani has a lot more class than, well, whatever its vernacular would be.
 
  • Like
Reactions: 2 users
Imagine my surprise when I started teaching and my Microbiology colleague was using the term "rhinorrhea" when discussing the common cold!

Still, Latin has its uses. The name levator ani has a lot more class than, well, whatever its vernacular would be.

It is helpful to have consistent terms so that we can communicate clearly with one another. And you have a point about levator ani. But eructation. PRN and BID. So much could be said more plainly, except that then we wouldn't sound like we'd spent hundreds of thousands of dollars and years of our lives to be able to talk over our patient's heads.

I'm not able to think of other plausible reasons that a simple percentile score wouldn't work just as well to convey relative performance on exams.
 
  • Like
Reactions: 1 user
I think the whole reason for all high stakes exams to have the scoring system they do is to simply confuse you.

I'm kinda confused. Given that those of us who are choosing candidates already know how to read scores, why exactly does this matter?

Also, not everyone who applies to residency came in taking the step the same year or same time. Sure most do, but not all, and enough that we need to be able to compare scores year to year.
lmao
 
  • Like
Reactions: 1 user
I think something that hasn't been pointed out is the large variation in score distributions between test-taking groups. Would you report every Step 1 score as a percentile compared to US MD first time takers? Or use all takers? Or use the distribution of the group that individual taker is from? Using percentile also requires reporting pass/fail as a separate metric because 5th percentile might pass or fail depending on the year and group, whereas we know 205 always means the same thing no matter the year or group

Year to year variation is also a problem, an Md/PhD could be 5 years from his Step 1 date by the time they're applying for residency, and that could mean a few points difference in percentile for the same scaled score.
 
I think something that hasn't been pointed out is the large variation in score distributions between test-taking groups. Would you report every Step 1 score as a percentile compared to US MD first time takers? Or use all takers? Or use the distribution of the group that individual taker is from? Using percentile also requires reporting pass/fail as a separate metric because 5th percentile might pass or fail depending on the year and group, whereas we know 205 always means the same thing no matter the year or group

Year to year variation is also a problem, an Md/PhD could be 5 years from his Step 1 date by the time they're applying for residency, and that could mean a few points difference in percentile for the same scaled score.
I've already addressed these points.

The score deviation between test taking groups wouldn't be an issue if they keep questions for a long time (pretty sure they do) and keep a running tally of performance (constantly updating) for each question.

Whether it's your 1st or 2nd or 3rd time taking the test doesn't matter. It's 280 questions and there are probably tens of thousands of questions in the test bank. Plus, if you're retaking the test you're probably not going to be the one to break the curve. For all we know when you retake a test USMLE doesn't recycle questions for individual test takers.

They already report 'PASS' or 'FAIL' on your score report the way it's done now. Nothing would change.

Year-to-year variation is a problem no matter what. I don't get this obsession with scaled scores being superior for "year to year" comparison. Here's the thing:
- Scaled scores might show equal difficulty across test takers and won't vary by year. However, if the average performance of the test takers changes as a whole (which is what we've seen - step scores have been rising very quickly) the scaled scores are meaningless. Sure, a 220 from 10 years ago = a 220 today, but ultimately it's completely worthless because step is used to see how you perform compared to your peers, not whether or not you are "competent."
- Percentiles have the reverse problem - they show how you perform in comparison to other test takers, but don't show how the overall group is improving or declining in overall performance. But, as I've been trying to say, this doesn't really matter at all except for the fact that you pass, which can be done internally.
 
Last edited:
Top