USMLE Questions Weighted?

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

mloose27

New Member
10+ Year Member
15+ Year Member
Joined
Apr 29, 2005
Messages
3
Reaction score
0
Hey everyone. I just finished taking my Step 1's (a nightmare), and upon talking with a few of my friends, I realized that some people thought their test was impossible, while others thought it was doable. Since everyone I talked to is in the same part of the class academically, and everyone spent about the same time studying, often with each other, it leads me to wonder if the USMLE accounts for differences in the difficulty of questions from test to test. Does the USMLE weight certain questions over others? Do they try and account for the fact that someone may have randomly gotten a really hard test, while someone else might not have?

I'm interested to hear everyone's thoughts, or knowledge on this subject.
 
mloose27 said:
Hey everyone. I just finished taking my Step 1's (a nightmare), and upon talking with a few of my friends, I realized that some people thought their test was impossible, while others thought it was doable. Since everyone I talked to is in the same part of the class academically, and everyone spent about the same time studying, often with each other, it leads me to wonder if the USMLE accounts for differences in the difficulty of questions from test to test. Does the USMLE weight certain questions over others? Do they try and account for the fact that someone may have randomly gotten a really hard test, while someone else might not have?

I'm interested to hear everyone's thoughts, or knowledge on this subject.

Let's use common sense. OF COURSE THEY DO. WOULD IT BE JUST IF THEY DID NOT? A standardized exam has to be valid and reproducible. The implied theory is that you would perform the same compared to your peers regardless of the form that you receive.

To answer your question, they standardize you compared to everyone else that took the same form. Then, they superimpose this gaussian distribution of all the different forms to get a singular distribution. Thus, if the top student in a hard form missed 60 questions, and a top student in an easy form missed 30 questions, they are considered statistically equal. This is a way for converting a raw score to a scaled score. Superimposing the distributions negates a "hard" form compared to an "easy" form. Each form is modified based on the statistical difficulty based on the number of questions that the top students got right.

With that said, even the NBME has some doubts about the true reproducibility of each form, therefore they discourage using Step 1 scores as a way of screening certain specialties. Still, the underlying point is that NBME does not penalize you based on the form that you receive on test day, your performance is linked to your ranking based on everyone else that took the same form as you did. Standardized exams are rigorously tested for validity and reliability.

http://www.nationmaster.com/encyclopedia/Standardized-testing
 
i was under the impression you were not compared with other tests but rather people who took the same test or atleast did the same questions on years previous. Its not like the mcat where you are compared to diff versions, where some are easier and some are harder. How can they rate an exam hard or easy. the only fair way is to compare you to people in the past who had those questions. so its kind of luck of the draw, what may be hard to you could have ben easy to the last 1000 people who took that exam so you will not get a curve.
 
Ramoray said:
i was under the impression you were not compared with other tests but rather people who took the same test or atleast did the same questions on years previous. Its not like the mcat where you are compared to diff versions, where some are easier and some are harder. How can they rate an exam hard or easy. the only fair way is to compare you to people in the past who had those questions. so its kind of luck of the draw, what may be hard to you could have ben easy to the last 1000 people who took that exam so you will not get a curve.

IF THERE IS NO CURVE, WHY DO THEY GIVE YOU THE MEAN AND SD when you receive your score?

The pool for each form is superimposed onto one giant distribution. This is why you see only ONE DISTRIBUTION for each step 1 exam. Think about it, Ramoray, why is there only one released mean for a given year, if there are more than one form of Step 1? That is because the mean that is released is of the giant distribution.

We all know that there is more than one form for Step 1 yet there is only ONE STEP 1 mean.

The way they rate an exam as hard or easy is based on a percentage of questions that the top students got right on your particular form. I don't know the exact formula, but let's stay the top ten percent of the students in form A got 300 out of 350 questions right, while the top ten percent of the students in form b averaged 280 out of 350 questions right. They can use statistical analysis to convert the raw 280/350 to 300/350 average to be equal, this is known as a scaled score.

What matters most is where do you sit on the Gaussian Distribution compared to people that took the same form. Your place in line will stay the same on the giant distribution regardless of the number of questions you got right. Thus, in theory if you answered 280 questions right on form A but had a hard form because the top ten percent of the scorers averaged 280, you should be given a higher score than someone that answered 280 questions right on a form that the top students averaged 300 questions right.

Also, the underlying assumption is that each form has the same percentage of top students. This is based on inferential statistics (something you should know for Step 1 BTW). Since the sample size is still large for each form, and the students are choosen randomly for each testing form, using inferential statistics is valid.

The numbers given above are arbitrary, just keep in mind the principle.
 
While I don't doubt that it is somehow curved, I no longer think it is as simple as a curve per form. I got the same exact question twice, and others I have spoken to have as well. Not a similar question but word for word the same question. There is no way they have a "form" for a whole test with the same question repeated let alone multiple forms with the same question repeated, that would be idiotic. I think if anything the blocks may be set and randomly assigned with a standard curve per block and a formula for integrating your relative score into a composite for the test or... if they are truely insane they may generate random questions and have an intividual question correct response rate but that would be truely crazy so Im leaning toward randomly assigned blocks that are pre-set.

My post here in no way furthers the thread but I just thought I would throw out my 2 cent.

p53 said:
IF THERE IS NO CURVE, WHY DO THEY GIVE YOU THE MEAN AND SD when you receive your score?

The pool for each form is superimposed onto one giant distribution. This is why you see only ONE DISTRIBUTION for each step 1 exam. Think about it, Ramoray, why is there only one released mean for a given year, if there are more than one form of Step 1? That is because the mean that is released is of the giant distribution.

We all know that there is more than one form for Step 1 yet there is only ONE STEP 1 mean.

The way they rate an exam as hard or easy is based on a percentage of questions that the top students got right on your particular form. I don't know the exact formula, but let's stay the top ten percent of the students in form A got 300 out of 350 questions right, while the top ten percent of the students in form b averaged 280 out of 350 questions right. They can use statistical analysis to convert the raw 280/350 to 300/350 average to be equal, this is known as a scaled score.

What matters most is where do you sit on the Gaussian Distribution compared to people that took the same form. Your place in line will stay the same on the giant distribution regardless of the number of questions you got right. Thus, in theory if you answered 280 questions right on form A but had a hard form because the top ten percent of the scorers averaged 280, you should be given a higher score than someone that answered 280 questions right on a form that the top students averaged 300 questions right.

Also, the underlying assumption is that each form has the same percentage of top students. This is based on inferential statistics (something you should know for Step 1 BTW). Since the sample size is still large for each form, and the students are choosen randomly for each testing form, using inferential statistics is valid.

The numbers given above are arbitrary, just keep in mind the principle.
 
dynx said:
While I don't doubt that it is somehow curved, I no longer think it is as simple as a curve per form. I got the same exact question twice, and others I have spoken to have as well. Not a similar question but word for word the same question. There is no way they have a "form" for a whole test with the same question repeated let alone multiple forms with the same question repeated, that would be idiotic. I think if anything the blocks may be set and randomly assigned with a standard curve per block and a formula for integrating your relative score into a composite for the test or... if they are truely insane they may generate random questions and have an intividual question correct response rate but that would be truely crazy so Im leaning toward randomly assigned blocks that are pre-set.

My post here in no way furthers the thread but I just thought I would throw out my 2 cent.

EVER heard of experimental questions? I had two pairs of similar questions too. The reason you were tested more than once is because the other ones were tested for future test adminstrations. The problem is you don't know which one was experimental.

The main reason new questions are added is because there are old questions floating around via messageboards, ebay, goljan etc. It would not be fair for someone to have "inside" knowledge about the exam. The NBME committee tries very hard to maintain integrity of the exam. They take medical licensing very serious (as they should). Sure there are loops in the system such as SDN where people post questions on their exam. Still this advantage within a calendar year is a drop in the bucket. What they are mainly concerned about is the integrity of the exam year by year. They don't want anyone from next years's class to have a hugh batch of remembered questions, thus they make up new questions.

Regardless as tommyk sez, they test the same concepts year by year. The questions are different, but the concepts are the same.

Let's use logic. Do you honestly believe the NBME would count the same question twice for something that is as high stakes as a medical licensing exam? Sure around 90% pass, but around 10% do not pass. NBME is not going to waste two pairs of questions and risk someone scoring a 181 and flunk the exam.

It is like RAMORAY's theory that every question is based on previous examinee's in previous years. That is horsecrap. I had two questions on my exam that pinpointed the date post 2000. They make new questions, how else are they going to see if it is valid? It is called field testing.

All standardized exams have experimental questions. If you were smart enough you would realize that my numbers such as 280/350 were arbitrary.

As a review

1. USMLE examinations have experimental questions, yet every form has the same amount of total raw score. Arbitrary example 320 total and 30 experimental questions. This is one of the reasons people perceive certain sections are heavily weighted. The experimental batch that you receive might be heavy in immunopathology.

2. Standardized exams field test new questions the preceding year to use the following year. If the top 10% of the students of the scorers on a particular form do not consistently get the question right, it is thrown out or the question is modified. There is a steady state of questions retired to new questions added, thus the pool stays the same.

Lastly, if anyone has a better theory post here. I'm smart enough to know that I might not be 100% right. Still, I'm very convinced of my theory.
 
I don't think there are "forms" in the sense that a group of people will have all the same questions. I think what happens is that you get 350 questions randomly chosen from the test bank, and that each question, based on who got it right when it was an experimental question (i.e., did only the top scorers get it right, or did everyone get it right except the people who failed anyway, etc.), is given a weight based on its difficulty. I think the individual weighting of each question is what's used to standardize performances between test-takers.

I know for certain that's how it's done for other computer-based standardized tests that are used in a similar fashion, and why would NBME reinvent the wheel?
 
Samoa said:
I don't think there are "forms" in the sense that a group of people will have all the same questions. I think what happens is that you get 350 questions randomly chosen from the test bank, and that each question, based on who got it right when it was an experimental question (i.e., did only the top scorers get it right, or did everyone get it right except the people who failed anyway, etc.), is given a weight. I think the individual weighting of each question is what's used to standardize performances between test-takers.

I know for certain that's how it's done for other computer-based standardized tests that are used in a similar fashion, and why would NBME reinvent the wheel?

Prove that it is done that way. Give us a link.

Secondly, GRE has an experimental block on their computer test. So using your logic, perhaps NBME also has an experimental block too.

Also, how do they determine one standard deviation and one mean if everyone has 350 randomly generated questions.
 
p53 said:
Prove that it is done that way. Give us a link.

Secondly, GRE has an experimental block on their computer test. So using your logic, perhaps NBME also has an experimental block too.

Also, how do they determine one standard deviation and one mean if everyone has 350 randomly generated questions.

atleast you didnt choose lawschool, your arguments are the least persuasive i have ever seen, you just make me laugh at how ******ed you sound. thanks for the laughs!
 
Ramoray said:
atleast you didnt choose lawschool, your arguments are the least persuasive i have ever seen, you just make me laugh at how ******ed you sound. thanks for the laughs!

You are an idiot. You can't come up with a theory if your life depended on it. All you do is post answers from reading books. BTW, you are weird for posting the stuff about looks and grades out of the blue. A confident person would not be fixated on physical appearance. You must be ugly.

Look dumbarse, if someone's entire argument is based on the fact that he knows it is like that for the other test, it is logical to ask for proof. A logical argument can be challenged by attacking its premise. A fact based argument has to be challenged for its veracity.
 
p53 said:
You are an idiot. You can't come up with a theory if your life depended on it. All you do is post answers from reading books. BTW, you are weird for posting the stuff about looks and grades out of the blue. A confident person would not be fixated on physical appearance. You must be ugly.

Look dumbarse, if someone's entire argument is based on the fact that he knows it is like that for the other test, it is logical to ask for proof. A logical argument can be challenged by attacking its premise. A fact based argument has to be challenged for its veracity.
read your own post again and then see your arguments are getting worse and worse. how much funnier can you get! keep em coming peace
 
Ramoray said:
read your own post again and then see your arguments are getting worse and worse. how much funnier can you get! keep em coming peace

Ramoray, step away from the computer. You are just wasting your own time. Your filler comments are crap. POST SOMETHING SUBSTANTIAL TO PROVE YOUR ARGUMENT. If you think my logic doesn't hold water post it on here. Just saying "read your post" is elementary.

You are a dumbarse that doesn't know how to debate nor back up any claims. Keep on posting garbage that makes you look like a simpleton.
 
p53 said:
Ramoray, step away from the computer. You are just wasting your own time. Your filler comments are crap. POST SOMETHING SUBSTANTIAL TO PROVE YOUR ARGUMENT. If you think my logic doesn't hold water post it on here. Just saying "read your post" is elementary.

You are a dumbarse that doesn't know how to debate nor back up any claims. Keep on posting garbage that makes you look like a simpleton.

:laugh: :laugh:
 
Ramoray said:

Nice comeback. Simple response from a simple mind. Later. Good Luck on your Step 1. I'm sure you will score above a 220. :laugh:
 
p53 said:
Prove that it is done that way. Give us a link.

Alright. Here's an article discussing the various methods for creating standardized tests assessing competency. It's in the context of pharmacy, but it discusses general methods that are used across disciplines. The method I described may not correspond exactly to any of the ones discussed, but it's in the ballpark.

article

Happy now?

p.s. never argue with a pharmacist. We rarely talk out of our @sses.
 
Samoa said:
I don't think there are "forms" in the sense that a group of people will have all the same questions. I think what happens is that you get 350 questions randomly chosen from the test bank, and that each question, based on who got it right when it was an experimental question (i.e., did only the top scorers get it right, or did everyone get it right except the people who failed anyway, etc.), is given a weight based on its difficulty. I think the individual weighting of each question is what's used to standardize performances between test-takers.

I know for certain that's how it's done for other computer-based standardized tests that are used in a similar fashion, and why would NBME reinvent the wheel?

This is the way we were told they do it by the faculty at our school who are involved in the question writing process for NBME.
 
i’m glad people are discussing this! it’s always a fun debate at my school. i happen to agree with the theory mentioned by samoa and tigershark. i’m no statistics buff, but we can imagine a given test that represents an average level of difficulty in comparison to all other randomly generated 350 question sets (or rather however many nonexperimental questions there are on a test). any particular question on any particular test would either move that specific exam away from the average level of difficulty or reinforce the exam’s particular place at exactly the average level of difficulty (based on previous correct response rates to these questions). after going through the entire test, it seems like some sort of unique factor could be generated to adjust for a test’s specific level of difficulty. thus, one mean and one standard deviation could be generated, by converting everyone’s scores to an equivalent for a test of average difficulty.

p53, it seems that if they could standardize on the basis of one subdivision (forms as you refer to them), than it would likely be feasible to standardize on the basis of a different type of subdivision. it’s the exact same principle, just a different number of questions. heck, maybe they standardize on the basis of random seven question groupings, i have no real idea, but intuitively it seems like standardizing on the basis of individual questions would be the most rigorous way to do it and not particularly more difficult than on the basis of “forms,” perhaps requiring a few more seconds of processing on whatever central computers they have handling this stuff.

the only thing that bothers me is i wonder if there is an additional adjustment for having a harder test or an easier test, that is, 1) do people perform significantly differently if they have a test that’s loaded with hard/easy questions and 2) if so, is there any sort of adjustment for this? i think that’s the strongest argument for the use of an entire standardized form or an entire standardized block, as i imagine this would be an easy way to help correct for such factors. although i also feel that they might be crafty enough to figure out how to adjust for this stuff in a system of completely randomized questions.

fyi i plan on fighting you for the title of “danica patrick’s boy toy.” actually maybe we can work together to get rid of her fiance first...
 
p53 said:
Nice comeback. Simple response from a simple mind. Later. Good Luck on your Step 1. I'm sure you will score above a 220. :laugh:

dude id be thrilled with a 210. im not smart person like the crazy smart people on here. I think sdn is a good learning tool to learn from the idios, bigfranks idq, who seemingly know more than all texts put together so its a good learning place. I never claimed id get a high score or that i know much. peace
 
Eh, im not even sure you are responding to my post since your reply does not adress it at all....lets try this: "Not a similar question but word for word the same question" is what I wrote. Then you argue about the possible significace of similar questions. See the disconect?
As for the rest of your post I agree, im just saying it seems to me given a couple of people with repeat EXACT questions in different blocks it might make more sense that the individual blocks are scored and compiled as I believe you are suggesting the whole exam is.
Given Idq1i's post however that may not be true but I would be interested to know how you can tell you got the exact same test versus say, 3 of the same block so you had enough overlap to make you think the entire test was the same since you can't possibly memorize it all?


p53 said:
EVER heard of experimental questions? I had two pairs of similar questions too. The reason you were tested more than once is because the other ones were tested for future test adminstrations. The problem is you don't know which one was experimental.

The main reason new questions are added is because there are old questions floating around via messageboards, ebay, goljan etc. It would not be fair for someone to have "inside" knowledge about the exam. The NBME committee tries very hard to maintain integrity of the exam. They take medical licensing very serious (as they should). Sure there are loops in the system such as SDN where people post questions on their exam. Still this advantage within a calendar year is a drop in the bucket. What they are mainly concerned about is the integrity of the exam year by year. They don't want anyone from next years's class to have a hugh batch of remembered questions, thus they make up new questions.

Regardless as tommyk sez, they test the same concepts year by year. The questions are different, but the concepts are the same.

Let's use logic. Do you honestly believe the NBME would count the same question twice for something that is as high stakes as a medical licensing exam? Sure around 90% pass, but around 10% do not pass. NBME is not going to waste two pairs of questions and risk someone scoring a 181 and flunk the exam.

It is like RAMORAY's theory that every question is based on previous examinee's in previous years. That is horsecrap. I had two questions on my exam that pinpointed the date post 2000. They make new questions, how else are they going to see if it is valid? It is called field testing.

All standardized exams have experimental questions. If you were smart enough you would realize that my numbers such as 280/350 were arbitrary.

As a review

1. USMLE examinations have experimental questions, yet every form has the same amount of total raw score. Arbitrary example 320 total and 30 experimental questions. This is one of the reasons people perceive certain sections are heavily weighted. The experimental batch that you receive might be heavy in immunopathology.

2. Standardized exams field test new questions the preceding year to use the following year. If the top 10% of the students of the scorers on a particular form do not consistently get the question right, it is thrown out or the question is modified. There is a steady state of questions retired to new questions added, thus the pool stays the same.

Lastly, if anyone has a better theory post here. I'm smart enough to know that I might not be 100% right. Still, I'm very convinced of my theory.
 
dynx said:
Given Idq1i's post however that may not be true but I would be interested to know how you can tell you got the exact same test versus say, 3 of the same block so you had enough overlap to make you think the entire test was the same since you can't possibly memorize it all?

OK, I can't sit here and claim that I remember every single question on my exam. However, I can tell you that every single question that I remembered was recognized by the other person. The order of questions was also the same.

(i'm not as certain about the latter part - the other person was hazy in his recollection)
 
idq1i said:
OK, I can't sit here and claim that I remember every single question on my exam. However, I can tell you that every single question that I remembered was recognized by the other person. The order of questions was also the same.

(i'm not as certain about the latter part - the other person was hazy in his recollection)

That would be an amazing coincidence if there were just set blocks, but still possible.
I think the fact is...we'll never know. Which, if you ask me is a little shady. They do advise against using the test other than as a pass fail system but given the knowledge that Res. programs do use it I think it would be better to open thier scoring system up to scrutiny.
 
Well, judging from the responses here, I think it's safe to assume that no one knows for sure whether or not these exams are weighted, or how they are curved, if at all. Here's what I have learned from talking to other people in my school.

#1) I am NOT the only one who feels as though they got screwed on the test. A few other people have felt that their test questions did not allow them to demostrate how prepared they really were.

#2) NO ONE, even the people who think their tests were fair (~50% of the people I asked), feel that they did well on their test. One kid felt that way, but he's an ass anyway, and I wouldn't believe a word he says.

#3) NO ONE has any idea about whether the 50 experimental questions are in one block, or if they're spread out throughout the test.

#4) I've talked to many people, and it does not seem as if any two people had even remotely similar tests. I've only met a few people who had even one question in common with another.

And just for kicks, here's what I felt consituted the bulk of my test. TONS of EBHC (evidence based health care - like statistics) questions. And not the easy one's like "what is the sensitivity of this test?' or "what is the postive predictive value". It was hard **** like "What's the case-fatality ratio?!?!?!?" Also, TONS of murmurs, and lung problems. I had barely any pharm (glad I wasted 5 days learned every single pharm card), and barely any kidney (ditto for learning all those damn glomerular diseases). I had not a single chromosomal question (guess memorizing all those was pointless. M3 15;17 anyone? 🙂 Ok, enough of my bitching. It's over, and I hope to God that I don't have to take it again. Thanks to everyone for their input though.
 
Top