It is a little more complicated than that.

Let me try to explain how exams like USMLE are designed. What I am going to describe is based on general theory, and USMLE most likely does things differently.

Designing a good exam is a science by itself. A lot of statistical calculations go into finding the perfect question, answer choices, and score distribution. There are many statisticians who work along side field-professionals to design professional exams.

USMLE uses statistical normalization to compute and calibrate score for each question; yes! each and every question. That is also the reason, why you never know about maximum possible score. How does it work? I don't know how USMLE does it, but here is an example of how it can be done:

Suppose I am to start teaching a class in a college. To test my students I want to develop a question-bank from which I can give the final exam every semester. One of my goals is to keep adding new questions to this bank and keep removing the old ones. When adding the questions, I want to make sure that questions are not very-very easy and they are also not very-very difficult. My other goal is to keep the grading standard consistent. For this I define my own statistical distribution based on either some standard mathematical model or define a new one. My plan is to give 300 questions on final and for this I pick a distribution with mean M = 215 and standard-deviation SD = 20. I will call it SKM95 distribution.

Given all the parameters, now let's work on the problem in a simplest possible way.

I will start by giving a practice test to all my students every week. Let's assume that each question on the test is worth 10 points. After every practice test I will compute mean and standard-deviation for each question. So, let's say that for Question-1 (Q1) the M = 5.2 and SD = 1.5.

Based on M and SD of every practice test question, I can decide which questions are too difficult and which are too easy. I can then remove those from my final question-bank, e.g. if on particular practice test question all students get 10/10 or all get 0/10 then that question will not make to final, I will drop it from question-bank.

At the end of the semester I will have a full question-bank from which I can randomly pick 300 questions and give the final to my students. Let's assume Q1 is on the final. Now pick 3 students A,B, and C randomly. On the final A gets 10/10, B gets 0/10, and C gets 7/10 on Q1. Now I will compute z-value for these 3 students, for Q1.

Formula for z = (Score - M) / SD.

For A, z = (10 - 5.2) / 1.5 = 3.2

For B, z = ( 0 - 5.2) / 1.5 = -3.47

For C, z = ( 7 - 5.2) / 1.5 = 1.2

For a full 300 question final, the M and SD will be different per question and to compute final z-score I have to add z-values for all 300 questions and divide by 300.

For this example let's assume that final exam has only one question. So, now the next step is to compute final score by mapping this z-score to my SKM95 distribution, and that can be computed by using the same z-value formula, except in this case the unknown is Final Score, so the equation will be:

Final Score = M + (SD * z)

For A, Final = 215 + (20 * 3.2 ) = 279

For B, Final = 215 + (20 * -3.47) = 146

For C, Final = 215 + (20 * 1.2 ) = 239

You can see that even with 100% correct the Max score was 279 and not 300, and with 0% correct the minimum was 146 and not zero.

Now you know why no one knows the maximum score on USMLE, and since very question has different z-score therefore no one can answer the question, "how many questions I need to answer to get a 236/99?"

In this imaginary scenario I can also repeat some of the questions in next semester practice tests. For these questions the M and SD will have to be re-computed based on new sample size (previous semester class size plus next semester class size). It will skew the per question distribution curve slightly so before the next semester final I will have re-adjust my SKM95 mean and standard deviation (.e.g. I may have to move M from 215 to 214 and SD from 20 to 21, etc..), to keep the grading standard consistent.

Now you know why USMLE SD and M shifts over period of time.

In my make-believe world I can also add some 50 new practice test questions to final and increase the final to 350 questions and call these practice questions experimental questions. These 50 won't be counted towards final score and only I will know about these experimental questions (sound familiar!)

With this scoring system all the students can theoretically get a perfect 270+ on final exam. But probability of that happening is next to zero, why? Because in practice/experimental questions if all students get 10/10 or 0/10, then that question is thrown away. But on paper I can still claim that every one can get a perfect score ;-)

Now this was a very simple explanation of how a scoring system like SUMLE can be designed. In reality it is much more complex.

USMLE, most likely, repeats same experimental question for 1 year before converting it to a question that counts, therefore sample size is huge/10000+.

Each experimental question then goes through item-analysis. This is where some of the following functions are performed:

Calculate p-value. This is probability of getting this question right. On a five-response multiple choice question, optimum difficulty level is 0.50 for maximum discrimination between high and low achievers.

Calculate point-biserial correlation (PBC). This is how good students did on question compared to their over all test score. A highly discriminating question indicates that the students who had high tests scores got the question correct whereas students who had low test scores got it incorrect. Goal for USMLE like exam is to get PBC of 0.4 or more.

Calculate Reliability coefficient. Using Kuder-Richardson formula compute the degree to which a question measures a single cognitive construct. Goal is to get 0.9 or above for USMLE like exam.

Distracter Analysis. On a multiple choice question if A is correct answer then B,C,D, & E are distracters (wrong answers). Distracters should appeal to low scorers who have not mastered the material whereas high scorers should infrequently select the distracters.

Distribution skew. In most processional exams the distribution is always negatively skewed (to right), including USMLE exam.

So, based on item-analysis USMLE can design each and every test question with very well defined boundary conditions. Ever wondered why Kaplan or UW score estimators predict a wide ranging score compared to NBME tests? That's because NBME questions are taken from actual USMLE exams, and all of them have gone through rigorous item-analysis.

OK! that's enough statistics for the day! Back to studying for Step-1!!!!!