OK, let me try to explain this. Again, we psychometricians do not simply "curve" the (observed, raw) scores. Rather, we use very specialized statistical methods (Item Response Theory + scaling/equating) to "estimate" your underlying, unboseved traits (in this case, capacity or mastery), using all the information from those items you answered on the test day. As your capacity/master is unobserved, we need to "infer" it. We don't just count how many items you got it right or wrong. We take into account of the patterns as well as the nature of the items (their individual behaviors, such as their difficulty levels, their "guessing" probability, their discriminatory power, etc). As you may know, in every exam, there will be a few items that are not counted toward your scores. It is our trick: we just collect the data to estimate the individual item's behaviors in a population, so that we can know better their "item characteristics." That is why there could be some unusual hard or easy items on the exam. Before we really collect the data to see how many people got it right/wrong, we have no clue how this will behave. After we know their individual item characteristics, we use this information to assemble a test, trying to reach a pre-set "test characteristics." During assembly, we will try to balance the difficulty levels. But of course, it may happen that when there are very easy items, there will be harder items to make the overall test balanced.
Those items released by AAMC are already retired from the active item banks. We already have a lot of data regarding those items and tests, so that we can compute the scores right away using automated computation. However, those active items, to make the estimations more accurate, it will still take time for real psychometricians to aggregate the data, and run the models to estimate the final scores - during which, there are a lot of assumptions need to be checked. Again, it is an estimation process using data and statistics. =) (oh, also, in the IRT model, one of the most basic assumption we make is: the underlying traits are normally distributed. That is where the "curve" comes from)
I know it is a totally different topic from our pending MCAT exam (mine is on 1/25!!), but if you want to get a little more reading, IRT article on wikipedia may be a good read. It is highly simplified and omits a lot of details. But I think it is a good place to start.
ps. By the way, although I use "we" in my response, it is my usual writing style. I have absolutely no affiliation with those test makers. I just know their tricks and also use those tricks in my own work.