No worries I'll explain to the best of my knowledge. Scaling inherently means you have predictive value on the score associated with each question. So a question likely doesn't go into a test officially to be scored until after many, many, tests in which it was included as an experimental question. Statistical analysis will then give you the confidence interval for how the typical student will perform. This would then go into creating a test, in which this is done for each question. Because each test has a combination of different questions, their raw %correct is different but they are all balanced so that they produce a scaled score around the same center. Which is why 500 is deigned to be the the middle point, 125 for sub sections.
The post test modifications serve a multitude of reasons, most of which I'm sure are well beyond me. But they are not for curving. They probably serve to test-retest validity (aka accuracy), reliability (aka predictability), and fairness. They probably include a series of human raters to check interrater reliability and confirm the scale that was used for said test.
Just because you came up with a scale doesn't mean you can just implement it without checking and confirming you have predicted what you said you would. Furthermore, a given test may include questions that just came out of the experimental phase and going into their first "real test". All these things require post-test analysis. And it's a good thing, because it means they are not just pumping out algorithms and not check for errors they made, etc.
Sent from my iPhone using
SDN mobile