Interesting I thought it had a theoretical max of >300 in the past but was being given these days as 300, similar to how step 1 used to allow more than 280 but is now proctored as 280. Didn't realize it was currently 318. What a weird number.
I do think the ultimate scaling is supposed to represent a value that captures your number correct, roughly. Look for example at how the NBME provided "equated percent correct" scaled scores on the shelves. The ~60-70% pass threshold is from one of their webpages, and I find it way too coincidental that it converts directly over (194/280=69% on Step 1, and 209/318 = 66% on step 2).
It's really easy to account for experimental items. Ready set go: 150/200 valid correct. Scaled: 210/280. Both 75%, one is your real performance and the other is your performance scaled up to fit the number of Qs if they had all been valid. Then also adjust for form differences and there's your score.
Since it's criterion referenced, not norm referenced, there's really no other basis for them to be building their scale. My money says the real meaning of a 250/280 on Step 1 is "based on their performance on this form, we believe this test taker correctly knows the answer for 89% of our valid test item bank"
This is all speculation though! I didn't hack their servers or anything