From what I can interpret from the website info, they say the questions are weighted by difficulty which has been said before. What isn't indicated, is when the weighing takes place. Other people in other threads claim that the weighing is performed before the test is indicated, based on how difficult they feel the question is when they are writing it. Others feel they asses the required weight of each question after a test is administered based on how many people get it right or wrong.
If I were to speculate, I would say they probably do both. They most likely decide what difficulty level questions are before the exam so they can assemble exams with relatively similar difficulty, and then adjust their prediction based on how many people actually got the questions right compared to how many they thought would get it right. This would explain why it takes so long to grade the test (essays shouldn't take that long) and also how they manage to make sure that the percentile and scaled scores are not highly variable from test to test or year to year. Additionally, it would explain how people can still manage to score similar to their AAMC average when they feel they did much worse and had to guess when they normally didn't have to during practice exams.
We shall see.