Having taken the first 2 steps and read a lot of the NBME's pubs on test construction, I would argue that the exams are becoming much better at testing relevant clinical thinking. It's not just picking the correct dx out of a list, but picking it out of a list of 5-10 things that would be your actual ddx in real life. Even on Step 1 I remember questions where all the answer diagnoses were highly plausible and I had to really tease through the history and the labs to figure out the right one (which is exactly what the questions were likely trying to test).
Learning to make a good ddx is definitely something you do more in third year and beyond, but my experience has been that the stronger students from preclinical years do this much better than the rest. While they may not automatically be a star, it's highly likely unless they have some sort of social interaction issues. My first 2 clerkships were surgery and IM and I definitely used my built up knowledge base every day and I think it helped me stand out. My anecdotal observations have been that the top students are much better clinically, perhaps because they don't have to worry as much about the shelf and can focus on developing their clinical skills.
The college --> med school transition definitely results in half of a group of former top students becoming below average. This same thing doesn't happen from M2 --> M3 because it's still the same group of people. Sure, maybe a tiny few are unable to pass Step 1 and leave the class but this would not significantly impact the others. The people I can think of who struggled with the transition to M3 were those who were very inefficient studiers and struggled with the drastic reduction in available study hours. Everyone feels like an idiot at least once every day, but that's just part of the experience.