Anki scientifically proven to predict Step 1 scores better than Firecracker

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

Elessar

Full Member
7+ Year Member
Joined
May 24, 2016
Messages
89
Reaction score
174
Link to full article:

A study published Dec. 2015 at WashU found:

"The use of boards-style questions and Anki flashcards predicted performance on Step 1 in our multivariate model, while Firecracker use did not."

"Unique Firecracker flashcards seen did not predict Step 1 score. Each additional 445 boards-style practice questions or 1700 unique Anki flashcards was associated with one additional point on Step 1 when controlling for other academic and psychological factors."

"Students who complete more practice questions or flashcards may simply study more in general, though the lack of a benefit with Firecracker compared with Anki suggests against this confounder."

Tables:
upload_2016-12-5_16-45-13.png

upload_2016-12-5_16-52-9.png


https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673073/

Members don't see this ad.
 
  • Like
Reactions: 1 users
Damn, that mean step 1 score...
 
  • Like
Reactions: 2 users
Members don't see this ad :)
Link to full article:

A study published Dec. 2015 at WashU found:

"The use of boards-style questions and Anki flashcards predicted performance on Step 1 in our multivariate model, while Firecracker use did not."

"Unique Firecracker flashcards seen did not predict Step 1 score. Each additional 445 boards-style practice questions or 1700 unique Anki flashcards was associated with one additional point on Step 1 when controlling for other academic and psychological factors."

"Students who complete more practice questions or flashcards may simply study more in general, though the lack of a benefit with Firecracker compared with Anki suggests against this confounder."

Tables:
View attachment 211614
View attachment 211615

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4673073/
Hard to make any improvement when you've got Step scores that high...
 
  • Like
Reactions: 1 users
Mean MCAT 38, mean step 253....

Perhaps not a representative sample?
 
  • Like
Reactions: 9 users
This title reads like a fake Facebook news "article." I guess that's suiting since it's garbage. You should be ashamed.
 
  • Like
Reactions: 2 users
At first I thought your username was "Elsevier" lol
 
  • Like
Reactions: 1 user
Well this proves it. Being cheap is the best way to get ahead


Sent from my iPhone using SDN mobile
 
Both are crap. Pick up first aid and read it like a normal person
 
  • Like
Reactions: 1 user
Please please please take a study in clinical epidemiology and methods if this is what you think this regression analysis proves. Would it have killed you to make your thread title "Anki, not Firecracker, associated with higher Step 1 scores in a single high-performing medical school class"?
 
Last edited:
  • Like
Reactions: 1 users
Well. Now I understand why I didn't get into WashU.


Sent from my iPhone using SDN mobile
 
  • Like
Reactions: 1 users
Members don't see this ad :)
Can you really have 8 co-variates with a sample of 72? It's been a while since I took statistics but that seems off
 
  • Like
Reactions: 1 user
Well I plan on doing 14k Anki cards and 5k questions so I should atleast get an 18 on STEP.
 
  • Like
Reactions: 5 users
Can you really have 8 co-variates with a sample of 72? It's been a while since I took statistics but that seems off
Absolutely you can. It might or might not be a problem depending on what they're going for.

Common rule of thumb is around 10 cases per predictor, so this analysis is a bit underpowered.

You can't really say it's underpowered, in general, because several things were found to be significant (underpower is more the case where you can't detect significance when there is a true effect). Precision in the estimation is a totally different issue (but can be related). A sample of 72 leaves them with 63 (72-8-1) degrees of freedom for estimating the standard deviation of the error term for fitting a model with the 8 independent variables and an intercept (not terrible).

Overall, as others have said, it doesn't really seem to be as representative of the average medical student. I skimmed it a bit and didn't notice whether or not they accounted for user-made vs. premade anki cards, which could reasonably have an influence (did anyone notice if they did this).

The results section also seems to suggest that they ran the 8 variable model and just checked p-values to see which variables were significant and which were not, and then concluded the 5 with significant p-values were "good to go". This isn't really a good a approach for those who remember their intro to stats or multidimensional mathematics. The reason is that each of those p-values are calculated under the assumption that all other variables/terms remain in the model. It answers the question, "If I have all these other variables in the model, is this one significant?" If it's not significant, you probably want to remove that variable from the model, and fit the new model to check further variables (otherwise, your conclusions can only be that X is significant/not significant while accounting for all the other variables, rather than saying variables A, B, C are not significant (because A assumed B and C were in the model)). This has to do with fitting the first model in a 9 dimensional space ( y axis, then 8 others for the IVs). The equation is estimated into that space around the data. Once you drop a variable (for nonsignificance, say) then the new line is fit in an 8 dimensional space ( y axis and 7 remaining independent variables, ). Believe it or not, this can often change the significance of variables from prior steps (which is why it's important to avoid testing things aimlessly)...just something to keep in mind when you see research or maybe decide to do your own... It's also hard to see if they actually fit an intercept, which doesn't really seem to make logical sense why the function would go through the origin-- but they might not have done this...they also misused the term multivariate in place of multivariable (multiple dependent variables vs. multiple independent variables)...but that's not a major issue here :=|:-):
 
Last edited by a moderator:
  • Like
Reactions: 2 users
Please please please take a study in clinical epidemiology and methods if this is what you think this regression analysis proves. Would it have killed you to make your thread title "Anki, not Firecracker, associated with higher Step 1 scores in a single high-performing medical school class"?
Probably would've been less sensational, though...all in all, the responses seem to be pretty good at seeing "eh, okay...not applicable/not practically meaningful (thousands of cards for a point...)"...at least everyone is taking it with a grain of salt. For one, I just think a strict schedule with repetition is going to benefit the average person (not necessarily the actual product of firecracker or anki).
 
Last edited by a moderator:
Absolutely you can. It might or might not be a problem depending on what they're going for.



You can't really say it's underpowered, in general, because several things were found to be significant (underpower is more the case where you can't detect significance when there is a true effect). Precision in the estimation is a totally different issue. A sample of 72 leaves them with 63 (72-8-1) degrees of freedom for estimating the standard deviation of the error term for fitting a model with the 8 independent variables and an intercept (not terrible).

Overall, as others have said, it doesn't really seem to be as representative of the average medical student. I skimmed it a bit and didn't notice whether or not they accounted for user-made vs. premade anki cards, which could reasonably have an influence (did anyone notice if they did this).

The results section also seems to suggest that they ran the 8 variable model and just checked p-values to see which variables were significant and which were not, and then concluded the 5 with significant p-values were "good to go". This isn't really a good a approach for those who remember their intro to stats or multidimensional mathematics. The reason is that each of those p-values are calculated under the assumption that all other variables/terms remain in the model. It answers the question, "If I have all these other variables in the model, is this one significant?" If it's not significant, you probably want to remove that variable from the model, and fit the new model to check further variables (otherwise, your conclusions can only be that X is significant/not significant while accounting for all the other variables, rather than saying variables A, B, C are not significant (because A assumed B and C were in the model)). This has to do with fitting the first model in a 9 dimensional space ( y axis, then 8 others for the IVs). The equation is estimated into that space around the data. Once you drop a variable (for nonsignificance, say) then the new line is fit in an 8 dimensional space ( y axis and 7 remaining independent variables, ). Believe it or not, this can often change the significance of variables from prior steps (which is why it's important to avoid testing things aimlessly)...just something to keep in mind when you see research or maybe decide to do your own... It's also hard to see if they actually fit an intercept, which doesn't really seem to make logical sense why the function would go through the origin-- but they might not have done this...they also misused the term multivariate in place of multivariable (multiple dependent variables vs. multiple independent variables)...but that's not a major issue here :=|:-):

Thanks for the detailed response, I appreciate it. The methods in the analysis just seemed a bit iffy to me.
 
Thanks for the detailed response, I appreciate it. The methods in the analysis just seemed a bit iffy to me.
No problem! I think it could be an okay study, but there are a lot of things that are unclear or not as they should be if you want to be close to talking about causality or even appropriate conclusions (who it applies to and what it applies to). What they've done (and how most studies should be viewed, in my opinion) is add some evidence in a bucket for or against an idea. In this case, it's in a bucket that supports a link between Anki use and increasing STEP 1 scores after including all those other factors (which this conclusion might not hold if we added something like making your own cards vs a premade set, for example). The difficulty lies in assessing the quality of that finding and what it actually means (which maybe they have a supplemental document with these details).
 
  • Like
Reactions: 1 user
The results section also seems to suggest that they ran the 8 variable model and just checked p-values to see which variables were significant and which were not, and then concluded the 5 with significant p-values were "good to go". This isn't really a good a approach for those who remember their intro to stats or multidimensional mathematics. The reason is that each of those p-values are calculated under the assumption that all other variables/terms remain in the model. It answers the question, "If I have all these other variables in the model, is this one significant?" If it's not significant, you probably want to remove that variable from the model, and fit the new model to check further variables (otherwise, your conclusions can only be that X is significant/not significant while accounting for all the other variables, rather than saying variables A, B, C are not significant (because A assumed B and C were in the model)). This has to do with fitting the first model in a 9 dimensional space ( y axis, then 8 others for the IVs). The equation is estimated into that space around the data. Once you drop a variable (for nonsignificance, say) then the new line is fit in an 8 dimensional space ( y axis and 7 remaining independent variables, ). Believe it or not, this can often change the significance of variables from prior steps (which is why it's important to avoid testing things aimlessly)...just something to keep in mind when you see research or maybe decide to do your own... It's also hard to see if they actually fit an intercept, which doesn't really seem to make logical sense why the function would go through the origin-- but they might not have done this...they also misused the term multivariate in place of multivariable (multiple dependent variables vs. multiple independent variables)...but that's not a major issue here :=|:-):

It's a judgment call when to perform variable selection, such as by stepwise regression. If there's significant collinearity between two variables, for instance, it would make sense to drop variables. In other situations, I think you have to have some good a priori reason to do so (ie, not just nonsignificance). You have to balance fit and simplicity. Here, the goal is not so much to find the most efficient model than to test the significance of the variables while controlling for the others. The upper bound is more determined by your sample size.

There's no supplement, but I can tell you an intercept was fitted but not reported. Source of anki cards was not surveyed. Agree with others that the sample does not represent the national average med student as the scores are on the high end of the curve where it's tighter. Agree that this paper should be viewed as adding a piece to a bucket. The area of spaced rep programs in med ed has scant practical evidence.
I didn't realize the difference between multivariate and multivariable; thanks for dropping the knowledge there.
 
  • Like
Reactions: 1 user
It's a judgment call when to perform variable selection, such as by stepwise regression. If there's significant collinearity between two variables, for instance, it would make sense to drop variables.
Ah, but stepwise regression isn't really a judgement call- you select the alpha (maybe in, out, or both) and the stepwise procedure and the computer does it for you. That's actually one of the benefits of stepwise regression methods. They provide a more objective way to sift through a cumbersome amount of potential predictors. A stepwise regression can also typically handle collinearity issues for you, since the t-stats will be deflated when the collinear predictors are in the model together (causing one of the two, for a simpler example, to drop out). Sure, you can add them back in if you'd like with your judgment (which often makes sense when a dummy variable is kicked out by stepwise but at least 1 of the other dummies remained in for a given variable), but stepwise, overall, is a relatively objective screening tool (as compared to manually picking t-tests to run or deciding what to include in subset F-tests, for example).

In other situations, I think you have to have some good a priori reason to do so (ie, not just nonsignificance). You have to balance fit and simplicity.
Definitely agree with you on this (when you have this a priori knowledge you can also manually include the variables after stepwise runs or add them back if you really want them). It's a shame that a lot of the research isn't done this way (not that anything is wrong with truly exploratory, but it's kind of blatant how much stuff is just indiscriminate).

Here, the goal is not so much to find the most efficient model than to test the significance of the variables while controlling for the others. The upper bound is more determined by your sample size
Right, which is why I cautioned that they only found those 5 to each be individually significant while 7 other regressors were accounted for-- a different picture than actually testing one or a partial subset and refitting the new model based on that test. You really might not have 5 significant variables when you do it the right way. Without referencing back for an example, it's possible there was enough multicollinearity to cause some of the nonsignificant regressors to appear nonsignificant (as long as all others were in the model), while removing one would reduce the multicollinearity and possibly push a previously nonsignificant variable to significant (which isn't uncommon at all). If for the upper bound you're referring to the number of predictors, then yeah you need enough degrees of freedom, which I might not have stated clearly. My comment was more that, yes, 8 predictors is allowed with the sample size (they were able to fit it), but you would also want to look at degrees of freedom left to estimate the SEE (which seemed reasonable in this case). Obviously, more degrees of freedom for estimation is better, but it didn't seem too bad here.

There's no supplement, but I can tell you an intercept was fitted but not reported. Source of anki cards was not surveyed. Agree with others that the sample does not represent the national average med student as the scores are on the high end of the curve where it's tighter. Agree that this paper should be viewed as adding a piece to a bucket. The area of spaced rep programs in med ed has scant practical evidence.
Again, I'm not going to look back and I kind of skimmed it the first time, but where did you see that they fit the intercept but didn't report it (I realized they might not have seen it as important, so it? Perhaps you are familiar with the researchers?

I didn't realize the difference between multivariate and multivariable; thanks for dropping the knowledge there.
A lot of people in public health and epidemiology confuse and conflate the two ideas (even textbooks in these fields), but biostatisticians and general statisticians (and mathematicians) don't really equate the two.
 
Last edited by a moderator:
I wanted to believe this headline since I'm such an Anki fan. But looking at the study, what stands out to me is that this is a survey. They asked people how many cards they did. I absolutely don't trust people to know that number across 2 years of medical school. So the premise is, in my mind, ridiculous.

I believe the result, but it's also obvious. They compared people doing the same number of Anki cards as Firecracker cards. It seems obvious to me that if you take two stacks of flash cards, and one stack is all information I don't know (i.e. user-generated cards I made for a reason) and the other is some stuff I already know (i.e. Firecracker pre-made deck), I'll learn more from the first stack.

The study is interesting and suggestive, but doesn't change anything.
 
Top