MD & DO Beneficial to learn R or python before medical school?

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.
This is actually why I prefer to have an actual statistician look at my data. No matter how much I know, it is far better to have someone whose actual specialization in life is working with statistics to be working with my statistics to ensure that no errors are made and I'm not drinking my own Koolaid via statistical errors
Agreed. If you literally cannot get a statistician, that's different from the typical "ahhh they want a piece of the grant!?!?! we have to pay for "number crunching?!?!!?" or "med student, go ask who can get us p-values" (med student says there's a fee) "we don't need that anyway." But I don't think many people actually are in a situation where a statistician is not attainable; even schools without statistics departments, there are usually professional independent statisticians for hire in the community, and rationing research via finite funds would have possibly averted a ton of the research that fails to replicate (or hell, even be reproduced on the same file)and is trash that's published. This rationing would improve the quality of research, in my opinion. No longer would the PI be able to have 4 med students all working on the same "data base" (really not a data base, but an excel sheet...) asking 3-5 questions each and failing to report all but small p-values.

People will come back with the "research shouldn't be limited by funds" and my response is that it shouldn't be if it's good research, so that's why PIs will take more responsibility for selecting questions grounded in biologically plausible mechanisms and occasionally look into totally obscure and possibly spurious stuff. The failure to do this is what causes some of the "flip flopping" every few years on whether a drug "works" or not.
 
This is actually easily done by just permuting the Y values without doing anything with the IVs/explanatory variables—if I remember correctly, your model is overfit if you permute the dependent variable values randomly but still see some indicator of “signal” or good performance (because you should see nothing special when you know the y values are unrelated to the Xs due to the random permutation). I will read later once I am home, but the title makes me think it is related to that.

Frank Harrell discusses this, too in his course notes and book on regression modeling strategies.

I guess what I was saying is you don’t have to do anything with Xs AND Y, you can just permute the Y values and that automatically uncouples Xs from Y. What you described is exactly what can be seen in overfitting (too many terms in the mode given the number of observations), and permuting the Y values is all you need to do to rerun the remodel and see overfitting. I am not sure what you mean by “smaller” indepdent and dependent variables.

Will read both links later today to see what the original and Wasserstein say.

Just read the stackexchange thread and see what you mean. I was referring to an easy method to see if your model is overfit to the data. You should see a big drop in performance measures (when rerunning the model) if you permute Y values randomly and if the model is not overfit.

To the thread you showed, that’s really striking that someone would think that’s some how valid to independently sort the variables and think they retain meaning as a “pair”. I see what you mean by “smaller” x and y values, literally the magnitude. All clear now after seeing what you posted from stackexchange. Pretty good on that OP for smelling the obvious nonsense 😎

I'm just flagging these as I haven't had a chance to read again on what I said for permuted Y if the exact method is accurate in my posts, so read up on it if you want to use it rather than taking my word for it. I may get back to this at some point. :chicken:
 
Hi, if you want to start learning R-start with the package swirl. It helped introduce me to the basic R language. If you've taken some basic stats course-I recommend reading up on multivariable statistics and regressions (categorical, time series, etc.) and practice the concepts using R. Can't comment on python but I've been told its just as easy if not even easier to pick up. Other stats program like SPSS SAS and Matlab might be used instead in the department but R and Python are the most accessible. Also, database management (SQL/Perl) in my opinion makes you more valuable on a team.90% of your time is cleaning a dataset.
 
Hi, if you want to start learning R-start with the package swirl. It helped introduce me to the basic R language. If you've taken some basic stats course-I recommend reading up on multivariable statistics and regressions (categorical, time series, etc.) and practice the concepts using R. Can't comment on python but I've been told its just as easy if not even easier to pick up. Other stats program like SPSS SAS and Matlab might be used instead in the department but R and Python are the most accessible. Also, database management (SQL/Perl) in my opinion makes you more valuable on a team.90% of your time is cleaning a dataset.
Can't recall if I mentioned swirl in here, so thanks for bringing that. Yes, swirl is a good option in R. Really good option, actually, to learn what R is doing and how it works. Gain familiarity with it and those modules and then create questions for yourself to answer (find a data set online or in a text that UCLA stats has done the annotated output for).
 
Top