MD & DO Beneficial to learn R or python before medical school?

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

PickleRickelus

Full Member
2+ Year Member
Joined
Jul 27, 2018
Messages
427
Reaction score
371
Title says it all. Would knowing R or python come in handy when I take on research as a med student or would it not even be necessary? Please feel free to share your own experiences doing research in medical school and what kinds of skill sets that required.

Members don't see this ad.
 
Last edited:
Yes it would. It obviously depends on the specific research mentor you end up contacting/what they expect from you, but from my experience being able to run stats is a huge advantage
 
yes. If you think you will be doing research, learn how to use R and do stats now because you won’t have time to devote to it once school starts

I really wish I would have done this
 
Members don't see this ad :)
No. This is a dumb idea.

Any academic institution will have statisticians and researchers who will crunch your data for you.

Why don't you spend your time traveling or enjoying life.

Of all the things you could/should be doing before medical school, learning a programming or stats language beforehand should be second to the bottom of your list just ahead of getting a DUI or other legal troubles.
 
No. This is a dumb idea.

Any academic institution will have statisticians and researchers who will crunch your data for you.

Why don't you spend your time traveling or enjoying life.

Of all the things you could/should be doing before medical school, learning a programming or stats language beforehand should be second to the bottom of your list just ahead of getting a DUI or other legal troubles.

Yeah in my experience the stats get dumped off to a statistician.
 
you guys are some real dummies

My research really took off after I finally learned how to do stats

If you have an idea and at least a basic knowledge of stats you can look for a signal to see if there is something worth investigating. That helps you develop a focused question to ask the stats people to help you answer

Stats departments are notoriously slow. If you can do the ground work it will help speed things along because you will know what question you want answered and you will get a much more focused analysis back from them

In one of my projects, I was able to show a signal for an analysis that everybody else thought wasn’t worth the time. So we sent it off to stats and they gave us back a report confirming what I found. Wow I’m writing a high impact paper with those results and we are developing a clinical trial to test it prospectively. That never would have happened if I didn’t take it upon myself to learn some basic **** about how to run stats. Just thinking of how useful it will be during residency and how I won’t have to rely on a slow ass stats dept to do a preliminary look into things for me, makes me very thankful I took the time.

If you have a few hours to learn it, do it now. It doesn’t take much time at all to learn the basics but you won’t have much time to do it when you’re in the thick of Med school.
 
Yeah in my experience the stats get dumped off to a statistician.

Disagree with this premise. I don't think you should be publishing statistical analysis if you don't understand it.

OP, I am a lowly (accepted) pre-med who has done his fair share of research. Learning a programming language will do nothing but help you in your research. Can't recommend it enough.
 
you guys are some real dummies

My research really took off after I finally learned how to do stats

If you have an idea and at least a basic knowledge of stats you can look for a signal to see if there is something worth investigating. That helps you develop a focused question to ask the stats people to help you answer

Stats departments are notoriously slow. If you can do the ground work it will help speed things along because you will know what question you want answered and you will get a much more focused analysis back from them

In one of my projects, I was able to show a signal for an analysis that everybody else thought wasn’t worth the time. So we sent it off to stats and they gave us back a report confirming what I found. Wow I’m writing a high impact paper with those results and we are developing a clinical trial to test it prospectively. That never would have happened if I didn’t take it upon myself to learn some basic **** about how to run stats. Just thinking of how useful it will be during residency and how I won’t have to rely on a slow ass stats dept to do a preliminary look into things for me, makes me very thankful I took the time.

If you have a few hours to learn it, do it now. It doesn’t take much time at all to learn the basics but you won’t have much time to do it when you’re in the thick of Med school.


Your perspective is off.

I suspect you are still a medical student? Those status on the profile are not always updated. Your post sounds like someone who hasn't completed medical school or residency.

Maybe worthwhile if you are an MD/PhD or want to pursue a career in research? even still.

If you think you are going to be doing tons of research and stats during residency you are misguided.

This is one of those ideas that sounds great, but in reality is dumb.

Often times, you will barely have time in the day to finish your clinical duties and get something to eat. You really think you are going to be plugging numbers from "all the research you are doing during residency" into R at night?

It is just a poor use of your time.

Don't try to tell this kid he should use his last remaining free time before medical school learning to program. Please...

There are a lot of things we can do in life that would be beneficial in some fashion. It doesn't mean pursuing them would be worth your time.

Why don't you learn to fix your car in case it breaks down during residency.
 
Last edited:
Your perspective is off.

I suspect you are still a medical student? Those status on the profile are not always updated. Your post sounds like someone who hasn't completed medical school or residency.

Maybe worthwhile if you are an MD/PhD? even still.

If you think you are going to be doing tons of research and stats during residency you are misguided.

Often times, you will barely have time in the day to finish your clinical duties and get something to eat. You really think you are going to be plugging numbers from "all the research you are doing during residency" into R at night?

It is just a poor use of your time.

Don't try to tell this kid he should use his last remaining free time before medical school learning to program. Please...

There are a lot of things we can do in life that would be beneficial in some fashion. It doesn't mean pursuing them would be worth your time.

Why don't you learn to fix your car in case it breaks down during residency.
speak for yourself

I am an MS4 going into rad onc, which is the most research-heavy specialty. Research is a constant for rad onc residents even when they are not on their dedicated research time (which at many programs is a full 12 months).

Obviously this does not apply to everybody and rad onc is not your typical residency so as you suggest, maybe my perspective is not as pertinent to others.

But agree with the premise above that if you are going to be publishing, you should have an understanding of how you are reaching the conclusions

I am hugely against pre studying for medical school but this is one thing I do advocate for people who expect to be highly involved in research during medical school (aka anybody interested in highly competitive residencies that emphasize research such as derm, ent, neurosurg etc). Learning stats doesn’t take much time, do a little now and you can greatly build on it as you progress in your research.
 
Last edited:
You don’t need to learn python or r just for stats.

Learning pandas (if python) and how to work with dataframes and do basic parsing/querying of raw data goes a long way in turning a raw dataset (often a dirty excel sheet for chart review) into clean, easy to work with csv. A project that would have taken 40 hours suddenly takes 4.


Having said that, I’d recommend learning this stuff via application and practice. A lot of people try to “learn programming” and get nowhere bc of how they choose to learn.


If you can get ur hands on any large dataset, try to use python or R to:
- summarize the data - who or what was studied (demographics etc.)
- pose a few interesting questions
- try to answer those questions statistically
 
One anecdote about programming in general:

Two of my pre-med friends once spent a whole summer parsing through a database of outcomes by hand, trying to establish statistical significance of different procedures. They literally scrolled through a spreadsheet and moved the data around by copying and pasting.

This is something I could have done in less than an hour in python/R/MATLAB.
 
Last edited:
One anecdote about programming in general:

Two of my pre-med friends once spent a whole summer parsing through a database of outcomes by hand, trying to establish statistical significance of different procedures. They literally scrolled through a spreadsheet and moved the data around by copying and pasting.

This is something I could have done in less than an hour in python/R/MATLAB.

That's also a terrible way of doing statistics and won't lead to a great paper. You're very likely to have multiple tests run positive for statistical significance like that just by chance. Knowing statistical software can be very helpful, but I think it's also important to understand the tests and the assumptions behind each test, if interested in publishing good research. Otherwise, it's best to leave it to the statistician.
 
I'm sorry, using code is a terrible way of doing statistics? You misunderstood me. I use code to implement statistical techniques that I understand. I don't think that researchers should blindly apply statistical methods without understanding them on a fundamental level. I'm also skeptical of the idea that you should pawn statistical analysis off on statisticians. If that's the status quo in medical research, then that is a poor status quo.

Was referring to this part of your quote:
"Two of my pre-med friends once spent a whole summer parsing through a database of outcomes by hand, trying to establish statistical significance of different procedures. They literally scrolled through a spreadsheet and moved the data around by copying and pasting."
 
Members don't see this ad :)
Was referring to this part of your quote:
"Two of my pre-med friends once spent a whole summer parsing through a database of outcomes by hand, trying to establish statistical significance of different procedures. They literally scrolled through a spreadsheet and moved the data around by copying and pasting."

My apologies! Agree.
 
As a scientist (MD/PhD), I would never trust my stats to a med student anyway. So from that perspective I think it would be wasted time. Learn it if and when you have to.
 
As a scientist (MD/PhD), I would never trust my stats to a med student anyway. So from that perspective I think it would be wasted time. Learn it if and when you have to.

Well in cases where the med student is leading the project from the ground up to completion, having stats knowledge would be helpful in understanding the results and forming meaningful conclusions. But in any case, having a statistician formally carrying out statistical analyses would be a lot more helpful.
 
Well in cases where the med student is leading the project from the ground up to completion, having stats knowledge would be helpful in understanding the results and forming meaningful conclusions. But in any case, having a statistician formally carrying out statistical analyses would be a lot more helpful.

As a scientist, I would never let a med student lead a project from the ground up without having formal research training first - otherwise it will be a wild goose chase wasting everyone’s time.
 
Title says it all. Would knowing R or python come in handy when I take on research as a med student or would it not even be necessary? Please feel free to share your own experiences doing research in medical school and what kinds of skill sets that required.
Short and sweet, as many have said, yes this is an incredibly valuable skill and knowledge base. Once you start learning you will see that getting the "results" is the simplest part (clicking buttons or running a script in R). The hard work comes before that. Anyone on here telling you it's a waste is, and I would put money on it, overconfident in his or her abilities, by far, and probably doesn't even know the most basic concepts in statistics (yet they strongly believe they do). People saying "just kick it to the stats people" clearly have zero understanding of the very thing their whole discussion section hinges upon. The time to involve the statistics team isn't once you've collected the data-- it's before you've solidified the research question. Again, if you develop an understanding of statistics you'll see how many studies are terrible because the MD or MD/PhD thinks the stats is just number crunching after data collection- so they collect the data in an inefficient and sometimes unworkable manner.

Learn as much about statistics as possible before school and build on it during school as you come to new projects. If you're a physician in modern medicine, you're inadvertently (or purposefully) committing yourself to being a consumer or producer of research. If you don't want to be involved in production, consumption of research for patient care requires an important element of understanding as well.

Medscape: Medscape Access

"A mistake in the operating room can threaten the life of one patient; a mistake in statistical analysis or interpretation can lead to hundreds of early deaths. So it is perhaps odd that, while we allow a doctor to conduct surgery only after years of training, we give SPSS® (SPSS, Chicago, IL) to almost anyone. Moreover, whilst only a surgeon would comment on surgical technique, it seems that anybody, regardless of statistical training, feels confident about commenting on statistical data."

This couldn't be more true, and this stereotype is evident by a few in this thread...
 
As a scientist, I would never let a med student lead a project from the ground up without having formal research training first - otherwise it will be a wild goose chase wasting everyone’s time.
To be fair, there are many MDs and MD/PhD people who fall into the same category, yet they lead research every day. People think "research" background from basic science without statistics makes them qualified to act as a statistician on their own project. Even worse is the "I've published X articles, so I know research"...Some of these docs are very good, recognizing they don't know statistics just because they took a "methods" class or two in the PhD coursework or because they have published without a statistician. They're the good ones who know that good research involves specialization in each aspect of clinical and statistical knowledge. Fortunately, I've had the pleasure of working with people who want things done right rather than fluffing their ego with the "I can do it all" attitude.
 
As a scientist (MD/PhD), I would never trust my stats to a med student anyway. So from that perspective I think it would be wasted time. Learn it if and when you have to.
To say that a practicing physician doesn't need to know some undergraduate statistics is to say they don't need to know how to read and write. Trusting others to interpret their research for you is a faulted approach to consuming research and possibly impacting patient care. To say you won't have to use it is also neglecting the amount it would be used in day-to-day practice for most physicians simply by reading a few journal articles. Physicians are used to being the most knowledgeable and hardest working in many respects, but they don't seem to want to take on the challenge of properly arming themselves to better use or create the research that pushes the limits in the field...(Dunning-Kruger is incredibly strong in MD, MD/PhD, MD/MPH cohort when it comes to applying statistical ideas to research).
 
To say that a practicing physician doesn't need to know some undergraduate statistics is to say they don't need to know how to read and write. Trusting others to interpret their research for you is a faulted approach to consuming research and possibly impacting patient care. To say you won't have to use it is also neglecting the amount it would be used in day-to-day practice for most physicians simply by reading a few journal articles. Physicians are used to being the most knowledgeable and hardest working in many respects, but they don't seem to want to take on the challenge of properly arming themselves to better use or create the research that pushes the limits in the field...(Dunning-Kruger is incredibly strong in MD, MD/PhD, MD/MPH cohort when it comes to applying statistical ideas to research).


Quoted just one of your replies, but in reference to both I don’t know that I disagree with you, but I also don’t know if your replies are really replies to what I said as they address different things.

Bottom line, an MD is not a research degree and doctors aren’t trained to do research. (Yes some doctors do blah blah blah who cares - they are the exception to 99% of physicians). So if a med student approached me to say “hey I want to do X project all by myself, watch and see what happens”, and they have no evidence of research training that applies to what they are proposing (ie. no clinical research background for a clinical project, no epi background for an epi project, no basic science for a lab project etc), I would have to be an idiot to reply “Ok! Use my grant funds and/or my time to to everything yourself because I think you’ll do just fine!” Nope. I tell med students exactly what the project is, exactly what their specific role is, and exactly what I expect for them to produce and how. That’s how I guarantee I get what I need. Further to that, if a med student out of the blue was like “hey I learned R, I’ll do all your analyses for you.” I’d again be stupid to say “Awesome! Publish away!” Again, no chance. I’ll take the data and either analyze it myself or ask for an experts assistance depending on what the question is because at the end of the day it’s my ass on the line for research integrity, not the med students.

So no, I do not think randomly learning how to use R or Python is helpful as a med student, and no I will not be asking a med student to run one of my projects on their own. This in no way takes away from the fact that a med student should understand basic stats and basic research principles - they should have an idea of what the stats are trying to do, and what general types of stats could apply to certain scenarios. But they absolutely don’t need to be an expert and don’t need to be the one doing the programming.
 
To be fair, there are many MDs and MD/PhD people who fall into the same category, yet they lead research every day. People think "research" background from basic science without statistics makes them qualified to act as a statistician on their own project. Even worse is the "I've published X articles, so I know research"...Some of these docs are very good, recognizing they don't know statistics just because they took a "methods" class or two in the PhD coursework or because they have published without a statistician. They're the good ones who know that good research involves specialization in each aspect of clinical and statistical knowledge. Fortunately, I've had the pleasure of working with people who want things done right rather than fluffing their ego with the "I can do it all" attitude.
How would you suggest I go about learning research-pertinent stats in 2 months? As an MS4 I've got a lot of time on my hands and wanted to equip myself would the necessary tools to be able to do efficient research during residency.
 
Quoted just one of your replies, but in reference to both I don’t know that I disagree with you, but I also don’t know if your replies are really replies to what I said as they address different things.
They are direct replies in that you mention "MDPhD/scientist" or something as if it necessitates being more proficient than a medical student at statistics with further implication that this "formal training" of MD PhD wouldn't lead people on more of a goose chase (which it might not clinically) but once stats are involved it often does because the "formally trained" doesn't know more than a med student in the realm of statistics, and this is a huge part of the study. The other reply was similar because you said you'd never trust it to a med student, but then go on to say learn it when and if it's needed which implies not everyone needs it-- and I firmly disagree that not every physician needs it, as I mentioned.

Bottom line, an MD is not a research degree and doctors aren’t trained to do research. (Yes some doctors do blah blah blah who cares - they are the exception to 99% of physicians).
But let's make sure we don't conflate an MD PhD, MD MPH or something as adequately rounding out the research skills of that person. The PhD is almost always basic science and has nothing to do with stats (even in the case of a "methods" class or two, which when called "methods" is usually trash). The MPH isn't a stats program, either. So they may be expert in some of these things, but still likely inadequate on average at the stats. These are common misconceptions I see held, almost always by those with those degrees. Again, the stats part is important because the rest of the paper can be beautiful from a clinical or basic science perspective, but the stats often break a paper either with misapplication or misinterpretation and this squeezes by many top journals depending on whether an actual statistician reviewed the paper.

So if a med student approached me to say “hey I want to do X project all by myself, watch and see what happens”, and they have no evidence of research training that applies to what they are proposing (ie. no clinical research background for a clinical project, no epi background for an epi project, no basic science for a lab project etc), I would have to be an idiot to reply “Ok! Use my grant funds and/or my time to to everything yourself because I think you’ll do just fine!” Nope.
Agreed, but let's recognize that clinical research and epi largely involves methodologies that are statistical just applied in a clinical or epidemiological frame work, so the key is learning stats, not these peripheral fields that are easier to pick up properly after you learn statistics. Learning and practicing good statistics is like medicine and takes a while and then reveals there is a great danger in what you don't know, so you need to know when to seek help (as you said).

I tell med students exactly what the project is, exactly what their specific role is, and exactly what I expect for them to produce and how. That’s how I guarantee I get what I need.
That's totally fair for your projects. I completely agree. If a student has an idea, though, I'll certainly listen and ask for a proposal, basically, and have the right experts on board if it ain't me (assuming it goes through what I am comfortable with).

Further to that, if a med student out of the blue was like “hey I learned R, I’ll do all your analyses for you.” I’d again be stupid to say “Awesome! Publish away!” Again, no chance.
I agree you shouldn't give them free range, but why not find out what they know? What if they know at least as much statistics as you know?

I’ll take the data and either analyze it myself or ask for an experts assistance depending on what the question is because at the end of the day it’s my ass on the line for research integrity, not the med students.
So you automatically assume the student knows less than you do in an area for which you have probably the same level of education (statistics)? How often do you get someone with an MS or PhD in statistics or biostatistics?

So no, I do not think randomly learning how to use R or Python is helpful as a med student, and no I will not be asking a med student to run one of my projects on their own.
Right, but these skills don't develop overnight or in a week, contrary to what many medical people think. So saving all the learning for clinical medicine or residency is foolish because your need and time horizon at that point will definitely keep you from learning it effectively. Also, good mentors know how to take good suggestions from students/residents and toss out the bad, so a student who learns can often positively contribute toward a project.

This in no way takes away from the fact that a med student should understand basic stats and basic research principles - they should have an idea of what the stats are trying to do, and what general types of stats could apply to certain scenarios. But they absolutely don’t need to be an expert and don’t need to be the one doing the programming.
The sad fact is, most med students, residents, and attendings don't know or understand basic statistics. And I agree they don't absolutely need to be an expert (read as PhD Statistics) nor doing programming but those things will improve their career opportunities (I know people who this has happened to) because most people in medicine aren't good at this stuff, at all.
 
Last edited by a moderator:
How would you suggest I go about learning research-pertinent stats in 2 months? As an MS4 I've got a lot of time on my hands and wanted to equip myself would the necessary tools to be able to do efficient research during residency.
I think you can make a lot of headway there if you want to, especially given the breakneck pace that you've become accustomed to in med school. If you want, we can PM so I can ask more about your background and goals and all of that to give some more pointed advice. Shoot me a message if you're interested!
 
Last edited by a moderator:
I think you can make a lot of headway their if you want to, especially given the breakneck pace that you've become accustomed to in med school. If you want, we can PM so I can ask more about your background and goals and all of that to give some more pointed advice. Shoot me a message if you're interested!

I'd trust you by your username alone.
 
They are direct replies in that you mention "MDPhD/scientist" or something as if it necessitates being more proficient than a medical student at statistics with further implication that this "formal training" of MD PhD wouldn't lead people on more of a goose chase (which it might not clinically) but once stats are involved it often does because the "formally trained" doesn't know more than a med student in the realm of statistics, and this is a huge part of the study. The other reply was similar because you said you'd never trust it to a med student, but then go on to say learn it when and if it's needed which implies not everyone needs it-- and I firmly disagree that not every physician needs it, as I mentioned.

But let's make sure we don't conflate an MD PhD, MD MPH or something as adequately rounding out the research skills of that person. The PhD is almost always basic science and has nothing to do with stats (even in the case of a "methods" class or two, which when called "methods" is usually trash). The MPH isn't a stats program, either. So they may be expert in some of these things, but still likely inadequate on average at the stats. These are common misconceptions I see held, almost always by those with those degrees. Again, the stats part is important because the rest of the paper can be beautiful from a clinical or basic science perspective, but the stats often break a paper either with misapplication or misinterpretation and this squeezes by many top journals depending on whether an actual statistician reviewed the paper.

Agreed, but let's recognize that clinical research and epi largely involves methodologies that are statistical just applied in a clinical or epidemiological frame work, so the key is learning stats, not these peripheral fields that are easier to pick up properly after you learn statistics. Learning and practicing good statistics is like medicine and takes a while and then reveals there is a great danger in what you don't know, so you need to know when to seek help (as you said).

That's totally fair for your projects. I completely agree. If a student has an idea, though, I'll certainly listen and ask for a proposal, basically, and have the right experts on board if it ain't me (assuming it goes through what I am comfortable with).

I agree you shouldn't give them free range, but why not find out what they know? What if they know at least as much statistics as you know?

So you automatically assume the student knows less than you do in an area for which you have probably the same level of education (statistics)? How often do you get someone with an MS or PhD in statistics or biostatistics?

Right, but these skills don't develop overnight or in a week, contrary to what many medical people think. So saving all the learning for clinical medicine or residency is foolish because your need and time horizon at that point will definitely keep you from learning it effectively. Also, good mentors know how to take good suggestions from students/residents and toss out the bad, so a student who learns can often positively contribute toward a project.

The sad fact is, most med students, residents, and attendings don't know or understand basic statistics. And I agree they don't absolutely need to be an expert (read as PhD Statistics) nor doing programming but those things will improve their career opportunities (I know people who this has happened to) because most people in medicine aren't good at this stuff, at all.

I’m not going to respond to all of this, as it’s a difference of opinion, but when I say I’m a scientist, it means I have experience in the scientific process - stats included. You can read as much as you want about knee replacements, but no ones going to trust you to do them alone until you’ve got the credentials and the experience under your belt, same for doing research solo. Read about stats, by all means. But investing time in learning computer programming is not going to be high yield for the vast majority of med students, when 99% of them will go into clinical practice and never touch research again in their life, and the other 1% will make time to learn it themselves or work with collaborators who know how to do it.

As you said, doing everything yourself is inefficient.
 
I would vote that learning R is a huge, but smart investment. Many projects only require entry-level stats work that R could make happen seamlessly. There are so many of my many papers/pubs where I was the “numbers guy” for doing basic stats. Knowing at least the basics becomes sort of a selling point for many projects.

I definitely think learning R is worth it. Especially when R makes the prettiest graphs. Papers/PIs love pretty graphs. Ggplot2 is my homeboy.
 
Does anyone have a good website/book/something that teaches R and the basics of stats that would be most helpful for the type of research for med school?
 
Thanks for all of the responses everyone! I didn't expect there to be such a polarizing reaction. I'm going to check out the free introduction to R and python on data camp and then purchase an additional month or two if I feel it's effective.

Learn Python for Data Science - Online Course
Introduction to R Online Course

Unless anyone has other suggestions on how to proceed?
 
Last edited:
Thanks for all of the responses everyone! I didn't expect there to be such a polarizing reaction. I'm going to check out the free introduction to R and python on data camp and then purchase an additional month or two if I feel it's effective.

Learn Python for Data Science - Online Course
Introduction to R Online Course

Unless anyone has other suggestions on how to proceed?

There's lots of free stuff, you should be able to get what you need without paying. I'd probably do more epidemiology type R coursework over straight data science as it will likely be a bit more directly applicable to what you might use it for as data science spans everything from business analytics to bioinformatics type stuff. A lot of the courses are geared towards the business analytics side. You can probably just start with R for now.

Coursera and ED X have free options for both python and R. Coursera has one by Johns Hopkins biostats profs. as well as a data science specialization if you really fall in love with it. You only pay if you want a certificate at the end.

Penn State gives free access to a lot of stats course stuff including intro to R.
Lesson 1: Getting Started: Basic R | STAT 484

Plus http://www.columbia.edu/~cjd11/charles_dimaggio/DIRE/resources/R/intro.pdf

Applied Epidemiology with R (course) | Tomás J. Aragón
 
Does anyone have a good website/book/something that teaches R and the basics of stats that would be most helpful for the type of research for med school?
Not trying to crap on anyone else, but the one thing you need to be careful of is learning coding from "data science" or "big data" people because they often don't understand the stats as well as they know the coding. Below I give strong resources in both the programming and stats. After you see their website a bit, you'll realize how much they have (UCLA and Frank Harrell with Vanderbilt). Including data sets to download they have output where they walk through the syntax and the meaning of what you're seeing and it's pretty thorough and accurate from a statistical sense (UCLA, because it's done by actual stats people). So, be wary of the statistical accuracy you get from other things that are from a lot of "learn R" or whatever online.

UCLA stats consulting (they have other software, too so you can learn about topics even if not how to do them in R directly on this site)
R

Haven't used but high recommendations for some stats concepts/material, (first one has some youtube videos for R)
harvardx

Online Statistics Education: A Free Resource for Introductory Statistics

This next one is from the Vandy founding chairman for biostats who is a statistical wizard and works with clinicians all the time I would look at this first if nothing else as he teaches R and effective statistical practice although assumes you know some basics (it's a condensed outline of his actual textbook, worth looking at and then 2nd link is part of his vandy website with more stuff and the R code and tons of great resources).
http://hbiostat.org/doc/bbr.pdf
biostat.mc.vanderbilt.edu/wiki/Main/ClinStat

The UCLA link and the last two links I posted are legitimately some of the best if not the best (including free or paid for) materials you could ask for. Online statbook has good basics and some cool simulations to understand theory (i.e. what it means to be "95% confident"... hint, it's not that "there's a 95% probability...", way off if you have that understanding).

Anything Frank Harrell Jr (vandy guy) touches is relevant for you as a medical person because his career has been biostats and consulting as a brilliant PhD Statistician. His course materials above are from teaching medical people and his twitter frequently calls out top journals for stupid stats mistakes they let occur even in big trials. (Frank Harrell (@f2harrell) | Twitter) He does post political stuff occasionally, but his stats stuff is pure gold and shows you how easily the top journal editors screw it up.
 
Last edited by a moderator:
There's lots of free stuff, you should be able to get what you need without paying. I'd probably do more epidemiology type R coursework over straight data science as it will likely be a bit more directly applicable to what you might use it for as data science spans everything from business analytics to bioinformatics type stuff. A lot of the courses are geared towards the business analytics side. You can probably just start with R for now.

Coursera and ED X have free options for both python and R. Coursera has one by Johns Hopkins biostats profs. as well as a data science specialization if you really fall in love with it. You only pay if you want a certificate at the end.

Penn State gives free access to a lot of stats course stuff including intro to R.
Lesson 1: Getting Started: Basic R | STAT 484
Personally, I would avoid data science/big data and epi/public health for any stats knowledge (coding would be okay), because these aren't usually taught by PhD statisticians. The coursera stuff with the Hopkins biostats people should be good since they're legit statisticians and I think Brian Caffo does some of their teaching and he's had some important papers in theoretical and applied stats if I recall.

Penn State is fantastic for free applied stats material since it's usually part of their masters in applied stats program, although I didn't know they taught R in it, because I used to only see minitab (which is good to learn material and what stuff means,, but not how to do it in a particular program of your choice). Good add for PSU.
 
I’m not going to respond to all of this, as it’s a difference of opinion, but when I say I’m a scientist, it means I have experience in the scientific process - stats included. You can read as much as you want about knee replacements, but no ones going to trust you to do them alone until you’ve got the credentials and the experience under your belt, same for doing research solo. Read about stats, by all means. But investing time in learning computer programming is not going to be high yield for the vast majority of med students, when 99% of them will go into clinical practice and never touch research again in their life, and the other 1% will make time to learn it themselves or work with collaborators who know how to do it.

As you said, doing everything yourself is inefficient.
Fair enough. We'll have to agree to disagree on the importance of statistical literacy for physicians. However, I agree that the benefit isn't spending your time learning programming, it's in learning statistics correctly which necessitates having skills to program to practice applying the theory and applied knowledge you read about. If you're doing stats in SPSS or something else that's point and click, it's likely not very good, and you probably don't know as much as you think-- this isn't my opinion but a really simple fact because SPSS and other point and clicks use, for example, algorithms that are not recommended or don't allow you full access to a full "work up" for a statistical question. They make it much harder to check and verify assumptions for a particular test or model which is like taking tires away from someone trying to ride a bike. The programming interfaces make it much easier to get what you actually need once you have some experience in the syntax. So my emphasis is on learning stats with the programming to facilitate learning and that will make you a big asset.

For me, I'd prefer to learn fully how students can contribute, but I'm also someone who thinks each department should have at least masters in stats/biostats statisticians on retainer for research projects so everyone has access to a real expert regardless of the project; the onus is on the PI not to contribute trash to the literature, no matter how big or small the study is because it has the potential to impact patient care. Students, residents, and attendings with knowledge of stats make it much easier to communicate the goals and needs for a project to the statistician while understanding what the statistician is telling them.

Again, to quote Andrew Vickers: "A mistake in the operating room can threaten the life of one patient; a mistake in statistical analysis or interpretation can lead to hundreds of early deaths. So it is perhaps odd that, while we allow a doctor to conduct surgery only after years of training, we give SPSS® (SPSS, Chicago, IL) to almost anyone. Moreover, whilst only a surgeon would comment on surgical technique, it seems that anybody, regardless of statistical training, feels confident about commenting on statistical data."

Being "capable" or "good" at statistics is far from clicking buttons in SPSS or having publications and taking some "methods" classes at one time. People complain about statisticians always giving hedged or "it depends" answers but this fits squarely with the scientific method of healthy skepticism about what you do and always asking "what did we do wrong? What disagrees with me" rather than "here is how we got it right."
 
Last edited by a moderator:
Not trying to crap on anyone else, but the one thing you need to be careful of is learning coding from "data science" or "big data" people because they often don't understand the stats as well as they know the coding. Below I give strong resources in both the programming and stats. After you see their website a bit, you'll realize how much they have (UCLA and Frank Harrell with Vanderbilt). Including data sets to download they have output where they walk through the syntax and the meaning of what you're seeing and it's pretty thorough and accurate from a statistical sense (UCLA, because it's done by actual stats people). So, be wary of the statistical accuracy you get from other things that are from a lot of "learn R" or whatever online.

UCLA stats consulting (they have other software, too so you can learn about topics even if not how to do them in R directly on this site)
R

Haven't used but high recommendations for some stats concepts/material, (first one has some youtube videos for R)
harvardx

Online Statistics Education: A Free Resource for Introductory Statistics

This next one is from the Vandy founding chairman for biostats who is a statistical wizard and works with clinicians all the time I would look at this first if nothing else as he teaches R and effective statistical practice although assumes you know some basics (it's a condensed outline of his actual textbook, worth looking at and then 2nd link is part of his vandy website with more stuff and the R code and tons of great resources).
http://hbiostat.org/doc/bbr.pdf
biostat.mc.vanderbilt.edu/wiki/Main/ClinStat

The UCLA link and the last two links I posted are legitimately some of the best if not the best (including free or paid for) materials you could ask for. Online statbook has good basics and some cool simulations to understand theory (i.e. what it means to be "95% confident"... hint, it's not that "there's a 95% probability...", way off if you have that understanding).

Anything Frank Harrell Jr (vandy guy) touches is relevant for you as a medical person because his career has been biostats and consulting as a brilliant PhD Statistician. His course materials above are from teaching medical people and his twitter frequently calls out top journals for stupid stats mistakes they let occur even in big trials. (Frank Harrell (@f2harrell) | Twitter) He does post political stuff occasionally, but his stats stuff is pure gold and shows you how easily the top journal editors screw it up.


Couldn't agree more with not relying on "big data" resources before understanding the fundamentals. That's how this sort of thing happens:

What happens if the explanatory and response variables are sorted independently before regression?
 
Personally, I would avoid data science/big data and epi/public health for any stats knowledge (coding would be okay), because these aren't usually taught by PhD statisticians. The coursera stuff with the Hopkins biostats people should be good since they're legit statisticians and I think Brian Caffo does some of their teaching and he's had some important papers in theoretical and applied stats if I recall.

Penn State is fantastic for free applied stats material since it's usually part of their masters in applied stats program, although I didn't know they taught R in it, because I used to only see minitab (which is good to learn material and what stuff means,, but not how to do it in a particular program of your choice). Good add for PSU.


Agree with getting it from biostats preferably, though there are some Epi PhD people who have pretty good mastery. My point was mainly to avoid taking a business analytics type course.

Actually, most of R courses I’ve found with a public health or Epi focus are taught by biostats PhDs or team taught biostats/Epi.

Penn has great stats resources. They’ve got pretty good SAS content as well, which is how I discovered it.
 
Last edited:
Couldn't agree more with not relying on "big data" resources before understanding the fundamentals. That's how this sort of thing happens:

What happens if the explanatory and response variables are sorted independently before regression?
This is actually easily done by just permuting the Y values without doing anything with the IVs/explanatory variables—if I remember correctly, your model is overfit if you permute the dependent variable values randomly but still see some indicator of “signal” or good performance (because you should see nothing special when you know the y values are unrelated to the Xs due to the random permutation). I will read later once I am home, but the title makes me think it is related to that.

Frank Harrell discusses this, too in his course notes and book on regression modeling strategies.

Edit: I mis-remembered the application here of permuting Y. It's not used to evaluate a model's performance. Rather, you can scramble a data set and build a model/select variables to show how overfitting can arise and make a nonsense model look good (more for studying overfitting rather than any kind of applied "validation").
 
Last edited by a moderator:
Agree with getting it from biostats preferably, though there are some Epi PhD people who have pretty good mastery. My point was mainly to avoid taking a business analytics type course.

Actually, most of R courses I’ve found with a public health or Epi focus are taught by biostats PhDs or team taught biostats/Epi.

Penn has great stats resources. They’ve got pretty good SAS content as well, which is how I discovered it.
SAS is also how I found the UCLA site and saw there is some on Penn. and yeah the courses on coursera have a lot of real biostatisticians on board!
 
This is actually easily done by just permuting the Y values without doing anything with the IVs/explanatory variables—if I remember correctly, your model is overfit if you permute the dependent variable values randomly but still see some indicator of “signal” or good performance (because you should see nothing special when you know the y values are unrelated to the Xs due to the random permutation). I will read later once I am home, but the title makes me think it is related to that.

Frank Harrell discusses this, too in his course notes and book on regression modeling strategies.

It's actually more simple than that... Stupidly simple in fact. The author's boss is suggesting they independently sort X and Y and then run the regression, which obviously uncouples the data and will give you something with little statistical relevance. However, the model appears to be extremely powerful and fits the data quite well, because, well... Smaller independent variables are now associated with smaller dependent variables.

Fun fact, under certain conditions (the function is nondecreasing), you can recover the regression with uncoupled, sorted X and Y, but the error converges very slowly:
Uncoupled isotonic regression via minimum Wasserstein deconvolution
 
Title says it all. Would knowing R or python come in handy when I take on research as a med student or would it not even be necessary? Please feel free to share your own experiences doing research in medical school and what kinds of skill sets that required.

Yes, this could definitely be helpful- and once you learn one it’s fairly easy to start the second. The easiest way to learn imo, especially if you have prior coding experience, is to get copies of others’ codes and start playing around with them to figure out what different functions do.

Source: am MS4, did research MS1 summer that basically consisted of me doing data analysis & graphical representation in R. Though it’s worth noting that I learned R on the job, I had prior experience with Matlab & Java.
 
It's actually more simple than that... Stupidly simple in fact. The author's boss is suggesting they independently sort X and Y and then run the regression, which obviously uncouples the data and will give you something with little statistical relevance. However, the model appears to be extremely powerful and fits the data quite well, because, well... Smaller independent variables are now associated with smaller dependent variables.

Fun fact, under certain conditions (the function is nondecreasing), you can recover the regression with uncoupled, sorted X and Y, but the error converges very slowly:
Uncoupled isotonic regression via minimum Wasserstein deconvolution
See above.
Edited.
 
Last edited by a moderator:
Yes, this could definitely be helpful- and once you learn one it’s fairly easy to start the second. The easiest way to learn imo, especially if you have prior coding experience, is to get copies of others’ codes and start playing around with them to figure out what different functions do.

Source: am MS4, did research MS1 summer that basically consisted of me doing data analysis & graphical representation in R. Though it’s worth noting that I learned R on the job, I had prior experience with Matlab & Java.
I think an even better way is to create problems for yourself to solve with a data set so you’re forced to look up the syntax, write it, trouble shoot it (a lot at first usually) and then have to redo it for another problem. Looking at a more experienced coder’s work can help show efficiency and good structure for coding, too.
 
It's actually more simple than that... Stupidly simple in fact. The author's boss is suggesting they independently sort X and Y and then run the regression, which obviously uncouples the data and will give you something with little statistical relevance. However, the model appears to be extremely powerful and fits the data quite well, because, well... Smaller independent variables are now associated with smaller dependent variables.

Fun fact, under certain conditions (the function is nondecreasing), you can recover the regression with uncoupled, sorted X and Y, but the error converges very slowly:
Uncoupled isotonic regression via minimum Wasserstein deconvolution
Just read the stackexchange thread and see what you mean. I was referring to an easy method to study how overfitting can make a nonsense model look good by scrambling data and then using some kind of variable selection/model building on totally unrelated variables.

To the thread you showed, that’s really striking that someone would think that’s some how valid to independently sort the variables and think they retain meaning as a “pair”. I see what you mean by “smaller” x and y values, literally the magnitude. All clear now after seeing what you posted from stackexchange. Pretty good on that OP for smelling the obvious nonsense 😎

Edited.
 
Last edited by a moderator:
Couldn't agree more with not relying on "big data" resources before understanding the fundamentals. That's how this sort of thing happens:

What happens if the explanatory and response variables are sorted independently before regression?
Now that I look at this not on my phone, I realize this isn't really a "big data" or "data science" problem as much as it is just stupidity. The problems in BD/DS are when statisticians don't teach it because you have the lean six sigma types teaching what buttons to click for a logistic regression and misinterpreting all of the crap along the way.
The problem described in the SE thread is pretty straight forward (like you said) once I recognized they were just saying each variable is rank ordered, rather than randomly permuted (shuffled).

Having only really learned from statisticians, I was surprised that someone would think that was a good idea (like the OP's boss)...
 
Yes it would be helpful. It would make you invaluable to research mentors because many don’t know how to run stats, and in my experience relying on a statistician is the rate limiting step to publishing papers. I learned stats in fellowship and as a result our division has cranked out papers and abstracts at a clip of 8-10 retrospective and 1-2 prospective studies per year. That being said it is not completely necessary and I think there may be more value in you enjoying your life before Med school instead of sitting in front of a computer screen.
 
Yes it would be helpful. It would make you invaluable to research mentors because many don’t know how to run stats, and in my experience relying on a statistician is the rate limiting step to publishing papers. I learned stats in fellowship and as a result our division has cranked out papers and abstracts at a clip of 8-10 retrospective and 1-2 prospective studies per year. That being said it is not completely necessary and I think there may be more value in you enjoying your life before Med school instead of sitting in front of a computer screen.
You can be very valuable because statisticians are at a hub point with all consultees coming to them like spokes on a wheel.

I will caution that “running stats” isn’t the goal but to understand it and be able to think statistically about how to answer a research question from before the formal question is fully defined including the best statistic to help answer the question, knowing how data are generated and snafus common to this kind of generation and measurement, being able to create an adequate sampling plan and study design based on the question, having data collected in an appropriate manner, and knowing how to do all the cleaning/investigatory work to understand how to handle missing data and all of that (and more) comes before “running the stats.” I’m assuming by that phrase you literally mean telling a program to give you output you want. After that there’s quality control and kicking tires and looking under the hood to make sure what you think you have is in fact what you have and need. The final part is interpreting it.

This is also a helpful document that everyone should read before thinking about collecting data:
https://amstat.tandfonline.com/doi/abs/10.1080/00031305.2017.1375989#.XHU86KROmEc

Read it in it’s entirety and you’ll see likely how poorly you or your colleagues (not specific to the poster I quoted, but in general) approach data collection and how to improve your life. Any med students who work with me will read this because it makes things easier and improves data integrity while speeding time from collection to analysis. Plain and simple your spreadsheet will look boring and simple. The more you understand statistics as a discipline the more you will see and appreciate data organization like this.
 
I would say that this is not necessary but certainly can't hurt. If you have the time and the interest, do it. If you are pressed for time or aren't really that interested, don't bother.
 
I think learning it makes it a lot more likely you will be better than average in your career and job overall in a well rounded sense (because I see numerical literacy and data literacy as a necessary component of the modern physician, and a lot of patients do, too). If you don’t do it, it’s a lot harder to not be average in that sense. It’s a skill that most physicians don’t have, even when they think they do, and it’s becoming even more necessary as medicine moves forward, in my opinion. The dangerous part is those who think they have it and aren’t cautious of what they don’t know (much like the rest of medical knowledge).
 
I found this lecture (his R lectures are good too).



Haven't watched the whole thing, but his other items I've seen are good and this is geared for medical students. He might be a little loose with the language at certain points, but he's far better than most stats for medical people presenters (and he has a stats degree, so not very surprising).
 
Short and sweet, as many have said, yes this is an incredibly valuable skill and knowledge base. Once you start learning you will see that getting the "results" is the simplest part (clicking buttons or running a script in R). The hard work comes before that. Anyone on here telling you it's a waste is, and I would put money on it, overconfident in his or her abilities, by far, and probably doesn't even know the most basic concepts in statistics (yet they strongly believe they do). People saying "just kick it to the stats people" clearly have zero understanding of the very thing their whole discussion section hinges upon. The time to involve the statistics team isn't once you've collected the data-- it's before you've solidified the research question. Again, if you develop an understanding of statistics you'll see how many studies are terrible because the MD or MD/PhD thinks the stats is just number crunching after data collection- so they collect the data in an inefficient and sometimes unworkable manner.

Learn as much about statistics as possible before school and build on it during school as you come to new projects. If you're a physician in modern medicine, you're inadvertently (or purposefully) committing yourself to being a consumer or producer of research. If you don't want to be involved in production, consumption of research for patient care requires an important element of understanding as well.

Medscape: Medscape Access

"A mistake in the operating room can threaten the life of one patient; a mistake in statistical analysis or interpretation can lead to hundreds of early deaths. So it is perhaps odd that, while we allow a doctor to conduct surgery only after years of training, we give SPSS® (SPSS, Chicago, IL) to almost anyone. Moreover, whilst only a surgeon would comment on surgical technique, it seems that anybody, regardless of statistical training, feels confident about commenting on statistical data."

This couldn't be more true, and this stereotype is evident by a few in this thread...
This is actually why I prefer to have an actual statistician look at my data. No matter how much I know, it is far better to have someone whose actual specialization in life is working with statistics to be working with my statistics to ensure that no errors are made and I'm not drinking my own Koolaid via statistical errors
 
Top