Best resources for learning R?

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

ChrisMack390

Full Member
7+ Year Member
Joined
Jan 15, 2015
Messages
3,378
Reaction score
4,619
I have no real commitments or obligations in the months of May and June. I have a good working knowledge of clinical research and biostatistics from past research experiences, but I do not know much regarding conducting my own data analysis beyond some basic Stata commands. I am interested in working on some clinical research in med school) residency, so I thought learning how to use R would be a good use of some of my time.

Can anyone recommend a good book or other resource for this purpose?

Members don't see this ad.
 
I have no real commitments or obligations in the months of May and June. I have a good working knowledge of clinical research and biostatistics from past research experiences, but I do not know much regarding conducting my own data analysis beyond some basic Stata commands. I am interested in working on some clinical research in med school) residency, so I thought learning how to use R would be a good use of some of my time.

Can anyone recommend a good book or other resource for this purpose?
This is what I used and and I really liked it: Amazon product ASIN 1446200469I had used the same text for SPSS and SAS for a class in undergrad so I was familiar with the author's style and approach which may make me a bit biased, but its a very approachable source for teaching yourself. Plus his website has a bunch of free data sets to mess around with.
 
Last edited:
If you wanted to learn from lecture-style format, check out coursera.com , they offer free classes from several institutions on almost any subject.
 
Members don't see this ad :)
I like using the R bloggers website
 
If you already know some basic Stata commands, it would be probably easier for you to learn more Stata commands in two months, rather than learning R from scratch.

As mentioned above, I would explore some textbooks/coursera/edx resources to learn the basics, but beyond that, google search/literature will help you with more advanced stuff.
 
If you already know some basic Stata commands, it would be probably easier for you to learn more Stata commands in two months, rather than learning R from scratch.

As mentioned above, I would explore some textbooks/coursera/edx resources to learn the basics, but beyond that, google search/literature will help you with more advanced stuff.

I've heard that R is much more useful for large data set analysis for clinical research. Is that true?
 
I've heard that R is much more useful for large data set analysis for clinical research. Is that true?
Yeah I've heard this too, and I've also heard that it's very possible to learn R in that time frame, so I'd suggest against just boning up on stata
 
I've heard that R is much more useful for large data set analysis for clinical research. Is that true?

I'm not sure true that this is true.

Nearly all the clinical research papers I see published at the moment report doing the analysis in Stata or SAS so I don't think knowing R is critical. The only field I've heard of where you'd really be hosed if you couldn't use R is bioinformatics.
 
R has built in modules that you can use. Coursera has an R course taught by an JHU prof that might be helpful. I ended up just doing the module with the tutorial built in and it seemed pretty decent. The best way to learn this is would be to have a problem in mind with a dataset you want to work on. That way actually working on it and trying to problem solve makes it stick better.
 
I'm not sure true that this is true.

Nearly all the clinical research papers I see published at the moment report doing the analysis in Stata or SAS so I don't think knowing R is critical. The only field I've heard of where you'd really be hosed if you couldn't use R is bioinformatics.

R is the standard in virtually everything except medicine. My explanation is that it is because nobody in academics can justify spending the money on Stata/SAS. My brother is in applied mathematics with a statistics focus, he was doing most of our statistics work in the last year. He was telling me that at the two schools he has trained at that nobody uses anything but R.

Personally, I wouldn't use anything besides R, but because of my brother, that is what I have been using for a while.
 
Members don't see this ad :)
R is the standard in virtually everything except medicine. My explanation is that it is because nobody in academics can justify spending the money on Stata/SAS. My brother is in applied mathematics with a statistics focus, he was doing most of our statistics work in the last year. He was telling me that at the two schools he has trained at that nobody uses anything but R.

Personally, I wouldn't use anything besides R, but because of my brother, that is what I have been using for a while.
I feel like Python and R are making inroads in medicine as well, but we are the slowest to adopt in general.
 
R is the standard in virtually everything except medicine. My explanation is that it is because nobody in academics can justify spending the money on Stata/SAS. My brother is in applied mathematics with a statistics focus, he was doing most of our statistics work in the last year. He was telling me that at the two schools he has trained at that nobody uses anything but R.

Personally, I wouldn't use anything besides R, but because of my brother, that is what I have been using for a while.

I wouldn't argue that R isn't more flexible and future-proof, but I'm not aware of features that make is significantly more useful than other packages for clinical research at the moment. I'm learning R and I like it, but in all of my interactions with researchers and analysts at our hospital/research group I've never met one who uses R. That will change in the future with turnover of the current generation of people who learned in SAS/Stata, but for now I don't think that you need to be using R to be productive in clinical research.
 
I wouldn't argue that R isn't more flexible and future-proof, but I'm not aware of features that make is significantly more useful than other packages for clinical research at the moment. I'm learning R and I like it, but in all of my interactions with researchers and analysts at our hospital/research group I've never met one who uses R. That will change in the future with turnover of the current generation of people who learned in SAS/Stata, but for now I don't think that you need to be using R to be productive in clinical research.

I think that it is easier to get help with R than it is with any of the others because I've yet to run into anyone that uses anything else. And like you said, it is going to shift at some point. I don't think anyone training now should be focusing on anything besides R.
 
I think that it is easier to get help with R than it is with any of the others because I've yet to run into anyone that uses anything else. And like you said, it is going to shift at some point. I don't think anyone training now should be focusing on anything besides R.

I agree that learning R is a good investment, I just don't think that it offers inherent advantages for doing clinical research vs. the others (outside of graphics probably). The statistics involved in medical student clinical research are going to be easy to do in any of the options out there.
 
I agree that learning R is a good investment, I just don't think that it offers inherent advantages for doing clinical research vs. the others (outside of graphics probably). The statistics involved in medical student clinical research are going to be easy to do in any of the options out there.
I think one of the inherent advantages of using R is the cost, it is free. There is a larger group of people actively working on packages, I feel like the documentation for the language is better as well.
 
I think one of the inherent advantages of using R is the cost, it is free. There is a larger group of people actively working on packages, I feel like the documentation for the language is better as well.

I was speaking in terms of features--the price point can't be beat. For what most med students are publishing you don't need to be on the cutting edge of new statistical techniques, you're doing multiple regression and calling it a day. I haven't spent that much time investigating the documentation, but nothing could come close to being as bad as what you find for SAS.
 
The last institution I was at mostly used SAS, but SAS is absurdly expensive so I'm not going to even attempt to teach that to myself.
 
R is fast, efficient, easy to learn, and can be adapted to any problem (for free) through a variety of packages. On top of that, using R you can make the prettiest graphs.

R is all i've ever seen used in cancer genetics. Anything really that uses Genome Wide, or Methylome Wide, etc. - all that stuff I've seen mostly R (although I did use Matlab for methylation stuff in the past). HLA typing, immunology - we got R for that stuff too. Proteome? R. Your mom? We got R for that.

It is useful to learn R over any other statistical package because it is as powerful as you are (as a user) and will be there for you whenever you need it. Like taco bell, it was there for you when you were poor. Don't go using stata or matlab once you get a bill or two in your pocket!

FYI here are some links to an Advanced Data Analysis class that is free from one of my old professors. He gives the pdf lecture, plus all the code so you can learn R along with him - all the while learning the ideas of statistics.

ADA1 | StatAcumen.com
ADA2 | StatAcumen.com

If you make it through the entire Advanced Data Analysis 1 course on his website, you will officially be competent. IF you get through ADA 2 - you will be a stat god.

I wouldn't argue that R isn't more flexible and future-proof, but I'm not aware of features that make is significantly more useful than other packages for clinical research at the moment. I'm learning R and I like it, but in all of my interactions with researchers and analysts at our hospital/research group I've never met one who uses R. That will change in the future with turnover of the current generation of people who learned in SAS/Stata, but for now I don't think that you need to be using R to be productive in clinical research.
 
R is fast, efficient, easy to learn, and can be adapted to any problem (for free) through a variety of packages. On top of that, using R you can make the prettiest graphs.

R is all i've ever seen used in cancer genetics. Anything really that uses Genome Wide, or Methylome Wide, etc. - all that stuff I've seen mostly R (although I did use Matlab for methylation stuff in the past). HLA typing, immunology - we got R for that stuff too. Proteome? R. Your mom? We got R for that.

It is useful to learn R over any other statistical package because it is as powerful as you are (as a user) and will be there for you whenever you need it. Like taco bell, it was there for you when you were poor. Don't go using stata or matlab once you get a bill or two in your pocket!

FYI here are some links to an Advanced Data Analysis class that is free from one of my old professors. He gives the pdf lecture, plus all the code so you can learn R along with him - all the while learning the ideas of statistics.

ADA1 | StatAcumen.com
ADA2 | StatAcumen.com

If you make it through the entire Advanced Data Analysis 1 course on his website, you will officially be competent. IF you get through ADA 2 - you will be a stat god.
Again I'm not really disagreeing with any of what's being argued in this thread as far as the relative benefits of R. My reply was to question the idea that R has features that make it clearly better than other software for clinical research; while there are good points about it being free and making nice figures I've not read anything in this thread that changes my initial response. Most med students are doing descriptive statistics and some basic modeling, not analyzing high volumes of epigenetics data in bioconductor.
 
Again I'm not really disagreeing with any of what's being argued in this thread as far as the relative benefits of R. My reply was to question the idea that R has features that make it clearly better than other software for clinical research; while there are good points about it being free and making nice figures I've not read anything in this thread that changes my initial response. Most med students are doing descriptive statistics and some basic modeling, not analyzing high volumes of epigenetics data in bioconductor.

But R still covers descriptive statistics and basic modeling right? Are the other statistical packages really worth the investment when R covers everything needed for free?
 
But R still covers descriptive statistics and basic modeling right? Are the other statistical packages really worth the investment when R covers everything needed for free?
No, if you are starting from scratch and can do any of the three it makes to just do R. Then you're not dependent on institutional access to SAS or Stata (though Stata does offer individual licenses at student discounts it's still way more expensive than free).

You can do everything you need to do in pretty much any of these programs.
 
No, if you are starting from scratch and can do any of the three it makes to just do R. Then you're not dependent on institutional access to SAS or Stata (though Stata does offer individual licenses at student discounts it's still way more expensive than free).

You can do everything you need to do in pretty much any of these programs.
but.... prettier graphs...
 
Late to the party, but I have some experience in R, Stata, SPSS, and SAS as I'm graduating with my MPH in a month.

The medical world overwhelmingly uses SAS and SPSS, so if you're hoping to piggyback on other's work, which you will be!, then go for one of those. SAS University Edition is totally free for life and is still what I use, despite having access to "the real thing" on school computers. The Little SAS Book is a great resource to self-teach. SPSS is easy as heck to learn too. Point and click toolbars. R is attractive because it's free and very multifunctional, but it hasn't gained much ground with clinicians and public health folks.
 
Late to the party, but I have some experience in R, Stata, SPSS, and SAS as I'm graduating with my MPH in a month.

The medical world overwhelmingly uses SAS and SPSS, so if you're hoping to piggyback on other's work, which you will be!, then go for one of those. SAS University Edition is totally free for life and is still what I use, despite having access to "the real thing" on school computers. The Little SAS Book is a great resource to self-teach. SPSS is easy as heck to learn too. Point and click toolbars. R is attractive because it's free and very multifunctional, but it hasn't gained much ground with clinicians and public health folks.

I'm curious why medical world is hesitant to use R especially since R is just as good (if not better than) as SAS/SPSS... and it's free. I just think the cost factor is a major unnecessary downside to choosing the alternatives so i have no idea why the medical world prefers to use the expensive options. Is it convenience? Easier to learn?
 
I'm curious why medical world is hesitant to use R especially since R is just as good (if not better than) as SAS/SPSS... and it's free. I just think the cost factor is a major unnecessary downside to choosing the alternatives so i have no idea why the medical world prefers to use the expensive options. Is it convenience? Easier to learn?
Because its always been that way.
 
It doesn't make sense. It looks like learning R could be a downside just because clinicians/researchers are somehow more comfortable with expensive options. And apparently, the non-medical world really likes using R
The point is to be able to work on large datasets in an efficient manner. R does that, the expertise of of some clinicians and researches may end at excel, it doesnt mean that you should limit yourself to excel for data analysis.
 
The point is to be able to work on large datasets in an efficient manner. R does that, the expertise of of some clinicians and researches may end at excel, it doesnt mean that you should limit yourself to excel for data analysis.

Right I agree but if the medical world by and large uses anything but R, wouldn't learning R actually be a detriment?
 
Right I agree but if the medical world by and large uses anything but R, wouldn't learning R actually be a detriment?
I dont think people are actually writing code for analysis in groups . It is usually someone gets tasked with doing the analysis and they use the tool of their choice to do so. It would be a detriment if there was no community using R, but these days on the internet you can find help troubleshooting your code. I would still opt for R since you can be the wizard of the database and people can ask your for collaboration on larger datasets that the other programs may be a pain to manage. But take my advice with a grain of salt since I am going the python R route myself.
 
I dont think people are actually writing code for analysis in groups . It is usually someone gets tasked with doing the analysis and they use the tool of their choice to do so. It would be a detriment if there was no community using R, but these days on the internet you can find help troubleshooting your code. I would still opt for R since you can be the wizard of the database and people can ask your for collaboration on larger datasets that the other programs may be a pain to manage. But take my advice with a grain of salt since I am going the python R route myself.
This is true. But I don't think you'll ever run into a sample size that SAS won't be able to handle! My own MPH thesis (using SAS) required sifting through millions of observations.

I'm curious why medical world is hesitant to use R especially since R is just as good (if not better than) as SAS/SPSS... and it's free. I just think the cost factor is a major unnecessary downside to choosing the alternatives so i have no idea why the medical world prefers to use the expensive options. Is it convenience? Easier to learn?
Since when do we expect the medical world to choose the most cost-effective option? Usually the institution pays for the software, in any case.
Just an aside, even though I already mentioned it, but since few people know about it, you can get SAS for free!

Because its always been that way.
Yepp. My PI learned stats using SPSS, taught the post-doc stats using SPSS, who taught people like me stats using SPSS. In the end, all of these stats packages will work to answer 99% of research questions. There's no incentive to learn another whole software when you already have a grasp of one, so you just stay with the first one you learned.
 
Since when do we expect the medical world to choose the most cost-effective option? Usually the institution pays for the software, in any case.
Just an aside, even though I already mentioned it, but since few people know about it, you can get SAS for free!

Because efficiency. There's no need to waste resources buying a more expensive one when a cheaper one works just as well, if not better. And yeah the institution pays for it but that's still investing resources in something where there's no need.

And good to know SAS can be obtained for free. My only concern is the medical world will look down on R just because they are more familiar with other software.
 
R is the standard in virtually everything except medicine. My explanation is that it is because nobody in academics can justify spending the money on Stata/SAS. My brother is in applied mathematics with a statistics focus, he was doing most of our statistics work in the last year. He was telling me that at the two schools he has trained at that nobody uses anything but R.

Personally, I wouldn't use anything besides R, but because of my brother, that is what I have been using for a while.
Where do you see that R is the standard outside of medicine? I haven't looked for any breakdown on who uses what, but from my own experience, SAS is more commonly used than R is at the moment in medicine and beyond. This could just be where I am from or the people I've worked with... Most people I've met who have legitimate stats degrees know SAS at least to some extent (PhD in statistics people). Many graduate stats programs have courses specifically geared towards learning SAS. However, these programs have also started to include R courses as well, probably due to R getting more backing from qualified people. The big downside to R is that you need to be careful with the package you use for analysis. If you look at the author, you should check out their credentials. There's often a difference between programs written by people without the proper stats and programming background. This was seen with SPSS earlier in its run when it was primarily used and worked on by people in soft fields like psychology and sociology. There were discrepancies in output between SPSS and the other programs like SAS.

The general breakdown I've observed is that people without much of a stats background stick to the point and click interface of SPSS and almost completely neglect Minitab which is at least as good. The younger crowd is more experienced and familiar with R and Python with R currently predominating out of those two packages. Stata is also a somewhat popular program, more so than R and Python. I haven't met too many people who don't know how to use the SAS coding interface to some extent, though. I think the major draw to R is the price or lack thereof. I've personally run into some scenarios where SAS could handle what R couldn't, and the reverse situation has been far less common. Again, this is all just from what I've seen when working with statisticians, but if there were any polls or anything to show the use, I'd probably benefit from seeing that.
 
Last edited by a moderator:
This is true. But I don't think you'll ever run into a sample size that SAS won't be able to handle! My own MPH thesis (using SAS) required sifting through millions of observations.


Since when do we expect the medical world to choose the most cost-effective option? Usually the institution pays for the software, in any case.
Just an aside, even though I already mentioned it, but since few people know about it, you can get SAS for free!


Yepp. My PI learned stats using SPSS, taught the post-doc stats using SPSS, who taught people like me stats using SPSS. In the end, all of these stats packages will work to answer 99% of research questions. There's no incentive to learn another whole software when you already have a grasp of one, so you just stay with the first one you learned.

I agree that SAS is pretty unlikely to get stuck in comparison to other packages. SPSS is pretty wimpy in that regard or will take far longer for the same calculation or even an approximation (which are often subpar, like their method for estimating a Cox PH regression). The problem I have with SPSS or other point and click software is that the menu options aren't necessarily thorough or intuitive (or they don't offer it without some weird maneuvering). If you're coding, you add a few characters and get exactly what you want. I think the point and click software also gives people a false sense of security. I've often seen people thinking you need everything that's in the printout or they don't think they need anything extra because it wasn't in the printout. They don't remember that the program doesn't do the critical thinking for them. The other issue I've seen is that people attempt to learn too much from the program and believe that if it's not in the drop down menu that it doesn't exist or must be some highly technical thing (like using a generalized Fisher exact test on larger than 2x2 tables where the chi-square approximation might not be valid). Maybe it's more of an out-of-sight-out-of-mind thing, though.
 
Because efficiency. There's no need to waste resources buying a more expensive one when a cheaper one works just as well, if not better. And yeah the institution pays for it but that's still investing resources in something where there's no need.

And good to know SAS can be obtained for free. My only concern is the medical world will look down on R just because they are more familiar with other software.

I think you're missing some important things when you talk about efficiency in this context.

Switching from SAS to R is not as simple as just installing R on your computers. You're essentially throwing out the window decades of existing analyst software experience to save a relatively (from the standpoint of a university) small amount of money. You're going to take a significant hit in productivity while your analysts learn how to use R--it could be calculated if you knew all the inputs but my guess is it would take quite a while for the difference in software prices to cover the hours wasted as people learn how to use R. In addition to the up front financial penalty, your research is going to be held up while you wait for people to get up to speed. You could get scooped, miss deadlines, etcetera. Analysts will have a lot of stored bits of code they use and recycle to make things faster (for example, calculating comorbidity scores) so you're also throwing all of that in the trash and making them recreate those procedures in a new environment.

Further, from an institutional standpoint there's something to be said for using paid software which comes with a guarantee of technical support and data security. Also, as mentioned by @dempty when you're installing random packages for R you don't have something like the SAS Institute standing behind them to verify their accuracy and functionality.

Again I think this is likely to change over coming years as more of the data analysts currently in practice are replaced by younger graduates who have exposure to R (though even now I'd guess a majority of courses taught in public health schools are done in SAS), but I don't think it's necessarily true that institutions would benefit from mandating a switch to R for their existing employees.

As an aside on your last point, assuming you can do the stats accurately it's unlikely anyone will care what software you use to do them.
 
I think you're missing some important things when you talk about efficiency in this context.

Switching from SAS to R is not as simple as just installing R on your computers. You're essentially throwing out the window decades of existing analyst software experience to save a relatively (from the standpoint of a university) small amount of money. You're going to take a significant hit in productivity while your analysts learn how to use R--it could be calculated if you knew all the inputs but my guess is it would take quite a while for the difference in software prices to cover the hours wasted as people learn how to use R. In addition to the up front financial penalty, your research is going to be held up while you wait for people to get up to speed. You could get scooped, miss deadlines, etcetera. Analysts will have a lot of stored bits of code they use and recycle to make things faster (for example, calculating comorbidity scores) so you're also throwing all of that in the trash and making them recreate those procedures in a new environment.

Further, from an institutional standpoint there's something to be said for using paid software which comes with a guarantee of technical support and data security. Also, as mentioned by @dempty when you're installing random packages for R you don't have something like the SAS Institute standing behind them to verify their accuracy and functionality.

Again I think this is likely to change over coming years as more of the data analysts currently in practice are replaced by younger graduates who have exposure to R (though even now I'd guess a majority of courses taught in public health schools are done in SAS), but I don't think it's necessarily true that institutions would benefit from mandating a switch to R for their existing employees.

As an aside on your last point, assuming you can do the stats accurately it's unlikely anyone will care what software you use to do them.
Pretty spot on post. SAS also functions as a database manager type product, too. It's not just a stats package, so a lot of institutions kill many birds with one stone when using it. The big argument I've also seen is, like you said, that SAS is a product with good market forces behind it. They can't afford to botch things badly. Open source software does have a reputation to protect but the security and integrity of the code and software is at least perceived to be, if not actually, lower than SAS, SPSS, and other programs offered by businesses. The best and most sound advice I've gotten for using R is to make sure the program comes from people with a strong background in mathematics and statistics at a good school. They often have a team of people who have experience coding but with formal statistical theory and education (this is the stuff that makes sure package works properly). The main issue with that still is that some of these people make packages for some methodology or approximation they came up with and might not be widely vetted beyond their paper on the topic. This may or may not be a good thing, but it's definitely better than blindly downloading the package that reddit or whatever told you to.
 
I think you're missing some important things when you talk about efficiency in this context.

Switching from SAS to R is not as simple as just installing R on your computers. You're essentially throwing out the window decades of existing analyst software experience to save a relatively (from the standpoint of a university) small amount of money. You're going to take a significant hit in productivity while your analysts learn how to use R--it could be calculated if you knew all the inputs but my guess is it would take quite a while for the difference in software prices to cover the hours wasted as people learn how to use R. In addition to the up front financial penalty, your research is going to be held up while you wait for people to get up to speed. You could get scooped, miss deadlines, etcetera. Analysts will have a lot of stored bits of code they use and recycle to make things faster (for example, calculating comorbidity scores) so you're also throwing all of that in the trash and making them recreate those procedures in a new environment.

Further, from an institutional standpoint there's something to be said for using paid software which comes with a guarantee of technical support and data security. Also, as mentioned by @dempty when you're installing random packages for R you don't have something like the SAS Institute standing behind them to verify their accuracy and functionality.

Again I think this is likely to change over coming years as more of the data analysts currently in practice are replaced by younger graduates who have exposure to R (though even now I'd guess a majority of courses taught in public health schools are done in SAS), but I don't think it's necessarily true that institutions would benefit from mandating a switch to R for their existing employees.

As an aside on your last point, assuming you can do the stats accurately it's unlikely anyone will care what software you use to do them.
Pretty spot on post. SAS also functions as a database manager type product, too. It's not just a stats package, so a lot of institutions kill many birds with one stone when using it. The big argument I've also seen is, like you said, that SAS is a product with good market forces behind it. They can't afford to botch things badly. Open source software does have a reputation to protect but the security and integrity of the code and software is at least perceived to be, if not actually, lower than SAS, SPSS, and other programs offered by businesses. The best and most sound advice I've gotten for using R is to make sure the program comes from people with a strong background in mathematics and statistics at a good school. They often have a team of people who have experience coding but with formal statistical theory and education (this is the stuff that makes sure package works properly). The main issue with that still is that some of these people make packages for some methodology or approximation they came up with and might not be widely vetted beyond their paper on the topic. This may or may not be a good thing, but it's definitely better than blindly downloading the package that reddit or whatever told you to.

Alright thanks! Good stuff right here so appreciate it. Just wanted to make sure that learning R wouldn't be a drawback.
 
Alright thanks! Good stuff right here so appreciate it. Just wanted to make sure that learning R wouldn't be a drawback.
I don't think it's a drawback. Some people prefer to be really good with one software first, but it's up to you. I think it's best to learn a few. Some use more efficient methods for different techniques, but you also won't be too limited in the event that you collaborate with someone using a different software. I would still recommend checking out SAS University. As mentioned earlier, it's free, so why not. SAS is a savage for a reason.
 
Top