"Landmark" cancer studies not really reproducible

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

scarbrtj

I Don't Like To Bragg
7+ Year Member
Joined
Dec 18, 2015
Messages
3,216
Reaction score
4,930
Only about 15-40% of the findings from studies are reproducible (https://www.statnews.com/2017/01/18/replication-cancer-studies/). I noticed someone had just posted about "practice changing" RTOG 9601, wherein:

The median follow-up among the surviving patients was 13 years. The actuarial rate of overall survival at 12 years was 76.3% in the bicalutamide group, as compared with 71.3% in the placebo group (hazard ratio for death, 0.77; 95% confidence interval, 0.59 to 0.99; P=0.04).

Reproducibility is highly correlated with the p-value. Studies with p-values in the ~0.03-0.05 range are only reproducible about 50-60% of the time, at best. Studies with p-values <0.001 are much more highly reproducible (http://rsos.royalsocietypublishing.org/content/1/3/140216).

Just food for thought. I look mildly askance at all these "practice changing" studies with HRs that miss unity by the skin of their teeth or p-values which exactly equal 0.04 (0.045 would round to 0.05 and make you too skeptical!). Your mileage may vary.

Members don't see this ad.
 
I tend to look mainly at the effect size (i buy that theres probably an improvement w a HR of 0.77) bc p value is dependent not just on effect size but also on how well-powered the study is.
 
Except we've got several studies showing the benefit of combining radiation therapy with some form of anti androgen therapy in the intermediate and high risk population, albeit in the intact setting (Ignoring the question of dose escalation +/- adt which is being addressed by the rtog now).

Too many good p values from all of those studies to just be related to chance
 
Last edited:
Members don't see this ad :)
Reproducibility is highly correlated with the p-value. Studies with p-values in the ~0.03-0.05 range are only reproducible about 50-60% of the time, at best. Studies with p-values <0.001 are much more highly reproducible (http://rsos.royalsocietypublishing.org/content/1/3/140216).

Just food for thought. I look mildly askance at all these "practice changing" studies with HRs that miss unity by the skin of their teeth or p-values which exactly equal 0.04 (0.045 would round to 0.05 and make you too skeptical!). Your mileage may vary.
As I say below, a good amount of this influenced by publication bias (if we saw more nonsignificant studies, we might be more tempered in our feelings about a particular finding that's reported). On another note, guidance from members of the statistical community is to only round p-values to 4 significant digits unless the p-value is less than .0001 or greater than .9999 (at which time you can use the p < .0001 or p > .9999 notation, respectively). P-values shouldn't be rounded to fewer decimal places than that as they misconstrue evidence (in your example, people may try rounding p-values to change the appearance of their finding (i.e. more significant by rounding 0.043 down to 0.04)). In reality, there's nothing magical about barely missing or scraping by the selected alpha level. With an alpha of 0.05, a p-value of 0.054 shouldn't be looked at much differently than 0.046, but unfortunately, this is the mindset with many researchers and publication gate keepers (look at the CI estimates to get a better picture).

I think your advice in bold is a wise approach, so I felt like adding to your discussion.


Too many good p values from all of those studies to just be related to chance
I'm not going to disagree with any of the evidence because I haven't looked at it, but you can't neglect giant publication bias that exists and prevents nonsignificant findings from being published. A lot of people don't have much background in statistics and often take a nonsignificant p-value to indicate a bad study when "some other studies" got significant results. There's no reason for that to be true. P-values have their purpose, but they need to be understood better than they currently are, and people need to realize that the size of the p-value doesn't indicate how big or small the effect is, which leads to people not understanding the difference between statistical and clinical significance-- they think a significant p-value means real world implications (CIs come into the frame to help with this).
 
As I say below, a good amount of this influenced by publication bias (if we saw more nonsignificant studies, we might be more tempered in our feelings about a particular finding that's reported). On another note, guidance from members of the statistical community is to only round p-values to 4 significant digits unless the p-value is less than .0001 or greater than .9999 (at which time you can use the p < .0001 or p > .9999 notation, respectively). P-values shouldn't be rounded to fewer decimal places than that as they misconstrue evidence (in your example, people may try rounding p-values to change the appearance of their finding (i.e. more significant by rounding 0.043 down to 0.04)). In reality, there's nothing magical about barely missing or scraping by the selected alpha level. With an alpha of 0.05, a p-value of 0.054 shouldn't be looked at much differently than 0.046, but unfortunately, this is the mindset with many researchers and publication gate keepers (look at the CI estimates to get a better picture).

I think your advice in bold is a wise approach, so I felt like adding to your discussion.



I'm not going to disagree with any of the evidence because I haven't looked at it, but you can't neglect giant publication bias that exists and prevents nonsignificant findings from being published. A lot of people don't have much background in statistics and often take a nonsignificant p-value to indicate a bad study when "some other studies" got significant results. There's no reason for that to be true. P-values have their purpose, but they need to be understood better than they currently are, and people need to realize that the size of the p-value doesn't indicate how big or small the effect is, which leads to people not understanding the difference between statistical and clinical significance-- they think a significant p-value means real world implications (CIs come into the frame to help with this).
Given that the context is prostate cancer and these are large randomized studies supported by the NCI and analogous institutions in other countries it is hard to imagine that there are lots of unpublished negative studies; every study with sufficient power and follow-up that has looked at radiation with or without androgen suppression has found an effect-usually on overall survival.
 
Only about 15-40% of the findings from studies are reproducible (https://www.statnews.com/2017/01/18/replication-cancer-studies/). I noticed someone had just posted about "practice changing" RTOG 9601, wherein:

The median follow-up among the surviving patients was 13 years. The actuarial rate of overall survival at 12 years was 76.3% in the bicalutamide group, as compared with 71.3% in the placebo group (hazard ratio for death, 0.77; 95% confidence interval, 0.59 to 0.99; P=0.04).

Reproducibility is highly correlated with the p-value. Studies with p-values in the ~0.03-0.05 range are only reproducible about 50-60% of the time, at best. Studies with p-values <0.001 are much more highly reproducible (http://rsos.royalsocietypublishing.org/content/1/3/140216).

Just food for thought. I look mildly askance at all these "practice changing" studies with HRs that miss unity by the skin of their teeth or p-values which exactly equal 0.04 (0.045 would round to 0.05 and make you too skeptical!). Your mileage may vary.

I don't read into the p value that much. The p value is more a function of measurement precision and sample size as opposed to the actual validity or accuracy of the results. For example in large cardiology trials with 1000's of patients, you can see ridiculous findings like a difference in platelets of 10 points correlating with OS with a p<0.0001 in a subgroup analysis. In that case its clinically meaningless and just a statistical calculation. I look at absolute values, absolute numbers of deaths, absolute increase in OS. Don't get me wrong I agree that there are certainly trials which are not reproducible or suspect. However in this case with RTOG 9601 it was a randomized, double blinded placebo controlled study with large number of patients, sufficient events, and long term followup which showed a 5% absolute increase in overall survival. It also rationally makes sense that using hormones earlier in course of disease (when there are less cells that could potentially become hormone refractory) that it could improve outcomes. I started recommending this to these patients as routine when the results were presented at ASTRO 2015.
 
Given that the context is prostate cancer and these are large randomized studies supported by the NCI and analogous institutions in other countries it is hard to imagine that there are lots of unpublished negative studies; every study with sufficient power and follow-up that has looked at radiation with or without androgen suppression has found an effect-usually on overall survival.

All I'm saying is: remain skeptical. There's a 50% chance these assumptions may be untrue.

http://www.nature.com/news/half-of-us-clinical-trials-go-unpublished-1.14286
 
Given that the context is prostate cancer and these are large randomized studies supported by the NCI and analogous institutions in other countries it is hard to imagine that there are lots of unpublished negative studies; every study with sufficient power and follow-up that has looked at radiation with or without androgen suppression has found an effect-usually on overall survival.
Yeah, and like I said, I'm not disagreeing with the evidence on any particular study because I haven't evaluated it. My main point was a generally larger problem with research in general. People don't understand statistics and the findings as well as they think they do.

There are a good number of people who think a p-value of .02 means a 2% chance of making a Type I error, a 2% chance that there really is no effect (or a 98% chance the effect is real), a 2% chance of observing the exact results, or some other idea that inappropriately weights the findings or just misses the boat. Similar misunderstandings apply to confidence intervals as well, and this is part of a larger issue in research, and it partly explains why people are flabbergasted when a significant finding doesn't occur in a follow up study.
 
I don't read into the p value that much. The p value is more a function of measurement precision and sample size as opposed to the actual validity or accuracy of the results. For example in large cardiology trials with 1000's of patients, you can see ridiculous findings like a difference in platelets of 10 points correlating with OS with a p<0.0001 in a subgroup analysis. In that case its clinically meaningless and just a statistical calculation. I look at absolute values, absolute numbers of deaths, absolute increase in OS. Don't get me wrong I agree that there are certainly trials which are not reproducible or suspect. However in this case with RTOG 9601 it was a randomized, double blinded placebo controlled study with large number of patients, sufficient events, and long term followup which showed a 5% absolute increase in overall survival. It also rationally makes sense that using hormones earlier in course of disease (when there are less cells that could potentially become hormone refractory) that it could improve outcomes. I started recommending this to these patients as routine when the results were presented at ASTRO 2015.

Exactly!
 
All I'm saying is: remain skeptical. There's a 50% chance these assumptions may be untrue.

http://www.nature.com/news/half-of-us-clinical-trials-go-unpublished-1.14286

Reading that much into the p-value that much is statistical purism that Phantom's post is meaningful response to. That p-value paper demands such pure statistical rigor that we would be so skeptical about most of the trials, might question why we ought do them at all. We should be skeptical about that paper :)
 
Last edited:
There are certainly large unpublished single institutional experiences with protons out there in multiple sites which remain unpublished because the results were underwhelming (IMRT better). I know this to be a fact. I have wondered what else is out there that isn't published.
 
Reading that much into the p-value that much is statistical purism that Phantom's post is meaningful response to. That p-value paper demands such pure statistical rigor that we would be so skeptical about most of the trials, might question why we ought do them at all. We should be skeptical about that paper :)

I would disagree that it demands "such statistical rigor" and say that it's really calling for a better understanding on the part of non-statisticians (people without [bio]statistics degrees). People generally have a poor understanding of statistical significance and confidence intervals and therefore put too much weight on things they find or don't adequately understand what it is that was "found."The researcher who thinks a p-value <.05 means that there's less than a 5% chance of being wrong (or the finding is false, or something along those lines) is more concerning than the researcher who knows that same p-value indicates a degree of incompatibility of the observations with the null hypothesis and that it actually makes no specific "truth statement" about the particular null and alternative hypotheses. This might be viewed as "purism" but it really boils down to getting the point or missing it. Someone prescribing a drug because they believe a study showed less than a 5% chance of being wrong is likely less informed about the data (and possibly a greater risk) than the person who's more tempered.

Being skeptical about research is at the heart of scientific inquiry. You constantly question things and look for data to suggest the status quo is not correct. Believing that any of the methods we have can actually prove something demonstrates a mis-characterization of what we can actually do in research. This questioning shouldn't bring anyone to the conclusion that research or experimentation is pointless, but rather bring one to recognize that we do research and question things because we're going to make decisions off of the conclusions. At the end of the day, those conclusions have an element of uncertainty, no matter how good everything else looks. Continual questioning and investigation, when done properly, is possibly the only rational way we can approach these decisions.
 
Last edited by a moderator:
Top