Shrink the P Value for Significance, Raise the Bar for Research: A Renewed Call

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.
Have you worked with population-level data before? Reporting every single analysis performed is infeasible and distracting. If I have no idea what causes increased obesity rates, I'm going to have to go looking for variables. I don't have any preconceived notion of a variable in my head. I know that this population has increased obesity but I have no idea why so I'm going to go into this in an unbiased manner and see what variables might be associated with it. In some aspects, this is better than going into this with a specific hypothesis because then you run the risk of having coincidental or confounded findings. If I look at all the variables in a population that are measured, I would probably find a few variables associated with obesity, e.g. hypertension, soda company revenue, etc. Hell, I might even report a p value for these comparisons. The key difference is that I know what kind of study this is - to discover a model that can be used to understand obesity. Others who don't have my training probably wouldn't understand and ding me for running regression on so many variables.

I think you would be hard-pressed to discover a more effective way of answering such research questions. In your world, how would you answer this same question? Remember, you have no idea what causes obesity here. Or, if you want a better example, what causes SIDS. There's no model from which you can derive testable hypotheses. How would you proceed?
significant.png
 
I mean the real answer is that people need to be able to read a study critically. What assumptions did they make? Does their statistical modeling make sense? Does the patient population match yours? Do their conclusions match their results?

They did a noninferiority study and got a p value of 0.06. Is that result really noninferior, or did they simply not recruit enough study participants to reach significance? [That recent surgery duty hours study.]
 
I mean the real answer is that people need to be able to read a study critically. What assumptions did they make? Does their statistical modeling make sense? Does the patient population match yours? Do their conclusions match their results?

They did a noninferiority study and got a p value of 0.06. Is that result really noninferior, or did they simply not recruit enough study participants to reach significance? [That recent surgery duty hours study.]

I agree.

I'd wager most MDs conducting clinical research don't know how to do a power analysis.

NEJM and others are starting to make visual abstracts, with important side effects to consider with positive results, it's a step in the right direction, needs a lot more work.
 

Lol, don't even get me started on the science communication end of this.

You all remember The Lancet stent study a few months ago (with only a 6 week follow up:eyebrow:), that spurned headlines about "unnecessary surgery!!!"

‘Unbelievable’: Heart Stents Fail to Ease Chest Pain

www.vox.com/platform/amp/science-and-health/2017/11/3/16599072/stent-chest-pain-treatment-angina-not-effective

Placebo Effect of the Heart

The study:

http://www.thelancet.com/journals/lancet/article/PIIS0140-6736(17)32714-9/fulltext?elsca1=tlxpr
 
Great post! The irony is this cartoon has a common, incorrect interpretation of a 95% confidence interval (it’s incorrect to say any specific interval (or p value < .05) has a 95% chance of being correct or 5% chance of being wrong).
I mean the real answer is that people need to be able to read a study critically. What assumptions did they make? Does their statistical modeling make sense? Does the patient population match yours? Do their conclusions match their results?

They did a noninferiority study and got a p value of 0.06. Is that result really noninferior, or did they simply not recruit enough study participants to reach significance? [That recent surgery duty hours study.]
Excellent point, but this also comes back to improved education in statistics for medical people. You can’t make an appropriate assessment of methods and conclusions without a good knowledge base. This is why people treat that .06 p-value as black and white. Better education would put focus on a p-value and a confidence interval without a black and white dichotomization if a continuous measure.
 
Last edited by a moderator:
Great post! The ironry is this cartoon has a common, incorrect interpretation of a 95% confidence interval (it’s incorrect to say any specific interval (or p value < .05) has a 95% chance of being correct or 5% chance of being wrong).

If the null hypothesis is true (which it presumably is for X color jelly beans to acne), there is a 5% chance that, through random chance alone, a sample will have a value as extreme as would be needed to have a P value <0.05. That's the literal definition of a p value.
 
I agree.

I'd wager most MDs conducting clinical research don't know how to do a power analysis.

NEJM and others are starting to make visual abstracts, with important side effects to consider with positive results, it's a step in the right direction, needs a lot more work.
Even fewer medical people don’t get that a power analysis isn’t what you do to determine sample size requirements—that’s a sample size calculation (put in a desired power and all other assumptions to get minimum sample size) whereas a power analysis is the other way around (the end result is a value for power based on the inputs and see how power changes based on differing assumptions). Furthering the issue is medical people think this is one number when really, a proper power analysis or sample size calculation covers a range of inputs to give a range of power or range of sample sizes based on different assumptions for inputs. The final embarrassing nail in the coffin is those who do post hoc power analysis not realizing it isn’t beneficial aside from planning a future study since it’s a mathematical transformation of the observed p value. It adds zero new information for interpreting the current information.
 
If the null hypothesis is true (which it presumably is for X color jelly beans to acne), there is a 5% chance that, through random chance alone, a sample will have a value as extreme as would be needed to have a P value <0.05. That's the literal definition of a p value.
If we’re going for the literal definition, we should avoid conflating alpha and p value (two totally distinct concepts) and improve our precision in the definition. The p value is the probability of observing a result as extreme OR more extreme as the observed result assuming the null hypothesis is true. A little more precise than what you’ve offered and it follows precisely from the mathematical calculations behind the pvalue.

I’m making the distinction that once a result is observed, and a conclusion is made, you’re either wrong or not, and the probability is 0 or 1 (the jelly bean cartoon). That’s how Frequentist statistical theory works. You cannot make probability statements about hypotheses in Frequentist statistics, that’s where Bayesian inference comes in to play. If the jelly bean hypothesis test were conducted and showed a p value of .03, that does not mean there is 97% probability of an association nor does it mean there is a 3% chance the finding is a fluke. This necessarily follows from the fact that the pvalue is not an error probability and does not put a probability on a hypothesis.

Edit: there is also an issue that it's incorrect to say "no link/association" because the test is non significant-- another common misinterpretation. The correct interpretation is "...insufficient evidence to conclude X color and acne are related at the .05 alpha level." May seem the same at first, but these are very different statements with different implications.
 
Last edited by a moderator:
...Others who don't have my training probably wouldn't understand and ding me for running regression on so many variables.

I think you would be hard-pressed to discover a more effective way of answering such research questions. In your world, how would you answer this same question? Remember, you have no idea what causes obesity here. Or, if you want a better example, what causes SIDS. There's no model from which you can derive testable hypotheses. How would you proceed?

I think I've waited long enough...😀

1) Did you ever remember what "[your] training" was? Was wondering since you seemed to making the appeal to "authority."
2) Did you end up reading up on the LASSO or other methods that are designed for situations like you described?

I thought we had a nice back and forth, but then it went dead after I answered your question.
 
I think I've waited long enough...😀

1) Did you ever remember what "[your] training" was? Was wondering since you seemed to making the appeal to "authority."
2) Did you end up reading up on the LASSO or other methods that are designed for situations like you described?

I thought we had a nice back and forth, but then it went dead after I answered your question.

damn

you're worse than my wife
 
im a MD and i can agree that the majority of MDs are garbage at research. most of the studies done are garbage too. how many studies are actually reproducible?
 
You obviously have no idea what you're talking about if you think one basic stats course at the med student level covers what you need to understand clinical research. Some examples of essential basic epidemiological concepts you may not know the answer to:

1) When is it appropriate to adjust for covariates? If a covariate is related to the outcome but not the variable of interest, should we put it in the model? What exactly qualifies as a confounder? If something is a confounder, should we stratify by it and make a subgroup or control for it, and what's the difference? Do we include confounders in RCTs if the design by nature eliminates confounding?

2) How about this real example?
To investigate the association between estrogen and cancer, Yale investigators considered the possibility of ascertainment bias, where estrogens lead to vaginal bleeding that accelerates the diagnosis of existing cancer (more likely to detect it if bleeding occurs). Therefore, they stated that we can look at only vaginal bleeding cases whether they are taking estrogen or not; if these patients all have bled, they must have the same likelihood of being diagnosed with cancer. If estrogen still leads to cancer among these patients, they stated that we can say it's causal. What was the serious flaw in this methodology? (Why do we find an association between estrogen and cancer even among women who bleed, if there is no real association?) How about in Belizan et al 1997? (Hint: similar concept as the prior)
Belizán JM, Villar J, Bergel E, et al. Long-term effect of calcium supplementation during pregnancy on the blood pressure of offspring: follow up of a randomised controlled trial. BMJ. 1997;315(7103):281-285.

3) What assumptions are made with Cox regression? What if proportionality assumptions are not met (very common); how should this be interpreted? What should be used as the time scale for your Cox regression, and how does this affect the analysis? (age, follow-up time, etc)

I can provide countless other examples...even in this thread, the p-value and 95% CI are not synonymous and corresponding concepts as the above poster alluded. These are not obscure topics that will never pop up in real life. They are the bare essentials to any clinical researcher that are guaranteed to come up (but many physicians ignore them unknowingly) and anyone who has taken real graduate level introductory epidemiology classes would know the answers. This lack of fundamental understanding is one of the main reasons why so many observational studies fail to stand up to the rigor of clinical trials. With proper design, cohort studies can closely mirror the results of RCTs, and case control studies should yield effect estimates as valid as cohort studies. However, that is rarely the case in reality.

I would really hope that a good stats class would teach causal inference and should most definitely cover Cox regressions. First question(s) can get more complex, but I still think a good stats class would at least cover the foundations for them.

In fact, most think the stats section is a quick calculation that you just "redo" if you don't like the results (or that you get more data to show what you want). They miss the point that their conclusions hinge on the appropriateness of the methodology employed and the correct interpretation of results.

I couldn't disagree more with this statement. I don't think I know any docs who would just "redo" the calculation if they didn't like the results. Wanting more data is a more common problem, but sometimes it's a valid point when you truly believe your population does not accurately represent the total population of what you're studying and you have tiny sample sizes (n<15).

Important point to emphasize. People claiming that an MD is all you need to perform and fully interpret good research don't know what they don't know.

I don't think I've ever heard anyone make that claim, and I think med schools in general do a poor job of teaching physicians how to be statistically literate when it comes to the studies they read (saying this as someone who got a master's before med school and whose father is a statistics professor).
 
I would really hope that a good stats class would teach causal inference and should most definitely cover Cox regressions. First question(s) can get more complex, but I still think a good stats class would at least cover the foundations for them.



I couldn't disagree more with this statement. I don't think I know any docs who would just "redo" the calculation if they didn't like the results. Wanting more data is a more common problem, but sometimes it's a valid point when you truly believe your population does not accurately represent the total population of what you're studying and you have tiny sample sizes (n<15).



I don't think I've ever heard anyone make that claim, and I think med schools in general do a poor job of teaching physicians how to be statistically literate when it comes to the studies they read (saying this as someone who got a master's before med school and whose father is a statistics professor).

I agree with the first two parts of your post, but there's actually an entire episode of this one podcast where an MD talks about how you don't really need anything more than an MD to do good basic science research.
 
I agree with the first two parts of your post, but there's actually an entire episode of this one podcast where an MD talks about how you don't really need anything more than an MD to do good basic science research.

To simply perform the research sure. To perform a full and proper analysis of the results and completely comprehend the implications of that analysis? Nah.

What was that podcast? I'd like to know which one to avoid.
 
Last edited:
To simply perform the research sure. To perform a full and proper analysis of the results and completely comprehend the implications of that analysis? Nah.

What was that podcast? I'd like to know ehichbone to avoid.

Since you're already a resident, you wouldn't really have any reason to listen to it anyway. It's called Specialty Stories. It's actually pretty decent for quick looks into different specialties in different settings.
 
Since you're already a resident, you wouldn't really have any reason to listen to it anyway. It's called Specialty Stories. It's actually pretty decent for quick looks into different specialties in different settings.

Interesting, would be curious about what they say about my field, but probably wouldn't take it too seriously.
 
Interesting, would be curious about what they say about my field, but probably wouldn't take it too seriously.

The psych episode (Episodes? There might have been more than one.) was pretty good. The format is the host has a specialist from each specialty and from each setting (academia, community, mixed, etc) come on and answer questions about that specialty.
 
I couldn't disagree more with this statement. I don't think I know any docs who would just "redo" the calculation if they didn't like the results. Wanting more data is a more common problem, but sometimes it's a valid point when you truly believe your population does not accurately represent the total population of what you're studying and you have tiny sample sizes (n<15).
This is a lot of what I see physician "researchers" doing, though. I've literally been told to discard a hypothesis test and do something different "that will be significant." So, in my experience, that has been the case, and it's supported by tons of circumstantial evidence in that of the "highest tier" medical journals, less than half (if I recall) have a real statistician (PhD in stats or biostats, not some tangential field like epi or a DrPH with "biostats concentration"). Of those journals, many times the statisticians don't even review the publications (i.e. they review a few here and there). There is a big implication by journals thinking these MD MPH or just MD physicians are qualified as statistical reviewers. It's also mind boggling that physicians don't comprehend the idea of representative sampling being most easily achieved by random sampling and that this isn't necessarily tied to the sample size. I can pick the 20,000 richest people in the world and I won't have a representative sample of the financial status of the population at large, but a random sample of 300 would be more valuable. Physicians balk at the idea that a random sample of 200 patients can be far more informative than a convenience sample of 10,000. I don't think wanting more data is necessarily a bad thing (which was actually what RA Fisher summarized as the meaning of a "large" p-value), but I also have heard the phrase "yeah, let's get more N so we can get a significant p-value"...cringe-worthy and one way to p-hack.



I don't think I've ever heard anyone make that claim, and I think med schools in general do a poor job of teaching physicians how to be statistically literate when it comes to the studies they read (saying this as someone who got a master's before med school and whose father is a statistics professor).
Again, I'm speaking from my experience. Tons of MDs think "stats is easy... why can Jimmy the MS3 give me all the stats in a few hours and you're saying you need several days or longer?" Million things wrong with that statement and I've heard more than one MD make it after being told data exploration, visualization, and cleaning can take a few days quite regularly, and that's not even getting to what many misinformed people consider "valuable" work. Many doctors I know of think that addressing missing data with something other than a case-wise deletion or LOCF is heresy and garbage. I agree that med schools do a terrible job at teaching any semblance of statistics to medical students; I often see schools using MD or MD MPH people to teach (if it's taught) and they go about teaching some pretty silly stuff.

I'm unclear about your masters degree, but I think that if your dad is a statistician, then he would appreciate a lot of what I've said (even more so if he's consulted a lot with MDs).
 
This is a lot of what I see physician "researchers" doing, though. I've literally been told to discard a hypothesis test and do something different "that will be significant." So, in my experience, that has been the case, and it's supported by tons of circumstantial evidence in that of the "highest tier" medical journals, less than half (if I recall) have a real statistician (PhD in stats or biostats, not some tangential field like epi or a DrPH with "biostats concentration"). Of those journals, many times the statisticians don't even review the publications (i.e. they review a few here and there). There is a big implication by journals thinking these MD MPH or just MD physicians are qualified as statistical reviewers. It's also mind boggling that physicians don't comprehend the idea of representative sampling being most easily achieved by random sampling and that this isn't necessarily tied to the sample size. I can pick the 20,000 richest people in the world and I won't have a representative sample of the financial status of the population at large, but a random sample of 300 would be more valuable. Physicians balk at the idea that a random sample of 200 patients can be far more informative than a convenience sample of 10,000. I don't think wanting more data is necessarily a bad thing (which was actually what RA Fisher summarized as the meaning of a "large" p-value), but I also have heard the phrase "yeah, let's get more N so we can get a significant p-value"...cringe-worthy and one way to p-hack.



Again, I'm speaking from my experience. Tons of MDs think "stats is easy... why can Jimmy the MS3 give me all the stats in a few hours and you're saying you need several days or longer?" Million things wrong with that statement and I've heard more than one MD make it after being told data exploration, visualization, and cleaning can take a few days quite regularly, and that's not even getting to what many misinformed people consider "valuable" work. Many doctors I know of think that addressing missing data with something other than a case-wise deletion or LOCF is heresy and garbage. I agree that med schools do a terrible job at teaching any semblance of statistics to medical students; I often see schools using MD or MD MPH people to teach (if it's taught) and they go about teaching some pretty silly stuff.

I'm unclear about your masters degree, but I think that if your dad is a statistician, then he would appreciate a lot of what I've said (even more so if he's consulted a lot with MDs).

Like I said earlier, I think there are times when increasing n is valid, mostly in situations where you can identify why the sample is not representative (say we flipped your example and saw that the 300 only cam from the richest people in the world while the 10k was from all economic classes). I also agree that increasing N with the sole purpose to increase the power of a study or "p-hack" is inappropriate. I can't speak towards medical journals and their staff as I mostly stick to content of articles I read (and searching for points to tear apart).

As for the second paragraph, I can only speak towards my experience (and to some extent my father's). Most of those I've actually talked to about studies in any academic sense other than "this one study showed this" had at least a fair grasp of stats and didn't just rely on p-values and n to interpret the data. My dad also felt like the people he worked with had a decent grasp (though obviously not as well as his). I will say he retired from medical research (about 15 years ago) because his new boss was more concerned with research volume instead of quality, so there certainly may have been a trend towards what you've experienced or it could just be random differences in people encountered. There could also be biases, as I don't talk to every attending about studies regularly.
 
Like I said earlier, I think there are times when increasing n is valid, mostly in situations where you can identify why the sample is not representative (say we flipped your example and saw that the 300 only cam from the richest people in the world while the 10k was from all economic classes). I also agree that increasing N with the sole purpose to increase the power of a study or "p-hack" is inappropriate. I can't speak towards medical journals and their staff as I mostly stick to content of articles I read (and searching for points to tear apart).
I agree getting more data can be a viable and legitimate solution if the sole purpose isn't to "get a significant result".

I just see that many people I've interacted with jump to small sample-->not representative---> get larger sample which is generally asinine because the sampling methodology would be a more likely and more logical reason a sample is not representative of a target population (rather than the sample being "small" all else the same). Increasing a sample size for a biased sample won't alleviate that issue (it may mitigate it if you effectively dilute the nonrepresentative portion with way more representative sampling). Taking an equally small sample as the first but with a sampling scheme likely to generate a representative sample is far better than lumping a representative 200 observations with nonrepresentative 100 observations; I see this very often (or worse, an additional 200 from with nonrepresentative sampling techniques lumped into the original 100).

As for the second paragraph, I can only speak towards my experience (and to some extent my father's). Most of those I've actually talked to about studies in any academic sense other than "this one study showed this" had at least a fair grasp of stats and didn't just rely on p-values and n to interpret the data. My dad also felt like the people he worked with had a decent grasp (though obviously not as well as his). I will say he retired from medical research (about 15 years ago) because his new boss was more concerned with research volume instead of quality, so there certainly may have been a trend towards what you've experienced or it could just be random differences in people encountered. There could also be biases, as I don't talk to every attending about studies regularly.
This could definitely be from where we are encountering people, but I also see papers in top journals that are without a statistician where the clinicians make some pretty poor decisions and then even conclusions from the analysis; an incredibly common issue is articles that claim "no difference" or "equality" because a p-value was greater than alpha. This is literally an undergraduate topic in intro to hypothesis testing; a nonsignificant test does not mean equality or no difference-- yet you see many papers in JAMA, BMJ.... that let the authors make some claim of "no difference", "no effect", or "no relationship" because P>.05 (which shows the fundamental misunderstanding of this concept on the part of nonstatisticians).
 
A p value of “X” can never fully account for bad science.

You could make the p value 0.0001 and you’d still get junk produced by people that lead you down false narratives and dead ends.
Generally, I put very little stock in "findings" that people publish or talk about as most come from terrible studies or the conclusion does not follow from the employed methods.

You may be surprised to know that p-values are pretty much invalid in observational studies, yet a majority in medicine like to force p-values onto everything; in a similar light, you almost never see biomedical research (outside genetics) using alpha other than .05 which again demonstrates a lack of understanding for the methods employed. The alpha .05 "standard" is an arbitrary threshold and ignores the proper utilization of the methodology (i.e. justifying an alpha and beta for each test of hypothesis depending on the given scenario and risk trade-off).
 
As a wetlab bench research monkey that merely follows orders and cranks out data, all I know is that my PI is happy when p < .05.

My PI has hella funding though, so who am I to question the methods?

#academia
 
As a wetlab bench research monkey that merely follows orders and cranks out data, all I know is that my PI is happy when p < .05.

My PI has hella funding though, so who am I to question the methods?

#academia
It's unfortunate that nearly all of us are put in these situations.

Your PI is probably very smart and knowledgeable about the nonstatistics portion of the research and deserves a lot of respect (which I'm sure is given). However, people are nearly unable to say anything or ask questions about areas that are well outside the PI's wheelhouse for fear of being labeled insubordinate or disrespectful. But this situation you've described is not uncommon, which is sad.

I'd bet that even with all that funding the PI/head honchos haven't carved out part of the funding to hire a statistician to make sure their work isn't for naught (which is often the case without a statistician, including if they're not brought in before the experiments or data collection start). It's a shocker to me that grants are awarded without a good check into who will function as the statistician; a real good way to light a pile of money on fire is to have an excited researcher without a statistician.

Journals are complicit in this issue, too, as most top tiers don't have full-time statisticians (PhD in stats or biostats), and of those journals that do have a statistician, only a handful of the papers are reviewed by the statistician.
 
Top