"approaching significance"

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

erg923

Regional Clinical Officer, Centene Corporation
Account on Hold
15+ Year Member
Joined
Apr 6, 2007
Messages
10,827
Reaction score
5,609
What do people think about using this term?

I have a regression model that explains 14.5% of the variance. My ANOVA regression p-value is .058. Do I say it approached sig but ultimately was not? Best way to word this? Or do I not elaborate on the 'so close" aspect until the discussion?

Members don't see this ad.
 
What do people think about using this term?

I have a regression model that explains 14.5% of the variance. My ANOVA regression p-value is .058. Do I say it approached sig but ultimately was not? Best way to word this? Or do I not elaborate on the 'so close" aspect until the discussion?

I wouldn't do this. The logic of significance testing is kind of compromised if you start saying an effect is "close to the level at which it could be safely treated as something besides random variation, but not quite there." You could report it as significant at a higher P value, but I know quant psych people hate this too.

What I have seen people do in discussion sections is look more closely at the reason for the near miss of significance. Since significance is essentially a function of average error, effect size, and sample size it looks as though either error or sample size is likely to blame. This ends up being a more meaningful discussion for your readers than simply saying "close to significant" anyway since it gives them an indication of how a future study might achieve significance.

On a related note, have you looked for multivariate outliers? I've run across a similar situation a couple of times (good effect size, adequate power, no significance), and the presence of a few multivariate outliers has been to blame each time. Apportioning shared varience can result in strange multivariate results even when cases do not appear to be outliers when you look within each variable alone. If this is the case, you should be able to fix it by figuring out which variable has gone rogue and directing its entry into the regression or eliminating the outliers from a post-hoc analysis.
 
What do people think about using this term?

I have a regression model that explains 14.5% of the variance. My ANOVA regression p-value is .058. Do I say it approached sig but ultimately was not? Best way to word this? Or do I not elaborate on the 'so close" aspect until the discussion?

Using approaching significance is horse-****ty (pardon the vernacular). Having said that, I used the term 4 times in a recent paper... :)

I like to use it in the context of a significant finding, e.g., "Mono-infected patients performed significantly worse on Executive functioning than co-infected patients (t = x, p < .01). Additionally, Memory functioning approached significance, with mono-infected patients performing worse than co-infected patients (t = y, p = .058)."

Or something like this.
 
Members don't see this ad :)
I wouldn't do this. The logic of significance testing is kind of compromised if you start saying an effect is "close to the level at which it could be safely treated as something besides random variation, but not quite there." You could report it as significant at a higher P value, but I know quant psych people hate this too.

What I have seen people do in discussion sections is look more closely at the reason for the near miss of significance. Since significance is essentially a function of average error, effect size, and sample size it looks as though either error or sample size is likely to blame. This ends up being a more meaningful discussion for your readers than simply saying "close to significant" anyway since it gives them an indication of how a future study might achieve significance.

Well I dont see reporting it as illogical, you HAVE to report it in the results before you can discuss it in the discussion... in fact it does you more harm if you do not set the premise in the intro, report it in the results, and then discuss why you got what you did in the discussion.

On a related note, have you looked for multivariate outliers? I've run across a similar situation a couple of times (good effect size, adequate power, no significance), and the presence of a few multivariate outliers has been to blame each time. Apportioning shared varience can result in strange multivariate results even when cases do not appear to be outliers when you look within each variable alone. If this is the case, you should be able to fix it by figuring out which variable has gone rogue and directing its entry into the regression or eliminating the outliers from a post-hoc analysis.

Based on your logic above, removing outliers is equally as "questionable" as reporting the state of the data. Statisticians would make the argument that you have no way of knowing that they are actually outliers and not the true states of the sample. The things you say are all accepted practice, and I would say if we are being perfect little statisticians then we would do all of the above AND report all of the above (before and after manipulation of the data).
 
I'm actually going to disagree (slightly) with psychgeek on this one. For the results section, I wouldn't go into detail on it, just report the p value. "Approaching significance" is perhaps a bit misleading, but that's just a personal preference...I prefer to refer to them as trend-level, etc. I'd just pull up the journal you plan to submit to and see if there is anything consistent there for how its described, since different fields often use slightly different language to describe it.

The broader point is that there is nothing special about .05, and I wish we'd get away from that notion. There is no reason a p value of .049 versus .054 should change the substantive message of a manuscript, though unfortunately it often does. Reviewers frequently under-think these issues so to some degree its just a matter of gaming the system to get your message across in a way that won't upset anyone. One thing I've never understood is why so many consider it absolute blasphemy to dichotomize a variable with say, a median-split, but are all too happy to apply equally arbitrary cutpoints to p values. I have yet to be convinced the same logic shouldn't apply to both cases.

I fully agree with what he said about triple-checking for outliers/distribution assumptions/etc. one more time in this case (though you should probably report it both ways). Discussing issues such as power can be meaningful, but do keep in mind that post-hoc power analysis isn't really scientifically valid, even if its published all the time and reviewers sometimes even ask for it. I won't get into details on why, but a number of papers have been published on that if you want to check.

Basically, if it seems important to the message of the paper, I will just interpret it like any other finding but pull back on the language I used to describe it in the discussion section (e.g. "There was some suggestion that moderation by variable x may be important, and it will be critical for future studies to follow up on this work".). When it seems off-topic, I tend to just ignore it, which is somewhat ethically grey but with page limits and the need to tell a story I'm not sure there is a good solution.
 
I'm actually going to disagree (slightly) with psychgeek on this one. For the results section, I wouldn't go into detail on it, just report the p value. "Approaching significance" is perhaps a bit misleading, but that's just a personal preference...I prefer to refer to them as trend-level, etc. I'd just pull up the journal you plan to submit to and see if there is anything consistent there for how its described, since different fields often use slightly different language to describe it.

The broader point is that there is nothing special about .05, and I wish we'd get away from that notion. There is no reason a p value of .049 versus .054 should change the substantive message of a manuscript, though unfortunately it often does. Reviewers frequently under-think these issues so to some degree its just a matter of gaming the system to get your message across in a way that won't upset anyone. One thing I've never understood is why so many consider it absolute blasphemy to dichotomize a variable with say, a median-split, but are all too happy to apply equally arbitrary cutpoints to p values. I have yet to be convinced the same logic shouldn't apply to both cases.

I fully agree with what he said about triple-checking for outliers/distribution assumptions/etc. one more time in this case (though you should probably report it both ways). Discussing issues such as power can be meaningful, but do keep in mind that post-hoc power analysis isn't really scientifically valid, even if its published all the time and reviewers sometimes even ask for it. I won't get into details on why, but a number of papers have been published on that if you want to check.

Basically, if it seems important to the message of the paper, I will just interpret it like any other finding but pull back on the language I used to describe it in the discussion section (e.g. "There was some suggestion that moderation by variable x may be important, and it will be critical for future studies to follow up on this work".). When it seems off-topic, I tend to just ignore it, which is somewhat ethically grey but with page limits and the need to tell a story I'm not sure there is a good solution.

Good points too. It's unfortunate some of the best statistical minds agree on the relevance of confidence intervals above p value statistics... maybe at some point we will catch up.
 
Well I dont see reporting it as illogical, you HAVE to report it in the results before you can discuss it in the discussion... in fact it does you more harm if you do not set the premise in the intro, report it in the results, and then discuss why you got what you did in the discussion.



Based on your logic above, removing outliers is equally as "questionable" as reporting the state of the data. Statisticians would make the argument that you have no way of knowing that they are actually outliers and not the true states of the sample. The things you say are all accepted practice, and I would say if we are being perfect little statisticians then we would do all of the above AND report all of the above (before and after manipulation of the data).

The logic of significance testing is that there is a level of chance of type I error you establish a priori that functions as the equivalent of the absence of type I error for the purposes of your conclusion. The line is somewhat arbitrary, but to allow shades of significance post-hoc basically allows you to change your a priori threshold whenever it hasn't been met.

Reporting the results has to be done, I am objecting to characterizing those reported results as "close to significant." Once the results are reported, you may engage in post hoc analysis to try to further understand the reasons why you obtained the results that you did. These post-hoc analyses do not replace the main analysis, and it is acceptable to manipulate the data to determine what the results would have been under alternative conditions provided that the manipulation is made clear. This can then be discussed more fully in the discussion section.

Figuring out if there is a problem with the entry method of the variables is ideal, but if that is too labor intensive, saying that you should leave the ouliers in the dataset is not the only scientifically acceptable option. A number of statisticians would suggest re-running the analysis with the outliers excluded if only to determine their overall effect upon the test statistic.

There is debate about the best way to handle outliers, and your response represents one, but only one, of the positions. Other positions include ...

1. Outliers are low probability events. The density of low probability events within a finite space is subject to more random error than higher probability events. Methods for defining outliers are sufficiently robust as to prevent the loss of large amounts of data in all but the smallest samples through the elimination of outliers. Thus, outliers should be elimiated entirely as a source of random error.

2. The disproportionate effect outliers exert in the calculation of statistics is a mathematical artifact. Leaving them in without adjustment distorts the distribution of the test statistic in a way that make it discrepant from the population distribution of the variable. Thus, outliers should be adjusted to mitigate the effect they have on the test statistic.

3. Outliers deviate from the general model for a reason that cannot be captured by the variables within the model. Explanations of these deviations strengthen a model provided that the explanations leave the hypothesized relationships between the variables intact and attribute the deviation to an unusual supervening effect. They weaken a model if they provide cases in which the variables do not have the expected relationships with one another and no unusual supevening effect can be found. Thus, each outlier should be investigated to determine the most likely reason for deviation, and the results of this determination should inform the interpretation of the model.
 
The broader point is that there is nothing special about .05, and I wish we'd get away from that notion. .

I actually agree with this wholeheartedly, but I don't agree that this allows us to shift our alpha levels post-hoc. The problem with saying that .053 isn't materially different from .05 so we should allow it to count is that people would never make a decision in the other direction post-hoc. Nobody would say at the end of their research that the alpha level should have actually been .047 so the .049 p value means the results are non-sig.

Allowing for "almost significant" or even "trending toward significance" replaces an arbitrary line with a second arbitrary fuzzy line at some level > .05. Also, to get back to my earlier point, I don't like "almost significant" or "trending toward significant" because they don't tell you anything. Trending toward significant is not functionally any more useful than not significant. It can't be treated as a significant finding, it nothing about the term tells a researcher what he or she should do to resolve the ambiguity.
 
I actually agree with this wholeheartedly, but I don't agree that this allows us to shift our alpha levels post-hoc. The problem with saying that .053 isn't materially different from .05 so we should allow it to count is that people would never make a decision in the other direction post-hoc. Nobody would say at the end of their research that the alpha level should have actually been .047 so the .049 p value means the results are non-sig.

Allowing for "almost significant" or even "trending toward significance" replaces an arbitrary line with a second arbitrary fuzzy line at some level > .05. Also, to get back to my earlier point, I don't like "almost significant" or "trending toward significant" because they don't tell you anything. Trending toward significant is not functionally any more useful than not significant. It can't be treated as a significant finding, it nothing about the term tells a researcher what he or she should do to resolve the ambiguity.

I think the only issue we are taking from what you are saying is that you are explaining your arguments as though we are saying our results are significant, but you are misunderstanding what we are saying. We are saying that "trending" or "approaching" is simply qualifying that in fact your results ARE NOT significant but there is still a noticeable relationship. If we were to suddenly make our results significant below .1 (which I have seen, fortunately not often) THEN you can complain about us changing an arbitrary line to another fuzzy arbitrary line... but in this case our line is STILL at .05, we are simply acknowledging the finding in our results which allows us to discuss it further and essentially make excuses for why it is not significant... (this last justifying our negative results part kind of drives me nuts too)

Good points though, and I, like Ollie, agree of the necessity of inspecting your data before running analyses... something I think is not done very often.
 
Last edited:
Thanks everyone. I appreciate the help. I am reading through the arguments here. I wish I was passionate enough about stats to join in, but I'm not. ha

Ultimately, I'm at the whim of my adviser on the phrasing (or whether i mention it all in the results). I'm finishing up my dissertation, so this isn't for a pub, at least not yet. I'd like to publish it, but will def be a year or so before I get around to reformatting and submission. I want to settle into my job, pass the EPPP etc. before returning to it. I just want it done. Starting discussion section now, so only a few more weeks of work ahead of me. Not really looking forward to digging back into the mess of lit that is my diss topic though. But screw it, I'm almost free! I just gotta push...
 
Last edited:
The broader point is that there is nothing special about .05, and I wish we'd get away from that notion. There is no reason a p value of .049 versus .054 should change the substantive message of a manuscript, though unfortunately it often does. Reviewers frequently under-think these issues so to some degree its just a matter of gaming the system to get your message across in a way that won't upset anyone. One thing I've never understood is why so many consider it absolute blasphemy to dichotomize a variable with say, a median-split, but are all too happy to apply equally arbitrary cutpoints to p values. I have yet to be convinced the same logic shouldn't apply to both cases.

Kazdin makes a really great point about this in Research Design in Clinical Psychology (4th ed) starting around page 439 (the footnote is priceless.) Maybe having Kazdin to cite might help.

Mark
 
Not sure we are really disagreeing here (at least not strongly).

I'm certainly not arguing for a post-hoc arbitrary "We call everything below .1 significant". I'm perhaps a bit of a rogue on this, but I'd like to see the whole concept of significance disappear. Needing to specify a priori what the significance level should be really just seems like a shortcut to me to produce papers that are easier to read..not necessarily papers that are more meaningful or valid.

I may be going a bit off the deep end here, but if we are going to keep p values around, I say treat them continuously. I realize this is somewhat impractical since at some point one needs to make the decision "probably not important" and it would probably make papers more difficult to read, make interpretation somewhat more subjective, and require FAR more effort to review papers adequately. However, I do think it would eliminate this problems of arbitrary cutoffs, and actually think it would stop a lot of the sketchy data manipulation that goes on to force things over that arbitrary line but renders many things uninterpretable (i.e. we eliminated x participants, transformed the transformed data, etc.).

Agree with Mark on Kazdin's work - he's made some good arguments for this from a more clinical perspective, though there are tons of statisticians arguing something similar.

I think a lot of it derives from our desire to be a "hard science", which is understandable and certainly I'm not against a priori hypotheses, etc. (though even this I think often gets overstated - exploratory analysis is not inherently "bad" as long as you are up front about what you did...it is HARKing that is a major problem). Shouldn't come as a surprise to anyone here that I favor making psychology more scientific, but I think our analytic approaches are oftentimes far more about appearance than reality.
 
When I have had similar results in experiments I've generally gone with something like "The results fell short of significance" or maybe even "just short". I think that this conveys the fact that your results were close without implying that significance exists when it doesn't. You can then discuss it in more detail in the discussion section. But overall I think that if you get something like p=0.052, then you have to at least address that in some fashion. It seems like it would be more misleading not to.
 
I wanted to add... that reporting "approaching significance" or "trending" for me, goes both ways for reporting significant and nonsignificant findings, but I don't look at the p value explicitly, I tend to also look at measures of power (and by extension effect size) as well as confidence intervals. So if something is barely significant, I will report it as such (and it hasnt given me trouble in the, albeit few, high impact factor journals Ive been fortunate enough to pub in).

Additionally, I will say something is trending or approaching significance, EVEN IF it contradicts my initial hypotheses. There are certainly ways of writing up results to make them sound interesting, even if its unexpected. I am very much opposed to HARKing (hypothesizing after results are known) and this lends itself to reporting confirmatory and contradictory findings.

AJ
 
I like reporting all of the p values, myself. Of course you still have to give an alpha or else people complain.
 
Relatedly, do you all report effect size? Our stats professor is pretty anti-significance testing, and he really emphasized the movement to always reporting effect sizes along with p-values (and also, not using Cohen's benchmarks, except as a very last resort).
 
Always do if I'm the one doing the results section, though I think some of the other folks in my lab do not.

To me the only question is what statistic to report. We do a lot of fairly complex designs, which unfortunately doesn't lend itself well to easy interpretation of effect sizes (what does a 3-way interaction for a mixed between-within design effect size mean? Beats me...). I usually end up reporting partial eta squared just because that seems to be most common in our go-to journals and its what SPSS spits out, though usually have a twinge of guilt for not bothering to convert to eta squared even if it isn't standard practice.

Agree with not liking Cohen's cutoffs, though I admit I still use them for power analysis (and indeed, am working on this for a grant right now). I'm generally not a big fan of power analysis though - obviously studies need to have a decent number of participants, but I'm not convinced the calculations are any more accurate than the "number out of a hat" method. There are way too many "researcher degrees of freedom" (I love that phrase) unless you are doing an incredibly simple study, and so many decisions are highly reliant on guesswork that it seems almost worthless to me beyond giving you a general range that most people would probably pick anyways and making a grant app look better.
 
Always do if I'm the one doing the results section, though I think some of the other folks in my lab do not.

To me the only question is what statistic to report. We do a lot of fairly complex designs, which unfortunately doesn't lend itself well to easy interpretation of effect sizes (what does a 3-way interaction for a mixed between-within design effect size mean? Beats me...). I usually end up reporting partial eta squared just because that seems to be most common in our go-to journals and its what SPSS spits out, though usually have a twinge of guilt for not bothering to convert to eta squared even if it isn't standard practice.

Agree with not liking Cohen's cutoffs, though I admit I still use them for power analysis (and indeed, am working on this for a grant right now). I'm generally not a big fan of power analysis though - obviously studies need to have a decent number of participants, but I'm not convinced the calculations are any more accurate than the "number out of a hat" method. There are way too many "researcher degrees of freedom" (I love that phrase) unless you are doing an incredibly simple study, and so many decisions are highly reliant on guesswork that it seems almost worthless to me beyond giving you a general range that most people would probably pick anyways and making a grant app look better.

Uh yeah isnt it annoying having to do a power analysis for a grant application... so counter-intuitive... apparently though, at least for NIH based funding, this phenomenon is changing quite quickly so we wont have to do that in the near future.

ps. I run into a similar problem with some of the analyses I do, its not that the results are not interpretable, because its quite clear what the results in general are saying, but let's say I run a discriminant functions analysis with rates of change (instead of fixed values). Ever seen the possible outputs of a discriminant functions analysis when you are less concerned about the overall model but rather the individual discriminations? ewwww. I also have similar fun with mixed models/random effects modelling (actually lets just say on all generalized linear models lol).

Good times
 
Top