Biostats question for my summer research project...

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

Jumb0

Full Member
10+ Year Member
Joined
Aug 24, 2012
Messages
239
Reaction score
113
Research n00b here. I am working on a protocol for my summer research project. It's a retrospective cohort study, and I'm trying to do a sample size / power analysis. We are shooting for a power of 0.8 and .95 confidence interval. Since our outcome of interest is rare (like occurring 5% or less of the time), it's looking like we will need a pretty large sample size...

I am fiddling around with some sample size calculators, and they ask you to supply an "Odds ratio" that you aim to detect. What is considered a decent odds ratio to aim to detect? I consulted with a statistician, and they said it really depends on the what is "clinically meaningful" for the particular field, but I'm not quite sure what to make of that...So, I asked him to just give me a generally acceptable OR to try, and he suggested 2.0...so, we plugged that into the calculator and got an N of just over 1000, which I think is doable for our project. Now, am I understanding this correctly: This means that we would be able to calculate no less than a twofold increase in risk? If I lower the OR to 1.5 in the calculator, our required N balloons up to 3500, which I'm afraid would be impossible given the # of cases we have access to (not to mention that, even if it were possible, it would require a huge amount of work for me, which would not be much fun)...

I should clarify that we are trying to prove that a certain procedure does NOT carry a significantly increased risk of a bad outcome and should thus be a treatment option...So, if we go with the OR of 2.0, does that mean that, at the end of the day, the most we could say would be "Procedure X will not double your risk of the bad outcome" ? If so, idk..that doesn't seem super compelling to me. Am I misinterpreting all of this? Like I said, I don't have any experience with research, so for all I know an OR of 2.0 could be well and good.

Thoughts? Thank you :)

Members don't see this ad.
 
To answer your question: in essence, no. An odds ratio is pretty much exactly what it sounds like: what are the odds of my outcome of interest in those who are exposed vs those who aren't. I'm not sure how you're entering your numbers, but, generally, an OR of 2 for the scenario you've given would indicate that procedure doubles (100% increase) the risk for your outcome. The details you gave are a bit vague, but I'll take a shot at giving you an example that seems to be along the lines of what you're looking for.

Example: Procedure X is anecdotally linked to increase in infection, so Procedure Y is the preferred treatment. Procedure X has some benefit to it that we care to know whether it actually does have a higher incidence of infection post-operatively.

What are the odds that a patient gets an infection after Procedure X compared to Procedure Y? You actually want a very low odds ratio for a scenario like this. An OR of 1.1 means that there is a 10% increase in risk of infection if you do X instead of Y. An OR of <1 would mean increased chance of infection by doing Y instead of X. An OR of one would mean equal risk.

To critique your experiment: you're trying to prove the null hypothesis, it seems. Also, since your outcome of interest is rare, it might be more worthwhile to look into a case-control study (where you pick your 'cases' by who has the outcome (and then an equally representative control group who didn't have the outcome) and then look back at what their exposure was), which reduces the number of people you need to include.

Finally, to comment on what the statistician said about clinically significance (CS because I'm getting lazy typing everything out). CS changes from field to field, and from study to study, and are not related to statistical significance (aka confidence interval). For example, a drug that reduces 1-year mortality rates by 30% is clinically insignificant for someone treating basal cell carcinoma (which already has an extremely low 1-year mortality rate). Assume 1-year mortality for BSC is 1%: a 30% reduction would lead to a new rate of 0.7%. Not really a big deal.

But for someone treating GBM, a 30% reduction in 1-year mortality is the new wonderbread. Let's assume GBM has a 70% 1-year mortality (not too far off from truth, sadly). A 30% reduction would drop mortality rates to 49%. Same OR (0.7, if you're interested), different context, different interpretation.
 
  • Like
Reactions: 4 users
To answer your question: in essence, no. An odds ratio is pretty much exactly what it sounds like: what are the odds of my outcome of interest in those who are exposed vs those who aren't. I'm not sure how you're entering your numbers, but, generally, an OR of 2 for the scenario you've given would indicate that procedure doubles (100% increase) the risk for your outcome. The details you gave are a bit vague, but I'll take a shot at giving you an example that seems to be along the lines of what you're looking for.

Example: Procedure X is anecdotally linked to increase in infection, so Procedure Y is the preferred treatment. Procedure X has some benefit to it that we care to know whether it actually does have a higher incidence of infection post-operatively.

What are the odds that a patient gets an infection after Procedure X compared to Procedure Y? You actually want a very low odds ratio for a scenario like this. An OR of 1.1 means that there is a 10% increase in risk of infection if you do X instead of Y. An OR of <1 would mean increased chance of infection by doing Y instead of X. An OR of one would mean equal risk.

To critique your experiment: you're trying to prove the null hypothesis, it seems. Also, since your outcome of interest is rare, it might be more worthwhile to look into a case-control study (where you pick your 'cases' by who has the outcome (and then an equally representative control group who didn't have the outcome) and then look back at what their exposure was), which reduces the number of people you need to include.

Finally, to comment on what the statistician said about clinically significance (CS because I'm getting lazy typing everything out). CS changes from field to field, and from study to study, and are not related to statistical significance (aka confidence interval). For example, a drug that reduces 1-year mortality rates by 30% is clinically insignificant for someone treating basal cell carcinoma (which already has an extremely low 1-year mortality rate). Assume 1-year mortality for BSC is 1%: a 30% reduction would lead to a new rate of 0.7%. Not really a big deal.

But for someone treating GBM, a 30% reduction in 1-year mortality is the new wonderbread. Let's assume GBM has a 70% 1-year mortality (not too far off from truth, sadly). A 30% reduction would drop mortality rates to 49%. Same OR (0.7, if you're interested), different context, different interpretation.

Thank you very much for your reply. That was very informative. I may have misinterpreted what the calculator was asking for in the field where we input "2". What is asked for was "Detectable/alternative," which I presumed to mean an odds ratio target. Interestingly, the statistician said the same exact thing about maybe doing a case control study instead in order to reduce the sample size. I am still a bit confused by how the methodology of the case control would differ. Let me further explain what we intend to prove, as I wasn't clear enough in the OP:

There is a subtype of a certain disease for which there is an ongoing debate regarding treatment guidelines.
There are 2 treatment modalities for this disease, X and Y.
X treatment is more aggressive and typically reserved for the "high-risk" subtype of the disease.
Y treatment is milder and usually offered only in the subtype that is graded "low risk."
We are looking at the intermediate risk subtype, for which there is no consensus on what treatment to use.
The outcome of interest would be relapse of the disease.
So, we ultimately want to prove that it's OK to use treatment Y, the milder treatment, on this intermediate risk subtype because doing so does not cause a statistically significant increase in relapse.

So, my idea was that we find all the patients who were diagnosed with the intermediate subtype and see if there was a statistically significant difference in relapse rate between those in the group who received treatment X vs those who received treatment Y. I believe this is considered a retrospective cohort study.

Now, you and the statistician have both told me that a case control design might help reduce the N, but I don't understand how the design would be constructed. The way I understand case control studies is that you work backwards from a variable outcome, i.e. you take a group of people who developed an outcome of interest and a control group who didn't develop the outcome, and then you look back in their history to see if there was an exposure that significantly differed between them. How would that translate to my study? So, I would find all the people w/ the disease subtype of interest who had recurrence after 2 years of being treated vs. a group of people w/ the same disease subtype who didn't have recurrence after 2 years, and then see whether the treatment modality they received significantly differed?
 
Last edited:
Members don't see this ad :)
The distinction between a retrospective cohort and a case-control study is subtle and confusing and I am not great at explaining it, but you pretty much have it. Case-control studies are great for rare outcomes since you start by selecting people with the outcome of interest. In your case, you would have your cases as intermediate-disease people with relapse after 2 years and your controls as intermediate-disease people without relapse. Then you work backwards: of the relapse group, how many received treatment X vs Y? What about in the no-relapse group? Just as you said.

You have the right of it with the retrospective cohort study too. Assume there is a time z between diagnosis and ending treatment, after which the 2 year window for relapse starts. You would start by getting a (very) large group of people diagnosed z+2 years ago (or earlier, if you need to increase your n) and dividing them into their treatment groups, and only after that look at their relapse outcome. If including those diagnosed earlier than z+2 years, you would stop looking at relapse outcomes 2 years after their treatment ended (so that you don't have some people with data for relapse after 3 years - does that make sense?). Since relapse is rare, do you see how you would need many more people for a retrospective cohort analysis? If you do a case-control study, a bigger group will just strengthen your association (or lack thereof); in a retrospective cohort, a large n will ensure you have enough people with relapse to get a respectable power. The drawback is that case-control studies only prove association, so you'd need to do another study to show causation.

And now this gets picky: you're actually looking for a negative result. You're looking to disprove the hypothesis that "Treatment with Y is associaed with a higher incidence of relapse among those with intermediate form of disease compared to treatment with X." So you're looking to have an odds ratio with the null value in the confidence interval. I'm no statistician, so I would run this all by them (and also the detectable/alternative, but my guess too is that it's an odds ratio), but that's how I would approach this. You could then turn around with the results and start a prospective cohort study.
 
  • Like
Reactions: 1 user
To critique your experiment: you're trying to prove the null hypothesis, it seems. Also, since your outcome of interest is rare, it might be more worthwhile to look into a case-control study (where you pick your 'cases' by who has the outcome (and then an equally representative control group who didn't have the outcome) and then look back at what their exposure was), which reduces the number of people you need to include.
I'll clarify this since it doesn't look like you took the next step, but the important thing here is that the OP should note that you cannot prove the null hypothesis, and failing to reject the null hypothesis does not constitute evidence in favor of the null hypothesis. In other words, not getting a significant result does not in any way, shape, or form suggest, imply, or support the null hypothesis.

The distinction between a retrospective cohort and a case-control study is subtle and confusing and I am not great at explaining it, but you pretty much have it. Case-control studies are great for rare outcomes since you start by selecting people with the outcome of interest. In your case, you would have your cases as intermediate-disease people with relapse after 2 years and your controls as intermediate-disease people without relapse. Then you work backwards: of the relapse group, how many received treatment X vs Y? What about in the no-relapse group? Just as you said.
A retrospective cohort study looks in the chart and determines grouping by exposure first, without any knowledge of the outcome. The sampling starts with the exposures and is similar to the prospective cohort study, but it is using data that already exists. The case-control study begins by ascertaining the outcome status, cases (diseased/affected) and controls (not diseased). Then in each of these groups, the exposure status is determined. The method of sampling is the main difference in the case-control and retrospective cohort studies.

The drawback is that case-control studies only prove association, so you'd need to do another study to show causation.
Another important distinction is that nothing is "proven". The idea is finding out to what degree the data disagree with the null hypothesis. Sufficiently strong disagreement of data with the null of no association can support an association, but doesn't prove an association.

And now this gets picky: you're actually looking for a negative result. You're looking to disprove the hypothesis that "Treatment with Y is associaed with a higher incidence of relapse among those with intermediate form of disease compared to treatment with X." So you're looking to have an odds ratio with the null value in the confidence interval.
Again, this comes back to the issue before, that in the traditional Frequentist framework for hypothesis testing, you can't prove the null hypothesis. Failing to reject the null doesn't indicate support of the null. The same thing goes for a confidence interval. Any value falling within the confidence interval isn't "proven" (which should be readily apparent due to the connection between confidence intervals and hypothesis tests). Having an OR null value in the confidence interval would be no different than failing to reject a hypothesis test using that hypothesized value in the null hypothesis, which wouldn't solve any of the OPs problems. I'd look into equivalence testing (and noninferiority testing, to be complete). Here is a brief start Understanding Equivalence and Noninferiority Testing

I'd also recommend this to make sure you have an adequate overview of how to do a power analysis (multiple parts in the seminar, but it can get complex rather quickly such as in the cases of multivariable regressions, for example, but even a t-test requires a good understanding beyond just knowing how to get a number): Introduction to Power Analysis - IDRE Stats
It's very much a garbage-in-garbage-out process, and it's often far more complex than a couple of quick calculations that the software does.
 
Last edited by a moderator:
Thank you for the replies, y'all. This convo is really helpful.


I'll clarify this since it doesn't look like you took the next step, but the important thing here is that the OP should note that you cannot prove the null hypothesis, and failing to reject the null hypothesis does not constitute evidence in favor of the null hypothesis. In other words, not getting a significant result does not in any way, shape, or form suggest, imply, or support the null hypothesis.

A retrospective cohort study looks in the chart and determines grouping by exposure first, without any knowledge of the outcome. The sampling starts with the exposures and is similar to the prospective cohort study, but it is using data that already exists. The case-control study begins by ascertaining the outcome status, cases (diseased/affected) and controls (not diseased). Then in each of these groups, the exposure status is determined. The method of sampling is the main difference in the case-control and retrospective cohort studies.

Another important distinction is that nothing is "proven". The idea is finding out to what degree the data disagree with the null hypothesis. Sufficiently strong disagreement of data with the null of no association can support an association, but doesn't prove an association.

Again, this comes back to the issue before, that in the traditional Frequentist framework for hypothesis testing, you can't prove the null hypothesis. Failing to reject the null doesn't indicate support of the null. The same thing goes for a confidence interval. Any value falling within the confidence interval isn't "proven" (which should be readily apparent due to the connection between confidence intervals and hypothesis tests). Having an OR null value in the confidence interval would be no different than failing to reject a hypothesis test using that hypothesized value in the null hypothesis, which wouldn't solve any of the OPs problems. I'd look into equivalence testing (and noninferiority testing, to be complete). Here is a brief start Understanding Equivalence and Noninferiority Testing

I'd also recommend this to make sure you have an adequate overview of how to do a power analysis (multiple parts in the seminar, but it can get complex rather quickly such as in the cases of multivariable regressions, for example, but even a t-test requires a good understanding beyond just knowing how to get a number): Introduction to Power Analysis - IDRE Stats
It's very much a garbage-in-garbage-out process, and it's often far more complex than a couple of quick calculations that the software does.

Can't I just frame the null / alternative hypotheses differently? I feel like it's kind of arbitrary, in a sense. Allow me to explain. I know the null hypothesis is the hypothesis of "no change," but aren't the experimenters at liberty to decide what the null case is? It's just a placeholder of sorts, no? In other words, instead of saying that the null hypothesis is "Procedure Y is risky" and designing an experiment to reject that in favor of the alternative, you could just as easily frame it inversely, i.e. take as the null hypothesis that "Procedure Y is NOT risky" and then test the alternative hypothesis. Plus, like I mentioned in an earlier post, there currently is no consensus in the community regarding the question we are seeking to answer here, which further bolsters the notion of choosing either side of the coin as the null. I mean, the raw data will be the same in the end, regardless of what we hypothesize.
 
Last edited:
Thank you for the replies, y'all. This convo is really helpful.

Can't I just frame the null / alternative hypotheses differently? I feel like it's kind of arbitrary, in a sense. Allow me to explain. I know the null hypothesis is the hypothesis of "no change," but aren't the experimenters at liberty to decide what the null case is? It's just a placeholder of sorts, no? In other words, instead of saying that the null hypothesis is "Procedure Y is risky" and designing an experiment to reject that in favor of the alternative, you could just as easily frame it inversely, i.e. take as the null hypothesis that "Procedure Y is NOT risky" and then test the alternative hypothesis. Plus, like I mentioned in an earlier post, there currently is no consensus in the community regarding the question we are seeking to answer here, which further bolsters the notion of choosing either side of the coin as the null. I mean, the raw data will be the same in the end, regardless of what we hypothesize.
I'll ask you to clarify a few things:
1) does the statistician you consulted with have at least an MS in statistics or biostatistics? If so, they should be able to tell you in their sleep that it's not just as simple as stating that the null is the inverse of what you were investigating.
2) what do you mean by "there is no consensus in the community...which further bolsters choosing either side... as the null." Who is "the community" and what don't they agree on?
3) What is it that you want to do, exactly?I completely forgot to read the part in your first post where you want to investigate whether the OR is 2 indicating twice the odds in one group relative to the other. If this is actually what you want to do, you could just conduct a one-tailed test with a null of Ho: OR=2 (technically >=2) and Ha: OR<2. A significant result would indicate that the true OR is significantly less than 2 (or similarly use a CI and make sure 2 is larger than the upper bound of the interval to indicate the CI is entirely below 2). Keep in mind that this doesn't suggest that the OR is 1 or anything, it just says the data disagree with the null that the OR is 2 or greater-- you can only conclude the true OR is significantly less than 2.

You are also right that the way the null is taught as "no change" or "no relationship" isn't actually teaching anyone stats, it's teaching them to the limited range of test questions, so this is why I said you could say your null is that the true OR=2 while your alternative hypothesis is that the null is less than 2 (you can change the null to an extent, but it should have an "=" in it unless you're using one of the procedures in the paper, you can see this in the first figure and table). If you truly wanted "no increased risk" as you mentioned in your first post, you won't be logically able to conclude that from a hypothesis test that fails to reject with a null of OR=1 (or similarly with a CI that contains 1). You would need to use the testing procedures I linked in my other post (probably noninferiority). The link gives a good overview to show that there's more to the procedure than just saying your null and alternative are swapped (this must happen for reasons I mentioned earlier that a value falling in a CI is not "proven" or "supported" in the same way that a non significant hypothesis test doesn't prove or support the null).

A quick and dirty explanation as to why you can't use the same hypothesis test but flip the null and alternative: you can't say that the null is that the true OR is less than 2 and the alternative is that the true OR is equal or greater to 2 because the calculation you're most likely familiar with is done assuming the null is that the OR equals 2. If you only flipped your thinking, the calculation and procedure don't reflect that change. This is why you need a procedure like the one outlined in the paper. However, as I mentioned before, if you just want to say there's significant evidence to conclude the true OR is less than 2, you can use a classical one-tail significance test. Rejecting the null that the true OR >= 2 will only allow you to conclude in favor of the alternative (suggest that true OR <2)-- but that's it. It wouldn't answer your question of "not increased risk" without getting into the non-inferiority/equivalence testing.

I also noticed I skipped this earlier:

Now, am I understanding this correctly: This means that we would be able to calculate no less than a twofold increase in risk?
The power level actually means that if the true OR is less than 2 (or whatever you set as an alternative in the power calculation), you'll have an 80% chance of rejecting the null that the true OR was equal to or greater than 2 (detecting a true difference).

Bold correction to clarify.
 
Last edited by a moderator:
  • Like
Reactions: 1 user
I'll ask you to clarify a few things:
1) does the statistician you consulted with have at least an MS in statistics or biostatistics? If so, they should be able to tell you in their sleep that it's not just as simple as stating that the null is the inverse of what you were investigating.
2) what do you mean by "there is no consensus in the community...which further bolsters choosing either side... as the null." Who is "the community" and what don't they agree on?
3) What is it that you want to do, exactly?I completely forgot to read the part in your first post where you want to investigate whether the OR is 2 indicating twice the odds in one group relative to the other. If this is actually what you want to do, you could just conduct a one-tailed test with a null of Ho: OR=2 (technically >=2) and Ha: OR<2. A significant result would indicate that the true OR is significantly less than 2 (or similarly use a CI and make sure 2 is larger than the upper bound of the interval to indicate the CI is entirely below 2). Keep in mind that this doesn't suggest that the OR is 1 or anything, it just says the data disagree with the null that the OR is 2 or greater-- you can only conclude the true OR is significantly less than 2.

You are also right that the way the null is taught as "no change" or "no relationship" isn't actually teaching anyone stats, it's teaching them to the limited range of test questions, so this is why I said you could say your null is that the true OR=2 while your alternative hypothesis is that the null is less than 2 (you can change the null to an extent, but it should have an "=" in it unless you're using one of the procedures in the paper, you can see this in the first figure and table). If you truly wanted "no increased risk" as you mentioned in your first post, you won't be logically able to conclude that from a hypothesis test that fails to reject with a null of OR=1 (or similarly with a CI that contains 1). You would need to use the testing procedures I linked in my other post (probably noninferiority). The link gives a good overview to show that there's more to the procedure than just saying your null and alternative are swapped (this must happen for reasons I mentioned earlier that a value falling in a CI is not "proven" or "supported" in the same way that a non significant hypothesis test doesn't prove or support the null).

A quick and dirty explanation as to why you can't use the same hypothesis test but flip the null and alternative: you can't say that the null is that the true OR is less than 2 and the alternative is that the true OR is equal or greater to 2 because the calculation you're most likely familiar with is done assuming the null is that the OR equals 2. If you only flipped your thinking, the calculation and procedure don't reflect that change. This is why you need a procedure like the one outlined in the paper. However, as I mentioned before, if you just want to say there's significant evidence to conclude the true OR is less than 2, you can use a classical one-tail significance test. Rejecting the null that the true OR >= 2 will only allow you to conclude in favor of the alternative (suggest that true OR <2)-- but that's it. It wouldn't answer your question of "not increased risk" without getting into the non-inferiority/equivalence testing.

I also noticed I skipped this earlier:

The power level actually means that if the true OR is less than 2 (or whatever you set as an alternative), you'll have an 80% chance of rejecting the null that the true OR was equal to or greater than 2 (detecting a true difference).

Thanks again for your help. To answer your clarifying questions.

1. It appears I was wrong about being able to change the null. I didn't get into that with the statistician (who has a PhD for the record), but I'm sure he would have said the same things as you.

2, 3. I outlined this above. Allow me to restate. This should make the background perfectly clear:


There are 2 main treatment modalities for the disease we are studying, Treatment X and Treatment Y.
X treatment is more aggressive and typically reserved for the "high-risk" subtype of this disease.
Y treatment is milder and usually offered only in the subtype that is graded "low risk."
Our study is specifically looking at the intermediate risk subtype, for which there currently is no consensus on which treatment to use. In fact, there is ongoing debate in the medical community of specialists who treat this disease, and historically major organizations have even published conflicting treatment guidelines, some saying it's OK to use Treatment Y and others saying "No, the intermediate type is still too "high risk" and therefore should be treated only with Treatment X.

The primary outcome of interest in our study would be relapse of the disease.
So, we ultimately want to show/suggest that it's OK to use treatment Y, the milder treatment, on this intermediate risk subtype disease because doing so does not cause a statistically significant increase in relapse. I should mention that the relapse rate for this disease is quite low (on average, regardless of treatment modality)...like less than 5%.

That last bit is the crux. I am open to any statistical design that accomplishes our end goal. I want to keep it as simple as possible. Now, having understood exactly what it is that we are trying to accomplish here, does that change your recommendations/advice at all?

Also, I would like to ask your opinion on the OR of 2.0 that we may be setting as our alternative (which is why I originally made this thread). For lack of a better way of phrasing this: Is this a "good" OR to set as the alternative? What I mean by that is, is it within the bounds of what is considered common practice in clinical research biostats? If you read a paper that had set their alternative OR as 2, would you respect that? I just dont want to build my whole study around this only to find out later that we picked laughably weak parameters that no one will take seriously. I know this alternative OR is supposed to be based on what's "clinically meaningful," but knowing what you know now about the background, do you think the OR of 2 is appropriate? Remember, it's a highly treatable disease with an overall relapse rate ~ < 5%...We tried fiddling with different OR's in the calculators, and if we set it at 1.5, the required sample size exploded to an unmanagebale number. I guess I just want some reassurance that the OR of 2 is a respectable value to set.

Thank you!!!
 
Thanks again for your help. To answer your clarifying questions.

1. It appears I was wrong about being able to change the null. I didn't get into that with the statistician (who has a PhD for the record), but I'm sure he would have said the same things as you.
I would get as much face time and feedback from the statistician as possible then. A PhD in statistics is as good as it will get, so maybe after a few more ideas are hammered out you can go back to the statistician to get his final word.

2, 3. I outlined this above. Allow me to restate. This should make the background perfectly clear:


There are 2 main treatment modalities for the disease we are studying, Treatment X and Treatment Y.
X treatment is more aggressive and typically reserved for the "high-risk" subtype of this disease.
Y treatment is milder and usually offered only in the subtype that is graded "low risk."
Our study is specifically looking at the intermediate risk subtype, for which there currently is no consensus on which treatment to use. In fact, there is ongoing debate in the medical community of specialists who treat this disease, and historically major organizations have even published conflicting treatment guidelines, some saying it's OK to use Treatment Y and others saying "No, the intermediate type is still too "high risk" and therefore should be treated only with Treatment X.
Thanks for clarifying.

The primary outcome of interest in our study would be relapse of the disease.
So, we ultimately want to show/suggest that it's OK to use treatment Y, the milder treatment, on this intermediate risk subtype disease because doing so does not cause a statistically significant increase in relapse. I should mention that the relapse rate for this disease is quite low (on average, regardless of treatment modality)...like less than 5%.
So based on the bold, let me clarify: if you used a "traditional" hypothesis test, your null is
Ho: Odds of relapse for intermediate disease treated with Y is less than or equal to the odds of relapse for intermediate disease treated with X (an OR <= 1 comparing Y to X)
Ha: odds of relapse for intermediate disease treated with Y is greater than the odds of relapse for intermediate disease treated with X (an OR > 1 comparing Y to X)

Based on your statement in bold, this how I've interpreted your idea. Then, you said if you fail to reject the null (Ho), then there is not a statistically significant increase in the odds of relapse comparing treatment Y to X. It seems that you want to take this result and say, "No statistically significant increase in odds comparing Y to X, therefore treatment Y is not "riskier" (trying to denote the difference between an OR and RR, but if the prevalence is low enough, these will become more similar) than treatment X (deciding Ho is true).

If I've understood your train of thought, then here is why you can't go ahead and take that approach. Failure to find statistical significance (failing to reject Ho) does not constitute evidence or support for the null (Ho). In other words, just because you did not observe a statistically significant increase in odds comparing Y to X does not mean that treatment Y is as "good" or "better" than treatment X in terms of relapse(which seems to be your goal). "Absence" of evidence of the alternative does not prove or suggest the null is true (really, it's about seeing how much "disagreement" the data have with the null, rather than any kind of support). One of the reasons for this is that there are many possible choices for the null hypothesis that may fail to result in a statistically significant increase (meaning they are also "compatible" with the data as different variants of the null). These tests are also done assuming the null is true, looking for evidence that's incompatible with the assumption, so it would be illogical to then conclude the null is true using a procedure that assumes the null is true.

I think there are a couple approaches, but it sounds like non-inferiority testing might be the best fit for you. Because the testing procedure is different, you could have a null that treatment Y is inferior to X in the intermediate sub type. Rejecting this null would mean there is sufficient evidence to conclude that treatment Y is non-inferior (could be superior or equal) to treatment X with respect to the z-year relapse risk (in the form of odds). (Or you could use equivalence testing if you want only to find evidence that they are equivalent within a clinically relevant range.) This seems like the best and most straight forward approach to me assuming that all you want to do is be able to say that treatment Y isn't worse than treatment X, therefore we should use treatment Y because it is less intense and shows evidence of at least as good outcomes.

Since you're not going to have the luxury of randomization, it probably makes sense to use a multiple logistic regression (but you may have to use a different estimation method if you only have a few events in one group, this can cause estimation to be less than ideal [a common problem with rare events-- data for events is "sparse"]...it's possible the statistician might think a Poisson regression is best in this case, but estimation methods like maximum likelihood vs penalized likelihood will still depend on how few events are in the smaller group, irrespective of using a Poisson vs Logistic regression...the Poisson will approximate the binomial in appropriate cases, but the logistic regression is based on the binomial distribution which is why I think a logistic model is fine). This will allow you to somewhat mitigate the effects of potential confounders and will allow for a more accurate estimate of the OR of interest, but it will somewhat depend on the group with the fewest events, which is why I'd recommend talking with the statistician about this. The point is that you want to account for variables that can reasonably impact the outcome, but you'll likely have limitations in the modeling process due to sparse data relative to the number of potential predictors and any transformations of the predictors (not a terrible thing, but you'll probably want some guidance from the statistician). Once you've done the modeling, you would just get a confidence interval for the (adjusted) OR (which just means from the model that accounts for the other independent variables) and use the non-inferiority or equivalence testing procedure (as far as I know, but again the statistician should be able to clarify this for you).

That last bit is the crux. I am open to any statistical design that accomplishes our end goal. I want to keep it as simple as possible. Now, having understood exactly what it is that we are trying to accomplish here, does that change your recommendations/advice at all?
I'll stress that if you just calculate an odds ratio for the two groups, you're not likely to see a good estimate of the true OR that's purely due to the differences in the treatments. The modeling and accounting for independent variables will be really important to "soak up" effects that aren't really due to the treatments but might contribute to variability in relapse between the two treatment groups since there wasn't randomization.

Also, I would like to ask your opinion on the OR of 2.0 that we may be setting as our alternative (which is why I originally made this thread). For lack of a better way of phrasing this: Is this a "good" OR to set as the alternative? What I mean by that is, is it within the bounds of what is considered common practice in clinical research biostats? If you read a paper that had set their alternative OR as 2, would you respect that? I just dont want to build my whole study around this only to find out later that we picked laughably weak parameters that no one will take seriously. I know this alternative OR is supposed to be based on what's "clinically meaningful," but knowing what you know now about the background, do you think the OR of 2 is appropriate? Remember, it's a highly treatable disease with an overall relapse rate ~ < 5%...We tried fiddling with different OR's in the calculators, and if we set it at 1.5, the required sample size exploded to an unmanagebale number. I guess I just want some reassurance that the OR of 2 is a respectable value to set.

Thank you!!!
As you were told before, it's kind of situational what dictates a reasonable effect size. Ultimately, you've got to go with your constraints, but the statistician should be able to help you generate a profile to demonstrate the power from different sample sizes with different assumptions about the OR, but this will also be more involved if you do end up using modeling. I don't think people will "laugh" at it. I don't think it would hurt to get some clinical perspective on "clinically meaningful" in terms of probabilities. For example (keeping the numbers simple, an OR of 2 implies that the probability of relapse in group Y (I'm assuming Y is in the numerator) is .67 with .33 being the probability of no relapse in Y, while the probability of relapse in X is .5 and the probability of no relapse in group X is .5 ([.67/.33]/[.5/.5]). So, I would recommend to look at aggregate data from somewhere to determine relapse probabilities for intermediate disease treated with X or Y. Let's say relapse rate is 25% in X, and 75% for no relapse in X (do this for both groups). Take this to a clinician and explain this is what it is currently in group X, what relapse rate would be clinically meaningful to you? Asking a few physicians might be useful. Once you have the probability they say is meaningful, you could then convert that to odds and make a ratio of these "clinically meaningful odds" with the empirical odds of the reference group. This would be then the OR that you think is clinically meaningful to detect. This might be easier to work with because people conflate odds and risk, but the two are quite different in most cases. I think using probabilities directly and then converting to odds will mitigate issues with intuition around odds and odds ratios (again an OR of 2 isn't really twice the risk, unless the OR is nearly equal the RR, it's like the example I gave above where the risk is actually 67% in group Y and 50% in group X).

As I've said before, this will come back to limitations you have based on the study design. All else the same, you'll need a larger sample to discern smaller effect sizes, so it makes sense that decreasing the OR will dramatically increase the sample size.

Sorry it was a bit of a long post, but I think after reading it and thinking about some of the things you might be able to get a good idea at least of what you'd want to mull over with the statistician. If you have access to them and they can help you plan and conduct the analysis, I'd highly recommend it as there is much more to good modeling than just putting the data in and getting an output.

I hope some of this was helpful and feel free to PM me or post again if you have other questions.

Edited with bold text to clarify what I meant since I originally said "safe" instead of efficacy which it what it sounds like you want.
 
Last edited by a moderator:
  • Like
Reactions: 2 users
I would get as much face time and feedback from the statistician as possible then. A PhD in statistics is as good as it will get, so maybe after a few more ideas are hammered out you can go back to the statistician to get his final word.

Thanks for clarifying.

So based on the bold, let me clarify: if you used a "traditional" hypothesis test, your null is
Ho: Odds of relapse for intermediate disease treated with Y is less than or equal to the odds of relapse for intermediate disease treated with X (an OR <= 1 comparing Y to X)
Ha: odds of relapse for intermediate disease treated with Y is greater than the odds of relapse for intermediate disease treated with X (an OR > 1 comparing Y to X)

Based on your statement in bold, this how I've interpreted your idea. Then, you said if you fail to reject the null (Ho), then there is not a statistically significant increase in the odds of relapse comparing treatment Y to X. It seems that you want to take this result and say, "No statistically significant increase in odds comparing Y to X, therefore treatment Y is not "riskier" (trying to denote the difference between an OR and RR, but if the prevalence is low enough, these will become more similar) than treatment X (deciding Ho is true).

If I've understood your train of thought, then here is why you can't go ahead and take that approach. Failure to find statistical significance (failing to reject Ho) does not constitute evidence or support for the null (Ho). In other words, just because you did not observe a statistically significant increase in odds comparing Y to X does not mean that treatment Y is as "good" or "better" than treatment X in terms of relapse(which seems to be your goal). "Absence" of evidence of the alternative does not prove or suggest the null is true (really, it's about seeing how much "disagreement" the data have with the null, rather than any kind of support). One of the reasons for this is that there are many possible choices for the null hypothesis that may fail to result in a statistically significant increase (meaning they are also "compatible" with the data as different variants of the null). These tests are also done assuming the null is true, looking for evidence that's incompatible with the assumption, so it would be illogical to then conclude the null is true using a procedure that assumes the null is true.

I think there are a couple approaches, but it sounds like non-inferiority testing might be the best fit for you. Because the testing procedure is different, you could have a null that treatment Y is inferior to X in the intermediate sub type. Rejecting this null would mean there is sufficient evidence to conclude that treatment Y is non-inferior (could be superior or equal) to treatment X with respect to the z-year relapse risk (in the form of odds). (Or you could use equivalence testing if you want only to find evidence that they are equivalent within a clinically relevant range.) This seems like the best and most straight forward approach to me assuming that all you want to do is be able to say that treatment Y isn't worse than treatment X, therefore we should use treatment Y because it is less intense and shows evidence of at least as good outcomes.

Since you're not going to have the luxury of randomization, it probably makes sense to use a multiple logistic regression (but you may have to use a different estimation method if you only have a few events in one group, this can cause estimation to be less than ideal [a common problem with rare events-- data for events is "sparse"]...it's possible the statistician might think a Poisson regression is best in this case, but estimation methods like maximum likelihood vs penalized likelihood will still depend on how few events are in the smaller group, irrespective of using a Poisson vs Logistic regression...the Poisson will approximate the binomial in appropriate cases, but the logistic regression is based on the binomial distribution which is why I think a logistic model is fine). This will allow you to somewhat mitigate the effects of potential confounders and will allow for a more accurate estimate of the OR of interest, but it will somewhat depend on the group with the fewest events, which is why I'd recommend talking with the statistician about this. The point is that you want to account for variables that can reasonably impact the outcome, but you'll likely have limitations in the modeling process due to sparse data relative to the number of potential predictors and any transformations of the predictors (not a terrible thing, but you'll probably want some guidance from the statistician). Once you've done the modeling, you would just get a confidence interval for the (adjusted) OR (which just means from the model that accounts for the other independent variables) and use the non-inferiority or equivalence testing procedure (as far as I know, but again the statistician should be able to clarify this for you).

I'll stress that if you just calculate an odds ratio for the two groups, you're not likely to see a good estimate of the true OR that's purely due to the differences in the treatments. The modeling and accounting for independent variables will be really important to "soak up" effects that aren't really due to the treatments but might contribute to variability in relapse between the two treatment groups since there wasn't randomization.

As you were told before, it's kind of situational what dictates a reasonable effect size. Ultimately, you've got to go with your constraints, but the statistician should be able to help you generate a profile to demonstrate the power from different sample sizes with different assumptions about the OR, but this will also be more involved if you do end up using modeling. I don't think people will "laugh" at it. I don't think it would hurt to get some clinical perspective on "clinically meaningful" in terms of probabilities. For example (keeping the numbers simple, an OR of 2 implies that the probability of relapse in group Y (I'm assuming Y is in the numerator) is .67 with .33 being the probability of no relapse in Y, while the probability of relapse in X is .5 and the probability of no relapse in group X is .5 ([.67/.33]/[.5/.5]). So, I would recommend to look at aggregate data from somewhere to determine relapse probabilities for intermediate disease treated with X or Y. Let's say relapse rate is 25% in X, and .75 for no relapse in X (do this for both groups). Take this to a clinician and explain this is what it is currently in group X, what relapse rate would be clinically meaningful to you? Asking a few physicians might be useful. Once you have the probability they say is meaningful, you could then convert that to odds and make a ratio of these "clinically meaningful odds" with the empirical odds of the reference group. This would be then the OR that you think is clinically meaningful to detect. This might be easier to work with because people conflate odds and risk, but the two are quite different in most cases. I think using probabilities directly and then converting to odds will mitigate issues with intuition around odds and odds ratios (again an OR of 2 isn't really twice the risk, unless the OR is nearly equal the RR, it's like the example I gave above where the risk is actually 67% in group Y and 50% in group X).

As I've said before, this will come back to limitations you have based on the study design. All else the same, you'll need a larger sample to discern smaller effect sizes, so it makes sense that decreasing the OR will dramatically increase the sample size.

Sorry it was a bit of a long post, but I think after reading it and thinking about some of the things you might be able to get a good idea at least of what you'd want to mull over with the statistician. If you have access to them and they can help you plan and conduct the analysis, I'd highly recommend it as there is much more to good modeling than just putting the data in and getting an output.

I hope some of this was helpful and feel free to PM me or post again if you have other questions.

Edited with bold text to clarify what I meant since I originally said "safe" instead of efficacy which it what it sounds like you want.

@dempty , you are amazing!!! Thank you so much for taking the time to write up this detailed analysis. This is incredibly helpful! I have much to discuss with the statistician now :D
 
@dempty , you are amazing!!! Thank you so much for taking the time to write up this detailed analysis. This is incredibly helpful! I have much to discuss with the statistician now :D
Glad it was helpful! As I said, if you have access to the statistician, it'll be hard for you to go wrong by sitting with them and asking questions and talking through how to do what you want to do. Take notes if you want, there's a lot to be learned (just the same as if you were asking a cardiologist about something related to cardio, for example, the experts are great resources if they have time).
 
  • Like
Reactions: 1 user
Top