Thanks again for your help. To answer your clarifying questions.

1. It appears I was wrong about being able to change the null. I didn't get into that with the statistician (who has a PhD for the record), but I'm sure he would have said the same things as you.

I would get as much face time and feedback from the statistician as possible then. A PhD in statistics is as good as it will get, so maybe after a few more ideas are hammered out you can go back to the statistician to get his final word.

2, 3. I outlined this above. Allow me to restate. This should make the background perfectly clear:

There are 2 main treatment modalities for the disease we are studying, Treatment X and Treatment Y.

X treatment is more aggressive and typically reserved for the "high-risk" subtype of this disease.

Y treatment is milder and usually offered only in the subtype that is graded "low risk."

Our study is specifically looking at the **intermediate risk subtype**, for which there currently is no consensus on which treatment to use. In fact, there is ongoing debate in the medical community of specialists who treat this disease, and historically major organizations have even published conflicting treatment guidelines, some saying it's OK to use Treatment Y and others saying "No, the intermediate type is still too "high risk" and therefore should be treated only with Treatment X.

Thanks for clarifying.

The primary outcome of interest in our study would be relapse of the disease.

So, we ultimately want to show/suggest that it's OK to use treatment Y, the milder treatment, on this intermediate risk subtype disease because doing so **does not cause a statistically significant increase in relapse**. I should mention that the relapse rate for this disease is quite low (on average, regardless of treatment modality)...like less than 5%.

So based on the bold, let me clarify: if you used a "traditional" hypothesis test, your null is

Ho: Odds of relapse for intermediate disease treated with Y is

**less than or equal **to the odds of relapse for intermediate disease treated with X (an OR <= 1 comparing Y to X)

Ha: odds of relapse for intermediate disease treated with Y is

**greater than **the odds of relapse for intermediate disease treated with X (an OR > 1 comparing Y to X)

Based on your statement in bold, this how I've interpreted your idea. Then, you said if you fail to reject the null (Ho), then there is not a statistically significant increase in the odds of relapse comparing treatment Y to X. It seems that you want to take this result and say, "No statistically significant increase in odds comparing Y to X, therefore treatment Y is not "riskier" (trying to denote the difference between an OR and RR, but if the prevalence is low enough, these will become more similar) than treatment X (deciding Ho is true).

If I've understood your train of thought, then here is why you can't go ahead and take that approach. Failure to find statistical significance (failing to reject Ho) does not constitute evidence or support for the null (Ho). In other words, just because you did not observe a statistically significant increase in odds comparing Y to X does not mean that treatment Y is as

**"good"** or

** "better"** than treatment X

**in terms of relapse**(which seems to be your goal). "Absence" of evidence of the alternative does not prove or suggest the null is true

** (really, it's about seeing how much "disagreement" the data have with the null, rather than any kind of support)**. One of the reasons for this is that there are many possible choices for the null hypothesis that may fail to result in a statistically significant increase (meaning they are also "compatible" with the data as different variants of the null). These tests are also done assuming the null is true, looking for evidence that's incompatible with the assumption, so it would be illogical to then conclude the null is true using a procedure that assumes the null is true.

I think there are a couple approaches, but it sounds like non-inferiority testing might be the best fit for you. Because the testing procedure is different, you could have a null that treatment Y is inferior to X in the intermediate sub type. Rejecting this null would mean there is sufficient evidence to conclude that treatment Y is non-inferior (could be superior or equal) to treatment X with respect to the z-year relapse risk (in the form of odds). (Or you could use equivalence testing if you want only to find evidence that they are equivalent within a clinically relevant range.) This seems like the best and most straight forward approach to me assuming that all you want to do is be able to say that treatment Y isn't worse than treatment X, therefore we should use treatment Y because it is less intense and shows evidence of at least as good outcomes.

Since you're not going to have the luxury of randomization, it probably makes sense to use a multiple logistic regression (but you may have to use a different estimation method if you only have a few events in one group, this can cause estimation to be less than ideal [a common problem with rare events-- data for events is "sparse"]...it's possible the statistician might think a Poisson regression is best in this case, but estimation methods like maximum likelihood vs penalized likelihood will still depend on how few events are in the smaller group, irrespective of using a Poisson vs Logistic regression...the Poisson will approximate the binomial in appropriate cases, but the logistic regression is based on the binomial distribution which is why I think a logistic model is fine). This will allow you to somewhat mitigate the effects of potential confounders and will allow for a more accurate estimate of the OR of interest, but it will somewhat depend on the group with the fewest events, which is why I'd recommend talking with the statistician about this. The point is that you want to account for variables that can reasonably impact the outcome, but you'll likely have limitations in the modeling process due to sparse data relative to the number of potential predictors and any transformations of the predictors (not a terrible thing, but you'll probably want some guidance from the statistician). Once you've done the modeling, you would just get a confidence interval for the (adjusted) OR (which just means from the model that accounts for the other independent variables) and use the non-inferiority or equivalence testing procedure (as far as I know, but again the statistician should be able to clarify this for you).

That last bit is the crux. I am open to any statistical design that accomplishes our end goal. I want to keep it as simple as possible. Now, having understood exactly what it is that we are trying to accomplish here, does that change your recommendations/advice at all?

I'll stress that if you just calculate an odds ratio for the two groups, you're not likely to see a good estimate of the true OR that's purely due to the differences in the treatments. The modeling and accounting for independent variables will be really important to "soak up" effects that aren't really due to the treatments but might contribute to variability in relapse between the two treatment groups since there wasn't randomization.

Also, I would like to ask your opinion on the OR of 2.0 that we may be setting as our alternative (which is why I originally made this thread). For lack of a better way of phrasing this: Is this a "good" OR to set as the alternative? What I mean by that is, is it within the bounds of what is considered common practice in clinical research biostats? If you read a paper that had set their alternative OR as 2, would you respect that? I just dont want to build my whole study around this only to find out later that we picked laughably weak parameters that no one will take seriously. I know this alternative OR is supposed to be based on what's "clinically meaningful," but knowing what you know now about the background, do you think the OR of 2 is appropriate? Remember, it's a highly treatable disease with an overall relapse rate ~ < 5%...We tried fiddling with different OR's in the calculators, and if we set it at 1.5, the required sample size exploded to an unmanagebale number. I guess I just want some reassurance that the OR of 2 is a respectable value to set.

Thank you!!!

As you were told before, it's kind of situational what dictates a reasonable effect size. Ultimately, you've got to go with your constraints, but the statistician should be able to help you generate a profile to demonstrate the power from different sample sizes with different assumptions about the OR, but this will also be more involved if you do end up using modeling. I don't think people will "laugh" at it. I don't think it would hurt to get some clinical perspective on "clinically meaningful" in terms of probabilities. For example (keeping the numbers simple, an OR of 2 implies that the probability of relapse in group Y (I'm assuming Y is in the numerator) is .67 with .33 being the probability of no relapse in Y, while the probability of relapse in X is .5 and the probability of no relapse in group X is .5 ([.67/.33]/[.5/.5]). So, I would recommend to look at aggregate data from somewhere to determine relapse probabilities for intermediate disease treated with X or Y. Let's say relapse rate is 25% in

**X**, and 75% for no relapse in X (do this for both groups). Take this to a clinician and explain this is what it is currently in group X, what relapse rate would be clinically meaningful to you? Asking a few physicians might be useful. Once you have the probability they say is meaningful, you could then convert that to odds and make a ratio of these "clinically meaningful odds" with the empirical odds of the reference group. This would be then the OR that you think is clinically meaningful to detect. This might be easier to work with because people conflate odds and risk, but the two are quite different in most cases. I think using probabilities directly and then converting to odds will mitigate issues with intuition around odds and odds ratios (again an OR of 2 isn't really twice the risk, unless the OR is nearly equal the RR, it's like the example I gave above where the risk is actually 67% in group Y and 50% in group X).

As I've said before, this will come back to limitations you have based on the study design. All else the same, you'll need a larger sample to discern smaller effect sizes, so it makes sense that decreasing the OR will dramatically increase the sample size.

Sorry it was a bit of a long post, but I think after reading it and thinking about some of the things you might be able to get a good idea at least of what you'd want to mull over with the statistician. If you have access to them and they can help you plan and conduct the analysis, I'd highly recommend it as there is much more to good modeling than just putting the data in and getting an output.

I hope some of this was helpful and feel free to PM me or post again if you have other questions.

Edited with bold text to clarify what I meant since I originally said "safe" instead of efficacy which it what it sounds like you want.