stats question!

This forum made possible through the generous support of SDN members, donors, and sponsors. Thank you.

parto123

Full Member
10+ Year Member
15+ Year Member
Joined
Dec 25, 2006
Messages
316
Reaction score
0
I want to see whether any of A, B, C, D (or any combination thereof) affects any of W, X, Y, Z (or any combination). Each of the above variables is measured on a 1-10 scale.

What is the easiest and simplest statistical method to do so? Would I just run 4 separate mult regressions of ABCD on each of w, x, y, z? Is there a more efficient way (anova or whatever)? thanks.
 
You can do that, or just look at a correlation table with all those variables in it to see what varies together (although that won't get the interactions you seem to be looking for).

But, really, don't do either of those things. That kind of atheoretical data mining isn't valuable. Do you have conceptual reasons to believe relationships (direct, mediated, moderated, etc.) exist? Those should be the relationships you look for.
 
You could do a MANOVA with everything, including all interactions and main effects.

I'd probably look at the distributions of A,B,C, and D, and see if I could break any of them into categories (low/medium/high; top two box vs. bottom two box, etc.). I just have a personal issue with using a scale as an independent variable, usually because it drives your n into the ground.
 
I'd probably look at the distributions of A,B,C, and D, and see if I could break any of them into categories (low/medium/high; top two box vs. bottom two box, etc.). I just have a personal issue with using a scale as an independent variable, usually because it drives your n into the ground.

Yuk. I totally disagree.

When you make a split like that (say, at the mean or median) you muck up your distributions. If you had normal data to start with you now have about as much skew in both the categories as you can get. You probably also made it heteroscedatic. There go the basic assumptions of the GLM. Aside from that, you lose valuable information about individual differences, and I'd say it's typically theoretically indefensible (say you split the scale at 5.5... you're effectively saying that people who score 1 and 5 are the same, but people who score 5 and 6 are different). Also, basing the split on your distribution rather than a theoretically grounded cutoff will bring hellfire from reviewers. Using three categories isn't any better. I say stick to regression analysis.
 
Agree with JN's advice, though disagree there isn't value to the MANOVA way. I mean, if you're just running random analyses hoping to find something with a p < .05 that's one thing, but I think its reasonable to check multiple variables (I assume this is part of a larger dataset? ) that you expect are related but aren't sure exactly how.

You just need to make sure when you report it you are framing it as exploratory - confirming a very specific hypothesis is obviously greatly preferred, but things can be learned from exploratory analyses as well and seeing what is there may generate future research ideas and hypotheses. If folks never did this, I imagine we'd be decades behind where we are now. My only issue is when people are dishonest in how they go about reporting it.

Then again, maybe this is just wishful thinking on my part since I did something similar recently running about 30 regression analyses since we hypothesized people would vary in how they rated images based off one variable, but had them rating pictures across a variety of variables so had to look at all of them.

Oh, and definitely avoid median splits whenever possible. Statistical no-no's these days except in the few instances there is justification for it (e.g. measures that were specifically designed to categorize folks as mild/severe, etc.)
 
Dumb question. Is it an instrument of some kind?
 
Agree with JN's advice, though disagree there isn't value to it. I mean, if you're just running random analyses hoping to find something with a p < .05 that's one thing, but I think its reasonable to check multiple variables (I assume this is part of a larger dataset? ) that you expect are related but aren't sure exactly how.

You just need to make sure when you report it you are framing it as exploratory - confirming a very specific hypothesis is obviously greatly preferred, but things can be learned from exploratory analyses as well and seeing what is there may generate future research ideas and hypotheses. If folks never did this, I imagine we'd be decades behind where we are now. My only issue is when people are dishonest in how they go about reporting it.

OK, I agree with you in principle, but in practice I don't think it's ever reported accurately.

But I think one additional issue with just jamming everything together into a big regression is that you're capitalizing on familywise error. With that many variables you're inflating FW error to, what... .20 or .80 depending on how you look at the data. I've read stuff like that that's been published (e.g. a stepwise regression pulls three predictors out of a multidimensional scale that aren't conceptually related to the outcome V) and it's nonsensical. Exploratory is fine, but there has to be some sort of reason behind it.
 
OK, I agree with you in principle, but in practice I don't think it's ever reported accurately.

But I think one additional issue with just jamming everything together into a big regression is that you're capitalizing on familywise error. With that many variables you're inflating FW error to, what... .20 or .80 depending on how you look at the data. I've read stuff like that that's been published (e.g. a stepwise regression pulls three predictors out of a multidimensional scale that aren't conceptually related to the outcome V) and it's nonsensical. Exploratory is fine, but there has to be some sort of reason behind it.

Definitely. In this case, exploratory works, as long as you have some idea of what you are looking for.
 
Dumb question. Is it an instrument of some kind?

For clarification, a(1-10) b (1-10) c(1-10) and d(1-10) are the results of one measure and w(1-10)x(1-10)y(1-10) z(1-10) the results of another.
 
For clarification, a(1-10) b (1-10) c(1-10) and d(1-10) are the results of one measure and w(1-10)x(1-10)y(1-10) z(1-10) the results of another.

Then I would reiterate-exploratory factoral analysis is fine, if you have some idea what you are looking for. It is probably also important to know whether these are validated measures that actually detect a known construct.
 
Misread your original post, I'm not entirely sure MANOVA is appropriate since I thought you had categorical variables (rereading the original post I have no idea why I thought that)..

I'd look at a correlation matrix to get a vague idea of what was going on and then run multiple regressions. Regression is pretty much the preferred choice for all continuous data (or quasi continuous since I guess this isn't truly continuous).
 
Although it's not used frequently, how about a canonical correlation?
 
Even though I saw how many posts there were in this thread, I thought I would stop by to test myself. I agree that the answer is regression, but that is about as deep as I go. Man I have a lot to learn, thank goodness I still have a lot of stats classes to take!
 
What is the research question?

Also, why should the OP use multiple regression versus a MANOVA procedure?
 
Also, why should the OP use multiple regression versus a MANOVA procedure?

You'd need to have categories to run the MANOVA, so see my reply to Thrak. ANOVA still uses the GLM so there's no advantage.
 
Check your bivariates among the predictors first. . . make sure none of the variables are redundant. If they are, eliminate them from the regression.

Realize that entering variables into a regression equation changes the significance levels of all of the variables. Also, realize that a regression tests a model that includes all of the variables. Interpreting individual significant results must be done with some degree of caution. In other words, your research question matters. If you are just looking for significant relationships among the variables without a clear model/theory in mind, correlations may be best. You have interaction terms in your dependent variables. You're going to have to run more than four analyses to do that. Remember, in regression if you are interested in interaction terms, you have to manually compute them (at least in SPSS). Without a theoretical justification for just throwing everything in, you risk. .

Can you (or anyone) elaborate on the bolded text.
 
Unless my thought process is totally out of whack (its finals week so it very well may be), the reason you'd want to use regression is because you pretty much always do when you are looking at continuous variables. I think (again, brain fried), if you ran an ANOVA-style test you'd have to artificially dichotomize (or polytomize) some of the variables.

If he wanted to compare scores on the above across several groups (or something like that) a MANOVA / repeated measures procedure would make sense. I don't think it would in his situation though.

Take what I just said with a grain of salt unless someone else confirms it though. This whole semester has been very relaxed, but this week has been a nightmare.

Edit: Apparently JN posted the same, and since he seems to be the resident stats guru, I'm now slightly more confident the above wasn't just the sleep deprivation talking.
 
Can you (or anyone) elaborate on the bolded text.

Take all of your variables and put them into a correlation matrix. If you're using SPSS, the procedure is analyze - correlate - bivariate. From there you can see which variables are related either positively or negatively and their respective correlations.
 
.....:laugh: this is why I like research (but wouldn't want to do it for a living) and hate stats. It never been my forte, and there always seems to be 2 or 3 ways to do the same thing, and no one can ever agree on the best way to do it. I showed my masters thesis to 3 different professors and all of them had me running different tests for the same hypothesis, because they were sure this was the best way to look at or elucidate what I was looking for....🙄
 
You'd need to have categories to run the MANOVA, so see my reply to Thrak. ANOVA still uses the GLM so there's no advantage.

Ok, I understand what you're saying. I browsed through my notes as well and found that indeed, taking a mean/median split kills the variability and possibly violates the assumption of homoscedacity of the GLM.

I found a nice java application that demonstrates this very well:

http://www.bolderstats.com/jmsl/doc/medianSplit.html

The reason why I ask is that I'm using a quasi-experimental design on a project and my advisor wants to create high/low groups using a median split of continuous data. From there we were going to use MANCOVA to determine significance. I'll have to speak with him about this. Thanks for the information.
 
It's generally a bad idea to split a continuous IV into groups due to losing variability. But, if you do want to do a split, one way to do it is a tertiary split (taking the top third/bottom third). It really depends on the specific research question - can you give us more info?

I think cmuhooligan may be our winner with the canonical correlation suggestion though - to my embarassment, I'm not particularly familiar with it, even though it encompasses both regression and ANOVA under its umbrella. As far as I'm aware, it works based on correlating a set of IVs to a set of DVs. Perhaps cmuhooligan or others can inform you further.
 
I think cmuhooligan may be our winner with the canonical correlation suggestion though - to my embarassment, I'm not particularly familiar with it, even though it encompasses both regression and ANOVA under its umbrella. As far as I'm aware, it works based on correlating a set of IVs to a set of DVs. Perhaps cmuhooligan or others can inform you further.

If it makes you feel better, my knowledge of it consists of "I've heard those two words together before".

I'm actually really glad this thread was posted since its nice to actually think about statistics in a useful way. I hate the way it is taught most places, since most professors make it into a math class. I'm fine with hand calculations, proofs and the like - I DO think they help. My problem is that every single stats professor I've had(save for my univariate prof last semester) has focused on things like that at the expense of actually teaching the concept of when/why you use certain tests, why you shouldn't use other tests in those situations, etc.

After my undergrad stats I could take a matrix of data and do an ANOVA/MANOVA/ANCOVA with nothing but a calculator, but still had no idea what any of the numbers I got meant or why you would want to do one. Grad school is a bit better but it still seems like alot of stats classes focus on the wrong things. The nitty-gritty math is great, but too often they seem to sacrifice the bigger picture for more number-crunching.
 
It's generally a bad idea to split a continuous IV into groups due to losing variability. But, if you do want to do a split, one way to do it is a tertiary split (taking the top third/bottom third). It really depends on the specific research question - can you give us more info?

I think cmuhooligan may be our winner with the canonical correlation suggestion though - to my embarassment, I'm not particularly familiar with it, even though it encompasses both regression and ANOVA under its umbrella. As far as I'm aware, it works based on correlating a set of IVs to a set of DVs. Perhaps cmuhooligan or others can inform you further.

I dislike the 1/3 splits because you lose so many participants without ANY justification. Again, just can't see an advantage over regression.

Canonical correlation could work, but it would depend on what that data looked like; I'm pretty sure canonical requires similar distributions of variables while regression doesn't. And canonical works a little like factor analysis; you're pulling out a latent variable from the combination of variables. That HAS to be theoretically justified.

Oh, and I just realized that since we don't know the n, all this could just be out the window entirely.
 
I dislike the 1/3 splits because you lose so many participants without ANY justification. Again, just can't see an advantage over regression.

Agreed. Same reason for making splits in continuous the variables...you may be cutting yourself short on useful info you can tease from the data. Using a matrix (I could be off on this, as I haven't played with SPSS in a bit) is good for seeing what variables hang together, but I'm not sure if you seperate the variables at this point, if you will miss more complex interactions later on (I'll defer to others on this point). I'd use it as one way to initially look at the data, but then to branch out from there and look at additional combination.

As for the teaching of stats....I was lucky that my undergrad was 70% theory, and 30% practice. Understanding the rationale behind doing things, and then the last 30% was doing it and seeing data in action.

My grad stats training was pretty good, though I would have liked to have it spread out more, as I felt like I was always playing catch up. Having to use a few different texts didn't help, as no one seems to be able to have a great text for both learning and reference. If anyone has one, I'd love to hear about it....particularly if it ties in SPSS procedure in it.
 
I'm actually really glad this thread was posted since its nice to actually think about statistics in a useful way. I hate the way it is taught most places, since most professors make it into a math class. I'm fine with hand calculations, proofs and the like - I DO think they help. My problem is that every single stats professor I've had(save for my univariate prof last semester) has focused on things like that at the expense of actually teaching the concept of when/why you use certain tests, why you shouldn't use other tests in those situations, etc.

After my undergrad stats I could take a matrix of data and do an ANOVA/MANOVA/ANCOVA with nothing but a calculator, but still had no idea what any of the numbers I got meant or why you would want to do one. Grad school is a bit better but it still seems like alot of stats classes focus on the wrong things. The nitty-gritty math is great, but too often they seem to sacrifice the bigger picture for more number-crunching.

As an aside, one thing I've noticed as a good way to learn stats if you've been taught stats in this manner, is to take a class on programming for a stats package (e.g., SAS, R). Often you go through one procedure after another (ANOVA, regression, logistic regression, chi-square, etc) and learn to apply it to a set of data with that package. It will be easy for you because you already know the underlying procedures and how to interpret the coeffecients, but it also gives you experience using one test after another and seeing "test 1 goes with this data set, test 2 goes with this data set, etc." I've found it to be a pragmatic way to learn stats once you know the procedures, perhaps closer to what a long-time professor has seen after using many data sets. I got a free book on programming in SPSS, and sometimes I go to it first for a brief refresher on a test if I've forgotten a bit about it - even though the book focuses on SPSS, it makes a quick tangential reference to the procedure that can jog my memory.
 
As an aside, one thing I've noticed as a good way to learn stats if you've been taught stats in this manner, is to take a class on programming for a stats package (e.g., SAS, R). Often you go through one procedure after another (ANOVA, regression, logistic regression, chi-square, etc) and learn to apply it to a set of data with that package. It will be easy for you because you already know the underlying procedures and how to interpret the coeffecients, but it also gives you experience using one test after another and seeing "test 1 goes with this data set, test 2 goes with this data set, etc." I've found it to be a pragmatic way to learn stats once you know the procedures, perhaps closer to what a long-time professor has seen after using many data sets. I got a free book on programming in SPSS, and sometimes I go to it first for a brief refresher on a test if I've forgotten a bit about it - even though the book focuses on SPSS, it makes a quick tangential reference to the procedure that can jog my memory.

I may have to try this. I've generally avoided such classes since I have a programming background so syntax was always the easiest part of stats for me, but if they provide better background/theory on the actual utilization of statistics it might be worth it just for that.
 
I'll just jump back in for a second, and mention that my head was in a very different place when I made my suggestion for making the top/bottom boxes from the continuous variables... less psych research, more marketing research. Which is definitely not the right mindset for a question posted here... my bad!
 
(on my talking about splits):Depends on if there is some theoretical/practical rationale for why you choose the split. For example, a test of categorical membership of some sort. It is important also to look at the variance in the data. Say you have a continuous variable and you have a hypothesis that people scoring high on the variable are different from those scoring lower. You might run a regression analyses across the whole variable and find no relationship even if there is one if the variance shifts across the distribution.

I think that would count as the justification I was mentioning as the qualifier 😛 Yup, I agree.
 
The wrench is your search for causality. But, without some sort of experimental control, causality is a logic argument regardless of statistical choice. You asked for the simplest solution. The simplest solution is correlation.


I guess this demonstrates a hole in my understanding, but when looking at a relationship between 2 variables, why would correlation be preferred over regression (or simpler for that matter)?
 
Canonical correlation could work, but it would depend on what that data looked like; I'm pretty sure canonical requires similar distributions of variables while regression doesn't. And canonical works a little like factor analysis; you're pulling out a latent variable from the combination of variables. That HAS to be theoretically justified.

There is no need to have the variables normally distributed, although it is better if they are.
 
Top