z, t and F tests are parametric tests that test for things like: means, variances, and proportions. These kind of tests make an assumption as you alluded to that the populations the samples come from are normally (Gaussian distribution) distributed. Nonparametric tests (again as you alluded to) are tests involving an assumption that the population is not normally distributed. Nonparametric tests are distribution free. When the sample size is too small non-parametric tests work as well.
Okay so why is the assessment of normailty flawed to determine whether to use parametric vs non parametric test flawed?
Answer: The t test can be used with great accuracy even when Gaussian distribution is violated (normalcy) Empirical evidence has shown that the t test does not inflate Type I errors and it does not falsely produce a type II error. In terms of normal distributions the t test does have a little more power than the Mann-Whitney, however, for any non normal distributions the Mann-Whitey tends to be far better. However the major flaw for these two tests is as follows: both tests are used for continuous variables. When baseline score is added as a covariate of in a linear regression ANCOVA is a more powerful test than t.
Also receiver operating characteristic analysis (ROC) is a preferred method to find the area under the curve (AUC) for evaluating diagnostic tests with continuous scales as the result.
Normality tests are very sensitive to sample size of course, however, histograms can be used for workable normality assumptions like in the case of blood pressure.
The generalized linear model forms the foundation of: t test, ANOVA and ANCOVA along with many other multivariate methods. Model specification for this method is a challenge. The equation is: y=b0 + bx + e where y = a set of outcomes, x = a set of preprogram variables or covariates, b0 = the set of intercepts and b= a set of coefficients. If the model does not specify with the appropriate equation an accurate summary of the data then the coefficients, the b values will most likely be biased. We will have curvilinearity problem.
Further assumptions to consider are for Regression Discontinuity analysis and are as follows:
1.) The cutoff criterion
2.) The pre-post distribution
3.) Comparison group pre-test variance
4.) Continuous pretest distribution
5.) Program implementation.
If there is a fit for the true function then we have a model that is exactly specified. If the model is overspecified with too many terms then we have an inefficient estimate and if we have left out some terms we have an underspecified model resulting in a biased estimate.
RD deals with various design basically compares pretest and posttest program-comparison group strategy.
Bootstrapping in some cases provides better estimates of sampling distributions than the theory of normalcy does.
Essentially, bootstrapping is a resampling method. The statistic being considered might be variable but we do not know how variable it is. In calculus we measure the rate of change of a function or rate of variance. Stats is based upon the theorems of calculus, however even when they differ as disciplines or in ideology they can be combined as in statistical thermodynamics, biostatistics and so forth. Monte Carlo simulations might also be involved, like in the examination of the bootstrap test of phi divergence statistics. Bootstrapping seems to be good at estimating the rejection probabilities. Monte Carlo simulations are used in quantum mechanics and other applications of random numbers. In this method we:
1.) List all possible outcomes of an experiment.
2.) Derermine the probability of each outcome.
3.) Set up correspondence between the outcomes of the experiment and the random numbers.
4.) Select random numbers from a table and conduct the experiment.
5.) Repeat the experiment and tally the outcomes.
6.) Compute any statistics and state the conclusions.
Back to bootstrapping: We may look at P-values of some divergence like aforementioned and so we use the probability density function, our stat model, our set, closed intervals, and a bunch of partial derivatives, thetas, functions that do and do not depend upon theta.
In factor analysis there is exploratory factor (data) analysis (EFA) and Confirmatory factor (data) analysis (CFA) EFA looks for the number of factors and the relationship between each variable and factor. CFA validates the factor structure in the presumption of the analysis and it measures the relationship between each factor.
For EFA the assumptions are:
Fi and epsiloni are independent,
E (F) = 0,
Cov (F) = I - key assumption in EFA - uncorrelated factors.
E (ep) = 0
Cov (ep) psi where psi is a diagonal matrix. This is taken from the equation: Xi = mu + D Fi + epi
EFA involves coming to the analysis with no prejudice. CFA involves: hypothesis testing, confidence intervals, and estimation.
Varimax solutions have a tendency to towards an equal sum of squared loadings for all factors. Quartimax rotation tends to produce solutions with a dominating factor. However varimax solutions do not always translate that way and need a little tinkering A term may have to be added to the varimax objective function in order to modify the varimax criterion.
Learn how to use MATLAB.
Rotation can be used to assist in the interpretation of of extracted factors and Varimax, Quatrimax and Equamax are used to find orthogonal rotations. To find Oblique (non-orthogonal) rotations in order to get better interpretation through correlation we use: Promax, Procrustes and Harris Kaiser.
Eigenvectors of a transformation is in the preserved direction and the amount of stretch is the Eigenvalue. These aforementioned Eigenvalues are mulitipliers. Eigenvectors eigenspaces and eigenvalues are properties of a mtraix. Generally speaking matrices will act upon a vector by changing both its magnitude and direction, however, a matrix may also just change the mag and leaving the direction unchanged or reversed. These are the so called eigenvectors.
The factor that is multiplied by the eigenvector magnitude is the eigenvalue. Eigenvectors and eigenvalues originate with physics and are used extensively in early advanced courses in undergraduate as well as undergraduate physical chemistry I and II. These courses are very statistical in nature and are based upon algebraic derivations and calculus, both single variable and multi variable. They are used in various differential equations and like Monte Carlo, quantum mechanics. P Chem is based upon the 5 postulates of quantum mechanics.
In brief: An eigenvector is a vector that keeps its direction after undergoing a linear transformation and an eigenvalue is a sclar value that the eigenvector was multiplied by during the linear transformation.
Question: how might this affect a stat analysis when using a matrix? How might results become skewed if an inappropriate matrix is applied?