Written by: STATISTICA 3/1/2010 5:38 PM
We recently received a question from a Electronic Statistics Textbook visitor. We don't have time to answer every question that we received. But I was interested in the answer, so I talked with a StatSoft Statistician.
The visitor asked:
In your survival analysis section you say that one reason to use survival analysis instead of regression is that, "First, the dependent variable of interest (survival/failure time) is most likely not normally distributed -- a serious violation of an assumption for ordinary least squares multiple regression."
I have researched this issue a fair bit, and my understanding is that univariate normality is NOT an issue in regression--only multivariate normality and that multivariate normality is a not particularly related to univariate normality. Can you provide some reference or explanation to what I am missing here?
And the answer is:
Standard ordinary least squares (OLS) regression typically assumes that the responses are independent and identically distributed normal random variables with mean of 0 and some common variance (homogeneity of variance assumption), another way of writing this is Yi ~ iid N(0,var).
If the response is distributed in this manner, then the OLS estimates are maximum likelihood estimates as well. The most critical assumption in OLS regression is that the observations are independent.
Next, the homogenous variance assumption is important, followed lastly by the normal distribution assumption. The typical view of the predictors is that they are fixed constants, that is, there is no randomness associated with them. If you do associate randomness, a typical view is to assume that all variables (response and predictor) are multivariate normal. If you assume multivariate normal, then when you condition on the predictors (that is, they are given as fixed), the conditional probability distribution of Y is normal with some mean given by a linear regression function and a constant variance.
Either way, the main assumption is that the conditional probability distribution of Y given/conditioned on a set of predictors is normal with a mean (expressed as linear function of the predictors) and a constant variance.
It just so happens that if we assume multivariate normal of response and predictors, then we get the required conditional probability distribution of Y, that is, it’s a sufficient condition for use of OLS regression but not a necessary condition.
There are a variety of texts on the subject of linear models. My favorite is “Plane Answers to Complex Questions” by Christensen. Another text that does not delve too deeply into the statistical theory is “Applied Linear Statistical Models” by Neter, Kutner, Nachtsheim, and Wasserman.
/aw
0 comment(s) so far...