All of these modules are extremely comprehensive and advanced implementations of the respective methods, and all of them share some general user interface solutions.
General Features Common to All Five Modules
Three alternative user-interfaces: (1) Quick-specs dialogs, (2) Wizard, and (3) Syntax. All modules offer three alternative user-interfaces for specifying research designs (e.g., ANOVA/ANCOVA designs, regression designs, response surface designs, mixture designs, etc.; see the description of GLM for details):
Automatically generating the syntax statements. One of the unique features of this user-interface is that in the background STATISTICA will automatically generate the complete set of syntax statements for any design specified via the Quick-specs dialogs (see point 1 above) or the Wizard (see point 2). These "active" logs of even the most complex and customized designs can be re-run, saved for future use, modified, included in STATISTICA Visual Basic scripts to be routinely run on new datasets, etc. Because the syntax for specifying general linear model designs is shared by all of these modules, it is also easy to move specifications form one type of analysis to another, for example, in order to fit the same model in GLM and GLZ.
Computation (training) sample, cross-validation (verification) sample, and prediction sample. All five modules will compute detailed residual statistics that can be saved for further analyses with other modules. Another unique feature of these programs is that the predicted and residual statistics can be computed separately for those observations from which the respective results were computed (i.e., the computation or training sample), for observations explicitly excluded from the model fitting computations (the cross-validation or verification sample), and for cases without observed data for the dependent (response) variables (prediction sample). Moreover, all graphical results options (e.g., probability plots, histograms, scatterplots of selected predicted or residual statistics) can be requested for these samples. Thus, all five programs offer exceptionally thorough diagnostic methods for evaluating the quality of the fit of the model.
Comparing analyses; modifying analyses. Like all analytic facilities of STATISTICA, multiple instances of all modules can be kept open at the same time, so multiple analyses can simultaneously be performed on the same or on different datasets. This is extremely useful for comparing the results from different analyses of the same data or the same analyses of different data. Modifying an analysis does not require complete respecification of the analysis; only desired changes need to be specified. Results from different modifications of an analysis can be easily compared. STATISTICA GLM, GRM, GDA, GLZ, and PLS can take what-if analyses to a new level, by allowing comparisons of different data and different analyses at the same time.
| Back to Top |
General Linear Models (GLM)The following sections summarize some of the most important specific advantages of GLM over other programs, and the unique features and facilities offered in this module; however, it is important to start by stressing the fact that GLM is not only the most computationally advanced GLM tool available on the market but it is also the most comprehensive and complete application that offers a wider selection of options, more graphs, more accompanying statistics and extended diagnostics than any other program. It has been designed with a "no compromise approach" to address the most challenging problems in the area of GLM and also to offer the most comprehensive selections of user-selectable options to handle so-called "controversial problems" that do not have any widely agreed upon solutions.
Designs. The user can choose simple or highly customized one-way, main-effect, factorial, or nested ANOVA or MANOVA designs, repeated measures designs, simple, multiple and polynomial regression designs, response surface designs (with or without blocking), mixture surface designs, simple or complex analysis of covariance designs (e.g., with separate slopes), or general multivariate MANCOVA designs. Factors can be fixed or random (in which case synthesized error terms will be computed). All of these designs can be efficiently specified via any of the three types of user interfaces described above, and customized in various ways (e.g., you can drop effects, specify custom hypotheses, etc.). Also, GLM can handle extremely large analysis designs; for example, repeated measures factors with 1000 levels can be specified, models may include 1000 covariates, or you can analyze very efficiently literally huge between-group designs.
The overparameterized and sigma-restricted model. A detailed discussion is beyond the scope of this summary; most programs only offer the overparameterized model, and a few only the sigma restricted model; STATISTICA GLM is the only program available on the market that offers both. Note that each of the two models has its advantages and disadvantages; however, both approaches are necessary to offer a truly comprehensive GLM computational platform, capable of properly handling even the most advanced and demanding analytic problems. For example, nested designs and separate slope designs are best analyzed using the overparameterized model; the most common way to estimate variance components, and to compute synthesized error terms in mixed model ANOVA is based on the overparameterized model. Factorial designs with large numbers of factors are best analyzed using the sigma restricted model; in short, a simple 2-way interaction of two two-level factors requires only a single column in the design matrix using the sigma restricted parameterization, but 4 columns in the overparameterized model; as a result, analyzing, for example, an 8-way full factorial design with GLM only requires a few seconds.
Handling missing cell designs. STATISTICA GLM will compute the customary Type I through IV sums of squares for unbalanced and incomplete designs; however, as is widely acknowledged (e.g., Searle, 1987; Millken & Johnson, 1986), applying these methods to "messy" designs with missing cells in more or less random locations in the design can lead to misleading, and even blatantly nonsensical results. STATISTICA GLM therefore also offers two additional methods for analyzing missing cell designs: Hockings (1985) "effective hypothesis decomposition," and a method that will automatically drop effects that cannot be fully estimated (e.g., when the least squares means do not exist for all levels of the respective main effect or interaction effect). The latter method is the one commonly applied to the analysis of highly fractionalized designs in industrial experimentation (see also STATISTICA DOE). This method leads to results that are unique (not dependent on the ordering of factor levels), easily interpretable, and consistent with the industrial experimentation literature. This highly useful feature is unique to GLM.
Results statistics. GLM will compute all the standard results, including ANOVA tables with univariate and multivariate tests, descriptive statistics, etc. GLM also offers a large number of results options and in particular graphics options that are usually not available in other programs. For example, GLM includes a comprehensive selection of types of plots of means (observed, least squares, weighted) for higher-order interactions,
with error bars (standard errors) for effects involving between-group factors as well as repeated measures factors;
extensive residual analyses and plots (for the "training" or computation sample, for a cross-validation or "verification" sample, or for a prediction sample without observed values for the dependent or response variables), plots of variance components; desirability profiler and response optimization for any model;
and adjusted means for traditional analysis of covariance designs. Extensive and flexible options for specifying planned comparisons are provided including facilities to specify contrasts using either the traditional command syntax or an extremely simple to use (Wizard-style) sequence of "intelligent" contrast dialogs
(you can enter contrast coefficients for clearly labeled levels of factors or cells in the design; the program will then evaluate the comparison for the least squares ("predicted") means, i.e., for the means as predicted by and consistent with the current model; this is a unique solution to the problem of planned comparisons in complex and incomplete designs); simple ways to test linear combinations of parameter estimates (e.g., to test for the equality of specific regression coefficients); specifications of custom error terms and effects; comprehensive post-hoc comparison methods for between group effects as well as repeated measures effects, and the interactions between repeated measures and between effects including: Fisher LSD, Bonferroni, Scheffé, Tukey HSD, Unequal N HSD, Newman Keuls, Duncan, and Dunnett's test
(with flexible options for estimating the appropriate error terms for those tests), tests of assumptions (e.g., Levene's test, plots of means vs. standard deviations, etc.).
| Back to Top |
General Regression Models (GRM)STATISTICA General Regression Models (GRM) also provides a unique, highly flexible implementation of the general linear model. Specifically, GRM's implementation permits the user to use stepwise and best subset methods to build models for highly complex designs, including designs with effects for categorical predictor variables. Thus, the "general" in General Regression Models refers both to the use of the general linear model, and to the fact that unlike most other stepwise regression programs, GRM is not limited to the analysis of designs that contain only continuous predictor variables.
Stepwise and best-subset selection for continuous and categorical predictors (ANOVA models) for models with multiple dependent variables. GRM is a "sister program" to STATISTICA General Linear Model (GLM) module. In addition to the large number of unique analytic options available in GLM (including planned comparisons, custom-hypotheses, a wide selection of post-hoc tests, residual analyses options, etc.), the General Regression Models (GRM) module allows you to build models via stepwise and best subset methods. GRM makes these techniques available not only for traditional analytic problems with a single dependent variable, but extends them to analyses of problems with multiple dependent variables; thus, in a sense, GRM can be considered a (very unique) stepwise and best-subset canonical analysis program. These methods can be used with designs that include continuous and/or categorical predictor variables (i.e., ANOVA or ANCOVA designs), and the techniques used in GRM will ensure that multiple degree of freedom effects will be considered (moved in or out of the model) in blocks. Specifically, GRM allows you build models via forward- or backward-only selection (effects can only be entered or removed once during the selection process), standard forward or backward selection (effects can be moved in or out of the model at each step, according to F or p to enter or remove criteria), or via best subset selection; this latter method gives the user flexible options to control the models considered during the subset search (e.g., maximum and minimum subset sizes, Mallow's CP, R-square, and adjusted R-for best subset selection, etc.).
Results. The General Regression Models (GRM) module offers all standard and unique results options described in the context of the GLM module in the previous section (including desirability profiling, predicted and residual statistics for the computation or training sample, cross-validation or verification sample, and prediction sample; tests of assumptions, means plots, etc.). In addition, unique regression-specific results options are also available, including Pareto charts of parameter estimates, whole model summaries (tests) with various methods for evaluating no-intercept models, partial and semi-partial correlations, etc.
| Back to Top |
General Discriminant Analysis Models (GDA)The STATISTICA General Discriminant Analysis Models (GDA) module is an application and extension of the General Linear Model to classification problems. Like the Stepwise Discriminant Function Analysis module, GDA allows you to perform standard and stepwise discriminant analyses. However, GDA implements the discriminant analysis problem as a special case of the general linear model, and thereby offers extremely useful analytic techniques that are innovative, efficient, and extremely powerful.
Computational approach and unique applications. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM (described above) can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well; for example, you can review all the regression-like residuals and predicted values for each group (each coded dependent indicator variable), and choose from the large number of residual plots. In addition, all specialized prediction and classification statistics are computed that are commonly reviewed in a discriminant analysis; but those statistics can be reviewed in innovate ways because of STATISTICA's unique approach. For example, you can perform "desirability profiling" by combining the posterior prediction probabilities for the groups into a desirability score, and then let the program find the values or combination of categorical predictor settings that will optimize that score. Thus, GDA provides powerful and efficient tools for data mining as well as applied research; for example, you could use the DOE (Design of Experiments) methods to generate an experimental design for quality improvement, apply this design to categorical outcome data (e.g., distinct classifications of an outcome as "superior," "acceptable," or "failed"), and then model the posterior prediction probabilities of those outcomes using the variables of your experimental design.
Standard discriminant analysis results. STATISTICA GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.
Unique features of GDA, currently only available in STATISTICA. In addition, STATISTICA GDA includes numerous unique features and results:
Specifying predictor variables and effects; model building:
1. Support for continuous and categorical predictors, instead of allowing only continuous predictors in the analysis (the common limitation in traditional discriminant function analysis programs), GDA allows the user to specify simple and complex ANOVA and ANCOVA-like designs, e.g., mixtures of continuous and categorical predictors, polynomial (response surface) designs, factorial designs, nested designs, etc.
2. Multiple-degree of freedom effects in stepwise selection; the terms that make up the predictor set (consisting not only of single-degree of freedom continuous predictors, but also multiple-degree of freedom effects) can be used in stepwise discriminant function analyses; multiple-degree of freedom effects will always be entered/removed as blocks.
3. Best subset selection of predictor effects; single- and multiple-degree of freedom effects can be specified for best-subset discriminant analysis; the program will select the effects (up to a user-specified number of effects) that produce the best discrimination between groups.
4. Selection of predictor effects based on misclassification rates; GDA allows the user to perform model building (selection of predictor effects) not only based on traditional criteria (e.g., p-to-enter/remove; Wilks' lambda), but also based on misclassification rates; in other words the program will select those predictor effects that maximize the accuracy of classification, either for those cases from which the parameter estimates were computed, or for a cross-validation sample (to guard against over-fitting); these techniques elevate GDA to the level of a fast neural-network-like data mining tool for classification, that can be used as an alternative to other similar techniques (tree-classifiers, designated neural-network methods, etc.; GDA will tend to be faster than those techniques because it is still based on the more efficient General Linear Model).
Results statistics; profiling:
1. Detailed results and diagnostic statistics and plots; in addition to the standard results statistics, GDA provides a large number of auxiliary information to help the user judge the adequacy of the chosen disciminant analysis model (descriptive statistics and graphs, Mahalanobis distances, Cook distances, and leverages for predictors, etc.). 2. Profiling of expected classification; GDA includes an adaptation of the general GLM (GRM) response profiler; these options allow the user to quickly determine the values (or levels) of the predictor variables that maximize the posterior classification probability for a single group, or for a set of groups in the analyses; in a sense, the user can quickly determine the typical profiles of values of the predictors (or levels of categorical predictors) that identify a group (or set of groups) in the analysis.
A note of caution for models with categorical predictors, and other advanced techniques. The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. However, most -- if not all -- textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No "experience" (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a crossvalidation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.
| Back to Top |
Generalized Linear Model (GLZ)Generalized linear models make it possible to flexibly search for linear and nonlinear relationships between a continuous, or binomial, multinomial, or ordinal multinomial categorical response variable and categorical or continuous predictor variables. (Note that STATISTICA also includes an implementation of Generalized Additive Models, GAM). A number of widely used types of analyses can be considered special applications of generalized linear models, such as binomial and multinomial logit and probit regression (which can quickly be specified via convenient short-cut dialog options), or Signal Detection Theory (SDT) models. The user-interfaces, methods for specifying designs, and "touch-and-feel" of the program is identical to that implemented in the other four modules (GLM, GRM, GDA, PLS) described here. For example, you can easily specify ANOVA or ANCOVA-like designs, response surface designs, mixture surface designs, etc., thus, even novice users will have no difficulty applying generalized linear models to analyze their data.
Models and link functions. A wide range of distributions (from the exponential family) can be specified for the response variable: Normal, Poisson, gamma, binomial, multinomial, ordinal multinomial, and inverse Gaussian. Further, the nature of the relationship between the predictors and the responses can be specified by choosing a so-called link function from a comprehensive list of (common and special-purpose) functions. Available link functions include: log, power, identity, logit, probit, complimentary log-log, and log-log links. Unlike other nonlinear models, these models can be fitted via fast estimation procedures, and allow meaningful interpretations (similar to general linear models), and hence, they are extensively employed in the analysis of non-linear relationships in science as well as applied research.
Stepwise and best-subset selection for continuous and categorical predictors (ANOVA-like models). In addition to the standard model fitting techniques, STATISTICA GLZ also provides unique options for exploratory analyses, including model building facilities like forward- or backward-only selection of effects (effects can only be selected for inclusion or removal once during the selection process), standard forward or backward stepwise selection of effects (effects can be entered or removed at each step, using a p to enter or remove criterion), and best subset regression methods (using the likelihood score statistic, model likelihood, or Akaike information criterion). These powerful methods can be applied to categorical predictors (ANOVA-like designs; effects will be moved in or out of the model as multiple-parameter blocks) as well as continuous predictors, and will save significant amounts of time when building appropriate models for complex data.
Results. The Generalized Linear Model module will compute all standard results statistics, including likelihood ratio tests, and Wald and score tests for significant effects, parameter estimates and their standard errors and confidence intervals, etc. In addition, for ANOVA-like designs, tables and plots of predicted means (the equivalent of least squares means computed in the general linear model) with their standard errors can be computed, to aid in the interpretation of results. GLZ also includes a comprehensive selection of model checking tools such as Spreadsheets and graphs for various residuals and outlier detection statistics, including raw residuals, Pearson residuals, deviance residuals, studentized Pearson residuals, studentized deviance residuals, likelihood residuals, differential Chi-square statistics, differential deviance, and generalized Cook distances, etc. As described earlier, predicted and residual statistics can be requested for observations that were used for fitting the model, and those that were not (i.e., for the cross-validation sample).
| Back to Top |
Partial Least Squares (PLS)Partial least squares methods for analyzing linear systems have become popular only in the last few years, and many of the algorithms and statistics are still the subject of ongoing research. STATISTICA Partial Least Squares (PLS) offers a selection of algorithms for univariate and multivariate partial least squares problems. The front-end user interface is very similar to that of GLM, GRM, GDA, and GLZ as described above and all of the advantages and features discussed there apply to PLS as well (e.g., specification of models, auto-updating of results, etc.). Moreover, thanks to the full implementation of the selection of the four GLM-style (see above) user interfaces, used also in GRM, GDA, and GLZ, it is very easy to set up models in one module (e.g., GLM), and to quickly analyze the data using the same model in PLS (or GLZ). This unique flexibility will allow even novice users to apply these powerful techniques to their analysis problems.
The overparameterized and sigma-restricted model for categorical predictors. Like GLM and GLZ, PLS offers both the overparameterized and sigma restricted parameterization methods for categorical predictors (ANOVA-like models). In partial least squares models, the sigma restricted solution can be particularly useful, because it may produce less complex results (explain more variability with fewer components, made up of design vectors coded in sigma-restricted form).
Algorithms. STATISTICA PLS implements the two most general algorithms for partial least squares analysis: SIMPLS and NIPALS.
Results. PLS will compute all the standard results for a partial least squares analysis, and also offers a large number of results options and in particular graphics options that are usually not available in other implementations; for example, graphs of parameter values as a function of the number of components, two-dimensional plots for all output statistics (parameters, factor loadings, etc.), two-dimensional plots for all residuals statistics, etc. Also, like GLM, GRM, and GLZ, the Partial Least Squares module offers extensive residual analysis options, and predicted and residual statistics can be requested for observations that were used for fitting the model (the "training" sample), those that were not (i.e., the cross-validation or verification sample), and for cases without observed data on the dependent (response) variables (the prediction sample).
| Back to Top |
| Request Quote |
| StatSoft Home Page |
![[StatSoft]](../images/sssmall.gif)
2300 East 14th Street, Tulsa, OK 74104
Phone: (918) 749-1119; Fax: (918) 749-2217
e-mail: info@statsoft.com
©Copyright StatSoft, Inc., 1984-2008.
StatSoft, StatSoft logo, and STATISTICA, are trademarks of StatSoft, Inc.