Data Mining Modules
Feature Selection and Variable Filtering
This module will automatically select subsets of variables from extremely large data files or databases connected for in-place processing (IDP). The module can handle a practically unlimited number of variables: over a million (!) input variables can be scanned to select predictors for regression or classification. Specifically, the program includes several options for selecting variables ("features") that are likely to be useful or informative in specific subsequent analyses. The unique algorithms implemented in the Feature Selection and Variable Filtering module will select continuous and categorical predictor variables which show a relationship to the continuous or categorical dependent variables of interest, regardless of whether that relationship is simple (e.g., linear) or complex (nonlinear, non-monotone). Hence, the program does not bias the selection in favor of any particular model that you may use to find a final best rule, equation, etc. for prediction or classification. Various advanced feature selection options are also available. This module is particularly useful in conjunction with the in-place processing of databases (without the need to copy or import the input data to the local machine), when it can be used to scan huge lists of input variables, select likely candidates that contain information relevant to the analyses of interest, and automatically select those variables for further analyses with other nodes in the data miner project. Subsets of variables based on an initial scan via this module can be submitted to further (post-) feature selection methods based on neural networks, MAR Splines, linear regression or classifiers, or CHAID. These options allow STATISTICA Data Miner to handle data sets in the multiple giga- and terabyte range (see Comparative performance benchmarks using large data sets).
This module contains a complete implementation of the so-called A-priori algorithm for detecting ("mining for") association rules such as "customers who order product A, often also order product B or C" or "employees who said positive things about initiative X, also frequently complain about issue Y but are happy with issue Z" (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000). The Association Rules module allows you to process rapidly huge data sets for associations (relationships), based on pre-defined "threshold" values for detection. Specifically, the program will detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects applied to databases containing records of customer transactions (e.g., items purchased by each customer), and also in the area of text mining. Like all modules of STATISTICA, data in external databases can be processed by the STATISTICA Association Rules module in-place (see IDP technology), so the program is prepared to handle efficiently extremely large analysis tasks.
The results can be displayed in tables, and also in unique 2D and 3D graphs where strong associations are highlighted by thick lines connecting the respective items.
Interactive Drill-Down Explorer
A first step of many data mining projects is to explore the data interactively, to gain a first "impression" of the types of variables in the analyses, and their possible relationships. The purpose of the Interactive Drill-Down Explorer is to provide a combined graphical, exploratory data analysis, and tabulation tool that will allow you to quickly review the distributions of variables in the analyses, their relationships to other variables, and to identify the actual observations belonging to specific subgroups in the data.
How the Drill-Down Explorer Works. The "drill-down" metaphor within the data mining context summarizes the basic operation of this analytic process quite well: The program allows you to select observations from larger data sets by selecting subgroups based on specific values or ranges of values of particular variables of interest (e.g., Gender and Average Purchase in the example above); in a sense you can expose the "deeper layers" or "strata" in the data by reviewing smaller and smaller subsets of observations selected by increasingly complex logical selection conditions.
Drilling "up." The interactive nature of the Drill Down Explorer allows you not only to drill down into the data or database (select groups of observations with increasingly specific logical selection conditions), but also to "drill up": At any time, you can select one of the previously specified variable (category) groups and de-select it from the list of drill-down conditions; while processing the data the program will then only select those observations that fit the remaining logical (case) selection conditions, and update the results accordingly.
Applications of the Interactive Drill-Down Explorer. The example shown earlier is very simple, exposing only the basic functionality of the program. The real power of the STATISTICA Interactive Drill-Down Explorer lies in the various auxiliary results which can automatically be updated during the interactive drill-down/up exploration: you can select a list of variables for review, and compute for the selected cases:
- Descriptive statistics and frequency tables;
- Box-and-whiskers plots summarizing the distributions of continuous variables;
- Scatterplot matrices summarizing the relationships between continuous variables;
- All of the other statistical and graphical analyses available in STATISTICA by extracting the observations belonging to the current subset;
For example, you could review the types of purchases that customers made with different demographic characteristics, study the effectiveness of certain drugs within different treatment groups, ages, etc., or extract likely customers for a new product from a database of previous customers based on careful study of apparent (market) segments exposed by the drill-down analysis.
Generalized EM & k-Means Cluster Analysis
The STATISTICA Generalized EM (Expectation Maximization) and k-Means Clustering module is an extension of the techniques available in the general STATISTICA Cluster Analysis options, specifically designed to handle large data sets and to allow clustering of continuous and/or categorical variables, and to provide the functionality for complete unsupervised learning (clustering) for pattern recognition, with all deployment options for predictive clustering. Various cross-validation options are provided (including modified v-fold cross-validation options) that will automatically choose and evaluate a best final solution for the clustering problem; you do not need to specify the number of clusters before an analysis; instead the program will use automatic (cross-validation based) methods to choose a best cluster solution (number of clusters) for you! The advanced EM clustering technique available in this module is sometimes referred to as probability-based clustering or statistical clustering. The program will cluster observations based on continuous and categorical variables, assuming different distributions for the variables in the analyses (as specified by the user). Various cross-validation options are provided to allow you to choose and evaluate a best final solution for the clustering problemDetailed output summaries and graphs (e.g., distribution plots for EM clustering), and detailed classification statistics are computed for each observation. These methods are optimized to handle very large data sets, and various results are provided to facilitate subsequent analyses using the assignment of observations to clusters. Options for deploying cluster solutions (in C, C++, C#, Visual Basic, or XML syntax based PMML), for classifying new observations, are also included.
Generalized Additive Models (GAM)
The STATISTICA Generalized Additive Models facilities are an implementation of methods developed and popularized by Hastie and Tibshirani (1990); additional detailed discussion of these methods can also be found in Schimek (2000). The program will handle continuous and categorical predictor variables. Note that STATISTICA includes a comprehensive selection of methods for fitting non-linear models to data, such as the Nonlinear Estimation module, etc. , Generalized Linear Models, General Classification and Regression Trees
Distributions and link functions. The program allows the user to choose from a wide variety of distributions for the dependent variable, and link functions for the effects of the predictor variables on the dependent variable:
Normal, Gamma, and Poisson distributions:
||f(z) = log(z)
||f(z) = 1/z
||f(z) = z
Scatterplot smoother. The program uses the cubic spline smoother with user-defined degrees of freedom to find an optimum transformation (function) of the predictor variables.
Results statistics. The program will report a comprehensive set of results statistics to aid in the evaluation of the model-adequacy, model fit, and interpretation of results; specifically, results include: the iteration history for the model fitting computations, summary statistics including the overall R-square value (computed from the deviance statistic) model degrees of freedom, and detailed observational statistics pertaining to the predicted response, residuals, and the smoothing of the predictor variables. Results graphs include plots of observed responses vs. residual responses, predicted values vs. residuals, histograms of observed and residual values, normal probability plots of residual values, and partial residual plots for each predictor, indicating the cubic spline smoothing fit for the final solution; for binary responses (e.g., logit-models) lift charts can also be computed.
General Classification and Regression Tress (GTrees)
The General Classification and Regression Tree (GC&RT) Model is a recursive partitioning method used to classify or divide cases based on a set of predictor variables. Unlike linear or nonlinear regression-like algorithms, this module will find hierarchical decision rules to provide optimal separation between observations with regard to a categorical or continuous criterion variable, based on splits on one or more continuous and/or categorical predictor variables. This module is a comprehensive implementation of the methods described as CART® by Breiman, Friedman, Olshen, and Stone (1984). However, the General Trees module contains various extensions and options that are typically not found in implementations of this algorithm, and that are particularly useful for data mining applications. In addition to standard analyses, the implementation of these methods in STATISTICA enables you to specify ANOVA/ANCOVA-like designs with continuous and/or categorical predictor variables, and their interactions. In short, ANOVA/ANCOVA-like predictor designs can be specified via dialogs, Wizards, or (design) command syntax; moreover, the command syntax is compatible across modules, so you can quickly apply identical designs to very different analyses (e.g., compare the quality of classification using GDA vs. GTrees).
User interface; specifying "models." The program provides a large number of options for controlling the building of the tree(s), the pruning of the tree(s), and the selection of the best-fitting solution. For continuous dependent (criterion) variables, pruning of the tree can be based on the variance, or on FACT-style pruning. For categorical dependent (criterion) variables, pruning of the tree can be based on misclassification errors, variance, or FACT-style pruning. You can specify the maximum number of nodes for the tree or the minimum n per node. Options are provided for validating the best decision tree, using V-fold cross validation, or by applying the decision tree to new observations in a validation sample. For categorical dependent (criterion) variables, i.e., for classification problems, various measures can be chosen to modify the algorithm and to evaluate the quality of the final classification tree: Options are provided to specify user-defined prior classification probabilities and misclassification costs; goodness-of-fit measures include the Gini measure, Chi-square, and G-Square.
Missing data and surrogate splits. Missing data values in the predictors can be handled by allowing the program to determine splits for surrogate variables, i.e., variables that are similar to the respective variable used for a particular split (node).
ANOVA/ANCOVA-like designs. In addition to the traditional CART®-style analysis, you can combine categorical and continuous predictor variables into ANOVA/ANCOVA-like designs and perform the analysis using a design matrix for the predictors. This allows you to evaluate and compare complex predictor models, and their efficacy for prediction and classification using various analytic techniques (e.g., General Linear Models, Generalized Linear Models, General Discriminant Analysis Models, etc.).
Tree browser. In addition to simple summary tree graphs, you can display the results trees in intuitive interactive tree-browsers that allow you to collapse or expand the nodes of the tree, and to quickly review the most salient information regarding the respective tree node or classification. For example, you can highlight (click on) a particular node in the browser-panel and immediately see the classification and misclassification rates for that particular node. The tree-browser provides a very efficient and intuitive facility for reviewing complex tree-structures, using methods that are commonly used in windows-based computer application to review hierarchically structured information. Multiple tree-browser can be displayed simultaneously, containing the final tree, and different sub-trees pruned from the larger tree, and by placing multiple browsers side-by-side it is easy to compare different tree structures and sub-trees. The STATISTICA Tree Browser is an important innovation to aid with the interpretation of complex decision trees.
Interactive trees. Options are also provided to review trees interactively, either by using STATISTICA Graphics brushing tools or by placing large tree graphs into scrollable graphics windows where large graphs can be inspected "behind" a smaller (scrollable) window.
Results statistics. The STATISTICA GTrees module provides a very large number of results options. Summary results for each node are accessible, detailed statistics are computed pertaining to classification, classification costs, gain, and so on. Unique graphical summaries are also available, including histograms (for classification problems) for each node, detailed summary plots for continuous dependent variables (e.g., normal probability plots, scatterplots), and parallel coordinate plots for each node, providing an efficient summary of patterns of responses for large classification problems. As in all statistical procedures of STATISTICA, all numerical results can be used as input for further analyses, allowing you to quickly explore and further analyze observations classified into particular nodes (e.g., you could use the GTrees module to produce an initial classification of cases, and then use best-subset selection of variables in GDA to find additional variables that may aid in the further classification).
C, C++, C#, Java, STATISTICA Visual Basic, SQL Code generators. The information contained in the final tree can be quickly incorporated into your own custom programs or database queries via the optional C, C++, C#, Java, STATISTICA Visual Basic, or SQL query code generator options. The STATISTICA Visual Basic will be generated in form that is particularly well suited for inclusion in custom nodes for STATISTICA Data Miner.
General CHAID (Chi-square Automatic Interaction Detection) Models
Like the implementation of General Classification and Regression Trees (GTrees) in STATISTICA, another recursive partitioning method, the General Chi-square Automatic Interaction Detection module, provides not only a comprehensive implementation of the original technique, but extends these methods to the analysis of ANOVA/ANCOVA - like designs.
Standard CHAID. The CHAID analysis can be performed for both continuous and categorical dependent (criterion) variables. Numerous options are available to control the construction of hierarchical trees: the user has control over the minimum n per node, maximum number of nodes, and probabilities for splitting and for merging categories; the user can also request exhaustive searches for the best solution (Exhaustive CHAID); V-fold validation statistics can be computed to evaluate the stability of the final solution; for classification problems, user-defined misclassification costs can be specified.
ANOVA/ANCOVA-like designs. In addition to the traditional CHAID analysis, you can combine categorical and continuous predictor variables into ANOVA/ANCOVA-like designs and perform the analysis using a design matrix for the predictors. This allows you to evaluate and compare complex predictor models, and their efficacy for prediction and classification using various analytic techniques (e.g., General Linear Models, Generalized Linear Models, General Discriminant Analysis Models, General Classification and Regression Tree Models, etc.). Refer also to the description of GLM (General Linear Models) and General Classification and Regression Trees (GTrees).
Tree browser. Like the binary results tree used to summarize binary classification and regression trees (see GTrees), the results of the CHAID analysis can be reviewed in the STATISTICA Tree Browser. This unique tree browser provides a very efficient and intuitive facility for reviewing complex tree-structures and for comparing multiple tree-solutions side-by-side (in multiple tree-browsers), using methods that are commonly used in windows-based computer applications to review hierarchically structured information. The STATSTICA Tree Browser is an important innovation to aid with the interpretation of complex decision trees. For additional details, see also the description of the tree browser in the context of the <General Classification and Regression Trees (GTrees).
Results statistics. The STATISTICA General CHAID Models module provides a very large number of results options. Summary results for each node are accessible, detailed statistics are computed pertaining to classification, classification costs, and so on. Unique graphical summaries are also available, including histograms (for classification problems) for each node, detailed summary plots for continuous dependent variables (e.g., normal probability plots, scatterplots), and parallel coordinate plots for each node, providing an efficient summary of patterns of responses for large classification problems. As in all statistical procedures of STATISTICA, all numerical results can be used as input for further analyses, allowing you to quickly explore and further analyze observations classified into particular nodes (e.g., you could use the GTrees module to produce an initial classification of cases, and then use best-subset selection of variables in GDA to find additional variables that may aid in the further classification).
Interactive Classification and Regression Trees
In addition to the modules for automatic tree building (e.g., General Classification and Regression Trees, General CHAID models), STATISTICA Data Miner also includes designated tools for building such trees interactively. You can choose either the (binary) General Classification and Regression Trees method or the CHAID method for building the (decision) tree, and at each step grow the tree either interactively (by choosing the splitting variable and splitting criterion) or automatically. When growing trees interactively, you have full control over all aspects of how to select and evaluate candidates for each split, how to categorize the range of values in predictors, etc. The highly interactive tools available for this module allow you to grow and prune back trees to quickly evaluate the quality of the tree for classification or regression prediction and to compute all auxiliary statistics at each stage to fully explore the nature of each solution. This tool is extremely useful for predictive data mining as well as for exploratory data analysis (EDA), and includes the complete set of options for automatic deployment, for the prediction or predicted classification of new observations (see also the description of these options in the context of CHAID and the General Classification and Regression Trees modules).
The most recent research on statistical and machine learning algorithms suggests that for some "difficult" estimation and prediction (predicted classification) tasks, using successively boosted simple trees can yield more accurate predictions than neural network architectures or complex single trees alone. STATISTICA Data Miner includes an advanced Boosted Trees module for applying this technique to predictive data mining tasks. You have control over all aspects of the estimation procedure and detailed summaries of each stage of the estimation procedures are provided so that the progress over successive steps can be monitored and evaluated. The results include most of the standard summary statistics for classification and regression computed by the General Classification and Regression Trees module. Automatic methods for deployment of the final boosted tree solution for classification or regression prediction are also provided.
The STATISTICA Random Forest module is an implementation of the Random Forest algorithm developed by Breiman. The algorithm is also applicable to regression problems. A Random Forest consists of a collection (ensemble) of simple tree classifiers, each capable of producing a response when presented with a set of predictor values. You have full control over all key aspects of the estimation procedure and model parameter, including the complexity of the trees fitted to the data, the maximum number of trees in the forest, control over how to stop the algorithm when satisfactory results have been achieved, etc. This module runs efficiently on large datasets and can handle extremely large number of variables without variable deletion. The results include most of the standard summary statistics for classification and regression computed by the General Classification and Regression Trees module. Automatic methods for deployment of the final Random Forests solution for classification or regression prediction are also provided.
Support Vector Machines (SVM)
This method performs regression and classification tasks by constructing nonlinear decision boundaries. Because of the nature of the feature space in which these boundaries are found, Support Vector Machines (SVM) can exhibit a large degree of flexibility in handling classification and regression tasks of varied complexities. STATISTICA SVM supports four types of Support Vector models with a variety of kernels as basis function expansions including linear, polynomial, RBF, and sigmoid. It also provides a facility for handling imbalanced data. Cross-validation, a well established technique is used for determining the best value of the various model parameters among a set of given values. A large number of graphs and spreadsheets can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Automatic methods for deployment of the final KNN solution for classification or regression prediction are also provided.
The Naïve Bayes classifier is based on the Bayesian Theorem and is particularly suited when the dimensionality of the inputs is high due to its simplifying assumption of independence among predictors. Despite the assumption of independence, Naïve Bayes typically outperforms more sophisticated classification methods. Although the assumption that the predictor variables are independent is not always accurate, it does simplify the classification task dramatically, since it allows the class conditional densities to be calculated separately for each variable, i.e., it reduces a multidimensional task to a number of one-dimensional tasks. Furthermore, the assumption does not seem to affect greatly the posterior probabilities, especially in regions near decision boundaries, so the classification task remains unaffected. STATISTICA supports categorical predictors and offers several choices for modeling numeric predictors to suit your analysis. These include normal, lognormal, gamma, and poisson density functions. STATISTICA also provides automatic methods for deployment of the final Naïve Bayes model.
K-Nearest Neighbors (KNN)
STATISTICA K-Nearest Neighbors is a memory-based method that, in contrast to other statistical methods, requires no training (i.e., no model to fit). It falls into the category of Prototype Methods. It functions on the intuitive idea that close objects are more likely to be in the same category. Thus, in KNN, predictions are based on a set of prototype examples that are used to predict new (i.e., unseen) data based on the majority vote (for classification tasks) and averaging (for regression) over a set of K nearest prototypes. This method can handle large datasets and both continuous and categorical predictors. Cross-validation, a well established technique is used to obtain estimates of model parameters that are unknown. A large number of graphs and spreadsheets can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Automatic methods for deployment of the final KNN solution for classification or regression prediction are also provided.
Multivariate Adaptive Regression Splines (MAR Splines)
The STATISTICA MAR Splines (Multivariate Adaptive Regression Splines) module is based on a complete implementation of this technique, as originally proposed by Friedman (1991; Multivariate Adaptive Regression Splines, Annals of Statistics, 19, 1-141); in STATISTICA Data Miner, the MARSplines options have further been enhanced to accommodate regression and classification problems, with continuous and categorical predictors.
The program, which in terms of its functionality can be considered a generalization and modification of stepwise Multiple Regression and Classification and Regression Trees (GC&RT), is specifically designed (optimized) for processing very large data sets. A large number of results options and extended diagnostics are available to allow you to evaluate numerically and graphically the quality of the MAR Splines solution.
C/C++, C#, STATISTICA Visual Basic, XML syntax based PMML code generators. The information contained in the model can be quickly incorporated into your own custom programs via the optional C/C++/C#, STATISTICA Visual Basic, or (XML-syntax based) PMML code generator options. STATISTICA Visual Basic will be generated in a form that is particularly well suited for inclusion in custom nodes for STATISTICA Data Miner. PMML (Predictive Models Markup Language) files with deployment information can be used with the Rapid Deployment of Predictive Models options to compute predictions for large numbers of cases very efficiently; PMML files are fully portable, and deployment information generated via the desktop version of STATISTICA Data Miner can be used in STATISTICA Enterprise Data Miner (i.e., on the server side of Client-Server installations), and vice versa.
Goodness of Fit Computations
The STATISTICA Goodness of Fit module will compute various goodness of fit statistics for continuous and categorical response variables (for regression and classification problems). This module is specifically designed for data mining applications to be included in "competitive evaluation of models" projects as a tool to choose the best solution. The program uses as input the predicted values or classifications as computed from any of the STATISTICA modules for regression and classification, and computes a wide selection of fit statistics as well as graphical summaries for each fitted response or classification. Goodness of fit statistics for continuous responses include least squares deviation (LSD), average deviation, relative squared error, relative absolute error, and the correlation coefficient. For classification problems (for categorical response variables), the program will compute Chi-square, G-square (maximum likelihood chisquare), percent disagreement (misclassification rate), quadratic loss, and information loss statistics.
Rapid Deployment of Predictive Models
The Rapid Deployment of Predictive Models module allows you to load one or more PMML (Predictive Models Markup Language) files with deployment information, and to compute very quickly (in a single pass through the data) predictions for large numbers of observations (for one or more models). PMML files can be generated from practically all modules for predictive data mining (as well as the Generalized EM & k-Means Cluster Analysis options). PMML is a XML-based (Extensible Markup Language) industry standard set of syntax convention that is particularly well suited to allow sharing of deployment information in a Client-Server architecture (e.g., via STATISTICA Enterprise).
The Rapid Deployment of Predictive Models options provide the fastest, most efficient methods for computing predictions from fully trained models. All models are pre-programmed in generic form in a highly optimized compiled program; the PMML code only supplies the parameter estimates etc. for the fully trained models, to allow the Rapid Deployment of Predictive Models program to compute predictions or predicted classifications (or cluster assignments) in a single pass through the data.
In fact, it is very difficult to "beat" the performance (speed of computations) of this tool, even if you were to write your own compiled C++ code, based on the (C, C++, or C#) deployment code generated by the respective models.
Note that the Rapid Deployment of Predictive Models module will also automatically compute summary statistics for each model, and if observed values or classifications are available, the program will automatically compute goodness-of-fit indices for participating models, including Gains and Lift charts for one or more models (overlaid lift and gain charts), for binary or multinomial
STATISICA Base Modules
Descriptive Statistica, Breakdownsm and Exploratory Data Analysis
Descriptive Statistics and Graphs
The program will compute practically all common, general-purpose descriptive statistics including medians, modes, quartiles, user-specified percentiles, average and standard deviations, quartile ranges, confidence limits for the mean, skewness and kurtosis (with their respective standard errors), harmonic means, geometric means, as well as many specialized descriptive statistics and diagnostics, either for all cases or broken down by one or more categorical (grouping) variables. As with all modules of STATISTICA, a wide variety of graphs will aid exploratory analyses, e.g., various types of box-and-whisker plots, histograms, bivariate distribution (3D or categorized) histograms, 2D and 3D scatterplots with marked subsets, normal, half-normal, detrended probability plots, Q-Q plots, P-P plots, etc. A selection of tests is available for fitting the normal distribution to the data (via the Kolmogorov-Smirnov, Lilliefors, and Shapiro-Wilks' tests; facilities for fitting a wide variety of other distributions are also available; see also STATISTICA Process Analysis; and the section on fitting in the Graphics section).
By-Group Analyses (Breakdowns)
Practically all descriptive statistics as well as summary graphs can be computed for data that are categorized (broken down) by one or more grouping variables. For example, with just a few mouse clicks the user can break down the data by Gender and Age and review categorized histograms, box-and-whisker plots, normal probability plots, scatterplots, etc. If more than two categorical variables are chosen, cascades of the respective graphs can be automatically produced. Options to categorize by continuous variables are provided, e.g., you can request that a variable be split into a requested number of intervals, or use the on-line recode facility to custom-define the way in which the variable will be recoded (categorization options of practically unlimited complexity can be specified at any point and they can reference relations involving all variables in the dataset). In addition, a specialized hierarchical breakdown procedure is provided that allows the user to categorize the data by up to six categorical variables, and compute a variety of categorized graphs, descriptive statistics, and correlation matrices for subgroups (the user can interactively request to ignore some factors in the complete breakdown table, and examine statistics for any marginal tables). Numerous formatting and labeling options allow the user to produce publication-quality tables and reports with long labels and descriptions of variables. Note that extremely large analysis designs can be specified in the breakdown procedure (e.g., 100,000 groups for a single categorization variable), and results include all relevant ANOVA statistics (including the complete ANOVA table, tests of assumptions such as the Levene and Brown-Forsythe tests for homogeneity of variance, a selection of seven post-hoc tests, etc.). As in all other modules of STATISTICA, extended precision calculations (the "quadruple" precision, where applicable) are used to provide an unmatched level of accuracy (see the section on Precision). Because of the interactive nature of the program, exploration of data is very easy. For example, exploratory graphs can be produced directly from all results Spreadsheets by pointing with the mouse to specific cells or ranges of cells. Cascades of even complex (e.g., multiple categorized) graphs can be produced with a single-click of the mouse and reviewed in a slide-show manner. In addition to numerous predefined statistical graphs, countless graphical visualizations of raw data, summary statistics, relations between statistics, as well as all breakdowns and categorizations can be custom-defined by the user via straightforward point-and-click facilities designed to reduce the necessary number of mouse clicks. All exploratory graphical techniques (described in the section on Graphics) are integrated with statistics to facilitate graphical data analyses (e.g., via interactive outlier removal, subset selections, smoothing, function fitting, extensive brushing options allowing the user to easily identify and/or extract the selected data, etc.). See also the section on Block Statistics, below.
A comprehensive set of options allows for the exploration of correlations and partial correlations between variables. First, practically all common measures of association can be computed, including Pearson r, Spearman rank order R, Kendall tau (b, c), Gamma, tetrachoric r, Phi, Cramer V, contingency coefficient C, Sommer's D, uncertainty coefficients, part and partial correlations, autocorrelations, various distance measures, etc. Correlation matrices can be computed using casewise (listwise) or pairwise deletion of missing data, or mean substitution. As in all other modules of STATISTICA, extended precision calculations (the "quadruple" precision, where applicable) are used to yield an unmatched level of accuracy (see the section on Precision). Like all other results in STATISTICA, correlation matrices are displayed in Spreadsheets offering various formatting options (see below) and extensive facilities to visualize numerical results; the user can "point to" a particular correlation in the Spreadsheet and choose to display a variety of "graphical summaries" of the coefficient (e.g., scatterplots with confidence intervals, various 3D bivariate distribution histograms, probability plots, etc.).
Brushing and outlier detection
The extensive brushing facilities in the scatterplots allow the user to select/deselect individual points in the plot and assess their effect on the regression line (or other fitted function lines).
Display formats of numbers
A variety of global display formats for correlations are supported; significant correlation coefficients can be automatically highlighted, each cell of the Spreadsheet can be expanded to display n and p, or detailed results may be requested that include all descriptive statistics (pairwise means and standard deviations, B weights, intercepts, etc.). Like all other numerical results, correlation matrices are displayed in Spreadsheets offering the zoom option and interactively-controlled display formats (e.g., from +.4 to +.4131089276410193); thus, large matrices can be compressed (via either the zoom or format-width control adjustable by dragging) to facilitate the visual search for coefficients which exceed a user-specified magnitude or significance level (e.g., the respective cells can be marked red in the Spreadsheet).
Scatterplot, scatterplot matrices, by-group analyses
As in all output selection dialogs, numerous global graphics options are available to further study patterns of relationships between variables, e.g., 2D and 3D scatterplots (with or without case labels) designed to identify patterns of relations across subsets of cases or series of variables. Correlation matrices can be computed as categorized by grouping variables and visualized via categorized scatterplots. Also "breakdowns of correlation matrices" can be generated (one matrix per subset of data), displayed in queues of Spreadsheets, and saved as stacked correlation matrices (which can later be used as input into the Structural Equations Modeling and Path Analysis [SEPATH] module). An entire correlation matrix can be summarized in a single graph via the Matrix scatterplot option (of practically unlimited density); large scatterplot matrices can then be reviewed interactively by "zooming in" on selected portions of the graph (or scrolling large graphs in the zoom mode) [see the illustration]. Also, categorized scatterplot matrix plots can be generated (one matrix plot for each subset of data). Alternatively, a multiple-subset scatterplot matrix plot can be created where specific subsets of data (e.g., defined by levels of a grouping variable or selection conditions of any complexity) are marked with distinctive point markers. Various other graphical methods can be used to visualize matrices of correlations in search of global patterns (e.g., contour plots, non-smoothed surfaces, icons, etc.). All of these operations require only a few mouse clicks and various shortcuts are provided to simplify selections of analyses; any number of Spreadsheets and graphs can be displayed simultaneously on the screen, making interactive exploratory analyses and comparisons very easy.
Basic Statistics From Results Spreadsheets (Tables)
STATISTICA is a single integrated analysis system that presents all numerical results in spreadsheet tables that are suitable (without any further modification) for input into subsequent analyses. Thus, basic statistics (or any other statistical analysis) can be computed for results tables from previous analyses; for example, you could very quickly compute a table of means for 2000 variables, and next use this table as an input data file to further analyze the distribution of those means across the variables. Thus, basic statistics are available at any time during your analyses, and can be applied to any results spreadsheet.
In addition to the detailed descriptive statistics that can be computed for every spreadsheet, you can also highlight blocks of numbers in any spreadsheet, and produce basic descriptive statistics or graphs for the respective subset of numbers only. For example, suppose you computed a results spreadsheet with measures of central tendency for 2000 variables (e.g., with Means, Modes, and Medians, Geometric Means, and Harmonic Means); you could highlight a block of, for example, 200 variables and the Means and Medians, and then in a single operation produce a multiple line graph of those two measures across the subset of 200 variables. Statistical analysis by blocks can be performed by row or by column; for example, you could also compute a multiple line graph for a subset of variables across the different measures of central tendency. To summarize, the block statistics facilities allow you to produce statistics and statistical graphs from values in arbitrarily selected (highlighted) blocks of values in the current data spreadsheet or output Spreadsheet.
Interactive Probability Calculator
A flexible, interactive Probability Calculator is accessible from all toolbars. It features a wide selection of distributions (including Beta, Cauchy, Chi-square, Exponential, Extreme value, F, Gamma, Laplace, Lognormal, Logistic, Pareto, Rayleigh, t (Student), Weibull, and Z (Normal)); interactively (in-place) updated graphs built into the dialog (a plot of the density and distribution functions) allow the user to visually explore distributions taking advantage of the flexible STATISTICA Smart MicroScrolls which allow the user to advance either the last significant digit (press the LEFT-mouse-button) or next to the last significant digit (press the RIGHT-mouse-button). Facilities are provided for generating customizable, compound graphs of distributions with requested cutoff areas. Thus, this calculator allows you to interactively explore the distributions (e.g., the respective probabilities depending on shape parameters).
t-Tests and Other Tests of Group Differences
T-tests for dependent and independent samples, as well as single samples (testing means against user-specified constants) can be computed, multivariate Hotelling's T 2 tests are also available (see also ANOVA/MANOVA, and GLM (General Linear Models)). Flexible options are provided to allow comparisons between variables (e.g., treating the data in each column of the input spreadsheet as a separate sample) and coded groups (e.g., if the data includes a categorical variable such as Gender to identify group membership for each case). As with all procedures, extensive diagnostics and graphics options are available from the results menus. For example, for the t-test for independent samples, options are provided to compute t-tests with separate variance estimates, Levene and Brown-Forsythe tests for homogeneity of variance, various box-and-whisker plots, categorized histograms and probability plots, categorized scatterplots, etc. Other (more specialized) tests of group differences are part of many modules (e.g., Nonparametrics, Survival Analysis, Reliability/Item Analysis).
Frequency Tables, Crosstabulation Tables, Stub-and-Banner Tables, Multiple Response Analysis, and Tables
Extensive facilities are provided to tabulate continuous, categorical, and multiple response variables, or multiple dichotomies. A wide variety of options are offered to control the layout and format of the tables. For example, for tables involving multiple response variables or multiple dichotomies, marginal counts and percentages can be based on the total number of respondents or responses, multiple response variables can be processed in pairs, and various options are available for counting (or ignoring) missing data. Frequency tables can also be computed based on user-defined logical selection conditions (of any complexity, referencing any relationships between variables in the dataset) that assign cases to categories in the table. All tables can be extensively customized to produce final (publication-quality) reports. For example, unique "multi-way summary" tables can be produced with breakdown-style, hierarchical arrangements of factors, crosstabulation tables may report row, column, and total percentages in each cell, long value labels can be used to describe the categories in the table, frequencies greater than a user-defined cutoff can be highlighted in the table, etc. The program can display cumulative and relative frequencies, Logit- and Probit-transformed frequencies, normal expected frequencies (and the Kolmogorov-Smirnov, Lilliefors, and Shapiro-Wilks' tests), expected and residual frequencies in crosstabulations, etc. Available statistical tests for crosstabulation tables include the Pearson, Maximum-Likelihood and Yates-corrected Chi-squares; McNemar's Chi-square, the Fisher exact test (one- and two-tailed), Phi, and the tetrachoric r; additional available statistics include Kendall's tau (a, b), Gamma, Spearman r, Sommer's D, uncertainty coefficients, etc.
Graphical options include simple, categorized (multiple), and 3D histograms, cross-section histograms (for any "slices" of the one-, two-, or multi-way tables), and many other graphs including a unique "interaction plot of frequencies" that summarizes the frequencies for complex crosstabulation tables (similar to plots of means in ANOVA). Cascades of even complex (e.g., multiple categorized, or interaction) graphs can be interactively reviewed. See also the section on Block Statistics, above, and sections on Log-linear Analysis and Correspondence Analysis.
Multiple Regression Methods
The Multiple Regression module is a comprehensive implementation of linear regression techniques, including simple, multiple, stepwise (forward, backward, or in blocks), hierarchical, nonlinear (including polynomial, exponential, log, etc.), Ridge regression, with or without intercept (regression through the origin), and weighted least squares models; additional advanced methods are provided in the General Regression Models (GRM) module (e.g., best subset regression, multivariate stepwise regression for multiple dependent variables, for models that may include categorical factor effects; statistical summaries for validation and prediction samples, custom hypotheses, etc.). The Multiple Regression module will calculate a comprehensive set of statistics and extended diagnostics including the complete regression table (with standard errors for B, Beta and intercept, R-square and adjusted R-square for intercept and non-intercept models, and ANOVA table for the regression), part and partial correlation matrices, correlations and covariances for regression weights, the sweep matrix (matrix inverse), the Durbin-Watson d statistic, Mahalanobis and Cook's distances, deleted residuals, confidence intervals for predicted values, and many others.
Predicted and residual values
The extensive residual and outlier analysis features a large selection of plots, including a variety of scatterplots, histograms, normal and half-normal probability plots, detrended plots, partial correlation plots, different casewise residual and outlier plots and diagrams, and others. The scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results Spreadsheets. Residual and predicted scores can be appended to the current data file. A forecasting routine allows the user to perform what-if analyses, and to interactively compute predicted scores based on user-defined values of predictors.
By-group analysis; related procedures
Extremely large regression designs can be analyzed. An option is also included to perform multiple regression analyses broken down by one or more categorical variable (multiple regression analysis by group); additional add-on procedures include a regression engine that supports models with thousands of variables, a Two-stage Least Squares regression, as well as Box-Cox and Box-Tidwell transformations with graphs. STATISTICA Advanced also includes general nonlinear estimation modules (Nonlinear Estimation, Generalized Linear Models (GLZ), Partial Least Squares models (PLS)) that can estimate practically any user-defined nonlinear model, including Logit, Probit, and others. The program also includes SEPATH, the general Structural Equation Modeling and Path Analysis module, which allows the user to analyze extremely large correlations, covariances, and moment matrices (for intercept models). An implementation of Generalized Additive Models (GAM) is also available in STATISTICA Data Miner.
The Nonparametric Statistics module features a comprehensive selection of inferential and descriptive statistics including all common tests and some special application procedures. Available statistical procedures include the Wald-Wolfowitz runs test, Mann-Whitney U test (with exact probabilities [instead of the Z approximations] for small samples), Kolmogorov-Smirnov tests, Wilcoxon matched pairs test, Kruskal-Wallis ANOVA by ranks, Median test, Sign test, Friedman ANOVA by ranks, Cochran Q test, McNemar test, Kendall coefficient of concordance, Kendall tau (b, c), Spearman rank order R, Fisher's exact test, Chi-square tests, V-square statistic, Phi, Gamma, Sommer's d, contingency coefficients, and others. (Specialized nonparametric tests and statistics are also part of many add-on modules, e.g., Survival Analysis, Process Analysis, and others.) All (rank order) tests can handle tied ranks and apply corrections for small n or tied ranks. The program can handle extremely large analysis designs. As in all other modules of STATISTICA, all tests are integrated with graphs (that include various scatterplots, specialized box-and-whisker plots, line plots, histograms and many other 2D and 3D displays).
The ANOVA/MANOVA module includes a subset of the functionality of the General Linear Models module and can perform univariate and multivariate analysis of variance of factorial designs with or without one repeated measures variable. For more complicated linear models with categorical and continuous predictor variables, random effects, and multiple repeated measures factors you need the General Linear Models module (stepwise and best-subset options are available in the General Regression Models module). In the ANOVA/MANOVA module, you can specify all designs in the most straightforward, functional terms of actual variables and levels (not in technical terms, e.g., by specifying matrices of dummy codes), and even less-experienced ANOVA users can analyze very complex designs with STATISTICA. Like the General Linear Models module, ANOVA/MANOVA provides three alternative user interfaces for specifying designs: (1) A Design Wizard, that will take you step-by-step through the process of specifying a design, (2) a simple dialog-based user-interface that will allow you to specify designs by selecting variables, codes, levels, and any design options from well-organized dialogs, and (3) a Syntax Editor for specifying designs and design options using keywords and a common design syntax. Computational methods. The program will use, by default, the sigma restricted parameterization for factorial designs, and apply the effective hypothesis approach (see Hocking, 19810) when the design is unbalanced or incomplete. Type I, II, III, and IV hypotheses can also be computed, as can Type V and Type VI hypotheses that will perform tests consistent with the typical analyses of fractional factorial designs in industrial and quality-improvement applications (see also the description of the Experimental Design module).
The ANOVA/MANOVA module is not limited in any of its computational routines for reporting results, so the full suite of detailed analytic tools available in the General Linear Models module is also available here; results include summary ANOVA tables, univariate and multivariate results for repeated measures factors with more than 2 levels, the Greenhouse-Geisser and Huynh-Feldt adjustments, plots of interactions, detailed descriptive statistics, detailed residual statistics, planned and post-hoc comparisons, testing of custom hypotheses and custom error terms, detailed diagnostic statistics and plots (e.g., histogram of within-cell residuals, homogeneity of variance tests, plots of means versus standard deviations, etc.).
The Distribution Fitting options allow the user to compare the distribution of a variable with a wide variety of theoretical distributions. You may fit to the data the Normal, Rectangular, Exponential, Gamma, Lognormal, Chi-square, Weibull, Gompertz, Binomial, Poisson, Geometric, or Bernoulli distribution. The fit can be evaluated via the Chi-square test or the Kolmogorov-Smirnov one-sample test (the fitting parameters can be controlled); the Lilliefors and Shapiro-Wilks' tests are also supported (see above). In addition, the fit of a particular hypothesized distribution to the empirical distribution can be evaluated in customized histograms (standard or cumulative) with overlaid selected functions; line and bar graphs of expected and observed frequencies, discrepancies and other results can be produced from the output Spreadsheets. Other distribution fitting options are available in Process Analysis, where the user can compute maximum-likelihood parameter estimates for the Beta, Exponential, Extreme Value (Type I, Gumbel), Gamma, Log-Normal, Rayleigh, and Weibull distributions. Also included in that module are options for automatically selecting and fitting the best distribution for the data, as well as options for general distribution fitting by moments (via Johnson and Pearson curves). User-defined 2- and 3-dimensional functions can also be plotted and overlaid on the graphs. The functions may reference a wide variety of distributions such as the Beta, Binomial, Cauchy, Chi-square, Exponential, Extreme value, F, Gamma, Geometric, Laplace, Logistic, Normal, Log-Normal, Pareto, Poisson, Rayleigh, t (Student), or Weibull distribution, as well as their integrals and inverses. Additional facilities to fit predefined or user-defined functions of practically unlimited complexity to the data are available in Nonlinear Estimation.
Multivariate Exploratory Technique Modules
This module includes a comprehensive implementation of clustering methods (k-means, hierarchical clustering, two-way joining). The program can process data from either raw data files or matrices of distance measures. The user can cluster cases, variables, or both based on a wide variety of distance measures (including Euclidean, squared Euclidean, City-block (Manhattan), Chebychev, Power distances, Percent disagreement, and 1-r) and amalgamation/linkage rules (including single, complete, weighted and unweighted group average or centroid, Ward's method, and others). Matrices of distances can be saved for further analysis with other modules of the STATISTICA system. In k-means clustering, the user has full control over the initial cluster centers. Extremely large analysis designs can be processed; for example, hierarchical (tree) joining can analyze matrices with over 1,000 variables, or with over 1 million distances. In addition to the standard cluster analysis output, a comprehensive set of descriptive statistics and extended diagnostics (e.g., the complete amalgamation schedule with cohesion levels in hierarchical clustering, the ANOVA table in k-means clustering) is available. Cluster membership data can be appended to the current data file for further processing. Graphics options in the Cluster Analysis module include customizable tree diagrams, discrete contour-style two-way joining matrix plots, plots of amalgamation schedules, plots of means in k-means clustering, and many others.
The Factor Analysis module contains a wide range of statistics and options, and provides a comprehensive implementation of factor (and hierarchical factor) analytic techniques with extended diagnostics and a wide variety of analytic and exploratory graphs. It will perform principal components, common, and hierarchical (oblique) factor analysis, and can handle extremely large analysis problems (e.g., with thousands of variables). Confirmatory factor analysis (as well as path analysis) can also be performed via the Structural Equation Modeling and Path Analysis (SEPATH) module found in STATISTICA Advanced Linear/Non-Linear Models.
Principal Components & Classification Analysis
STATISTICA also includes a designated program for principal components and classification analysis. The output includes eigenvalues (regular, cumulative, relative), factor loadings, factor scores (which can be appended to the input data file, reviewed graphically as icons, and interactively recoded), and a number of more technical statistics and diagnostics. Available rotations include Varimax, Equimax, Quartimax, Biquartimax (either normalized or raw), and Oblique rotations. The factorial space can be plotted and reviewed "slice by slice" in either 2D or 3D scatterplots with labeled variable-points; other integrated graphs include Scree plots, various scatterplots, bar and line graphs, and others. After a factor solution is determined, the user can recalculate (i.e., reconstruct) the correlation matrix from the respective number of factors to evaluate the fit of the factor model. Both raw data files and matrices of correlations can be used as input. Confirmatory factor analysis and other related analyses can be performed with the Structural Equation Modeling and Path Analysis (SEPATH) module available in STATISTICA Advanced Linear/Non-Linear Models, where a designated Confirmatory Factor Analysis Wizard will guide you step by step through the process of specifying the model.
Click here to read a real-life application story using STATISTICA's Principal Components Analysis tools.
Canonical Correlation Analysis
This module offers a comprehensive implementation of canonical analysis procedures; it can process raw data files or correlation matrices and it computes all of the standard canonical correlation statistics (including eigenvectors, eigenvalues, redundancy coefficients, canonical weights, loadings, extracted variances, significance tests for each root, etc.) and a number of extended diagnostics. The scores of canonical variates can be computed for each case, appended to the data file, and visualized via integrated icon plots. The Canonical Analysis module also includes a variety of integrated graphs (including plots of eigenvalues, canonical correlations, scatterplots of canonical variates, and many others). Note that confirmatory analyses of structural relationships between latent variables can also be performed via the SEPATH (Structural Equation Modeling and Path Analysis) module. Advanced stepwise and best-subset selection of predictor variables for MANOVA/MANCOVA designs (with multiple dependent variables) is available in the General Regression Models (GRM) module.
This module includes a comprehensive selection of procedures for the development and evaluation of surveys and questionnaires. As in all other modules of STATISTICA, extremely large designs can be analyzed. The user can calculate reliability statistics for all items in a scale, interactively select subsets, or obtain comparisons between subsets of items via the "split-half" (or split-part) method. In a single run, the user can evaluate the reliability of a sum-scale as well as subscales. When interactively deleting items, the new reliability is computed instantly without processing the data file again. The output includes correlation matrices and descriptive statistics for items, Cronbach alpha, the standardized alpha, the average inter-item correlation, the complete ANOVA table for the scale, the complete set of item-total statistics (including multiple item-total R's), the split-half reliability, and the correlation between the two halves corrected for attenuation. A selection of graphs (including various integrated scatterplots, histograms, line plots and other plots) and a set of interactive what-if procedures are provided to aid in the development of scales. For example, the user can calculate the expected reliability after adding a particular number of items to the scale, and can estimate the number of items that would have to be added to the scale in order to achieve a particular reliability. Also, the user can estimate the correlation corrected for attenuation between the current scale and another measure (given the reliability of the current scale).
STATISTICA's Classification Trees module provides a comprehensive implementation of the most recently developed algorithms for efficiently producing and testing the robustness of classification trees (a classification tree is a rule for predicting the class of an object from the values of its predictor variables). STATISTICA Data Miner offers additional advanced methods for tree classifications such as Boosted Trees, Random Forests, General Classification and Regression Tree Models (GTrees) and General CHAID (Chi-square Automatic Interaction Detection) models facilities. Classification trees can be produced using categorical predictor variables, ordered predictor variables, or both, and using univariate splits or linear combination splits.
Analysis options include performing exhaustive splits or discriminant-based splits; unbiased variable selection (as in QUEST); direct stopping rules (as in FACT) or bottom-up pruning (as in C&RT); pruning based on misclassification rates or on the deviance function; generalized Chi-square, G-square, or Gini-index goodness of fit measures. Priors and misclassification costs can be specified as equal, estimated from the data, or user-specified. The user can also specify the v value for v-fold cross-validation during tree building, v value for v-fold cross-validation for error estimation, size of the SE rule, minimum node size before pruning, seeds for random number generation, and alpha value for variable selection. Integrated graphics options are provided to explore the input and output data.
This module features a full implementation of simple and multiple correspondence analysis techniques, and can analyze even extremely large tables. The program will accept input data files with grouping (coding) variables that are to be used to compute the crosstabulation table, data files that contain frequencies (or some other measure of correspondence, association, similarity, confusion, etc.) and coding variables that identify (enumerate) the cells in the input table, or data files with frequencies (or other measure of correspondence) only (e.g., the user can directly type in and analyze a frequency table). For multiple correspondence analysis, the user can also directly specify a Burt table as input for the analysis. The program will compute various tables, including the table of row percentages, column percentages, total percentages, expected values, observed minus expected values, standardized deviates, and contributions to the Chi-square values. The Correspondence Analysis module will compute the generalized eigenvalues and eigenvectors, and report all standard diagnostics including the singular values, eigenvalues, and proportions of inertia for each dimension. The user can either manually choose the number of dimensions, or specify a cutoff value for the maximum cumulative percent of inertia. The program will compute the standard coordinate values for column and row points. The user has the choice of row-profile standardization, column-profile standardization, row and column profile standardization, or canonical standardization. For each dimension and row or column point, the program will compute the inertia, quality, and cosine-square values. In addition, the user can display (in spreadsheets) the matrices of the generalized singular vectors; like the values in all spreadsheets, these matrices can be accessed via STATISTICA Visual Basic, for example, in order to implement non-standard methods of computing the coordinates. The user can compute coordinate values and related statistics (quality and cosine-square values) for supplementary points (row or column), and compare the results with the regular row and column points. Supplementary points can also be specified for multiple correspondence analysis. In addition to the 3D histograms that can be computed for all tables, the user can produce a line plot for the eigenvalues, and 1D, 2D, and 3D plots for the row or column points. Row and column points can also be combined in a single graph, along with any supplementary points (each type of point will use a different color and point marker, so the different types of points can easily be identified in the plots). All points are labeled, and an option is available to truncate the names for the points to a user-specified number of characters.
The Multidimensional Scaling module includes a full implementation of (nonmetric) multidimensional scaling. Matrices of similarities, dissimilarities, or correlations between variables (i.e., "objects" or cases) can be analyzed. The starting configuration can be computed by the program (via principal components analysis) or specified by the user. The program employs an iterative procedure to minimize the stress value and the coefficient of alienation. The user can monitor the iterations and inspect the changes in these values. The final configurations can be reviewed via spreadsheets, and via 2D and 3D scatterplots of the dimensional space with labeled item-points. The output includes the values for the raw stress (raw F), Kruskal stress coefficient S, and the coefficient of alienation. The goodness of fit can be evaluated via Shepard diagrams (with d-hats and d-stars). Like all other results in STATISTICA, the final configuration can be saved to a data file.
The Discriminant Analysis module is a full implementation of multiple stepwise discriminant function analysis. STATISTICA also includes the General Discriminant Analysis Models module (below) for fitting ANOVA/ANCOVA-like designs to categorical dependent variables, and to perform various advanced types of analyses (e.g., best subset selection of predictors, profiling of posterior probabilities, etc.). The Discriminant Analysis program will perform forward or backward stepwise analyses, or enter user-specified blocks of variables into the model.
In addition to the numerous graphics and diagnostics describing the discriminant functions, the program also provides a wide range of options and statistics for the classification of old or new cases (for validation of the model). The output includes the respective Wilks' lambdas, partial lambdas, F to enter (or remove), the p levels, the tolerance values, and the R-square. The program will perform a full canonical analysis and report the raw and cumulative eigenvalues for all roots, and their p levels, the raw and standardized discriminant (canonical) function coefficients, the structure coefficient matrix (of factor loadings), the means for the discriminant functions, and the discriminant scores for each case (which can also be automatically appended to the data file). Integrated graphs include histograms of the canonical scores within each group (and all groups combined), special scatterplots for pairs of canonical variables (where group membership of individual cases is visibly marked), a comprehensive selection of categorized (multiple) graphs allowing the user to explore the distribution and relations between dependent variables across the groups (including multiple box-and-whisker plots, histograms, scatterplots, and probability plots), and many others. The Discriminant Analysis module will also compute the standard classification functions for each group. The classification of cases can be reviewed in terms of Mahalanobis distances, posterior probabilities, or actual classifications, and the scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results spreadsheets. All of these values can be automatically appended to the current data file for further analyses. The summary classification matrix of the number and percent of correctly classified cases can also be displayed. The user has several options to specify the a priori classification probabilities and can specify selection conditions to include or exclude selected cases from the classification (e.g., to validate the classification functions in a new sample).
General Discriminant Analysis Models (GDA)
The STATISTICA General Discriminant Analysis (GDA) module is an application and extension of the General Linear Model to classification problems. Like the Discriminant Analysis module, GDA allows you to perform standard and stepwise discriminant analyses. GDA implements the discriminant analysis problem as a special case of the general linear model, and thereby offers extremely useful analytic techniques that are innovative, efficient, and extremely powerful. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well. GDA provides powerful and efficient tools for data mining as well as applied research. GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.
Computational approach and unique applications. As in traditional discriminant analysis, GDA allows you to specify a categorical dependent variable. For the analysis, the group membership (with regard to the dependent variable) is then coded into indicator variables, and all methods of GRM can be applied. In the results dialogs, the extensive selection of residual statistics of GRM and GLM are available in GDA as well; for example, you can review all the regression-like residuals and predicted values for each group (each coded dependent indicator variable), and choose from the large number of residual plots. In addition, all specialized prediction and classification statistics are computed that are commonly reviewed in a discriminant analysis; but those statistics can be reviewed in innovate ways because of STATISTICA's unique approach. For example, you can perform "desirability profiling" by combining the posterior prediction probabilities for the groups into a desirability score, and then let the program find the values or combination of categorical predictor settings that will optimize that score. Thus, GDA provides powerful and efficient tools for data mining as well as applied research; for example, you could use the DOE (Design of Experiments) methods to generate an experimental design for quality improvement, apply this design to categorical outcome data (e.g., distinct classifications of an outcome as "superior," "acceptable," or "failed"), and then model the posterior prediction probabilities of those outcomes using the variables of your experimental design.
Standard discriminant analysis results. STATISTICA GDA will compute all standard results for discriminant analysis, including discriminant function coefficients, canonical analysis results (standardized and raw coefficients, step-down tests of canonical roots, etc.), classification statistics (including Mahalanobis distances, posterior probabilities, actual classification of cases in the analysis sample and validation sample, misclassification matrix, etc.), and so on.
Unique features of GDA, currently only available in STATISTICA. In addition, STATISTICA GDA includes numerous unique features and results:
Specifying predictor variables and effects; model building:
1. Support for continuous and categorical predictors, instead of allowing only continuous predictors in the analysis (the common limitation in traditional discriminant function analysis programs), GDA allows the user to specify simple and complex ANOVA and ANCOVA-like designs, e.g., mixtures of continuous and categorical predictors, polynomial (response surface) designs, factorial designs, nested designs, etc.
2. Multiple-degree of freedom effects in stepwise selection; the terms that make up the predictor set (consisting not only of single-degree of freedom continuous predictors, but also multiple-degree of freedom effects) can be used in stepwise discriminant function analyses; multiple-degree of freedom effects will always be entered/removed as blocks.
3. Best subset selection of predictor effects; single- and multiple-degree of freedom effects can be specified for best-subset discriminant analysis; the program will select the effects (up to a user-specified number of effects) that produce the best discrimination between groups.
4. Selection of predictor effects based on misclassification rates; GDA allows the user to perform model building (selection of predictor effects) not only based on traditional criteria (e.g., p-to-enter/remove; Wilks' lambda), but also based on misclassification rates; in other words the program will select those predictor effects that maximize the accuracy of classification, either for those cases from which the parameter estimates were computed, or for a cross-validation sample (to guard against over-fitting); these techniques elevate GDA to the level of a fast neural-network-like data mining tool for classification, that can be used as an alternative to other similar techniques (tree-classifiers, designated neural-network methods, etc.; GDA will tend to be faster than those techniques because it is still based on the more efficient General Linear Model).
Results statistics; profiling:
1. Detailed results and diagnostic statistics and plots; in addition to the standard results statistics, GDA provides a large number of auxiliary information to help the user judge the adequacy of the chosen disciminant analysis model (descriptive statistics and graphs, Mahalanobis distances, Cook distances, and leverages for predictors, etc.). 2. Profiling of expected classification; GDA includes an adaptation of the general GLM (GRM) response profiler; these options allow the user to quickly determine the values (or levels) of the predictor variables that maximize the posterior classification probability for a single group, or for a set of groups in the analyses; in a sense, the user can quickly determine the typical profiles of values of the predictors (or levels of categorical predictors) that identify a group (or set of groups) in the analysis.
A note of caution for models with categorical predictors, and other advanced techniques. The General Discriminant Analysis module provides functionality that makes this technique a general tool for classification and data mining. However, most -- if not all -- textbook treatments of discriminant function analysis are limited to simple and stepwise analyses with single degree of freedom continuous predictors. No "experience" (in the literature) exists regarding issues of robustness and effectiveness of these techniques, when they are generalized in the manner provided in this very powerful module. The use of best-subset methods, in particular when used in conjunction with categorical predictors or when using the misclassification rates in a crossvalidation sample for choosing the best subset of predictors, should be considered a heuristic search method, rather than a statistical analysis technique.
Advanced Linear/Non-Linear Modules
Distributions and Simulation
Distributions and Simulation enables users to automatically fit a large number of distributions for continuous and categorical variables to lists of variables. Standard distributions are available (normal, halfnormal, log-normal, Weibull, etc.), but also included are specialized and general distributions (Johnson, Gaussian Mixture, Generalized Pareto, Generalized Extreme Value), and STATISTICA automatically ranks the quality of the fit for each selected distribution and variable.
In addition, the distributions fit to the list of selected variables and the covariance between the selected variables can be saved for deployment. The Distributions & Simulation module uses this deployment information to generate simulated data sets that not only faithfully reproduce the respective distributions, but also the covariances between variables. In short, in addition to facilitating efficient distribution fitting to large numbers of variables, this module enables users to fit general multivariate distributions, and simulate from those distributions, using cutting edge simulation techniques (e.g., Latin-Hypercube simulation).When data are not available for which to fit distributions, the Design Simulation tool allows you to generate data from a correlation matrix and selection of distributions.These methods have proven useful in various domains such as modern DOE, reliability engineering, and risk modeling.
Variance Compnents and Mixed Model ANOVA/ANCOVA
Variance Components and Mixed Model ANOVA/ANCOVA. is a specialized module for designs with random effects and/or factors with many levels; options for handling random effects and for estimating variance components are also provided in the General Linear Models module. Random effects (factors) occur frequently in industrial research, when the levels of a factor represent values sampled from a random variable (as opposed to being deliberately chosen or arranged by the experimenter). The Variance Components module will allow you to analyze designs with any combinations of fixed effects, random effects, and covariates. Extremely large ANOVA/ANCOVA designs can be efficiently analyzed: Factors can have several hundreds of levels. The program will analyze standard factorial (crossed) designs as well as hierarchically nested designs, and compute the standard Type I, II, and III analysis of variance sums of squares and mean squares for the effects in the model. In addition, you can compute the table of expected mean squares for the effects in the design, the variance components for the random effects in the model, the coefficients for the denominator synthesis, and the complete ANOVA table with tests based on synthesized error sums of squares and degrees of freedom (using Satterthwaite's method). Other methods for estimating variance components are also supported (e.g., MIVQUE0, Maximum Likelihood [ML], Restricted Maximum Likelihood [REML]). For maximum likelihood estimation, both the Newton-Raphson and Fisher scoring algorithms are used, and the model will not be arbitrarily changed (reduced) during estimation to handle situations where most components are at or near zero. Several options for reviewing the weighted and unweighted marginal means, and their confidence intervals, are also available. Extensive graphics options can be used to visualize the results.
Survival Failure Time Analysis
This module features a comprehensive implementation of a variety of techniques for analyzing censored data from social, biological, and medical research, as well as procedures used in engineering and marketing (e.g., quality control, reliability estimation, etc.). In addition to computing life tables with various descriptive statistics and Kaplan-Meier product limit estimates, the user can compare the survivorship functions in different groups using a large selection of methods (including the Gehan test, Cox F-test, Cox-Mantel test, Log-rank test, and Peto & Peto generalized Wilcoxon test). Also, Kaplan-Meier plots can be computed for groups (uncensored observations are identified in graphs with different point markers). The program also features a selection of survival function fitting procedures (including the Exponential, Linear Hazard, Gompertz, and Weibull functions) based on either unweighted and weighted least squares methods (maximum-likelihood parameter estimates for various distributions, including Weibull, can also be computed via the STATISTICA Process Analysis module). Finally, the program offers full implementations of four general explanatory models (Cox's proportional hazard model, exponential regression model, log-normal and normal regression models) with extended diagnostics, including stratified analysis and graphs of survival for user-specified values of predictors. For Cox proportional hazard regression, the user can choose to stratify the sample to permit different baseline hazards in different strata (but a constant coefficient vector), or the user can allow for different baseline hazards as well as coefficient vectors. In addition, general facilities are provided to define one or more time-dependent covariates. Time-dependent covariates can be specified via a flexible formula interpreter that allows the user to define the covariates via arithmetic expressions which may include time, as well as the standard logical functions (e.g., timdep=age+age*log(t_)*(age>45), where t_ references survival time) and a wide variety of distribution functions. As in all other modules of STATISTICA, the user can access and change the technical parameters of all procedures (or accept dynamic defaults). The module also offers an extensive selection of graphics and specialized diagrams to aid in the interpretation of results (including plots of cumulative proportions surviving/failing, patterns of censored data, hazard and cumulative hazard functions, probability density functions, group comparison plots, distribution fitting plots, various residual plots, and many others).
Cox Proportional Hazards Models
The Cox Proportional Hazards Models module is a highly scalable tool which includes:
- analysis of survival data from patients in medical studies
- customer churn analysis (loss of customer)
- modeling and failure of mechanical parts (reliability)
This tool allows for flexible handling of censored data, categorical predictors, and designs that include interactions and/or nested effects. It uses model building techniques such as best subsets and stepwise regression. Deployment of the survival functions on new data is available with STATISTICA Rapid Deployment.
General Nonlinear Estimation (and Quick Logit/Probit Regression)
The Nonlinear Estimation module allows the user to fit essentially any type of nonlinear model. One of the unique features of this module is that (unlike traditional nonlinear estimation programs) it does not impose any limits on the size of data files that it can process.
The models can be fit using least squares or maximum-likelihood estimation, or any user-specified loss function. When using the least-squares criterion, the very efficient Levenberg-Marquardt and Gauss-Newton algorithms can be used to estimate the parameters for arbitrary linear and nonlinear regression problems. For large datasets or for difficult nonlinear regression problems (such as those rated "higher difficulty" among the Statistical Reference Datasets provided by the National Institute of Standards and Technology; see http://www.nist.gov/itl/div898/strd/index.html), when using the least-squares criterion, this is the recommended method for computing precise parameter estimates. When using arbitrary loss functions, the user can choose from among four very different, powerful estimation procedures (quasi-Newton, Simplex, Hooke-Jeeves pattern moves, and Rosenbrock pattern search method of rotating coordinates) so that stable parameter estimates can be obtained in practically all cases, and even in extremely numerically-demanding conditions (see the Validation Benchmarks ).
The user can specify any type of model by typing in the respective equation into an equation editor. The equations may include logical operators; thus, discontinuous (piecewise) regression models and models including indicator variables can also be estimated. The equations may also include a wide selection of distribution functions and cumulative distribution functions (Beta, Binomial, Cauchy, Chi-square, Exponential, Extreme value, F, Gamma, Geometric, Laplace, Logistic, Normal, Log-Normal, Pareto, Poisson, Rayleigh, t (Student), or Weibull distribution). The user has full control over all aspects of the estimation procedure (e.g., starting values, step sizes, convergence criteria, etc.). The most common nonlinear regression models are predefined in the Nonlinear Estimation module, and can be chosen simply as menu options. Those regression models include stepwise Probit and Logit regression, the exponential regression model, and linear piecewise (break point) regression. Note that STATISTICA also includes implementations of powerful algorithms for fitting generalized linear models, including probit and multinomial logit models, and generalized additive models; see the respective descriptions for additional details.
In addition to various descriptive statistics, standard results of the nonlinear estimation include the parameter estimates and their standard errors (computed independently of the estimation itself, via finite differencing to optimize precision; see the Validation Benchmarks ); the variance/covariance matrix of parameter estimates, the predicted values, residuals, and appropriate measures of goodness-of-fit (e.g., log-likelihood of estimated/null models and Chi-square test of difference, proportion of variance accounted for, classification of cases and odds-ratios for Logit and Probit models, etc.). Predicted and residual values can be appended to the data file for further analyses. For Probit and Logit models, the incremental fit is also automatically computed when adding or deleting parameters from the regression model (thus, the user can explore the data via a stepwise nonlinear estimation procedure; options for automatic forward and backward stepwise regression as well as best-subset selection of predictors in logit and probit models is provided in the Generalized Linear Models module, below).
All output is integrated with extensive selections of graphs, including interactively-adjustable 2D and 3D (surface) arbitrary function fitting graphs which allow the user to visualize the quality of the fit and identify outliers or ranges of discrepancy between the model and the data; the user can interactively adjust the equation of the fitted function (as shown in the graph) without re-processing the data and visualize practically all aspects of the nonlinear fitting process). Many other specialized graphs are provided to evaluate the fitting process and visualize the results, such as histograms of all selected variables and residual values, scatterplots of observed versus predicted values and predicted versus residual values, normal and half-normal probability plots of residuals, and many others.
Log-Linear Analysis of Frequency Tables
This module offers a complete implementation of log-linear modeling procedures for multi-way frequency tables. Note that STATISTICA also includes the Generalized Linear Models module, which provides options for analyzing binomial and multinomial logit models with coded ANOVA/ANCOVA-like designs. In the Log-Linear Analysis module, the user can analyze up to 7-way tables in a single run. Both complete and incomplete tables (with structural zeros) can be analyzed. Frequency tables can be computed from raw data, or may be entered directly into the program. The Log-Linear Analysis module provides a comprehensive selection of advanced modeling procedures in an interactive and flexible environment that greatly facilitates exploratory and confirmatory analyses of complex tables. The user may at all times review the complete observed table as well as marginal tables, and fitted (expected) values, and may evaluate the fit of all partial and marginal association models or select specific models (marginal tables) to be fitted to the observed data. The program also offers an intelligent automatic model selection procedure that first determines the necessary order of interaction terms required for a model to fit the data, and then, through backwards elimination, determines the best sufficient model to satisfactorily fit the data (using criteria determined by the user). The standard output includes G-square (Maximum-Likelihood Chi-square), the standard Pearson Chi-square with the appropriate degrees of freedom and significance levels, the observed and expected tables, marginal tables, and other statistics. Graphics options available in the Log-linear module include a variety of 2D and 3D graphs designed to visualize 2-way and multi-way frequency tables (including interactive, user-controlled cascades of categorized histograms and 3D histograms revealing "slices" of multi-way tables), plots of observed and fitted frequencies, plots of various residuals (standardized, components of Maximum-Likelihood Chi-square, Freeman-Tukey deviates, etc.), and many others.
Time Series Analysis Forecasting
The Time Series module contains a wide range of descriptive, modeling, decomposition, and forecasting methods for both time and frequency domain models. These procedures are integrated, that is, the results of one analysis (e.g., ARIMA residuals) can be used directly in subsequent analysis (e.g., to compute the autocorrelation of the residuals). Also, numerous flexible options are provided to review and plot single or multiple series. Analyses can be performed on even very long series. Multiple series can be maintained in the active work area
of the program (e.g., multiple raw input data series or series resulting from different stages of the analysis); the series can be reviewed and compared. The program will automatically keep track of successive analyses, and maintain a log of transformations and other results (e.g., ARIMA residuals, seasonal components, etc.). Thus, the user can always return to prior transformations or compare (plot) the original series together with its transformations. Information about the consecutive transformations is maintained in the form of long variable labels, so if you save the newly created variables into a dataset, the "history" of each of the series will be permanently preserved. The specific Time Series procedures are described in the following subsections.
Transformations, Modeling, Plots, Autocorrelations
The available time series transformations allow the user to fully explore patterns in the input series, and to perform all common time series transformations, including: de-trending, removal of autocorrelation, moving average smoothing (unweighted and weighted, with user-defined or Daniell, Tukey, Hamming, Parzen, or Bartlett weights), moving median smoothing, simple exponential smoothing (see also the description of all exponential smoothing options below), differencing, integrating, residualizing, shifting, 4253H smoothing, tapering, Fourier (and inverse) transformations, and others. Autocorrelation, partial autocorrelation, and crosscorrelation analyses can also be performed.
ARIMA and Interrupted Time Series (Intervention) Analysis
The Time Series module offers a complete implementation of ARIMA. Models may include a constant, and the series can be transformed prior to the analysis; these transformations will automatically be "undone" when ARIMA forecasts are computed, so that the forecasts and their standard errors are expressed in terms of the values of the original input series. Approximate and exact maximum-likelihood conditional sums of squares can be computed, and the ARIMA implementation in the Time Series module is uniquely suited to fitting models with long seasonal periods (e.g., periods of 30 days). Standard results include the parameter estimates and their standard errors and the parameter correlations. Forecasts and their standard errors can be computed and plotted, and appended to the input series. In addition, numerous options for examining the ARIMA residuals (for model adequacy) are available, including a large selection of graphs. The implementation of ARIMA in the Time Series module also allows the user to perform interrupted time series (intervention) analysis. Several simultaneous interventions may be modeled, which can either be single-parameter abrupt-permanent interventions, or two-parameter gradual or temporary interventions (graphs of different impact patterns can be reviewed). Forecasts can be computed for all intervention models, which can be plotted (together with the input series) as well as appended to the original series.
Seasonal and Non-Seasonal Exponential Smoothing
The Time Series module contains a complete implementation of all 12 common exponential smoothing models. Models can be specified to contain an additive or multiplicative seasonal component and/or linear, exponential, or damped trend; thus, available models include the popular Holt-Winter linear trend models. The user may specify the initial value for the smoothing transformation, initial trend value, and seasonal factors (if appropriate). Separate smoothing parameters can be specified for the trend and seasonal components. The user can also perform a grid search of the parameter space in order to identify the best parameters; the respective results spreadsheet will report for all combinations of parameter values the mean error, mean absolute error, sum of squares error, mean square error, mean percentage error, and mean absolute percentage error. The smallest value for these fit indices will be highlighted in the spreadsheet. In addition, the user can also request an automatic search for the best parameters with regard to the mean square error, mean absolute error, or mean absolute percentage error (a general function minimization procedure is used for this purpose). The results of the respective exponential smoothing transformation, the residuals, as well as the requested number of forecasts, are available for further analyses and plots. A summary plot is also available to assess the adequacy of the respective exponential smoothing model; that plot will show the original series together with the smoothed values and forecasts, as well as the smoothing residuals plotted separately against the right-Y axis.
Census Method I - Classical Seasonal Decomposition
The user may specify the length of the seasonal period, and choose either the additive or multiplicative seasonal model. The program will compute the moving averages, ratios or differences, seasonal factors, the seasonally adjusted series, the smoothed trend-cycle component, and the irregular component. Those components are available for further analysis; for example, the user may compute histograms, normal probability plots, etc. for any or all of these components (e.g., to test model adequacy).
Census Method II - X-11 Monthly and Quarterly Seasonal Decomposition and Seasonal Adjustment
The Time Series module contains a full-featured implementation of the US Bureau of the Census X-11 variant of the Census Method II seasonal adjustment procedure. While the original X-11 algorithms were not year-2000 compatible (only data prior to January 2000 could be analyzed), the STATISTICA implementation of X11 can handle data containing dates prior to January 1, 2000, after that date, or series that will start prior to that date but terminate in or after the year 2000. The arrangement of options and dialogs closely follows the definitions and conventions described in the Bureau of the Census documentation. Additive and multiplicative seasonal models may be specified. The user may also specify prior trading-day factors and seasonal adjustment factors. Trading-day variation can be estimated via regression (controlling for extreme observations), and used to adjust the series (conditionally if requested). The standard options are provided for graduating extreme observations, for computing the seasonal factors, and for computing the trend-cycle component (the user can choose between various types of weighted moving averages; optimal lengths and types of moving averages can also automatically be chosen by the program). The final components (seasonal, trend-cycle, irregular) and the seasonally adjusted series are automatically available for further analyses and plots; those components can also be saved for further analyses with other programs. The program will produce the plots of the different components, including categorized plots by months (or quarters).
Polynomial Distributed Lag Models
The implementation of the polynomial distributed lag methods in the Time Series module will estimate models with unconstrained lags as well as (constrained) Almon distributed lags models. A selection of graphs are available to examine the distributions of the model variables.
Spectrum (Fourier) and Cross-Spectrum Analysis
The Time Series module includes a full implementation of spectrum (Fourier decomposition) analysis and cross-spectrum analysis techniques. The program is particularly suited for the analysis of unusually long time series (e.g., with over 250,000 observations), and it will not impose any constraints on the length of the series (i.e., the length of input series does not have to be a multiple of 2). However, the user may also choose to pad or truncate the series prior to the analysis. Standard pre-analysis transformations include tapering, subtraction of the mean, and detrending. For single spectrum analysis, the standard results include the frequency, period, sine and cosine coefficients, periodogram values, and spectral density estimates. The density estimates can be computed using Daniell, Hamming, Bartlett, Tukey, Parzen, or user-defined weights and user-defined window widths. An option that is particularly useful for long input series is to display only a user-defined number of the largest periodogram or density values in descending order; thus, the most salient periodogram or density peaks can be easily identified in long series. The user can compute the Kolmogorov-Smirnov d test for the periodogram values to test whether they follow an exponential distribution (i.e., whether the input is a white-noise series). Numerous plots are available to summarize the results; the user can plot the sine and cosine coefficients, periodogram values, log-periodogram values, spectral density values, and log-density values against the frequencies, period, or log-period. For long input series, the user can choose the segment (period) for which to plot the respective periodogram or density values, thus enhancing the "resolution" of the periodogram or density plot. For cross-spectrum analysis, in addition to the single spectrum results for each series, the program computes the cross-periodogram (real and imaginary part), co-spectral density, quadrature spectrum, cross-amplitude, coherency values, gain values, and the phase spectrum. All of these can also be plotted against the frequency, period, or log-period, either for all periods (frequencies) or only for a user-defined segment. A user-defined number of the largest cross-periodogram values (real or imaginary) can also be displayed in a spreadsheet in descending order of magnitude to facilitate the identification of salient peaks when analyzing long input series. As with all other procedures in the Time Series module, all of these result series can be appended to the active work area, and will be available for further analyses with other time series methods or other STATISTICA modules.
Regression-Based Forecasting Techniques
Finally, STATISTICA offers regression-based time series techniques for lagged or non-lagged variables (including regression through the origin, nonlinear regression, and interactive what-if forecasting).
Structural Equation Modeling and Path Analysis (SEPATH)
STATISTICA includes a comprehensive implementation of structural equation modeling techniques with flexible Monte Carlo simulation facilities (SEPATH). The module is a state-of-the art program with an "intelligent" user-interface. It offers a comprehensive selection of modeling procedures integrated with unique user-interface tools allowing you to specify even complex models without using any command syntax. Via Wizards and Path Tools, you can define the analysis in simple functional terms using menus and dialog boxes (unlike other programs for structural equation modeling, no complex "language" must be mastered).
SEPATH is a complete implementation that includes numerous advanced features: The program can analyze correlation, covariance, and moment matrices (structured means, models with intercepts); all models can be specified via the Path Wizard, Factor Analysis Wizard, and General Path tools; these facilities are highly efficient and allow users to specify even complex models in minutes by making choices from dialogs. The SEPATH module will compute, using constrained optimization techniques, the appropriate standard errors for standardized models, and for models fitted to correlation matrices. The results options include a comprehensive set of diagnostic statistics including the standard fit indices as well as noncentrality-based indices of fit, reflecting the most recent developments in the area of structural equation modeling. The user may fit models to multiple samples (groups), and can specify for each group fixed, free, or constrained (to be equal across groups) parameters. When analyzing moment matrices, these facilities allow you to test complex hypotheses for structured means in different groups. The SEPATH module documentation contains numerous detailed descriptions of examples from the literature, including examples of confirmatory factor analysis, path analysis, test theory models for congeneric tests, multi-trait-multi-method matrices, longitudinal factor analysis, compound symmetry, structured means, etc.
SEPATH Monte Carlo simulation
The STATISTICA Structural Equation Modeling (SEPATH) module includes powerful simulation options: the user can generate (and save) datasets for predefined models, based on normal or skewed distributions. Bootstrap estimates can be computed, as well as distributions for various diagnostic statistics, parameter estimates, etc. over the Monte Carlo trials. Numerous flexible graphing options are available to visualize the results (e.g., distributions of parameters) from Monte Carlo runs.
Analyzing Linear and Nonlinear systems
STATISTICA includes five powerful types of analyses for analyzing linear and nonlinear models: General Linear Models (GLM), General Regression Models (GRM), General Discriminant Analysis Models (GDA), Generalized Linear Model (GLZ), and General Partial Least Squares Models (PLS). Note that STATISTICA also includes implementations of Generalized Additive Models (GAM), Classification and Regression Trees (C&RT)and General CHAID (Chi-square Automatic Interaction Detection) available in STATISTICA Data Miner; these modules can also be used to fit nonlinear (ANOVA/ANCOVA-like) models to continuous or categorical dependent (criterion) variables.
All of these modules are extremely comprehensive and advanced implementations of the respective methods, and all of them share some general user interface solutions.
General Features Common to All Five Modules
Three alternative user-interfaces: (1) Quick-specs dialogs, (2) Wizard, and (3) Syntax. All modules offer three alternative user-interfaces for specifying research designs (e.g., ANOVA/ANCOVA designs, regression designs, response surface designs, mixture designs, etc.; see the description of GLM for details):
- Via Quick-specs dialogs which prompt the user the specify the necessary variables, etc., given an initial choice of design (e.g., if you choose a response surface design, you are prompted to specify continuous predictors, and an optional blocking variable),
- Via unique, powerful Design Wizards, which lead the user step-by-step through the process of specifying a model, and
- Via a simple command syntax which offers a choice of either the traditional SAS® language or simpler to use and more flexible VGLM language (both options include "quick entry" dialogs with shortcut buttons and facilities to open syntax files saved in text format).
Automatically generating the syntax statements. One of the unique features of this user-interface is that in the background STATISTICA will automatically generate the complete set of syntax statements for any design specified via the Quick-specs dialogs (see point 1 above) or the Wizard (see point 2). These "active" logs of even the most complex and customized designs can be re-run, saved for future use, modified, included in STATISTICA Visual Basic scripts to be routinely run on new datasets, etc. Because the syntax for specifying general linear model designs is shared by all of these modules, it is also easy to move specifications form one type of analysis to another, for example, in order to fit the same model in GLM and GLZ.
Computation (training) sample, cross-validation (verification) sample, and prediction sample. All five modules will compute detailed residual statistics that can be saved for further analyses with other modules. Another unique feature of these programs is that the predicted and residual statistics can be computed separately for those observations from which the respective results were computed (i.e., the computation or training sample), for observations explicitly excluded from the model fitting computations (the cross-validation or verification sample), and for cases without observed data for the dependent (response) variables (prediction sample). Moreover, all graphical results options (e.g., probability plots, histograms, scatterplots of selected predicted or residual statistics) can be requested for these samples. Thus, all five programs offer exceptionally thorough diagnostic methods for evaluating the quality of the fit of the model.
Comparing analyses; modifying analyses. Like all analytic facilities of STATISTICA, multiple instances of all modules can be kept open at the same time, so multiple analyses can simultaneously be performed on the same or on different datasets. This is extremely useful for comparing the results from different analyses of the same data or the same analyses of different data. Modifying an analysis does not require complete respecification of the analysis; only desired changes need to be specified. Results from different modifications of an analysis can be easily compared. STATISTICA GLM, GRM, GDA, GLZ, and PLS can take what-if analyses to a new level, by allowing comparisons of different data and different analyses at the same time.
General Linear Models (GLM)
STATISTICA General Linear Models (GLM) analyzes responses on one or more continuous dependent variables as a function of one or more categorical or continuous independent variables. GLM is not only the most computationally advanced GLM tool currently on the market, but it is also the most comprehensive and complete application available, offering a larger selection of options, graphs, accompanying statistics and extended diagnostics than any other program. Designed with a "no compromise approach", GLM offers the most extensive selection of options to handle GLM's so-called "controversial problems" that do not have any widely agreed upon solutions. GLM will compute all the standard results, including ANOVA tables with univariate and multivariate tests, descriptive statistics, etc. GLM offers a large number of results and graphics options that are usually not available in other programs. GLM also offers simple ways to test linear combinations of parameter estimate; specifications of custom error terms and effects; comprehensive post-hoc comparison methods for between group effects as well as repeated measures effects, and the interactions between repeated measures.
The following sections summarize some of the most important specific advantages of GLM over other programs, and the unique features and facilities offered in this module; however, it is important to start by stressing the fact that GLM is not only the most computationally advanced GLM tool available on the market but it is also the most comprehensive and complete application that offers a wider selection of options, more graphs, more accompanying statistics and extended diagnostics than any other program. It has been designed with a "no compromise approach" to address the most challenging problems in the area of GLM and also to offer the most comprehensive selections of user-selectable options to handle so-called "controversial problems" that do not have any widely agreed upon solutions.
Designs. The user can choose simple or highly customized one-way, main-effect, factorial, or nested ANOVA or MANOVA designs, repeated measures designs, simple, multiple and polynomial regression designs, response surface designs (with or without blocking), mixture surface designs, simple or complex analysis of covariance designs (e.g., with separate slopes), or general multivariate MANCOVA designs. Factors can be fixed or random (in which case synthesized error terms will be computed). All of these designs can be efficiently specified via any of the three types of user interfaces described above, and customized in various ways (e.g., you can drop effects, specify custom hypotheses, etc.). Also, GLM can handle extremely large analysis designs; for example, repeated measures factors with 1000 levels can be specified, models may include 1000 covariates, or you can analyze very efficiently literally huge between-group designs.
The overparameterized and sigma-restricted model. A detailed discussion is beyond the scope of this summary; most programs only offer the overparameterized model, and a few only the sigma restricted model; STATISTICA GLM is the only program available on the market that offers both. Note that each of the two models has its advantages and disadvantages; however, both approaches are necessary to offer a truly comprehensive GLM computational platform, capable of properly handling even the most advanced and demanding analytic problems. For example, nested designs and separate slope designs are best analyzed using the overparameterized model; the most common way to estimate variance components, and to compute synthesized error terms in mixed model ANOVA is based on the overparameterized model. Factorial designs with large numbers of factors are best analyzed using the sigma restricted model; in short, a simple 2-way interaction of two two-level factors requires only a single column in the design matrix using the sigma restricted parameterization, but 4 columns in the overparameterized model; as a result, analyzing, for example, an 8-way full factorial design with GLM only requires a few seconds.
Handling missing cell designs. STATISTICA GLM will compute the customary Type I through IV sums of squares for unbalanced and incomplete designs; however, as is widely acknowledged (e.g., Searle, 1987; Milliken & Johnson, 1986), applying these methods to "messy" designs with missing cells in more or less random locations in the design can lead to misleading, and even blatantly nonsensical results. STATISTICA GLM therefore also offers two additional methods for analyzing missing cell designs: Hockings (1985) "effective hypothesis decomposition," and a method that will automatically drop effects that cannot be fully estimated (e.g., when the least squares means do not exist for all levels of the respective main effect or interaction effect). The latter method is the one commonly applied to the analysis of highly fractionalized designs in industrial experimentation (see also STATISTICA DOE). This method leads to results that are unique (not dependent on the ordering of factor levels), easily interpretable, and consistent with the industrial experimentation literature. This highly useful feature is unique to GLM.
Results statistics. GLM will compute all the standard results, including ANOVA tables with univariate and multivariate tests, descriptive statistics, etc. GLM also offers a large number of results options and in particular graphics options that are usually not available in other programs. For example, GLM includes a comprehensive selection of types of plots of means (observed, least squares, weighted) for higher-order interactions,
with error bars (standard errors) for effects involving between-group factors as well as repeated measures factors;
extensive residual analyses and plots (for the "training" or computation sample, for a cross-validation or "verification" sample, or for a prediction sample without observed values for the dependent or response variables), plots of variance components; desirability profiler and response optimization for any model;
and adjusted means for traditional analysis of covariance designs. Extensive and flexible options for specifying planned comparisons are provided including facilities to specify contrasts using either the traditional command syntax or an extremely simple to use (Wizard-style) sequence of "intelligent" contrast dialogs
(you can enter contrast coefficients for clearly labeled levels of factors or cells in the design; the program will then evaluate the comparison for the least squares ("predicted") means, i.e., for the means as predicted by and consistent with the current model; this is a unique solution to the problem of planned comparisons in complex and incomplete designs); simple ways to test linear combinations of parameter estimates (e.g., to test for the equality of specific regression coefficients); specifications of custom error terms and effects; comprehensive post-hoc comparison methods for between group effects as well as repeated measures effects, and the interactions between repeated measures and between effects including: Fisher LSD, Bonferroni, Scheffé, Tukey HSD, Unequal N HSD, Newman Keuls, Duncan, and Dunnett's test
(with flexible options for estimating the appropriate error terms for those tests), tests of assumptions (e.g., Levene's test, plots of means vs. standard deviations, etc.).
General Regression Models (GRM)
STATISTICA General Regression Models (GRM) provides the user with a unique, highly flexible implementation of the standard and unique results options in the general linear models, as well as including a comprehensive set of stepwise regression and best-subset model building techniques supporting both continuous and categorical variables. Stepwise and best subset methods to build models for highly complex designs can be used in GRM, including designs with effects for categorical predictor variables. Thus, the "general" in General Regression Models refers both to the use of the general linear models, and to the fact that unlike most other stepwise regression programs, GRM is not limited to the analysis of designs that contain only continuous predictor variables. In addition, unique regression-specific results options include Pareto charts of parameter estimates, whole model summaries (tests) with various methods for evaluating no-intercept models, partial and semi-partial correlations, etc.
Stepwise and best-subset selection for continuous and categorical predictors (ANOVA models) for models with multiple dependent variables. GRM is a "sister program" to STATISTICA General Linear Model (GLM) module. In addition to the large number of unique analytic options available in GLM (including planned comparisons, custom-hypotheses, a wide selection of post-hoc tests, residual analyses options, etc.), the General Regression Models (GRM) module allows you to build models via stepwise and best subset methods. GRM makes these techniques available not only for traditional analytic problems with a single dependent variable, but extends them to analyses of problems with multiple dependent variables; thus, in a sense, GRM can be considered a (very unique) stepwise and best-subset canonical analysis program. These methods can be used with designs that include continuous and/or categorical predictor variables (i.e., ANOVA or ANCOVA designs), and the techniques used in GRM will ensure that multiple degree of freedom effects will be considered (moved in or out of the model) in blocks. Specifically, GRM allows you build models via forward- or backward-only selection (effects can only be entered or removed once during the selection process), standard forward or backward selection (effects can be moved in or out of the model at each step, according to F or p to enter or remove criteria), or via best subset selection; this latter method gives the user flexible options to control the models considered during the subset search (e.g., maximum and minimum subset sizes, Mallow's CP, R-square, and adjusted R-for best subset selection, etc.).
Results. The General Regression Models (GRM) module offers all standard and unique results options described in the context of the GLM module in the previous section (including desirability profiling, predicted and residual statistics for the computation or training sample, cross-validation or verification sample, and prediction sample; tests of assumptions, means plots, etc.). In addition, unique regression-specific results options are also available, including Pareto charts of parameter estimates, whole model summaries (tests) with various methods for evaluating no-intercept models, partial and semi-partial correlations, etc.
Generalized Linear models (GLZ)
The Generalized Linear Models (GLZ) allows the user to search for both linear and nonlinear relationships between a response variable and categorical or continuous predictor variables (including multinomial logit and probit, signal detection models, and many others). Special applications of generalized linear models include a number of widely used types of analyses, such as binomial and multinomial logit and probit regression, Signal Detection Theory (SDT) or Tweedie models.
The Tweedie distribution is actually a family of distributions belonging to the class of exponential dispersion models such that the variance is of the form Var(Y) = φμP, where φ > 0 is the dispersion/scale parameter and μ is the mean. P must be in the interval (-∞, 0] U [1, ∞).
Note that STATISTICA Data Miner also includes an implementation of Generalized Additive Models, GAM).The GLZ module will compute all standard results statistics, including likelihood ratio tests, and Wald and score tests for significant effects, parameter estimates and their standard errors and confidence intervals, etc. The user-interfaces, methods for specifying designs, and "touch-and-feel" of the program is similar to GLM, GRM, and PLS. The user is able to easily specify ANOVA or ANCOVA-like designs, response surface designs, mixture surface designs, etc.; thus, even novice users will have no difficulty applying generalized linear models to analyze their data. In addition, GLZ includes a comprehensive selection of model checking tools such as spreadsheets and graphs for various residuals and outlier detection statistics, including raw residuals, Pearson residuals, deviance residuals, studentized Pearson residuals, studentized deviance residuals, likelihood residuals, differential Chi-square statistics, differential deviance, and generalized Cook distances, etc.
Models and link functions. A wide range of distributions (from the exponential family) can be specified for the response variable: Normal, Poisson, gamma, binomial, multinomial, ordinal multinomial, and inverse Gaussian. Further, the nature of the relationship between the predictors and the responses can be specified by choosing a so-called link function from a comprehensive list of (common and special-purpose) functions. Available link functions include: log, power, identity, logit, probit, complimentary log-log, and log-log links. Unlike other nonlinear models, these models can be fitted via fast estimation procedures, and allow meaningful interpretations (similar to general linear models), and hence, they are extensively employed in the analysis of non-linear relationships in science as well as applied research.
Stepwise and best-subset selection for continuous and categorical predictors (ANOVA-like models). In addition to the standard model fitting techniques, STATISTICA GLZ also provides unique options for exploratory analyses, including model building facilities like forward- or backward-only selection of effects (effects can only be selected for inclusion or removal once during the selection process), standard forward or backward stepwise selection of effects (effects can be entered or removed at each step, using a p to enter or remove criterion), and best subset regression methods (using the likelihood score statistic, model likelihood, or Akaike information criterion). These powerful methods can be applied to categorical predictors (ANOVA-like designs; effects will be moved in or out of the model as multiple-parameter blocks) as well as continuous predictors, and will save significant amounts of time when building appropriate models for complex data.
Results. The Generalized Linear Model module will compute all standard results statistics, including likelihood ratio tests, and Wald and score tests for significant effects, parameter estimates and their standard errors and confidence intervals, etc. In addition, for ANOVA-like designs, tables and plots of predicted means (the equivalent of least squares means computed in the general linear model) with their standard errors can be computed, to aid in the interpretation of results. GLZ also includes a comprehensive selection of model checking tools such as Spreadsheets and graphs for various residuals and outlier detection statistics, including raw residuals, Pearson residuals, deviance residuals, studentized Pearson residuals, studentized deviance residuals, likelihood residuals, differential Chi-square statistics, differential deviance, and generalized Cook distances, etc. As described earlier, predicted and residual statistics can be requested for observations that were used for fitting the model, and those that were not (i.e., for the cross-validation sample).
General Partial Least Squares Models (PLS)
Partial Least Squares (PLS) includes a comprehensive selection of algorithms for univariate and multivariate partial least squares problems. PLS will compute all the standard results for a partial least squares analysis; in addition, it offers a large number of results options and in particular graphics options that are usually not available in other implementations; for example, graphs of parameter values as a function of the number of components, two-dimensional plots for all output statistics (parameters, factor loadings, etc.), two-dimensional plots for all residuals statistics, etc. Because PLS offers an identical selection of flexible user interfaces to that of GLM, GRM and GLZ, it is very easy to set up models in one module and quickly analyze the data using the same model in PLS. This unique flexibility allows even novice users to apply these powerful techniques to their analysis problems. The partial least squares method is a powerful data mining technique, particularly well suited for determining a smaller number of dimensions in a large number of predictors and response variables. These methods for analyzing linear systems have become popular only in the last few years; thus, many of the algorithms and statistics are still the subject of ongoing research.
The overparameterized and sigma-restricted model for categorical predictors. Like GLM and GLZ, PLS offers both the overparameterized and sigma restricted parameterization methods for categorical predictors (ANOVA-like models). In partial least squares models, the sigma restricted solution can be particularly useful, because it may produce less complex results (explain more variability with fewer components, made up of design vectors coded in sigma-restricted form).
Algorithms. STATISTICA PLS implements the two most general algorithms for partial least squares analysis: SIMPLS and NIPALS.
Results. PLS will compute all the standard results for a partial least squares analysis, and also offers a large number of results options and in particular graphics options that are usually not available in other implementations; for example, graphs of parameter values as a function of the number of components, two-dimensional plots for all output statistics (parameters, factor loadings, etc.), two-dimensional plots for all residuals statistics, etc. Also, like GLM, GRM, and GLZ, the Partial Least Squares module offers extensive residual analysis options, and predicted and residual statistics can be requested for observations that were used for fitting the model (the "training" sample), those that were not (i.e., the cross-validation or verification sample), and for cases without observed data on the dependent (response) variables (the prediction sample).
Power Analysis and Interval Estimation Modules
Some of the advantages of STATISTICA Power Analysis and Interval Estimation are:
- Precise and fast computational routines, which maintain their accuracy across a broad range of parameters
- Presentation-quality, automatically-scaled graphs of power vs. sample size, power vs. effect size, and power vs. alpha
- Protocol statements describing calculations in a form that can be transferred directly to a text document
Power Calculation allows you to calculate statistical power for a given analysis type (see List of Tests below), and to produce graphs of power as a function of various quantities that affect power in practice, such as effect size, type I error rate, and sample size.
Sample Size Calculation
Sample Size Calculation allows you to calculate, for a given analysis type (see List of Tests below), the sample size required to attain a given level of power, and to generate plots of required sample size as a function of required power, type I error rate, and effect size.
Interval Estimation allows you to calculate, for a given analysis type (see List of Tests below), specialized confidence intervals not generally available in general-purpose statistical packages. These confidence intervals are distinguished in some cases by the fact that they refer to standardized effects, and in others by the fact that they are exact confidence intervals in situations where only approximate techniques have generally been available.
STATISTICA Power Analysis and Interval Estimation is unique among programs of its type in that it calculates confidence intervals for a number of important statistical quantities such as standardized effect size (in t-tests and ANOVA), the correlation coefficient, the squared multiple correlation, the sample proportion, and the difference between proportions (either independent or dependent samples).
These capabilities, in turn, may be used to construct confidence intervals on quantities such as power and sample size, allowing the user to utilize the data from one study to construct an exact confidence interval on the sample size required for another study.
Probability Distributions allows you to perform a variety of calculations on probability distributions that are of special value in performing power and sample size calculations.
The routines are distinguished by their high level of accuracy, and the wide range of parameter values for which they will perform calculations. The noncentral distributions are also distinguished by the ability to calculate a noncentrality parameter that places a given observation at a given percentage point in the noncentral distribution. The ability to perform this calculation is essential to the technique of "noncentrality interval estimation"
These routines, which include the noncentral t, noncentral F, noncentral chi-square, binomial, exact distribution of the correlation coefficient, and the exact distribution of the squared multiple correlation coefficient, are characterized by their ability to solve for an unknown parameter, and for their ability to handle "non-null" cases.
For example, not only can the distribution routine for the Pearson correlation calculate p as a function of r and N for rho=0, it can also perform the calculation for other values of rho. Moreover, it can solve for the exact value of rho that places an observed r at a particular percentage point, for any given N.
List of Tests
STATISTICA Power Analysis and Interval Estimation calculates power as a function of sample size, effect size, and Type I error rate for the tests listed below:
- 1-sample t-test
- 2-sample independent sample t-test
- 2-sample dependent sample t-test
- Planned contrasts
- 1-way ANOVA (fixed and random effects)
- 2-way ANOVA
- Chi-square test on a single variance
- F-test on 2 variances
- Z-test (or chi-square test) on a single proportion
- Z-test on 2 independent proportions
- Mcnemar's test on 2 dependent proportions
- F-test of significance in multiple regression
- t-test for significance of a single correlation
- Z-test for comparing 2 independent correlations
- Log-rank test in survival analysis
- Test of equal exponential survival, with accrual period
- Test of equal exponential survival, with accrual period and dropouts
- Chi-square test of significance in structural equation modeling
- Tests of "close fit" in structural equation modeling confirmatory factor analysis
Suppose you are planning a 1-Way ANOVA to study the effect of a drug.
Prior to planning the study, you find that there has been a similar study previously. This particular study had 4 groups, with N = 50 subjects per group, and obtained an F-statistic of 15.4.
From this information, as a first step you can (a) gauge the population effect size with an exact confidence interval, and (b) use this information to set a lower bound to appropriate sample size in your study.
Simply enter the data into a convenient dialog, and results are immediately available.
In this case, we discover that a 90% exact confidence interval on the root-mean-square standardized effect (RmsSE) ranges from about .398 to .686. With effects this strong, it is not surprising that the 90% post hoc confidence interval for power ranges from .989 to almost 1. We can use this information to construct a confidence interval on the actual N needed to achieve a power goal (in this case, .90). This confidence interval ranges from 12 to 31. So, based on the information in the study, we are 90% confident that a sample size no greater than 31 would have been adequate to produce a power of .90.
Turning to our own study, suppose we examine the relationship between power and effect size for a sample size of 31. The first graph shows quite clearly that as long as the effect size for our drug is in the range of the confidence interval for the previous study, our power will be quite high, should the actual effect size for our drug be on the order of .25, power will be inadequate.
If, on the other hand, we use a sample size comparable to the previous study (i.e., 50 per group) we discover that power will remain quite reasonable, even for effects on the order of .28.
With STATISTICA Power Analysis and Interval Estimation, this entire analysis runs in just a minute or two.