DESCRIPTIVE STATISTICS, BREAKDOWNS, AND EXPLORATORY DATA ANALYSIS
Descriptive Statistics and Graphs
The program will compute practically all common, general-purpose descriptive statistics including medians, modes, quartiles, user-specified percentiles, average and standard deviations, quartile ranges, confidence limits for the mean, skewness and kurtosis (with their respective standard errors), harmonic means, geometric means, as well as many specialized descriptive statistics and diagnostics, either for all cases or broken down by one or more categorical (grouping) variables. As with all modules of STATISTICA, a wide variety of graphs will aid exploratory analyses, e.g., various types of box-and-whisker plots, histograms, bivariate distribution (3D or categorized) histograms, 2D and 3D scatterplots with marked subsets, normal, half-normal, detrended probability plots, Q-Q plots, P-P plots, etc. A selection of tests is available for fitting the normal distribution to the data (via the Kolmogorov-Smirnov, Lilliefors, and Shapiro-Wilks' tests; facilities for fitting a wide variety of other distributions are also available; see also STATISTICA Process Analysis; and the section on fitting in the Graphics section).
By-Group Analyses (Breakdowns)
Practically all descriptive statistics as well as summary graphs can be computed for data that are categorized (broken down) by one or more grouping variables. For example, with just a few mouse clicks the user can break down the data by Gender and Age and review categorized histograms, box-and-whisker plots, normal probability plots, scatterplots, etc. If more than two categorical variables are chosen, cascades of the respective graphs can be automatically produced. Options to categorize by continuous variables are provided, e.g., you can request that a variable be split into a requested number of intervals, or use the on-line recode facility to custom-define the way in which the variable will be recoded (categorization options of practically unlimited complexity can be specified at any point and they can reference relations involving all variables in the dataset). In addition, a specialized hierarchical breakdown procedure is provided that allows the user to categorize the data by up to six categorical variables, and compute a variety of categorized graphs, descriptive statistics, and correlation matrices for subgroups (the user can interactively request to ignore some factors in the complete breakdown table, and examine statistics for any marginal tables). Numerous formatting and labeling options allow the user to produce publication-quality tables and reports with long labels and descriptions of variables. Note that extremely large analysis designs can be specified in the breakdown procedure (e.g., 100,000 groups for a single categorization variable), and results include all relevant ANOVA statistics (including the complete ANOVA table, tests of assumptions such as the Levene and Brown-Forsythe tests for homogeneity of variance, a selection of seven post-hoc tests, etc.). As in all other modules of STATISTICA, extended precision calculations (the "quadruple" precision, where applicable) are used to provide an unmatched level of accuracy (see the section on Precision). Because of the interactive nature of the program, exploration of data is very easy. For example, exploratory graphs can be produced directly from all results Spreadsheets by pointing with the mouse to specific cells or ranges of cells. Cascades of even complex (e.g., multiple categorized) graphs can be produced with a single-click of the mouse and reviewed in a slide-show manner. In addition to numerous predefined statistical graphs, countless graphical visualizations of raw data, summary statistics, relations between statistics, as well as all breakdowns and categorizations can be custom-defined by the user via straightforward point-and-click facilities designed to reduce the necessary number of mouse clicks. All exploratory graphical techniques (described in the section on Graphics) are integrated with statistics to facilitate graphical data analyses (e.g., via interactive outlier removal, subset selections, smoothing, function fitting, extensive brushing options allowing the user to easily identify and/or extract the selected data, etc.). See also the section on Block Statistics, below.
A comprehensive set of options allows for the exploration of correlations and partial correlations between variables. First, practically all common measures of association can be computed, including Pearson r, Spearman rank order R, Kendall tau (b, c), Gamma, tetrachoric r, Phi, Cramer V, contingency coefficient C, Sommer's D, uncertainty coefficients, part and partial correlations, autocorrelations, various distance measures, etc. (nonlinear regressions, regressions for censored data and other specialized measures of correlations are available in Nonlinear Estimation, Survival Analysis, and other modules offered in STATISTICA Advanced Linear/Non-Linear Models). Correlation matrices can be computed using casewise (listwise) or pairwise deletion of missing data, or mean substitution. As in all other modules of STATISTICA, extended precision calculations (the "quadruple" precision, where applicable) are used to yield an unmatched level of accuracy (see the section on Precision). Like all other results in STATISTICA, correlation matrices are displayed in Spreadsheets offering various formatting options (see below) and extensive facilities to visualize numerical results; the user can "point to" a particular correlation in the Spreadsheet and choose to display a variety of "graphical summaries" of the coefficient (e.g., scatterplots with confidence intervals, various 3D bivariate distribution histograms, probability plots, etc.).
Brushing and outlier detection
The extensive brushing facilities in the scatterplots allow the user to select/deselect individual points in the plot and assess their effect on the regression line (or other fitted function lines).
Display formats of numbers
A variety of global display formats for correlations are supported; significant correlation coefficients can be automatically highlighted, each cell of the Spreadsheet can be expanded to display n and p, or detailed results may be requested that include all descriptive statistics (pairwise means and standard deviations, B weights, intercepts, etc.). Like all other numerical results, correlation matrices are displayed in Spreadsheets offering the zoom option and interactively-controlled display formats (e.g., from +.4 to +.4131089276410193); thus, large matrices can be compressed (via either the zoom or format-width control adjustable by dragging) to facilitate the visual search for coefficients which exceed a user-specified magnitude or significance level (e.g., the respective cells can be marked red in the Spreadsheet).
Scatterplot, scatterplot matrices, by-group analyses
As in all output selection dialogs, numerous global graphics options are available to further study patterns of relationships between variables, e.g., 2D and 3D scatterplots (with or without case labels) designed to identify patterns of relations across subsets of cases or series of variables. Correlation matrices can be computed as categorized by grouping variables and visualized via categorized scatterplots. Also "breakdowns of correlation matrices" can be generated (one matrix per subset of data), displayed in queues of Spreadsheets, and saved as stacked correlation matrices (which can later be used as input into the Structural Equations Modeling and Path Analysis [SEPATH] module offered in STATISTICA Advanced Linear/Non-Linear Models). An entire correlation matrix can be summarized in a single graph via the Matrix scatterplot option (of practically unlimited density); large scatterplot matrices can then be reviewed interactively by "zooming in" on selected portions of the graph (or scrolling large graphs in the zoom mode) [see the illustration]. Also, categorized scatterplot matrix plots can be generated (one matrix plot for each subset of data). Alternatively, a multiple-subset scatterplot matrix plot can be created where specific subsets of data (e.g., defined by levels of a grouping variable or selection conditions of any complexity) are marked with distinctive point markers. Various other graphical methods can be used to visualize matrices of correlations in search of global patterns (e.g., contour plots, non-smoothed surfaces, icons, etc.). All of these operations require only a few mouse clicks and various shortcuts are provided to simplify selections of analyses; any number of Spreadsheets and graphs can be displayed simultaneously on the screen, making interactive exploratory analyses and comparisons very easy.
BASIC STATISTICS FROM RESULTS SPREADSHEETS (TABLES)
STATISTICA is a single integrated analysis system that presents all numerical results in spreadsheet tables that are suitable (without any further modification) for input into subsequent analyses. Thus, basic statistics (or any other statistical analysis) can be computed for results tables from previous analyses; for example, you could very quickly compute a table of means for 2000 variables, and next use this table as an input data file to further analyze the distribution of those means across the variables. Thus, basic statistics are available at any time during your analyses, and can be applied to any results spreadsheet.
In addition to the detailed descriptive statistics that can be computed for every spreadsheet, you can also highlight blocks of numbers in any spreadsheet, and produce basic descriptive statistics or graphs for the respective subset of numbers only. For example, suppose you computed a results spreadsheet with measures of central tendency for 2000 variables (e.g., with Means, Modes, and Medians, Geometric Means, and Harmonic Means); you could highlight a block of, for example, 200 variables and the Means and Medians, and then in a single operation produce a multiple line graph of those two measures across the subset of 200 variables. Statistical analysis by blocks can be performed by row or by column; for example, you could also compute a multiple line graph for a subset of variables across the different measures of central tendency. To summarize, the block statistics facilities allow you to produce statistics and statistical graphs from values in arbitrarily selected (highlighted) blocks of values in the current data spreadsheet or output Spreadsheet.
INTERACTIVE PROBABILITY CALCULATOR
A flexible, interactive Probability Calculator is accessible from all toolbars. It features a wide selection of distributions (including Beta, Cauchy, Chi-square, Exponential, Extreme value, F, Gamma, Laplace, Lognormal, Logistic, Pareto, Rayleigh, t (Student), Weibull, and Z (Normal)); interactively (in-place) updated graphs built into the dialog (a plot of the density and distribution functions) allow the user to visually explore distributions taking advantage of the flexible STATISTICA Smart MicroScrolls which allow the user to advance either the last significant digit (press the LEFT-mouse-button) or next to the last significant digit (press the RIGHT-mouse-button). Facilities are provided for generating customizable, compound graphs of distributions with requested cutoff areas. Thus, this calculator allows you to interactively explore the distributions (e.g., the respective probabilities depending on shape parameters).
t-TESTS AND OTHER TESTS OF GROUP DIFFERENCES
T-tests for dependent and independent samples, as well as single samples (testing means against user-specified constants) can be computed, multivariate Hotelling's T 2 tests are also available (see also ANOVA/MANOVA, and GLM (General Linear Models) offered in STATISTICA Advanced Linear/Non-Linear Models. Flexible options are provided to allow comparisons between variables (e.g., treating the data in each column of the input spreadsheet as a separate sample) and coded groups (e.g., if the data includes a categorical variable such as Gender to identify group membership for each case). As with all procedures, extensive diagnostics and graphics options are available from the results menus. For example, for the t-test for independent samples, options are provided to compute t-tests with separate variance estimates, Levene and Brown-Forsythe tests for homogeneity of variance, various box-and-whisker plots, categorized histograms and probability plots, categorized scatterplots, etc. Other (more specialized) tests of group differences are part of many modules (e.g., Nonparametrics (below), Survival Analysis (available in STATISTICA Advanced Linear/Non-Linear Models), Reliability/Item Analysis (available in STATISTICA Multivariate Exploratory Techniques).
FREQUENCY TABLES, CROSSTABULATION TABLES, STUB-AND-BANNER TABLES, MULTIPLE RESPONSE ANALYSIS, AND TABLES
Extensive facilities are provided to tabulate continuous, categorical, and multiple response variables, or multiple dichotomies. A wide variety of options are offered to control the layout and format of the tables. For example, for tables involving multiple response variables or multiple dichotomies, marginal counts and percentages can be based on the total number of respondents or responses, multiple response variables can be processed in pairs, and various options are available for counting (or ignoring) missing data. Frequency tables can also be computed based on user-defined logical selection conditions (of any complexity, referencing any relationships between variables in the dataset) that assign cases to categories in the table. All tables can be extensively customized to produce final (publication-quality) reports. For example, unique "multi-way summary" tables can be produced with breakdown-style, hierarchical arrangements of factors, crosstabulation tables may report row, column, and total percentages in each cell, long value labels can be used to describe the categories in the table, frequencies greater than a user-defined cutoff can be highlighted in the table, etc. The program can display cumulative and relative frequencies, Logit- and Probit-transformed frequencies, normal expected frequencies (and the Kolmogorov-Smirnov, Lilliefors, and Shapiro-Wilks' tests), expected and residual frequencies in crosstabulations, etc. Available statistical tests for crosstabulation tables include the Pearson, Maximum-Likelihood and Yates-corrected Chi-squares; McNemar's Chi-square, the Fisher exact test (one- and two-tailed), Phi, and the tetrachoric r; additional available statistics include Kendall's tau (a, b), Gamma, Spearman r, Sommer's D, uncertainty coefficients, etc.
Graphical options include simple, categorized (multiple), and 3D histograms, cross-section histograms (for any "slices" of the one-, two-, or multi-way tables), and many other graphs including a unique "interaction plot of frequencies" that summarizes the frequencies for complex crosstabulation tables (similar to plots of means in ANOVA). Cascades of even complex (e.g., multiple categorized, or interaction) graphs can be interactively reviewed. See also the section on Block Statistics, above, and sections on Log-linear Analysis (available in STATISTICA Advanced Linear/Non-Linear Models) and Correspondence Analysis (available in STATISTICA Multivariate Exploratory Techniques).
MULTIPLE REGRESSION METHODS
The Multiple Regression module is a comprehensive implementation of linear regression techniques, including simple, multiple, stepwise (forward, backward, or in blocks), hierarchical, nonlinear (including polynomial, exponential, log, etc.), Ridge regression, with or without intercept (regression through the origin), and weighted least squares models; additional advanced methods are provided in the General Regression Models (GRM) module (e.g., best subset regression, multivariate stepwise regression for multiple dependent variables, for models that may include categorical factor effects; statistical summaries for validation and prediction samples, custom hypotheses, etc.). The Multiple Regression module will calculate a comprehensive set of statistics and extended diagnostics including the complete regression table (with standard errors for B, Beta and intercept, R-square and adjusted R-square for intercept and non-intercept models, and ANOVA table for the regression), part and partial correlation matrices, correlations and covariances for regression weights, the sweep matrix (matrix inverse), the Durbin-Watson d statistic, Mahalanobis and Cook's distances, deleted residuals, confidence intervals for predicted values, and many others.
Predicted and residual values
The extensive residual and outlier analysis features a large selection of plots, including a variety of scatterplots, histograms, normal and half-normal probability plots, detrended plots, partial correlation plots, different casewise residual and outlier plots and diagrams, and others. The scores for individual cases can be visualized via exploratory icon plots and other multidimensional graphs integrated directly with the results Spreadsheets. Residual and predicted scores can be appended to the current data file. A forecasting routine allows the user to perform what-if analyses, and to interactively compute predicted scores based on user-defined values of predictors.
By-group analysis; related procedures
Extremely large regression designs can be analyzed. An option is also included to perform multiple regression analyses broken down by one or more categorical variable (multiple regression analysis by group); additional add-on procedures include a regression engine that supports models with thousands of variables, a Two-stage Least Squares regression, as well as Box-Cox and Box-Tidwell transformations with graphs. An add-on package, STATISTICA Advanced Linear/Non-Linear Models, also includes general nonlinear estimation modules (Nonlinear Estimation, Generalized Linear Models (GLZ), Partial Least Squares models (PLS)) that can estimate practically any user-defined nonlinear model, including Logit, Probit, and others. The add-on also includes SEPATH, the general Structural Equation Modeling and Path Analysis module, which allows the user to analyze extremely large correlations, covariances, and moment matrices (for intercept models). An implementation of Generalized Additive Models (GAM) is also available in STATISTICA Data Miner.
The Nonparametric Statistics module features a comprehensive selection of inferential and descriptive statistics including all common tests and some special application procedures. Available statistical procedures include the Wald-Wolfowitz runs test, Mann-Whitney U test (with exact probabilities [instead of the Z approximations] for small samples), Kolmogorov-Smirnov tests, Wilcoxon matched pairs test, Kruskal-Wallis ANOVA by ranks, Median test, Sign test, Friedman ANOVA by ranks, Cochran Q test, McNemar test, Kendall coefficient of concordance, Kendall tau (b, c), Spearman rank order R, Fisher's exact test, Chi-square tests, V-square statistic, Phi, Gamma, Sommer's d, contingency coefficients, and others. (Specialized nonparametric tests and statistics are also part of many add-on modules, e.g., Survival Analysis, STATISTICA Process Analysis, and others.) All (rank order) tests can handle tied ranks and apply corrections for small n or tied ranks. The program can handle extremely large analysis designs. As in all other modules of STATISTICA, all tests are integrated with graphs (that include various scatterplots, specialized box-and-whisker plots, line plots, histograms and many other 2D and 3D displays).
The ANOVA/MANOVA module includes a subset of the functionality of the General Linear Models module (part of the Advanced Linear/Non-Linear Models), and can perform univariate and multivariate analysis of variance of factorial designs with or without one repeated measures variable. For more complicated linear models with categorical and continuous predictor variables, random effects, and multiple repeated measures factors you need the General Linear Models module (stepwise and best-subset options are available in the General Regression Models module). In the ANOVA/MANOVA module, you can specify all designs in the most straightforward, functional terms of actual variables and levels (not in technical terms, e.g., by specifying matrices of dummy codes), and even less-experienced ANOVA users can analyze very complex designs with STATISTICA. Like the General Linear Models module, ANOVA/MANOVA provides three alternative user interfaces for specifying designs: (1) A Design Wizard, that will take you step-by-step through the process of specifying a design, (2) a simple dialog-based user-interface that will allow you to specify designs by selecting variables, codes, levels, and any design options from well-organized dialogs, and (3) a Syntax Editor for specifying designs and design options using keywords and a common design syntax. Computational methods. The program will use, by default, the sigma restricted parameterization for factorial designs, and apply the effective hypothesis approach (see Hocking, 19810) when the design is unbalanced or incomplete. Type I, II, III, and IV hypotheses can also be computed, as can Type V and Type VI hypotheses that will perform tests consistent with the typical analyses of fractional factorial designs in industrial and quality-improvement applications (see also the description of the Experimental Design module).
The ANOVA/MANOVA module is not limited in any of its computational routines for reporting results, so the full suite of detailed analytic tools available in the General Linear Models module is also available here; results include summary ANOVA tables, univariate and multivariate results for repeated measures factors with more than 2 levels, the Greenhouse-Geisser and Huynh-Feldt adjustments, plots of interactions, detailed descriptive statistics, detailed residual statistics, planned and post-hoc comparisons, testing of custom hypotheses and custom error terms, detailed diagnostic statistics and plots (e.g., histogram of within-cell residuals, homogeneity of variance tests, plots of means versus standard deviations, etc.).
The Distribution Fitting options allow the user to compare the distribution of a variable with a wide variety of theoretical distributions. You may fit to the data the Normal, Rectangular, Exponential, Gamma, Lognormal, Chi-square, Weibull, Gompertz, Binomial, Poisson, Geometric, or Bernoulli distribution. The fit can be evaluated via the Chi-square test or the Kolmogorov-Smirnov one-sample test (the fitting parameters can be controlled); the Lilliefors and Shapiro-Wilks' tests are also supported (see above). In addition, the fit of a particular hypothesized distribution to the empirical distribution can be evaluated in customized histograms (standard or cumulative) with overlaid selected functions; line and bar graphs of expected and observed frequencies, discrepancies and other results can be produced from the output Spreadsheets. Other distribution fitting options are available in STATISTICA Process Analysis, where the user can compute maximum-likelihood parameter estimates for the Beta, Exponential, Extreme Value (Type I, Gumbel), Gamma, Log-Normal, Rayleigh, and Weibull distributions. Also included in that module are options for automatically selecting and fitting the best distribution for the data, as well as options for general distribution fitting by moments (via Johnson and Pearson curves). User-defined 2- and 3-dimensional functions can also be plotted and overlaid on the graphs. The functions may reference a wide variety of distributions such as the Beta, Binomial, Cauchy, Chi-square, Exponential, Extreme value, F, Gamma, Geometric, Laplace, Logistic, Normal, Log-Normal, Pareto, Poisson, Rayleigh, t (Student), or Weibull distribution, as well as their integrals and inverses. Additional facilities to fit predefined or user-defined functions of practically unlimited complexity to the data are available in Nonlinear Estimation (available in STATISTICA Advanced Linear/Non-Linear Models).