Fri, 17 May 2013 20:28:00 GMT
Mon, 13 May 2013 08:48:00 GMT
Tue, 07 May 2013 10:58:00 GMT
STATISTICA Data Miner offers the most comprehensive selection of statistical, exploratory, and visualization techniques available on the market, including leading edge and highly efficient neural network/machine learning and classification procedures.
The complete analytic functionality of STATISTICA is available for data mining, encapsulated in over 300 nodes that can be selected in a structured and customizable Node Browser and dragged into the data mining workspace.
The specialized workspace templates for data mining are optimized for speed and efficiency and can be classified into the following five general areas.
A large number of analysis nodes are available for creating exploratory graphs, to compute descriptive statistics, tabulations, etc. These nodes can be connected to input data sources, or to all intermediate results. A specialized STATISTICA application module is available (STATISTICA Drill-Down Explorer) for interactively exploring your data by drilling down on selected variables, and categories or ranges of values in those variables. For example, you can drill-down on Gender, to display the distribution for a variable Income for females only; next you could drill down on a specific income group, to explore (e.g., create graphical summaries for) selected variables, for females in the selected income group only. A unique feature of STATISTICA Drill-Down Explorer is the ability to select and deselect drill-down variables and categories in any order; so you could next deselect variable Gender and thus display selected graphs and statistics for the selected Income group, but now for both males and females. Another unique feature of the Drill-Down Explorer is its variety of categorization ("slicing") methods. Hence, the STATISTICA Drill-Down Explorer offers tremendous flexibility for "slicing-and-dicing" the data. The STATISTICA Drill-Down Explorer can be applied to raw data, database connections for in-place processing of data in remote databases, or to any intermediate result computed in a STATISTICA Data Miner project.
STATISTICA Data Miner offers the widest selection of tools to perform data mining classification techniques (and build related deployable models) available on the market, including generalized linear models (for binomial and multinomial responses), classification trees, general classification and regression tree modeling (GTrees), general CHAID models, cluster analysis techniques (including "large capacity" implementations of tree-clustering as well as k-means and EM clustering methods with v-fold crossvalidation options to determine automatically the best number of clusters), and general discriminant analysis models (including best-subset selection of predictors). Also, the numerous advanced neural network classifiers available in STATISTICA Neural Networks are available in STATISTICA Data Miner, and can be used in conjunction or competition with other classification techniques.
STATISTICA Data Miner offers the widest selection of tools to build deployable data mining models, based on linear, nonlinear, or neural network techniques and tools to explore data; the user can also build predictive models based on general multivariate techniques. In summary, STATISTICA offers the full range of techniques, from linear and nonlinear regression models, advanced generalized linear and generalized additive models, regression trees and CHAID, to advanced neural network methods and multivariate adaptive regression splines (MARSplines). STATISTICA Data Miner also includes techniques that are not usually found in data mining software, such as partial least squares methods (for feature selection, to reduce large numbers of variables), survival analysis (for analyzing data containing censored observations; e.g. for medical research data and data from industrial reliability and quality control studies), structural equation modeling techniques (to build and evaluate confirmatory linear models), correspondence analysis (for analyzing the structure of complex tables), factor analysis and multidimensional scaling (for exploring structure in large numbers of variables), and many others.
STATISTICA Data Miner includes a broad selection of traditional (i.e., non-neural networks-based) forecasting techniques (including ARIMA, exponential smoothing with seasonal components, Fourier spectral decomposition, seasonal decomposition, regression- and polynomial lags analysis, etc.), as well as neural network methods for time series data.
This tool contains the most comprehensive selection of neural network methods available on the market. This powerful component of STATISTICA Data Miner offers tools to approach virtually any data mining problem (including classification, hidden structure detection, and powerful forecasting). One of the unique features of the NN Explorer is the selection of intelligent problem solvers and automatic wizards that use Artificial Intelligence methods to help you solve the most demanding problems involved in advanced NN analysis (such as selecting the best network architecture and the best subset of variables). The Explorer offers the widest selection of cutting-edge NN architectures and procedures and highly optimized algorithms that include: multilayer perceptrons, radial basis function networks, probabilistic neural networks, generalized regression neural networks, self-organizing feature maps, linear models, principal components network, and cluster networks. Network ensembles of these architectures can also be evaluated. Estimation methods include back propagation, conjugate gradient decent, quasi-Newton, Levenberg-Marquardt, quick propagation, delta-bar-delta, LVQ, pruning algorithms, and more; options are available for cross validation, bootstrapping, subsampling, sensitivity analysis, etc.
A recipe-like step-by-step process to guide you through the data mining process:
A general trend in data mining is the increasing emphasis on solutions based on simple analytic processes, rather than the creation of ever-more sophisticated general analytic tools. The STATISTICA Data Miner Recipe (SDMR) approach provides an intuitive graphical interface to enable those with limited data mining experience to execute a "recipe-like" step-by-step analytic process. With these intuitive dialogs, you can perform various data mining tasks such as regression, classification, and clustering. Other recipes can be built quickly as custom solutions. Completed recipes can be saved and deployed as project files to score new data. SDMR spans the entire data mining process - from querying external databases to the final deployment of solutions - and, in general, consists of the following steps. 1. Identifies the data from which to learn
2. Cleans data and removes the redundant predictors
3. Identifies important predictors from a large pool of predictors that are strongly related to the dependent (outcome or target) variable of interest
4. Generates a pool of eligible models
5. Performs automatic competitive evaluation of models to identify the optimum model with respect to performance, and complexity 6. Deploys the model to score new data using the inbuilt efficient deployment engine (using optional STATISTICA Enterprise) With only a few clicks, the program will take you through the complete analytic process - from the definition of input data and analysis problem, through data cleaning and preparation and model building, all the way to final model selection and deployment. Even though most of the computational complexities of data mining are resolved automatically in the STATISTICA Data Miner Recipe, which enables you to move from problem definition to a solution very quickly even if you are a novice, the program will "apply and try" a large number of advanced data mining algorithms and automatically determine which approach is most successful. Thus, the STATISTICA Data Miner Recipe methodology and user interface enables you to leverage the largest collection of data mining algorithms in a single package to solve your problems.
Back to Top
A large portion of analytic functionality used by STATISTICA Data Miner is driven by the computational engines of modules that are included in various other STATISTICA products:
However, several modules include selections of highly specialized data mining and data mining modeling techniques that are offered only as part of STATISTICA Data Miner. The following sections include technical information about these modules
This module will automatically select subsets of variables from extremely large data files or databases connected for in-place processing (IDP). The module can handle a practically unlimited number of variables: over a million (!) input variables can be scanned to select predictors for regression or classification. Specifically, the program includes several options for selecting variables ("features") that are likely to be useful or informative in specific subsequent analyses. The unique algorithms implemented in the Feature Selection and Variable Filtering module will select continuous and categorical predictor variables which show a relationship to the continuous or categorical dependent variables of interest, regardless of whether that relationship is simple (e.g., linear) or complex (nonlinear, non-monotone). Hence, the program does not bias the selection in favor of any particular model that you may use to find a final best rule, equation, etc. for prediction or classification. Various advanced feature selection options are also available. This module is particularly useful in conjunction with the in-place processing of databases (without the need to copy or import the input data to the local machine), when it can be used to scan huge lists of input variables, select likely candidates that contain information relevant to the analyses of interest, and automatically select those variables for further analyses with other nodes in the data miner project. Subsets of variables based on an initial scan via this module can be submitted to further (post-) feature selection methods based on neural networks, MAR Splines, linear regression or classifiers, or CHAID. These options allow STATISTICA Data Miner to handle data sets in the multiple giga- and terabyte range (see Comparative performance benchmarks using large data sets).
This module contains a complete implementation of the so-called A-priori algorithm for detecting ("mining for") association rules such as "customers who order product A, often also order product B or C" or "employees who said positive things about initiative X, also frequently complain about issue Y but are happy with issue Z" (see Agrawal and Swami, 1993; Agrawal and Srikant, 1994; Han and Lakshmanan, 2001; see also Witten and Frank, 2000). The Association Rules module allows you to process rapidly huge data sets for associations (relationships), based on pre-defined "threshold" values for detection. Specifically, the program will detect relationships or associations between specific values of categorical variables in large data sets. This is a common task in many data mining projects applied to databases containing records of customer transactions (e.g., items purchased by each customer), and also in the area of text mining. Like all modules of STATISTICA, data in external databases can be processed by the STATISTICA Association Rules module in-place (see IDP technology), so the program is prepared to handle efficiently extremely large analysis tasks.
The results can be displayed in tables, and also in unique 2D and 3D graphs where strong associations are highlighted by thick lines connecting the respective items.
A first step of many data mining projects is to explore the data interactively, to gain a first "impression" of the types of variables in the analyses, and their possible relationships. The purpose of the Interactive Drill-Down Explorer is to provide a combined graphical, exploratory data analysis, and tabulation tool that will allow you to quickly review the distributions of variables in the analyses, their relationships to other variables, and to identify the actual observations belonging to specific subgroups in the data.
How the Drill-Down Explorer Works. The "drill-down" metaphor within the data mining context summarizes the basic operation of this analytic process quite well: The program allows you to select observations from larger data sets by selecting subgroups based on specific values or ranges of values of particular variables of interest (e.g., Gender and Average Purchase in the example above); in a sense you can expose the "deeper layers" or "strata" in the data by reviewing smaller and smaller subsets of observations selected by increasingly complex logical selection conditions.
Drilling "up." The interactive nature of the Drill Down Explorer allows you not only to drill down into the data or database (select groups of observations with increasingly specific logical selection conditions), but also to "drill up": At any time, you can select one of the previously specified variable (category) groups and de-select it from the list of drill-down conditions; while processing the data the program will then only select those observations that fit the remaining logical (case) selection conditions, and update the results accordingly.
Applications of the Interactive Drill-Down Explorer. The example shown earlier is very simple, exposing only the basic functionality of the program. The real power of the STATISTICA Interactive Drill-Down Explorer lies in the various auxiliary results which can automatically be updated during the interactive drill-down/up exploration: you can select a list of variables for review, and compute for the selected cases:
For example, you could review the types of purchases that customers made with different demographic characteristics, study the effectiveness of certain drugs within different treatment groups, ages, etc., or extract likely customers for a new product from a database of previous customers based on careful study of apparent (market) segments exposed by the drill-down analysis.
The STATISTICA Generalized EM (Expectation Maximization) and k-Means Clustering module is an extension of the techniques available in the general STATISTICA Cluster Analysis options, specifically designed to handle large data sets and to allow clustering of continuous and/or categorical variables, and to provide the functionality for complete unsupervised learning (clustering) for pattern recognition, with all deployment options for predictive clustering. Various cross-validation options are provided (including modified v-fold cross-validation options) that will automatically choose and evaluate a best final solution for the clustering problem; you do not need to specify the number of clusters before an analysis; instead the program will use automatic (cross-validation based) methods to choose a best cluster solution (number of clusters) for you! The advanced EM clustering technique available in this module is sometimes referred to as probability-based clustering or statistical clustering. The program will cluster observations based on continuous and categorical variables, assuming different distributions for the variables in the analyses (as specified by the user). Various cross-validation options are provided to allow you to choose and evaluate a best final solution for the clustering problemDetailed output summaries and graphs (e.g., distribution plots for EM clustering), and detailed classification statistics are computed for each observation. These methods are optimized to handle very large data sets, and various results are provided to facilitate subsequent analyses using the assignment of observations to clusters. Options for deploying cluster solutions (in C, C++, C#, Visual Basic, or XML syntax based PMML), for classifying new observations, are also included.
The STATISTICA Generalized Additive Models facilities are an implementation of methods developed and popularized by Hastie and Tibshirani (1990); additional detailed discussion of these methods can also be found in Schimek (2000). The program will handle continuous and categorical predictor variables. Note that STATISTICA includes a comprehensive selection of methods for fitting non-linear models to data, such as the Nonlinear Estimation module, etc. , Generalized Linear Models, General Classification and Regression Trees
Distributions and link functions. The program allows the user to choose from a wide variety of distributions for the dependent variable, and link functions for the effects of the predictor variables on the dependent variable:
Normal, Gamma, and Poisson distributions:
Binomial distribution:
Scatterplot smoother. The program uses the cubic spline smoother with user-defined degrees of freedom to find an optimum transformation (function) of the predictor variables.
Results statistics. The program will report a comprehensive set of results statistics to aid in the evaluation of the model-adequacy, model fit, and interpretation of results; specifically, results include: the iteration history for the model fitting computations, summary statistics including the overall R-square value (computed from the deviance statistic) model degrees of freedom, and detailed observational statistics pertaining to the predicted response, residuals, and the smoothing of the predictor variables. Results graphs include plots of observed responses vs. residual responses, predicted values vs. residuals, histograms of observed and residual values, normal probability plots of residual values, and partial residual plots for each predictor, indicating the cubic spline smoothing fit for the final solution; for binary responses (e.g., logit-models) lift charts can also be computed.
The General Classification and Regression Tree (GC&RT) Model is a recursive partitioning method used to classify or divide cases based on a set of predictor variables. Unlike linear or nonlinear regression-like algorithms, this module will find hierarchical decision rules to provide optimal separation between observations with regard to a categorical or continuous criterion variable, based on splits on one or more continuous and/or categorical predictor variables. This module is a comprehensive implementation of the methods described as CART® by Breiman, Friedman, Olshen, and Stone (1984). However, the General Trees module contains various extensions and options that are typically not found in implementations of this algorithm, and that are particularly useful for data mining applications. In addition to standard analyses, the implementation of these methods in STATISTICA enables you to specify ANOVA/ANCOVA-like designs with continuous and/or categorical predictor variables, and their interactions. In short, ANOVA/ANCOVA-like predictor designs can be specified via dialogs, Wizards, or (design) command syntax; moreover, the command syntax is compatible across modules, so you can quickly apply identical designs to very different analyses (e.g., compare the quality of classification using GDA vs. GTrees).
User interface; specifying "models." The program provides a large number of options for controlling the building of the tree(s), the pruning of the tree(s), and the selection of the best-fitting solution. For continuous dependent (criterion) variables, pruning of the tree can be based on the variance, or on FACT-style pruning. For categorical dependent (criterion) variables, pruning of the tree can be based on misclassification errors, variance, or FACT-style pruning. You can specify the maximum number of nodes for the tree or the minimum n per node. Options are provided for validating the best decision tree, using V-fold cross validation, or by applying the decision tree to new observations in a validation sample. For categorical dependent (criterion) variables, i.e., for classification problems, various measures can be chosen to modify the algorithm and to evaluate the quality of the final classification tree: Options are provided to specify user-defined prior classification probabilities and misclassification costs; goodness-of-fit measures include the Gini measure, Chi-square, and G-Square. Missing data and surrogate splits. Missing data values in the predictors can be handled by allowing the program to determine splits for surrogate variables, i.e., variables that are similar to the respective variable used for a particular split (node).
ANOVA/ANCOVA-like designs. In addition to the traditional CART®-style analysis, you can combine categorical and continuous predictor variables into ANOVA/ANCOVA-like designs and perform the analysis using a design matrix for the predictors. This allows you to evaluate and compare complex predictor models, and their efficacy for prediction and classification using various analytic techniques (e.g., General Linear Models, Generalized Linear Models, General Discriminant Analysis Models, etc.). Tree browser. In addition to simple summary tree graphs, you can display the results trees in intuitive interactive tree-browsers that allow you to collapse or expand the nodes of the tree, and to quickly review the most salient information regarding the respective tree node or classification. For example, you can highlight (click on) a particular node in the browser-panel and immediately see the classification and misclassification rates for that particular node. The tree-browser provides a very efficient and intuitive facility for reviewing complex tree-structures, using methods that are commonly used in windows-based computer application to review hierarchically structured information. Multiple tree-browser can be displayed simultaneously, containing the final tree, and different sub-trees pruned from the larger tree, and by placing multiple browsers side-by-side it is easy to compare different tree structures and sub-trees. The STATISTICA Tree Browser is an important innovation to aid with the interpretation of complex decision trees.
Interactive trees. Options are also provided to review trees interactively, either by using STATISTICA Graphics brushing tools or by placing large tree graphs into scrollable graphics windows where large graphs can be inspected "behind" a smaller (scrollable) window.
Results statistics. The STATISTICA GTrees module provides a very large number of results options. Summary results for each node are accessible, detailed statistics are computed pertaining to classification, classification costs, gain, and so on. Unique graphical summaries are also available, including histograms (for classification problems) for each node, detailed summary plots for continuous dependent variables (e.g., normal probability plots, scatterplots), and parallel coordinate plots for each node, providing an efficient summary of patterns of responses for large classification problems. As in all statistical procedures of STATISTICA, all numerical results can be used as input for further analyses, allowing you to quickly explore and further analyze observations classified into particular nodes (e.g., you could use the GTrees module to produce an initial classification of cases, and then use best-subset selection of variables in GDA to find additional variables that may aid in the further classification).
C, C++, C#, Java, STATISTICA Visual Basic, SQL Code generators. The information contained in the final tree can be quickly incorporated into your own custom programs or database queries via the optional C, C++, C#, Java, STATISTICA Visual Basic, or SQL query code generator options. The STATISTICA Visual Basic will be generated in form that is particularly well suited for inclusion in custom nodes for STATISTICA Data Miner.
Like the implementation of General Classification and Regression Trees (GTrees) in STATISTICA, another recursive partitioning method, the General Chi-square Automatic Interaction Detection module, provides not only a comprehensive implementation of the original technique, but extends these methods to the analysis of ANOVA/ANCOVA - like designs.
Standard CHAID. The CHAID analysis can be performed for both continuous and categorical dependent (criterion) variables. Numerous options are available to control the construction of hierarchical trees: the user has control over the minimum n per node, maximum number of nodes, and probabilities for splitting and for merging categories; the user can also request exhaustive searches for the best solution (Exhaustive CHAID); V-fold validation statistics can be computed to evaluate the stability of the final solution; for classification problems, user-defined misclassification costs can be specified.
ANOVA/ANCOVA-like designs. In addition to the traditional CHAID analysis, you can combine categorical and continuous predictor variables into ANOVA/ANCOVA-like designs and perform the analysis using a design matrix for the predictors. This allows you to evaluate and compare complex predictor models, and their efficacy for prediction and classification using various analytic techniques (e.g., General Linear Models, Generalized Linear Models, General Discriminant Analysis Models, General Classification and Regression Tree Models, etc.). Refer also to the description of GLM (General Linear Models) and General Classification and Regression Trees (GTrees).
Tree browser. Like the binary results tree used to summarize binary classification and regression trees (see GTrees), the results of the CHAID analysis can be reviewed in the STATISTICA Tree Browser. This unique tree browser provides a very efficient and intuitive facility for reviewing complex tree-structures and for comparing multiple tree-solutions side-by-side (in multiple tree-browsers), using methods that are commonly used in windows-based computer applications to review hierarchically structured information. The STATSTICA Tree Browser is an important innovation to aid with the interpretation of complex decision trees. For additional details, see also the description of the tree browser in the context of the <General Classification and Regression Trees (GTrees).
Results statistics. The STATISTICA General CHAID Models module provides a very large number of results options. Summary results for each node are accessible, detailed statistics are computed pertaining to classification, classification costs, and so on. Unique graphical summaries are also available, including histograms (for classification problems) for each node, detailed summary plots for continuous dependent variables (e.g., normal probability plots, scatterplots), and parallel coordinate plots for each node, providing an efficient summary of patterns of responses for large classification problems. As in all statistical procedures of STATISTICA, all numerical results can be used as input for further analyses, allowing you to quickly explore and further analyze observations classified into particular nodes (e.g., you could use the GTrees module to produce an initial classification of cases, and then use best-subset selection of variables in GDA to find additional variables that may aid in the further classification).
In addition to the modules for automatic tree building (e.g., General Classification and Regression Trees, General CHAID models), STATISTICA Data Miner also includes designated tools for building such trees interactively. You can choose either the (binary) General Classification and Regression Trees method or the CHAID method for building the (decision) tree, and at each step grow the tree either interactively (by choosing the splitting variable and splitting criterion) or automatically. When growing trees interactively, you have full control over all aspects of how to select and evaluate candidates for each split, how to categorize the range of values in predictors, etc. The highly interactive tools available for this module allow you to grow and prune back trees to quickly evaluate the quality of the tree for classification or regression prediction and to compute all auxiliary statistics at each stage to fully explore the nature of each solution. This tool is extremely useful for predictive data mining as well as for exploratory data analysis (EDA), and includes the complete set of options for automatic deployment, for the prediction or predicted classification of new observations (see also the description of these options in the context of CHAID and the General Classification and Regression Trees modules).
The most recent research on statistical and machine learning algorithms suggests that for some "difficult" estimation and prediction (predicted classification) tasks, using successively boosted simple trees can yield more accurate predictions than neural network architectures or complex single trees alone. STATISTICA Data Miner includes an advanced Boosted Trees module for applying this technique to predictive data mining tasks. You have control over all aspects of the estimation procedure and detailed summaries of each stage of the estimation procedures are provided so that the progress over successive steps can be monitored and evaluated. The results include most of the standard summary statistics for classification and regression computed by the General Classification and Regression Trees module. Automatic methods for deployment of the final boosted tree solution for classification or regression prediction are also provided.
The STATISTICA Random Forest module is an implementation of the Random Forest algorithm developed by Breiman. The algorithm is also applicable to regression problems. A Random Forest consists of a collection (ensemble) of simple tree classifiers, each capable of producing a response when presented with a set of predictor values. You have full control over all key aspects of the estimation procedure and model parameter, including the complexity of the trees fitted to the data, the maximum number of trees in the forest, control over how to stop the algorithm when satisfactory results have been achieved, etc. This module runs efficiently on large datasets and can handle extremely large number of variables without variable deletion. The results include most of the standard summary statistics for classification and regression computed by the General Classification and Regression Trees module. Automatic methods for deployment of the final Random Forests solution for classification or regression prediction are also provided.
This method performs regression and classification tasks by constructing nonlinear decision boundaries. Because of the nature of the feature space in which these boundaries are found, Support Vector Machines (SVM) can exhibit a large degree of flexibility in handling classification and regression tasks of varied complexities. STATISTICA SVM supports four types of Support Vector models with a variety of kernels as basis function expansions including linear, polynomial, RBF, and sigmoid. It also provides a facility for handling imbalanced data. Cross-validation, a well established technique is used for determining the best value of the various model parameters among a set of given values. A large number of graphs and spreadsheets can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Automatic methods for deployment of the final KNN solution for classification or regression prediction are also provided.
The Naïve Bayes classifier is based on the Bayesian Theorem and is particularly suited when the dimensionality of the inputs is high due to its simplifying assumption of independence among predictors. Despite the assumption of independence, Naïve Bayes typically outperforms more sophisticated classification methods. Although the assumption that the predictor variables are independent is not always accurate, it does simplify the classification task dramatically, since it allows the class conditional densities to be calculated separately for each variable, i.e., it reduces a multidimensional task to a number of one-dimensional tasks. Furthermore, the assumption does not seem to affect greatly the posterior probabilities, especially in regions near decision boundaries, so the classification task remains unaffected. STATISTICA supports categorical predictors and offers several choices for modeling numeric predictors to suit your analysis. These include normal, lognormal, gamma, and poisson density functions. STATISTICA also provides automatic methods for deployment of the final Naïve Bayes model.
STATISTICA K-Nearest Neighbors is a memory-based method that, in contrast to other statistical methods, requires no training (i.e., no model to fit). It falls into the category of Prototype Methods. It functions on the intuitive idea that close objects are more likely to be in the same category. Thus, in KNN, predictions are based on a set of prototype examples that are used to predict new (i.e., unseen) data based on the majority vote (for classification tasks) and averaging (for regression) over a set of K nearest prototypes. This method can handle large datasets and both continuous and categorical predictors. Cross-validation, a well established technique is used to obtain estimates of model parameters that are unknown. A large number of graphs and spreadsheets can be computed to evaluate the quality of the fit and to aid with the interpretation of results. Automatic methods for deployment of the final KNN solution for classification or regression prediction are also provided.
The STATISTICA MAR Splines (Multivariate Adaptive Regression Splines) module is based on a complete implementation of this technique, as originally proposed by Friedman (1991; Multivariate Adaptive Regression Splines, Annals of Statistics, 19, 1-141); in STATISTICA Data Miner, the MARSplines options have further been enhanced to accommodate regression and classification problems, with continuous and categorical predictors.
The program, which in terms of its functionality can be considered a generalization and modification of stepwise Multiple Regression and Classification and Regression Trees (GC&RT), is specifically designed (optimized) for processing very large data sets. A large number of results options and extended diagnostics are available to allow you to evaluate numerically and graphically the quality of the MAR Splines solution.
C/C++, C#, STATISTICA Visual Basic, XML syntax based PMML code generators. The information contained in the model can be quickly incorporated into your own custom programs via the optional C/C++/C#, STATISTICA Visual Basic, or (XML-syntax based) PMML code generator options. STATISTICA Visual Basic will be generated in a form that is particularly well suited for inclusion in custom nodes for STATISTICA Data Miner. PMML (Predictive Models Markup Language) files with deployment information can be used with the Rapid Deployment of Predictive Models options to compute predictions for large numbers of cases very efficiently; PMML files are fully portable, and deployment information generated via the desktop version of STATISTICA Data Miner can be used in WebSTATISTICA Data Miner (i.e., on the server side of Client-Server installations), and vice versa.
The STATISTICA Goodness of Fit module will compute various goodness of fit statistics for continuous and categorical response variables (for regression and classification problems). This module is specifically designed for data mining applications to be included in "competitive evaluation of models" projects as a tool to choose the best solution. The program uses as input the predicted values or classifications as computed from any of the STATISTICA modules for regression and classification, and computes a wide selection of fit statistics as well as graphical summaries for each fitted response or classification. Goodness of fit statistics for continuous responses include least squares deviation (LSD), average deviation, relative squared error, relative absolute error, and the correlation coefficient. For classification problems (for categorical response variables), the program will compute Chi-square, G-square (maximum likelihood chisquare), percent disagreement (misclassification rate), quadratic loss, and information loss statistics.
The Rapid Deployment of Predictive Models module allows you to load one or more PMML (Predictive Models Markup Language) files with deployment information, and to compute very quickly (in a single pass through the data) predictions for large numbers of observations (for one or more models). PMML files can be generated from practically all modules for predictive data mining (as well as the Generalized EM & k-Means Cluster Analysis options). PMML is a XML-based (Extensible Markup Language) industry standard set of syntax convention that is particularly well suited to allow sharing of deployment information in a Client-Server architecture (e.g., via WebSTATISTICA).
The Rapid Deployment of Predictive Models options provide the fastest, most efficient methods for computing predictions from fully trained models. All models are pre-programmed in generic form in a highly optimized compiled program; the PMML code only supplies the parameter estimates etc. for the fully trained models, to allow the Rapid Deployment of Predictive Models program to compute predictions or predicted classifications (or cluster assignments) in a single pass through the data.
In fact, it is very difficult to "beat" the performance (speed of computations) of this tool, even if you were to write your own compiled C++ code, based on the (C, C++, or C#) deployment code generated by the respective models.
Note that the Rapid Deployment of Predictive Models module will also automatically compute summary statistics for each model, and if observed values or classifications are available, the program will automatically compute goodness-of-fit indices for participating models, including Gains and Lift charts for one or more models (overlaid lift and gain charts), for binary or multinomial
STATISTICA Data Miner is compatible with Windows XP, Windows Vista, and Windows 7.
Native 64-bit versions and highly optimized multiprocessor versions are available.