www.statsoft.com

Performance of STATISTICA on Large Data Sets and Computationally Intensive Analyses

1. Performance of STATISTICA compared to competing data analysis applications

One of the significant differentiators of the STATISTICA family of data analysis software is its performance on large data sets and computationally intensive applications, such as analyses requiring recursive access to data or complex data management and database query operations.

For example, in a recent carefully designed and conducted comparison of competing analytic software packages performed on a quad-core 64-bit machine running under a 64-bit Microsoft Windows operating system, STATISTICA outperformed other widely used data analysis packages by a wide margin:

  • Basic descriptive statistics for 30 variables (fields) and 9,000,000 rows or cases (data file size approximately 2.2 Gigabyte) were computed in approximately 3 seconds; the two major competing packages in the data analysis/BI market required 4.5 seconds (on the computing platform that purportedly also takes advantage of multiple processors) to 37 seconds.
  • Correlation matrices for 500 variables (fields) with 1,000,000 rows or cases (data file size approximately 4 Gigabytes) were computed in approximately 5 seconds; competing computing platforms required 20 to 65 seconds to perform the same task.
  • Basic data management operations as they are commonly required in data mining (predictive modeling) work (e.g., sub-setting of data) execute 3 to 4 times faster in STATISTICA.

Read more about  Performance comparison of STATISTICA Version 9 on multi-core 64-bit machines with current 64-bit releases of SAS (Version 9.2) and PASW (formerly SPSS) Statistics Version 18; basic data management, basic statistics, and aggregation operations.

2. The performance optimization technology used in STATISTICA

The current version of STATISTICA software, including STATISTICA Data Miner , takes full advantage of state-of-the-art hardware and software technologies, as well as proprietary performance optimization technologies developed at StatSoft. STATISTICA is available as a native 64-bit application, and most STATISTICA computational (statistical) routines, as well as the key predictive modeling algorithms available in STATISTICA Data Miner, will take full advantage of multi-processor computing platforms.

Shown below are some performance benchmark data collected as part of the STATISTICA and STATISTICA Data Miner software validation and release process. Each analysis was repeated multiple times on 64 bit computers with either 1, 2, 3 or 4 processors (and otherwise identical hardware configurations). STATISTICA was designed to take advantage of available hardware resources to achieve maximum performance for complex predictive modeling analyses (e.g., via regression trees, stochastic gradient boosting, or random forests analyses), as well as common statistical analyses (e.g., computing correlation coefficients).

regression tree

stochastic gradient boosted tree

complex random forest

correlation matrix

 

Performance of Predictive Modeling Algorithms

STATISTICA Data Miner contains multithreaded implementations of Classification and Regression Trees, CHAID, stochastic gradient boosting of trees (Boosted Trees), Random Forests (voting trees), and others, as well as multithreaded implementation of traditional generalized linear modeling techniques (e.g., logit regression, etc.). The performance of these predictive modeling algorithms on modern 64-bit multi-core hardware and 64-bit operating system platforms is spectacular, and as of this writing not matched by any other general software platform for predictive modeling (see also graphs shown above). Analyses with hundreds of variables and millions of cases will complete in minutes.

Data Buffering and Storage

In part, the unmatched performance of STATISTICA and STATISTICA Data Miner computational algorithms was achieved through carefully redesigned intelligent data access, storage, and buffering methods. Data can be read asynchronously in multiple threads servicing different parallel computations for a single (e.g., classification and regression trees) analysis. Data arrays are never stored explicitly in memory, so there are no limitations on file sizes; yet, the available memory is used intelligently to buffer the data (read by multiple threads) to make them available for computation.


Using these technologies, STATISTICA data analysis and STATISTICA Data Miner software has leapfrogged the competition.

Content

Contact Us

Statistica
2300 East 14th Street
Tulsa, Oklahoma, 74104
(918) 749-1119