Welcome, Register  | Login
Search Options
Electronic Statistics Textbook
StatSoft Blog
  • Home
  • Products
    • STATISTICA Product Catalog
    • STATISTICA Product Overview
    • Connectivity and Data Integration Solutions
    • Data Mining Solutions
    • Decisioning Platform
    • Desktop Solutions
    • Enterprise Solutions
    • Power Solutions
    • Statistics Methods and Applications Book
    • Text Mining Solutions
    • Web-Based Solutions
    • Video Tutorials
    • Brochures
    • Request Quote
    • STATISTICA Upgrade Offer
  • Services
    • Services Overview
    • Custom Development
    • Consulting
    • Training
      • United States Course Schedule
    • Validation Services
  • Solutions
    • Solutions Overview
    • Automotive Manufacturing
    • Banking
    • Chemical and Petrochemical
    • Credit Cards
    • Consumer Product Goods
    • Credit Scoring
    • Food and Beverage
    • Government Agencies
    • Hedge Fund Applications
    • Heavy Equipment Manufacturing
    • Healthcare
    • Insurance
      • Health Insurance
      • Life Insurance
      • Property and Casualty Insurance
    • Manufacturing
    • Medicare Fraud Detection
    • Marketing
    • Pharmaceuticals
    • Power Industry
    • R Language Platform
    • Sarbanes-Oxley Compliance
    • SAS Alternative
    • Semiconductors
    • Sentiment Analysis
    • Six Sigma
    • Telecommunications
  • Support
    • Support Overview
    • Product Registration
    • Knowledge Base
      • Installation, Registration, & Licensing
      • User Interface
      • Analyses
      • Graphics
      • Graph Customization
      • Graphic Interactive Analysis
      • Reports
      • Spreadsheets
      • Data Import & Export
      • Data Manipulation
      • Workbooks
      • Output Management & Printing
    • Download
      • Video Tutorials
      • Webcasts
      • Brochures
      • White Papers
      • Help
      • Installation Instructions
      • STATISTICA Software Updates
      • Visual Basic Examples
      • Free STATISTICA 10 Trial
    • Books on STATISTICA
    • Electronic Statistics Textbook
    • Free STATISTICA 10 Trial
    • Blog
    • Forum
    • Section 508 Compliance
    • Privacy Statement
  • Customers
    • Customer Listing
    • Success Stories
    • Feedback
  • Academic
    • Academic Overview
    • Academic Customers
    • Academic Request Quote
  • Company
    • About StatSoft
    • History
    • Office Locations
    • News
    • Events
    • Webcasts
    • Newsletter
    • Reviews
    • Careers
    • Partners
  • Contact Us
Chat Live with StatSoft
Solutions
  • Insurance, Fraud Detection
  • Data Mining: How To Get Started
  • Financial, Credit Scoring
  • Hands-on Data Mining (video series)
  • Performance Benchmarks on Large Datasets
Product Information
  • STATISTICA Scorecard
  • STATISTICA Data Miner Details
  • STATISTICA Data Mining Overview
  • STATISTICA Live Score
  • Market-Basket Analysis
  • Neural Networks
  • Process Optimization
What's New
  • StatSoft’s VP Hill Accepts Keynote Role at Big Data Analytics Conclave
  • Mon, 20 May 2013 19:00:00 GMT

  • What does it mean to not have enough codes?
  • Fri, 17 May 2013 20:28:00 GMT

  • Magic Bullet
  • Mon, 13 May 2013 08:48:00 GMT

Skip Navigation Links.
Collapse SubscriptionsSubscriptions
STATISTICA Newsletter
STATISTICA Webcasts
AnalyticBridge
YouTube
Twitter
Facebook
LinkedIn

Text Mining with STATISTICA

  • Overview of Text Mining
  • Access Documents
  • Process Documents
  • Analyze Documents
  • System Requirements

text miningSTATISTICA Text Miner is an optional extension of STATISTICA Data Miner, ideal for translating unstructured text data into meaningful, valuable clusters of decision-making "gold."

As most users familiar with text mining already know, real-world data comes in a variety of forms, not always organized or easily ready to analyze. Text mining digs for the underlying information not readily apparent in traditional structured data.  These data sources can be extremely large as well.  STATISTICA Text Miner is optimized and has recently been further enhanced for working with such data.

How can you Use STATISTICA Text Miner ?

  • Analyze the contents of Web pages. For example, users can automatically process and summarize all Web pages of particular companies, message boards, etc.
  • Include unstructured notes in predictive data mining projects. For example, users may include responses to open-ended interview questions, patients' own descriptions of medical symptoms, etc. in data mining projects involving the clustering of patients and symptoms.
  • Analyze large document repositories. For example, users may analyze repositories of documents such as narratives of insurance claims, etc., to include such information in fraud detection projects.

STATISTICA Text Miner was specifically designed as a general and open-architecture tool for mining unstructured information. The feature extraction/selection and other analytic tools available in STATISTICA Text Miner are not only applicable to text documents or Web pages, but can also be used to index, classify, cluster, or otherwise include in your analyses unstructured information such as (pre-processed) bitmaps imported as data matrices, etc..

  • Accessing Documents
  • Processing Documents
  • Analyzing Documents

Integration with STATISTICA, STATISTICA Data Miner, and WebSTATISTICA

The text miner software is fully integrated into the STATISTICA line of software. It is not a stand-alone product manufactured by another vendor and "connected" to STATISTICA. Text mining functionality can be integrated into the STATISTICA Data Miner workspace environment, WebSTATISTICA, or custom STATISTICA applications.

For example a customer may:

  • automatically access data stored in a data warehouse
  • update certain analyses and numeric summaries of the textual information
  • publish results to authorized users via the Internet

It is scalable and uses multi-threaded computing technology to extract optimum performance from advanced multiple-processor server hardware.

Accessing Documents

The program contains numerous options for accessing text documents in different formats, including .txt (text), .pdf (Adobe), .html, .xml (Web-formats), and most Microsoft Office formats (e.g., .doc, .rtf).

Flexible user interface options (and automation functions) are provided for selecting large numbers of files via wild-cards (e.g., to select all documents in a particular subdirectory structure).

The program supports full "Web-crawling" capabilities, so that documents can be extracted from the Web, starting at a particular root Web page (URL). All documents linked to that particular page will be included, as well as the documents linked to those sub-documents, and so on, up to a user-specified level or depth.

File names and URLs can also be stored in text variables, in STATISTICA data files. In this manner, the program can not only process actual text stored in text variables, but also properly interpret references to text documents or URLs. Thus, numeric information and textual information (large documents) can be stored on a per-case (observation) basis and meaningful analyses can be performed on data files where for each observation numeric as well as (voluminous) unstructured textual information is available (e.g., patients' age, height, weight, along with physicians narrative description of symptoms).

Options are provided to flexibly import such lists of filenames or URLs into the columns of a STATISTICA spreadsheet.

Processing Documents

Documents can be preprocessed, prior to (actually concurrent with the) indexing of all documents. Exclusion rules and stub-lists can be applied to remove common but not useful words like "a", "the", "to", "is". Then a stemming algorithm is applied so that English words like "traveled", "traveling" both count as instances of "travel".

STATISTICA Text Miner includes stub lists and stemming algorithms for Danish, Dutch, English, French, German, Italian, Portuguese, Spanish, Swedish, and other languages. Please email info@statsoft.com about your language needs. Stub lists can be edited (augmented) by the user as needed. The program is designed so that support for additional languages can be added with minimum effort.

Next, the program will index the "stubbed-and-stemmed" documents, to create a frequency count of all words and for all documents. This "raw-data" (count) information is the basis for all subsequent numerical analyses.

Before creating a STATISTICA Data File containing the counts (etc.) to summarize the documents, various additional filters may be applied. For example, the counts for particular (most frequent) words per document can be:

  • normalized based on the length of each document
  • transformed (e.g., log-transformed)
  • optionally "compressed" by, for example, applying various feature extraction algorithms such as SVD (singular value decomposition, specifically optimized to operate on large sparse matrices)

The resulting data file with numeric information (e.g., SVD dimensions, raw counts, relative counts, most-frequent-word counts, and so on) is then ready for further analyses.

Various options are provided for writing the information extracted from text into the input data file, or directly into external databases (see also the description of STATISTICA In-Place Database Processing technology). 

Analyzing Documents

All statistical analysis methods can be applied to the numeric summaries representing the texts. Simple summary statistics may extract the most common words used in the documents.

By mapping the documents into the SVD dimensions (e.g., via PCA), dimensional maps of documents can be created, to evaluate the similarity of documents, etc.

By mapping documents into dimensions based on original (transformed) word counts, simultaneous maps of documents and words can be created. This reflects the "meaning" of documents.

Clustering techniques (such as EM or k-Means) can be applied to identify clusters of similar documents.

Predictive data mining techniques can be used to relate the numerical summaries of documents to other indicators of interest, e.g., fraudulent intent, medical diagnosis, and so on.

Key analytic components requiring extensive data processing are implemented via multi-threaded computing technology, to extract optimum performance from advanced multiple-processor server hardware.

STATISTICA Text Miner is compatible with Windows XP, Windows Vista, and Windows 7.

Minimum System Requirements

  • Operating System: Windows XP or above
  • RAM: 1 GB
  • Processor Speed: 2.0 GHz

Recommended System Requirements

  • Operating System: Windows 7
  • RAM: 4 GB or more 
  • Processor Speed: 2.0 GHz, 64-bit, dual core 

Native 64-bit versions and highly optimized multiprocessor versions are available.

 

Home   |   Products   |   Services   |   Solutions   |   Support   |   Customers   |   Academic   |   Company   |   Contact Us
Copyright © 2013 by StatSoft Inc. Privacy Statement   |  Terms Of Use