Electronic statistics textbook banner

Glossary Index


Sampling Fraction. In probability sampling, the sampling fraction is the (known) probability with which cases in the population are selected into the sample. For example, if we were to take a simple random sample with a sampling fraction of 1/10,000 from a population of 1,000,000 cases, each case would have a 1/10,000 probability of being selected into the sample, which will consist of approximately 1/10,000 * 1,000,000 = 100 observations.

Scalable Software Systems. Software (e.g., a data base management system, such as MS SQL Server or Oracle) that can be expanded to meet future requirements without the need to restructure its operation (e.g., split data into smaller segments) to avoid a degradation of its performance. For example, a scalable network allows the network administrator to add many additional nodes without the need to redesign the basic system. An example of a non-scalable architecture is the DOS directory structure (adding files will eventually require splitting them into subdirectories). See also Enterprise-Wide Systems.

Scaling. Altering original variable values (according to a specific function or an algorithm) into a range that meet particular criteria (e.g., positive numbers, fractions, numbers less than 10E12, numbers with a large relative variance).

Scatterplot, 2D. The scatterplot visualizes a relation (correlation) between two variables X and Y (e.g., weight and height). Individual data points are represented in two-dimensional space (see below), where axes represent the variables (X on the horizontal axis and Y on the vertical axis).

The two coordinates (X and Y) that determine the location of each point correspond to its specific values on the two variables. See also, Data Reduction.

Scatterplot, 2D - Categorized Ternary Graph. The points representing the proportions of the component variables (X, Y, and Z) in a ternary graph are plotted in a 2-dimensional display for each level of the grouping variable (or user-defined subset of data). One component graph is produced for each level of the grouping variable (or user-defined subset of data) and all the component graphs are arranged in one display to allow for comparisons between the subsets of data (categories). See also, Data Reduction.

Scatterplot, 2D - Double-Y. This type of scatterplot can be considered to be a combination of two multiple scatterplots for one X-variable and two different sets (lists) of Y-variables. A scatterplot for the X-variable and each of the selected Y-variables will be plotted, but the variables entered into the first list (called Left-Y) will be plotted against the left-Y axis, whereas the variables entered into the second list (called Right-Y) will be plotted against the right-Y axis. The names of all Y-variables from the two lists will be included in the legend followed either by the letter (L) or (R), denoting the left-Y and right-Y axis, respectively.

The Double-Y scatterplot can be used to compare images of several correlations by overlaying them in a single graph. However, due to the independent scaling used for the two list of variables, it can facilitate comparisons between variables with values in different ranges. See also, Data Reduction.

Scatterplot, 2D - Frequency. Frequency scatterplots display the frequencies of overlapping points between two variables in order to visually represent data point weight or other measurable characteristics of individual data points.

See also, Data Reduction.

Scatterplot, 2D - Multiple. Unlike the regular scatterplot in which one variable is represented by the horizontal axis and one by the vertical axis, the multiple scatterplot consists of multiple plots and represents multiple correlations: one variable (X) is represented by the horizontal axis, and several variables (Y's) are plotted against the vertical axis. A different point marker and color is used for each of the multiple Y-variables and referenced in the legend so that individual plots representing different variables can be discriminated in the graph.

The Multiple scatterplot is used to compare images of several correlations by overlaying them in a single graph that uses one common set of scales (e.g., to reveal the underlying structure of factors or dimensions in Discriminant Function Analysis). See also, Data Reduction.

Scatterplot, 2D - Regular. The regular scatterplot visualizes a relation between two variables X and Y ( e.g., weight and height). Individual data points are represented by point markers in two-dimensional space, where axes represent the variables. The two coordinates (X and Y) that determine the location of each point correspond to its specific values on the two variables. If the two variables are strongly related, then the data points form a systematic shape (e.g., a straight line or a clear curve). If the variables are not related, then the points form an irregular "cloud" (see the categorized scatterplot below for examples of both types of data sets).

Fitting functions to scatterplot data helps identify the patterns of relations between variables (see example below).

For more examples of how scatterplot data helps identify the patterns of relations between variables, see Outliers and Brushing. See also, Data Reduction.

Scatterplot, 3D. 3D Scatterplots visualize a relationship between three or more variables, representing the X, Y, and one or more Z (vertical) coordinates of each point in 3-dimensional space (see graph below).

See also, 3D Scatterplot - Custom Ternary Graph, Data Reduction and Data Rotation (in 3D space).

Scatterplot, 3D - Raw Data. An unsmoothed surface (no smoothing function is applied) is drawn through the points in the 3D scatterplot. See also, Data Reduction.

Scatterplot, 3D - Ternary Graph. In this type of ternary graph, the triangular coordinate systems are used to plot four (or more) variables (the components X, Y, and Z, and the responses V1, V2, etc.) in three dimensions (ternary 3D scatterplots or surface plots). Here, the responses (V1, V2, etc.) associated with the proportions of the component variables (X, Y, and Z) in a ternary graph are plotted as the heights of the points. See also, Data Reduction.

Scatterplot Smoothers. In 2D scatterplots, various smoothing methods are available to fit a function through the points to best represent (summarize) the relationship between the variables.

Scheffe's Test. This post hoc test can be used to determine the significant differences between group means in an analysis of variance setting. Scheffe's test is considered to be one of the most conservative post hoc tests (for a detailed discussion of different post hoc tests, see Winer, Michels, & Brown (1991). For more details, see General Linear Models. See also, Post Hoc Comparisons. For a discussion of statistical significance, see Elementary Concepts.

Score Statistic. This statistic is used to evaluate the statistical significance of parameter estimates computed via maximum likelihood methods. It is also sometimes called the efficient score statistic. The test is based on the behavior of the log-likelihood function at the point where the respective parameter estimate is equal to 0.0 (zero); specifically, it uses the derivative (slope) of the log-likelihood function evaluated at the null hypothesis value of the parameter (parameter = 0.0). While this test is not as accurate as explicit likelihood-ratio test statistics based on the ratio of the likelihoods of the model that includes the parameter of interest, over the likelihood of the model that does not, its computation is usually much faster. It is therefore the preferred method for evaluating the statistical significance of parameter estimates in stepwise or best-subset model building methods. An alternative statistic is the Wald statistic.

Scree Plot, Scree Test. The eigenvalues for successive factors can be displayed in a simple line plot. Cattell (1966) proposed that this scree plot can be used to graphically determine the optimal number of factors to retain.

The scree test involves finding the place where the smooth decrease of eigenvalues appears to level off to the right of the plot. To the right of this point, presumably, we find only "factorial scree" – "scree" is the geological term referring to the debris that collects on the lower part of a rocky slope. Thus, no more than the number of factors to the left of this point should be retained.

For more information on procedures for determining the optimal number of factors to retain, see the section on Reviewing the Results of a Principal Components Analysis in Factor Analysis and How Many Dimensions to Specify in Multi-dimensional Scaling.

S.D. Ratio. In a regression problem, the ratio of the prediction error standard deviation to the original output data standard deviation. A lower S.D. ratio indicates a better prediction. This is equivalent to one minus the explained variance of the model. See Multiple Regression, Neural Networks.

Semi-Partial (or Part) Correlation. The semi-partial or part correlation is similar to the partial correlation statistic. Like the, partial correlation, it is a measure of the correlation between two variables that remains after controlling for (i.e., "partialling" out) the effects of one or more other predictor variables. However, while the squared partial correlation between a predictor X1 and a response variable Y can be interpreted as the proportion of (unique) variance accounted for by X1, in the presence of other predictors X2, ... , Xk, relative to the residual or unexplained variance that cannot be accounted for by X2, ... , Xk, the squared semi-partial or part correlation is the proportion of (unique) variance accounted for by the predictor X1, relative to the total variance of Y. Thus, the semi-partial or part correlation is a better indicator of the "practical relevance" of a predictor, because it is scaled to (i.e., relative to) the total variability in the dependent (response) variable.

See also Correlation, Spurious Correlations, partial correlation, Basic Statistics, Multiple Regression, General Linear Models, General Stepwise Regression, Structural Equation Modeling (SEPATH).

SEMMA. See Models for Data Mining. See also, Data Mining Techniques.

Sensitivity Analysis (in Neural Networks). A sensitivity analysis indicates which input variables are considered most important by that particular neural network. Sensitivity analysis can be used purely for informative purposes, or to perform input pruning.

Sensitivity analysis can give important insights into the usefulness of individual variables. It often identifies variables that can be safely ignored in subsequent analyses, and key variables that must always be retained. However, it must be deployed with some care, for reasons that are explained below.

Input variables are not, in general, independent; that is, there are interdependencies between variables. Sensitivity analysis rates variables according to the deterioration in modeling performance that occurs if that variable is no longer available to the model. In so doing, it assigns a single rating value to each variable. However, the interdependence between variables means that no scheme of single ratings per variable can ever reflect the subtlety of the true situation.

Consider, for example, the case where two input variables encode the same information (they might even be copies of the same variable). A particular model might depend wholly on one, wholly on the other, or on some arbitrary combination of them. Then sensitivity analysis produces an arbitrary relative sensitivity to them. Moreover, if either is eliminated the model may compensate adequately because the other still provides the key information. It may therefore rate the variables as of low sensitivity, even though they might encode key information. Similarly, a variable that encodes relatively unimportant information, but is the only variable to do so, may have higher sensitivity than any number of variables that mutually encode more important information.

There may be interdependent variables that are useful only if included as a set. If the entire set is included in a model, they can be accorded significant sensitivity, but this does not reveal the interdependency. Worse, if only part of the interdependent set is included, their sensitivity will be zero, as they carry no discernable information.

In summary, sensitivity analysis does not rate the "usefulness" of variables in modeling in a reliable or absolute manner. We must be cautious in the conclusions we draw about the importance of variables. Nonetheless, in practice it is extremely useful. If a number of models are studied, it is often possible to identify key variables that are always of high sensitivity, others that are always of low sensitivity, and "ambiguous" variables that change ratings and probably carry mutually redundant information.

How does sensitivity analysis work? Each input variable is treated in turn as if it were "unavailable" (Hunter, 2000). There is a missing value substitution procedure, which is used to allow predictions to be made in the absence of values for one or more inputs. To define the sensitivity of a particular variable, v, we first run the network on a set of test cases, and accumulate the network error.  We then run the network again using the same cases, but this time replacing the observed values of v with the value estimated by the missing value procedure, and again accumulate the network error.

Given that we have effectively removed some information that presumably the network uses (i.e. one of its input variables), we would reasonably expect some deterioration in error to occur. The basic measure of sensitivity is the ratio of the error with missing value substitution to the original error. The more sensitive the network is to a particular input, the greater the deterioration we can expect, and therefore the greater the ratio.

If the ratio is one or lower, making the variable "unavailable" either has no effect on the performance of the network, or actually enhances it. Once sensitivities have been calculated for all variables, they may be ranked in order.

Sequential Contour Plot, 3D. This contour plot presents a 2-dimensional projection of the spline-smoothed surface fit to the data (see 3D Sequential Surface Plot. Successive values of each series are plotted along the X-axis, with each successive series represented along the Y-axis.

Sequential/Stacked Plots. In this type of graph, the sequence of values from each selected variable is stacked on one another.

Sequential/Stacked Plots, 2D - Area. The sequence of values from each selected variable will be represented by consecutive areas stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Column. The sequence of values from each selected variable will be represented by consecutive segments of vertical columns stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Lines. The sequence of values from each selected variable will be represented by consecutive lines stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Mixed Line. In this type of graph, the sequences of values of variables selected in the first list will be represented by consecutive areas stacked on one another while the sequences of values of variables selected in the second list will be represented by consecutive lines stacked on one another (over the area representing the last variable from the first list).

Sequential/Stacked Plots, 2D - Mixed Step. In this type of graph, the sequences of values of variables selected in the first list will be represented by consecutive step areas stacked on one another while the sequences of values of variables selected in the second list will be represented by consecutive step lines stacked on one another (over the step area representing the last variable from the first list).

Sequential/Stacked Plots, 2D - Step. The sequence of values from each selected variable will be represented by consecutive step lines stacked on one another in this type of graph.

Sequential/Stacked Plots, 2D - Step Area. The sequence of values from each selected variable will be represented by consecutive step areas stacked on one another in this type of graph.

Sequential Surface Plot, 3D. In this sequential plot, a spline-smoothed surface is fit to each data point. Successive values of each series are plotted along the X-axis, with each successive series represented along the Y-axis.

Sets of Samples in Quality Control Charts. While monitoring an ongoing process, it often becomes necessary to adjust the center line values or control limits, as those values are being refined over time. Also, we may want to compute the control limits and center line values from a set of samples that are known to be in control, and apply those values to all subsequent samples. Thus, each set is defined by a set of computation samples (from which various statistics are computed, e.g., sigma, means, etc.) and a set of application samples (to which the respective statistics, etc. are applied). Of course, the computation samples and application samples can be (and often are) not the same. To reiterate, we may want to estimate sigma from a set of samples that are known to be in control (the computation set), and use that estimate for establishing control limits for all remaining and new samples (the application set).

Note that each sample must be uniquely assigned to one application set; in other words, each sample has control limits based on statistics (e.g., sigma) computed for one particular set. The assignment of application samples to sets proceeds in a hierarchical manner, i.e., each sample is assigned to the first set where it "fits" (where the definition of the application sample set would include the respective sample). This hierarchical search always begins at the last set that the user specified, and not with the all-samples set. Hence, if the user-specified sets encompass all valid samples, the default all-samples set will actually become empty (since all samples will be assigned to one of the user-defined sets).

Shapiro-Wilk W Test. The Shapiro-Wilk W test is used in testing for normality. If the W statistic is significant, then the hypothesis that the respective distribution is normal should be rejected. The Shapiro-Wilk W test is the preferred test of normality because of its good power properties as compared to a wide range of alternative tests (Shapiro, Wilk, & Chen, 1968). Some software programs implement an extension to the test described by Royston (1982), which allows it to be applied to large samples (with up to 5000 observations). See also Kolmogorov-Smirnov test and Lilliefors test.

Shewhart Control Charts. This is a standard graphical tool widely used in statistical Quality Control. The general approach to quality control charting is straightforward: We extract samples of a certain size from the ongoing production process. We then produce line charts of the variability in those samples, and consider their closeness to target specifications. If a trend emerges in those lines, or if samples fall outside pre-specified limits, then the process is declared to be out of control and the operator will take action to find the cause of the problem. These types of charts are sometimes also referred to as Shewhart control charts (named after W. A. Shewhart who is generally credited as being the first to introduce these methods; see Shewhart, 1931). For additional information, see also Quality Control charts; Assignable causes and actions.

Short Run Control Charts. The short run quality control chart , for short production runs, plots transformations of the observations of variables or attributes for multiple parts, each of which constitutes a distinct "run," on the same chart. The transformations rescale the variable values of interest such that they are of comparable magnitudes across the different short production runs (or parts). The control limits computed for those transformed values can then be applied to determine if the production process is in control, to monitor continuing production, and to establish procedures for continuous quality improvement.

Shuffle, Back Propagation (in Neural Networks). Presenting training cases in a random order on each epoch to prevent various undesirable effects that can otherwise occur (such as oscillation and convergence to local minima). See, Neural Networks.

Shuffle Data (in Neural Networks). Randomly assigning cases to the training and verification sets, so that these are (as far as possible) statistically unbiased. See, Neural Networks.

Sigma Restricted Model. A sigma restricted model uses the sigma-restricted coding to represent effects for categorical predictor variables in general linear models and generalized linear models. To illustrate the sigma-restricted coding, suppose that a categorical predictor variable called Gender has two levels (i.e., male and female). Cases in the two groups would be assigned values of 1 or -1, respectively, on the coded predictor variable, so that if the regression coefficient for the variable is positive, the group coded as 1 on the predictor variable will have a higher predicted value (i.e., a higher group mean) on the dependent variable, and if the regression coefficient is negative, the group coded as -1 on the predictor variable will have a higher predicted value on the dependent variable. This coding strategy is aptly called the sigma-restricted parameterization, because the values used to represent group membership (1 and -1) sum to zero. See also, categorical predictor variables, design matrix; or General Linear Models.

Sigmoid Function. An S-shaped curve, with a near-linear central response and saturating limits. See also, logistic function and hyperbolic tangent function.

Signal Detection Theory (SDT). Signal detection theory (SDT) is an application of statistical decision theory used to detect a signal embedded in noise. SDT is used in psychophysical studies of detection, recognition, and discrimination, and in other areas such as medical research, weather forecasting, survey research, and marketing research.

A general approach to estimating the parameters of the signal detection model is via the use of the generalized linear model. For example, DeCarlo (1998) shows how signal detection models based on different underlying distributions can easily be considered by using the generalized linear model with different link functions.

For discussion of the generalized linear model and the link functions it uses, see the Generalized Linear Models topic.

Simple Random Sampling (SRS). Simple random sampling is a type of probability sampling where observations are randomly selected from a population with a known probability or sampling fraction. Typically, we begin with a list of N observations that comprises the entire population from which we wish to extract a simple random sample (e.g., a list of registered voters); we can then generate k random case numbers (without replacement) in the range from 1 to N, and select the respective cases into the final sample (with a sampling fraction or known selection probability of k/N). Refer to Kish (1965) for a detailed discussion of the advantages and characteristics of probability samples and EPSEM samples.

Simplex Algorithm. A nonlinear estimation algorithm that does not rely on the computation or estimation of the derivatives of the loss function. Instead, at each iteration the function will be evaluated at m+1 points in the m dimensional parameter space. For example, in two dimensions (i.e., when there are two parameters to be estimated), the program will evaluate the function at three points around the current optimum. These three points would define a triangle; in more than two dimensions, the "figure" produced by these points is called a Simplex.

Single and Multiple Censoring. There are situations in which censoring can occur at different times (multiple censoring), or only at a particular point in time (single censoring). Consider an example experiment where we start with 100 light bulbs, and terminate the experiment after a certain amount of time. If the experiment is terminated at a particular point in time, then a single point of censoring exists, and the data set is said to be single-censored. However, in biomedical research multiple censoring often exists, for example, when patients are discharged from a hospital after different amounts (times) of treatment, and the researcher knows that the patient survived up to those (differential) points of censoring.

Data sets with censored observations can be analyzed via Survival Analysis or Weibull and Reliability/Failure Time Analysis. See also, Type I and II Censoring and Left and Right Censoring.

Singular Value Decomposition. An efficient algorithm for optimizing a linear model. See also, pseudo-inverse.

Six Sigma (DMAIC). Six Sigma is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. Six Sigma methodology is based on the combination of well-established statistical quality control techniques, simple and advanced data analysis methods, and the systematic training of all personnel at every level in the organization involved in the activity or process targeted by Six Sigma.

Six Sigma methodology and management strategies provide an overall framework for organizing company wide quality control efforts. These methods have recently become very popular, due to numerous success stories from major US-based as well as international corporations. For reviews of Six Sigma strategies, refer to Harry and Schroeder (2000), or Pyzdek (2001).

These are organized into the categories of activities that make up the Six Sigma effort: Define (D), Measure (M), Analyze (A), Improve (I), Control (C); or DMAIC for short.

Define. The Define phase is concerned with the definition of project goals and boundaries, and the identification of issues that need to be addressed to achieve the higher sigma level.

Measure. The goal of the Measure phase is to gather information about the current situation, to obtain baseline data on current process performance, and to identify problem areas.

Analyze. The goal of the Analyze phase is to identify the root cause(s) of quality problems, and to confirm those causes using the appropriate data analysis tools.

Improve. The goal of the Improve phase is to implement solutions that address the problems (root causes) identified during the previous (Analyze) phase.

Control. The goal of the Control phase is to evaluate and monitor the results of the previous phase (Improve).

Six Sigma Process. A six sigma process is one that can be expected to produce only 3.4 defects per one million opportunities. The concept of the six sigma process is important in Six Sigma quality improvement programs. The idea can best be summarized with the following graphs.

The term Six Sigma derives from the goal to achieve a process variation, so that ± 6 * sigma (the estimate of the population standard deviation) will "fit" inside the lower and upper specification limits for the process. In that case, even if the process mean shifts by 1.5 * sigma in one direction (e.g., to +1.5 sigma in the direction of the upper specification limit), then the process will still produce very few defects.

For example, suppose we expressed the area above the upper specification limit in terms of one million opportunities to produce defects. The 6 * sigma process shifted upwards by 1.5 * sigma will only produce 3.4 defects (i.e., "parts" or "cases" greater than the upper specification limit) per one million opportunities.

Shift. An ongoing process that at some point was centered will shift over time. Motorola, in their implementation of Six Sigma strategies, determined that it is reasonable to assume that a process will shift over time by approximately 1.5 * sigma (see, for example, Harry and Schroeder, 2000). Hence, most standard Six Sigma calculators will be based on a 1.5 * sigma shift.

One-sided vs. two-sided limits. In the illustration shown above the area outside the upper specification limit (greater than USL) is defined as one million opportunities to produce defects. Of course, in many cases any "outcomes" (e.g., parts) that are produced that fall below the specification limit can be equally defective. In that case, we may want to consider the lower tail of the respective (shifted) normal distribution as well. However, in practice, we usually ignore the lower tail of the normal curve because (1) in many cases, the process "naturally" has one-sided specification limits (e.g., very low delay times are not really a defect, only very long times; very few customer complaints are not a problem, only very many, etc.), and (2) when a 6 * sigma process has been achieved, the area under the normal curve below the lower specification limit is negligible.

Yield. The illustration shown above focuses on the number of defects that a process produces. The number of non-defects can be considered the Yield of the process. Six Sigma calculators will compute the number of defects per million opportunities (DPMO) as well as the yield, expressed as the percent of the area under the normal curve that falls below the upper specification limit (in the illustration above).

Skewness. Skewness (this term was first used by Pearson, 1895) measures the deviation of the distribution from symmetry. If the skewness is clearly different from 0, then that distribution is asymmetrical, while normal distributions are perfectly symmetrical.

Skewness = n*M3/[(n-1)*(n-2)*3]

M3     is equal to: (xi-Meanx)3
3     is the standard deviation (sigma) raised to the third power
n        is the valid number of cases.

See also, Descriptive Statistics Overview.

Smoothing. Smoothing techniques can be used in two different situations. Smoothing techniques for 3D Bivariate Histograms allow us to fit surfaces to 3D representations of bivariate frequency data. Thus, every 3D histogram can be turned into a smoothed surface providing a sensitive method for revealing non-salient overall patterns of data and/or identifying patterns to use in developing quantitative models of the investigated phenomenon.

In Time Series analysis, the general purpose of smoothing techniques is to "bring out" the major patterns or trends in a time series, while de-emphasizing minor fluctuations (random noise). Visually, as a result of smoothing, a jagged line pattern should be transformed into a smooth curve.

See also, Exploratory Data Analysis and Data Mining Techniques and Smoothing Bivariate Distributions.

SOFMs (Self-Organizing Feature Maps; Kohonen Networks). Neural networks based on the topological properties of the human brain, also known as Kohonen Networks (Kohonen, 1982; Fausett, 1994,; Haykin, 1994; Patterson, 1996).

Softmax. A specialized activation function for one-of-N encoded classification networks. Performs a normalized exponential (i.e. the outputs add up to 1). In combination with the cross entropy error function, allows multilayer perceptron networks to be modified for class probability estimation (Bishop, 1995; Bridle, 1990). See, Neural Networks.

Space Plots. This type of graph offers a distinctive means of representing 3D Scatterplot data through the use of a separate X-Y plane positioned at a user-selectable level of the vertical Z-axis (which "sticks up" through the middle of the plane).

The Space Plots specific layout may facilitate exploratory examination of specific types of three-dimensional data. It is recommended to assign variables to axes such that the variable that is most likely to discriminate between patterns of relation among the other two is specified as Z. See also, Data Rotation (in 3D space) in the Graphical Techniques topic.

Spearman R. Spearman R can be thought of as the regular Pearson product-moment correlation coefficient (Pearson r); that is, in terms of the proportion of variability accounted for, except that Spearman R is computed from ranks. As mentioned above, Spearman R assumes that the variables under consideration were measured on at least an ordinal (rank order) scale; that is, the individual observations (cases) can be ranked into two ordered series. Detailed discussions of the Spearman R statistic, its power and efficiency can be found in Gibbons (1985), Hays (1981), McNemar (1969), Siegel (1956), Siegel and Castellan (1988), Kendall (1948), Olds (1949), or Hotelling and Pabst (1936).

Spectral Plot. The original application of this type of plot was in the context of spectral analysis in order to investigate the behavior of non-stationary time series. On the horizontal axes, we can plot the frequency of the spectrum against consecutive time intervals, and indicate on the Z-axis the spectral densities at each interval (see for example, Shumway, 1988, page 82).

Spectral plots have clear advantages over the regular 3D Scatterplots when we are interested in examining how a relationship between two variables changes across the levels of a third variable, as is shown in the next illustration. The advantage of Spectral Plots over regular 3D Scatterplots is well-illustrated in the comparison of the two displays of the same data set shown below.

The Spectral Plot makes it easier to see that the relationship between Pressure and Yield changes from an "inverted U" to a "U". See also, Data Rotation (in 3D space) in Graphical Techniques.

Spikes (3D Graphs). In this type of graph, individual values of one or more series of data are represented along the X-axis as a series of "spikes" (point symbols with lines descending to the base plane). Each series to be plotted is spaced along the Y-axis. The "height" of each spike is determined by the respective value of each series.

Spline (2D Graphs). A curve is fitted to the XY coordinate data using the bicubic spline smoothing procedure.

Spline (3D Graphs). A surface is fitted to the XYZ coordinate data using the bicubic spline smoothing procedure.

Split Selection (for Classification Trees). Split selection for classification trees refers to the process of selecting the splits on the predictor variables that are used to predict membership in the classes of the dependent variable for the cases or objects in the analysis. Given the hierarchical nature of classification trees, these splits are selected one at time, starting with the split at the root node, and continuing with splits of resulting child nodes until splitting stops, and the child nodes that have not been split become terminal nodes. The split selection process is described in the Computational Methods section of Classification Trees.

Spurious Correlations. Correlations that are due mostly to the influences of one or more "other" variables. For example, there is a correlation between the total amount of losses in a fire and the number of firemen that were putting out the fire; however, what this correlation does not indicate is that if we call fewer firemen, we would lower the losses. There is a third variable (the initial size of the fire) that influences both the amount of losses and the number of firemen. If we "control" for this variable (e.g., consider only fires of a fixed size), the correlation will either disappear or perhaps even change its sign. The main problem with spurious correlations is that we typically do not know what the "hidden" agent is. However, in cases when we know where to look, we can use partial correlations that control for (i.e., partial out) the influence of specified variables. See also, Correlation, Partial Correlation, Basic Statistics, Multiple Regression, Structural Equation Modeling (SEPATH).

SQL. SQL (Structured Query Language) enables us to query an outside data source about the data it contains. We can use a SQL statement in order to specify the desired tables, fields, rows, etc. to return as data. For information on SQL syntax, consult an SQL manual.

Square Root of the Signal to Noise Ratio (f). This standardized measure of effect size is used in the Analysis of Variance to characterize the overall level of population effects, and is very similar to the RMSSE. It is the square root of the sum of squared standardized effects divided by the number of effects. For example, in a 1-Way ANOVA, with J groups, f is calculated as

For more information, see Power Analysis.

Stacked Generalization. See Stacking.

Stacking (Stacked Generalization). The concept of stacking (short for Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT and  CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000).

In stacking, the predictions from different classifiers are used as input into a meta-learner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting).

Standard Deviation. The standard deviation (this term was first used by Pearson, 1894) is a commonly-used measure of variation. The standard deviation of a population of values is computed as:

= [(xi-µ)2/N]1/2

µ     is the population mean
N    is the population size.
The sample estimate of the population standard deviation is computed as:

s = [(xi-x-bar)2/n-1]1/2

xbar   is the sample mean
n        is the sample size.

See also, Descriptive Statistics Overview.

Standard Error. The standard error (this term was first used by Yule, 1897) is the standard deviation of a mean and is computed as:

std.err. = Ö(s2/n)

s2 is the sample variance
n is the sample size.

Standard Error of the Mean. The standard error of the mean (first used by Yule, 1897) is the theoretical standard deviation of all sample means of size n drawn from a population and depends on both the population variance (sigma) and the sample size (n) as indicated below:

= (2/n)1/2 where
2   is the population variance and
n      is the sample size.

Since the population variance is typically unknown, the best estimate for the standard error of the mean is then calculated as:

= (s2/n)1/2

s2    is the sample variance (our best estimate of the population variance) and
n    is the sample size.

See also, Descriptive Statistics Overview.

Standard Error of the Proportion. This is the standard deviation of the distribution of the sample proportion over repeated samples. If the population proportion is , and the sample size is N, the standard error of the proportion when sampling from an infinite population is

sp = (p(1-p)/N)**1/2

For more information, see Power Analysis.

Standardization.While in the everyday language, the term "standardization" means converting to a common standard or making something conform to a standard (i.e., its meaning is similar to the term normalization in data analysis), in statistics, this term has a very specific meaning and refers to the transformation of data by subtracting each value from some reference value (typically a sample mean) and diving it by the standard deviation (typically a sample SD). This important transformation will bring all values (regardless of their distributions and original units of measurement) to compatible units from a distribution with a mean of 0 and a standard deviation of 1.

This transformation has a wide variety of applications because it makes the distributions of values easy to compare across variables and/or subsets. If applied to the input data, standardization also makes the results of a variety of statistical techniques entirely independent of the ranges of values or the units of measurements (see the discussion of these issues in Elementary Concepts, Basic Statistics, Multiple Regression, Factor Analysis, and others).

Standardized DFFITS. This is another measure of impact of the respective case on the regression equation. The formula for standardized DFFITS is

SDFITi = DFFITi/(si(i)1/2) where hi is the leverage for the ith case

i = 1/N + hi See also, DFFITS, studentized residuals, and studentized deleted residuals. For more information see Hocking (1996) and Ryan (1997).

Standardized Effect (Es). A statistical effect expressed in convenient standardized units. For example, the standardized effect in a 2 Sample t-test is the difference between the two means, divided by the standard deviation, i.e.,

Es = (µ1 - µ2)/s

For more information, see Power Analysis.

Standard Residual Value. This is the standardized residual value (observed minus predicted divided by the square root of the residual mean square). See also, Mahalanobis distance, deleted residual and Cook’s distance.

Stationary Series (in Time Series). In Time Series analysis, a stationary series has a constant mean, variance, and autocorrelation through time (i.e., seasonal dependencies have been removed via Differencing).

STATISTICA Advanced Linear/Nonlinear Models. StatSoft's STATISTICA Advanced Linear/Nonlinear Models offers a wide array of the most advanced linear and nonlinear modeling tools on the market, supports continuous and categorical predictors, interactions, hierarchical models; automatic model selection facilities; also, includes variance components, time series, and many other methods; all analyses include extensive, interactive graphical support and built-in complete Visual Basic scripting.

  • General Linear Models
  • Generalized Linear/Nonlinear Models
  • General Regression Models
  • General Partial Least Squares Models
  • Variance Components
  • Survival Analysis
  • Nonlinear Estimation
  • Fixed Nonlinear Regression
  • Log-Linear Analysis of Frequency Tables
  • Time Series/Forecasting
  • Structural Equation Modeling, and more

STATISTICA Automated Neural Networks. StatSoft's STATISTICA Automated Neural Networks (SANN) contains a comprehensive array of statistics, charting options, network architectures, and training algorithms; C and PMML (Predictive Model Markup Language) code generators. The C code generator is an add-on.

  • Automatic Search for Best Architecture and Network Solutions
  • Multilayer Perceptrons
  • Radial Basis Function Networks
  • Self-Organizing Feature Maps
  • Time Series Neural Networks for both Regression and Classification problems
  • A variety of algorithms for fast and efficient training of Neural Network Models including Gradient Descent, Conjugate Gradient, and BFGS
  • Numerous analytical graphs to aid in generating results and drawing conclusions
  • Sampling of data into subsets for optimizing network performance and enhancing the generalization ability
  • Sensitivity Analysis, Lift Charts, and ROC Curves
  • Creation of Ensembles out of already existing standalone networks
  • C-code and PMML (Predictive Model Markup Language) Neural Network Code Generators that are easy to deploy

STATISTICA Base. StatSoft's STATISTICA Base offers a comprehensive set of essential statistics in a user-friendly package with flexible output management and Web enablement features; it also includes all STATISTICA graphics tools and a comprehensive Visual Basic development environment.

  • All STATISTICA graphics tools
  • Basic Statistics, Breakdowns, and Tables
  • Distribution Fitting
  • Multiple Linear Regression
  • Analysis of Variance
  • Nonparametrics, and more

STATISTICA Data Miner. StatSoft's STATISTICA Data Miner contains the most comprehensive selection of data mining solutions on the market, with an icon-based, extremely easy-to-use user interface. It features a selection of completely integrated, and automated, ready to deploy "as is" (but also easily customizable) specific data mining solutions for a wide variety of business applications. The data mining solutions are driven by powerful procedures from five modules, which can also be used interactively and/or used to build, test, and deploy new solutions:

  • General Slicer/Dicer Explorer (with optional OLAP)
  • General Classifier
  • General Modeler/Multivariate Explorer
  • General Forecaster
  • General Neural Networks Explorer, and more

STATISTICA Data Warehouse. StatSoft's STATISTICA Data Warehouse is the ultimate high-performance, scalable system for intelligent management of unlimited amounts of data, distributed across locations worldwide. STATISTICA Data Warehouse consists of a suite of powerful, flexible component applications, including:

  • STATISTICA Data Warehouse Server Database
  • STATISTICA Data Warehouse Query (featuring STATISTICA Query)
  • STATISTICA Data Warehouse Analyzer (featuring STATISTICA Data Miner, STATISTICA Text Miner, STATISTICA Process Optimization and Root Cause Analysis, or the complete set of STATISTICA Enterprise Server analytics)
  • STATISTICA Data Warehouse Reporter (featuring STATISTICA Enterprise Server Knowledge Portal)
  • STATISTICA Data Warehouse Document Repository (featuring STATISTICA Document Management System)
  • STATISTICA Data Warehouse Scheduler
  • STATISTICA Data Warehouse Real Time Monitor and Reporter (featuring STATISTICA Enterprise Server)

STATISTICA Document Management System (SDMS). StatSoft's STATISTICA Document Management System (SDMS) is a scalable solution for flexible, productivity-enhancing management of local or Web-based document repositories (FDA/ISO compliant).

  • Extremely transparent and easy to use
  • Flexible, customizable (optionally browser/Web-enabled) user interface
  • Electronic signatures
  • Comprehensive auditing trails, approvals
  • Optimized searches
  • Document comparison tools
  • Security
  • Satisfies the FDA 21 CFR Part 11 requirements
  • Satisfies ISO 9000 (9001, 14001) documentation requirements
  • Unlimited scalability (from desktop or network Client-Server versions, to the ultimate size, Web-based worldwide systems)
  • Open architecture and compatibility with industry standards

STATISTICA Enterprise. StatSoft's STATISTICA Enterprise is an integrated multi-user software system designed for general purpose data analysis and business intelligence applications in research, marketing, finance, and other industries. STATISTICA Enterprise provides an efficient interface to enterprise-wide data repositories and a means for collaborative work as well as all the statistical functionality available in any or all STATISTICA products.

  • Integration with data warehouses
  • Intuitive query and filtering tools
  • Easy-to use administration tools
  • Automatic report distribution
  • Alarm notification, and more

STATISTICA Enterprise/QC. StatSoft's STATISTICA Enterprise/QC is designed for local and global enterprise quality control and improvement applications including Six Sigma. STATISTICA Enterprise/QC offers a high-performance database (or an optimized interface to existing databases), real-time monitoring and alarm notification for the production floor, a comprehensive set of analytical tools for engineers, sophisticated reporting features for management, Six Sigma reporting options, and much more.

  • Web-enabled user interface and reporting tools; interactive querying tools
  • User-specific interfaces for operators, engineers, etc.
  • Groupware functionality for sharing queries, special applications, etc.
  • Open-ended alarm notification including cause/action prompts
  • Scalable, customizable, and can be integrated into existing database/ERP systems, and more

STATISTICA Enterprise Server. StatSoft's STATISTICA Enterprise Server is the ultimate enterprise system that offers full Web enablement, including the ability to run STATISTICA interactively or in batch from a Web browser on any computer (including Linux, UNIX), offload time consuming tasks to the servers (using distributed processing), use multi-tier Client-Server architecture, and manage projects over the Web (supporting multithreading and distributed/parallel processing that scales to multiple server computers).

STATISTICA Monitoring and Alerting Server (MAS).StatSoft's STATISTICA Monitoring and Alerting Server (MAS) is a system that enables users to automate the continual monitoring of hundreds or thousands of critical process and product parameters. The ongoing monitoring is an automated and efficient method for:

  • Monitoring many critical parameters simultaneously
  • Providing status "snapshots" from the results of these monitoring activities to personnel based on their responsibilities.
  • Dashboards associated with User/Group

STATISTICA MultiStream. StatSoft's STATISTICA MultiStream is a solution package for identifying and implementing effective strategies for advanced multivariate process monitoring and control. STATISTICA MultiStream was designed for process industries in general. MultiStream is well suited to help pharmaceutical manufacturers and power generation facilities leverage the data collected into their existing specialized process data bases for multivariate and predictive process control, for actionable advisory systems.

STATISTICA MultiStream is a complete enterprise system built on a robust, advanced client-server (and fully Web-enabled) architecture, offers central administration and management of deployment of models, as well as cutting edge root-cause analysis and predictive data mining technology, and its analytics are seamlessly integrated with a built-in document management system.

STATISTICA Multivariate Statistical Process Control (MSPC).StatSoft's STATISTICA Multivariate Statistical Process Control (MSPC) is a complete solution for multivariate statistical process control, deployed within a scalable, secure analytics software platform.

  • Univariate and multivariate statistical methods for quality control, predictive modeling, and data reduction
  • Functions to determine the most critical process, raw materials, and environment factors and their optimal settings for delivering products of the highest quality
  • Monitoring of process characteristics interactively or automatically during production stages
  • Building, evaluating, and deploying predictive models based on the known outcomes from historical data
  • Historical analysis, data exploration, data visualization, predictive model building and evaluation, model deployment to monitoring server
  • Interactive monitoring with dashboard summary displays and automatic-updating results
  • Automated monitoring with rules, alarm events, and configurable actions
  • Multivariate techniques including Partial Least Squares, Principal Components, Neural Networks, Recursive Partitioning (Tree) Methods, Support Vector Machines, Independent Components Analysis, Cluster Analysis, and more

STATISTICA PI Connector. StatSoft's STATISTICA PI Connector is an optional STATISTICA add-on component that allows for direct integration to data stored in the PI data historian. The STATISTICA PI Connector utilizes the PI user access control and security model, allows for interactive browsing of tags, and takes advantages of dedicated PI functionality for interpolation and snapshot data. STATISTICA integrated with the PI system is being used for streamlined and automated analyses for applications such as Process Analytical Technology (PAT) in FDA-regulated industries, Advanced Process Control (APC) systems in Chemical and Petrochemical industries, and advisory systems for process optimization and compliance in the Energy Utility industry.

STATISTICA PowerSolutions. StatSoft's STATISTICA PowerSolutions is a solution package aimed for use at power generation companies to optimize power plant performance, increase efficiency, and reduce emissions. This product offers a highly economical alternative to multimillion dollar investments in new or upgraded equipment (hardware). Based on more than 20 years of experience in applying advanced data driven, predictive data mining/optimization technologies for process optimization in various industries, STATISTICA PowerSolutions enables power plants to get the most out of their existing equipment and control systems by leveraging all data collected at their sites to identify opportunities for improvement, even for older designs such as coal-fired Cyclone furnaces (as well as wall-fired or T-fired designs).

STATISTICA Process Optimization. StatSoft's STATISTICA Process Optimization is a powerful software solution designed to monitor processes and identify and anticipate problems related to quality control and improvement with unmatched sensitivity and effectiveness. STATISTICA Process Optimization integrates all Quality Control Charts, Process Capability Analyses, Experimental Design procedures, and Six Sigma methods with a comprehensive library of cutting-edge techniques for exploratory and predictive data mining.

  • Predict QC problems with cutting edge data mining methods
  • Discover root causes of problem areas
  • Monitor and improve ROI (Return On Investment)
  • Generate suggestions for improvement
  • Monitor processes in real time over the Web
  • Create and deploy QC/SPC solutions over the Web
  • Use multithreading and distributed processing to rapidly process extremely large streams of data

STATISTICA Quality Control Charts. StatSoft's STATISTICA Quality Control Charts offers fully customizable (e.g., callable from other environments), easy and quick to use, versatile charts with a selection of automation options and user-interface shortcuts to simplify routine work (a comprehensive tool for Six Sigma methods).

  • Multiple Chart (Six Sigma Style) Reports and displays
  • X-bar and R Charts; X-bar and S Charts; Np, P, U, C Charts
  • Pareto Charts
  • Process Capability and Performance Indices
  • Moving Average/Range Charts, EWMA Charts
  • Short Run Charts (including Nominal and Target)
  • CuSum (Cumulative Sum) Charts
  • Runs Tests
  • Interactive
  • Causes and actions, customizable alarms, analytic brushing, and more

STATISTICA Sequence, Association and Link Analysis (SAL). StatSoft's STATISTICA Sequence, Association and Link Analysis (SAL) is designed to address the needs of clients in retailing, banking, insurance, etc., industries by implementing the fastest known highly scalable algorithm with the ability to drive Association and Sequence rules in one single analysis. The program represents a stand-alone module that can be used for both model building and deployment. All tools in STATISTICA Data Miner can be quickly and effortlessly leveraged to analyze and "drill into" results generated via STATISTICA SAL.

  • Uses a tree-building technique to extract Association and Sequence rules from data
  • Uses efficient and thread-safe local relational database technology to store Association and Sequence models
  • Handles multiple response, multiple dichotomy, and continuous variables in one analysis
  • Performs Sequence Analysis while mining for Association rules in a single analysis
  • Simultaneously extracts Association and Sequence rules for more than one dimension
  • Given the ability to perform multidimensional Association and Sequence mining and the capacity to extract only rules for specific items, the program can be used for Predictive Data Mining
  • Performs Hierarchical Single-Linkage Cluster Analysis, which can detect the more likely cluster of items that can occur. This has extremely useful, practical real-world applications, e.g., in retailing

STATISTICA Text Miner. StatSoft's STATISTICA Text Miner is an optional extension of STATISTICA Data Miner. The program features a large selection of text retrieval, pre-processing, and analytic and interpretive mining procedures for unstructured text data (including Web pages), with numerous options for converting text into numeric information (for mapping, clustering, predictive data mining, etc.), language-specific stemming algorithms. Because STATISTICA’s flexible data import options, the methods available in STATISTICA Text Miner can also be useful for processing other unstructured input (e.g., image files imported as data matrices, etc.).

STATISTICA Variance Estimation and Precision. StatSoft's STATISTICA Variance Estimation and Precision offers a comprehensive set of techniques for analyzing data from experiments that include both fixed and random effects using REML (Restricted Maximum Likelihood Estimation). With STATISTICA Variance Estimation and Precision, you can obtain estimates of variance components and use them to make precision statements while at the same time comparing fixed effects in the presence of multiple sources of variation.

  • Variability plots
  • Multiple plot layouts to allow direct comparison of multiple dependent variables
  • Expected mean squares and variance components with confidence intervals
  • Flexible handling of multiple dependent variables
  • Graph displays of variance components

Statistical Power. The probability of rejecting a false statistical null hypothesis. For more information, see Power Analysis.

Statistical Process Control (SPC).The term Statistical Process Control (SPC) is typically used in context of manufacturing processes (although it may also pertain to services and other activities), and it denotes statistical methods used to monitor and improve the quality of the respective operations. By gathering information about the various stages of the process and performing statistical analysis on that information, the SPC engineer is able to take necessary action (often preventive) to ensure that the overall process stays in-control and to allow the product to meet all desired specifications.

SPC involves monitoring processes, identifying problem areas, recommending methods to reduce variation and verifying that they work, optimizing the process, assessing the reliability of parts, and other analytic operations. SPC uses such basic statistical quality control methods as quality control charts (Sheward, Pareto, and others), capability analysis, gage repeatability/reproducibility analysis, and reliability analysis. However, also specialized experimental methods (DOE) and other advanced statistical techniques are often part of global SPC systems. Important components of effective, modern SPC systems are real-time access to data and facilities to document and respond to incoming QC data on-line, efficient central QC data warehousing, and groupware facilities allowing QC engineers to share data and reports (see also, Enterprise SPC).

See also, Quality Control and Process Analysis. For more information on process control systems, see the ASQC/AIAG's Fundamental statistical process control reference manual (1991).

Statistical Significance (p-value). The statistical significance of a result is an estimated measure of the degree to which it is "true" (in the sense of "representative of the population"). More technically, the value of the p-value represents a decreasing index of the reliability of a result. The higher the p-value, the less we can believe that the observed relation between variables in the sample is a reliable indicator of the relation between the respective variables in the population. Specifically, the p-value represents the probability of error that is involved in accepting our observed result as valid, that is, as "representative of the population." For example, the p-value of .05 (i.e.,1/20) indicates that there is a 5% probability that the relation between the variables found in our sample is a "fluke." In other words, assuming that in the population there was no relation between those variables whatsoever, and we were repeating experiments like ours one after another, we could expect that approximately in every 20 replications of the experiment there would be one in which the relation between the variables in question would be equal or stronger than in ours. In many areas of research, the p-value of .05 is customarily treated as a "border-line acceptable" error level. See also, Elementary Concepts.

Steepest Descent Iterations. When initial values for the parameters are far from the ultimate minimum, the approximate Hessian used in the Gauss-Newton procedure may fail to yield a proper step direction during iteration. In this case, the program may iterate into a region of the parameter space from which recovery (i.e., successful iteration to the true minimum point) is not possible. One option offered by Structural Equation Modeling is to precede the Gauss-Newton procedure with a few iterations utilizing the "method of steepest descent." In the steepest descent approach, values of the parameter vector q on each iteration are obtained as

k+1 = k + kgk

In simple terms, what this means is that the Hessian is not used to help find the direction for the next step. Instead, only the first derivative information in the gradient is used.

Hint for beginners. Inserting a few Steepest Descent Iterations may help in situations where the iterative routine "gets lost" after only a few iterations.

Stemming. An important pre-processing step before indexing input documents for text mining is the stemming of words. The term stemming refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of verbs are identified and indexed (counted) as the same word. For example, stemming will ensure that both "travel" and "traveled" will be recognized by the program as the same word. For more information, see Manning and Schütze (2002).

Steps. Repetitions of a particular analytic or computational operation or procedure. For example in the neural network time series analysis, the number of consecutive time steps from which input variable values should be drawn to be fed into the neural network input units.

Stepwise Regression. A model-building technique that finds subsets of predictor variables that most adequately predict responses on a dependent variable by linear (or nonlinear) regression, given the specified criteria for adequacy of model fit.

For an overview of stepwise regression and model fit criteria see General Stepwise Regression or Multiple Regression; for nonlinear stepwise and best subset regression, see Generalized Linear Models.

Stiffness Parameter (in Fitting Options). The function that controls the weight is determined by the Stiffness parameter, which can be modified. Thus, the stiffness parameter determines the degree to which the fitted curve depends on local configurations of the analyzed values.

The lower the coefficient, the more the shape of the curve is influenced by individual data points (i.e., the curve "bends" more to accommodate individual values and subsets of values). The range of the stiffness parameters is 0 < s < 1. Large values of the parameter produce smoother curves that adequately represent the overall pattern in the data set at the expense of local details. See also, McLain, 1974.

Stopping Conditions. During an iterative process (e.g., fitting, searching, training), the conditions that must be true for the process to stop. (For example, in neural networks, the stopping conditions include the maximum number of epochs, target error performance and the minimum error improvement thresholds.

Stopping Conditions (in Neural Networks). The iterative gradient-descent training algorithms (back propagation, Quasi-Newton, conjugate gradient descent, Levenberg-Marquardt, quick propagation, Delta-bar-Delta, and Kohonen) all attempt to reduce the training error on each epoch.

We specify a maximum number of epochs for these iterative algorithms. However, we can also define stopping conditions that may cause training to determine earlier.

Specifically, training may be stopped when:

  • the error drops below a given level;

  • the error fails to improve by a given amount over a given number of epochs.

The conditions are cumulative; i.e., if several stopping conditions are specified, training ceases when any one of them is satisfied. In particular, a maximum number of epochs must always be specified. The error-based stopping conditions can also be specified independently for the error on the training set and the error on the selection set (if any).

Target Error. We can specify a target error level, for the training subset, the selection subset, or both. If the RMS falls below this level, training ceases.

Minimum Improvement. Specifies that the RMS error on the training subset, the selection subset, or both must improve by at least this amount, or training will cease (if the Window parameter is non-zero).

Sometimes error improvement may slow down for a while or even rise temporarily (particularly if the shuffle option is used with back propagation, or non-zero noise is specified, as these both introduce an element of noise into the training process). To prevent this option from aborting the run prematurely, specify a longer Window. It is particularly recommended to monitor the selection error for minimum improvement, as this helps to prevent over-learning.

Specify a negative improvement threshold if you want to stop training only when a significant deterioration in the error is detected. The algorithm will stop when a number of generations pass during which the error is always the given amount worse than the best it ever achieved.

Window. The window factor is the number of epochs across which the error must fail to improve by the specified amount, before the algorithm is deemed to have slowed down too much and is stopped. By default the window is zero, which means that the minimum improvement stopping condition is not used at all.

Stopping Rule (in Classification Trees). The stopping rule for a classification tree refers to the criteria that are used for determining the "right-sized" classification tree, that is, a classification tree with an appropriate number of splits and optimal predictive accuracy. The process of determining the "right-sized" classification tree is described in the Computational Methods section of Classification Trees.

Stratified Random Sampling. In general, random sampling is the process of randomly selecting observations from a population, to create a subsample that "represents" the observations in that population (see Kish, 1965; see also Probability Sampling, Simple Random Sampling, EPSEM Samples; see also Representative Sample for a brief exploration of this often misunderstood notion). In stratified sampling, we usually apply specific (identical or different) sampling fractions to different groups (strata) in the population to draw the sample.

Over-sampling particular strata to over-represent rare events. In some predictive data mining applications, it is often necessary to apply stratified sampling to systematically over-sample (apply a greater sampling fraction) to particular "rare events" of interest. For example, in catalog retailing the response rate to particular catalog offers can be below 1%, and when analyzing historical data (from prior campaigns) to build a model for targeting potential customers more successfully, it is desirable to over-sample past respondents (i.e., the "rare" respondents who ordered from the catalog); we can then apply the various model building techniques for classification (see Data Mining) to a sample consisting of approximately 50% responders and 50% non-responders. Otherwise, if we were to draw a simple random sample for the analysis (with 1% of responders), then practically all model building techniques would likely predict a simple "no-response" for all cases and would be (trivially) correct in 99% of the cases.

Stub and Banner Tables (Banner Tables). Stub-and-banner tables are essentially two-way tables, except that two lists of categorical variables (instead of just two individual variables) are crosstabulated. In the Stub-and-banner table, one list will be tabulated in the columns (horizontally) and the second list will be tabulated in the rows (vertically) of the Scrollsheet. For more information, see the Stub and Banner Tables section of Basic Statistics.

Studentized Deleted Residuals. In addition to standardized residuals several methods (including studentized residuals, studentized deleted residuals, DFFITS, and standardized DFFITS) are available for detecting outlying values (observations with extreme values on the set of predictor variables or the dependent variable). The formula for studentized deleted residuals is given by:

e = residual
SSE = sum of squared errors
N = sample size
p  = rank of design matrix X
H = leverage

For more information see Hocking (1996) and Ryan (1997).

Studentized Residuals. In addition to standardized residuals, several methods (including studentized residuals, studentized deleted residuals, DFFITS, and standardized DFFITS) are available for detecting outlying values (observations with extreme values on the set of predictor variables or the dependent variable). The formula for studentized residuals is

SRESi = (ei/s)/(1-i)1/2 where
ei    is the error for the ith case
hi    is the leverage for the ith case

and i = 1/N + hi For more information see Hocking (1996) and Ryan (1997).

Student's t Distribution. The Student's t distribution has density function (for = 1, 2, ...):

     is the degrees of freedom
    (gamma) is the Gamma function
    is the constant Pi (3.14...)


The animation above shows various tail areas (p-values) for a Student's t distribution with 15 degrees of freedom.

Sum-Squared Error Function. An error function composed by squaring the difference between sets of target and actual values, and adding these together. See also, loss function.

Supervised and Unsupervised Learning. An important distinction in machine learning, and also applicable to data mining, is that between supervised and unsupervised learning algorithms. The term "supervised" learning is usually applied to cases in which a particular classification is already observed and recorded in a training sample, and you want to build a model to predict those classifications (in a new testing sample). For example, you may have a data set that contains information about who from among a list of customers targeted for a special promotion responded to that offer. The purpose of the classification analysis would be to build a model to predict who (from a different list of new potential customers) is likely to respond to the same (or a similar) offer in the future. You may want to review the methods discussed in General Classification and Regression Trees (GC&RT), General CHAID Models (GCHAID), Discriminant Function Analysis and General Discriminant Analysis (GDA), MARSplines (Multivariate Adaptive Regression Splines), and neural networks to learn about different techniques that can be used to build or fit models to data where the outcome variable of interest (e.g., customer did or did not respond to an offer) was observed. These methods are called supervised learning algorithms because the learning (fitting of models) is "guided" or "supervised" by the observed classifications recorded in the data file.

In unsupervised learning, the situation is different. Here the outcome variable of interest is not (and perhaps cannot be) directly observed. Instead, we want to detect some "structure" or clusters in the data that may not be trivially observable. For example, you may have a database of customers with various demographic indicators and variables potentially relevant to future purchasing behavior. Your goal would be to find market segments, i.e., groups of observations that are relatively similar to each other on certain variables; once identified, you could then determine how best to reach one or more clusters by providing certain goods or services you think may have some special utility or appeal to individuals in that segment (cluster). This type of task calls for an unsupervised learning algorithm, because learning (fitting of models) in this case cannot be guided by previously known classifications. Only after identifying certain clusters can you begin to assign labels, for example, based on subsequent research (e.g., after identifying one group of customers as "young risk takers").

There are several methods available for unsupervised learning, including Principal Components and Classification Analysis, Factor Analysis, Multidimensional Scaling, Correspondence Analysis, Neural Networks, Self-Organizing Feature Maps (SOFM, Kohonen networks); particularly powerful algorithms for pattern recognition and clustering are the EM and k-Means clustering algorithms.

Support Value (Association Rules). When applying (in data or text mining) algorithms for deriving association rules of the general form If Body then Head (e.g., If (Car=Porsche and Age<20) then (Risk=High and Insurance=High)), the Support value is computed as the joint probability (relative frequency of co-occurrence) of the Body and Head of each association rule.

Support Vector. A set of points in the feature space that determines the boundary between objects of different class memberships.

Support Vector Machine (SVM) Support Vector Machine (SVM) A classification method based on the maximum margin hyperplane.

Suppressor Variable. A suppressor variable (in Multiple Regression ) has zero (or close to zero) correlation with the criterion but is correlated with one or more of the predictor variables, and therefore, it will suppress irrelevant variance of independent variables. For example, you are trying to predict the times of runners in a 40 meter dash. Your predictors are Height and Weight of the runner. Now, assume that Height is not correlated with Time, but Weight is. Also assume that Weight and Height are correlated. If Height is a suppressor variable, then it will suppress, or control for, irrelevant variance (i.e., variance that is shared with the predictor and not the criterion), thus increasing the partial correlation. This can be viewed as ridding the analysis of noise.

Let t = Time, h = Height, w - Weight, rth = 0.0, rtw = 0.5, and rhw = 0.6.

Weight in this instance accounts for 25% (Rtw**2 = 0.5**2) of the variability of Time. However, if Height is included in the model, then an additional 14% of the variability of Time is accounted for even though Height is not correlated with Time (see below):

Rt.hw**2 = 0.5**2/(1 - 0.6**2) = 0.39

For more information, please refer to Pedhazur, 1982.

Surface Plot (from Raw Data). This sequential plot fits a spline-smoothed surface to each data point. Successive values of each series are plotted along the X-axis, with each successive series represented along the Y-axis.

Survival Analysis. Survival analysis (exploratory and hypothesis testing) techniques include descriptive methods for estimating the distribution of survival times from a sample, methods for comparing survival in two or more groups, and techniques for fitting linear or non-linear regression models to survival data. A defining characteristic of survival time data is that they usually include so-called censored observations, e.g., observations that "survived" to a certain point in time, and then dropped out from the study (e.g., patients who are discharged from a hospital). Instead of discarding such observations from the data analysis all together (i.e., unnecessarily loose potentially useful information) survival analysis techniques can accommodate censored observations, and "use" them in statistical significance testing and model fitting.

Typical survival analysis methods include life table, survival distribution, and Kaplan-Meier survival function estimation, and additional techniques for comparing the survival in two or more groups. Finally, Survival analysis includes the use of regression models for estimating the relationship of (multiple) continuous variables to survival times. For more information, see Survival Analysis.

Survivorship Function. The survivorship function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the survivorship function is also referred to as the reliability or survival function (since it describes the probability of not failing or of surviving until a certain time t; e.g., see Lee, 1992).

For additional information see Survival Analysis or the Weibull and Reliability/Failure Time Analysis section of Process Analysis.

Sweeping. The sweeping transformation of matrices is commonly used to efficiently perform stepwise multiple regression (see Dempster, 1969, Jennrich, 1977) or similar analyses; a modified version of this transformation is also used to compute the g2 generalized inverse. The forward sweeping transformation for a column k can be summarized in the following four steps (where the e's refer to the elements of a symmetric matrix):

  1. eij = eij - ejk * ekj / ekk for i<>k, j<>k
  2. ekj = ekj / ekk
  3. eik = eik / ekk
  4. ekk = -1 / ekk

The reverse sweeping operation reverses the changes effected by these transformations. The sweeping operator is used extensively in General Linear Models, Multiple Regression, and similar techniques.

Symmetrical Distribution. If you split the distribution in half at its mean (or median), then the distribution of values would be a "mirror image" about this central point. See also, Descriptive Statistics Overview.

Symmetric Matrix. A matrix is symmetric if the transpose of the matrix is itself (i.e., A = A'). In other words, the lower triangle of the square matrix is a "mirror image" of the upper triangle with 1's on the diagonal (see below).

|1 2 3 4|
|2 1 5 6|
|3 5 1 7|
|4 6 7 1|

Synaptic Functions (in Neural Networks).

Dot product. Dot product units perform a weighted sum of their inputs, minus the threshold value. In vector terminology, this is the dot product of the weight vector with the input vector, plus a bias value. Dot product units have equal output values along hyperplanes in pattern space. They attempt to perform classification by dividing pattern space into sections using intersecting hyperplanes.

Radial. Radial units calculate the square of the distance between the two points in N dimensional space (where N is the number of inputs) represented by the input pattern vector and the unit's weight vector. Radial units have equal output values lying on hyperspheres in pattern space. They attempt to perform classification by measuring the distance of normalized cases from exemplar points in pattern space (the exemplars being stored by the units). The squared distance is multiplied by the threshold (which is, therefore, actually a deviation value in radial units) to produce the post synaptic value of the unit (which is then passed to the unit's activation function).

Dot product units are used in multilayer perceptron and linear networks, and in the final layers of radial basis function, PNN, and GRNN networks.

Radial units are used in the second layer of Kohonen, radial basis function, Clustering, and probabilistic and generalized regression networks.  They are not used in any other layers of any standard network architecture.

Division. This is specially designed for use in generalized regression networks, and should not be employed elsewhere. It expects one incoming weight to equal +1, one to equal -1, and the others to equal zero. The post-synaptic value is the +1 input divided by the -1 input.