 ### Glossary Index

###### Z

R. R is a programming language and environment for statistical computing (see http://www.r-project.org). The R environment and its source code are freely available under the GNU GPL license; precompiled binaries exist for Microsoft Windows, Unix, and Mac OS platforms. R uses a command line interface, but several graphical user interfaces are also available.

Radial Basis Functions. A type of neural network employing a hidden layer of radial units and an output layer of linear units, and characterized by reasonably fast training and reasonably compact networks. Introduced by Broomhead and Lowe (1988) and Moody and Darkin (1989), they are described in most good neural network text books (e.g., Bishop, 1995; Haykin, 1994). See, Neural Networks.

Radial Sampling (in Neural Networks). Radial sampling is a simple technique to assign centers to radial units in the first hidden layer of a network by randomly sampling training cases and copying those to the centers. This is a reasonable approach if the training data are distributed in a representative manner for the problem (Lowe, 1989). The number of training cases must at least equal the number of centers to be assigned.

Random Effects (in Mixed Model ANOVA). The term random effects in the context of analysis of variance is used to denote factors in an ANOVA design with levels that were not deliberately arranged by the experimenter (those factors are called fixed effects), but which were sampled from a population of possible samples instead. For example, if we were interested in the effect that the quality of different schools has on academic proficiency, we could select a sample of schools to estimate the amount of variance in academic proficiency (component of variance) that is attributable to differences between schools.

A simple criterion for deciding whether or not an effect in an experiment is random or fixed is to ask how we would select (or arrange) the levels for the respective factor in a replication of the study. For example, if we wanted to replicate the study described in this example, we would choose (take a sample of) different schools from the population of schools. Thus, the factor "school" in this study would be a random factor. In contrast, if we wanted to compare the academic performance of boys to girls in an experiment with a fixed factor Gender, we would always arrange two groups: boys and girls. Hence, in this case, the same (and in this case only) levels of the factor Gender would be chosen when we wanted to replicate the study.

Random Forests. A Random Forest consists of a collection or ensemble of simple tree predictors, each capable of producing a response when presented with a set of predictor values. For classification problems, this response takes the form of a class membership, which associates, or classifies, a set of independent predictor values with one of the categories present in the dependent variable. Alternatively, for regression problems, the tree response is an estimate of the dependent variable given the predictors. The Random Forest algorithm was developed by Breiman.

A Random Forest consists of an arbitrary number of simple trees, which are used to determine the final outcome.  For classification problems, the ensemble of simple trees vote for the most popular class. In the regression problem, their responses are averaged to obtain an estimate of the dependent variable. Using tree ensembles can lead to significant improvement in prediction accuracy (i.e., better ability to predict new data cases).

Random Numbers from Arbitrary Distributions. Random numbers can be generated for all continuous and discrete distributions, using the standard inversion method (Muller, 1959; see also Evans, Hastings, Peacock, B., 1993). First generate a uniform random number, and then use the inverse distribution function for the distribution of interest to generate the random variate values.

In practice, care must be taken not to use uniform random numbers that are (almost) 0 or (almost) 1, since the inverse distribution functions at the extreme margins may return in some cases the missing data code. For example, suppose you want to generate random variate values for the Weibull distribution with scale parameter .5, shape parameter .6, and location (threshold) parameter 10. You could use a formula to generate those values:

=invWeibull(rnd(1)*.99999+.000001,.5,.6,10)

Note that the rnd(1) function will generate uniform random numbers in the range from 0 to 1; by multiplying those values by .99999 and adding .000001 you guarantee that the inverse Weibull distribution function vWeibull is not called with a very small (almost 0) or very large (almost 1) probability value (where it might return a missing value; the constants used in this example guarantee .000001<=p<=.999991); thus the variable computed in this manner is guaranteed to contain only valid random values from the specified Weibull distribution (except for the extreme tails).

Bayesian analysis; Monte Carlo Markov Chain analysis; Gibbs sampler. Bayesian analysis and related techniques rely on the fast, efficient, and "precise" generation of random numbers from a variety of distributions. Numerous different types of specialized libraries are available for this purpose, and a good deal of ongoing research in this area is further refining these techniques and their applications. See also, DIEHARD Suite of Tests and Random Number Generation.

Random Numbers (Uniform). Various techniques for uniform random number generators are discussed in Press, Teukolsky, Vetterling, & Flannery (1995). DIEHARD suite of tests (Marsaglia, 1998) applies various methods of assembling and combining uniform random numbers, and then performs statistical tests that are expected to be non-significant; this suite of tests has become a standard method of evaluating the quality of uniform random number generator routines (see also McCullough, 1998, 1999). See also, Random Numbers from Arbitrary Distributions and DIEHARD Suite of Tests and Random Number Generation.

Random Sub-Sampling in Data Mining. When mining huge data sets with many millions of observations, it is neither practical nor desirable to process all cases (although efficient incremental learning algorithms exist to perform predictive data mining using all observations in the dataset). For example, by properly sampling only 100 observations (from millions of observations) you can compute a very reliable estimate of the mean. One of the rules of statistical sampling that is often not intuitively understood by untrained "observers" is the fact that the reliability and validity of results depend, among many other things, on the size of a random sample, and not on the size of the population from which it is taken. In other words, the mean estimated from 100 randomly sampled observations is as accurate (i.e., falls within the same confidence limits) regardless of whether the sample was taken from 1000 cases or 100 billion cases. Put another way, given a certain (reasonable) degree of accuracy required, there is absolutely no need to process and include all observations in the final computations (for estimating the mean, fitting models, etc.).

Range Plots - Boxes. In this style of range plot, the range is represented by a "box" (i.e., as a rectangular box where the top of the box is the upper range and the bottom of the box is the lower range). The midpoints are represented either as point markers or horizontal lines that "cut" the box.

Range Plots - Columns. In this style of range plot, a column represents the mid-point (i.e., the top of the column is at the mid-point value) and the range (represented by "whiskers") is overlaid in the column. Range Plots - Whiskers. In this style of range plot (see example above), the range is represented by "whiskers" (i.e., as a line with a serif on both ends). The midpoints are represented by point markers.

Rank. A rank is a consecutive number assigned to a specific observation in a sample of observations sorted by their values and, thus, reflecting the ordinal relation of the observation to others in the sample. Depending on the order of sorting (ascending or descending), the higher ranks represent the higher values [i.e., ascending ranks, the lowest value is assigned a rank of 1, and the highest value - the "last" (highest) rank] or higher ranks represent the lower values (i.e., descending ranks, the highest value is assigned a rank of 1). See ordinal scale and Coombs, 1950.

Rank Correlation. A rank correlation coefficient is a coefficient of correlation between two random variables that is based on the ranks of the measurements and not the actual values, for example, see Spearman R, Kendall tau, Gamma. Detailed discussions of rank correlations can be found in Hays (1981), Kendall (1948, 1975), Everitt (1977), and Siegel and Castellan (1988). See also Nonparametric Statistics.

Ratio Scale. This scale of measurement contains an absolute zero point, therefore it allows you to not only quantify and compare the sizes of differences between values, but also to interpret both values in terms of absolute measures of quantity or amount (e.g., time; 3 hours is not only 2 hours more than 1 hour, but it is also 3 times more than 1 hour).

Rayleigh Distribution. The Rayleigh distribution has the probability density function:

f(x) = x/b2 * e-(x 2/2b2)
0 x < b > 0

where
b     is the scale parameter
e     is the base of the natural logarithm, sometimes called Euler's e (2.71...) The graphic above shows the changing shape of the Rayleigh distribution when the scale parameter equals 1, 2, and 3.

Receiver Operating Characteristic Curve (ROC Curve). A ROC curve can be used to evaluate the goodness of fit for a binary classifier. It is a plot of the true positive rate (rate of events that are correctly predicted as events) against the false positive rate (rate of nonevents predicted to be events) for the different possible cutpoints.

A ROC curve demonstrates the following:

• The trade-off between sensitivity and specificity (any increase in sensitivity will be accompanied by a decrease in specificity).

• The closer the curve follows the left border and then the top border of the ROC space, the more accurate the test.

• The closer the curve comes to the 45-degree diagonal of the ROC space, the less accurate the test.

Receiver Operating Characteristic (ROC) Curve (in Neural Networks). When a neural network is used for classification, confidence levels (the Accept and Reject thresholds, available from the Pre/Post Processing Editor) determine how the neural networks assigns input cases to classes.

In the case of two-class classification problems, by default the output class is indicated by a single output neuron, with high output corresponding to one class and low output to the other. If the Reject threshold is strictly less than the Accept threshold, then the network may include a "doubt" option, where it is not sure of the class if the output lies between the Reject and Accept thresholds.

An alternative approach is to set the Accept and Reject thresholds equal. In this case, as the single decision threshold is adjusted, so the classification behavior of the network changes. At one extreme, all cases will be assigned to one class, and at the other extreme to the other. In between these extremes, different compromises may be found, leading to different trade-offs between the rate of erroneous assignment to each class (i.e. false-positives and false-negatives).

A Receiver Operating Characteristic curve (Zweig, 1993) summarizes the performance of a two-class classifier across the range of possible thresholds. It plots the sensitivity (class two true positives) versus one minus the specificity (class one false negatives). An ideal classifier hugs the left side and top side of the graph, and the area under the curve is 1.0. A random classifier should achieve approximately 0.5 (a classifier with an area less than 0.5 can be improved simply by flipping the class assignment). The ROC curve is recommended for comparing classifiers, as it does not merely summarize performance at a single arbitrarily selected decision threshold, but across all possible decision thresholds.

The ROC curve can be used to select an optimum decision threshold. This threshold (which equalizes the probability of misclassification of either class; i.e. the probability of false-positives and false-negatives) can be used to automatically set confidence thresholds in classification networks with a nominal output variable with the Two-state conversion function.

Rectangular Distribution. The rectangular distribution (continuous uniform distribution) is useful for describing random variables with a constant probability density over the defined range a<b:

Regression. A category of problems where the objective is to estimate the value of a continuous output variable from some input variables.

Regression (in Neural Networks). In regression problems the purpose is to predict the value of a continuous output variable. Regression problems can done in Neural Networks using multilayer perceptrons, radial basis function networks, (Bayesian) regression networks, and linear networks.

Output Scaling. Multilayer perceptrons include Minimax scaling of both input and output variables. When the network is trained, shift and scale coefficients are determined for each variable, based on the minimum and maximum values in the training set, and the data is transformed by multiplying by the scale factor and adding the shift factor.

The net effect is that a 0.0 output activation level in the network is translated into the minimum value encountered in the training data, and a 1.0 activation level is translated into the maximum training data value. Consequently, the network is able to interpolate between the values represented in the training data. However, extrapolation outside the range encountered in the training set is more circumscribed. Two approaches to encoding the output are available, each of which allows a certain amount of extrapolation.

• A logistic activation function is used for the output, with scaling factors determined so that the range encountered in the training set is mapped to a restricted part of the logistic functions (0,1) range (e.g. to [0.05, 0.95]. This allows a small amount of extrapolation (significant extrapolation from data is usually unjustified anyway). Using the logistic function makes training stable.

• Uses an identity activation function in the final layer of the network. This supports a substantial amount of extrapolation, although not unlimited (the hidden units will saturate eventually). As a bonus, the final layer can be "fine-tuned" after iterative training using the pseudo-inverse technique. However, iterative training tends to be less stable than with a non-linear activation function, and the learning rate must be carefully chosen to avoid weight divergence during training (i.e. less than 0.1), if using an algorithm such as back propagation.

Outliers. Regression networks can be particularly prone to problems with outlying data. The use of the sum-squared network error function means that points lying far from the others have a disproportionate influence on the position of the hyperplanes used in regression. If these points are actually anomalies (for example, spurious points generated by the failure of measuring devices) they can substantially degrade the network's performance.

One approach to this problem is to train the network, test it on the training cases, isolate those that have extremely high error values and remove them, then to retrain the network.

If you believe the outlier is caused by a suspicious value for one of the variables in that case, you can delete that particular value, at which point the case is treated as having a missing value (see Missing Values, below).

Another approach is to use the city-block error function. Rather than summing the squared-differences in each variable to work out an error measure, this simply sums the absolute differences. Removing the square function makes training far less sensitive to outliers.

Whereas the amount of "pull" a case has on a hyperplane is proportional to the distance of the point from the hyperplane in the sum-squared error function, with the city block error function the pull is the same for all points, and the direction of pull simply depends on the side of the hyperplane to which the point lies. Effectively, the sum-squared error function attempts to find the mean, but the city-block error function attempts to find the median.

Missing Values. It is not uncommon to come across situations where the data for some cases has some values missing; perhaps because data was unavailable, or corrupted, when gathered. In such cases, you may still need to execute a network (to get the best estimate possible given the information available) or (and this is more suspect) use the partially complete data in training because of an acute shortage of training data.

Where possible, it is usually good practice not to use variables containing a great many missing values. Cases with missing values can be excluded.

Regression Summary Statistics (in Neural Networks). In regression problems, the purpose of the neural network is to learn a mapping from the input variables to a continuous output variable, or variables.

A network is successful at regression if it makes predictions that are more accurate than a simple estimate.

The simplest way to construct an estimate, given training data, is to calculate the mean of the training data, and use that mean as the predicted value for all previously unseen cases.

The average expected error from this procedure is the standard deviation of the training data. The aim in using a regression network is therefore to produce an estimate that has a lower prediction error standard deviation than the training data standard deviation.

The regression statistics are:

Data Mean. Average value of the target output variable.

Data S.D. Standard deviation of the target output variable.

Error Mean. Average error (residual between target and actual output values) of the output variable.

Abs. E. Mean. Average absolute error (difference between target and actual output values) of the output variable.

Error S.D. Standard deviation of errors for the output variable.

S.D. Ratio. The error:data standard deviation ratio.

Correlation. The standard Pearson-R correlation coefficient between the predicted and observed output values.

The degree of predictive accuracy needed varies from application to application. However, generally an s.d. ratio of 0.1 or lower indicates very good regression performance.

Regular Histogram. This simple histogram will produce a column plot of the frequency distribution for the selected variable (if more than one variable is selected, then one graph will be produced for each variable in the list).

Regularization (in Neural Networks). A modification to training algorithms that attempts to prevent over- or under-fitting of training data by building in a penalty factor for network complexity (typically by penalizing large weights, which correspond to networks modeling functions of high curvature) (Bishop, 1995).

Reject Inference. Used in the building of credit scorecards, reject inference refers to the process of removing bias from credit scoring models.  Typically, a financial institution will have historical data concerning applicants who where extended credit.  Applicants who were refused credit and those who turned down the credit offer are not represented in the data.  This introduces a bias.  Predictive models are intended to score all possible applicants; however, the data used to build the models is missing certain types of applicants.  Reject inference methods attempt to alleviate this bias.

Relative Function Change Criterion. The relative function change criterion is used to stop iteration when the function value is no longer changing (see Structural Equation Modeling). Basically, it stops iteration when the function ceases to change. The criterion is necessary because, sometimes, it is not possible to reduce the discrepancy function even when the gradient is not close to zero. This occurs, in particular, when one of the parameter estimates is at a boundary value. The "true minimum," where the gradient actually is zero, includes parameter values that are not permitted (like negative variances, or correlations greater than one).

On the i'th iteration, this criterion is equal to Reliability. There are two very different ways in which this term can be used:

Reliability and item analysis. In this context reliability is defined as the extent to which a measurement taken with multiple-item scale (e.g., questionnaire) reflects mostly the so-called true score of the dimension that is to be measured, relative to the error. A similar notion of scale reliability is sometimes used when assessing the accuracy (and reliability) of gages or scales used in quality control charting. For additional details refer to the Reliability and Item Analysis or the description of Gage Repeatability/Reproducibility Analysis in Process Analysis.

Weibull and reliability/failure time analysis. In this context reliability is defined as the function that describes the probability of failure (or death) of an item as a function of time. Thus, the reliability function (commonly denoted as R(t)) is the complement to the cumulative distribution function (i.e., R(t)=1-F(t)); the reliability function is also sometimes referred to as the survivorship or survival function (since it describes the probability of not failing or surviving until a certain time t; e.g., see Lee, 1992). For additional information, see Weibull and Reliability/Failure Time Analysis in Process Analysis.

Reliability and Item Analysis. In many areas of research, the precise measurement of hypothesized processes or variables (theoretical constructs) poses a challenge by itself. For example, in psychology, the precise measurement of personality variables or attitudes is usually a necessary first step before any theories of personality or attitudes can be considered. In general, in all social sciences, unreliable measurements of people's beliefs or intentions will obviously hamper efforts to predict their behavior. The issue of precision of measurement will also come up in applied research, whenever variables are difficult to observe. For example, reliable measurement of employee performance is usually a difficult task; yet, it is obviously a necessary precursor to any performance-based compensation system.

In all of these cases, Reliability & Item Analysis may be used to construct reliable measurement scales, to improve existing scales, and to evaluate the reliability of scales already in use. Specifically, Reliability & Item Analysis will aid in the design and evaluation of sum scales, that is, scales that are made up of multiple individual measurements (e.g., different items, repeated measurements, different measurement devices, etc.). Reliability & Item Analysis provides numerous statistics that allow the user to build and evaluate scales following the so-called classical testing theory model.

For more information, see Reliability and Item Analysis.

The term reliability used in industrial statistics denotes a function describing the probability of failure (as a function of time). For a discussion of the concept of reliability as applied to product quality (e.g., in industrial statistics), please refer to the section on Reliability/Failure Time Analysis in Process Analysis (see also the section on Repeatability and Reproducibility and Survival/Failure Time Analysis). For a comparison between these two (very different) concepts of reliability, see Reliability.

Representative Sample. The notion of a "representative sample" is often misunderstood. The general intent usually is to draw a sample from a population so that particular properties of that population can be estimated accurately from the sample. For example, political scientists may draw samples from the population of voters to predict with some certainty the outcome of an election.

In general, only properly drawn probability samples such as EPSEM samples will guarantee that the population to which we wish to generalize is properly "represented." On the other hand, a generally erroneous notion is commonly expressed that, in order to achieve "representativeness," it is desirable to draw a stratified sample using particular "quotas" (quota sampling) where demographic characteristics such as age, gender, race, and so on are properly "balanced," to match precisely the makeup of the underlying population. This notion is false: The precision of the estimates (such as voting margins) for a population computed from such a sample will only be enhanced, if the variables that we are attempting to match (age, gender, race...) are (strongly) related to the outcome variable of interest (e.g., voting behavior). However, in practice such a-priori knowledge is usually elusive, and applying such quota sampling methods may yield grossly misleading results.

Refer to, for example, Kish (1965) for a detailed discussion of the advantages and characteristics of probability samples and EPSEM samples.

Resampling (in Neural Networks). A major problem with neural networks is the generalization issue (the tendency to overfit the training data), accompanied by the difficulty in quantifying likely performance on new data.

This difficulty can be disturbing if you are accustomed to the relative security of linear modeling, where a given set of data generates a single "optimal" linear model. However, this security may be somewhat deceptive, and if the underlying function is not linear, the model may be very far from optimal.

In contrast, in nonlinear modeling some choice must be made about the complexity (curvature, eccentricity) of the model, and this can lead to a plethora of alternative models. Given this diversity, it is important to have ways to estimate the performance of the models on new data, and to be able to select among them.

Most work on assessing performance in neural modeling concentrates on approaches to resampling. A neural network is optimized using a training subset. Often, a separate subset (the selection subset) is used to halt training to mitigate over-learning, or to select from a number of models trained with different parameters. Then, a third subset (the test subset) is used to perform an unbiased estimation of the network's likely performance.

Although the use of a test set allows us to generate unbiased performance estimates, these estimates may exhibit high variance. Ideally, we would like to repeat the training procedure a number of different times, each time using new training, selection, and test cases drawn from the population; then, we could average the performance prediction over the different test subsets, to get a more reliable indicator of generalization performance.

In reality, we seldom have enough data to perform a number of training runs with entirely separate training, selection and test subsets. However, intuitively we might think we can do better if we train multiple networks, as when a single network is trained, only part of the data is actually involved in training. Can we find a way to use all the data in training, selection and test?

Cross validation is the most simple resampling technique. Suppose that we decide to conduct ten experiments with a given data set. We divide the data set into ten equal parts. Then, for each experiment we select one part to act as the test set. The other nine tenths of the data set are used for training and selection. When the ten experiments are finished, we can average the test set performances of the individual networks.

Cross validation has some obvious advantages. If training a single network, we would probably reserve 25% of the data for test. By using cross validation, we can reduce the individual test set size. In the most extreme version, leave-one-out cross validation, we perform a number of experiments equal to the size of the data set. On each experiment a single case is placed in the test subset, and the rest of the data is used for training. Clearly this may require a substantial number of experiments if the data set is large, but it can give you a very accurate estimate of generalization performance.

What precisely does cross validation tell us? In cross validation, each of the set of experiments should be performed with the same process parameters (same training algorithms, number of epochs, learning rates, etc.). The averaged performance measure is then an estimate of the performance on new data (drawn from the same distribution as the training data) of a single network trained using the same procedure (including the networks actually generated in the cross validation procedure).

We could select one of the cross validated networks at random and deploy it, using the estimates generated in cross validation to characterize its expected performance. However, this seems intuitively wasteful; having generated a number of networks, why not use them all? We can form the networks into an ensemble, and make predictions by averaging or voting across the resampled member networks (ensembles can also usefully combine the predictions of networks trained using different parameters, or of different architectures).

If we form an ensemble from the cross validated networks, is the performance estimate formed by averaging the test set performance of the individual networks an unbiased estimate of generalization performance?

The answer is: no. The expected performance of an ensemble is not, in general, the same as the average performance of the members. Actually, the expected performance of the ensemble is at least the average performance of the members, but usually better. Thus you can use the estimate so-formed, knowing that it is conservatively biased.

Cross validation is one technique for resampling data. There are others:

• Random (Monte Carlo) resampling - the subsets are randomly sampled from the available cases.  Each available case is assigned to one of the three subsets.

• Bootstrapping - this technique (Efron, 1979) samples a data set with replacement (i.e. a single case may be randomly sampled several times into the bootstrap set). The bootstrap can be applied any number of times, for increased accuracy. Compared with random sampling, the use of sampling with replacement can help to iron out generalization problems caused by the finite size of the data set. Breiman (1996) suggested using the bootstrap sampling technique to train multiple models for ensemble averaging (in his case the models were decision trees, but the conclusions carry over to other models), a technique he refers to as bagging.

Residual. Residuals are differences between the observed values and the corresponding values that are predicted by the model and thus they represent the variance that is not explained by the model. The better the fit of the model, the smaller the values of residuals. The ith residual (ei) is equal to:

ei = (yi - yi-hat)

where
yi         is the ith observed value
yi-hat   is the corresponding predicted value

Resolution. An experimental design of resolution R is one in which no l-way interactions are confounded with any other interaction of order less than R - l. For example, in a design of resolution R equal to 5, no l = 2-way interactions are confounded with any other interaction of order less than R - l = 3, so main effects are unconfounded with each other, main effects are unconfounded with 2-way interactions, and 2-way interactions are unconfounded with each other. For discussions of the role of resolution in experimental design see 2**(k-p) fractional factorial designs and 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs.

Response Surface. A surface plotted in three dimensions, indicating the response of one or more variable (or a neural network) as two input variables are adjusted with the others held constant. See DOE, Neural Networks.

RMS (Root Mean Squared) Error. To calculate the RMS (root mean squared) error the individual errors are squared, added together, divided by the number of individual errors, and then square rooted. Gives a single number that summarizes the overall error. See Neural Networks.

Root Cause Analysis. The term root cause analysis is commonly used in manufacturing to summarize the activities involved in determining the variables or factors that impact the final quality or yield of the respective processes. For example, if a particular pattern of defects emerges in manufacture of silicon chips, engineers will pursue various methods and strategies for root cause analysis to determine the ultimate causes of those patterns of quality problems.

Root Mean Square Standardized Effect (RMSSE). This standardized measure of effect size is used in the Analysis of Variance to characterize the overall level of population effects. It is the square root of the sum of squared standardized effects divided by the number of degrees of freedom for the effect. For example, in a 1-Way Anova, the RMSSE is calculated as 