Electronic statistics textbook banner

Glossary Index


Wald Statistic. The results Scrollsheet with the parameter estimates for the Cox proportional hazard regression model includes the so-called Wald statistic, and the p level for that statistic. This statistic is a test of significance of the regression coefficient; it is based on the asymptotic normality property of maximum likelihood estimates, and is computed as:

W = * 1/Var() *

In this formula, stands for the parameter estimates, and Var() stands for the asymptotic variance of the parameter estimates. The Wald statistic is tested against the Chi-square distribution.

STATISTICA Enterprise Server applications (formerly WebSTATISTICA). STATISTICA Enterprise Server is the ultimate enterprise system that offers the full Web enablement, including the ability to run STATISTICA interactively or in batch from a Web browser on any computer in the world (incl. Linux, UNIX), offload time consuming tasks to the servers (using distributed processing), use multi-tier client/server architecture, develop highly customized Web applications, and manage/administer projects over the Web.

Weibull Distribution. The Weibull distribution (Weibull, 1939, 1951; see also Lieblein, 1955) has density function (for positive parameters b, c, and ):

f(x) = c/b*[(x-)/b]c-1 * e^{-[(x-)/b]c}
< x,  b > 0,  c > 0

b     is the scale parameter of the distribution
c     is the shape parameter of the distribution
   is the location parameter of the distribution
e     is the base of the natural logarithm, sometimes called Euler's e (2.71...)

The animation above shows the Weibull distribution as the shape parameter increases (.5, 1, 2, 3, 4, 5, and 10).

Weigend Weight Regularization (in Neural Networks).

A common problem in neural network training (particularly of multilayer perceptrons) is over-fitting. A network with a large number of weights in comparison with the number of training cases available can achieve a low training error by modeling a function that fits the training data well despite failing to capture the underlying model. An over-fitted model typically has high curvature, as the function is contorted to pass through the points, modeling any noise in addition to the underlying data.

There are several approaches in neural network training to deal with the over-fitting problem (Bishop, 1995). These approaches are listed below.

  • Select a neural network with just enough units to model the underlying function. The problem with this approach is determining the correct number of units, which is problem-dependent.

  • Add some noise to the training cases during training (altering the noise on each case each epoch): this "blurs" the position of the training data, and forces the network to model a smoothed version of the data.

  • Stop training (see Stopping Conditions) when the selection error begins to rise, even if the training error continues to fall. This event is a sure sign that the network is beginning to over-fit the data.

  • Use a regularization technique, which explicitly penalizes networks with large curvature, thus encouraging the development of a smoother model.

The last technique mentioned is regularization, and this section describes Weigend weight regularization (Weigend et. al., 1991).

A multilayer perceptron model with sigmoid (logistic or hyperbolic tangent) activation functions has higher curvature if the weights are larger. You can see this by considering the shape of the sigmoid curve: if you just look at a small part of the central section, around the value 0.0, it is "nearly linear," and so a network with very small weights will model a "nearly linear" function, which has low curvature. As an aside, note that during training the weights are first set to small values (corresponding to a low curvature function), and then (at least some of them) diverge. One way to promote low curvature therefore is to encourage smaller weights.

Weigend weight regularization does this by adding an extra term to the error function, which penalizes larger weights. Hence the network tends to develop the larger weights that it needs to model the problem, and the others are driven toward zero. The technique can be used with any multilayer perceptron training algorithms (back propagation, conjugate gradient descent, Quasi-Newton Method, quick propagation, and Delta-bar-Delta) apart from Levenberg-Marquardt, which makes its own assumptions about the error function.

The technique is commonly referred to as Weigend weight elimination, as it is possible, once weights become very small, to simply remove them from the network. This is an extremely useful technique for developing models with a "sensible" number of hidden units, and for selecting input variables.

Once a model has been trained with Weigend regularization and excess inputs and hidden units removed, it can be further trained with Weigend regularization turned off, to "sharpen up" the final solution.

Weigend regularization can also be very helpful in that it tends to prevent models from becoming over-fitted.

Note: When using Weigend regularization, the error on the progress graph includes the Weigend penalty factor. If you compare a network trained with Weigend to one without, you may get a false impression that the Weigend-trained network is under-performing. To compare such networks, view the error reported in the summary statistics on the model list (this does not include the Weigend error term).

Technical Details. The Weigend error penalty is given by:

where l is the Regularization coefficient, wi is each of the weights, and wo is the Scale coefficient.

The error penalty is added to the error calculated by the network's error function during training, and its derivative is added to the weight's derivative. However, the penalty is ignored when running a network.

The regularization coefficient is usually manipulated to adjust the selective pressure to prune units. The relationship between this coefficient and the number of active units is roughly logarithmic, so the coefficient is typically altered over a wide range (0.01-0.0001, say).

The scale coefficient defines what is a "large" value to the algorithm. The default setting of 1.0 is usually reasonable, and it is seldom altered.

A feature of the Weigend error penalty is that it does not just penalize larger weights. It also prefers to tolerate an uneven mix of some large and some small weights, as opposed to a number of medium-sized weights. It is this property that allows it to "eliminate" weights.

Weighted Least Squares (in Regression). In some cases it is desirable to apply differential weights to the observations in a regression analysis, and to compute so-called weighted least squares regression estimates. This method is commonly applied when the variances of the residuals are not constant over the range of the independent variable values. In that case, one can apply the inverse values of the variances for the residuals as weights and compute weighted least squares estimates. (In practice, these variances are usually not known, however, they are often proportional to the values of the independent variable(s), and this proportionality can be exploited to compute appropriate case weights.) Neter, Wasserman, and Kutner (1985) describe an example of such an analysis.

Wilcoxon Test. The Wilcoxon test is a nonparametric alternative to t-test for dependent samples. It is designed to test a hypothesis about the location (median) of a population distribution. It often involves the use of matched pairs, for example, "before" and "after" data, in which case it tests for a median difference of zero.

This procedure assumes that the variables under consideration were measured on a scale that allows the rank ordering of observations based on each variable (i.e., ordinal scale) and that allows rank ordering of the differences between variables (this type of scale is sometimes referred to as an ordered metric scale, see Coombs, 1950). For more details, see Siegel & Castellan, 1988. See also, Nonparametric Statistics.

Wilson-Hilferty Transformation. Wilson & Hilferty (1931) found a way to transform a Chi-square variable to the Z-scale so that their p-values are closely approximated. This transformation therefore enables us to compare the statistical significance of Chi-square values with different degrees of freedom. The transformation is:

Where Y is the Chi-square statistic and n is the degrees of freedom.

Win Frequencies (in Neural Networks). In a Kohonen network, the number of times that each radial unit is the winner when the data set is executed. Units which win frequently represent cluster centers in the topological map. See, Neural Networks.

Wire. A wire is a line, usually curved, used in a path diagram to represent variances and covariances of exogenous variables.