| 2
| 3
| A
| B
| C
| D
| E
| F
| G
| H
| I
| J
| K
| L
| M
| N
| O
| P
| Q
| R
| S
| T
| U
| V
| W
| X
| Y
| Z |

Search the Electronic Statistics Textbook

Machine Learning. Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation. A good example of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in crossvalidation samples.

Mahalanobis Distance. We can think of the independent variables (in a regression equation) as defining a multidimensional space in which each observation can be plotted. Also, we can plot a point representing the means for all independent variables. This "mean point" in the multidimensional space is also called the centroid. The *Mahalanobis distance* is the distance of a case from the centroid in the multidimensional space, defined by the correlated independent variables (if the independent variables are uncorrelated, it is the same as the simple Euclidean distance). Thus, this measure provides an indication of whether or not an observation is an outlier with respect to the independent variable values. See also, standard residual value, deleted residual and Cook’s distance.

Mallow's CP. If *p* regressors are selected from a set of *k, Cp* is defined as:

S (y-yp)^{2} / s^{2} - n+2p

where

y_{p} is the predicted value of y from the p regressors

s^{2} is the residual mean square after regression on the complete set of k

n *is the sample size*

The model is then chosen to give a minimum value of the criterion, or a value that is acceptably small. It is essential a special case of Akaike Information Criterion. Mallow's CP is used in *General Regression Models (GRM)* as the criterion for choosing the best subset of predictor effects when a best subset regression analysis is being performed. This measure of the quality of fit for a model tends to be less dependent (than the *R-square*) on the number of effects in the model, and hence, it tends to find the best subset that includes only the important predictors of the respective dependent variable. See Best Subset Regression Options in GRM for further details.

Manifest Variable. A manifest variable is a variable that is directly observable or measurable. In path analysis diagrams used in structural modeling (see Path Diagram), manifest variables are usually represented by enclosing the variable name within a square or a rectangle.

Mann-Scheuer-Fertig Test. This test, proposed by Mann, Scheuer, and Fertig (1973), is described in detail in, for example, Dodson (1994) or Lawless (1982). The null hypothesis for this test is that the population follows the Weibull distribution with the estimated parameters. Nelson (1982) reports this test to have reasonably good power, and this test can be applied to Type II censored data. For computational details refer to Dodson (1994) or Lawless (1982); the critical values for the test statistic have been computed based on Monte Carlo studies, and have been tabulated for *n* (sample sizes) between 3 and 25; for *n* greater than 25, this test is not computed.

The *Mann-Scheuer-Fertig test* is used in Weibull and Reliability/Failure Time Analysis; see also, Hollander-Proschan Test and Anderson-Darling Test.

Map-Reduce. As a general approach, when analyzing hundreds of terabytes of data, or petabytes of data, it is not feasible to extract the data to another location for analysis; the process of moving data across wires to a separate server or servers (for parallel processing) would take too long and require too much bandwidth. Instead, the analytic computations must be performed physically close to where the data are stored. It is easier to bring the analytics to the data than the data to the analytics.

Map-reduce algorithms, i.e., data processing algorithms designed according to this pattern, do exactly that. A central component of the algorithm will map sub-computations to different locations in the distributed file system and combine the results (the reduce-step) that are computed at the individual nodes of the file system. In short, to compute a count, the algorithm would compute sub-totals within each node and in parallel in the distributed file system, and report back to the map component the subtotals, which are then added up.

Marginal Frequencies. In a Multi-way table, the values in the margins of the table are simply one-way (frequency) tables for all values in the table. They are important in that they help us to evaluate the arrangement of frequencies in individual columns or rows. The differences between the distributions of frequencies in individual rows (or columns) and in the respective margins inform us about the relationship between the crosstabulated variables. For more information on Marginal frequencies, see the Crosstabulations section of Basic Statistics.

Markov Chain Monte Carlo (MCMC). The term *"Monte Carlo method"* (suggested by John von Neumann and S. M. Ulam, in the 1940s) refers to simulation of processes, using random numbers. The term *Monte Carlo* (a city long known for its gambling casinos) derived from the fact that "numbers of chance" (i.e., *Monte Carlo* simulation methods) were used in order to solve some of the integrals of the complex equations involved in the design of the first nuclear bombs (integrals of quantum dynamics). By generating large samples of random numbers from, for example, mixtures of distributions, the integrals of these (complex) distributions can be approximated from the (generated) data.

Complex equations with difficult to solve integrals are often involved in Bayesian Statistics Analyses. For a simple example of the *MCMC* method for generating bivariate normal random variables, see the description of the Gibbs Sampler.

For a detailed discussion of *MCMC* methods, see Gilks, Richardson, and Spiegelhalter (1996). See also the description of the Gibbs Sampler, and Bayesian Statistics (Analysis).

Mass. The term *mass* in correspondence analysis is used to denote the entries in the two-way table of relative frequencies (i.e., each entry is divided by the sum of all entries in the table). Note that the results from correspondence analysis are still valid if the entries in the table are not frequencies, but some other measure of correspondence, association, similarity, confusion, etc. Since the sum of all entries in the table of relative frequencies is equal to 1.0, we could say that the table of relative frequencies shows how one unit of mass is distributed across the cells of the table. In the terminology of correspondence analysis, the row and column totals of the table of relative frequencies are called the row mass and column mass, respectively.

Matching Moments Method. This method can be employed to determine parameter estimates for a distribution (see *Quantile- Quantile Plots*, *Probability-Probability Plots*, and *Process Analysis*). The method of matching moments sets the distribution moments equal to the data moments and solves to obtain estimates for the distribution parameters. For example, for a distribution with two parameters, the first two moments of the distribution (the mean and variance of the distribution, respectively, e.g., and , respectively) would be set equal to the first two moments of the data (the sample mean and variance, respectively, e.g., the unbiased estimators x-bar and s**2, respectively) and solved for the parameter estimates. Alternatively, you could use the Maximum Likelihood Method to estimate the parameters. For more information, see Hahn and Shapiro, 1994.

Matrix Collinearity, Multicollinearity. This term is used in the context of correlation matrices or covariance matrices, to describe the condition when one or more variables from which the respective matrix was computed are linear functions of other variables; as a consequence such matrices cannot be inverted (only the generalized Inverse can be computed). See also Matrix Singularity for additional details.

Matrix Ill-Conditioning. *Matrix ill-conditioning *is a general term used to describe a rectangular matrix of values which is unsuitable for use in a particular analysis.

This occurs perhaps most frequently in applications of linear multiple regression when the matrix of *correlations* for the predictors is *singular* and thus the regular matrix inverse cannot be computed. In some modules (i.e., in *Factor Analysis*) this problem is dealt with by issuing a respective warning and then artificially lowering all *correlations *in the correlation matrix by adding a small constant to the diagonal elements of the matrix, and then restandardizing it. This procedure will usually yield a matrix for which the regular matrix inverse can be computed.

Note that in many applications of the general linear model* *and the generalized linear/nonlinear model, matrix singularity is not abnormal (i.e., when the overparameterized model is used to represent effects for *categorical predictor variables*) and is dealt with by computing a generalized inverse rather than the regular *matrix inverse*.

Another example of matrix ill-conditioning* *is intransitivity of the correlations* *in a correlation matrix. If in a correlation matrix variable *A* is highly positively correlated with *B*, *B* is highly positively correlated with *C*, and *A* is highly negatively correlated with *C*, this "impossible" pattern of correlations signals an error in the elements of the matrix. See also matrix singularity, matrix inverse, generalized inverse.

Matrix Inverse. The *regular inverse* of a rectangular matrix of values is an extension of the concept of a numeric reciprocal. For a *nonsingular matrix* ** A, **its

A^{-1}AA=A

No such regular inverse exists for singular matrices, but generalized inverses* *(an infinite number of them) can be computed for any singular matrix. See also matrix singularity, generalized inverse.

Matrix Plots. Matrix graphs summarize the relationships between several variables in a matrix of true *X-Y* plots. The most common type of matrix plot is the scatterplot, which can be considered to be the graphical equivalent of the correlation matrix.

Matrix Plots - Columns. In this type of Matrix Plot, columns represent projections of individual data points onto the *X- axis* (showing the distribution of the maximum values), arranged in a matrix format. Histograms representing the distribution of each variable are displayed along the diagonal of the matrix (in square matrices, see example below) or along the edges (in rectangular matrices).

Matrix Plots - Lines. In this type of Matrix Plot, a matrix of *X-Y* (i.e., nonsequential) line plots (similar to a scatterplot matrix) is produced in which individual points are connected by a line in the order of their appearance in the data file. Histograms representing the distribution of each variable are displayed along the diagonal of the matrix (in square matrices) or along the edges (in rectangular matrices, see example below).

Matrix Plots - Scatterplot. In this type of Matrix Plot, 2D Scatterplots are arranged in a matrix format (values of the column variable are used as *X* coordinates, values of the row variable represent the *Y* coordinates). Histograms representing the distribution of each variable are displayed along the diagonal of the matrix (in square matrices, see example below) or along the edges (in rectangular matrices).

See also, Data Reduction.

Matrix Rank. The column (or row) *rank* of a rectangular matrix of values (e.g., a sums of squares and cross-products matrix) is equal to the number of linearly independent columns (or rows) of elements in the matrix. If there are no columns that are linearly dependent on other columns, then the rank of the matrix is equal to the number of its columns and the matrix is said to have full (column) *rank*. If the *rank *is less than the number of columns, the matrix is said to have reduced (column) *rank* and is *singular*. See also matrix singularity.

Matrix Singularity. A rectangular matrix of values (e.g., a sums of squares and cross-products matrix) is *singular *if the elements in a column (or row) of the matrix are linearly dependent on the elements in one or more other columns (or rows) of the matrix. For example, if the elements in one column of a matrix are *1, -1, 0, *and the elements in another column of the matrix are 2, -2, 0, then the matrix is singular* *because *2 *times each of the elements in the first column is equal to each of the respective elements in the second column. Such matrices are also said to suffer from multicollinearity problems, since one or more columns are linearly related to each other.

A unique, regular matrix inverse* *cannot be computed for singular matrices, but generalized inverses* *(an infinite number of them) can be computed for any singular matrix. See also, matrix inverse.

Maximum Likelihood Loss Function. An common alternative to the least squares loss function is to maximize the likelihood or log-likelihood function (or to minimize the negative log-likelihood function; the term maximum likelihood was first used by Fisher, 1922a). These functions are typically used when fitting non-linear models. In most general terms, the likelihood function is defined as:

L=F(Y,Model)=^{n}_{i=1} { p[y_{i} , Model Parameters(x_{i})]}

Maximum Likelihood Method. The method of maximum likelihood (the term first used by Fisher, 1922a) is a general method of estimating parameters of a population by values that maximize the *likelihood* (*L*) of a sample. The *likelihood L* of a sample of n observations *x _{1}, x_{2}, ..., x_{n}*, is the joint probability function

Let *L* be the likelihood of a sample, where *L* is a function of the parameters _{1}, _{2}, ... _{k}. Then the maximum likelihood estimators of _{1}, _{2}, ... _{k} are the values of _{1}, _{2}, ... _{k} that maximize *L*.

Let be an element of . If is an open interval, and if *L*() is differentiable and assumes a maximum on W, then the MLE will be a solution of the following equation: (d*L*())/d = 0. For more information, see Bain and Engelhardt (1989) and Neter, Wasserman, and Kutner (1989).

See also, Nonlinear Estimation or Variance Components and Mixed Model ANOVA/ANCOVA.

Maximum Unconfounding. *Maximum unconfounding* is an experimental design criterion that is subsidiary to the criterion of design resolution. The *maximum unconfounding* criterion specifies that design generators should be chosen such that the maximum number of interactions of less than or equal to the crucial order, given the *resolution*, are unconfounded with all other interactions of the crucial order. It is an alternative to the *minimum aberration* criterion for finding the "best" design of maximum resolution. . For discussions of the role of design criteria in experimental design see 2**(k-p) fractional factorial designs and 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs.

MD (Missing data). See Missing values.

Mean. The mean is a particularly informative measure of the "central tendency" of the variable if it is reported along with its confidence intervals. Usually we are interested in statistics (such as the mean) from our sample only to the extent to which they are informative about the population. The larger the sample size, the more reliable its mean. The larger the variation of data values, the less reliable the mean (see also Elementary Concepts).

Mean = (x_{i})/n

where

*n* is the sample size.

See also, Descriptive Statistics

Mean/S.D. An algorithm (used in neural networks) to assign linear scaling coefficients for a set of numbers. The mean and standard deviation of the set are found, and scaling factors selected so that these are mapped to desired mean and standard deviation values. See also, Neural Networks.

Mean Substitution of Missing Data. When you select *Mean Substitution*, the missing data will be replaced by the means for the respective variables during an analysis. See also, Casewise vs. pairwise deletion of missing data

Median. A measure of central tendency, the *median* (the term first used by Galton, 1882) of a sample is the value for which one-half (50%) of the observations (when ranked) will lie above that value and one-half will lie below that value. When the number of values in the sample is even, the *median* is computed as the average of the two middle values. See also, Descriptive Statistics.

Meta-Learning. The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization).

Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis (e.g., GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

We can apply meta-learners to the results from different meta-learners to create "meta-meta"-learners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility.

Minimax. An algorithm to assign linear scaling coefficients for a set of numbers. The minimum and maximum of the set are found, and scaling factors selected so that these are mapped to desired minimum and maximum values. See also, Neural Networks.

Minimum Aberration. *Minimum aberration* is an experimental design criterion that is subsidiary to the criterion of design resolution. The *minimum aberration* design is defined as the design of maximum *resolution* "which minimizes the number of words in the defining relation that are of minimum length" (Fries & Hunter, 1980). Less technically, the criterion apparently operates by choosing design generators that produce the smallest number of pairs of confounded interactions of the crucial order. For example, the *minimum aberration* resolution IV design would have the minimum number of pairs of confounded 2-factor interactions. For discussions of the role of design criteria in experimental design see 2**(k-p) fractional factorial designs and 2**(k-p) Maximally Unconfounded and Minimum Aberration Designs.

Missing Values. Values of variables within data sets which are not known. Although such cases that contain missing data are incomplete, they can still be used in data analysis. Various methods exist to substitute missing data (e.g., by mean substitution, various types of interpolations and extrapolations). Also, pairwise deletion of missing data can be used. See also, Pairwise deletion of missing data, Casewise (Listwise) deletion of missing data, Pairwise deletion of missing data vs. mean substitution, and Casewise vs. pairwise deletion of missing data.

Mode. A measure of central tendency, the *mode* (the term first used by Pearson, 1895) of a sample is the value which occurs most frequently in the sample. See also, Descriptive Statistics.

Model Profiles (in Neural Networks). Model profiles are concise text strings indicating the architecture of networks and ensembles. A profile consists of a type code followed by a code giving the number of input and output variables and number of layers and units (networks) or members (ensembles). For time series networks, the number of steps and the lookahead factor are also given. The individual parts of the profile are:

Model Type. The codes are:

MLP |
Multilayer Perceptron Network |

RBF |
Radial Basis Function Network |

SOFM |
Kohonen Self-Organizing Feature Map |

Linear |
Linear Network |

PNN |
Probabilistic Neural Network |

GRNN |
Generalized Regression Neural Network |

PCA |
Principal Components Network |

Cluster |
Cluster Network |

Output |
Output Ensemble |

Conf |
Confidence Ensemble |

Network architecture. This is of the form I:N-N-N:O, where *I* is the number of input variable, *O* the number of output variables, and *N* the number of units in each layer.

**Example.** 2:4-6-3:1 indicates a network with *2* input variables, *1* output variable, *4* input neurons, *6* hidden neurons, and *3* output neurons.

For a time series network, the steps factor is prepended to the profile, and signified by an "s."

**Example.** s10 1:10-2-1:1 indicates a time series network with steps factor (lagged input) 10.

Ensemble architecture. This is of the form I:[N]:O, where *I* is the number of input variable, *O* the number of output variables, and *N* the number of members of the ensemble.

Models for Data Mining. In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements.

One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects:

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps -

- that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries).

Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA -

- which is focusing more on the technical activities typically involved in a data mining project.

All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.

The general underlying philosophy of StatSoft's *STATISTICA* Data Miner is to provide a flexible data mining workbench that can be integrated into any organization, industry, or organizational culture, regardless of the general data mining process-model that the organization chooses to adopt. For example, *STATISTICA* Data Miner can include the complete set of (specific) necessary tools for ongoing company wide Six Sigma quality control efforts, and users can take advantage of its (still optional) DMAIC-centric user interface for industrial data mining tools. It can equally well be integrated into ongoing marketing research, CRM (Customer Relationship Management) projects, etc. that follow either the CRISP or SEMMA approach - it fits both of them perfectly well without favoring either one. Also, *STATISTICA* Data Miner offers all the advantages of a general data mining oriented "development kit" that includes easy to use tools for incorporating into your projects not only such components as custom database gateway solutions, prompted interactive queries, or proprietary algorithms, but also systems of access privileges, workgroup management, and other collaborative work tools that allow you to design large scale, enterprise-wide systems (e.g., following the CRIPS, SEMMA, or a combination of both models) that involve your entire organization. See also Data Mining Techniques.

Monte Carlo. A computer-intensive technique for assessing how a statistic will perform under repeated sampling. In Monte Carlo methods, the computer uses random number simulation techniques to mimic a statistical population. In the *STATISTICA* Monte Carlo procedure, the computer constructs the population according to the user’s prescription, then does the following:

For each Monte Carlo replication, the computer:

- Simulates a random sample from the population,
- Analyzes the sample,
- Stores the results.

After many replications, the stored results will mimic the sampling distribution of the statistic. Monte Carlo techniques can provide information about sampling distributions when exact theory for the sampling distribution is not available.

Multidimensional Scaling. *Multidimensional scaling* (*MDS*) can be considered to be an alternative to factor analysis (see Factor Analysis), and it is typically used as an exploratory method. In general, the goal of the analysis is to detect meaningful underlying dimensions that allow the researcher to explain observed similarities or dissimilarities (distances) between the investigated objects. In factor analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix. With *MDS,* we can analyze not only correlation matrices but also any kind of similarity or dissimilarity matrix (including sets of measures that are not internally consistent, e.g., do not follow the rule of transitivity). For more information, see the Multidimensional Scaling overview.

Multilayer Perceptrons. Feedforward neural networks having linear PSP functions and (usually) non-linear activation functions.

Multimodal Distribution. A distribution that has multiple modes (thus two or more "peaks").

Multimodality of the distribution in a sample is often a strong indication that the distribution of the variable in population is not normal. Multimodality of the distribution may provide important information about the nature of the investigated variable (i.e., the measured quality). For example, if the variable represents a reported preference or attitude, then multimodality may indicate that there are several pronounced views or patterns of response in the questionnaire. Often however, the multimodality may indicate that the sample is not homogenous and the observations come in fact from two or more "overlapping" distributions. Sometimes, multimodality of the distribution may indicate problems with the measurement instrument (e.g, "gage calibration problems" in natural sciences, or "response biases" in social sciences). See also, unimodal distribution, bimodal distribution.

Multinomial Distribution. The multinomial distribution arises when a response variable is categorical in nature, i.e., consists of data describing the membership of the respective cases to a particular category. For example, if a researcher recorded the outcome for the driver in accidents as "uninjured, "injury not requiring hospitalization", "injury requiring hospitalization", or "fatality", then the distribution of the counts in these categories would be multinomial (see Agresti, 1996). The multinomial distribution is a generalization of the binomial distribution to more than two categories.

If the categories for the response variable can be ordered, then the distribution of that variable is referred to as *ordinal multinomial*. For example, if in a survey the responses to a question are recorded such that respondents have to choose from the pre-arranged categories "Strongly agree", "Agree", "Neither agree nor disagree", "Disagree", and "Strongly disagree", then the counts (number of respondents) that endorsed the different categories would follow an ordinal multinomial distribution (since the response categories are ordered with respect to increasing degrees of disagreement).

Specialized methods for analyzing multinomial and ordinal multinomial response variables can be found in *Generalized Linear Models*.

Multinomial Logit and Probit Regression. The multinomial logit and probit regression models are extensions of the standard logit and probit regression models to the case where the dependent variable has more than two categories (e.g., not just *Pass - Fail*, but *Pass*, *Fail*, *Withdrawn*), i.e., when the dependent or response variable of interest follows a multinomial distribution rather than binomial distribution. When multinomial responses contain rank-order information, they are also called *ordinal multinomial responses *(see ordinal multinomial distribution).

For additional details, see also the discussion of Link Functions, Probit Transformation and Regression, Logit Transformation and Regression, or *Generalized Linear Models*.

Multi-Pattern Bar. Multi-pattern bar plots may be used to represent individual data values of the *X* variable (the same type of data as in pie charts), however, consecutive data values of the *X* variable are represented by the heights of sequential vertical bars, each of a different color and pattern (rather than as pie wedges of different widths).

Multiple Axes in Graphs. An arrangement of axes (coordinate scales) in graphs, where two or more axes are placed parallel to each other, in order to either:

- represent different units in which the variable(s) depicted in the graph can be measured (e.g., a Celsius and Fahrenheit scales of temperature), or

- allow for a comparison of trends or shapes between several plots placed in one graph (e.g., one axis for each plot) which otherwise would be obscured by incompatible measurement units or ranges of values for each variable (that is an extension of the common "double-Y" type of graph).

The latter instance, which requires the appropriate plot legends to be attached to each axis, is illustrated in the graph above.

Multiple Dichotomies. One possible coding scheme that can be used when more than one response is possible from a given question is to code responses using *Multiple dichotomies* . For example, as part of a larger market survey, suppose you asked a sample of consumers to name their three favorite soft drinks. The specific item on the questionnaire may look like this:

**Write down your three favorite soft drinks:
1:__________ 2:__________ 3:__________**

Suppose in the above example we were only interested in *Coke*, *Pepsi*, and *Sprite*. One way to code the data in that case would be as follows:

COKE | PEPSI | SPRITE | . . . . | |
---|---|---|---|---|

case 1 case 2 case 3 . . . |
1 . . . |
1 1 . . . |
1 . . . |

In other words, one variable was created for each soft drink, then a value of *1* was entered into the respective variable whenever the respective drink was mentioned by the respective respondent. Note that each variable represents a *dichotomy*; that is, only "*1*"s and "*not 1*"s are allowed (we could have entered *1*'s and *0*'s, but to save typing we can also simply leave the *0*'s as blanks or as missing values). When tabulating these variables, we would like to compute the number and percent of respondents (and responses) for each soft drink. In a sense, we "compact" the three variables *Coke*, *Pepsi*, and *Sprite* into a single variable (*Soft Drink*) consisting of *multiple dichotomies*. For more information on *Multiple dichotomies*, see the Multiple Response Tables section of Basic Statistics.

Multiple Histogram. Multiple histograms present frequency distributions of more than one variable in one 2D graph. Unlike the Double-Y Histograms, the frequencies for all variables are plotted against the same left-*Y* axis.

Also, the values of all examined variables are plotted against a single *X-axis*, which facilitates comparisons between analyzed variables.

Multiple R. The coefficient of multiple correlation (Multiple R) is the positive square root of *R-square* (the coefficient of multiple determination, see Residual Variance and R-Square). This statistic is useful in multivariate regression (i.e., multiple independent variables) when you want to describe the relationship between the variables.

Multiple Regression. The general purpose of *multiple regression* (the term was first used by Pearson, 1908) is to analyze the relationship between several independent or predictor variables and a dependent or criterion variable.

The computational problem that needs to be solved in multiple regression analysis is to fit a straight line (or plane in an *n*-dimensional space, where *n* is the number of independent variables) to a number of points. In the simplest case - one dependent and one independent variable - we can visualize this in a scatterplot (scatterplots are two-dimensional plots of the scores on a pair of variables). It is used as either a hypothesis testing or exploratory method. For more information, see the Multiple Regression overview.

Multiple Response Variables. Coding the responses to *Multiple response variables* is necessary when more than one response is possible from a given question. For example, as part of a larger market survey, suppose you asked a sample of consumers to name their three favorite soft drinks. The specific item on the questionnaire may look like this:

**Write down your three favorite soft drinks:
1:__________ 2:__________ 3:__________**

Thus, the questionnaires returned to you will contain somewhere between 0 and 3 answers to this item. Also, a wide variety of soft drinks will most likely be named. One way to record the various responses would be to use three *multiple response variables* and a coding scheme for the many soft drinks. Then we could enter the respective codes (or alphanumeric labels) into the three variables, in the same way that respondents wrote them down in the questionnaire.

Resp. 1 | Resp. 2 | Resp. 3 | |
---|---|---|---|

case 1 case 2 case 3 . . . |
COKE SPRITE PERRIER . . . |
PEPSI SNAPPLE GATORADE . . . |
JOLT DR. PEPPER MOUNTAIN DEW . . . |

For more information, see the Multiple Response Tables section of Basic Statistics.

Multiple-Response Tables. *Multiple-response tables* are Crosstabulation tables used when the categories of interest are not mutually exclusive. Such tables can accommodate Multiple response variables as well as Multiple dichotomies.

For more information, see the Multiple Response Tables section of Basic Statistics.

Multiple Stream Group Charts. Variable and attribute control charts (see also Quality Control) can be computed for multiple-stream processes (e.g., operators, machines, assembly lines); the resulting *multiple stream group chart* summarizes the measurements for all streams simultaneously. These charts can also be produced for short production runs, and the measurements summarized in *short run group charts*. In addition to the standard parameters for determining the control limits and other characteristics of the control charts, the number of consecutive points *r* from the same process stream (i.e., "runs" of length *r*) to be highlighted in the chart can be specified.

Multiplicative Season, Damped Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by a damped trend component (independently smoothed with the single parameter ; this model is an extension of Brown's one-parameter linear model, see Gardner, 1985, p. 12-13) and a multiplicative seasonal component (smoothed with parameter ). For example, suppose we wanted to forecast from month to month the number of households that purchase a particular consumer electronics device (e.g., VCR). Every year, the number of households that purchase a VCR will increase, however, this trend will be damped (i.e., the upward trend will slowly disappear) over time as the market becomes saturated. In addition, there will be a seasonal component, reflecting the seasonal changes in consumer demand for VCR's from month to month (demand will likely be smaller in the summer and greater during the December holidays). This seasonal component may be multiplicative, for example, sales during the December holidays may increase by factor of 1.4 (or 40%) over the average annual sales. To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of *S _{0}* and

T_{0} = (1/)*M_{k}-M_{1})/[(k-1)*p]

where

is the smoothing parameter

k is the number of complete seasonal cycles

M_{k} is the mean for the last seasonal cycle

M_{1} is the mean for the first seasonal cycle

p is the length of the seasonal cycle

and S_{0} = M_{1}-p*T_{0}/2

Multiplicative Season, Exponential Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by an exponential trend component (independently smoothed with parameter ) and a multiplicative seasonal component (smoothed with parameter ). For example, suppose we wanted to forecast the monthly revenue for a resort area. Every year, revenue may increase by a certain percentage or *factor*, resulting in an exponential trend in overall revenue. In addition, there could be an multiplicative seasonal component, that is, given the respective annual revenue, each year 20% of the revenue is produced during the month of December, that is, during Decembers the revenue grows by a particular (multiplicative) *factor*.

To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of *S _{0}* and

T_{0} = exp{[log(M_{2})-log(M_{1})]/p}

where

M_{2} is the mean for the second seasonal cycle

M_{1} is the mean for the first seasonal cycle

p is the length of the seasonal cycle

and S_{0} = exp{log(M_{1})-p*log(T_{0})/2}

Multiplicative Season, Linear Trend. In this Time Series model, the simple exponential smoothing forecasts are "enhanced" both by a linear trend component (independently smoothed with parameter ) and a multiplicative seasonal component (smoothed with parameter ). For example, suppose we were to predict the monthly budget for snow-removal in a community. There may be a trend component (as the community grows, there is an upward trend for the cost of snow removal from year to year). At the same time, there is obviously a seasonal component, reflecting the differential likelihood of snow during different months of the year. This seasonal component could be multiplicative, meaning that given a respective budget figure, it may increase by a *factor* of, for example, 1.4 during particular winter months; or it may be additive (see above), that is, a particular fixed additional amount of money is necessary during the winter months. To compute the smoothed values for the first season, initial values for the seasonal components are necessary. Also, to compute the smoothed value (forecast) for the first observation in the series, both estimates of *S _{0}* and

T_{0} = (M_{k}-M_{1})/((k-1)*p)

where

k is the number of complete seasonal cycles

M_{k} is the mean for the last seasonal cycle

M_{1} is the mean for the first seasonal cycle

p is the length of the seasonal cycle

and S_{0} = M_{1} - T_{0}/2

Multiplicative Season, No Trend. This Time Series model is partially equivalent to the simple exponential smoothing model; however, in addition, each forecast is "enhanced" by a multiplicative component that is smoothed independently (see *The seasonal smoothing parameter * in Time Series Analysis). This model would, for example, be adequate when computing forecasts for monthly expected sales for a particular toy. The level of sales may be stable from year to year, or change only slowly; at the same time, there will be seasonal changes (e.g., greater sales during the December holidays), which again may change slowly from year to year. The seasonal changes may affect the sales in a multiplicative fashion, for example, depending on the respective overall level of sales, December sales may always be greater by a *factor* of 1.4.

Multivariate Adaptive Regression Splines (MARSplines). Multivariate adaptive regression splines (or MARSplines for short) is a nonparametric regression procedure which makes no assumption about the underlying functional relationship between the dependent and independent variables. Instead MARSplines constructs this relation from a set of coefficients and basis functions that are entirely "driven" from the regression data. The MARSplines technique has become particularly popular in the area of data mining, because it does not assume or impose any particular type or class of relationship (e.g., linear, logistic, and so on) between the predictor variables and the dependent (outcome) variable of interest.

The general MARSplines model equation (see Hastie et al., 2001, equation 9.19) is given as:

where the summation is over the M predictors in the model. To summarize, y is predicted as a function of the predictor variables X (and their interactions); this function consists of an intercept parameter ( and the weighted (by sum of one or more basis functions . You may also think of this model as "selecting" a weighted sum of basis functions from the set of (a large number of) basis functions that span all values of each predictor (i.e., that set would consist of one basis function, and "knot" parameter t, for each distinct value for each predictor variable): The MARSplines algorithm then searches over the space of all inputs and predictor values (knot locations t) as well as interactions between variables. During this search an increasingly larger number of basis functions are added to the model (selected from the set of possible basis functions), to maximize an overall least squares goodness-of-fit criterion.

For more information about this technique, and how it compares to other methods for nonlinear regression (or regression trees), see Hastie, Tishirani, and Friedman (2001).

Multivariate Statistical Process Control (MSPC). *Multivariate statistical process control *is a methodology for simultaneously monitoring multiple inputs or variables describing a process, for the purpose of ensuring that the overall process is in control. It is an extension of simple univariate (one variable at a time) quality control.

Modern automated production processes typically measure large numbers of variables that describe the process at each stage and across multiple stages. Standard quality control charting techniques (e.g., Shewhart charts, X-bar and R charts, etc.) are applicable only to single variables. Therefore, when applied to modern production processes with hundreds of important variables that need to be monitored, the criteria typically applied to univariate charts will lead to a large number of false alarms, and in many cases nearly constant, perpetual, alarms. Furthermore, this approach will ignore the inherent correlations between variables and, thus, lose important information (e.g., consider a single measure of temperature collected by one sensor drifting out of control, while 50 others stay within control, vs. a scenario where all 50 temperature readings begin to slowly drift upward; intuitively, the latter condition would be the more "significant" event).

To rectify these shortcomings, methods have been developed to monitor simultaneously multiple variables, using multivariate statistical procedures, such as Principal Component Analysis (PCA) and Partial Least Squares (PLS) methods. In short, these techniques will enable you to identify a) when multiple correlated variables start to drift out of control, and b) when the fundamental relationships between variables change (so that the correlations between variables observed when the process was known to be in control are no longer applicable and valid).

A special application of MSPC is commonly found in process monitoring and quality control for industrial batch processing. Batch processes are those where goods are manufactured in "chunks" or batches, such as beer, pharmaceuticals, chemicals, polymers, paint, fertilizers, cement, petroleum products, biochemicals, perfumes, or semiconductors. In those applications, one can define in-control ("good") batches; those batches can be characterized by particular maturing effects, as various measures systematically change over time (e.g., as the alcohol ferments). By building multivariate models (e.g., via *PLS*) describing the relationship of the various variables of interest to time (i.e., to the maturing process) for those good batches, quality control schemes can be derived to detect when a batch deviates from this known "good" multivariate pattern. For details regarding these procedures, see also Nomikos and MacGregor (1995).