Some of the most interesting statistics can come from the analysis of outliers. Deadspin posted an article yesterday about Boston's pitcher Aaron Cook, discussing his extremly low strikeout percentage (2 Ks in just under 30 innings, as of the article's writing). Oddly, despite the low number of strikeouts, he continues to have a very good ERA. Traditional thinking is that these should be proportional. In other words, it's strange for a pitcher to have a low ERA if he isn't striking out batters. For now, Aaron Cook is an outlier.
As you may know, numerous statistical formulas have been developed for use on baseball to try to minimize biases like that. This is actually true for all industries. Outliers exist in nearly all sample sets, from quality control to social sciences to finance, and there are several ways to handle them. Once it is determined that the outlier isn't caused by poor data entry, a missampled set, or another issue, a researcher must decide to keep the outlier in the set, or exclude it. In the case of a baseball stat, it isn't data entry, and should be included in the sample size. The data could then be Winsorized and used in the sample, thereby negating its extreme nature without affecting the population size.
The problem with leaving outliers in a population is that they can affect correlation. For baseball statisticians, this means that outliers are curveballs telling them that their model isn't perfected yet, and potentially that the human element will always affect their system. In this case, the outlier may deserve more in depth study. Our EST states, "Outliers could be indicative of the occurrence of a phenomenon that is qualitatively different than the typical pattern observed or expected in the sample, thus the relative frequency of outliers could provide evidence of a relative frequency of departure from the process or phenomenon that is typical for the majority of cases in a group." Therefore, more may be learned by studying the comparison of the subset of outliers to the whole population, than only studying the population in its entirety, at least for developing a model.
I apologize for not posting anything last week; I didn't get to review the UFC statistics. In my mind, however, the only one that mattered was another increase in the W column for Andersen Silva.
Writing about statistical analysis in sports, manufacturing, and other topics that happen to interest me that week.