Written by: STATISTICA 9/3/2009 10:43 AM
Outliers are atypical, infrequent observations - data points that do not appear to follow the characteristic distribution of the rest of the data. They can reflect genuine properties of the underlying phenomenon (variable), or be due to measurement errors or other anomalies which should not be modeled.
Here are a couple of ways to detect outliers.
Open the Poverty.sta data file, located in the STATISTICA examples folder.
From the Statistics menu, select Basic Statistics/Tables. The Basic Statistics and Tables module provides both graphical and quantitative approaches to detecting outliers. To begin our analysis, double-click Descriptive Statistics.
A common graphical means of detecting outliers is to construct a box plot of the data. To do this, click the Variables button in the Descriptive Statistics dialog to display a variable selection dialog. Because we are interested in detecting any existing outliers, click the Select all button, and then click OK. Now, on the Quick tab of the Descriptive Statistics dialog, click the Box & whisker plot for all variables button.
Clearly, there is greater variability within the variable N_EMPLD than within the other variables. In this initial graph, potential outliers and extreme values are not displayed. To turn this feature on, double-click in the background of the graph to display the Graph Options dialog, and then select Box/Whisker located under Plot.
Click the More button to display the Box/Whiskers More Options dialog, where you can control the display of outliers and extremes. Select Outl. & Extremes in the Outliers drop-down list.
Click Close to return to the Graph Options dialog, and then click OK to update the graph with outliers and extreme values.
As we suspected, there seems to be an outlier in the variable N_EMPLD.
The Basic Statistics and Tables module also provides certain quantitative methods for detecting outliers, one of which is the Grubbs test. To perform this test, return to the Descriptive Statistics dialog and select the Robust tab. Select the Grubbs test for outliers check box.
Now click the Summary: statistics button to generate a spreadsheet that contains descriptive statistics for all variables.
Here, we can see that the Grubbs Test Statistic for N_EMPLD is 4.88, and it has a p-value of 0.00. This small p-value is evidence that there is at least 1 outlier in the N-EMPLD variable.
Once the presence of outliers has been detected, it is up to the researcher to determine whether the outlier represents a genuine property of the underlying phenomenon (variable) or is due to measurement errors or other anomalies which should not be modeled.
SB
0 comment(s) so far...