Exploratory Data Analysis
1.3. EDA Techniques
1.3.5. Quantitative Techniques
Detection of Outliers
(Grubbs 1969 and
is used to detect outliers in a univariate data
set. It is based on the assumption of normality. That is,
you should first verify that your data can be reasonably
approximated by a normal distribution before applying the
Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test should not be used for sample sizes of six or less since it frequently tags most of the points as outliers.
Grubbs' test is also known as the maximum normed residual test.
Grubbs' test is defined for the hypothesis:
Dataplot generated the following output for
the ZARR13.DAT data set
showing that Grubbs' test finds no outliers in the dataset:
********************* ** grubbs test y ** ********************* GRUBBS TEST FOR OUTLIERS (ASSUMPTION: NORMALITY) 1. STATISTICS: NUMBER OF OBSERVATIONS = 195 MINIMUM = 9.196848 MEAN = 9.261460 MAXIMUM = 9.327973 STANDARD DEVIATION = 0.2278881E-01 GRUBBS TEST STATISTIC = 2.918673 2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION FOR GRUBBS TEST STATISTIC 0 % POINT = 0.000000 50 % POINT = 2.774246 75 % POINT = 2.984294 90 % POINT = 3.242170 95 % POINT = 3.424672 97.5 % POINT = 3.597898 99 % POINT = 3.814610 100 % POINT = 13.89263 3. CONCLUSION (AT THE 5% LEVEL): THERE ARE NO OUTLIERS.
|Interpretation of Sample Output||
The output is divided into three sections.
Grubbs' test can be used to answer the following questions:
Many statistical techniques are sensitive to the presence
of outliers. For example, simple calculations of the mean
and standard deviation may be distorted by a single grossly
inaccurate data point.
Checking for outliers should be a routine part of any data analysis. Potential outliers should be examined to see if they are possibly erroneous. If the data point is in error, it should be corrected if possible and deleted if it is not possible. If there is no reason to believe that the outlying point is in error, it should not be deleted without careful consideration. However, the use of more robust techniques may be warranted. Robust techniques will often downweight the effect of outlying points without deleting them.
|Related Techniques||Several graphical techniques can, and should, be used to detect outliers. A simple run sequence plot, a box plot, or a histogram should show any obviously outlying points.|
|Case Study||Heat flow meter data.|
|Software||Some general purpose statistical software programs, including Dataplot, support the Grubbs' test.|