1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.5. Quantitative Techniques

Grubbs' Test for Outliers

Purpose:
Detection of Outliers
Grubbs' test (Grubbs 1969 and Stefansky 1972) is used to detect outliers in a univariate data set. It is based on the assumption of normality. That is, you should first verify that your data can be reasonably approximated by a normal distribution before applying the Grubbs' test.

Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test should not be used for sample sizes of six or less since it frequently tags most of the points as outliers.

Grubbs' test is also known as the maximum normed residual test.

Definition Grubbs' test is defined for the hypothesis:

 H0: There are no outliers in the data set Ha: There is at least one outlier in the data set Test Statistic: The Grubbs' test statistic is defined as: where and are the sample mean and standard deviation. The Grubbs test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation. Significance Level: . Critical Region: The hypothesis of no outliers is rejected if where is the critical value of the t-distribution with (N-2) degrees of freedom and a significance level of /N. In the above formulas for the critical regions, the Handbook follows the convention that is the upper critical value from the t-distribution and is the lower critical value from the t-distribution. Note that this is the opposite of what is used in some texts and software programs. In particular, Dataplot uses the opposite convention.
Sample Output
Dataplot generated the following output for the ZARR13.DAT data set showing that Grubbs' test finds no outliers in the dataset:
```
*********************
**  grubbs test y  **
*********************

GRUBBS TEST FOR OUTLIERS
(ASSUMPTION: NORMALITY)

1. STATISTICS:
NUMBER OF OBSERVATIONS      =      195
MINIMUM                     =    9.196848
MEAN                        =    9.261460
MAXIMUM                     =    9.327973
STANDARD DEVIATION          =   0.2278881E-01

GRUBBS TEST STATISTIC       =    2.918673

2. PERCENT POINTS OF THE REFERENCE DISTRIBUTION
FOR GRUBBS TEST STATISTIC
0          % POINT    =    0.000000
50         % POINT    =    2.774246
75         % POINT    =    2.984294
90         % POINT    =    3.242170
95         % POINT    =    3.424672
97.5       % POINT    =    3.597898
99         % POINT    =    3.814610
100        % POINT    =    13.89263

3. CONCLUSION (AT THE 5% LEVEL):
THERE ARE NO OUTLIERS.

```
Interpretation of Sample Output The output is divided into three sections.
1. The first section prints the sample statistics used in the computation of the Grubbs' test and the value of the Grubbs' test statistic.

2. The second section prints the upper critical value for the Grubbs' test statistic distribution corresponding to various significance levels. The value in the first column, the confidence level of the test, is equivalent to 100(1-). We reject the null hypothesis at that significance level if the value of the Grubbs' test statistic printed in section one is greater than the critical value printed in the last column.

3. The third section prints the conclusion for a 95% test. For a different significance level, the appropriate conclusion can be drawn from the table printed in section two. For example, for = 0.10, we look at the row for 90% confidence and compare the critical value 3.24 to the Grubbs' test statistic 2.92. Since the test statistic is less than the critical value, we accept the null hypothesis at the = 0.10 level.
Output from other statistical software may look somewhat different from the above output.
Questions Grubbs' test can be used to answer the following questions:
1. Does the data set contain any outliers?
2. How many outliers does it contain?
Importance Many statistical techniques are sensitive to the presence of outliers. For example, simple calculations of the mean and standard deviation may be distorted by a single grossly inaccurate data point.

Checking for outliers should be a routine part of any data analysis. Potential outliers should be examined to see if they are possibly erroneous. If the data point is in error, it should be corrected if possible and deleted if it is not possible. If there is no reason to believe that the outlying point is in error, it should not be deleted without careful consideration. However, the use of more robust techniques may be warranted. Robust techniques will often downweight the effect of outlying points without deleting them.

Related Techniques Several graphical techniques can, and should, be used to detect outliers. A simple run sequence plot, a box plot, or a histogram should show any obviously outlying points.
Case Study Heat flow meter data.
Software Some general purpose statistical software programs, including Dataplot, support the Grubbs' test.