Next Page Previous Page Home Tools & Aids Search Handbook

1. Exploratory Data Analysis
1.3. EDA Techniques
1.3.5. Quantitative Techniques

Grubbs' Test for Outliers

Detection of Outliers
Grubbs' test (Grubbs 1969 and Stefansky 1972) is used to detect outliers in a univariate data set. It is based on the assumption of normality. That is, you should first verify that your data can be reasonably approximated by a normal distribution before applying the Grubbs' test.

Grubbs' test detects one outlier at a time. This outlier is expunged from the dataset and the test is iterated until no outliers are detected. However, multiple iterations change the probabilities of detection, and the test should not be used for sample sizes of six or less since it frequently tags most of the points as outliers.

Grubbs' test is also known as the maximum normed residual test.

Definition Grubbs' test is defined for the hypothesis:

H0: There are no outliers in the data set
Ha: There is at least one outlier in the data set
Test Statistic: The Grubbs' test statistic is defined as:
    G = MAX|Y(i) - YBAR|/s
where YBAR and s are the sample mean and standard deviation. The Grubbs test statistic is the largest absolute deviation from the sample mean in units of the sample standard deviation.
Significance Level: alpha.
Critical Region: The hypothesis of no outliers is rejected if
    G > [(N-1)/SQRT(N)]*SQRT(t(alpha/N,N-2)**2/(N - 2 + t(alpha/N,N-2)**2)
where t(alpha/N,N-2) is the critical value of the t-distribution with (N-2) degrees of freedom and a significance level of alpha/N.

In the above formulas for the critical regions, the Handbook follows the convention that t(alpha) is the upper critical value from the t-distribution and t(1-alpha) is the lower critical value from the t-distribution. Note that this is the opposite of what is used in some texts and software programs. In particular, Dataplot uses the opposite convention.

Sample Output
Dataplot generated the following output for the ZARR13.DAT data set showing that Grubbs' test finds no outliers in the dataset:
       **  grubbs test y  **
               (ASSUMPTION: NORMALITY)
       NUMBER OF OBSERVATIONS      =      195
       MINIMUM                     =    9.196848
       MEAN                        =    9.261460
       MAXIMUM                     =    9.327973
       STANDARD DEVIATION          =   0.2278881E-01
       GRUBBS TEST STATISTIC       =    2.918673
       0          % POINT    =    0.000000
       50         % POINT    =    2.774246
       75         % POINT    =    2.984294
       90         % POINT    =    3.242170
       95         % POINT    =    3.424672
       97.5       % POINT    =    3.597898
       99         % POINT    =    3.814610
       100        % POINT    =    13.89263
Interpretation of Sample Output The output is divided into three sections.
  1. The first section prints the sample statistics used in the computation of the Grubbs' test and the value of the Grubbs' test statistic.

  2. The second section prints the upper critical value for the Grubbs' test statistic distribution corresponding to various significance levels. The value in the first column, the confidence level of the test, is equivalent to 100(1-alpha). We reject the null hypothesis at that significance level if the value of the Grubbs' test statistic printed in section one is greater than the critical value printed in the last column.

  3. The third section prints the conclusion for a 95% test. For a different significance level, the appropriate conclusion can be drawn from the table printed in section two. For example, for alpha = 0.10, we look at the row for 90% confidence and compare the critical value 3.24 to the Grubbs' test statistic 2.92. Since the test statistic is less than the critical value, we accept the null hypothesis at the alpha = 0.10 level.
Output from other statistical software may look somewhat different from the above output.
Questions Grubbs' test can be used to answer the following questions:
  1. Does the data set contain any outliers?
  2. How many outliers does it contain?
Importance Many statistical techniques are sensitive to the presence of outliers. For example, simple calculations of the mean and standard deviation may be distorted by a single grossly inaccurate data point.

Checking for outliers should be a routine part of any data analysis. Potential outliers should be examined to see if they are possibly erroneous. If the data point is in error, it should be corrected if possible and deleted if it is not possible. If there is no reason to believe that the outlying point is in error, it should not be deleted without careful consideration. However, the use of more robust techniques may be warranted. Robust techniques will often downweight the effect of outlying points without deleting them.

Related Techniques Several graphical techniques can, and should, be used to detect outliers. A simple run sequence plot, a box plot, or a histogram should show any obviously outlying points.
Case Study Heat flow meter data.
Software Some general purpose statistical software programs, including Dataplot, support the Grubbs' test.
Home Tools & Aids Search Handbook Previous Page Next Page