g07ga identifies outlying values using Peirce's criterion.

# Syntax

C# |
---|

public static void g07ga( int n, int p, double[] y, double mean, double var, int[] iout, out int niout, int ldiff, double[] diff, double[] llamb, out int ifail ) |

Visual Basic |
---|

Public Shared Sub g07ga ( _ n As Integer, _ p As Integer, _ y As Double(), _ mean As Double, _ var As Double, _ iout As Integer(), _ <OutAttribute> ByRef niout As Integer, _ ldiff As Integer, _ diff As Double(), _ llamb As Double(), _ <OutAttribute> ByRef ifail As Integer _ ) |

Visual C++ |
---|

public: static void g07ga( int n, int p, array<double>^ y, double mean, double var, array<int>^ iout, [OutAttribute] int% niout, int ldiff, array<double>^ diff, array<double>^ llamb, [OutAttribute] int% ifail ) |

F# |
---|

static member g07ga : n : int * p : int * y : float[] * mean : float * var : float * iout : int[] * niout : int byref * ldiff : int * diff : float[] * llamb : float[] * ifail : int byref -> unit |

#### Parameters

- n
- Type: System..::..Int32
*On entry*: $n$, the number of observations.*Constraint*: ${\mathbf{n}}\ge 3$.

- p
- Type: System..::..Int32
*On entry*: $p$, the number of parameters in the model used in obtaining the $y$. If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$, i.e., as if a model just containing the mean had been used.*Constraint*: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.

- y
- Type: array<System..::..Double>[]()[][]An array of size [n]
*On entry*: $y$, the data being tested.

- mean
- Type: System..::..Double

- var
- Type: System..::..Double

- iout
- Type: array<System..::..Int32>[]()[][]An array of size [n]
*On exit*: the indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $\left|{\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-2\right]-1\right]-\mu \right|\ge \left|{\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]-\mu \right|$, for $\mathit{i}=2,3,\dots ,{\mathbf{n}}$.

- niout
- Type: System..::..Int32%

- ldiff
- Type: System..::..Int32

- diff
- Type: array<System..::..Double>[]()[][]An array of size [ldiff]
*On exit*: ${\mathbf{diff}}\left[\mathit{i}-1\right]$ holds $\left|y-\mu \right|-{\sigma}^{2}z$ for observation ${\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.

- llamb
- Type: array<System..::..Double>[]()[][]An array of size [ldiff]
*On exit*: ${\mathbf{llamb}}\left[\mathit{i}-1\right]$ holds $\mathrm{log}\left({\lambda}^{2}\right)$ for observation ${\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.

- ifail
- Type: System..::..Int32%
*On exit*: ${\mathbf{ifail}}={0}$ unless the method detects an error or a warning has been flagged (see [Error Indicators and Warnings]).

# Description

g07ga flags outlying values in data using Peirce's criterion. Let

- $y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
- $m$ denote the number of potential outlying values,
- $\mu $ and ${\sigma}^{2}$ denote the mean and variance of $y$ respectively,
- $\stackrel{~}{y}$ denote a vector of length $n-m$ constructed by dropping the $m$ values from $y$ with the largest value of $\left|{y}_{i}-\mu \right|$,
- ${\stackrel{~}{\sigma}}^{2}$ denote the (unknown) variance of $\stackrel{~}{y}$,
- $\lambda $ denote the ratio of $\stackrel{~}{\sigma}$ and $\sigma $ with $\lambda =\frac{\stackrel{~}{\sigma}}{\sigma}$.

Peirce's method flags ${y}_{i}$ as a potential outlier if $\left|{y}_{i}-\mu \right|\ge x$, where $x={\sigma}^{2}z$ and $z$ is obtained from the solution of

where

and $\Phi $ is the cumulative distribution function for the standard Normal distribution.

$${R}^{m}={\lambda}^{m-n}\frac{{m}^{m}{\left(n-m\right)}^{n-m}}{{n}^{n}}$$ | (1) |

$$R=2\mathrm{exp}\left(\left(\frac{{z}^{2}-1}{2}\right)\left(1-\Phi \left(z\right)\right)\right)$$ | (2) |

As ${\stackrel{~}{\sigma}}^{2}$ is unknown an assumption is made that the relationship between ${\stackrel{~}{\sigma}}^{2}$ and ${\sigma}^{2}$, hence $\lambda $, depends only on the sum of squares of the rejected observations and the ratio estimated as

which gives

$${\lambda}^{2}=\frac{n-p-m{z}^{2}}{n-p-m}$$ |

$${z}^{2}=1+\frac{n-p-m}{m}\left(1-{\lambda}^{2}\right)$$ | (3) |

A value for the cutoff $x$ is calculated iteratively. An initial value of $R=0.2$ is used and a value of $\lambda $ is estimated using equation (1). Equation (3) is then used to obtain an estimate of $z$ and then equation (2) is used to get a new estimate for $R$. This process is then repeated until the relative change in $z$ between consecutive iterations is $\text{}\le \sqrt{\epsilon}$, where $\epsilon $ is machine precision.

By construction, the cutoff for testing for $m+1$ potential outliers is less than the cutoff for testing for $m$ potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.

If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

# References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application

*The Astronomical Journal***45**Peirce B (1852) Criterion for the rejection of doubtful observations

*The Astronomical Journal***45**# Error Indicators and Warnings

Errors or warnings detected by the method:

- ${\mathbf{ifail}}=1$
- On entry, ${\mathbf{n}}<3$.

- ${\mathbf{ifail}}=2$
- On entry, ${\mathbf{p}}\le 0$ or ${\mathbf{p}}>{\mathbf{n}}-2$.

# Accuracy

Not applicable.

# Parallelism and Performance

None.

# Further Comments

One problem with Peirce's algorithm as implemented in g07ga is the assumed relationship between ${\sigma}^{2}$, the variance using the full dataset, and ${\stackrel{~}{\sigma}}^{2}$, the variance with the potential outliers removed. In some cases, for example if the data $y$ were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases g07gb should be used instead.

# Example

This example reads in a series of data and flags any potential outliers.

The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.

Example program (C#): g07gae.cs