g07ga identifies outlying values using Peirce's criterion.

Syntax

C#
```public static void g07ga(
int n,
int p,
double[] y,
double mean,
double var,
int[] iout,
out int niout,
int ldiff,
double[] diff,
double[] llamb,
out int ifail
)```
Visual Basic
```Public Shared Sub g07ga ( _
n As Integer, _
p As Integer, _
y As Double(), _
mean As Double, _
var As Double, _
iout As Integer(), _
<OutAttribute> ByRef niout As Integer, _
ldiff As Integer, _
diff As Double(), _
llamb As Double(), _
<OutAttribute> ByRef ifail As Integer _
)```
Visual C++
```public:
static void g07ga(
int n,
int p,
array<double>^ y,
double mean,
double var,
array<int>^ iout,
[OutAttribute] int% niout,
int ldiff,
array<double>^ diff,
array<double>^ llamb,
[OutAttribute] int% ifail
)```
F#
```static member g07ga :
n : int *
p : int *
y : float[] *
mean : float *
var : float *
iout : int[] *
niout : int byref *
ldiff : int *
diff : float[] *
llamb : float[] *
ifail : int byref -> unit
```

Parameters

n
Type: System..::..Int32
On entry: $n$, the number of observations.
Constraint: ${\mathbf{n}}\ge 3$.
p
Type: System..::..Int32
On entry: $p$, the number of parameters in the model used in obtaining the $y$. If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$, i.e., as if a model just containing the mean had been used.
Constraint: $1\le {\mathbf{p}}\le {\mathbf{n}}-2$.
y
Type: array<System..::..Double>[]()[][]
An array of size [n]
On entry: $y$, the data being tested.
mean
Type: System..::..Double
On entry: if ${\mathbf{var}}>0.0$, mean must contain $\mu$, the mean of $y$, otherwise mean is not referenced and the mean is calculated from the data supplied in y.
var
Type: System..::..Double
On entry: if ${\mathbf{var}}>0.0$, var must contain ${\sigma }^{2}$, the variance of $y$, otherwise the variance is calculated from the data supplied in y.
iout
Type: array<System..::..Int32>[]()[][]
An array of size [n]
On exit: the indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $\left|{\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-2\right]-1\right]-\mu \right|\ge \left|{\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]-\mu \right|$, for $\mathit{i}=2,3,\dots ,{\mathbf{n}}$.
niout
Type: System..::..Int32%
On exit: the number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most ${\mathbf{n}}-{\mathbf{p}}-1$ values flagged as outliers.
ldiff
Type: System..::..Int32
On entry: the maximum number of values to be returned in arrays diff and llamb.
If ${\mathbf{ldiff}}\le 0$, arrays diff and llamb are not referenced.
diff
Type: array<System..::..Double>[]()[][]
An array of size [ldiff]
On exit: ${\mathbf{diff}}\left[\mathit{i}-1\right]$ holds $\left|y-\mu \right|-{\sigma }^{2}z$ for observation ${\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
llamb
Type: array<System..::..Double>[]()[][]
An array of size [ldiff]
On exit: ${\mathbf{llamb}}\left[\mathit{i}-1\right]$ holds $\mathrm{log}\left({\lambda }^{2}\right)$ for observation ${\mathbf{y}}\left[{\mathbf{iout}}\left[\mathit{i}-1\right]-1\right]$, for $\mathit{i}=1,2,\dots ,\mathrm{min}\phantom{\rule{0.125em}{0ex}}\left({\mathbf{ldiff}},{\mathbf{niout}}+1,{\mathbf{n}}-{\mathbf{p}}-1\right)$.
ifail
Type: System..::..Int32%
On exit: ${\mathbf{ifail}}={0}$ unless the method detects an error or a warning has been flagged (see [Error Indicators and Warnings]).

Description

g07ga flags outlying values in data using Peirce's criterion. Let
• $y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
• $m$ denote the number of potential outlying values,
• $\mu$ and ${\sigma }^{2}$ denote the mean and variance of $y$ respectively,
• $\stackrel{~}{y}$ denote a vector of length $n-m$ constructed by dropping the $m$ values from $y$ with the largest value of $\left|{y}_{i}-\mu \right|$,
• ${\stackrel{~}{\sigma }}^{2}$ denote the (unknown) variance of $\stackrel{~}{y}$,
• $\lambda$ denote the ratio of $\stackrel{~}{\sigma }$ and $\sigma$ with $\lambda =\frac{\stackrel{~}{\sigma }}{\sigma }$.
Peirce's method flags ${y}_{i}$ as a potential outlier if $\left|{y}_{i}-\mu \right|\ge x$, where $x={\sigma }^{2}z$ and $z$ is obtained from the solution of
 $Rm=λm-nmmn-mn-mnn$ (1)
where
 $R=2expz2-121-Φz$ (2)
and $\Phi$ is the cumulative distribution function for the standard Normal distribution.
As ${\stackrel{~}{\sigma }}^{2}$ is unknown an assumption is made that the relationship between ${\stackrel{~}{\sigma }}^{2}$ and ${\sigma }^{2}$, hence $\lambda$, depends only on the sum of squares of the rejected observations and the ratio estimated as
 $λ2=n-p-mz2n-p-m$
which gives
 $z2=1+n-p-mm1-λ2$ (3)
A value for the cutoff $x$ is calculated iteratively. An initial value of $R=0.2$ is used and a value of $\lambda$ is estimated using equation (1). Equation (3) is then used to obtain an estimate of $z$ and then equation (2) is used to get a new estimate for $R$. This process is then repeated until the relative change in $z$ between consecutive iterations is $\text{}\le \sqrt{\epsilon }$, where $\epsilon$ is machine precision.
By construction, the cutoff for testing for $m+1$ potential outliers is less than the cutoff for testing for $m$ potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.
If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45
Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

Error Indicators and Warnings

Errors or warnings detected by the method:
${\mathbf{ifail}}=1$
On entry, ${\mathbf{n}}<3$.
${\mathbf{ifail}}=2$
On entry, ${\mathbf{p}}\le 0$ or ${\mathbf{p}}>{\mathbf{n}}-2$.
${\mathbf{ifail}}=-9000$
An error occured, see message report.
${\mathbf{ifail}}=-8000$
Negative dimension for array $〈\mathit{\text{value}}〉$
${\mathbf{ifail}}=-6000$
Invalid Parameters $〈\mathit{\text{value}}〉$

Not applicable.

Parallelism and Performance

None.

One problem with Peirce's algorithm as implemented in g07ga is the assumed relationship between ${\sigma }^{2}$, the variance using the full dataset, and ${\stackrel{~}{\sigma }}^{2}$, the variance with the potential outliers removed. In some cases, for example if the data $y$ were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases g07gb should be used instead.

Example

This example reads in a series of data and flags any potential outliers.
The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.

Example program (C#): g07gae.cs

Example program data: g07gae.d

Example program results: g07gae.r