g07ga identifies outlying values using Peirce's criterion.

Syntax

C#
public static void g07ga( int n, int p, double[] y, double mean, double var, int[] iout, out int niout, int ldiff, double[] diff, double[] llamb, out int ifail )

Visual Basic
Public Shared Sub g07ga ( _ n As Integer, _ p As Integer, _ y As Double(), _ mean As Double, _ var As Double, _ iout As Integer(), _ <OutAttribute> ByRef niout As Integer, _ ldiff As Integer, _ diff As Double(), _ llamb As Double(), _ <OutAttribute> ByRef ifail As Integer _ )

Visual Basic

Public Shared Sub g07ga ( _
	n As Integer, _
	p As Integer, _
	y As Double(), _
	mean As Double, _
	var As Double, _
	iout As Integer(), _
	<OutAttribute> ByRef niout As Integer, _
	ldiff As Integer, _
	diff As Double(), _
	llamb As Double(), _
	<OutAttribute> ByRef ifail As Integer _
)

Visual C++
public: static void g07ga( int n, int p, array<double>^ y, double mean, double var, array<int>^ iout, [OutAttribute] int% niout, int ldiff, array<double>^ diff, array<double>^ llamb, [OutAttribute] int% ifail )

Visual C++

public:
static void g07ga(
	int n, 
	int p, 
	array<double>^ y, 
	double mean, 
	double var, 
	array<int>^ iout, 
	[OutAttribute] int% niout, 
	int ldiff, 
	array<double>^ diff, 
	array<double>^ llamb, 
	[OutAttribute] int% ifail
)

F#
static member g07ga : n : int * p : int * y : float[] * mean : float * var : float * iout : int[] * niout : int byref * ldiff : int * diff : float[] * llamb : float[] * ifail : int byref -> unit

Parameters

n: Type: System..::..Int32
On entry: $n$ , the number of observations.

Constraint: $n \geq 3$ .

p: Type: System..::..Int32
On entry: $p$ , the number of parameters in the model used in obtaining the $y$ . If $y$ is an observed set of values, as opposed to the residuals from fitting a model with $p$ parameters, then $p$ should be set to $1$ , i.e., as if a model just containing the mean had been used.

Constraint: $1 \leq p \leq n - 2$ .

y: Type: array<System..::..Double>[]()[][]
An array of size [n]
On entry: $y$ , the data being tested.

mean: Type: System..::..Double
On entry: if $var > 0.0$ , mean must contain $μ$ , the mean of $y$ , otherwise mean is not referenced and the mean is calculated from the data supplied in y.

var: Type: System..::..Double
On entry: if $var > 0.0$ , var must contain $σ^{2}$ , the variance of $y$ , otherwise the variance is calculated from the data supplied in y.

iout: Type: array<System..::..Int32>[]()[][]
An array of size [n]
On exit: the indices of the values in y sorted in descending order of the absolute difference from the mean, therefore $|y [iout [i - 2] - 1] - μ| \geq |y [iout [i - 1] - 1] - μ|$ , for $i = 2, 3, \dots, n$ .

niout: Type: System..::..Int32%
On exit: the number of potential outliers. The indices for these potential outliers are held in the first niout elements of iout. By construction there can be at most $n - p - 1$ values flagged as outliers.

ldiff: Type: System..::..Int32
On entry: the maximum number of values to be returned in arrays diff and llamb.
If $ldiff \leq 0$ , arrays diff and llamb are not referenced.

diff: Type: array<System..::..Double>[]()[][]
An array of size [ldiff]
On exit: $diff [i - 1]$ holds $|y - μ| - σ^{2} z$ for observation $y [iout [i - 1] - 1]$ , for $i = 1, 2, \dots, \min (ldiff, niout + 1, n - p - 1)$ .

llamb: Type: array<System..::..Double>[]()[][]
An array of size [ldiff]
On exit: $llamb [i - 1]$ holds $\log (λ^{2})$ for observation $y [iout [i - 1] - 1]$ , for $i = 1, 2, \dots, \min (ldiff, niout + 1, n - p - 1)$ .

ifail: Type: System..::..Int32%
On exit: $ifail = 0$ unless the method detects an error or a warning has been flagged (see [Error Indicators and Warnings]).

Description

g07ga flags outlying values in data using Peirce's criterion. Let

$y$ denote a vector of $n$ observations (for example the residuals) obtained from a model with $p$ parameters,
$m$ denote the number of potential outlying values,
$μ$ and $σ^{2}$ denote the mean and variance of $y$ respectively,
$\tilde{y}$ denote a vector of length $n - m$ constructed by dropping the $m$ values from $y$ with the largest value of $|y_{i} - μ|$ ,
${\tilde{σ}}^{2}$ denote the (unknown) variance of $\tilde{y}$ ,
$λ$ denote the ratio of $\tilde{σ}$ and $σ$ with $λ = \frac{\tilde{σ}}{σ}$ .

Peirce's method flags

y_{i}

as a potential outlier if

|y_{i} - μ| \geq x

, where

x = σ^{2} z

and

z

is obtained from the solution of

R^{m} = λ^{m - n} \frac{m^{m} {(n - m)}^{n - m}}{n^{n}}

(1)

where

R = 2 \exp ((\frac{z^{2} - 1}{2}) (1 - Φ (z)))

(2)

and

Φ

is the cumulative distribution function for the standard Normal distribution.

{\tilde{σ}}^{2}

is unknown an assumption is made that the relationship between

{\tilde{σ}}^{2}

and

σ^{2}

, hence

λ

, depends only on the sum of squares of the rejected observations and the ratio estimated as

λ^{2} = \frac{n - p - m z^{2}}{n - p - m}

which gives

z^{2} = 1 + \frac{n - p - m}{m} (1 - λ^{2})

(3)

A value for the cutoff

x

is calculated iteratively. An initial value of

R = 0.2

is used and a value of

λ

is estimated using equation (1). Equation (3) is then used to obtain an estimate of

z

and then equation (2) is used to get a new estimate for

R

. This process is then repeated until the relative change in

z

between consecutive iterations is

\leq \sqrt{ε}

, where

ε

is machine precision.

By construction, the cutoff for testing for

m + 1

potential outliers is less than the cutoff for testing for

m

potential outliers. Therefore Peirce's criterion is used in sequence with the existence of a single potential outlier being investigated first. If one is found, the existence of two potential outliers is investigated etc.

If one of a duplicate series of observations is flagged as an outlier, then all of them are flagged as outliers.

References

Gould B A (1855) On Peirce's criterion for the rejection of doubtful observations, with tables for facilitating its application The Astronomical Journal 45

Peirce B (1852) Criterion for the rejection of doubtful observations The Astronomical Journal 45

Error Indicators and Warnings

Errors or warnings detected by the method:

$ifail = 1$: On entry, $n < 3$ .

$ifail = 2$: On entry, $p \leq 0$ or $p > n - 2$ .

$ifail = -9000$: An error occured, see message report.
$ifail = -8000$: Negative dimension for array $〈value〉$
$ifail = -6000$: Invalid Parameters $〈value〉$

Accuracy

Not applicable.

Parallelism and Performance

None.

Further Comments

One problem with Peirce's algorithm as implemented in g07ga is the assumed relationship between

σ^{2}

, the variance using the full dataset, and

{\tilde{σ}}^{2}

, the variance with the potential outliers removed. In some cases, for example if the data

y

were the residuals from a linear regression, this assumption may not hold as the regression line may change significantly when outlying values have been dropped resulting in a radically different set of residuals. In such cases g07gb should be used instead.

Example

This example reads in a series of data and flags any potential outliers.

The dataset used is from Peirce's original paper and consists of fifteen observations on the vertical semidiameter of Venus.

Example program (C#): g07gae.cs

Example program data: g07gae.d

Example program results: g07gae.r