g02ef calculates a full stepwise selection from

p

variables by using Clarke's sweep algorithm on the correlation matrix of a design and data matrix,

Z

. The (weighted) variance-covariance, (weighted) means and sum of weights of

Z

must be supplied.

Syntax

C#
public static void g02ef( int m, int n, double[] wmean, double[] c, double sw, int[] isx, double fin, double fout, double tau, double[] b, double[] se, out double rsq, out double rms, out int df, int monlev, G02..::..G02EF_MONFUN monfun, out int ifail )

public static void g02ef(
	int m,
	int n,
	double[] wmean,
	double[] c,
	double sw,
	int[] isx,
	double fin,
	double fout,
	double tau,
	double[] b,
	double[] se,
	out double rsq,
	out double rms,
	out int df,
	int monlev,
	G02..::..G02EF_MONFUN monfun,
	out int ifail
)

Visual Basic
Public Shared Sub g02ef ( _ m As Integer, _ n As Integer, _ wmean As Double(), _ c As Double(), _ sw As Double, _ isx As Integer(), _ fin As Double, _ fout As Double, _ tau As Double, _ b As Double(), _ se As Double(), _ <OutAttribute> ByRef rsq As Double, _ <OutAttribute> ByRef rms As Double, _ <OutAttribute> ByRef df As Integer, _ monlev As Integer, _ monfun As G02..::..G02EF_MONFUN, _ <OutAttribute> ByRef ifail As Integer _ )

Visual Basic

Public Shared Sub g02ef ( _
	m As Integer, _
	n As Integer, _
	wmean As Double(), _
	c As Double(), _
	sw As Double, _
	isx As Integer(), _
	fin As Double, _
	fout As Double, _
	tau As Double, _
	b As Double(), _
	se As Double(), _
	<OutAttribute> ByRef rsq As Double, _
	<OutAttribute> ByRef rms As Double, _
	<OutAttribute> ByRef df As Integer, _
	monlev As Integer, _
	monfun As G02..::..G02EF_MONFUN, _
	<OutAttribute> ByRef ifail As Integer _
)

Visual C++
public: static void g02ef( int m, int n, array<double>^ wmean, array<double>^ c, double sw, array<int>^ isx, double fin, double fout, double tau, array<double>^ b, array<double>^ se, [OutAttribute] double% rsq, [OutAttribute] double% rms, [OutAttribute] int% df, int monlev, G02..::..G02EF_MONFUN^ monfun, [OutAttribute] int% ifail )

Visual C++

public:
static void g02ef(
	int m, 
	int n, 
	array<double>^ wmean, 
	array<double>^ c, 
	double sw, 
	array<int>^ isx, 
	double fin, 
	double fout, 
	double tau, 
	array<double>^ b, 
	array<double>^ se, 
	[OutAttribute] double% rsq, 
	[OutAttribute] double% rms, 
	[OutAttribute] int% df, 
	int monlev, 
	G02..::..G02EF_MONFUN^ monfun, 
	[OutAttribute] int% ifail
)

F#
static member g02ef : m : int * n : int * wmean : float[] * c : float[] * sw : float * isx : int[] * fin : float * fout : float * tau : float * b : float[] * se : float[] * rsq : float byref * rms : float byref * df : int byref * monlev : int * monfun : G02..::..G02EF_MONFUN * ifail : int byref -> unit

static member g02ef : 
        m : int * 
        n : int * 
        wmean : float[] * 
        c : float[] * 
        sw : float * 
        isx : int[] * 
        fin : float * 
        fout : float * 
        tau : float * 
        b : float[] * 
        se : float[] * 
        rsq : float byref * 
        rms : float byref * 
        df : int byref * 
        monlev : int * 
        monfun : G02..::..G02EF_MONFUN * 
        ifail : int byref -> unit

Parameters

m: Type: System..::..Int32
On entry: the number of explanatory variables available in the design matrix, $Z$ .

Constraint: $m > 1$ .

n: Type: System..::..Int32
On entry: the number of observations used in the calculations.

Constraint: $n > 1$ .

wmean: Type: array<System..::..Double>[]()[][]
An array of size [ $m + 1$ ]
On entry: the mean of the design matrix, $Z$ .

c: Type: array<System..::..Double>[]()[][]
An array of size [ $(m + 1) \times (m + 2) / 2$ ]
On entry: the upper-triangular variance-covariance matrix packed by column for the design matrix, $Z$ . Because the method computes the correlation matrix $R$ from c, the variance-covariance matrix need only be supplied up to a scaling factor.

sw: Type: System..::..Double
On entry: if weights were used to calculate c then sw is the sum of positive weight values; otherwise sw is the number of observations used to calculate c.

Constraint: $sw > 1.0$ .

isx

Type: array<System..::..Int32>[]()[][]

An array of size [m]

On entry: the value of

isx [j - 1]

determines the set of variables used to perform full stepwise model selection, for

j = 1, 2, \dots, m

$isx [j - 1] = - 1$: To exclude the variable corresponding to the $j$ th column of $X$ from the final model.
$isx [j - 1] = 1$: To consider the variable corresponding to the $j$ th column of $X$ for selection in the final model.
$isx [j - 1] = 2$: To force the inclusion of the variable corresponding to the $j$ th column of $X$ in the final model.

Constraint:

isx [j - 1] = - 1, 1 ​ or ​ 2

, for

j = 1, 2, \dots, m

On exit: the value of

isx [j - 1]

indicates the status of the

j

th explanatory variable in the model.

$isx [j - 1] = - 1$: Forced exclusion.
$isx [j - 1] = 0$: Excluded.
$isx [j - 1] = 1$: Selected.
$isx [j - 1] = 2$: Forced selection.

fin: Type: System..::..Double
On entry: the value of the variance ratio which an explanatory variable must exceed to be included in a model.
Suggested value: $fin = 4.0$

Constraint: $fin > 0.0$ .

fout: Type: System..::..Double
On entry: the explanatory variable in a model with the lowest variance ratio value is removed from the model if its value is less than fout. fout is usually set equal to the value of fin; a value less than fin is occasionally preferred.
Suggested value: $fout = fin$

Constraint: $0.0 \leq fout \leq fin$ .

tau: Type: System..::..Double
On entry: the tolerance, $τ$ , for detecting collinearities between variables when adding or removing an explanatory variable from a model. Explanatory variables deemed to be collinear are excluded from the final model.
Suggested value: $tau = 1.0 \times 10^{- 6}$

Constraint: $tau > 0.0$ .

b: Type: array<System..::..Double>[]()[][]
An array of size [ $m + 1$ ]
On exit: $b [0]$ contains the estimate for the intercept term in the fitted model. If $isx [j - 1] \neq 0$ then $b [j + 1 - 1]$ contains the estimate for the $j$ th explanatory variable in the fitted model; otherwise $b [j + 1 - 1] = 0$ .

se: Type: array<System..::..Double>[]()[][]
An array of size [ $m + 1$ ]
On exit: $se [j - 1]$ contains the standard error for the estimate of $b [j - 1]$ , for $j = 1, 2, \dots, m + 1$ .

rsq: Type: System..::..Double%
On exit: the $R^{2}$ -statistic for the fitted regression model.

rms: Type: System..::..Double%
On exit: the mean square of residuals for the fitted regression model.

df: Type: System..::..Int32%
On exit: the number of degrees of freedom for the sum of squares of residuals.

monlev: Type: System..::..Int32
On entry: if a submethod is provided by you to monitor the model selection process, set monlev to $1$ ; otherwise set monlev to $0$ .

Constraint: $monlev = 0$ or $1$ .

monfun: Type: NagLibrary..::..G02..::..G02EF_MONFUN
You may define your own function or specify the NAG defined default function G02EFH.
A delegate of type G02EF_MONFUN.

ifail: Type: System..::..Int32%
On exit: $ifail = 0$ unless the method detects an error or a warning has been flagged (see [Error Indicators and Warnings]).

Description

The general multiple linear regression model is defined by

y = β_{0} + X β + ε,

where

$y$ is a vector of $n$ observations on the dependent variable,
$β_{0}$ is an intercept coefficient,
$X$ is an $n$ by $p$ matrix of $p$ explanatory variables,
$β$ is a vector of $p$ unknown coefficients, and
$ε$ is a vector of length $n$ of unknown, Normally distributed, random errors.

g02ef employs a full stepwise regression to select a subset of explanatory variables from the

p

available variables (the intercept is included in the model) and computes regression coefficients and their standard errors, and various other statistical quantities, by minimizing the sum of squares of residuals. The method applies repeatedly a forward selection step followed by a backward elimination step and halts when neither step updates the current model.

The criterion used to update a current model is the variance ratio of residual sum of squares. Let

s_{1}

and

s_{2}

be the residual sum of squares of the current model and this model after undergoing a single update, with degrees of freedom

q_{1}

and

q_{2}

, respectively. Then the condition:

\frac{(s_{2} - s_{1}) / (q_{2} - q_{1})}{s_{1} / q_{1}} > f_{1},

must be satisfied if a variable

k

will be considered for entry to the current model, and the condition:

\frac{(s_{1} - s_{2}) / (q_{1} - q_{2})}{s_{1} / q_{1}} < f_{2},

must be satisfied if a variable

k

will be considered for removal from the current model, where

f_{1}

and

f_{2}

are user-supplied values and

f_{2} \leq f_{1}

In the entry step the entry statistic is computed for each variable not in the current model. If no variable is associated with a test value that exceeds

f_{1}

then this step is terminated; otherwise the variable associated with the largest value for the entry statistic is entered into the model.

In the removal step the removal statistic is computed for each variable in the current model. If no variable is associated with a test value less than

f_{2}

then this step is terminated; otherwise the variable associated with the smallest value for the removal statistic is removed from the model.

The data values

X

and

y

are not provided as input to the method. Instead, summary statistics of the design and data matrix

Z = (X ∣ y)

are required.

Explanatory variables are entered into and removed from the current model by using sweep operations on the correlation matrix

R

Z

, given by:

R = (\begin{matrix} 1 & \dots & r_{1 p} & r_{1 y} \\ ⋮ & ⋱ & ⋮ & ⋮ \\ r_{p 1} & \dots & 1 & r_{p y} \\ r_{y 1} & \dots & r_{y p} & 1 \end{matrix}),

where

r_{i j}

is the correlation between the explanatory variables

i

and

j

, for

i = 1, 2, \dots, p

and

j = 1, 2, \dots, p

, and

r_{y i}

(and

r_{i y}

) is the correlation between the response variable

y

and the

i

th explanatory variable, for

i = 1, 2, \dots, p

A sweep operation on the

k

th row and column (

k \leq p

) of

R

replaces:

\begin{array}{l} r_{k k} ​ by ​ - 1 / r_{k k}; \\ r_{i k} ​ by ​ r_{i k} / |r_{k k}|, i = 1, 2, \dots, p + 1 ​ ​ (i \neq k); \\ r_{k j} ​ by ​ r_{k j} / |r_{k k}|, j = 1, 2, \dots, p + 1 ​ ​ (j \neq k); \\ r_{i j} ​ by ​ r_{i j} - r_{i k} r_{k j} / |r_{k k}|,  ​ i = 1, 2, \dots, p + 1 ​ ​ (i \neq k); ​ j = 1, 2, \dots, p + 1 ​ ​ (j \neq k) . \end{array}

The

k

th explanatory variable is eligible for entry into the current model if it satisfies the collinearity tests:

r_{k k} > τ

and

(r_{i i} - \frac{r_{i k} r_{k i}}{r_{k k}}) τ \leq 1,

for a user-supplied value (

> 0

) of

τ

and where the index

i

runs over explanatory variables in the current model. The sweep operation is its own inverse, therefore pivoting on an explanatory variable

k

in the current model has the effect of removing it from the model.

Once the stepwise model selection procedure is finished, the method calculates:

(a)	the least squares estimate for the $i$ th explanatory variable included in the fitted model;
(b)	standard error estimates for each coefficient in the final model;
(c)	the square root of the mean square of residuals and its degrees of freedom;
(d)	the multiple correlation coefficient.

The method makes use of the symmetry of the sweep operations and correlation matrix which reduces by almost one half the storage and computation required by the sweep algorithm, see Clarke (1981) for details.

References

Clarke M R B (1981) Algorithm AS 178: the Gauss–Jordan sweep operator with detection of collinearity Appl. Statist. 31 166–169

Dempster A P (1969) Elements of Continuous Multivariate Analysis Addison–Wesley

Draper N R and Smith H (1985) Applied Regression Analysis (2nd Edition) Wiley

Error Indicators and Warnings

Errors or warnings detected by the method:

$ifail = 1$

On entry,	$m \leq 1$ ,
or	$n \leq 1$ ,
or	$sw \leq 1.0$ ,
or	$fin \leq 0.0$ ,
or	$fout < 0.0$ ,
or	$fout > fin$ ,
or	$tau \leq 0.0$ .

$ifail = 2$

On entry,	at least one element of isx was set incorrectly,
or	there are no explanatory variables to select from $isx [i - 1] \neq 1$ , for $i = 1, 2, \dots, m$ ,
or	invalid value for monlev.

$ifail = 3$: Warning: the design and data matrix $Z$ is not positive definite, results may be inaccurate.

$ifail = 4$: All variables are collinear, there is no model to select.

$ifail = -9000$: An error occured, see message report.
$ifail = -8000$: Negative dimension for array $〈value〉$
$ifail = -6000$: Invalid Parameters $〈value〉$

Accuracy

g02ef returns a warning if the design and data matrix is not positive definite.

Parallelism and Performance

None.

Further Comments

Although the condition for removing or adding a variable to the current model is based on a ratio of variances, these values should not be interpreted as

F

-statistics with the usual interpretation of significance unless the probability levels are adjusted to account for correlations between variables under consideration and the number of possible updates (see, e.g., Draper and Smith (1985)).

g02ef allocates internally

O (4 \times m + (m + 1) \times (m + 2) / 2 + 2)

of real storage.

Example

Example program (C#): g02efe.cs

Example program data: g02efe.d

Example program results: g02efe.r