hide long namesshow long names
hide short namesshow short names
Integer type:  int32  int64  nag_int  show int32  show int32  show int64  show int64  show nag_int  show nag_int

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

NAG Toolbox: nag_rand_kfold_xyw (g05pv)

 Contents

    1  Purpose
    2  Syntax
    7  Accuracy
    9  Example

Purpose

nag_rand_kfold_xyw (g05pv) generates training and validation datasets suitable for use in cross-validation or jack-knifing.

Syntax

[nt, state, sx, sy, sw, errbuf, ifail] = g05pv(k, fold, x, state, 'n', n, 'm', m, 'sordx', sordx, 'y', y, 'w', w, 'sordsx', sordsx)
[nt, state, sx, sy, sw, errbuf, ifail] = nag_rand_kfold_xyw(k, fold, x, state, 'n', n, 'm', m, 'sordx', sordx, 'y', y, 'w', w, 'sordsx', sordsx)

Description

Let Xo denote a matrix of n observations on m variables and yo and wo each denote a vector of length n. For example, Xo might represent a matrix of independent variables, yo the dependent variable and wo the associated weights in a weighted regression.
nag_rand_kfold_xyw (g05pv) generates a series of training datasets, denoted by the matrix, vector, vector triplet Xt,yt,wt of nt observations, and validation datasets, denoted Xv,yv,wv with nv observations. These training and validation datasets are generated as follows.
Each of the original n observations is randomly assigned to one of K equally sized groups or folds. For the kth sample the validation dataset consists of those observations in group k and the training dataset consists of all those observations not in group k. Therefore at most K samples can be generated.
If n is not divisible by K then the observations are assigned to groups as evenly as possible, therefore any group will be at most one observation larger or smaller than any other group.
When using K=n the resulting datasets are suitable for leave-one-out cross-validation, or the training dataset on its own for jack-knifing. When using Kn the resulting datasets are suitable for K-fold cross-validation. Datasets suitable for reversed cross-validation can be obtained by switching the training and validation datasets, i.e., use the kth group as the training dataset and the rest of the data as the validation dataset.
One of the initialization functions nag_rand_init_repeat (g05kf) (for a repeatable sequence if computed sequentially) or nag_rand_init_nonrepeat (g05kg) (for a non-repeatable sequence) must be called prior to the first call to nag_rand_kfold_xyw (g05pv).

References

None.

Parameters

Compulsory Input Parameters

1:     k int64int32nag_int scalar
K, the number of folds.
Constraint: 2kn.
2:     fold int64int32nag_int scalar
The number of the fold to return as the validation dataset.
On the first call to nag_rand_kfold_xyw (g05pv) fold should be set to 1 and then incremented by one at each subsequent call until all K sets of training and validation datasets have been produced. See Further Comments for more details on how a different calling sequence can be used.
Constraint: 1foldk.
3:     xldx: – double array
The first dimension, ldx, of the array x must satisfy
  • if sordx=2, ldxm;
  • otherwise ldxn.
The second dimension of the array x must be at least m if sordx=1 and at least n if sordx=2.
The way the data is stored in x is defined by sordx.
If sordx=1, xij contains the ith observation for the jth variable, for i=1,2,,n and j=1,2,,m.
If sordx=2, xji contains the ith observation for the jth variable, for i=1,2,,n and j=1,2,,m.
If fold=1, x must hold Xo, the values of X for the original dataset, otherwise, x must hold the array returned in sx by the last call to nag_rand_kfold_xyw (g05pv).
4:     state: int64int32nag_int array
Note: the actual argument supplied must be the array state supplied to the initialization routines nag_rand_init_repeat (g05kf) or nag_rand_init_nonrepeat (g05kg).
Contains information on the selected base generator and its current state.

Optional Input Parameters

1:     n int64int32nag_int scalar
Default:
  • if sordx=2, the second dimension of x;
  • otherwise the first dimension of x.
n, the number of observations.
Constraint: n1.
2:     m int64int32nag_int scalar
Default:
  • if sordx=2, the first dimension of x;
  • otherwise the second dimension of x.
m, the number of variables.
Constraint: m1.
3:     sordx int64int32nag_int scalar
Default: 1
Determines how variables are stored in x.
Constraint: sordx=1 or 2.
4:     yly – double array
Optionally, yo, the values of y for the original dataset. If fold1, y must hold the vector returned in sy by the last call to nag_rand_kfold_xyw (g05pv).
5:     wlw – double array
Optionally, wo, the values of w for the original dataset. If fold1, w must hold the vector returned in sw by the last call to nag_rand_kfold_xyw (g05pv).
6:     sordsx int64int32nag_int scalar
Default: sordx
Determines how variables are stored in sx.
Constraint: sordsx=1 or 2.

Output Parameters

1:     nt int64int32nag_int scalar
nt, the number of observations in the training dataset.
2:     state: int64int32nag_int array
Contains updated information on the state of the generator.
3:     sxldsx: – double array
The first dimension, ldsx, of the array sx will be
  • if sordsx=1, ldsx=n;
  • if sordsx=2, ldsx=m.
The second dimension of the array sx will be m if sordsx=1 and n otherwise.
The way the data is stored in sx is defined by sordsx.
If sordsx=1, sxij contains the ith observation for the jth variable, for i=1,2,,n and j=1,2,,m.
If sordsx=2, sxji contains the ith observation for the jth variable, for i=1,2,,n and j=1,2,,m.
sx holds the values of X for the training and validation datasets, with Xt held in observations 1 to nt and Xv in observations nt+1 to n.
4:     sylsy – double array
If y is supplied then sy holds the values of y for the training and validation datasets, with yt held in elements 1 to nt and yv in elements nt+1 to n.
5:     swlsw – double array
If w is supplied then sw holds the values of w for the training and validation datasets, with wt held in elements 1 to nt and wv in elements nt+1 to n.
6:     errbuf – string (length at least 200) (length ≥ 200)
7:     ifail int64int32nag_int scalar
ifail=0 unless the function detects an error (see Error Indicators and Warnings).

Error Indicators and Warnings

Note: nag_rand_kfold_xyw (g05pv) may return useful information for one or more of the following detected errors or warnings.
Errors or warnings detected by the function:

Cases prefixed with W are classified as warnings and do not generate an error of type NAG:error_n. See nag_issue_warnings.

   ifail=11
Constraint: 2kn.
   ifail=21
Constraint: 1foldk.
   ifail=31
Constraint: n1.
   ifail=41
Constraint: m1.
   ifail=51
Constraint: sordx=1 or 2.
W  ifail=61
More than 50% of the data did not move when the data was shuffled.
   ifail=71
Constraint: if sordx=1, ldxn.
   ifail=72
Constraint: if sordx=2, ldxm.
   ifail=131
On entry, state vector has been corrupted or not initialized.
   ifail=161
Constraint: sordsx=1 or 2.
   ifail=-99
An unexpected error has been triggered by this routine. Please contact NAG.
   ifail=-399
Your licence key may have expired or may not have been installed correctly.
   ifail=-999
Dynamic memory allocation failed.

Accuracy

Not applicable.

Further Comments

nag_rand_kfold_xyw (g05pv) will be computationality more efficient if each observation in x is contiguous, that is sordx=2.
Because of the way nag_rand_kfold_xyw (g05pv) stores the data you should usually generate the K training and validation datasets in order, i.e., set fold=1 on the first call and increment it by one at each subsequent call. However, there are times when a different calling sequence would be beneficial, for example, when performing different cross-validation analyses on different threads. This is possible, as long as the following is borne in mind:
For example, if you have three threads, you would call nag_rand_kfold_xyw (g05pv) once with fold=1. You would then copy the x returned onto each thread and generate the remaing k-1 sets of data by splitting them between the threads. For example, the first thread runs with fold=2,,L1, the second with fold=L1+1,,L2 and the third with fold=L2+1,,k.

Example

This example uses nag_rand_kfold_xyw (g05pv) to facilitate K-fold cross-validation.
A set of simulated data is split into 5 training and validation datasets. nag_correg_glm_binomial (g02gb) is used to fit a logistic regression model to each training dataset and then nag_correg_glm_predict (g02gp) is used to predict the response for the observations in the validation dataset.
The counts of true and false positives and negatives along with the sensitivity and specificity is then reported.
function g05pv_example


fprintf('g05pv example results\n\n');

% Fit a logistic regression model using g02gb and predict values using g02gp
% (binomial error, logistic link, with an intercept)
link = 'G';
mean = 'M';
errfn = 'B';

% Not using the predicted standard errors
vfobs = false;

% Independent variables
x = [ 0.0 -0.1  0.0  1.0;   0.4 -1.1  1.0  1.0;  -0.5  0.2  1.0  0.0;
      0.6  1.1  1.0  0.0;  -0.3 -1.0  1.0  1.0;   2.8 -1.8  0.0  1.0;
      0.4 -0.7  0.0  1.0;  -0.4 -0.3  1.0  0.0;   0.5 -2.6  0.0  0.0;
     -1.6 -0.3  1.0  1.0;   0.4  0.6  1.0  0.0;  -1.6  0.0  1.0  1.0;
      0.0  0.4  1.0  1.0;  -0.1  0.7  1.0  1.0;  -0.2  1.8  1.0  1.0;
     -0.9  0.7  1.0  1.0;  -1.1 -0.5  1.0  1.0;  -0.1 -2.2  1.0  1.0;
     -1.8 -0.5  1.0  1.0;  -0.8 -0.9  0.0  1.0;   1.9 -0.1  1.0  1.0;
      0.3  1.4  1.0  1.0;   0.4 -1.2  1.0  0.0;   2.2  1.8  1.0  0.0;
      1.4 -0.4  0.0  1.0;   0.4  2.4  1.0  1.0;  -0.6  1.1  1.0  1.0;
      1.4 -0.6  1.0  1.0;  -0.1 -0.1  0.0  0.0;  -0.6 -0.4  0.0  0.0;
      0.6 -0.2  1.0  1.0;  -1.8 -0.3  1.0  1.0;  -0.3  1.6  1.0  1.0;
     -0.6  0.8  0.0  1.0;   0.3 -0.5  0.0  0.0;   1.6  1.4  1.0  1.0;
     -1.1  0.6  1.0  1.0;  -0.3  0.6  1.0  1.0;  -0.6  0.1  1.0  1.0;
      1.0  0.6  1.0  1.0];                       

% Dependent variable
y = [0;1;0;0;0;0;1;1;1;0;0;1;1;0;0;0;0;1;1;1;
     1;0;1;1;1;0;0;1;0;0;1;1;0;0;1;0;0;0;0;1];

% Each observation represents a single trial
t = ones(size(x,1));

% We want to include all independent variables in the model
isx = int64(ones(size(x,2),1));
ip = int64(sum(isx) + (upper(mean(1:1)) == 'M'));

% In order to use cross-validation we need to initialise the random
% number generator (using L'Ecuyers MRG32k3a and a repeatable sequence)
seed = int64(42321);
genid = int64(6);
subid = int64(0);
[state,ifail] = g05kf( ...
                       genid,subid,seed);

% perform 5-fold sampling
k = int64(5);

% Some of the routines used in this example issue warnings, but return
% sensible results, so save current warning state and turn warnings on
warn_state = nag_issue_warnings();
nag_issue_warnings(true);

tn = 0;
fn = 0;
fp = 0;
tp = 0;

%  Loop over each fold
for i = 1:k
  fold = int64(i);

  % Split the data into training and validation datasets
  [nt,state,x,y,t,ifail] = g05pv( ...
                                  k,fold,x,state,'y',y,'w',t);
  if (ifail~=0 & ifail~=61)
    break
  end

  % Fit generalized linear model, with Binomial errors to training data
  % (the first nt values in x).
  [~,~,b,~,~,cov,~,ifail] = g02gb( ...
                                   link,mean,x,isx,ip,y,t,'n',nt);
  if (ifail~=0 & ifail < 6)
    break
  end

  % Predict the response for the observations in the validation dataset
  [~,~,pred,~,ifail] = g02gp( ...
                              errfn,link,mean,x(nt+1:end,:),isx,b, ...
                              cov,vfobs,'t',t(nt+1:end));
  if (ifail~=0)
    break
  end

  % Cross-tabulate the observed and predicted values
  obs_val = ceil(y(nt+1:end) + 0.5);
  pred_val = (pred >= 0.5) + 1;
  count = zeros(2,2);
  for i = 1:size(pred_val,1)
    count(pred_val(i),obs_val(i)) = count(pred_val(i),obs_val(i)) + 1;
  end

  % Extract the true/false negatives/positives
  tn = tn + count(1,1);
  fn = fn + count(1,2);
  fp = fp + count(2,1);
  tp = tp + count(2,2);
end

% Reset the warning state to its initial value
nag_issue_warnings(warn_state);

np = tp + fn;
nn = fp + tn;

fprintf('                       Observed\n');
fprintf('             --------------------------\n');
fprintf(' Predicted | Negative  Positive   Total\n');
fprintf(' --------------------------------------\n');
fprintf(' Negative  | %5d     %5d     %5d\n', tn, fn, tn + fn);
fprintf(' Positive  | %5d     %5d     %5d\n', fp, tp, fp + tp);
fprintf(' Total     | %5d     %5d     %5d\n', nn, np, nn + np);
fprintf('\n');

if (np~=0)
  fprintf(' True Positive Rate (Sensitivity): %4.2f\n', tp / np);
else
  fprintf(' True Positive Rate (Sensitivity): No positives in data\n');
end
if (nn~=0)
  fprintf(' True Negative Rate (Specificity): %4.2f\n', tn / nn);
else
  fprintf(' True Negative Rate (Specificity): No negatives in data\n');
end


g05pv example results

                       Observed
             --------------------------
 Predicted | Negative  Positive   Total
 --------------------------------------
 Negative  |    18         8        26
 Positive  |     4        10        14
 Total     |    22        18        40

 True Positive Rate (Sensitivity): 0.56
 True Negative Rate (Specificity): 0.82

PDF version (NAG web site, 64-bit version, 64-bit version)
Chapter Contents
Chapter Introduction
NAG Toolbox

© The Numerical Algorithms Group Ltd, Oxford, UK. 2009–2015