Note:this function usesoptional parametersto define choices in the problem specification and in the details of the algorithm. If you wish to use default settings for all of the optional parameters, you need only read Sections 1 to 10 of this document. If, however, you wish to reset some or all of the settings please refer to Section 11 for a detailed description of the algorithm, to Section 12 for a detailed description of the specification of the optional parameters and to Section 13 for a detailed description of the monitoring information produced by the function.
nag_regsn_quant_linear (g02qgc) performs a multiple linear quantile regression. Parameter estimates and, if required, confidence limits, covariance matrices and residuals are calculated. nag_regsn_quant_linear (g02qgc) may be used to perform a weighted quantile regression. A simplified interface for nag_regsn_quant_linear (g02qgc) is provided by nag_regsn_quant_linear_iid (g02qfc).
Given a vector of observed values,
, an design matrix , a column vector, , of length holding the th row of and a quantile , nag_regsn_quant_linear (g02qgc) estimates the -element vector as the solution to
where is the piecewise linear loss function , and is an indicator function taking the value if and otherwise. Weights can be incorporated by replacing and with and respectively, where is an diagonal matrix. Observations with zero weights can either be included or excluded from the analysis; this is in contrast to least squares regression where such observations do not contribute to the objective function and are therefore always dropped.
nag_regsn_quant_linear (g02qgc) uses the interior point algorithm of Portnoy and Koenker (1997), described briefly in Section 11, to obtain the parameter estimates , for a given value of .
Under the assumption of Normally distributed errors, Koenker (2005) shows that the limiting covariance matrix of has the form
where and is a function of , as described below. Given an estimate of the covariance matrix, , lower () and upper () limits for an confidence interval can be calculated for each of the parameters, via
where is the percentile of the Student's distribution with degrees of freedom, where is the rank of the cross-product matrix .
Four methods for estimating the covariance matrix, , are available:
Independent, identically distributed (IID) errors
Under an assumption of IID errors the asymptotic relationship for simplifies to
where is the sparsity function. nag_regsn_quant_linear (g02qgc) estimates from the residuals,
and a bandwidth .
Powell (1991) suggested estimating the matrix by a kernel estimator of the form
where is a kernel function and satisfies and . When the Powell method is chosen, nag_regsn_quant_linear (g02qgc) uses a Gaussian kernel (i.e., ) and sets
where is a bandwidth, and are, respectively, the standard deviation and the and quantiles for the residuals, .
where is a bandwidth and denotes the parameter estimates obtained from a quantile regression using the th quantile. Similarly with .
The last method uses bootstrapping to either estimate a covariance matrix or obtain confidence intervals for the parameter estimates directly. This method therefore does not assume Normally distributed errors. Samples of size are taken from the paired data (i.e., the independent and dependent variables are sampled together). A quantile regression is then fitted to each sample resulting in a series of bootstrap estimates for the model parameters, . A covariance matrix can then be calculated directly from this series of values. Alternatively, confidence limits, and , can be obtained directly from the and sample quantiles of the bootstrap estimates.
Further details of the algorithms used to calculate the covariance matrices can be found in Section 11.
All three asymptotic estimates of the covariance matrix require a bandwidth, . Two alternative methods for determining this are provided:
for a user-supplied value ,
nag_regsn_quant_linear (g02qgc) allows optional arguments to be supplied via the iopts and opts arrays (see Section 12 for details of the available options).
If the default values for these optional arguments are sufficient then iopts and opts can be set to NULL, otherwise prior
to calling nag_regsn_quant_linear (g02qgc) the optional parameter arrays,
must be initialized by calling nag_g02_opt_set (g02zkc) with optstr set to . If bootstrap confidence limits are required () then one of the random number initialization functions nag_rand_init_repeatable (g05kfc) (for a repeatable analysis) or nag_rand_init_nonrepeatable (g05kgc) (for an unrepeatable analysis) must also have been previously called.
Koenker R (2005) Quantile Regression Econometric Society Monographs, Cambridge University Press, New York
Mehrotra S (1992) On the implementation of a primal-dual interior point method SIAM J. Optim.2 575–601
Nocedal J and Wright S J (2006) Numerical Optimization (2nd Edition) Springer Series in Operations Research, Springer, New York
Portnoy S and Koenker R (1997) The Gaussian hare and the Laplacian tortoise: computability of squared-error versus absolute error estimators Statistical Science4 279–300
Powell J L (1991) Estimation of monotonic regression models under quantile restrictions Nonparametric and Semiparametric Methods in Econometrics Cambridge University Press, Cambridge
On entry: the order argument specifies the two-dimensional storage scheme being used, i.e., row-major ordering or column-major ordering. C language defined storage is specified by . See Section 220.127.116.11 in How to Use the NAG Library and its Documentation for a more detailed explanation of the use of this argument.
On entry: indicates whether an intercept will be included in the model. The intercept is included by adding a column of ones as the first column in the design matrix, .
An intercept will be included in the model.
An intercept will not be included in the model.
On entry: the total number of observations in the dataset. If no weights are supplied, or no zero weights are supplied or observations with zero weights are included in the model then . Otherwise the number of observations with zero weights.
On entry: , the total number of variates in the dataset.
– const doubleInput
Note: the dimension, dim, of the array dat
must be at least
Where appears in this document, it refers to the array element
On entry: the
th value for the th variate, for and , must be supplied in
On entry: , the number of independent variables in the model, including the intercept, see intcpt, if present.
if , ;
if , .
– const doubleInput
On entry: , the observations on the dependent variable.
– const doubleInput
On entry: optionally, the diagonal elements of the weight matrix .
If weights are not provided then wt must be set to NULL.
If , the th observation is not included in the model, in which case the effective number of observations, , is the number of observations with nonzero weights. If , the values of res will be set to zero for observations with zero weights.
All observations are included in the model and the effective number of observations is n, i.e., .
the effective number of observations ;
, for all .
On entry: the number of quantiles of interest.
– const doubleInput
On entry: the vector of quantiles of interest. A separate model is fitted to each quantile.
The method used for calculating the interval is controlled by the optional parameters and . The size of the interval, is controlled by the optional parameter .
Note: the dimension, dim, of the array ch
must be at least
, or and
Where appears in this document, it refers to the array element .
On exit: depending on the supplied optional parameters, ch will either not be referenced, hold an estimate of the upper triangular part of the covariance matrix, , or an estimate of the upper triangular parts of and .
If , holds an estimate of the covariance between and .
If , holds an estimate of the th element of and holds an estimate of the th element of , for .
The method used for calculating and is controlled by the optional parameter .
In cases where ch is not going to be referenced it can be set to NULL.
Note: the th element of the matrix is stored in .
On exit: if ,
holds the (weighted) residuals, , for , for and .
If and , the value of res will be set to zero for observations with zero weights.
If , res is not referenced and can be set to NULL.
– const IntegerCommunication Array
Note: the dimension, , of this array is dictated by the requirements of associated functions that must have been previously called. This array MUST be the same array passed as argument iopts in the previous call to nag_g02_opt_set (g02zkc).
On entry: if the default values of the optional arguments are sufficient, iopts can be set to NULL, otherwise theoptional parameter array, as initialized by a call to nag_g02_opt_set (g02zkc) must be supplied.
– const doubleCommunication Array
Note: the dimension, , of this array is dictated by the requirements of associated functions that must have been previously called. This array MUST be the same array passed as argument opts in the previous call to nag_g02_opt_set (g02zkc).
On entry: if the default values of the optional arguments are sufficient, opts can be set to NULL, otherwise theoptional parameter array, as initialized by a call to nag_g02_opt_set (g02zkc) must be supplied.
If , state contains information about the selected random number generator. Otherwise state is not referenced and can be set to NULL.
On exit: holds additional information concerning the model fitting and confidence limit calculations when .
Model fitted and confidence limits (if requested) calculated successfully
The function did not converge. The returned values are based on the estimate at the last iteration. Try increasing whilst calculating the parameter estimates or relaxing the definition of convergence by increasing .
A singular matrix was encountered during the optimization. The model was not fitted for this value of .
Some truncation occurred whilst calculating the confidence limits for this value of . See Section 11 for details. The returned upper and lower limits may be narrower than specified.
The function did not converge whilst calculating the confidence limits. The returned limits are based on the estimate at the last iteration. Try increasing .
Confidence limits for this value of could not be calculated. The returned upper and lower limits are set to a large positive and large negative value respectively as defined by the optional parameter .
It is possible for multiple warnings to be applicable to a single model. In these cases the value returned in info is the sum of the corresponding individual nonzero warning codes.
– NagError *Input/Output
The NAG error argument (see Section 3.7 in How to Use the NAG Library and its Documentation).
Error Indicators and Warnings
Dynamic memory allocation failed.
See Section 18.104.22.168 in How to Use the NAG Library and its Documentation for further information.
On entry, and .
On entry, and .
On entry, argument had an illegal value.
On entry, either the option arrays have not been initialized or they have been corrupted.
On entry, .
On entry, .
On entry, .
On entry, and .
On entry, .
Constraint: or for all .
An internal error has occurred in this function. Check the function call and any array sizes. If the call is correct then please contact NAG for assistance.
See Section 2.7.6 in How to Use the NAG Library and its Documentation for further information.
On entry, state vector has been corrupted or not initialized.
Your licence key may have expired or may not have been installed correctly.
See Section 2.7.5 in How to Use the NAG Library and its Documentation for further information.
On entry, .
On entry, is invalid.
A potential problem occurred whilst fitting the model(s). Additional information has been returned in info.
Parallelism and Performance
nag_regsn_quant_linear (g02qgc) is threaded by NAG for parallel execution in multithreaded implementations of the NAG Library.
nag_regsn_quant_linear (g02qgc) makes calls to BLAS and/or LAPACK routines, which may be threaded within the vendor library used by this implementation. Consult the documentation for the vendor library for further information.
Please consult the x06 Chapter Introduction for information on how to control and interrogate the OpenMP environment used within this function. Please also consult the Users' Note for your implementation for any additional implementation-specific information.
nag_regsn_quant_linear (g02qgc) allocates internally approximately the following elements of double storage:
. If then a further
elements are required, and this increases by
if . Where possible, any user-supplied output arrays are used as workspace and so the amount actually allocated may be less. If , , and an internal copy of the input data is avoided and the amount of locally allocated memory is reduced by .
A quantile regression model is fitted to Engels 1857 study of household expenditure on food. The model regresses the dependent variable, household food expenditure, against two explanatory variables, a column of ones and household income. The model is fit for five different values of and the covariance matrix is estimated assuming Normal IID errors. Both the covariance matrix and the residuals are returned.
By the addition of slack variables the minimization (1) can be reformulated into the linear programming problem
and its associated dual
where is a vector of s. Setting gives the equivalent formulation
The algorithm introduced by Portnoy and Koenker (1997) and used by nag_regsn_quant_linear (g02qgc), uses the primal-dual formulation expressed in equations (2) and (4) along with a logarithmic barrier function to obtain estimates for . The algorithm is based on the predictor-corrector algorithm of Mehrotra (1992) and further details can be obtained from Portnoy and Koenker (1997) and Koenker (2005). A good description of linear programming, interior point algorithms, barrier functions and Mehrotra's predictor-corrector algorithm can be found in Nocedal and Wright (2006).
Interior Point Algorithm
In this section a brief description of the interior point algorithm used to estimate the model parameters is presented. It should be noted that there are some differences in the equations given here – particularly (7) and (9) – compared to those given in Koenker (2005) and Portnoy and Koenker (1997).
Rather than optimize (4) directly, an additional slack variable is added and the constraint is replaced with , for .
The positivity constraint on and is handled using the logarithmic barrier function
The primal-dual form of the problem is used giving the Lagrangian
whose central path is described by the following first order conditions
where denotes the diagonal matrix with diagonal elements given by , similarly with and . By enforcing the inequalities on and strictly, i.e., and for all we ensure that and are positive definite diagonal matrices and hence and exist.
Rather than applying Newton's method to the system of equations given in (5) to obtain the step directions and , Mehrotra substituted the steps directly into (5) giving the augmented system of equations
where and denote the diagonal matrices with diagonal elements given by and respectively.
Affine scaling step
The affine scaling step is constructed by setting in (5) and applying Newton's method to obtain an intermediate set of step directions
Initial step sizes for the primal () and dual () parameters are constructed as
where is a user-supplied scaling factor. If
then the nonlinearity adjustment, described in Section 11.1.3, is not made and the model parameters are updated using the current step size and directions.
In the nonlinearity adjustment step a new estimate of is obtained by letting
and estimating as
This estimate, along with the nonlinear terms (, , and ) from (6) are calculated using the values of
and obtained from the affine scaling step.
Given an updated estimate for and the nonlinear terms the system of equations
are solved and updated values for
Update and convergence
At each iteration the model parameters
are updated using step directions,
and step lengths
Convergence is assessed using the duality gap, that is, the differences between the objective function in the primal and dual formulations. For any feasible point
the duality gap can be calculated from equations (2) and (3) as
and the optimization terminates if the duality gap is smaller than the tolerance supplied in the optional parameter .
Initial values are required for the parameters and . If not supplied by the user, initial values for are calculated from a least squares regression of on . This regression is carried out by first constructing the cross-product matrix and then using a pivoted decomposition as performed by nag_dgeqp3 (f08bfc). In addition, if the cross-product matrix is not of full rank, a rank reduction is carried out and, rather than using the full design matrix, , a matrix formed from the first -rank columns of is used instead, where is the pivot matrix used during the decomposition. Parameter estimates, confidence intervals and the rows and columns of the matrices returned in the argument ch (if any) are set to zero for variables dropped during the rank-reduction. The rank reduction step is performed irrespective of whether initial values are supplied by the user.
Once initial values have been obtained for , the initial values for and are calculated from the residuals. If then a value of is used instead, where is supplied in the optional parameter . The initial values for the and are always set to and respectively.
nag_regsn_quant_linear (g02qgc) supplies four methods to calculate the covariance matrices associated with the parameter estimates for . This section gives some additional detail on three of the algorithms, the fourth, (which uses bootstrapping), is described in Section 3.
Independent, identically distributed (IID) errors
When assuming IID errors, the covariance matrices depend on the sparsity, , which nag_regsn_quant_linear (g02qgc) estimates as follows:
Let denote the residuals from the original quantile regression, that is
Drop any residual where is less than , supplied in the optional parameter .
Sort and relabel the remaining residuals in ascending order, by absolute value, so that
Select the first values where , for some bandwidth .
Sort and relabel these residuals again, so that
and regress them against a design matrix with two columns () and rows given by
using quantile regression with .
Use the resulting estimate of the slope as an estimate of the sparsity.
When using the Powell Sandwich to estimate the matrix , the quantity
is calculated. Dependent on the value of and the method used to calculate the bandwidth (), it is possible for the quantities to be too large or small, compared to machine precision (). More specifically, when , or , a warning flag is raised in info, the value is truncated to or respectively and the covariance matrix calculated as usual.
The Hendricks–Koenker Sandwich requires the calculation of the quantity
As with the Powell Sandwich, in cases where , or , a warning flag is raised in info, the value truncated to or respectively and the covariance matrix calculated as usual.
In addition, it is required that , in this method. Hence, instead of using
in the calculation of ,
is used instead, where is supplied in the optional parameter .
Several optional parameters in nag_regsn_quant_linear (g02qgc) control aspects of the optimization algorithm, methodology used, logic or output. Their values are contained in the arrays iopts and opts; these must be initialized before calling nag_regsn_quant_linear (g02qgc) by first calling nag_g02_opt_set (g02zkc) with optstr set to .
Keywords and character values are case and white space insensitive.
Band Width Alpha
A multiplier used to construct the parameter used when calculating the Sheather–Hall bandwidth (see Section 3), with . Here, is the .
Band Width Method
The method used to calculate the bandwidth used in the calculation of the asymptotic covariance matrix and if , or (see Section 3).
This parameter should be set to something larger than the biggest value supplied in dat and y.
Bootstrap Interval Method
If , controls how the confidence intervals are calculated from the bootstrap estimates.
intervals are calculated. That is, the covariance matrix, is calculated from the bootstrap estimates and the limits calculated as where is the percentage point from a Student's distribution on degrees of freedom, is the effective number of observations and is given by the optional parameter .
Quantile intervals are calculated. That is, the upper and lower limits are taken as the and quantiles of the bootstrap estimates, as calculated using nag_double_quantiles (g01amc).
The number of bootstrap samples used to calculate the confidence limits and covariance matrix (if requested) when .
If and , the parameter estimates for each of the bootstrap samples are displayed. This information is sent to the unit number specified by .
Calculate Initial Values
If then the initial values for the regression parameters, , are calculated from the data. Otherwise they must be supplied in b.
This special keyword is used to reset all optional parameters to their default values.
Drop Zero Weights
If a weighted regression is being performed and then observations with zero weight are dropped from the analysis. Otherwise such observations are included.
, the tolerance used when calculating the covariance matrix and the initial values for and . For additional details see Section 11.2 and Section 11.1.5 respectively.
The value of controls whether confidence limits are returned in bl and bu and how these limits are calculated. This parameter also controls how the matrices returned in ch are calculated.
No limits are calculated and bl, bu and ch are not referenced.
The Powell Sandwich method with a Gaussian kernel is used.
The Hendricks–Koenker Sandwich is used.
The errors are assumed to be identical, and independently distributed.
A bootstrap method is used, where sampling is done on the pair . The number of bootstrap samples is controlled by the parameter and the type of interval constructed from the bootstrap samples is controlled by .
, , , or .
The maximum number of iterations to be performed by the interior point optimization algorithm.
The value of controls the type of matrices returned in ch. If , this parameter is ignored and ch is not referenced. Otherwise:
No matrices are returned and ch is not referenced.
The covariance matrices are returned.
If or , the matrices and are returned. Otherwise no matrices are returned and ch is not referenced.
The matrices returned are calculated as described in Section 3, with the algorithm used specified by . In the case of the covariance matrix is calculated directly from the bootstrap estimates.
, or .
If then the duality gap is displayed at each iteration of the interior point optimization algorithm. In addition, the final estimates for are also displayed.
The monitoring information is sent to the unit number specified by .
The tolerance used to calculate the rank, , of the cross-product matrix, . Letting be the orthogonal matrix obtained from a decomposition of , then the rank is calculated by comparing with .
If the cross-product matrix is rank deficient, the parameter estimates for the columns with the smallest values of are set to zero, along with the corresponding entries in bl, bu and ch, if returned. This is equivalent to dropping these variables from the model. Details on the decomposition used can be found in nag_dgeqp3 (f08bfc).
If , the residuals are returned in res. Otherwise res is not referenced.
The scaling factor used when calculating the affine scaling step size (see equation (8)).
, the size of the confidence interval whose limits are returned in bl and bu.
Convergence tolerance. The optimization is deemed to have converged if the duality gap is less than (see Section 11.1.4).
Output sent to stdout
The unit number to which any monitoring information is sent. See nag_open_file (x04acc) for details on how to assign a file to a unit number. If no unit number is specified then any monitoring information will be sent to standard output (stdout).