Quadratic Programming and Penalized Regression

Quadratic programming is a versatile tool for calculating estimates in penalized regression. It can be used to produce estimates based on L 1 roughness penalties, as in total variation denoising. In particular, it can calculate estimates when the roughness penalty is the total variation of a derivative of the estimate. Combining two roughness penalties, the total variation and total variation of the third derivative, results in an estimate with continuous second derivative but controls the number of spurious local extreme values. A multiresolution criterion may be included in a quadratic program to achieve local smoothing without having to specify smoothing parameters.


Introduction
The problem of regression remains an important aspect of statistics. It is an appropriate way to make inferences about relationships between variables. Nonparametric regression is necessary if the relationship cannot be explained by a small number of parameters. Even the most straightforward case of nonparametric regression between two continuous variables is not an easy problem, and remains far from being "solved". There are many different methods for nonparametric regression and each behaves differently in different situations. Three broad categories are: kernel smoothing, wavelets, and penalized regression.
We will focus on nonparametric regression between continuous one-dimensional response and explanatory variables. All of the methods described below can be extended to the case of multi-dimensional explanatory variables. Given a set of response observations y 1 y n corresponding to ordered observations t 1 < · · · < t n , thought to have been generated by the signal plus noise model y i = f t i + i i = 1 n Smith we wish to find an estimate,f , of the functional relationship, f , sometimes called the signal function, between the explanatory variable t and response variable y. The noise terms i , for i = 1 n, are realizations of independent random variables with zero mean.
When approaching data for the first time, it is a good idea to try several different methods but not practical to employ every one. It is more convenient to try a subset of methods with different properties, and one such subset is penalized regression. This is an umbrella term for methods that add a penalty term, which quantifies the smoothness or sparsity of the estimated signal function, to the residual sum of squares. A penalized regression estimate is found by minimizing over all possible functionsf t 1 t n → . This is the sum of the residual sum of squares, which represents the distance between observation and estimate, and a roughness penalty, P, which is parameterized by a smoothing parameter, .
The smoothing parameter controls the tradeoff between distance and roughness. The quality of the estimate is affected by the nature of the roughness penalty and the norm on which it is based. Below, we will show that quadratic programming can calculate penalized regression estimates with a variety of roughness penalties, in particular total variation denoizing estimates, which use an L 1 penalty. The quadratic program is a vector optimization procedure that seeks to minimize a quadratic term subject to a set of linear constraints. A general form for a quadratic program is where d is a vector with the same dimension as x, the Hessian G is a square, symmetric matrix with number of rows equal to the dimension of x, the matrix A has the same number of rows as G, and the vector b, together with A, encodes constraints on x. This can be solved in finite time by, for example, the numerically stable active set algorithm of Goldfarb and Idnani (1983), which is probably the most frequently implemented algorithm for solving quadratic programs and is therefore used in the examples below. In many forms of penalized regression the estimate of the signal function is defined by its values at the explanatory observations, so it is sufficient to calculate f i =f t i for i = 1 n.

Total Variation Denoizing with Higher-order Penalties
Total variation denoizing is a type of penalized regression in which the roughness penalty is equal to the total variation of the estimated signal function. The total variation of a function f on the interval t 1 t n is defined as When this is used as a roughness penalty the resulting estimate is a piecewise-constant function that changes value only at points in a subset of t 1 t n . Therefore, the total variation of the estimated signal function may be expressed as n−1 i=1 f i+1 − f i . Since it employs the L 1 norm in the roughness penalty, total variation denoizing may be thought of as a nonparametric version of the lasso estimator (Tibshirani, 1996) and it hence leads to sparse estimates. When approaching new data it is possible that the underlying signal function has a very simple form; sparse estimates can more easily reveal this. Furthermore, this penalty also reduces the number of local extreme values in the estimate (Davies and Kovac, 2001). The total variation penalty may be extended to multi-dimensional explanatory variables, for instance in image denoizing. This was first put forward by Rudin et al. (1992), and fast algorithms for specific penalties were developed by Chambolle (2004) and Kovac and Smith (2011). This roughness penalty is readily extendable and the resulting estimates may be calculated by quadratic programming.
With a one-dimensional explanatory variable, the total variation penalty can be generalized in two ways: by extending it to higher derivatives of the estimate, and by adding local smoothing parameters. We will construct a roughness penalty, P TV p , that is proportional to the total variation of the pth derivative of the estimate. The amount of penalization can be varied in different locations by using a vector of local smoothing parameters = 1 n−p−1 T . The generalized total variation roughness penalty is is a matrix with n − 1 rows and n columns, and To include this roughness penalty in a quadratic program, we must introduce a vector v = v 1 v n−p−1 T of dummy variables. If v does not appear anywhere else in the quadratic program, then P TV p may be included in the quadratic program as The residual sum of squares can also be generalized by including individual non negative weights on each observation, contained in the vector w = w 1 w n T . The generalized sum of squares is n i=1 w i y i − f i 2 , which is equivalent to y − f T W y − f , where y = y 1 y n T and W = diag w . This is easily included in a quadratic program. Therefore, the estimate may be calculated with a quadratic program (1), where Quadratic program algorithms typically require G to be positive definite, which does not hold in the quadratic program above. We can circumvent this problem by adding a small value to the diagonal entries of G, for instance 1/1000 times the absolute value of the smallest non zero element of G. This does not have a discernable effect on the outcome in any of the examples given below. Alternatively, the quadratic program (1) has a dual formulation (Nocedal and Wright, 1999) and in some cases the dual Hessian matrix is positive definite. When p = 0, the dual formulation for the quadratic program above is equivalent to the taut string algorithm (Davies and Kovac, 2001), which is a special case of regression on a graph (Kovac and Smith, 2011). Figure 1 shows a sequence of estimates calculated using the quadratic program above. The example datasets were generated by the Blocks and Doppler functions (Donoho and Johnstone, 1994) with 500 observations in each dataset. The explanatory observations were fixed at t i = i/500, for i = 1 500, and Gaussian noise, with zero mean, was added to create noisy response variables. The standard deviation of the noise is 0.5 in the Blocks data and 0.05 in the Doppler data. Since the standard deviation is constant, it is appropriate to use equal weights on each observation. It is known (Mammen and van de Geer, 1997) that an estimate based on the roughness penalty P TV p will be a piecewise polynomial with continuous p − 1 th derivative. When p = 0, it can be seen that the estimate is piecewiseconstant, which is appropriate for the Blocks data as the signal function is piecewise constant. However, for the Doppler data, in which the signal function is smooth, the estimate "staircases" in an undesirable manner. When p ≥ 2, the estimates appear much smoother, and have continuous derivatives, which is much more appropriate for the Doppler data. However, the Blocks estimates exhibit additional, unwanted local extreme values.

Combining L 1 Penalties
The different qualities of different total variation roughness penalties have advantages and disadvantages. When p = 0, and the roughness penalty is the total variation, then it reduces the number of additional, unwanted local extreme values. However, the estimate is piecewise-constant and therefore does not have a smooth appearance. When p = 3, and the roughness penalty is the total variation of the third derivative, then the estimated signal function is piecewise-cubic and may be considered to be smooth as it has the same degree of differentiability as a smoothing spline estimate. However, this estimate often exhibits additional local extreme values. If we have to choose one roughness penalty then we are forced to choose between these opposing advantages. However, it is possible to combine both advantages by combining two roughness penalties. Including the total variation prevents additional local extremes, provided the associated smoothing parameter is not too small. Additionally, including the total variation of the third Smith derivative causes the estimate to have continuous second derivative, again provided the associated smoothing parameter is not too small. The choice of smoothing parameter is discussed in Sec. 4. A similar approach was used by Aldrin (2004) in the case of semiparametric regression. Therefore, we seek to minimize There are separate vectors of smoothing parameters: = 1 n−1 T for P TV 0 and = 1 n−4 T for P TV 3 . As before, this minimization problem can be written as a constrained optimization problem and hence as a quadratic program. In order to combine two roughness penalties we require two vectors of dummy variables. The first, v, has dimension n − 1 and the second, v , has dimension n − 4. The estimate that we seek can be found by minimising the quadratic program (1) Examples of this quadratic program, using the Blocks and Doppler datasets seen in Sec. 2, are given in Fig. 2. The practical outcome is as good as the theory suggests: both estimates are smooth, due to the inclusion of P TV 3 as a roughness penalty, but neither exhibits additional, spurious local extreme values, due to the inclusion of P TV 0 as a roughness penalty.
Also of note is the estimator obtained by combining the residual sum of squares with two roughness penalties P TV p and P TV p+1 . This is equivalent to a nonparametric version of the fused lasso (Tibshirani et al., 2005).

Involving L 2 Penalties
The estimates above use an L 1 penalty for roughness, as a nonparametric version of the lasso. We can also involve an L 2 penalty as well, expressing it, with local smoothing parameters = 1 n−p−1 T , as Using this roughness penalty alone is a nonparametric version of ridge regression. The minimizer of n i=1 w i y i − f i 2 + P RR p f may be found by calculating Alternatively, P RR p f can be easily added included in the quadratic program of Sec. 2 by replacing W with W + 2 T p D T n−p diag 2 D n−p p in the Hessian matrix G. Making this adjustment to the quadratic program in Sec. 2 will find an estimate that minimizes As it combines both L 1 and L 2 roughness penalties, this is a nonparametric version of the elastic net (Zou and Hastie, 2005).

Multiresolution
It is necessary to make an appropriate choice of smoothing parameter, and there are several techniques for this. Commonly used techniques designed for the selection of a single, global, smoothing parameter are computationally intensive when generalized to local smoothing parameters. The multiresolution criterion (Davies and Kovac, 2001) can be used to aid the choice of global or local smoothing parameters, without increased complexity for local smoothing. It acts as a test on the residuals y i − f i and judges whether they contain any systematic deviations from zero. The residuals are estimates of the noise terms, i , therefore if the residuals are considered to behave like noise then the signal plus noise model is valid and the values f i are appropriate estimates of f t i . If the residuals are considered to contain some of the signal function then the estimates must be too far from the observations and hence the smoothing parameter must be too large. Multiresolution tests sums of residuals on intervals at different scales and in different locations, and can therefore discern specific intervals in which the smoothing parameter is too large. Through this approach it is possible to choose local smoothing parameters that lead to appropriate, but locally varying, amounts of smoothing in different parts of the estimate. When the noise terms i are assumed to be independent realisations of a Gaussian random variable with zero mean and constant variance 2 , then the multiresolution criterion is Smith where l = 2 J −j k − 1 + 1 and m = min 2 J −j k n , for j = 0 J, where J = log 2 n and k = 1 2 j−J n . If m i=l w 2 i = 0 then the left-hand side of (2) is considered to be 0. This requires an estimate of the noise variance that is independent of the residuals. One such estimate is based on the median absolute deviation (MAD):ˆ Typically, the multiresolution criterion is implemented with an iterative procedure (Davies and Kovac, 2001;Davies and Meise, 2008), however, quadratic programming allows a straightforward implementation that makes it possible to avoid specifying a smoothing parameter but still apply an appropriate amount of smoothing. As the multiresolution criterion (2) is a system of linear inequalities it can be directly incorporated into the quadratic program. For j = 0 J and k = 1 2 j−J n , we append to the columns of A the vectors W c j k and −W c j k and append to b the values c T j k W y −ˆ 2 log nw and c T j k W −y −ˆ 2 log nw where c j k is a vector of appropriate dimension with ith element equal to 1 if 2 J −j k − 1 + 1 ≤ i ≤ min 2 J −j k n and 0 otherwise. To obtain as smooth a fitted function as possible we provide the quadratic program with a large global smoothing parameter, such that the estimate is a straight line in the absence of multiresolution. In the presence of the linear constraints the estimate will be the smoothest function that satisfies (2). In the case of combined total variation penalties (Sec. 3.1), we provide two global smoothing parameters and , and have to choose the ratio / , specifying how much we favour a smooth estimate over controlling local extremes. This should be chosen within a range that ensures that the estimate will be a straight line in the absence of multiresolution. The specific value of the ratio may be chosen by cross-validation. Alternatively, a pilot estimate may be found using total variation denoizing, with p = 0, and built-in multiresolution. The resulting estimate is piecewise-constant but may be assumed to have the correct number of local extreme values. A pragmatic approach to setting the value of / is to choose a ratio that gives the same number of local extreme values. In practice, the specific value of this ratio does not seem to be important; values in 0 5 2 give no discernable difference for the examples here. Figure 3 shows two examples of estimates calculated using a quadratic program with built-in multiresolution, for the datasets in Figures 1 and 2. Both estimates are smooth but neither exhibits additional, spurious local extreme values. Furthermore, the estimates were calculated without any outside specification of the smoothing parameters. It is possible to see the effects of local smoothing in the Doppler estimate: more smoothing has been applied to the right-hand side of the estimate than to the left-hand side. This is appropriate as the signal function is more slowlyvarying on the right-hand side than the left-hand side.

Discussion
We have seen that it is possible to find estimates based on L 1 penalties without having to choose smoothing parameters, and that quadratic programming is a versatile tool for calculating such estimates. The most appropriate choice of penalty or penalties depends on the qualitative nature of the signal function. If this can be discerned beforehand then it will inform the choice of penalty. For instance, the discontinuities in the Blocks data above are evident from a scatterplot, hence it is appropriate to apply total variation as a roughness penalty. In contrast, a scatterplot of the Doppler data shows the signal function is continuous and therefore it is more appropriate to apply a higher-order total variation penalty. Quadratic programming allows these different estimates with different qualities to be calculated within the same algorithmic framework.
If the choice of penalty is not clear beforehand, perhaps due to increased noise in the data, then the combined penalty approach of Sec. 3.1 may be used: if the signal function is smooth it can be approximated by a smooth estimate, but if it has discontinuities it can be estimated without introducing spurious local extreme values. This should only be used if there is no clear information available about the nature of the signal function, which is likely to be a rare occurence.
Estimates based on L 2 penalties, such as ridge regression and the elastic net, can also be incorporated into the quadratic program. Other penalties, such as L 0 , may require a different algorithmic approach. Rippe et al. (2012) where q ∈ 0 1 , for estimating segmentation (changes in the ratio of alleles) in genomic applications. When q = 0, this estimate may be calculated iteratively without using quadratic programming algorithms.
Unfortunately, the versatility of quadratic programming must be traded against its computational complexity. The number of calculations performed during the algorithm of Goldfarb and Idnani (1983), for the quadratic program described above, will be O n 3 . The taut string algorithm (Davies and Kovac, 2001) can find an estimate in O n calculations, but only when the roughness penalty is P TV 0 . It would therefore be beneficial, as further work, to develop a faster algorithm specifically designed for the quadratic programs above.