Damped Techniques for the Limited Memory BFGS ... - Semantic Scholar

2 downloads 0 Views 479KB Size Report
Oct 26, 2013 - value of the steplength, we also consider only the first Wolfe–Powell ... global convergence result of [7] that the L-BFGS method has for convex ..... Depending on the experiments of [5], we repeated the run using σ3 = max(αk −.
J Optim Theory Appl (2014) 161:688–699 DOI 10.1007/s10957-013-0448-8

Damped Techniques for the Limited Memory BFGS Method for Large-Scale Optimization Mehiddin Al-Baali · Lucio Grandinetti · Ornella Pisacane

Received: 8 June 2013 / Accepted: 4 October 2013 / Published online: 26 October 2013 © Springer Science+Business Media New York 2013

Abstract This paper is aimed to extend a certain damped technique, suitable for the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method, to the limited memory BFGS method in the case of the large-scale unconstrained optimization. It is shown that the proposed technique maintains the global convergence property on uniformly convex functions for the limited memory BFGS method. Some numerical results are described to illustrate the important role of the damped technique. Since this technique enforces safely the positive definiteness property of the BFGS update for any value of the steplength, we also consider only the first Wolfe–Powell condition on the steplength. Then, as for the backtracking framework, only one gradient evaluation is performed on each iteration. It is reported that the proposed damped methods work much better than the limited memory BFGS method in several cases. Keywords Large-scale optimization · The limited memory BFGS method · Damped technique · Line search framework

Communicated by David Hull. Some results were presented by M. Al-Baali at 40th Workshop on Large Scale Nonlinear Optimization, Erice, Italy, June 22–July 1, 2004. M. Al-Baali Department of Mathematics and Statistics, Sultan Qaboos University, Muscat, Oman e-mail: [email protected]

B

L. Grandinetti ( ) Department of Computer Engineering, Electronics and Systems, University of Calabria, Rende, Italy e-mail: [email protected] O. Pisacane Information Engineering Department, Polytechnic University of Marche, Ancona, Italy e-mail: [email protected]

J Optim Theory Appl (2014) 161:688–699

689

1 Introduction Consider the limited memory quasi-Newton methods for minimizing a smooth function f (x) : Rn → R, under the hypothesis that n is large so that a number of n-vectors can be stored, but a n × n symmetric matrix cannot. To solve this large-scale optimization problem, we will focus on certain modified Broyden–Fletcher–Goldfarb–Shanno (BFGS) methods which use values of the steplength which satisfy the Wolfe–Powell conditions (see for example [1]). The main goal of this paper is to extend the recent damped (D-)technique [2], suitable for the BFGS method which modifies the one described in [3] for the constrained optimization, to the limited memory BFGS method [4] in the case of the large-scale unconstrained optimization. The resulting damped methods (referred to as L-D-BFGS) are defined in the same way of the limited BFGS ones (L-BFGS, for short), except for the BFGS updating formula. In fact, it is replaced by the D-BFGS formula [3], which maintains the Hessian approximations sufficiently positive definite. Indeed, for certain choices of the damped technique, it is shown in [5] that the performance of the D-BFGS method is substantially better than that of the standard BFGS method. Therefore, it is expected that some L-D-BFGS methods would work better than the ‘undamped’ L-BFGS ones in certain cases as presented in [6]. This paper studies this conjecture by applying these methods to a set of standard test large-scale optimization problems. Some of these results and other experiments are described in Sect. 4, which shows that L-D-BFGS performs substantially better than L-BFGS in certain cases. The other sections are organized as follows. Section 2 describes both the L-BFGS and L-D-BFGS methods with some features. In particular, it extends the global convergence result of [7] that the L-BFGS method has for convex functions to the L-D-BFGS class of methods. Section 3 considers certain choices for the damped technique, which work well in practice. Finally, Sect. 5 concludes the paper.

2 Limited Memory Methods Any quasi-Newton method generates a sequence of points {xk } iteratively by the line search formula xk+1 = xk + αk dk ,

(1)

where x1 is given, αk is a steplength parameter, and dk = −Hk gk

(2)

is the search direction. Here, gk denotes the gradient g(xk ) = ∇f (xk ) and Hk approximates the inverse Hessian [∇ 2 f (xk )]−1 . Initially, H1 is given positive definite and Hk is updated on every iteration in terms of the differences sk = xk+1 − xk ,

yk = gk+1 − gk ,

(3)

690

J Optim Theory Appl (2014) 161:688–699

such that the quasi-Newton condition Hk+1 yk = sk is satisfied. In the BFGS method, the updated Hessian approximation is given by Hk+1 = bfgs−1 (Hk , sk , yk ),

(4)

where, for any vectors s and y and for any symmetric matrix H , the function     ys T ss T sy T H I− T + T bfgs−1 (H, s, y) = I − T s y s y s y

(5)

defines the BFGS updating formula for the inverse Hessian approximation. Since this formula maintains H positive definite if s T y > 0, the BFGS method preserves the positive definiteness of the matrices {Hk } if the curvature condition skT yk > 0 holds. This condition is guaranteed if αk is chosen such that the following Wolfe–Powell conditions hold: fk − fk+1 ≥ σ0 |skT gk |,

skT yk ≥ (1 − σ1 )|skT gk |,

(6)

where fk denotes f (xk ) and σ0 ∈ ]0, 0.5[ and σ1 ∈ ]σ0 , 1[ are fixed parameters (for further details, see for example [1]). To fulfill the conditions in (6) by testing a few gradient evaluations, σ1 is chosen large (usually, σ1 = 0.9). However, for a sufficiently large value of σ1 < 1, these conditions may hold with skT yk close to zero. In addition, this scalar might become negative if only the first condition in (6) is used for acceptable values of αk , using the so-called backtracking line search (see for example [8]). Although this difficulty can be avoided by skipping the update so that Hk+1 = Hk (see for example [8]), a drawback is represented by the fact that the Hessian approximation remains unchanged. However, to circumvent these disadvantages, the following D-BFGS update is considered in [5]: yk ) Hk+1 = bfgs−1 (Hk , sk ,

(7)

which is given by (4) with yk replaced by the damped technique  yk = ϕk yk + (1 − ϕk )Bk sk ,

(8)

where Bk = Hk−1 and ϕk ∈ ]0, 1] is a parameter. This damped parameter is chosen such that skT  yk is sufficiently positive for an appropriate value of ϕk . Indeed, for certain choices of this parameter, it has been reported in [5] that the performance of the D-BFGS method is substantially better than that of the standard BFGS method. Both the L-BFGS and the L-D-BFGS methods are defined like the above BFGS and D-BFGS methods, except for the information of the inverse Hessian approximations, given by (4) and (7), respectively. These pieces of information are, in fact, reduced sufficiently in such a way that the product in (2) is computed efficiently without storing Hk explicitly. This computation can be described in the following way for the L-BFGS method. Instead, for the L-D-BFGS method, a similar process can be used with yk replaced by the damped vector  yk .

J Optim Theory Appl (2014) 161:688–699

691

Replacing Hk in (4) by another implicitly stored matrix Hk0 , such that the product Hk0 v can be computed for any vector v, the updated matrix Hk+1 would be obtained by (5) in terms of Hk0 and the vector pair {sk , yk } as in the following:   Hk+1 = bfgs−1 Hk0 , sk , yk = VkT Hk0 Vk + ρk sk skT ,

(9)

where Vk = I − ρk yk skT ,

ρk =

1 . ykT sk

(10)

Hence, the product −Hk+1 gk+1 , which defines the next search direction dk+1 , can be computed without storing Hk+1 explicitly, since for any vector v we have Vk v = v − ρk (skT v)yk and VkT v = v − ρk (ykT v)sk . In particular, if Hk0 (referred to as a basic matrix) is given by the identity matrix I , which requires zero storage, we obtain the memoryless BFGS method (see for example [9]). Similarly, the choice Hk0 = νk I , for a certain scalar νk , yields the L-BFGS method with least storage (see for example [9]). However, it would be useful to increase the storage so that the information as regards the Hessian approximation would be increased as shown in the following way. Assuming Hk−1 is replaced by Hk0 , formula (4) yields the 2-BFGS updates in terms of Hk0 and the 2-vector pairs {si , yi }ki=k−1 , given by     Hk+1 = bfgs−1 bfgs−1 Hk0 , sk−1 , yk−1 , sk , yk T T = VkT Vk−1 Hk0 Vk−1 Vk + ρk−1 VkT sk−1 sk−1 Vk + ρk sk skT .

(11)

In general, for m ≥ 1 (which is replaced by k whenever k < m), let Hk−m+1 be replaced by Hk0 . Then formula (4) yields the m-BFGS updates in terms of Hk0 and the m most recent vector pairs {si , yi }ki=k−m+1 , which require the storage location of 2mn spaces, given by a sum of symmetric matrices of the structure   0 T Hk [Vk−m+1 · · · Vk ] Hk+1 = VkT · · · Vk−m+1  T  T T sk−m+1 sk−m+1 [Vk−m+2 · · · Vk ] + ρk−m+1 Vk · · · Vk−m+2  T  T T sk−m+2 sk−m+2 [Vk−m+3 · · · Vk ] + ρk−m+2 Vk · · · Vk−m+3 .. . T + ρk−1 VkT sk−1 sk−1 Vk

+ ρk sk skT .

(12)

Hence, the search direction dk+1 = −Hk+1 gk+1 (defined by (2) with k replaced by k + 1) can be computed without storing Hk+1 explicitly. In [4], a two loop recursive procedure for computing this direction efficiently is derived (for further details see for example [9]).

692

J Optim Theory Appl (2014) 161:688–699

To define the basic matrix Hk0 , it is worth noting that in [7], few choices are taken into account and, in particular, the following choice is recommended: Hk0 = νk I,

(13)

where νk =

skT yk ykT yk

(14)

,

which depends on the most recent vectors in (3). We now state the global convergence result of the L-D-BFGS method by extending that of [7], since the choice ϕk = 1 reduces the L-D-BFGS update to that of L-BFGS. To do so, we first state the following standard assumptions on the objective function. Assumption 2.1 1. The objective function f is twice continuously differentiable, 2. the level set D = {x : f (x) ≤ f1 } is convex, 3. there exist positive constants M1 and M2 such that M1 z2 ≤ zT G(x)z ≤ M2 z2 ,

(15)

where G(x) = ∇ 2 f (x), for all z ∈ Rn and all x ∈ D, and . denotes the usual Euclidean norm. Theorem 2.1 Let x1 be any starting point for which f satisfies Assumption 2.1, and −1 assume that the basic matrices Hk0 are chosen so that {Hk0 } and {Hk0 } are bounded. Suppose  yk be defined by (8) such that ϕk ≥ ν, where ν is a positive number and αk be chosen such the Wolfe–Powell conditions (6) hold. Then for any positive definite H1 , the L-D-BFGS method (defined by (1), (2), (12), and (10) with yk replaced by  yk ) generates a sequence of points {xk } which converges to the solution x∗ . Moreover, there is a constant 0 ≤ r < 1 such that fk+1 − f∗ ≤ r k (f1 − f∗ ), where f∗ denotes f (x∗ ), which implies that {xk } converges R-linearly. Proof It is a simple extension to the result of [7] for the L-BFGS method. Let the D-BFGS updated Hessian approximation be given by the inverse of matrix (4) with yk replaced by  yk , that is, the following BFGS update of the Hessian approximation Bk (= Hk−1 ) in terms of sk and  yk : bfgs(Bk , sk , yk ) = Bk −

Bk sk skT Bk skT Bk sk

+

 yk  ykT  ykT sk

Using the result of [2]  ykT  yk skT  yk

≤ ϕk

ykT yk skT yk

+ (1 − ϕk )

skT Bk Bk sk skT Bk sk

,

.

(16)

J Optim Theory Appl (2014) 161:688–699

693

it follows that   y T yk yk ) ≤ trace(Bk ) + ϕk kT trace bfgs(Bk , sk , sk yk ≤ trace(Bk ) +

ykT yk skT yk

,

(17)

which yields the trace inequality (2.10) of [7]. yk ≥ ϕk skT yk and using the formula for the determinant of the BFGS Noting that skT  update, we obtain   sT  yk det bfgs(Bk , sk , yk ) = T k det(Bk ) sk Bk sk ≥ ϕk

skT yk

det(Bk ) skT Bk sk   ≥ ν det bfgs(Bk , sk , yk ) .

(18)

This inequality, together with the expressions (6.10) and (2.11) described in [7], yields the determinant inequality (6.12) with M4 replaced by νM4 in that paper. Now following the analysis in that paper, the proof will be completed.  It is worth noting that this result with  yk = yk for all k yields that of [7], for the L-BFGS method, which also states the possibility of proving it for several other line search strategies, including backtracking, on the basis of the result of [10]. It is also possible to prove the following convergence result for the L-D-BFGS method. Theorem 2.2 Theorem 2.1 remains valid if the steplength αk and the damped parameter ϕk are chosen such that the first Wolfe–Powell condition in (6) and the curvature like condition skT  yk > 0 are satisfied, respectively. 3 Choosing Damped Parameters We now define some formulas for the damped parameter ϕk , which maintain the above convergence properties of the L-BFGS method. In particular, [3] uses ϕk = 1 only if τk ≥ 0.2, where τk =

skT yk skT Bk sk

,

but otherwise, ϕk = 0.8/(1 − τk ). This choice has been extended by [6] to ⎧ σ2 ⎪ , τk < 1 − σ2 , ⎪ ⎪ ⎨ 1 − τk σ3 ϕk = , τk > 1 + σ 3 , ⎪ ⎪ τ k ⎪ −1 ⎩ 1, otherwise,

(19)

(20)

694

J Optim Theory Appl (2014) 161:688–699

where σ2 and σ3 are positive parameters (values of σ2 ≥ 0.6 and σ3 ≥ 3 have been suggested). This formula is reduced to that of [3] if σ2 = 0.8 (or 0.9) and σ3 = ∞. Since a value of τk = hk , where hk =

ykT Hk yk skT yk

,

(21)

reduces the BFGS update to the symmetric rank one update (see for example [11]), which has some useful properties, we use ϕk < 1 only when hk /τk is sufficiently larger than one. Thus, in particular, we modify formula (20) to ⎧ σ2 ⎪ ⎪ ⎪ 1 − τk , τk ≤ 0 or 0 < τk < min(1 − σ2 , (1 − σ4 )hk ), ⎨ σ3 (22) ϕk = , 1 + σ3 < τk < (1 − σ4 )hk , ⎪ ⎪ τk − 1 ⎪ ⎩ 1, otherwise, where σ4 is another fixed positive parameter. In the D-BFGS method, this formula works better than (20), even for σ4 = 0. Note that both formulas are equivalent, except the case when hk /τk and τk are sufficiently close to and far away from one, respectively. We also note that the inequality τk > 0 is guaranteed by the second Wolfe–Powell condition in (6). We observed that several choices for the damped parameter ϕk work well in practice (see for example [5]). In the next section, we test some of these choices.

4 Numerical Results We now give some idea about the performance of the L-D-BFGS class of methods versus that of the standard L-BFGS method. For this comparison, we applied some methods to a set of 33 general unconstrained test problems; most of them are used by [12] and [13]. Table 1 presents, for every test problem, the name, the abbreviated name, the reference, and the number of variables n which are used for the runs. We tested other problems with values of n ≤ 104 . However, we decided to not include them, since we observed that they do not give a reasonable idea of the behavior of the L-BFGS methods with respect to the more significant ones described in Table 1. Our experiments were performed in double precision with machine  = 2−52 (≈ 2.22 × 10−16 ). For given σ0 , σ1 , m and initial H1 = I , the identity matrix, we used the Fortran 77 software of [1], which implements the L-BFGS method, with yk replaced by  yk defined by (8) for a given value of ϕk based on (22) for some choices of σ2 , σ3 and σ4 . The modified L-D-BFGS routine is reduced to L-BFGS if the value of ϕk = 1 is used for all k. The routine ensures finding a value of αk which satisfies the strong Wolfe–Powell conditions, defined by (6) and skT yk ≤ (1 + σ1 )|skT gk |,

(23)

where σ1 is defined as in (6). Note that these conditions are reduced to the first condition in (6) if a sufficiently large value of σ1 is used. In this case, the backtracking line

J Optim Theory Appl (2014) 161:688–699

695

Table 1 List of test problems Abbreviated name

n

Function’s name

Reference

EX-ROSBRK

103

Extended Rosenbrock

[14]

EX-ROSBRKI

103

Extended Rosenbrock I

[12]

EX-FRDRTH

103

Extended Freudenstein and Roth

[14]

CH-FRDRTH

103

Extended Freudenstein and Roth

[15]

EX-POW-SR

103

Extended Powell Singular

[14]

EX-POW-SRI

103

Extended Powell Singular I

[12]

VAR-DIM

103

Variably Dimensioned

[14]

TRGSPD

103

Trigonometric of Spedicato

[14]

PENALTYI

103

Penalty I

[14]

TRID1BandL

100, 103

Tridiagonal of Broyden

[14]

LINMINSUR

121, 961

Linear Minimum Surface

[15]

BROYTRIDN

103

Tridiagonal of Broyden

[14]

EX-ENGVL1

103 , 104

Extended ENGVL1

[15]

WR-EX-WOOD

103

Wrong Extended Wood

[15]

SP-MXSQRT

100, 103

Sparse Matrix Square Root

[7]

CH-ROSBRK

100, 200

Chained Rosenbrock

[16]

PE

100, 200

Potential Energy of [17]

[12]

TRID2TOIN

100, 200

TRIG (trigonometric)

[15]

BVP

100

Boundary Value Problem

[15]

MXSQRT1

100

Matrix Square Root 1

[7]

MXSQRT2

100

Matrix Square Root 2

[7]

VAR(-3.4)

100, 200

VAR(λ) (variational)

[18]

QOR

50

Quadratic Operations Research

[18]

GOR

50

Generalized QOR

[18]

PSP

50

Pseudo Penalty

[18]

search framework can be used, which requires only one gradient evaluation for computing an acceptable value of the steplength αk . Therefore, all the algorithms (with different values for m), used in our implementation, differ only in the preset choices of σi , i ≥ 0. We let σ0 = 10−4 , σ4 = 0, and the other values are given below to define the following algorithms: • • • •

LM1: the standard L-BFGS method; σ1 = 0.9, σ2 = 1 and σ3 = ∞, LM2: an L-D-BFGS method; σ1 = 0.9, σ2 = 0.6 and σ3 = 3, LM3: the same as LM2, except that σ1 = 0.9999, LM4: the same as LM2, except that σ1 = ∞.

The runs were terminated when the following condition was satisfied:   gk  ≤ ˜ max 1, |fk | ,

(24)

696

J Optim Theory Appl (2014) 161:688–699

√ where ˜ = 10  (≈1.47 × 10−7 ). We use this fairly accurate stopping condition as a fair criterion for comparing different algorithms. All the runs achieved essentially the same accuracy. We consider LM2 to examine the usefulness of the damped technique for improving the performance of the standard L-BFGS method. LM3 maintains the theoretical properties of LM2, but considers a weakly accurate line search. Although this choice may worsen the performance of L-BFGS in certain cases, it improves the performance of L-D-BFGS as shown below. However, LM4 considers the backtracking line search technique which is based on enforcing only the first Wolfe–Powell condition in (6) for a steplength estimated by a quadratic interpolation larger than or equal to 1/2 (see [1], for instance). Although this case may yield a failure for L-BFGS, it maintains the global convergence of L-D-BFGS. A feature worth of noting is that LM3 requires a number of gradient evaluations less than or equal to that required by LM2, while LM4 requires only one gradient evaluation for obtaining an acceptable value of the steplength. In addition, implementing LM4 is simpler than those of the other algorithms. We applied the above algorithms with the practical choices of m = 3, 5, 7, 10, and 20 to our set of tests. We observed that there is little to choose among the damped LM2, LM3, and LM4 algorithms and all of them perform slightly better than that of LM1 in terms of the number of function and gradient evaluations (referred to as nfe and nge, respectively) required to solve the test problems. To illustrate this behavior observation for m = 10, we display the numerical results as performance profiles (see [19]) in Figs. 1 and 2 based on nfe and nge, respectively. Nowadays (see for example [20]), this performance profile is a standard tool for presenting such results in a concise mechanism for comparing at least two algorithms over a set of problems. Considering, for example, nfe, the performance profile plots the fraction of problems (vertical-axis) solved within a given factor (horizontal-axis) of the best available algorithm which requires the least nfe among those needed by the algorithms under comparison. More efficient/robust algorithms are “on top” towards near the left/right side of the plot. Thus, an algorithm that is “on top” throughout the plot could be considered more efficient and robust of the other algorithms, a useful outcome. Now it follows from the graphs in both Figs. 1 and 2 that the performance of the LM3 algorithm is slightly better than that of LM4, which also performs slightly better than LM2. These damped algorithms clearly perform better than that of the standard LM1 algorithm. Depending on the experiments of [5], we repeated the run using σ3 = max(αk − 1, 1) rather than σ3 = 3. We observed that there is little to choose among the above algorithms, except that for m = 5. For this choice of m, the corresponding performance profiles for comparing nfe and nge are plotted in Figs. 3 and 4, respectively. Both graphs clearly show that the damped LM2, LM3, and LM4 algorithms are preferable to the standard undamped LM1 method. Specifically, LM4 is slightly better than LM3 and both of them are much better than both LM2 and LM1 in terms of nfe and nge. However, Fig. 3 shows that there is little to choose between LM2 and LM1 in terms of nfe, while Fig. 4 shows that for all cases the former algorithm is slightly better than the latter one in terms of nge.

J Optim Theory Appl (2014) 161:688–699

697

Fig. 1 Performance profile based on nfe; m = 10, σ3 = 3

Fig. 2 Performance profile based on nge; m = 10, σ3 = 3

The above experiments show that the LM4 algorithm outperforms the others under comparison. Finally, it is worth noting that we repeated the run for the damped LM3 algorithm with the exception that an acceptable value of αk is given by the largest value of 2−i , i = 0, 1, 2, . . . which satisfies the first condition in (6), the backtrack-

698

J Optim Theory Appl (2014) 161:688–699

Fig. 3 Performance profile based on nfe; m = 5, σ3 = max(αk − 1, 1)

Fig. 4 Performance profile based on nge; m = 5, σ3 = max(αk − 1, 1)

ing Armijo line search (see for example [9]). We observed that the above technique, which is based on the quadratic polynomial interpolation, is preferable to that of Armijo.

J Optim Theory Appl (2014) 161:688–699

699

5 Conclusions In this paper, it is demonstrated that the proposed class of L-D-BFGS methods has similar features to that of the attractive L-BFGS method, except that it maintains the Hessian approximations sufficiently positive definite and bounded. Indeed, we observed that this class, for certain choices of the damped parameter, performs better than the L-BFGS method, significantly on certain ill conditioned problems. In general, however, the former methods perform better than the latter one in terms of the number of function and gradient evaluations required to solve the problems. In particular, computational results also showed that the latter number required by L-D-BFGS is always smaller than that required by L-BFGS. In conclusion, the damped methods, specifically defined by LM4, are recommended. However, further numerical experiments are required. References 1. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley, Chichester (1987). Reprinted in 2000 2. Al-Baali, M.: Convergence analysis of a class of damped quasi-Newton methods for nonlinear optimization. Research Report, DOMAS 11/2, Sultan Qaboos University, Oman (2011) 3. Powell, M.J.D.: Algorithms for nonlinear constraints that use Lagrange functions. Math. Program. 14, 224–248 (1978) 4. Nocedal, J.: Updating quasi-Newton matrices with limited storage. Math. Comput. 35, 773–782 (1980) 5. Al-Baali, M., Grandinetti, L.: On practical modifications of the quasi-Newton BFGS method. Adv. Model. Optim. 11, 63–76 (2009) 6. Al-Baali, M.: Quasi-Wolfe conditions for quasi-Newton methods for large-scale optimization. In: 40th Workshop on Large Scale Nonlinear Optimization, Erice, Italy, June 22–July 1 (2004) 7. Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989) 8. Kelley, C.T.: Iterative Methods for Optimization, 2nd edn. Siam, Philadelphia (1999) 9. Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, London (1999) 10. Byrd, R.H., Nocedal, J.: A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 26, 727–739 (1989) 11. Al-Baali, M.: Variational quasi-Newton methods for unconstrained optimization. J. Optim. Theory Appl. 77, 127–143 (1993) 12. Al-Baali, M.: Improved Hessian approximations for the limited memory BFGS method. Numer. Algorithms 22, 99–112 (1999) 13. Nash, S.G., Nocedal, J.: A numerical study of the limited memory BFGS method and the truncatedNewton method for large scale optimization. SIAM J. Optim. 1, 358–372 (1991) 14. Moré, J.J., Garbow, B.S., Hillstrom, K.E.: Testing unconstrained optimization software. ACM Trans. Math. Softw. 7, 17–41 (1981) 15. Toint, P.L.: Test problems for partially separable optimization and results for the routine PSPMIN. Technical Report No. 83/4, Department of Mathematics, Faculties University de Namur, Namur, Belgium (1983) 16. Grandinetti, L.: Some investigations in a new algorithm for nonlinear optimization based on conic models of the objective function. J. Optim. Theory Appl. 43, 1–21 (1984) 17. Siegel, D.: Implementing and modifying Broyden class updates for large scale optimization. Technical Report NA12, Department of Applied Mathematics and Theoretical Physics, Cambridge University, England (1992) 18. Toint, Ph.L.: Some numerical results using a sparse matrix updating formula in unconstrained optimization. Math. Comput. 32, 839–851 (1978) 19. Dolan, E.D., Moré, J.J.: Benchmarking optimization software with performance profiles. Math. Program. 91, 201–213 (2002) 20. Malmedy, V., Toint, Ph.L.: Approximating Hessians in unconstrained optimization arising from discretized problems. Comput. Optim. Appl. 50, 1–22 (2011)