Linear and non-linear credit scoring by combining

0 downloads 0 Views 249KB Size Report
Luc Leonard and Eric Hermann (Dexia Group) for many helpful comments. Johan Suykens acknowledges support from K.U. Leuven, IUAP V, GOA-Ambiorics ...
Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines Tony Van Gestel* Group Risk Management, Dexia Group, Square Meeus 1, B-1000 Brussels, Belgium; email: [email protected]

Bart Baesens School of Management, University of Southampton, Southampton SO17 1BJ, UK; email: [email protected]

Peter Van Dijcke Research, Dexia Bank Belgium, Av. Galilei 30, B-1000 Brussels, Belgium; email: [email protected]

Johan A. K. Suykens K.U. Leuven, Dept of Electrical Engineering, ESAT-SCD-SISTA, Kasteelpark Arenberg 10, B-3001 Leuven (Heverlee), Belgium; email: [email protected]

Joao Garcia Group Risk Management, Dexia Group, Square Meeus 1, B-1000 Brussels, Belgium; email: [email protected]

Thomas Alderweireld Group Risk Management, Dexia Group, Square Meeus 1, B-1000 Brussels, Belgium; email: [email protected]

The Basel II capital accord encourages banks to develop internal rating models that are financially intuitive, easily interpretable and optimally predictive for default. Standard linear logistic models are very easily readable but have limited model flexibility. Advanced neural network and support vector machine models (SVMs) are less straightforward to interpret but can capture more complex multivariate non-linear relations. A gradual approach that balances the interpretability and predictability requirements is applied here to rate banks. First, a linear model is estimated; it is then improved by identifying univariate non-linear ratio transformations that emphasize distressed conditions; and finally SVMs are added to capture remaining multivariate non-linear relations.

*Corresponding author. The authors would like to thank Mark Itterbeek, Geert Kindt, Frank Lierman (Dexia Bank), Luc Leonard and Eric Hermann (Dexia Group) for many helpful comments. Johan Suykens acknowledges support from K.U. Leuven, IUAP V, GOA-Ambiorics and FWO projects G.0407.02, G.0499.04, G.02111.05 and G.0080.01.

31

32

Tony Van Gestel et al.

1 INTRODUCTION Credit risk is a major, if not the largest, source of risk for banks. Credit scoring techniques are decision-support tools that aim to discriminate between risky and less risky counterparts: good counterparts receive high scores, while bad counterparts receive low scores. Calibration of the scoring function relates the score to the default probability that is used, eg, for pricing, provisioning, and regulatory and economic capital calculations. The more powerful the discrimination, the better the bank can distinguish between good and bad counterparts. Applying the same lending policies, losses will be reduced in hold-to-maturity portfolios by using more discriminant scoring functions. Timely and correct indications for changes of credit quality are important for mark-to-market portfolios. Apart from reasons of profitability, the use of advanced credit scoring techniques is also stimulated by the internal risk measurement and assessment processes that become increasingly important, especially in the context of the Basel II capital accord (Basel Committee on Banking Supervision, 2003). The growing importance of credit scoring is illustrated by the wide variety of credit scoring techniques. Figure 1 indicates the main techniques that are used nowadays for credit scoring: structural models are based on economic theory, while supervised learning models are trained to discriminate between risky and less risky counterparts. Structural models like Merton’s (1974) model and gambler’s ruin (Wilcox, 1971) compare equity value to asset and cashflow volatility, respectively. They are based on a relatively simple and intuitive model in which default is defined as the point at which the equity value becomes zero and the asset value less than the liabilities. The default risk is expressed as a distance to default or related default probability, which depends on the difference between assets and debt relative to the asset or cashflow volatility. The initial models assumed simple debt structures, but they have been further extended towards more complex bond and debt repayments. This is especially so for the Merton model, which is especially popular for scoring listed counterparts as the FIGURE 1 Overview of credit scoring techniques. Structural models

Supervised learning Parametric learning Non-linear

Linear Merton Gambler's ruin

FDA Logit

Intrinsically linear FDA Intrinsically linear logit

Non-parametric learning Density Kernel machines Kernel density Nearest neighbour

Linear SVM Neural networks MLP RBF network Bayesian network … and other machine learning techniques

Journal of Credit Risk

SVA Kernel density Kernel logit

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

asset volatility can be derived from the history of equity value. A well-known application is the KMV default model (Crosbie and Bohn, 2003; Saunders and Allen, 2002). Supervised learning techniques learn from data to discriminate between good and bad counterparts. Based on financial ratios, qualitative and judgmental indicators, and other potentially relevant information, the credit scoring model computes a score that is related to a default probability. The more powerful the scoring model’s discrimination between future defaults and non-defaults, the more the good counterparts receive high scores and the bad counterparts low scores. In terms of default probability, this means that low scores and high scores correspond to increasingly higher and lower default rates with increasingly better discriminative power of the model. Because of the importance of credit scoring and classification problems in general, there is a wide variety of models. In linear models, the z-score is a linear combination z = w1 x1 + … + wn xn + b

(1)

of the ratios x1,…, xn. The model structure is visualized in Figure 2a. The model parameters or weights w1,…, wn and bias term b are estimated by optimizing the discrimination behavior. Depending on the formulation of the problem, one can use classical linear techniques such as Fisher discriminant analysis (FDA) and logistic regression (logit) (Altman, 1968; Fisher, 1936; McCullagh and Nelder, 1989; Meyer and Pifer, 1970; Ohlson, 1980). Sometimes one also formulates the problem as a linear or quadratic programming problem. The latter is closely related to linear support vector machine classifiers (Thomas, Edelman and Crook, 2000; Vapnik, 1998). A more advanced model formulation uses additional parameters that pre-transform (some of) the ratios by a parameterized function f (x; ) = f (x; λ1,…, λm ): z = w1 f1 (x1; 1) + … + wn fn (xn; n ) + b

(2)

as indicated in Figure 2b. Such a model belongs to the class of additive models. Each transformation fi transforms in a univariate way the ratio xi . The parameters of the transformation are optimized so as to increase the discrimination ability (Box and Tidwell, 1962; Hastie, Tibshirani and Friedman, 2001). More complex models allow for more complex interactions between the variables. Neural networks are a typical example of such parameterized models. A neural network multilayer perceptron (MLP) model with one hidden layer has the following parameterization (Figure 2c):  n  w k fAct  v k l x l + bk  + b   l=1 k =1 nh

z=





(3)

The parameters b, wk , bk and vkl are optimized to provide optimal discrimination Research papers

www.journalofcreditrisk.com

33

34

Tony Van Gestel et al.

FIGURE 2 Model architectures for computing the z-score from the input ratios x1,…, x5. The linear (a), intrinsically linear (b) and MLP model (c) are parametric models with increasing complexity and learning capability. The SVM or kernel-based model (d) is a non-parametric model where the learning capacity depends, among other things, on the choice of the kernel function K.

Ratio 1

w1

Ratio 1

w1

Ratio 2

w2

Ratio 2

w2

z-score

w3

Ratio 3

Ratio 4

w4

Ratio 4

w4

Ratio 5

w5

Ratio 5

w5

(a)

z-score

w3

Ratio 3

(b) x

Ratio 2

Ratio 3

vij

w1

Ratio 1

K(x,x1)

α1

w2

Ratio 2

K(x,x2)

α2

w3

Ratio 4

w4

Ratio 5

w5

(c)

z-score

Ratio 3



Ratio 1

z-score

Ratio 4

Ratio 5

K(x,xN)

αN

(d)

ability on both current and future data. Typical activation functions, fAct , are the hyperbolic tangent function or other S-shaped, locally bounded piecewise continuous functions. The MLP model is very popular because of its universal approximation property that allows one to approximate any analytical function by a MLP with two hidden layers. A disadvantage of the increased learning capacity and modeling flexibility is the model design. The optimization problem is nonconvex and thus more complex because the global optimum is typically situated between multiple local minima. One should take care not merely to optimize the model parameters towards the learning data set but rather to estimate them such as to guarantee sufficient generalization ability on new, unseen counterparts. One can also apply Bayesian learning theory or statistical learning theory. Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

The above models specify a parameterized form, f (x1,…, xn; wi , λ j ), of the discriminant function in which the parameters wi (i = 1,…), λ j ( j = 1,…) are optimized to discriminate between (future) solvent and non-solvent counterparts. Non-parametric kernel-based learning models do not specify a parameterized form of the discriminant function. Instead, a discriminant function is estimated, z = f (x1,…, xn), that learns to discriminate between good and bad counterparts subject to some smoothness constraint: fApprox(x) = argmin f J( f ) = Classification cost( f; data) + ζRegularization term( f ) N

= b+

∑ α K( x, x ) , i

(x = [ x1 ,… , x n ] T )

i

i=1

N

The classifier is then sign [b + ∑ i =1αi K(x, xi )], where the vector xi ∈ Rn consists of the n ratios of the ith training data point (i = 1,…, N ). The corresponding model structure is indicated in Figure 2d. The training set classification cost function is often a cost function that is used in linear parametric models, such as least squares or negative log-likelihood, while the regularization term involves a derivative operator that penalizes higher-order derivatives of the function f and promotes smooth discriminant functions. The derivative operator is related to the kernel function K. The regularization parameter ζ > 0 determines the trade-off between the training set cost and the regularization. Support vector machines (SVMs) and related kernel-based learning techniques can be understood as non-parametric learning techniques. Compared to MLPs, the solution follows from a convex optimization problem. The solution is unique and no longer depends on the starting point for optimization of the model parameters. Apart from computational convenience, this property ensures reproducibility of the model parameter estimates. The main properties of these kernel-based techniques are well understood given the links with regularization networks, reproducing kernel Hilbert spaces, Gaussian processes, convex optimization theory, neural networks and learning theory. SVMs have been reported many times to perform at least as well or better than linear techniques and neural networks on many domains (Schölkopf, Tsuda and Vert, 2004; Van Gestel et al., 2004; Wang, 2005). All these theoretical and practical properties mean that SVMs and kernelbased learning techniques are becoming a powerful reference technique. While SVMs and related kernel-based learning techniques focus on the model’s discrimination ability, other non-parametric techniques are more related to density estimation, such as Nadaraya–Watson and nearest neighbor methods. Neural networks and SVMs are often referred to as data mining and machine learning techniques, which is a broad classification of supervised modeling techniques that learn relations from data. Other techniques, like Bayesian networks, decision trees and other techniques from applied statistics, are counted among the machine learning techniques (Duda, Hart and Stork, 2001; Hastie, Tibshirani and Research papers

www.journalofcreditrisk.com

35

36

Tony Van Gestel et al.

Friedman, 2001; Mitchell, 1997; Ripley, 1996). An overview of common machine learning techniques is given in Figures 1 and 2. Non-linear models are more flexible and allow more complex interactions and improved modeling capability, but they need appropriate learning techniques1 to avoid overfitting, and the resulting models are more difficult to interpret. In this paper, the above linear, intrinsically linear and kernel-based learning techniques are combined to produce a model with high readability and discrimination ability. The paper is organized as follows. The combination of linear and non-linear models for credit scoring is motivated in Section 2. An introduction to SVMs is provided in Section 3, while mathematical details are provided in the Appendix. Section 4 provides the formulation of the mathematical model for combined parametric and non-parametric (semi-parametric) models. Empirical results are reported in Section 5 on the application of credit scoring for banks. Conclusions are drawn in Section 6.

2 COMBINING LINEAR AND NON-LINEAR MODELS FOR CREDIT SCORING Important requirements for internal default rating systems are predictability, stability and interpretability. In this paper, the above techniques are combined to obtain a model with statistically stable and reliable parameter estimates that identifies an appropriate trade-off between predictability and interpretability. Optimal predictability involves good discrimination between defaults and non-defaults. Stability concerns the reliability of the estimated model parameters such that the model exhibits good generalization behavior on new data points and does not just memorize the historical learning data set on which it was developed. The readability issue states that the model and its results should be intuitive and interpretable for the user, avoiding a situation whereby the model is a “black box” from which it is unclear to the end-user how, for example, the rating result is obtained and, in case of rating anomalies, it is difficult to motivate manual overruling by an expert. A stepwise and gradual approach is followed so as to find an optimal trade-off between simple techniques with excellent readability but restricted model flexibility and complexity and advanced techniques with reduced readability but extended learning capacity. The first step consists of the estimation of a linear ordinal logistic regression model (McCullagh, 1980), which serves as the benchmark statistical technique for modeling ordinal categories. In the next step an intrinsically linear model, or additive model, is built by considering univariate non-linear transformations of the explanatory variables using logarithmic, exponential and sigmoid transformations (Bishop, 1995; Box and Tidwell, 1962; Hastie, Tibshirani and Friedman, 2001; Yeo and Johnson, 2000). Finally, a kernel1 For example, one can tune the effective number of parameters using Bayesian learning, complexity criteria and statistical learning theory.

Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

FIGURE 3 Diagram showing the gradual combination of linear regression, intrinsically linear regression and kernel-based learning (support vector machines, SVMs) to produce increasing modeling capacity but decreasing readability.

P( y ≤ AAA)

Ratio 1

Readability

Learning capacity

w1

θ1

Ratio 2

w2

θ2

Ratio 3

w3

P( y ≤ AA)

w4

Ratio 4

z-score

P( y ≤ A)

θ3

P( y ≤ BBB)

θ4

w5

SVM SVM

Ratio 6

P( y ≤ BB)

θ5

Ratio 5 w6

P( y ≤ B)

θ6

1

P( y = AAA)

P( y = AA)

P( y = A)

P( y = BBB)

P( y = BB)

P( y = B)

P( y = CCC)

Having regard to the trade-off between readability and accuracy, a first step tries to capture as many relations as possible with simple techniques, while advanced techniques are applied in subsequent steps to capture the remaining, more complex relations in the data. Starting from the input data (ratios 1–6), the ratios are processed by the linear nodes (ratios 1–3), the univariate non-linear nodes (ratios 4–5) and the SVM node (ratios 3, 4 and 6). The transformed ratios are combined into the z-score, which is then mapped to the external rating probabilities using an ordinal logistic regression model.

based technique known as “support vector machines” (SVM) is introduced to construct an advanced non-linear model on top of the intrinsically linear model that captures the remaining multivariate non-linear relationships in the data (Schölkopf and Smola, 2002; Suykens et al., 2002; Vapnik, 1998). The three-step process is visualized in Figure 3, where it can be seen that the capacity for generalization increases as the readability of the model decreases.

3 SUPPORT VECTOR MACHINES FOR BINARY CLASSIFICATION Due to the universal approximation property (Hornik, Stinchcombe and White, 1989), the multilayer perceptron (MLP) neural network is a popular neural network for both regression and classification and has often been used in financial contexts such as bankruptcy prediction and credit scoring (see, eg, Atiya, 2001; Baesens et al., 2003b; Refenes, 1995; Serrano-Cinca, 1997). Although good training algorithms (eg, Bayesian inference – see Bishop (1995); Suykens et al. (2002)) are now available to design the MLP, a number of drawbacks remain, such as the choice of the architecture of the MLP and the existence of multiple Research papers

www.journalofcreditrisk.com

37

38

Tony Van Gestel et al.

local minima, which implies that the estimated parameters may not be uniquely determined. Recently, a new learning technique has emerged, known as support vector machines (SVMs), together with related kernel-based learning methods in general, in which the solution is unique and follows from a convex optimization problem (Schölkopf and Smola, 2002; Suykens et al., 2002; Vapnik, 1998). Kernel-based learning techniques have already been successfully applied in many domains, such as bio-informatics, text mining, optical character recognition and time-series analysis (Schölkopf, Tsuda and Vert, 2004; Fan and Yao, 2003; Wang, 2005). For financial applications see, for example, Baesens et al. (2003b), De Servigny and Renault (2004), Hutchinson, Lo and Poggio (1994), Van Gestel et al. (2001) or Van Gestel et al. (2005). For a statistical analysis of support vector machines and kernel-based learning, the reader is referred to Cucker and Smale (2003) or Vapnik (1998).

3.1 Binary SVM classifier SVMs were first derived for the binary classification problem with class labels – 1 and +1, which may correspond, for example, to the default and non-default states. The classifier has the form y(x) = sign [ wT (x) + b ]

(4)

where the coefficient vector w ∈ Rn  and bias term b have to be estimated from the data. The corresponding score function is equal to z = wT(x) + b, which ranks, for example, counterparts from high to low default risk. The non-linear function ( · ) : Rn → Rn  : x  (x)

(5)

maps the input space to a high- (possibly infinite) dimensional feature space (see Figure 4). In this feature space, a linear separating hyperplane, wT(x) + b = 0, is then constructed by applying linear methodology. In SVMs the classifier is obtained from a convex quadratic programming (QP) problem in the parameters w and b subject to 2N constraints, as explained in the Appendix, where N is the number of training data points. A key element of non-linear SVMs and kernel-based learning in general is that the non-linear mapping ( · ) and weight vector w are never calculated explicitly. Instead, Mercer’s theorem K(xi , xj) = (xi)T(xj)

(6)

is applied to relate the mapping ( · ) with the symmetric and positive definite kernel function K. For K(xi , x) one typically has the following choices: K(xi , x) = xiT x (linear kernel); K(xi , x) = (xiTx + η) d (polynomial SVM of degree d with η Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

FIGURE 4 Illustration of SVM-based classification. The inputs are first mapped in a non-linear way to a high-dimensional feature space (x  (x)), in which a linear separating hyperplane is constructed. By taking the Mercer kernel (K(xi , xj ) = (xi )T(xj )), a non-linear classifier in the input space is obtained.

Feature space x  (x)

Input space

K(x1,x2) = (x 1) T (x 2)

a positive real constant); K(xi , x) = exp (–x – xi22 σ2 ) (radial basis function (RBF) kernel with bandwidth parameter σ). Constructing the Lagrangian of the QP problem, one can eliminate w from the conditions of optimality in the saddle point of the Lagrangian and formulate a dual optimization problem in the Lagrange multipliers  = [α1,…, αN ] T ∈ RN. The resulting classifier is depicted in Figure 4 and is given by  y ( x ) = sign  



N

∑ α K ( x , x ) + b  i

i=1

i



(7)

N

with z = ∑ i =1αi K(x, xi ) + b. For illustrative purposes, the generalization behavior of a least-squares support vector machine (LS-SVM) classifier 2 is illustrated in Figure 5 for the two-spiral problem, which is known to be a difficult benchmark classification problem. Given the two explanatory variables x1 and x 2, the SVM classifier is able to generalize well and to follow the complex border between the two intertwined spirals3 (Suykens and Vandewalle, 1999). Further mathematical details on SVMs are reported in the Appendix. 2 The LS-SVM classifier is a modified version of the SVM classifier that uses the least-squares cost function, resulting in a simpler optimization problem and connections with Fisher discriminant analysis (Van Gestel et al., 2002). 3 A Matlab implementation of LS-SVMs is available at www.esat.kuleuven.ac.be/sista/ lssvmlab.

Research papers

www.journalofcreditrisk.com

39

Tony Van Gestel et al.

FIGURE 5 LS-SVM classifier result for the two-spiral benchmark problem. The bold + and × signs indicate the training data points for the two intertwined spirals. The classifier result is indicated by the + and × cross-hatching. The complex boundary between the intertwined spirals is well captured by the spirals. 25

20

15

10

5

x2

40

0

–5

–10

–15

–20

–25 –25

–20

–15

–10

–5

0

5

10

15

20

25

x1

3.2 Related kernel-based learning techniques More generally, SVMs and related kernel-based learning techniques for QP classification, Fisher discriminant analysis, logistic regression and least-squares regression (Schölkopf and Smola, 2002; Suykens et al., 2002; Van Gestel et al., 2002; Keerthi et al., 2002; Vapnik, 1998) follow more or less the following steps: 1. one starts by mapping the input space to a high-dimensional feature space where the mapping itself is implicitly defined by the Mercer kernel (6); 2. a (linear) regression or classification technique is then formulated in the (primal) feature space, where one typically uses a regularization term that penalizes large weights to avoid overfitting; 3. the Lagrangian is constructed and a (dual) optimization problem is formulated in the Lagrange multipliers; 4. the solution is then expressed in terms of the resulting Lagrange multipliers and the kernel function K. Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

A disadvantage of SVM formulations is that the computational and memory requirements grow as a power 4 of N, the number of training data points. More recently, methods have been being developed to avoid having to solve the largescale dual-space problem in the Lagrange multipliers, typically of dimension N. Instead, one can directly estimate a finite-dimensional approximation to the non-linear mapping (x) in (5) such that (xi)T(xj) = K(xi , xj) for all i, j = 1,…, N. As explained in the appendix, Nyström approximation allows a further reduction of the computational complexity by estimating  on a reduced set of M data points. Additionally, not only does one no longer need to construct the Lagrangian but one can also estimate the discriminant function given the explicit estimate for  and the resulting data set {(xi), yi}iN=1 (Suykens et al., 2002; Williams and Seeger, 2001).

4 COMBINING LINEAR ORDINAL LOGISTIC REGRESSION WITH SUPPORT VECTOR MACHINES (SVMs) 4.1 Linear ordinal logistic regression For binary classification problems such as bankruptcy prediction, ordinary leastsquares 5 (Altman, 1968) and logistic regression (Ohlson, 1980) are key techniques for building a discriminant function between two classes: class 1 (defaults) and class 2 (non-defaults). Logistic regression is typically preferred for the following reasons: its model formulation is specific to a binary classification problem (defaults/non-defaults); it is known to exhibit better generalization behavior than least-squares regression, as is observed empirically (Baesens et al., 2003b; Lim, Loh and Shih, 2000; Van Gestel et al., 2004); and it is theoretically proven to be more robust to deviations from multivariate Gaussian distributed classes (Efron, 1975). The ordinal logistic regression (OLR) model (Agresti, 2002; McCullagh, 1980; Resti, 2002) is an extension of the binary logistic regression model for ordinal multiclass categorization problems – an example of which might be: class 1 (very good), class 2 (good), class 3 (medium), class 4 (bad) and class 5 (very bad). In the cumulative ordinal logistic regression formulation for m classes, the cumulative probability of the rating y being less than or equal to rating i is given by P( y ≤ i ) =

1 1 + exp (−θ i + β1 x1 + β 2 x 2 + … + βn xn )

,

i = 1, …, m

(8)

with the vector x = [x1, x 2,…, xn ] T of n explanatory variables x1, x 2,…, xn and the 4

For example, as explained in the appendix, one often needs to construct the kernel matrix  ∈ RN×N with elements Ωij = (xi)T(xj) = K(xi , xj) and with size N × N. Memory requirements grow as N 2. 5 For binary classification problems, ordinary least-squares regression corresponds to Fisher discriminant analysis and canonical correlation analysis (Van Gestel et al., 2002). Research papers

www.journalofcreditrisk.com

41

42

Tony Van Gestel et al.

corresponding coefficient vector  = [β1, β2,…, βn ] T. Because P( y ≤ m) = 1, the parameter θm is equal to ∞. The latent variable z is the linear combination of the explanatory variables xi , i = 1,…, n, z = – β 1 x 1 – β 2 x 2 – … – β n x n = – T x

(9)

and summarizes the financial information concerning the individual risk of the bank. Essentially, the cumulative probability P( y ≤ i) is linked to the latent variable (plus a category dependent constant θi ) via the logistic link function. Given the cumulative probabilities P( y ≤ i), with i = 1,…, m, one obtains the probabilities P( y = i) as follows: P( y = 1) = P( y ≤ 1) P( y = i) = P( y ≤ i) – P( y ≤ i – 1)

i = 2,…, m – 1

P( y = m) = 1 – P( y ≤ m – 1)

(10)

Given a training data set D = {xi , yi}iN=1 of N data points, the parameters θ1, θ2 ,…, θm and β1, β2,…, βn are estimated using a maximum-likelihood procedure to minimize the negative log-likelihood (NLL): N

( θˆ1, θˆ 2 ,…, θˆm; βˆ 1, βˆ 2 ,…, βˆ n ) = argmin NLL (, ) = –

∑ log (P( y = yi )) i =1

(11) with θm = ∞ and yi ∈ {1,…, m}. The maximum-likelihood estimate is obtained via iteratively reweighted least squares using Levenberg–Marquardt optimization. As a result of the optimization, not only are the optimal parameters obtained but also the standard errors (s.e.) on the estimates of the coefficients. The model deviance (dev) is equal to twice the negative log-likelihood in the optimum, and it can be used for model comparison, for example, using an appropriate information criterion (Agresti, 2002; Akaike, 1973). The statistical relevance of input i can be assessed from its p-value in the hypothesis test H 0 (βi = 0) obtained from the z-test with z-statistic z = βˆ i s.e. (βi ) (Friedl and Tilg, 1994). As a second measure of the importance of a ratio, the model deviance 6 is used to compare the full model M1 (with inputs 1,…, i – 1, i, i + 1,…, m) and the reduced model M 0 without the corresponding input (inputs 1,…, i – 1, i + 1,…, m). The Bayes factor B10 indicates the model improvement from model M 0 to model M1 and is approximated via 2log (B10 ) ≈ dev (M 0 ) – dev (M1) = ∆ dev. Indicative ranges for the Bayes factor are reviewed in Table 1 (Jeffreys, 1961), where it is preferred for this application to have strong and decisive evidence of model improvement. 6 It is preferable to report the deviance as it is straightforward to compute the appropriate complexity criterion, such as the Akaike and Bayesian information criterion, from the deviance.

Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

TABLE 1 Evidence against the H 0 hypothesis (no improvement of M1 over M 0 ) for different values of the Bayes factor, B10 , and difference in model deviance. B10

2log (B 10 ) 0 to 2

1 to 3

2 to 5 5 to 10 >10

3 to 12 12 to 150 >150

Evidence against H 0 Not worth more than a bare mention Positive Strong Decisive

4.2 Univariate non-linear ratio transformations In the linear model formulation (9), a ratio xi influences the score z in a linear way. However, it can be argued that, in a risk assessment context, a change of a ratio by 1% should not always have the same influence on the score (Atiya, 2001; Baesens et al., 2003a; Serrano-Cinca, 1997). For example, an increase of 1% in the total capital ratio from 7% to 8% may not have the same impact as an increase from 9% to 10% (Estrella, Peristiani and Park, 2002). Therefore, it is often suggested that one should estimate univariate non-linear transformations7 (xi  fi (xi )) for some of the independent variables. Applying the transformation to ratios m + 1,…, n, the z-score (9) becomes z = – β1 x1 – … – βm xm – βm +1 fm +1 (xm +1) – … – βn fn (xn )

(12)

The model is referred to as intrinsically linear in the sense that, after applying the non-linear transformation to the explanatory variables, a linear model is estimated (Box and Tidwell, 1962). A non-linear transformation of the explanatory variables is applied only when it is reasonable from a financial as well as a statistical perspective (eg, model fit and performance). Box–Cox power transformations are a well-known type of transformation to improve symmetry, normality or model fit (see, eg, Box and Cox (1964) and Box and Tidwell (1962) for more details). However, these transformations are only defined for positive values x > 0. Recently, an alternative family of transformations, defined by Yeo and Johnson (2000), has been proposed that is of the same form as the Box–Cox transformations and is also valid for negative values:  ((1 + x ) λ − 1) λ ,  log( x + 1), f ( x ; λ ) =   − ((1 − x ) 2−λ − 1) 2 − λ ,   log(− x + 1),

λ ≠ 0, λ = 0, λ ≠ 2, λ = 2,

x≥0 x≥0 x YTrue – Y2 YL

YNL

YSVM

(a) Leave-bank-out YL YNL YSVM

0 280 338

486 0 276

646 363 0

(b) Leave-one-out YL YNL YSVM

— 280 328

480 0 277

639 397 0

reporting the 0, 0–1, 0–2, … , notches difference between the internal model rating and the external rating. These performances are assessed out-of-sample using leave-one-out and leave-bank-out cross-validation.15 The first performance criterion removes in turn one observation from the training database, re-estimates the model parameters and evaluates the re-estimated model on the observation left out. The second performance criterion applies the same idea, removing in turn all the bank-year observations for the corresponding bank. These cross-validation performances allow use of the full data set for learning, while they are a close estimate for the out-of-sample performance (Eisenbeis, 1977). The cumulative notch differences16 with respect to the external rating are reported in Table 3 for the three models (linear, univariate non-linear and univariate non-linear with SVM terms). Detailed performances for each rating range are reported in Table 5. It is seen that the performance increases gradually from the linear model, through the univariate non-linear model, to the univariate nonlinear model with SVM terms. When one compares the number of observations in which model M1 yields a larger error than model M2, the improvement in performance between the different models is clearly evident in Table 4. Applying 15

Leave-one-out cross-validation is also known as Jack-knife cross-validation. Leave-bank-out cross-validation is a specific case of k-fold cross-validation in which each fold of the crossvalidation contains all observations for a specific bank over multiple years. 16 For the sake of completeness, it is reported here that we also performed hold-out tests using two-thirds of the banks for training and assessing the performance on the remaining third. The resulting average cumulative notch differences on 100 random choices from the training and test sets were 36.8%, 63.9% and 81.1% for the linear model, 39.2%, 66.2% and 83.9% for the intrinsically linear model, and 41.2%, 69.0% and 85.4% for the intrinsically linear model with SVM terms on 0, 0–1 and 0–2 notches difference. Research papers

www.journalofcreditrisk.com

51

Journal of Credit Risk

10.0% 12.1% 9.6% 35.0% 16.2% 7.5% 9.6%

0.0% 0.0% 6.2% 38.8% 19.5% 14.9% 20.7%

0.0% 0.0% 6.4% 37.8% 19.1% 15.6% 21.2%

707

2996

Leave-bank-out >2 9.8% 2 12.1% 1 9.3% 0 36.0% –1 16.1% –2 7.3% < –2 9.3%

Leave-one-out >2 2 1 0 –1 –2 < –2

A… B–

A… E

7.2% 14.0% 10.8% 33.9% 17.2% 10.5% 6.4%

7.3% 13.9% 11.0% 33.2% 17.5% 10.6% 6.5%

993

C+ … C –

Linear

17.3% 17.2% 10.0% 36.2% 13.3% 0.7% 5.4%

17.5% 17.3% 10.3% 35.0% 13.7% 0.7% 5.6%

1296

D+ … E

8.1% 11.1% 10.3% 39.2% 14.4% 9.1% 7.9%

8.2% 11.1% 10.5% 38.4% 14.6% 9.0% 8.2%

0.0% 0.1% 6.8% 36.2% 19.9% 16.4% 20.5%

0.0% 0.1% 6.9% 35.8% 19.9% 16.3% 20.9%

5.9% 13.2% 11.6% 36.1% 15.9% 11.9% 5.4%

6.0% 13.3% 11.9% 35.0% 16.1% 12.0% 5.6%

14.1% 15.5% 11.2% 43.1% 10.1% 2.9% 3.0%

14.3% 15.4% 11.3% 42.4% 10.4% 2.9% 3.2%

Rating A… E A … B – C+ … C – D+ … E Number of observations 2996 707 993 1296

Intrinsically linear

7.0% 10.7% 11.3% 41.0% 14.8% 8.0% 7.1%

7.3% 10.8% 11.6% 40.1% 14.7% 8.0% 7.5%

2996

A… E

0.0% 0.3% 8.8% 40.5% 18.7% 14.6% 17.3%

0.0% 0.3% 8.9% 39.5% 18.7% 14.4% 18.2%

707

A… B–

3.3% 11.4% 12.7% 40.8% 16.6% 9.6% 5.6%

3.5% 11.4% 13.4% 39.5% 16.5% 9.7% 6.0%

993

C+ … C –

13.7% 16.0% 11.6% 41.5% 11.3% 3.2% 2.7%

14.1% 16.0% 11.8% 40.8% 11.2% 3.2% 2.8%

1296

D+ … E

Intrinsically linear + SVM

TABLE 5 Comparison of the leave-one-out and leave-bank-out cumulative accuracy, respectively, on 0 to 4 notches difference. Reference to Table 1 shows that the differences in deviance (leave-one-out/leave-bank-out) are significant between the linear, intrinsically linear and intrinsically linear model with SVM terms, respectively.

52 Tony Van Gestel et al.

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

TABLE 6 Leave-bank-out areas under the receiver operating characteristic (AUROC) curve.

Linear

Intrinsically linear

Intrinsically linear + SVM

Classifier A vs. B–E A–B vs. C–E A–C vs. D–E A–D vs. E

85.7% 90.6% 93.2% 89.0%

86.2% 91.1% 94.2% 90.8%

89.4% 92.8% 93.7% 91.2%

Average

89.6%

90.6%

91.8%

z-statistic Linear Intr. linear Intr. linear + SVM

— – 3.54 – 5.36

3.54 — – 3.03

5.36 3.03 —

p-values Linear Intr. linear Intr. linear + SVM

— 100% 100%

0.0% — 99.9%

0.0% 0.1% —

The ordinal multiclass AUROC is estimated as the average of the binary classifier AUROCs reported in the top panel. Applying the significance test for difference between the classifiers (linear, intrinsically linear and intrinsically linear + SVM terms) for each of the binary classification problems, the average z-statistic is reported in the middle panel and the p-values of the significance test with H0 are reported in the bottom panel. Classifier A performs worse than classifier B. It is seen that the non-linear classifiers yield significantly better results.

a significance test17 with the null hypothesis of (pairwise) equal performance, implying equal numbers above and below the diagonal, yields the result that the differences are statistically significant at the 1% level. An alternative measure that is commonly used to compare binary classifiers is the “area under receiver operating characteristic” (AUROC) curve, which is closely related to the accuracy ratio of the cumulative accuracy profile. For multiclass problems it is suggested that the AUROC be computed as an average of the individual binary classification problems (Hand and Henley, 1997). For the ordinal classification problem here, the binary classifiers A vs. B–E, A–B vs. C–E, A–C vs. D–E and A–D vs. E were considered. The AUROCs of the different binary classifiers are reported in Table 6. In this case too the same conclusion holds. Observe that the resulting average performance measure depends on the rating distribution as well.

17 The test statistic used here to compare ordinal multiclass classifiers was similar to the McNemar significance test (Agresti, 2002) for binary classifiers.

Research papers

www.journalofcreditrisk.com

53

54

Tony Van Gestel et al.

6 CONCLUSIONS Linear logistic regression has become a standard modeling methodology that yields an easily readable model but is restricted to the assumption of a linear relationship between financial ratios and the discriminant score. Univariate non-linear transformations were formulated by Box and Tidwell (1962) to, first, transform the explanatory ratios and then perform a linear regression on the transformed ratios. The transformations allow the inclusion of saturation effects and emphasize stress zones of the ratios – for example, where the total capital ratio is below 8%. Support vector machines are a recent, kernel-based modeling technique that is able to capture complex multivariate non-linearities in the data. Provided that the model is financially and statistically meaningful, applying a gradual approach in which one starts with a simple model and improves it allows one to combine good model readability with improved model performance and generalization behavior. This was observed in our case study of the development of a rating model for assessing a bank’s financial strength.

APPENDIX: SUPPORT VECTOR MACHINES For the sake of completeness, the (primal) feature space formulation for SVMs is presented and we show how the corresponding dual optimization problem allows one to estimate and evaluate the classifier in terms of the kernel function without explicit knowledge of the non-linear mapping x  (x). The estimation of an explicit expression for the non-linear mapping is also described.

Primal–dual formulations Consider a training set of N data points, {(xi , yi )}iN=1, with input data xi ∈ Rn mapped into the feature space (xi) ∈ Rn and corresponding binary class labels yi ∈ {–1, +1}. When the data of the two classes are separable (Figure 7a), one can say that  w T ( x i ) + b ≥ + 1,  T  w  ( x i ) + b ≤ − 1,

if yi = +1 if yi = −1

This set of two inequalities can be combined into one single set as follows: yi ( wT(xi ) + b ) ≥ +1,

i = 1,…, N

(15)

As can be seen from Figure 7a, multiple solutions are possible. From the point of view of generalization, it is best to choose the solution with the largest margin 2w2 . In most practical, real-life classification problems, the data are non-separable in the linear or non-linear sense due to the overlap between the two classes (see Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

55

FIGURE 7 Illustration of SVM classification in two dimensions, (ϕ1, ϕ2 ), of the feature space. Left: separable case; right: non-separable case. The margin between the hyperplanes w T (x) + b = +1 and w T (x) + b = – 1 is equal to 2w. ϕ2 2w

x x x x x Class C 1

x x

x x x

+ + + Class C + + ++ + + +

ϕ2 2

x x+ x x x

w T ((x) + b = +1 w T ((x) + b = 0

Class C 1

x x

w T ((x) + b = –1

x + x x

+ + + Class C + + x + + + x + +

w T ((x) + b = +1 w T ((x) + b = 0 w T ((x) + b = –1

ϕ1

ϕ1

(a)

2

(b)

Figure 7b). In such cases, one aims to find a classifier that separates the data as much as possible. The SVM classifier formulation (15) is extended to the non-separable case by introducing slack variables ξ i ≥ 0 in order to tolerate misclassifications (Vapnik, 1998). The inequalities are changed into yi ( wT(xi ) + b ) ≥ 1 – ξ i ,

i = 1,…, N

(16)

where inequality i is violated when ξ i > 1. In the primal weight space, the optimization problem becomes min J P (w ) =

w, b, 

1 2

N

w w+c T

∑ξ

i

(17)

i=1

such that yi ( wT(xi ) + b ) ≥ 1 – ξ i, ξ i ≥ 0,

i = 1,…, N, i = 1,…, N

where c is a positive real constant that determines the trade-off between the large N margin term 1 ⁄ 2wTw and the error term ∑ i =1ξi . The purpose of the large margin term is to maximize the generalization behavior in the separable case (Figure 7a). That of the error term is to minimize the training set error in the non-separable case. Support vector machines are modeled within a context of convex optimization theory (Schölkopf and Smola, 2002; Suykens et al., 2002; Vapnik, 1998). The general methodology is to start formulating the problem in the primal weight Research papers

www.journalofcreditrisk.com

56

Tony Van Gestel et al.

space as a constrained optimization problem, then to formulate the Lagrangian, take the conditions for optimality and finally to solve the problem in the dual space of Lagrange multipliers, which are also called support values. The Lagrangian is equal to N

N

N

i =1

i =1

i =1

L = 0.5wTw + c ∑ ξi – ∑ αi ( yi (wT(xi )) + b – 1 + ξ i ) – ∑ νi ξ i with Lagrange multipliers αi ≥ 0, νi ≥ 0 (i = 1,…, N). The solution is given by the saddle point of the Lagrangian max, minw,b, L(w, b, ; , ), with conditions for optimality          

N ∂L → w= α y  ( xi ) i =1 i i ∂w N ∂L → α y =0 i =1 i i ∂b ∂L → 0 ≤ α i ≤ c , i = 1, …, N ∂ ξi





(18)

Replacing (18) in (17) yields the following dual QP-problem max J D (  ) = − 

1 2

N



N

yi y j  ( x i ) T  ( xj ) α i α j +

i , j =1

∑α

i

i =1

1 = −  T Dy  Dy  + 1 T 2

(19)

with the vectors  = [α1,…, αn] T, 1 = [1,…, 1] T ∈ RN, y = [y1,…, yn] T ∈ RN, the diagonal matrix Dy = diag( y) ∈ RN×N and the kernel matrix  ∈ RN×N, where ij = (xi)T(xj) = K(xi , xj ) (i, j = 1,…, N ). The matrix  is positive (semi-) definite by construction. The bias term b is obtained as a by-product of the QP-calculation or from a non-zero support value. More generally, one obtains other SVM formulations (eg, for least-squares and logistic regression) using the same methodology. The solutions are obtained in the dual space, where the kernel matrix typically has the form  K ( x1, x1 ) K ( x1, x 2 )   K ( x 2 , x1 ) K ( x 2 , x 2 )  =      K(x , x ) K(x , x ) N 1 N 2 

… K ( x1, x1 ) … K ( x2 , x N )   … K(xN, xN )

       

(20)

with dimensions N × N. Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines

Estimation of non-linear mapping Given the data points {x1,…, xN} and the kernel function K, one can estimate the non-linear mapping (x) based on the eigenvalue decomposition of the kernel matrix :  = U U T

(21)

with U = [u1,…, un ] ∈ RN×N and = diag ([υ1,…, υn ]) ∈ RN×N. The elements ϕi of the mapping  = [ϕ1,…, ϕn f}] T are estimated as follows (Suykens et al., 2002) ϕi (x) =

N υi

N

∑u

ki

K ( xk , x ),

i = 1, … , N

(22)

k =1

and ϕi (x) = 0 for υi = 0 or i ≥ N + 1. Using this estimate, it is easy to see that (xi)T(xj) = K(xi , xj) for i, j = 1,…, N. For large data sets, the computational requirements may become too high. The idea of Nyström approximation is to estimate ( · ) on a (carefully) selected subsample of size M ≤ N from the data {xi}iN=1. In fixed-size, least-squares support vector machines, the Renyi entropy measure is used to select the sub-sample (Suykens et al., 2002). A similar solution to (22) is obtained with M ≤ N nonzero components. The computational complexity for (22) reduces from O(N 3) to O(M 3), while the memory requirements drop from O(N 2) to O(M 2). In practice, the performance does not depend too much on the selected sample if M is sufficiently large for the complexity of the problem.

REFERENCES Agresti, A. (2002). Categorical data analysis. Wiley. Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki (eds), Proceeding 2nd International Symposium on Information Theory, pp. 267–81. Akademia Kiado, Budapest. Altman, E. L. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance 23, 589–609. Atiya, A. F. (2001). Bankruptcy prediction for credit risk using neural networks: a survey and new results. IEEE Transactions on Neural Networks 12, 929–35. Baesens, B., Setiono, R., Mues, C., and Vanthienen, J. (2003a). Using neural network rule extraction and decision tables for credit-risk evaluation. Management Science 49, 312–29. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. A. K., and Vanthienen, J. (2003b). Benchmarking state of the art classification algorithms for credit scoring, Journal of the Operational Research Society 54, 627–35. Basel Committee on Banking Supervision (2003). The new Basel capital accord. Bank of International Settlements, Basel. Research papers

www.journalofcreditrisk.com

57

58

Tony Van Gestel et al. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford University Press. Box, G. E. P., and Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society, Series B, 26, 211–43. Box, G. E. P., and Tidwell, P. W. (1962). Transformations of the independent variables. Technometrics 4, 531–50. Crosbie, P., and Bohn J. (2003). Modeling default risk, modeling methodology. Moody’s KMV. Cucker, F., and Smale, S. (2003). Best choices for regularization parameters in learning theory: on the bias-variance problem. In J. A. K. Suykens et al. (eds), Advances in learning theory: methods, models and applications, NATO Science Series III, Vol. 190, pp. 29–46. IOS Press. De Servigny, A., and Renault O. (2004). Measuring and managing credit risk. McGraw-Hill, New York. Duda, R. O., Hart, P. E., and Stork, D. G. (2001). Pattern classification, Second edition. John Wiley and Sons. Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Society 70, 892–8. Eisenbeis, R. A. (1977). Pitfalls in the application of discriminant analysis in business, finance and economics. Journal of Finance 32, 875–900. Estrella, A., Peristiani, S., and Park, S. (2002). Capital ratios and credit ratings as predictors of bank failures. In M. K. Ong (ed.), Credit ratings, methodologies, rationale and default risk, Chapter 11, pp. 233–56. Risk Books, London. Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics 7, 179–88. Friedl, H., and Tilg, N. (1994). Variance estimates in logistic regression using the bootstrap. Communications in Statistics – Theory and Methods 24, 473–86. Gilbert, R. A., Meyer, A. P., and Vaughan, M. D. (1999). The role of supervisory screens and econometric models in off-site surveillance. Review, Federal Reserve Bank of St Louis 81, 31–56. Gilbert, R. A., Meyer, A. P., and Vaughan, M. D. (2000). The role of a CAMEL downgrade model in bank surveillance. Working paper 2000-021A, Federal Reserve Bank of St Louis. Hand, D. J., and Henley, W. E. (1997). Statistical classification methods in consumer risk. Journal of the Royal Statistical Society, Series A, 160, 523–41. Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical learning, data mining, inference, and prediction. Springer. Hornik, K., Stinchcombe, M., and White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks 2, 359–66. Hutchinson, J. M., Lo, A. W., and Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. Journal of Finance 49, 851–89. Fan, J., and Yao, Q. (2003). Nonlinear time series: non-parametric and parametric methods. Springer-Verlag, New York. Journal of Credit Risk

Volume 1/ Number 4, Fall 2005

Linear and non-linear credit scoring by combining logistic regression and support vector machines Federal Financial Institutions Examination Council (1979). Uniform financial institutions rating system. Examining circular 159. Jeffreys, H. (1961). Theory of probability. Oxford University Press. Kaminsky, G., and Reinhart, C. (1999). The twin crises: the causes of banking and balance payment problems. American Economic Review 89(3), 473–500. Kealhofer, S. (2002). The economics of the banks and off the loan book. Research report, Moody’s KMV. Keerthi, S. S., Duan, K., Shevade, S. K., and Poo, A. N. (2002). A fast dual algorithm for kernel logistic regression. In Proceedings ICML 2002, pp. 299–306. Morgan Kaufmann Publishers. Kocagil A. E., Reyngold, A., Stein, R. M., and Ibarra, E. (2002). Moody’s RiskCalc™ model for privately-held U.S. Banks. Global Credit Research, Moody’s Investors Service. Le Bras, A., and Andrews, D. (2003). Bank rating methodology. Criteria report, Fitch Ratings. Lim, T. S., Loh, W. Y., and Shih, Y. S. (2000). A comparison of prediction accuracy, complexity and training time of thirty-three old and new classification algorithms. Machine Learning 40, 203–28. McCullagh, P. (1980). Regression models for ordinal data. Journal of the Royal Statistical Society, Series B (Methodological), 42, 109–42. McCullagh, P., and Nelder, J. A. (1989). Generalized linear models. Chapman and Hall, London. Meyer, P. and Pifer, H. W. (1970). Prediction of bank failures. Journal of Finance 25(4), 853–67. Merton, R. (1974). On the pricing of corporate debt: the risk structure of interest rates. Journal of Finance 29, 449–52. Mitchell, T. M. (1997). Machine learning. McGraw-Hill. Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of Accounting Research 18, 109–31. Refenes, A. P. (ed.) (1995). Neural networks in the capital markets. John Wiley and Sons, Chichester, UK. Resti, A. (2002). Replicating agency rating through multinomial scoring models. In M. K. Ong (ed.), Credit ratings, methodologies, rationale and default risk, Chapter 10, pp. 213–32. Risk Books, London. Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge University Press. Saunders, A., and Allen, L. (2002). Credit risk measurement. Wiley, New York. Schölkopf, B., and Smola, A. (2002). Learning with kernels. MIT Press, Cambridge, MA. Schölkopf, B., Tsuda, K., and Vert, J.-P. (eds) (2004). Kernel methods in computational biology. MIT Press, Cambridge, MA. Serrano-Cinca, C. (1997). Feedforward neural networks in the classification of financial information. European Journal of Finance 3, 183–202. Soberhart, J. R., Stein, R. M., Mikityanskaya, V., and Li, L. (2000). Moody’s public firm risk model: a hybrid approach to modelling default risk. Internal report, Moody’s Investors Service rating methodology. Research papers

www.journalofcreditrisk.com

59

60

Tony Van Gestel et al. Standard & Poor’s (2004). FI criteria, bank rating analysis methodology profile. Research report, March. Suykens, J. A. K., and Vandewalle, J. (1999). Least squares support vector machine classifiers. Neural Processing Letters 9(3), 293–300. Suykens, J. A. K., Van Gestel, T., De Brabanter, J., De Moor, B., and Vandewalle, J. (2002). Least squares support vector machines. World Scientific, Singapore. Thomas, L. C., Edelman, D. B., and Crook, J. N. (2000). Credit scoring and its applications. SIAM Monographs on Mathematical Modeling and Computation, Philadelphia, U.S. Van Gestel, T., Suykens, J. A. K., Baestaens, D., Lambrechts, A., Lanckriet, G., Vandaele, B., De Moor, B., and Vandewalle, J. (2001). Predicting financial time series using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, Special Issue on Financial Engineering, 12(4), 809–21. Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., and Vandewalle, J. (2002). A Bayesian framework for least squares support vector machine classifiers. Neural Computation 15, 1115–47. Van Gestel, T., Suykens, J. A. K., Baesens, B., Viaene, S., Vanthienen, J., Dedene, G., De Moor, B., and Vandewalle, J. (2004). Benchmarking least squares support vector machine classifiers. Machine Learning 54, 5–32. Van Gestel, T., Baesens, B., Suykens, J. A. K., Van den Poel, D., Baestaens, D. E., and Willekens, M. (2005). Bayesian kernel based classification for financial distress prediction. European Journal of Operational Research, in press. Vapnik, V. (1998). Statistical learning theory. John Wiley, New York. Wheelock, D. C., and Wilson, P. W. (2000). Why do banks disappear? The determinants of US bank failures and acquisitions. Review of Economics and Statistics 82(1), 127–38. Wilcox, J. W. (1971). A simple theory of financial ratios as predictors of failure. Journal of Accounting Research 9(2), 389–95. Williams, C. K. I., and Seeger, M. (2001). Using the Nyström method to speed up kernel machines. In T. K. Leen, T. G. Diettrich and V. Tresp (eds), Advances in neural information processing systems vol. 13, pp. 682–8. MIT Press. Wang, L. (ed.) (2005). Support vector machines: theory and applications. Studies in Fuzziness and Soft Computing, 177, Springer. Yeo, I. K., and Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrica 87, 949–59.

Journal of Credit Risk

Volume 1/ Number 4, Fall 2005