Smart Engineering System Design, 4:195–204, 2002 Copyright # 2002 Taylor & Francis 1025-5818/02 $12.00 þ .00 DOI: 10.1080/10255810290008090

The Limited Coupling Approximation with Application to CMAC Networks Eric J. Barth Vanderbilt University, Department of Mechanical Engineering, Nashville, Tennessee, USA

Nader Sadegh George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA

The issue of training a neural network to approximate a multivariate function is addressed at length in the current literature. The lack of a non-iterative or closed form solution for the weights of sigmoid neural networks motivates an alternate view of neural networks. It has long been known that Cerebellar Model Articulation Controller-type networks are able to approximate multivariate functions, as well as being directly programmable given a training data set. CMAC networks, however, suffer from the curse of dimensionality regarding the exponential growth in the number of receptive field functions that must be employed as the dimension of the function increases. The work in this paper studies the use of networks that limit the number of dimensions onto which high dimensional functions must be projected for approximation purposes. While preserving the direct programmability of CMAC networks, the work here uses frequency information about the function to limit the number of couplings used in the approximation. It is shown that for bandlimited functions, the number of network parameters or weights required to achieve zero approximation error has only a polynomial growth with dimension. At the conclusion of the paper, a procedure is given for estimating the size of a CMAC network based on the frequency bandwidth of the function to be approximated. Keywords: neural networks, cerebellar model articulation controller (CMAC), curse of dimensionality, dimensionality reduction, frequency domain properties, multivariate function approximation

NEURAL AND CMAC NETWORKS Neural networks are often used to analyze, model, or approximate high dimensional multivariate functions. Networks such as sigmoid networks, and others using one-dimensional activation functions, project high-dimensional spaces onto several singledimensional spaces which are nonlinearly activated and summed together. The known problem with this is certainly not the universal approximation capability of the network [1]; the problem is training. Training is difficult because there is presently no way of knowing the optimal projections in closed form. This leads to the difficult problem of nonlinear optimization, usually in the form of iterative procedures such as gradient descent methods, Quasi-Newton methods, the Gauss-Newton method, the Levenberg-Marquardt method, simulated annealing, trust region methods, sequential quadratic programming (SQP), the goal attainment method, or genetic algorithm (GA)

Received 24 January 2001; accepted 21 November 2001. Address correspondence to Eric J. Barth, Vanderbilt University, Department of Mechanical Engineering,VU Station B 351592, 2301Vanderbilt Place, Nashville, Tennessee 37235-1592. E-mail: [email protected]

techniques. Consequently, much of the literature is dedicated to investigating the problem of training. The effort to produce a reliable training scheme without unreasonable computational requirements has been immense. Many works analyze the internal representations of the network in an effort to truly understand which projections of the high dimensional domain yield the optimal decomposition. There are a wide variety of approaches in the literature, ranging from studying the linear independence of such internal representations to developing various types of iterative and constructive algorithms for network synthesis. Examples of purely iterative algorithms can be found in [2, 3, 4]. The constructive algorithms employ procedures that place basis functions either one at a time [5, 6, 7] or iteratively in order to optimize the network layer by layer [8, 9, 10]. The most widely used training algorithms in the literature are of the iterative type based on local minimization techniques, such as linear least squares, gradient descent, modified gradient descent, and various mixtures of these and other similar methods [11, 12, 13, 14, 15]. Other training algorithms in the literature range from those that select hidden layer nodes based on statistical hypothesis testing [16], to casting the training problem in the light of optimal control and dynamic programming [17, 18]. While all 195

196 of these works have areas of applicability with a certain amount of success, they all leave the same thing to be desired ^ an exact understanding and closed form optimization of internal representations. One work that sheds light on the intractability of the training problem is that by Hush [19]. The work by Hush shows that finding the weights for a sigmoid node is NP-Hard; the consequence being that constructive algorithms of the type so prized and sought after quickly become computationally intractable as the number of dimensions increase. Indeed the area of projection pursuit [20, 21], which is expressly concerned with lower dimensional projections of high dimensional data, admits to the drawback of high demand on computer time as the number of dimensions increases [22]. It may therefore be said that sigmoid and other neural networks with single-dimension activation functions have the ability to approximate high-dimensional functions with relatively few activation functions, but are difficult to train. CMAC (Cerebellar Model Articulated Controller) networks have exactly the opposite problem. CMAC networks are directly programmable given a data set, but the number of basis functions, or receptive field functions as they are often called, grows exponentially with dimension [23]. Whereas sigmoid neural networks have global basis functions, CMAC networks typically have local basis functions which can give CMACs the desirable property of local learning; what is learned in one region of the domain will not alter what has been learned in another region. These two fields of study represent two extremes; neural networks project a high-dimensional function onto single-dimensional hyperplanes, and CMAC networks divide the entire n dimensional domain into Nn hypercubes with local overlapping n-dimensional basis functions. These observations motivate the main question: is there some midway between these two ideas that possesses easy training while remaining moderate in network size? The answer to this question turns out to be affirmative, as long as the function to be approximated is bandlimited. In particular, it is necessary to show that such functions can be approximated arbitrarily closely by a superposition of a finite number of CMAC networks, where the number of network parameters grows polynomially rather than exponentially with the dimension of the problem. The paper is organized as follows. The next section presents a brief overview of the mathematical framework necessary for the mechanics of the analysis and defines the property of coupling degree. The following section presents the core of the analysis and shows a correspondence between the coupling degree and the spatial frequency content of a bandlimited function. The following section applies the results to a CMAC network structure; specifically, a reduced coupling CMAC network is introduced, whereby the concepts of coupling degree and the frequency bandlimit of a function result in a polynomial growth rate with dimension. A procedure is then introduced for estimating the size of a CMAC network based on the

E. J. Barth and N. Sadegh

frequency bandwidth of the function to be approximated. The section before the conclusions presents an illustrative example.

MATHEMATICAL FRAMEWORK Before beginning the analysis, it is useful to first define the mathematical framework within which the analysis will be made. Consider a subset of Lebesgue measurable multi-dimensional functions f :

The Limited Coupling Approximation with Application to CMAC Networks Eric J. Barth Vanderbilt University, Department of Mechanical Engineering, Nashville, Tennessee, USA

Nader Sadegh George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA

The issue of training a neural network to approximate a multivariate function is addressed at length in the current literature. The lack of a non-iterative or closed form solution for the weights of sigmoid neural networks motivates an alternate view of neural networks. It has long been known that Cerebellar Model Articulation Controller-type networks are able to approximate multivariate functions, as well as being directly programmable given a training data set. CMAC networks, however, suffer from the curse of dimensionality regarding the exponential growth in the number of receptive field functions that must be employed as the dimension of the function increases. The work in this paper studies the use of networks that limit the number of dimensions onto which high dimensional functions must be projected for approximation purposes. While preserving the direct programmability of CMAC networks, the work here uses frequency information about the function to limit the number of couplings used in the approximation. It is shown that for bandlimited functions, the number of network parameters or weights required to achieve zero approximation error has only a polynomial growth with dimension. At the conclusion of the paper, a procedure is given for estimating the size of a CMAC network based on the frequency bandwidth of the function to be approximated. Keywords: neural networks, cerebellar model articulation controller (CMAC), curse of dimensionality, dimensionality reduction, frequency domain properties, multivariate function approximation

NEURAL AND CMAC NETWORKS Neural networks are often used to analyze, model, or approximate high dimensional multivariate functions. Networks such as sigmoid networks, and others using one-dimensional activation functions, project high-dimensional spaces onto several singledimensional spaces which are nonlinearly activated and summed together. The known problem with this is certainly not the universal approximation capability of the network [1]; the problem is training. Training is difficult because there is presently no way of knowing the optimal projections in closed form. This leads to the difficult problem of nonlinear optimization, usually in the form of iterative procedures such as gradient descent methods, Quasi-Newton methods, the Gauss-Newton method, the Levenberg-Marquardt method, simulated annealing, trust region methods, sequential quadratic programming (SQP), the goal attainment method, or genetic algorithm (GA)

Received 24 January 2001; accepted 21 November 2001. Address correspondence to Eric J. Barth, Vanderbilt University, Department of Mechanical Engineering,VU Station B 351592, 2301Vanderbilt Place, Nashville, Tennessee 37235-1592. E-mail: [email protected]

techniques. Consequently, much of the literature is dedicated to investigating the problem of training. The effort to produce a reliable training scheme without unreasonable computational requirements has been immense. Many works analyze the internal representations of the network in an effort to truly understand which projections of the high dimensional domain yield the optimal decomposition. There are a wide variety of approaches in the literature, ranging from studying the linear independence of such internal representations to developing various types of iterative and constructive algorithms for network synthesis. Examples of purely iterative algorithms can be found in [2, 3, 4]. The constructive algorithms employ procedures that place basis functions either one at a time [5, 6, 7] or iteratively in order to optimize the network layer by layer [8, 9, 10]. The most widely used training algorithms in the literature are of the iterative type based on local minimization techniques, such as linear least squares, gradient descent, modified gradient descent, and various mixtures of these and other similar methods [11, 12, 13, 14, 15]. Other training algorithms in the literature range from those that select hidden layer nodes based on statistical hypothesis testing [16], to casting the training problem in the light of optimal control and dynamic programming [17, 18]. While all 195

196 of these works have areas of applicability with a certain amount of success, they all leave the same thing to be desired ^ an exact understanding and closed form optimization of internal representations. One work that sheds light on the intractability of the training problem is that by Hush [19]. The work by Hush shows that finding the weights for a sigmoid node is NP-Hard; the consequence being that constructive algorithms of the type so prized and sought after quickly become computationally intractable as the number of dimensions increase. Indeed the area of projection pursuit [20, 21], which is expressly concerned with lower dimensional projections of high dimensional data, admits to the drawback of high demand on computer time as the number of dimensions increases [22]. It may therefore be said that sigmoid and other neural networks with single-dimension activation functions have the ability to approximate high-dimensional functions with relatively few activation functions, but are difficult to train. CMAC (Cerebellar Model Articulated Controller) networks have exactly the opposite problem. CMAC networks are directly programmable given a data set, but the number of basis functions, or receptive field functions as they are often called, grows exponentially with dimension [23]. Whereas sigmoid neural networks have global basis functions, CMAC networks typically have local basis functions which can give CMACs the desirable property of local learning; what is learned in one region of the domain will not alter what has been learned in another region. These two fields of study represent two extremes; neural networks project a high-dimensional function onto single-dimensional hyperplanes, and CMAC networks divide the entire n dimensional domain into Nn hypercubes with local overlapping n-dimensional basis functions. These observations motivate the main question: is there some midway between these two ideas that possesses easy training while remaining moderate in network size? The answer to this question turns out to be affirmative, as long as the function to be approximated is bandlimited. In particular, it is necessary to show that such functions can be approximated arbitrarily closely by a superposition of a finite number of CMAC networks, where the number of network parameters grows polynomially rather than exponentially with the dimension of the problem. The paper is organized as follows. The next section presents a brief overview of the mathematical framework necessary for the mechanics of the analysis and defines the property of coupling degree. The following section presents the core of the analysis and shows a correspondence between the coupling degree and the spatial frequency content of a bandlimited function. The following section applies the results to a CMAC network structure; specifically, a reduced coupling CMAC network is introduced, whereby the concepts of coupling degree and the frequency bandlimit of a function result in a polynomial growth rate with dimension. A procedure is then introduced for estimating the size of a CMAC network based on the

E. J. Barth and N. Sadegh

frequency bandwidth of the function to be approximated. The section before the conclusions presents an illustrative example.

MATHEMATICAL FRAMEWORK Before beginning the analysis, it is useful to first define the mathematical framework within which the analysis will be made. Consider a subset of Lebesgue measurable multi-dimensional functions f :