Skip to Book Content
Book cover image

Chapter 4 - MLP Representational Capabilities

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology
 

4.2 Universal Approximation Capabilities

The examples above suggest that any bounded function can be approximated with arbitrary accuracy if enough hidden units are available. These ideas have been extended theoretically to show that multilayer perceptrons are universal approximators. That is, they are capable of arbitrarily accurate approximation of essentially arbitrary continuous mappings from the [-1, +1]n hypercube to the (-1, 1) interval. This is an important result because it says neural networks, as a class, are powerful enough to implement essentially any function we require.

4.2.1 Kolmogorov's Theorem, One Hidden Layer Is Sufficient

A somewhat surprising result is that two hidden layers are not necessary for universal approximation; one hidden layer is sufficient. Kolmogorov [219], [220] showed that a continuous function of several variables can be represented exactly by the superposition of continuous one-dimensional functions of the original input variables. A refined proof by Sprecher [355] is often cited. Hecht-Nielsen [161] introduced the result to the neural network community and showed an implementation in the form of a single-hidden-layer network. Briefly, the result is that any continuous function mapping an n-dimensional input, n 2, to an m-dimensional output can be implemented exactly by a network with one hidden layer. For φ: In Rm , φ(x) = y, where I is the closed interval [0, 1] and therefore In is an n-dimensional hypercube, the function φ can be implemented exactly by a single-hidden-layer neural network having n elements in the input layer, (2n + 1) elements in the middle layer, and m elements in the output layer. That is,

(4.1)
(4.2)

where λ is a real constant, ψ is a continuous real monotonic increasing function independent of φ but dependent on n, and Î is a rational number, 0 < Î δ, where δ is an arbitrarily chosen positive constant. The output node functions gi are real and continuous and depend on φ and Î. Unfortunately, the proof is not constructive; it does not say how weights should be chosen to implement a particular mapping or specify the functions gi (different for each i), which have unknown and often extremely nonlinear shapes.

Girosi and Poggio [139] point out that the theorem might not be relevant because the inner functions ψ are extremely nonsmooth. (The theorem fails when restricted to smooth functions.) The functions are very unlike the sigmoids normally used in neural networks and not likely to be learnable because of their extreme roughness. Further, the functions gk depend on φ, are all different, and do not form a parameterized family. They are likely to be at least as complex as φ, if not more so, and therefore no easier to learn than φ itself. In effect, the complexity of approximating the original function is shifted into the task of finding the internal node functions.

In more recent work, Sprecher [356] describes a stronger version of the theorem using translates of a single function in place of the multiple internal functions gk. The replacement function remains nonsmooth, however, so the objections remain.

Kurkova [231] notes that although exact representation is not possible with smooth internal functions, approximation is, and goes on to show universal approximation properties of sigmoidal networks with two hidden layers. But other work [244] suggests that approximation of the internal functions does not yield approximation of the target function.

4.2.2 Other Proofs of Universal Approximation Capability

Several proofs are based on showing that single-hidden-layer networks can implement Fourier transforms and thus have the same approximation capabilities. The idea is that a sinusoid can be implemented by sums and differences of many shifted sigmoids, one for each half cycle. Irie and Miyake [186] give a proof based on Fourier integrals where the number of nodes required for exact approximation is infinite. Gallant and White [132] give another Fourier transform proof using monotone "cosine-squasher" sigmoids. Funahashi [131] approximates the integral representation of Irie and Miyake with a discrete sum and shows that networks with at least one hidden layer of sigmoid units are universal approximators.

Cybenko [95], [96] provided one of the first rigorous proofs of universal approximation capability, showing that a network with linear output units and a single hidden layer of sigmoid units is sufficient for uniform approximation of any continuous function in the unit hypercube. Hornik, Stinchcombe, and White [177] and Stinchcombe and White [358] show that standard multilayer networks using arbitrary squashing functions are universal approximators; single-hidden-layer nets are included as a special case. Approximation of a function and its derivatives using arbitrary bounded nonconstant squashing functions is discussed by Hornik [175]. Other work on approximation includes [162], [48], [188], [226], [176].

In most cases, these are existence proofs rather than constructive recipes for finding a solution to a particular problem. These results do not guarantee that a chosen optimization method will find the necessary weights or that the solution found will be efficient. A constructive proof based on the Radon transform is given by Carroll and Dickson [64].

It should be noted that universal approximation is not a rare property. Many other systems have similar capabilities: polynomials, trigonometric polynomials (e.g., Fourier series), kernel regression systems, wavelets, and so on. By itself, this property does not make neural networks special. The results are important because they show that neural networks are powerful enough to approximate most functions that people find interesting. The lack of a universal approximation capability, on the other hand, would be bad news; neural networks would then be too weak for many problems and therefore much less appealing.

Given that there are many universal approximation systems, the choice of one over another depends on other factors such as efficiency, robustness, and so on. Barron [21], [22], [23] addresses the problem of how the MLP approximation error scales with the number of training samples and the number of parameters. One important result is that the error decreases like as the number of training samples N increases. Another result is that the error decreases like O (1 / M) as a function of M, the number of hidden nodes. Unlike other systems (e.g., polynomials), this is independent of the input dimension and appears to avoid the "curse of dimensionality" (see [317], [178] for a skeptical view, however). These results can be used to put bounds on the necessary number of hidden nodes (in one hidden layer) and provide another justification for the rule of thumb that the number of training samples should be larger than the number of parameters divided by the desired approximation error, N > O (Mp / Î). Here N is the number of samples, M is the number of hidden nodes, p is the input dimension (so Mp is approximately the number of parameters), and Î is the desired approximation error.

One Hidden Layer Is Not Always Enough The limits of these proofs are sometimes forgotten. They say that a sufficiently large one-hidden-layer network is capable of essentially arbitrarily accurate approximation of continuous functions over compact regions. Although this covers most functions we would like to approximate, it may not include them all. Sontag [350], [351] points out that there are certain functions, important in inverse control, which cannot be approximated by single-hidden-layer nets with any number of units but which can be implemented rather simply with two hidden layers. That is, there are control systems that can be stabilized by a two-hidden-layer network but not by a one-hidden-layer net. The difference depends not on the number of units needed to achieve a certain numerical accuracy but rather on the need to approximate discontinuous functions that may arise as one-sided inverses of continuous functions.