### Chapter 16 - Heuristics for Improving Generalization

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II

## 16.2 Regularization

A problem is said to be ill-posed if small changes in the given information cause large changes in the solution. This instability with respect to the data makes solutions unreliable because small measurement errors or uncertainties in parameters may be greatly magnified and lead to wildly different responses. In contrast, a problem is well-posed if (i) it has a solution, (ii) the solution is unique, and (iii) the solution varies continuously with the given data. Violation of any of these conditions makes the problem ill-posed [370].

The idea behind regularization is to use supplementary information to restate an ill-posed problem in a stable form. The result will be a well-behaved, but approximate, solution of the original problem. Ideally, the bias introduced by the approximation will be more than offset by the gain in reliability. In general, domain-specific knowledge will be needed to stabilize a problem without changing it fundamentally.

Regularization has been studied extensively for linear systems. The book by Tikhonov and Arsenin [370] is a classic reference. In the context of learning from limited data, generalization is an unrealistic goal unless additional information is available beyond the training samples. One of the least restrictive assumptions is that the target function is smooth, that is, that small changes in the input do not cause large changes in the output. Given two functions that fit the data equally well, we tend to prefer the smoother one because it is somehow simpler or more efficient. This bias is embedded in the learning algorithm by adding terms to the cost function to penalize nonsmooth solutions. In addition to the usual term Eo measuring the approximation error, we add terms ω(y) which measure how well the approximation function y(x) conforms to our preferences

 (16.1)

The regularizing parameter λ balances the trade-off between minimizing the approximation error and conforming to the external constraints. A regularizer favoring smooth functions is [300]

 (16.2)

where the regularizer P is a differential operator. This rewards smooth functions (whose derivatives are small, on average) and penalizes nonsmooth functions (those with large derivatives).

Regularization can be fit into a Bayesian approach [274], [140]. Equation 16.2, for example, corresponds to a prior in equation 15.6

 (16.3)

Approximation with radial basis functions (which are linear in their output weights) is equivalent to classical regularization under certain conditions [274], [140]. Radial basis functions, however, form mostly local internal representations and therefore usually do not generalize as well as sigmoid networks (e.g., [56]). Curvature-driven smoothing using second derivative information as a means of improving generalization in radial basis function nets is discussed by Bishop [42].

Regularization provides a way of biasing the learning algorithm, but its success depends on the choice of an appropriate value for the regularization parameter λ to determine how strong the bias should be. In many of the other proposed heuristics there is a similar parameter balancing the need to minimize training error with other constraints. The parameter has an important effect on the eventual solution and is usually determined by criteria such as cross-validation. Although not discussed here, it is often useful to change the parameter dynamically because overfitting usually is not a problem until the later stages of learning. In many cases, it helps to impose the constraints only after the network has made some progress in reducing the initial error. In difficult problems, for example, there may be long periods before the network makes any significant progress. If a strong weight decay rule were in force during this period, the network might never escape from the initial set of weights around w = 0.