Books24x7 Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks

2.1 Objective Functions

As noted, the role of the objective function is to measure how well the network performs the intended task. The function defines the difference between good or bad performance and thus guides the search for a solution. It has a fundamental effect on the outcome so it is important to choose a function that accurately reflects our design goals.

A few standard error functions are commonly used. The most common is the sum of squared errors (SSE),

(2.1)

where p indexes the patterns in the training set, i indexes the output nodes, and t_pi and y_pi are, respectively, the target and actual network output for the ith output node on the pth pattern. This is the sum of the squared errors on each training pattern. The mean-squarederror (MSE)

(2.2)

normalizes E_SSE for the number of training patterns P and the number of network outputs N. The logarithmic or cross-entropy error function

(2.3)

is often used for classification problems where the network output is interpreted as the probability that the input pattern belongs to a certain class. Here y_pi is the estimated probability that pattern p belongs to class i and t_pi Î {0,1} is the target. Other functions have been developed for various applications.

Each of these functions carries assumptions including, among others, assumptions about the distribution of fitting errors that arise given the model and the data. In a statistical setting the mean squared error function, for example, corresponds to a maximum likelihood model with the assumption that errors have a Gaussian distribution (see section 15.2). The logarithmic error function corresponds to a classification model and the assumption of a binomial error distribution. Reasonable performance can be expected if these assumptions match reality but poor performance may result if the assumptions are not met. More details can be found in [328], as well as numerous statistics texts.

These standard functions are convenient to use and are well-understood. Advantages include easy differentiability and independence. (All numerical deviations of equal magnitude have equal costs which do not depend on the input pattern, the sizes of other errors, the trend of previous errors, and so on.) These properties simplify analysis considerably and allow valuable theoretical study that would not be possible otherwise. In spite of this, more idiosyncratic functions may be useful in applications where errors of similar numerical magnitude may have quite different costs depending on the input pattern and other factors. These considerations are completely application dependent, however, so the standard error functions are used for most discussions.

Figure 2.3: Supervised learning can be applied to many different error functions. The figure illustrates a piecewise linear error function with upper and lower tolerance limits; the error is zero when f(x) is within the limits. Functions like this are sometimes useful in engineering applications.

Figure 2.3 illustrates an error function that falls a bit outside the range of standard models but is still included in the supervised learning model. In this case, the target function has piecewise constant upper and lower tolerance limits; the error is zero when y(x) is within the limits and increases quadratically otherwise. Functions like this are sometimes useful in engineering applications. An application-specific error evaluation function is required and the mathematical analysis is not as clean, but the training procedure is basically the same otherwise.

Penalty Terms In addition to the primary terms that measure fitting errors, the cost function is often augmented with terms reflecting goals or preferences, which are not directly measurable in terms of differences between outputs and targets on a set of patterns. "Penalty terms" may be added to steer the solution in preferred directions or enforce constraints. Some common biases include:

a preference for simple solutions over complex ones (Occam's razor),
a preference for smooth continuous solutions over wildly varying or discontinuous solutions, and
beliefs about the relative probabilities of various solutions (corresponding to prior probabilities in Bayesian methods).

Many of the heuristics discussed later can be viewed as modifications of the basic error function which introduce these types of biases. These hints can be especially useful when training data is limited.

Chapter 2 - Supervised Learning

2.1 Objective Functions