In machine learning, supervised learning has come to mean the process of adjusting a system so it produces specified outputs in response to specified inputs. It is often posed as a function approximation problem (figure 2.1). Given training data consisting of pairs of input patterns, x, and corresponding desired outputs or targets, t, the goal is to find a function y(x) that matches the desired response for each training input. The functional relationship between the input patterns and target outputs is usually unknown (otherwise different methods would be used) so the idea is to start with a system flexible enough to implement many functions and adjust it to fit the given data.
"Training" refers to the adaptation process by which the system "learns" the relationship between the inputs and targets. This is often a repetitive incremental process guided by an optimization algorithm (figure 2.2). The process is "supervised" in the sense that an external "teacher" must specify the correct output for each and every input pattern. In some cases, the teacher is a person who specifies the correct class for each pattern. In other cases, it may be a physical system whose behavior we want to model.
In this book, the learning system is an artificial neural network. During training, each input pattern is presented and propagated through the network to produce an output. Unless the network is perfectly trained, there will be differences between the actual and desired outputs. The real-world significance of these deviations depend on the application and is measured by an objective function whose output rates the quality of the network's response. (The terms "cost function" and "error function" are also used.) The overall goal is then to find a system that minimizes the total error for the given training data.
When defined in this way, training becomes a statistical optimization problem and there are a number of interacting factors to be considered:
Variable selection and representation. What information should be presented to the network and in what form?
Selection and preparation of training data.
Model selection. What structure should the network have?
Choice of error function. How is network performance graded?
Choice of optimization method. The network output is a function of its parameters (weights). How should parameters be adjusted to minimize the error?
Prior knowledge and heuristics. If we know useful rules or heuristics (which may not be learnable from available data) can we somehow insert them into the system? Can we make the system favor particular sorts of solutions?
Generalization. How well does the network really work? Did it learn what we intended or did it simply memorize the training set or find a set of tricks that work on this data but not on others?
These issues are discussed at length in following chapters.