Appendix D: Sigmoid-like Nonlinear Functions

In a large class of networks, each node computes a function f(w^Tx) of its inputs x. In most cases, f(u) is chosen to be a bounded nondecreasing function of u; the sigmoid and tanh functions are common choices. Back-propagation and other gradient based training methods require that f be differentiable.

Table D.1 lists some functions commonly used for node nonlinearities. In general, scaled and translated functions g(u) = af(ku) + b, for constants a, b, and k, yield networks with equivalent representational properties. The tanh and sigmoid functions are related by tanh(u) = 2sigmoid(2u) + 1, for example. There may be practical reasons however, for choosing one form over another.

Sigmoid The sigmoid, or logistic, function

(D.1)

is a bounded, nondecreasing function of u. It approaches 0 for u→ - ∞, is 1/2 at u = 0, and approaches 1 for u→∞. It is approximately linear for small inputs (u ≊ 0), but saturates for large positive or negative inputs. The name derives from this "s" shape. Other monotonic functions with similar shapes are often called sigmoidal. The optional parameter λ controls the slope in the linear region; with large values the response approximates a step function. Normally λ = 1 unless otherwise specified since equivalent results can be obtained by scaling the magnitude of the weight vector.

A useful property of the usual form (D.1) is that its derivative is easily calculated given the output

(D.2)

The derivative is a bell shaped function, positive everywhere and largest at u = 0 where the slope is λ/4. For large positive and negative values of u, it approaches 0.

The inverse is

(D.3)

Tanh The tanh function is

(D.4)

Table D.1: Common Node Nonlinearities

is a centered version of the sigmoid. It is - 1 foru=-∞,Oforu =0,and+1 foru=+∞.

The functions are related by tanh(u) = 2sigmoid(2u + 1. Its derivative, in terms of its output, is

(D.5)

At u = 0, the slope is λ. Its inverse is

(D.6)

Step and Sign The unit step function is

(D.7)

y(u) ={ 0 u≤0 1 u>0.

In engineering, this is sometimes called the ‘Heaviside’ step function. A node implementing f(w^Tx) where f is a step function is also called a linear threshold unit (LTU). The derivative of the step function is the Dirac delta function δ(u), which is ∞ at u = 0 and zero everywhere else.

The sign function is the bipolar equivalent of the step function

(D.8)

y(u) = { -1 u≤0 1 u>0.

and has derivative 2δ(u).

Clipped Linear The output of the clipped linear function is equal to its input for small inputs, but clips at large positive and negative values

(D.9)

y(u) = { -1 u≤ - 1/λ λu≤ 1/λ 1 >1 u>/λ.

This may also be called a semilinear ramp function.

Its derivative is constant in the linear region and zero elsewhere

(D.10)

Other Functions There are a number of alternative sigmoid-like functions occasionally used in special cases. One is

(D.11)

In table D.1 this is called the Inverse Abs function, but it does not have a generally recognized name. The shape is similar to the tanh function, but convergence to the±1 asymptotes is slower. (The horizontal axis of the thumbnail figure in table D.1 spans -10<u<10.) Its derivative is

(D.12)

At u = 0, ∂y/∂u = 1, but the derivative is more sharply peaked near 0 and has wider tails than the tanh function.

An advantage of this function is that it does not require transcendental functions that may be time consuming to calculate on some computers. This may be useful in digital implementations, but is not a particular advantage for analog electronic implementations because tanh functions are easily realized with differential amplifiers. The slower convergence to the asymptotes may help prevent paralysis during learning due to saturation of node nonlinearities.

Appendix D - Sigmoid-like Nonlinear Functions

Appendix D: Sigmoid-like Nonlinear Functions