Clever Algorithms: Nature-Inspired Programming Recipes

A book by Jason Brownlee

Back-propagation

Back-propagation, Backpropagation, Error Back Propagation, Backprop, Delta-rule.

Taxonomy

The Back-propagation algorithm is a supervised learning method for multi-layer feed-forward networks from the field of Artificial Neural Networks and more broadly Computational Intelligence. The name refers to the backward propagation of error during the training of the network. Back-propagation is the basis for many variations and extensions for training multi-layer feed-forward networks not limited to Vogl's Method (Bold Drive), Delta-Bar-Delta, Quickprop, and Rprop.

Inspiration

Feed-forward neural networks are inspired by the information processing of one or more neural cells (called a neuron). A neuron accepts input signals via its dendrites, which pass the electrical signal down to the cell body. The axon carry the signal out to synapses, which are the connections of a cell's axon to other cell's dendrites. In a synapse, the electrical activity is converted into molecular activity (neurotransmitter molecules crossing the synaptic cleft and binding with receptors). The molecular binding develops an electrical signal which is passed onto the connected cells dendrites. The Back-propagation algorithm is a training regime for multi-layer feed forward neural networks and is not directly inspired by the learning processes of the biological system.

Strategy

The information processing objective of the technique is to model a given function by modifying internal weightings of input signals to produce an expected output signal. The system is trained using a supervised learning method, where the error between the system's output and a known expected output is presented to the system and used to modify its internal state. State is maintained in a set of weightings on the input signals. The weights are used to represent an abstraction of the mapping of input vectors to the output signal for the examples that the system was exposed to during training. Each layer of the network provides an abstraction of the information processing of the previous layer, allowing the combination of sub-functions and higher order modeling.

Procedure

The Back-propagation algorithm is a method for training the weights in a multi-layer feed-forward neural network. As such, it requires a network structure to be defined of one or more layers where one layer is fully connected to the next layer. A standard network structure is one input layer, one hidden layer, and one output layer. The method is primarily concerned with adapting the weights to the calculated error in the presence of input patterns, and the method is applied backward from the network output layer through to the input layer.

Algorithm (below) provides a high-level pseudocode for preparing a network using the Back-propagation training method. A weight is initialized for each input plus an additional weight for a fixed bias constant input that is almost always set to 1.0. The activation of a single neuron to a given input pattern is calculated as follows:

$activation = \bigg(\sum_{k=1}^{n} w_{k} \times x_{ki}\bigg) + w_{bias} \times 1.0$

where $n$ is the number of weights and inputs, $x_{ki}$ is the $k^{th}$ attribute on the $i^{th}$ input pattern, and $w_{bias}$ is the bias weight. A logistic transfer function (sigmoid) is used to calculate the output for a neuron $\in [0,1]$ and provide nonlinearities between in the input and output signals: $\frac{1}{1+exp(-a)}$, where $a$ represents the neuron activation.

The weight updates use the delta rule, specifically a modified delta rule where error is backwardly propagated through the network, starting at the output layer and weighted back through the previous layers. The following describes the back-propagation of error and weight updates for a single pattern.

An error signal is calculated for each node and propagated back through the network. For the output nodes this is the sum of the error between the node outputs and the expected outputs:

$es_i = (c_i - o_i) \times td_i$

where $es_i$ is the error signal for the $i^{th}$ node, $c_i$ is the expected output and $o_i$ is the actual output for the $i^{th}$ node. The $td$ term is the derivative of the output of the $i^{th}$ node. If the sigmod transfer function is used, $td_i$ would be $o_i \times (1-o_i)$ For the hidden nodes, the error signal is the sum of the weighted error signals from the next layer.

$es_i = \bigg(\sum_{k=1}^n (w_{ik} \times es_k)\bigg) \times td_i$

where $es_i$ is the error signal for the $i^{th}$ node, $w_{ik}$ is the weight between the $i^{th}$ and the $k^{th}$ nodes, and $es_k$ is the error signal of the $k_th$ node.

The error derivatives for each weight are calculated by combining the input to each node and the error signal for the node.

$ed_i = \sum_{k=1}^n es_i \times x_k$

where $ed_i$ is the error derivative for the $i^{th}$ node, $es_i$ is the error signal for the $i^{th}$ node and $x_k$ is the input from the $k^{th}$ node in the previous layer. This process include the bias input that has a constant value.

Weights are updated in a direction that reduces the error derivative $ed_i$ (error assigned to the weight), metered by a learning coefficient.

$w_i(t+1) = w_i(t) + (ed_k \times learn_{rate})$

where $w_i(t+1)$ is the updated $i^{th}$ weight, $ed_k$ is the error derivative for the $k^{th}$ node and $learn_{rate}$ is an update coefficient parameter.

Input: ProblemSize, InputPatterns, $iterations_{max}$, $learn_{rate}$
Output: Network
Network $\leftarrow$ ConstructNetworkLayers()
$Network_{weights}$ $\leftarrow$ InitializeWeights{Network, ProblemSize}
For ($i=1$ To $iterations_{max}$)
    $Pattern_i$ $\leftarrow$ SelectInputPattern{InputPatterns}
    $Output_i$ $\leftarrow$ ForwardPropagate{$Pattern_i$, Network}
    BackwardPropagateError{$Pattern_i$, $Output_i$, Network}
    UpdateWeights{$Pattern_i$, $Output_i$, Network, $learn_{rate}$}
End
Return (Network)

Pseudocode for Back-propagation.

Heuristics

The Back-propagation algorithm can be used to train a multi-layer network to approximate arbitrary non-linear functions and can be used for regression or classification problems.
Input and output values should be normalized such that $x \in [0,1)$.
The weights can be updated in an online manner (after the exposure to each input pattern) or in batch (after a fixed number of patterns have been observed).
Batch updates are expected to be more stable than online updates for some complex problems.
A logistic (sigmoid) transfer function is commonly used to transfer the activation to a binary output value, although other transfer functions can be used such as the hyperbolic tangent (tanh), Gaussian, and softmax.
It is good practice to expose the system to input patterns in a different random order each enumeration through the input set.
The initial weights are typically small random values $\in [0, 0.5]$.
Typically a small number of layers are used such as 2-4 given that the increase in layers result in an increase in the complexity of the system and the time required to train the weights.
The learning rate can be varied during training, and it is common to introduce a momentum term to limit the rate of change.
The weights of a given network can be initialized with a global optimization method before being refined using the Back-propagation algorithm.
One output node is common for regression problems, where as one output node per class is common for classification problems.

Code Listing

Listing (below) provides an example of the Back-propagation algorithm implemented in the Ruby Programming Language. The problem is the classical XOR boolean problem, where the inputs of the boolean truth table are provided as inputs and the result of the boolean XOR operation is expected as output. This is a classical problem for Back-Propagation because it was the problem instance referenced by Minsky and Papert in their analysis of the Perceptron highlighting the limitations of their simplified models of neural networks [Minsky1969].

The algorithm was implemented using a batch learning method, meaning the weights are updated after each epoch of patterns are observed. A logistic (sigmoid) transfer function is used to convert the activation into an output signal. Weight updates occur at the end of each epoch using the accumulated delta's. A momentum term is used in conjunction with the past weight update to ensure the last update influences the current update, reducing large changes.

A three layer network is demonstrated with 2 nodes in the input layer (two inputs), 2 nodes in the hidden layer and 1 node in the output layer, which is sufficient for the chosen problem. A bias weight is used on each neuron for stability with a constant input of 1.0. The learning process is separated into four steps: forward propagation, backward propagation of error, calculation of error derivatives (assigning blame to the weights) and the weight update. This separation facilities easy extensions such as adding a momentum term and/or weight decay to the update process.

def random_vector(minmax)
  return Array.new(minmax.size) do |i|
    minmax[i][0] + ((minmax[i][1] - minmax[i][0]) * rand())
  end
end

def initialize_weights(num_weights)
  minmax = Array.new(num_weights) {[-rand(),rand()]}
  return random_vector(minmax)
end

def activate(weights, vector)
  sum = weights[weights.size-1] * 1.0
  vector.each_with_index do |input, i|
    sum += weights[i] * input
  end
  return sum
end

def transfer(activation)
  return 1.0 / (1.0 + Math.exp(-activation))
end

def transfer_derivative(output)
  return output * (1.0 - output)
end

def forward_propagate(net, vector)
  net.each_with_index do |layer, i|
    input=(i==0)? vector : Array.new(net[i-1].size){|k|net[i-1][k][:output]}
    layer.each do |neuron|
      neuron[:activation] = activate(neuron[:weights], input)
      neuron[:output] = transfer(neuron[:activation])
    end
  end
  return net.last[0][:output]
end

def backward_propagate_error(network, expected_output)
  network.size.times do |n|
    index = network.size - 1 - n
    if index == network.size-1
      neuron = network[index][0] # assume one node in output layer
      error = (expected_output - neuron[:output])
      neuron[:delta] = error * transfer_derivative(neuron[:output])
    else
      network[index].each_with_index do |neuron, k|
        sum = 0.0
        # only sum errors weighted by connection to the current k'th neuron
        network[index+1].each do |next_neuron|
          sum += (next_neuron[:weights][k] * next_neuron[:delta])
        end
        neuron[:delta] = sum * transfer_derivative(neuron[:output])
      end
    end
  end
end

def calculate_error_derivatives_for_weights(net, vector)
  net.each_with_index do |layer, i|
    input=(i==0)? vector : Array.new(net[i-1].size){|k|net[i-1][k][:output]}
    layer.each do |neuron|
      input.each_with_index do |signal, j|
        neuron[:deriv][j] += neuron[:delta] * signal
      end
      neuron[:deriv][-1] += neuron[:delta] * 1.0
    end
  end
end

def update_weights(network, lrate, mom=0.8)
  network.each do |layer|
    layer.each do |neuron|
      neuron[:weights].each_with_index do |w, j|
        delta = (lrate * neuron[:deriv][j]) + (neuron[:last_delta][j] * mom)
        neuron[:weights][j] += delta
        neuron[:last_delta][j] = delta
        neuron[:deriv][j] = 0.0
      end
    end
  end
end

def train_network(network, domain, num_inputs, iterations, lrate)
  correct = 0
  iterations.times do |epoch|
    domain.each do |pattern|
      vector,expected=Array.new(num_inputs){|k|pattern[k].to_f},pattern.last
      output = forward_propagate(network, vector)
      correct += 1 if output.round == expected
      backward_propagate_error(network, expected)
      calculate_error_derivatives_for_weights(network, vector)
    end
    update_weights(network, lrate)
    if (epoch+1).modulo(100) == 0
      puts "> epoch=#{epoch+1}, Correct=#{correct}/#{100*domain.size}"
      correct = 0
    end
  end
end

def test_network(network, domain, num_inputs)
  correct = 0
  domain.each do |pattern|
    input_vector = Array.new(num_inputs) {|k| pattern[k].to_f}
    output = forward_propagate(network, input_vector)
    correct += 1 if output.round == pattern.last
  end
  puts "Finished test with a score of #{correct}/#{domain.length}"
  return correct
end

def create_neuron(num_inputs)
  return {:weights=>initialize_weights(num_inputs+1),
          :last_delta=>Array.new(num_inputs+1){0.0},
          :deriv=>Array.new(num_inputs+1){0.0}}
end

def execute(domain, num_inputs, iterations, num_nodes, lrate)
  network = []
  network << Array.new(num_nodes){create_neuron(num_inputs)}
  network << Array.new(1){create_neuron(network.last.size)}
  puts "Topology: #{num_inputs} #{network.inject(""){|m,i|m+"#{i.size} "}}"
  train_network(network, domain, num_inputs, iterations, lrate)
  test_network(network, domain, num_inputs)
  return network
end

if __FILE__ == $0
  # problem configuration
  xor = [[0,0,0], [0,1,1], [1,0,1], [1,1,0]]
  inputs = 2
  # algorithm configuration
  learning_rate = 0.3
  num_hidden_nodes = 4
  iterations = 2000
  # execute the algorithm
  execute(xor, inputs, iterations, num_hidden_nodes, learning_rate)
end

Back-propagation in Ruby

Download: backpropagation.rb.

References

Primary Sources

The backward propagation of error method is credited to Bryson and Ho in [Bryson1969]. It was applied to the training of multi-layer networks and called back-propagation by Rumelhart, Hinton and Williams in 1986 [Rumelhart1986b] [Rumelhart1986c]. This effort and the collection of studies edited by Rumelhart and McClelland helped to define the field of Artificial Neural Networks in the late 1980s [Rumelhart1986] [Rumelhart1986a].

Learn More

A seminal book on the approach was "Backpropagation: theory, architectures, and applications" by Chauvin and Rumelhart that provided an excellent introduction (chapter 1) but also a collection of studies applying and extending the approach [Chauvin1995]. Reed and Marks provide an excellent treatment of feed-forward neural networks called "Neural Smithing" that includes chapters dedicated to Back-propagation, the configuration of its parameters, error surface and speed improvements [Reed1999].

Bibliography

[Bryson1969]	A. E. Bryson and Y–C. Ho, "Applied optimal control: optimization, estimation, and control", Taylor \& Francis, 1969.
[Chauvin1995]	Y. Chauvin and D. E. Rumelhart, "Backpropagation: Theory, architectures, and applications", Routledge, 1995.
[Minsky1969]	M. L. Minsky and S. A. Papert, "Perceptrons", MIT Press, 1969.
[Reed1999]	R. D. Reed and R. J. Marks II, "Neural Smithing: Supervised Learning in Feedforward Artificial Neural\n\tNetworks", Mit Press, 1999.
[Rumelhart1986]	D. E. Rumelhart and J. L. McClelland, "Parallel distributed processing: explorations in the microstructure\n\tof cognition. Foundations, Volume 1", MIT Press, 1986.
[Rumelhart1986a]	D. E. Rumelhart and J. L. McClelland, "Parallel distributed processing: Explorations in the microstructure\n\tof cognition. Psychological and biological models, Volume 2", MIT Press, 1986.
[Rumelhart1986b]	D. E. Rumelhart and G. E. Hinton and R. J. Williams, "Learning representations by back-propagating errors", Nature, 1986.
[Rumelhart1986c]	D. E. Rumelhart and G. E. Hinton and R. J. Williams, "Learning internal representations by error propagation", in Parallel distributed processing: explorations in the microstructure\n\tof cognition, vol. 1, pages 318–362, MIT Press, 1986.

Please Note: This content was automatically generated from the book content and may contain minor differences.