Wednesday, 3 June 2009

Back propagation algorithm in Clojure

In my previous post I looked at the most basic type of neural network, the single layer perceptron. Next, we'll look at the multi-layer perceptron network. This is a more powerful class of neural network than the single layer because it can handle non linearly separable data (such as the XOR test case which failed last time).

The main difference in a multi-layer perceptron is that each neuron is only activated based on the results of an activation function. This makes use of the kernel trick that maps a non-linear problem into a higher dimensional problem which is linearly separable (see also support vector machines). The mathematics behind this is better explained here [PDF].

A typical activation function is tanh, shown below between -10 and 10 and generated using Wolfram Alpha

Graph of tanh(x) from -10 to 10 from Wolfram Alpha

So how do we model this network in a functional programming language like Clojure? A typical solution using an imperative/OO language (see here for a good Python implementation) would use arrays representing the weights and modify things in situ, but that's not very functional.

We start by defining some constants (including the activation-function and its derivation). bp-nn represents the network itself.


(def activation-function (fn [x] (Math/tanh x)))
(def activation-function-derivation (fn [y] (- 1.0 (* y y))))
(def num-hidden 2)

(def learning-rate 0.5)
(def momentum 0.1)

(defstruct bp-nn :weight-input :weight-output :change-input :change-output)


Next we create some simple initialization functions to create an initial neural network, together with some helper functions for iterating over matrices (which we'll model as lists of lists). Usually you'd use random weights to initialize things, but allowing fixed values makes testing possible.



(defn make-matrix
[width height]
"Create a matrix (list of lists)"
(repeat width (repeat height 0)))

(defn matrix-map
[m func]
"Apply a function to every element in a matrix"
(map (fn [x] (map func x)) m))

(defn rand-range
[l h]
"Return a real number within the given range"
(+ (rand (- h l)) l))

(defn create-network
([input hidden output]
(create-network input hidden output true))
([input hidden output use-random-weights]
"Create a network with the given number of input, hidden and output nodes"
(let [i (inc input)
w-func (if use-random-weights (fn [_] (rand-range -0.2 0.2)) (fn [_] 0.2))
o-func (if use-random-weights (fn [_] (rand-range -2.0 2.0)) (fn [_] 2.0))]
(struct bp-nn
(matrix-map (make-matrix i hidden) w-func)
(matrix-map (make-matrix hidden output) o-func)
(make-matrix i hidden)
(make-matrix hidden output)))))



The first thing we should do is run a pattern through the network and see what comes out the other end. We're not just interested in the output result, we want to know what happened at the hidden layer, so we return a vector of ao (activation output) and ah (activation hidden).

comp is functional composition. ((comp x y) 5) is the equivalent of (x (y 5)) so in the example below we add the numbers together and then apply the activation function. The nested map calls allow us to iterate over the elements in a matrix.



(defn apply-activation-function
[w i]
"Calculate the hidden activations"
(apply map (comp activation-function +) (map (fn [col p] (map (partial * p) col)) w i)))

(defn run-network
[pattern network]
"Run the network with the given pattern and return the output and the hidden values"
(assert (= (count pattern) (dec (count (network :weight-input)))))
(let [p (cons 1 pattern)] ;; ensure bias term added
(let [wi (network :weight-input)
wo (network :weight-output)
ah (apply-activation-function wi p)
ao (apply-activation-function wo ah)]
[ao ah])))


In order to perform backwards-propagation we need a couple of helper functions that work on matrices and vectors to calculate changes in the output that will be used to update the weights.

These helper functions are pretty sick - (no-one wants to read (map (partial reduce +) ...). A better design would probably be to introduce a proper matrix abstraction. There's the beginnings of one here but this is a bit too "Java" syntax heavy for more liking.



(defn calculate-hidden-deltas
[wo ah od]
"Calculate the error terms for the hidden"
(let [errors (map (partial reduce +) (map (fn [x] (map * x od)) wo))] ;; Sick.
(map (fn [h e] (* e (activation-function-derivation h))) ah errors)))

(defn update-weights
[w deltas co ah]
(let [x (map
(fn [wcol ccol h]
(map (fn [wrow crow od]
(let [change (* od h)]
[(+ wrow (* learning-rate change) (* momentum crow)) change]))
wcol ccol deltas))
w co ah)]
[(matrix-map x first) (matrix-map x second)]))


I did warn you...

The next thing we need to implement is the back-propagation algorithm itself. This takes in more parameters than it needs to so that it can be tested standalone (it could be implemented as a local function using a closure to capture some of them). It returns an updated version of the neural network.



(defn back-propagate
[target p results network]
"Back propagate the results to adjust the rates"
(assert (= (count target) (count (first (network :weight-output)))))
(let [pattern (cons 1 p) ;; ensure bias term added
ao (first results)
ah (second results)
error (map - target ao)
wi (network :weight-input)
wo (network :weight-output)
ci (network :change-input)
co (network :change-output)
output-deltas (map (fn [o e] (* e (activation-function-derivation o))) ao error)
hidden-deltas (calculate-hidden-deltas wo ah output-deltas)
updated-output-weights (update-weights wo output-deltas co ah)
updated-input-weights (update-weights wi hidden-deltas ci pattern)]
(struct bp-nn
(first updated-input-weights)
(first updated-output-weights)
(second updated-input-weights)
(second updated-output-weights))
))


All that remains is to train the network. We need a set of samples with know results, together with a number of iterations to try. I've split these into run-patterns which runs through the patterns once, and train-network> which creates the initial network and runs it through the patterns the specified number of times.



(defn run-patterns
[network samples expecteds]
(if (empty? samples)
network
(let [expected (first expecteds)
sample (first samples)
[ah ao] (run-network sample network)
updated-network (back-propagate expected sample [ah ao] network)]
(recur updated-network (rest samples) (rest expecteds)))))

(defn train-network
([samples expected iterations]
(train-network (create-network (count (first samples))
num-hidden (count (first expected)))
samples expected iterations))
([network samples expected iterations]
(if (zero? iterations)
network
(recur (run-patterns network samples expected) samples expected (dec iterations)))))


So how well does it work in practise? Pretty damn good. It correctly converges after a few iterations (this time 100) and consistently gets the XOR test data set correct.


(defn example[]
(let [x (apply train-network (conj xor-test-data 100))]
(println (first (run-network [0 0] x)) "-->" 0)
(println (first (run-network [0 1] x)) "-->" 1)
(println (first (run-network [1 0] x)) "-->" 1)
(println (first (run-network [1 1] x)) "-->" 0)))

;;; Example run through the REPL
uk.co.fatvat.mlperceptron> (example)
(0.10717792758953508) --> 0
(0.993502708495955) --> 1
(0.9930515903590437) --> 1
(0.00883530181467182) --> 0
nil


Total code weighs in at ~100 of so lines of purely functional code. This made it a doddle to test. Coding this was an exercise in map masochism. Never have I had to construct so many convoluted map expressions. At least it works in the end. It feels like it would be considerably simpler to implement this using mutability, I might try that for comparison purposes. Any suggestions on improving the code much appreciated! Full code is on GitHub.

(update - read some of Programming Clojure and applied some of the lessons learnt - see here.)

Neural networks are pretty interesting and in the next post I'll look at how to implement basic OCR using them.

Labels: ,


Monday, 25 May 2009

Neural Networks and Clojure

Neural Networks (ANNs) attempt to "learn" by modelling the behaviour of neurons. Although neural networks sound cool, there is no magic behind them!

Invented in 1957, by Frank Rosenblatt, the single layer perceptron network is the simplest type of neural network. The single layer perceptron network is able to act as a binary classifier for any linearly separable data set.

Single Layer Perceptron graphic from Wikipedia

The SLP is nothing more than a collection of weights and an output value. The Clojure code below allows you to create a network (initially with zero weights) and get a result from the network given some weights and an input. Not very interesting.



(defn create-network
[out]
(repeat in 0))

(defn run-network
[input weights]
(if (pos? (reduce + (map * input weights))) 1 0))



The clever bit is adapting the weights so that the neural network learns. This process is known as training and is based on a set of data with known expectations. The learning algorithm for SLPs is shown below. Given an error (either 1 or -1 in this case), adjust the weights based on the size of the inputs. The learning-rate decides how much to vary the weights; too high and the algorithm won't converge, too low and it'll take forever to converge.



(def learning-rate 0.05)

(defn- update-weights
[weights inputs error]
(map
(fn [weight input] (+ weight (* learning-rate error input)))
weights inputs))


Finally, we can put this all together with a simple training function. Given a series of samples and the expected values, repeatedly update the weights until the training set is empty.



(defn train
([samples expecteds] (train samples expecteds (create-network (count (first samples)))))
([samples expecteds weights]
(if (empty? samples)
weights
(let [sample (first samples)
expected (first expecteds)
actual (run-network sample weights)
error (- expected actual)]
(recur (rest samples) (rest expecteds) (update-weights weights sample error))))))



So we have our network now. How can we use it? Firstly, let's define a couple of data sets both linearly separable and not. jiggle adds some random noise to each sample. Note the cool # syntax for a short function definition (I hadn't seen it before).



(defn jiggle [data]
(map (fn [x] (+ x (- (rand 0.05) 0.025))) data))

(def linearly-separable-test-data
[(concat
(take 100 (repeatedly #(jiggle [0 1 0])))
(take 100 (repeatedly #(jiggle [1 0 0]))))
(concat
(repeat 100 0)
(repeat 100 1))])

(def xor-test-data
[(concat
(take 100 (repeatedly #(jiggle [0 1])))
(take 100 (repeatedly #(jiggle [1 0])))
(take 100 (repeatedly #(jiggle [0 0])))
(take 100 (repeatedly #(jiggle [1 1]))))
(concat
(repeat 100 1)
(repeat 100 1)
(repeat 100 0)
(repeat 100 0))])


If we run these in the REPL we can see that the results are perfect for the linearly separable data.


> (apply train ls-test-data)
(0.04982859491606148 -0.0011851610388172009 -4.431771581539448E-4)

> (run-network [0 1 0] (apply train ls-test-data))
0

> (run-network [1 0 0] (apply train ls-test-data))
1


However, for the non-linearly separable they are completely wrong:


> (apply train xor-test-data)
(-0.02626745010362212 -0.028550312499346104)

> (run-network [1 1] (apply train xor-test-data))
0

> (run-network [0 1] (apply train xor-test-data))
0

> (run-network [1 0] (apply train xor-test-data))
0

> (run-network [0 0] (apply train xor-test-data))
0


The neural network algorithm shown here is really just a gradient descent optimization that only works for linearly separable data. Instead of calculating the solution in an iterative manner, we could have just arrived at an optimal solution in one go.

More complicated networks, such as the multi-layer perceptron network have more classification power and can work for non linearly separable data. I'll look at them next time!

Labels: ,


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]