#Author notes: This is a good analogy on how optimization technique is used in Neural network (NN) training. Nonetheless, a basic knowledge about optimization techniques in NN is recommended in order to understand the meaning behind the story.
Kangaroos and Training Neural Networks
By: Warren S. Sarle and Net Poohbahs
Revised: Oct 22, 1994
Training a NN is a form of numerical optimization, which can be likened to a kangaroo searching for the top of Mt. Everest. Everest is the _global optimum_, the highest mountain in the world, but the top of any other really tall mountain such as K2 (a good _local optimum_) would be satisfactory. On the other hand, the top of a small hill like Chapel Hill, NC, (a bad local optimum) would not be acceptable.
This analogy is framed in terms of maximization, while neural networks are usually discussed in terms of minimizing an error measure such as the least-squares criterion, but if you multiply the error measure by -1, it works out the same. So in this analogy, the higher the altitude, the smaller the error.
The compass directions represent the values of synaptic weights in the network. The north-south direction represents one weight, while the east-west direction represents another weight. Most networks have more than two weights, but representing additional weights would require a multidimensional landscape, which is difficult to visualize. Keep in mind that when you are training a network with more than two weights, everything gets more complicated.
Initial weights are usually chosen randomly, which means that the kangaroo is dropped by parachute somewhere over Asia by a pilot who has lost the map. If you know something about the scales of the inputs, you may be able to give the pilot adequate instructions to get the kangaroo to land near the Himalayas. However, if you make a really bad choice of distributions for the initial weights, the kangaroo may plummet into the Indian ocean and drown.
With Newton-type (second-order) algorithms, the Himalayas are covered with fog, and the kangaroo can only see a little way around her location. Judging from the local terrain, the kangaroo makes a guess about where the top of the mountain is, assuming that the mountain has a nice, smooth, quadratic shape. The kangaroo then tries to leap all the way to the top in one jump.
Since most mountains do not have a perfect quadratic shape, the kangaroo will rarely reach the top in one jump. Hence the kangaroo must _iterate_, i.e., jump repeatedly as previously described until she finds the top of a mountain. Unfortunately, there is no assurance that this mountain will be Everest.
In a stabilized Newton algorithm, the kangaroo has an altimeter, and if the jump takes her to a lower point, she backs up to where she was and takes a shorter jump. If ridge stabilization is used, the kangaroo also adjusts the direction of her jump to go up a steeper slope. If the algorithm isn't stabilized, the kangaroo may mistakenly jump to Shanghai and get served for dinner in a Chinese restaurant.
In steepest ascent with line search, the fog is _very_ dense, and the kangaroo can only tell which direction leads up most steeply. The kangaroo hops in this direction until the terrain starts going down. Then the kangaroo looks around again for the new steepest ascent direction and iterates.
Using an ODE (ordinary differential equation) solver is similar to steepest ascent, except that the kangaroo crawls on all fives to the top of the nearest mountain, being sure to crawl in steepest direction at all times.
The following description of conjugate gradient methods is adapted from Tony Plate (1993):
The environment for conjugate gradient search is just like that for steepest ascent with line search\mdash the fog is dense and the kangaroo can only tell which direction leads up. The difference is that the kangaroo has some memory of the directions it has hopped in before, and the kangaroo assumes that the ridges are straight (i.e., the surface is quadratic). The kangaroo chooses a direction to hop in that is upwards, but that does not result in it going downwards in the previous directions it has hopped in. That is, it chooses an upwards direction, moving along which will not undo the work of previous steps. It hops upwards until the terrain starts going down again, then chooses another direction.
In standard backprop, the most common NN training method, the kangaroo is blind and has to feel around on the ground to make a guess aboutwhich way is up. A major problem with standard backprop is that the distance the kangaroo hops is related to the steepness of the terrain. If the kangaroo starts on a gently sloping plain instead of a mountain side, she will take very small hops and make very slow progress. When she finally starts to ascend a mountain, her hops get longer and more dangerous, and she may hop off the mountain altogether. If the kangaroo ever gets near the peak, she may jump back and forth across the peak without ever landing on it. If you use a decaying step size, the kangaroo gets tired and makes smaller and smaller hops, so if she ever gets near the peak she has a better chance of actually landing on it before the Himalayas erode away.
In backprop with momentum, the kangaroo has poor traction and can't make sharp turns. With on-line training, there are frequent earthquakes, and mountains constantly appear and disappear. This makes it difficult for the blind kangaroo to tell whether she has ever reached the top of a mountain, and she has to take small hops to avoid falling into the gaping chasms that can open up at any moment.
Notice that in all the methods discussed so far, the kangaroo can hope at best to find the top of a mountain close to where she starts. In other words, these are _local ascent_ methods. There's no guarantee that this mountain will be Everest, or even a very high mountain. Many methods exist to try to find the global optimum.
In simulated annealing, the kangaroo is drunk and hops around randomly for a long time. However, she gradually sobers up and the more sober she is, the more likely she is to hop up hill. In a random multistart method, lots of kangaroos are parachuted into the Himalayas at random places. You hope that at least one of them will find Everest.