next up previous index
Next: On-line Pattern Associator Up: Learning algorithms Previous: Gradient descent   Index



Quasi-Newton algorithm

As the former algorithm is slow, especially in the neighbourhood of a minimum, we need a more powerful algorithm to find the minimum quickly. The Quasi-Newton algorithm is based on the idea that a function looks like a quadratic function near its minimum. In the case of a quadratic function, we have :

\begin{displaymath}w_{minimum}=w_0-\left(\nabla^2E(w_0)\right)^{-1}\cdot \nabla E(w_0)\end{displaymath}

so if we use this equation to update the weights, the new weights should be close to the minimum. The minimum can then be reached in a few steps. In the ideal case of a quadratic function, the minimum is found in one iteration.

In practice, the computation of the Hessian of $E$ is very slow, so the convergence speed of this algorithm is not really as good as it is supposed to be. It is usually better to approximate the Hessian matrix in order to find a good descent direction. The BFGS (Broyden, Fletcher, Goldfarb and Shanno) formula was used to approximate the Hessian [1]. The descent direction becomes:

\begin{displaymath}d_k=-M_k\cdot \nabla E\left(w_k\right)\end{displaymath}

where $M_k$ is computed using the BFGS formula:

\begin{displaymath}M_{k+1}=M_k+\left[1+\frac{\gamma_k^T\cdot M_k\cdot \gamma_k}{...
...ot M_k+M_k\cdot \gamma_k\cdot \delta_k^T}{\delta_k^T.\gamma_k }\end{displaymath}

where $\delta_k=w_{k+1}-w_k$ and $\gamma_k=\nabla E\left(w_{k+1}\right)-\nabla E\left(w_k\right)$.

Even if the BFGS formula preserves positive definiteness, the computation of $M_k$ can lead to a matrix which is not positive due to rounding errors. So for each iteration, we must have $\delta_k\cdot \gamma_k>0$. If it is not the case, we simply initialize $M_k$ to the identity. It is proved that $\left(M_k\right)$ tends to $\left(\nabla^2E(w)\right)^{-1}$ [18]. So the formula tends to produce the same result as the idea formula used for a quadratic curve.

We also need the step for the update as it was the case in the gradient descent. Now this stage is more important because the minimization requires more precision near the minimum. However, the Quasi-Newton algorithm is not very sensitive to the step chosen. Thus, we can use a simple and efficient algorithm to choose this step like the Nash method [18].

These stages combined together lead to the following algorithm:

Initialize a tolerance factor $\tau$ between 0 and 1.

Initialize a reduction factor $r$ between 0 and 1.

Initialize the weights $w_0$.

Initialize $k=0$, $g_0=\nabla E\left(w_0\right)$, $M_0=I$, $d_0=-M_0\cdot g_0$, and $\mu=1$.

Do:

Until a certain number of iterations is reached or

until the weights are almost not changed

during the iteration ( $\Vert g_k\Vert\approx 0$).

In practice, this algorithm works fine with small networks. Indeed, the dimension of the matrix $M_k$ is the square of the number of weights of the network. So this matrix takes a lot of space in memory. This becomes a problem when we begin to use networks with a few thousand weights. In order to bypass this problem, we used a memoryless version of the Quasi-Newton algorithm [4]. This version is based on an approximation. In the BFGS formula, $M_k$ is replaced by the unit matrix $I$. We can then update the search direction using the formula:


\begin{displaymath}d_{k+1}=-g_{k+1}+A\cdot \delta_k+B\cdot \gamma_k\end{displaymath}

where $A$ and $B$ are two reals defined by:


\begin{displaymath}A=-\left(1+\frac{\gamma_k^T\cdot \gamma_k}{\delta_k^T\cdot \g...
...mma_k}+\frac{\gamma_k^T\cdot g_{k+1}}{\delta_k^T\cdot \gamma_k}\end{displaymath}

and


\begin{displaymath}B=\frac{\delta_k^T\cdot g_{k+1}}{\delta_k^T\cdot \gamma_k}\end{displaymath}

As we can see, this version of the Quasi-Newton algorithm only requires to store vectors and no matrix. So if $W$ is the number of weights of the network, the memory storage used is ${\cal O}\left(W\right)$ instead of ${\cal O}\left(W^2\right)$.


next up previous index
Next: On-line Pattern Associator Up: Learning algorithms Previous: Gradient descent   Index

franck 2006-10-15