# 3.1 – The Newton-Raphson Method

This post is based on Andrew Ng’s CS229 course at Stanford University.

We now briefly digress to talk about an algorithm used to find optima (remember: we used batch gradient descent to do this). This method is called Newton’s method, or the Newton-Raphson method, and is usually faster than batch gradient descent.

Suppose we have a function $f : \mathbb{R} \mapsto \mathbb{R}$. That is, it takes in a real number as input, and gives some other real number as output. We’re interested in finding a value $x$ (a real number such that $f(x) = 0$. We start with some initial guess, say $x_0$. For the Newton-Raphson method, the subscript indicates which iteration we’re currently on. The 0th iteration means it’s our initial guess. The Newton-Raphson method performs the following update:

$x = x - \frac{f(x)}{f^{\prime}(x)}$

Let’s understand what this intuitively means. Consider the function shown below. The line $y = 0$ is also plotted.

The minimum for this function occurs at around 1.3. Let’s pick an initial guess of 4.5. We draw a vertical line, $x = 4.5$. We take the point where this vertical line meets the function, and draw a tangent at that point. Then, we note the point where this tangent meets the horizontal, $y = 0$ line that we drew earlier. This is illustrated in the next figure, numbered for your ease of understanding.

We continue doing the exact same procedure, now starting with $x_1$, to obtain $x_2$, and so on. $x_1$ is about 2.8, and $x_2$ is about 1.8.

So now we know how to use the Newton-Raphson method (or Newton’s method) to find a value for which $f(x) = 0$. How do we use this to maximize or minimize a function? We note that at the maxima or minima of a function, $f^{\prime}(x) = 0$. Thus, we can use Newton’s method in the same way as described above, using $f^{\prime}(x)$ instead. We can use this to maximize our log-likelihood function, and thus our update rule is

$\Theta = \Theta - \frac{l^{\prime}(\Theta)}{l^{\prime \prime}(\Theta)}$

We note that for logistic regression, $\Theta$ is a vector and not a real number, so we modify the equation accordingly. Remember, from our post on linear regression, that we defined $\nabla_\Theta J(\Theta)$ as the derivative of the cost function with respect to $\Theta$, where $\Theta$ was a vector. We use the same for the numerator. But what about the denominator? We need something to represent the second derivative of the cost function, where $\Theta$ is, again, a vector. Fortunately, this already exists, and is called the Hessian. Each entry of the Hessian is defined as

$H_{ij} = \frac{\partial^2 l(\Theta)}{\partial \theta_i \partial \theta_j}$

Since we need the double derivative in the denominator, and division is not defined for matrices, we use the inverse, to get the final update rule (here, $\Theta$ is a vector).

$\Theta = \Theta - H^{-1} \nabla_\Theta l(\Theta)$

Here, $H$ is of dimensions $(n + 1, n + 1)$. While Newton’s method takes fewer steps to find the minimum, it does require finding and then inverting the Hessian, and then multiplying with another matrix. This can be very expensive for large values of $n$. When we use the Newton-Raphson method instead of gradient descent in logistic regression, it’s called Fisher scoring.