3.1 – The Newton-Raphson Method

This post is based on Andrew Ng’s CS229 course at Stanford University.

We now briefly digress to talk about an algorithm used to find optima (remember: we used batch gradient descent to do this). This method is called Newton’s method, or the Newton-Raphson method, and is usually faster than batch gradient descent.

Suppose we have a function f : \mathbb{R} \mapsto \mathbb{R}. That is, it takes in a real number as input, and gives some other real number as output. We’re interested in finding a value x (a real number such that f(x) = 0. We start with some initial guess, say x_0. For the Newton-Raphson method, the subscript indicates which iteration we’re currently on. The 0th iteration means it’s our initial guess. The Newton-Raphson method performs the following update:

x = x - \frac{f(x)}{f^{\prime}(x)}

Let’s understand what this intuitively means. Consider the function shown below. The line y = 0 is also plotted.


The minimum for this function occurs at around 1.3. Let’s pick an initial guess of 4.5. We draw a vertical line, x = 4.5. We take the point where this vertical line meets the function, and draw a tangent at that point. Then, we note the point where this tangent meets the horizontal, y = 0 line that we drew earlier. This is illustrated in the next figure, numbered for your ease of understanding.


We continue doing the exact same procedure, now starting with x_1, to obtain x_2, and so on. x_1 is about 2.8, and x_2 is about 1.8.

So now we know how to use the Newton-Raphson method (or Newton’s method) to find a value for which f(x) = 0. How do we use this to maximize or minimize a function? We note that at the maxima or minima of a function, f^{\prime}(x) = 0. Thus, we can use Newton’s method in the same way as described above, using f^{\prime}(x) instead. We can use this to maximize our log-likelihood function, and thus our update rule is

\Theta = \Theta - \frac{l^{\prime}(\Theta)}{l^{\prime \prime}(\Theta)}

We note that for logistic regression, \Theta is a vector and not a real number, so we modify the equation accordingly. Remember, from our post on linear regression, that we defined \nabla_\Theta J(\Theta) as the derivative of the cost function with respect to \Theta, where \Theta was a vector. We use the same for the numerator. But what about the denominator? We need something to represent the second derivative of the cost function, where \Theta is, again, a vector. Fortunately, this already exists, and is called the Hessian. Each entry of the Hessian is defined as

H_{ij} = \frac{\partial^2 l(\Theta)}{\partial \theta_i \partial \theta_j}

Since we need the double derivative in the denominator, and division is not defined for matrices, we use the inverse, to get the final update rule (here, \Theta is a vector).

\Theta = \Theta - H^{-1} \nabla_\Theta l(\Theta)

Here, H is of dimensions (n + 1, n + 1). While Newton’s method takes fewer steps to find the minimum, it does require finding and then inverting the Hessian, and then multiplying with another matrix. This can be very expensive for large values of n. When we use the Newton-Raphson method instead of gradient descent in logistic regression, it’s called Fisher scoring.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s