# 11.1 – Convex Optimization

## A recap of the Lagrangian method

If you remember how to solve an optimization problem using Lagrange multipliers, you can skip this section. Suppose you have an optimization problem \begin{aligned} &\min\limits_w f(w) \\ & \text{s.t. } h_i(w) = 0, i = 1, 2, \ldots, l \end{aligned}

Then you first construct the Lagrangian: $\mathcal{L}(w, \boldsymbol \lambda) = f(w) + \sum\limits_{i=1}^l \lambda_i h_i(w)$

where each $\lambda-i$ is called a Lagrange multiplier. Then, you set all partial derivatives to 0: \begin{aligned} \frac{\partial \mathcal{L}}{\partial w} &= 0 \\ \frac{\partial \mathcal{L}}{\partial \lambda_i} &= 0 \end{aligned}

Only when there exists some values for each $\lambda_i$ satisfying all the above equations, will there exist some solution for $w$.

## Primal problems and duality

Let’s consider a slightly more difficult problem. Suppose our optimization problem is \begin{aligned} \min\limits_w &f(w) \\ \text{s.t. } &g_i(w) \leq 0, i = 1, 2, \ldots, k \\ & h_i(w) = 0, i = 1, 2, \ldots, l \end{aligned}

The Lagrangian for this problem is similar: $\mathcal{L}(w, \boldsymbol \lambda, \boldsymbol \mu) = f(w) + \sum\limits_{i=1}^k \lambda_i g_i(w) + \sum\limits_{i=1}^l \mu_i h_i(w)$

This is where it gets interesting. Let’s define a problem: $\Theta_p(w) = \max\limits_{\substack{\boldsymbol\lambda, \boldsymbol \mu \\ \lambda_i \geq 0}} \mathcal{L}(w, \boldsymbol\lambda, \boldsymbol\mu)$

To understand why this problem is interesting, we need to consider yet another problem: $p^* = \min\limits_w \max\limits_{\substack{\boldsymbol\lambda, \boldsymbol \mu \\ \lambda_i \geq 0}} \mathcal{L}(w, \boldsymbol\lambda, \boldsymbol\mu) = \min\limits_w \Theta_p(w)$

It turns out that this problem is actually equal to our original problem of minimizing $f(w)$ (at the beginning of this topic). To see why, suppose that some $g_i(w) > 0$ (which violates a condition of our original problem). Because the optimizer wants to maximize $\mathcal{L}$, it could simply set $\lambda_i$ to $\infty$. Similarly, suppose we have a violation of the second condition, i.e., for some $i$, $h_i(w) \neq 0$. Then the optimizer could set the corresponding $\mu_i$ to either $+\infty$ or $-\infty$ depending on the sign of $h_i(w)$. Therefore, if the constraints are satisfied, then $\Theta_p(w) = f(w)$, otherwise $\Theta_p(w) = \infty$. Thus, to minimize $\Theta_p(w)$, the optimizer had better find values of $\lambda, \mu$ such that the constraints are satisfied.

The subscript $p$ in the above discussion is due to the fact that the original problem above is called a primal problem. Now for every primal problem, there is a corresponding dual problem. The dual problem is the below. $d^* = \max\limits_{\substack{\boldsymbol\lambda, \boldsymbol \mu \\ \lambda_i \geq 0}}\min\limits_w \mathcal{L}(w, \boldsymbol\lambda, \boldsymbol\mu) = \min\limits_w \Theta_p(w)$

This really is the same problem, with the max and min parts swapped. However, this swap does make a difference. It turns out that for any function to be optimized, $\max\min\mathcal{L}(\ldots) \leq \min\max\mathcal{L}(\ldots)$

Why are we bothering with primal and dual problems in the first place? For some cases, the dual problem is easier to optimize than the primal problem, so the solution to the primal problem involves finding the dual, and then solving that problem. Obviously, this procedure will only work in the cases where the equality above holds true. Let’s look at the conditions for this.

Suppose that $f$ and all the $g_i$ are convex functions, and that all $h_i$ are affine ( $h_i(w) = \alpha^T w + \beta$). Further, assume that all the $g_i$ are strictly feasible–this means that there exists some $w$ such that all the $g_i(w) < 0$ (the strict word here means that all the conditions are satisfied with inequality; there’s no equality in this condition).

Under these conditions, the value of the dual problem will be equal to the value of the primal problem. Further, there will exist $w^*, \boldsymbol\lambda^*, \boldsymbol\mu^*$ such that $w^*$ satisfies the primal problem, $\boldsymbol\lambda^*, \boldsymbol\mu^*$ satisfy the dual problem, and a set of conditions, called the Karush-Kuhn-Tucker (KKT) conditions, will be satisfied. The converse of this statement is also true. There are five conditions, but only three interest us: \begin{aligned} \lambda_i^* g_i(w^*) &= 0 \\ g_i(w^*) &\leq 0 \\ \lambda_i^* &\geq 0 \end{aligned}

Observe that if in the last condition, only the inequality holds, then all of $g_i(w^*)$ must be 0 to satisfy the first condition.

Let’s now use this knowledge to pose the dual of our problem.