Perceptron Convergence Proof

Okay, so here's the setup for proving that the perceptron learning algorithm converges in the binary classification case. If this is new to you, the perceptron is a linear classifier as follows:

• $$\hat{y} = \text{sign}(w \cdot x)$$
• The perceptron is a linear combination of the input vector and outputs either $$-1$$ or $$+1$$.
• If $$\hat{y}_t \not = y_t$$ then we update $$w_{k+1} = w_{k} + y_t(x_k)^T$$
• It's an iterative algorithm, where the first misclassified input is added to our weight vector on iteration $$k+1$$ and then we iterate again.
• $$w_1 = 0$$
• We initialize the starting weight vector to the zero vector.

For this proof, we'll assume a few things:

• There exists an optimal $$w = w^*$$ such that for some $$\epsilon > 0$$, it's the case that $$y_i(w^* \cdot x_i) > \epsilon$$ for all inputs on the training set.
• In other words, we assume that there is a margin of at least $$\epsilon$$ and that a solution exists.
• $$||w^*|| = 1$$
• This gives us a unique weight vector.
• For all $$x_1, x_2, ... \in X$$, it's the case that $$||x_i|| \le R$$.
• We're upper bounding the norm of the data points in our training set.

To start, we show that $$w_{k+1} \cdot (w^*)^T \ge k \epsilon$$ with a proof by induction.

Base case where $$k = 0$$: $$w_{0+1}\cdot (w^*)^T = 0 \ge 0 \cdot \epsilon$$

Inductive step where $$k \to k+1$$: $$w_{k+1}\cdot (w^*)^T = (w^k + y_t(x_t)^T)\cdot (w^*)^T$$ (WLOG, we assume that $$x_t$$ is the first misclassified input) $$= w^k \cdot (w^*)^T + y_t (w^* \cdot x_t)$$ $$\ge (k-1)\cdot \epsilon + \epsilon$$ (Using the inductive hypothesis, and the definition of $$w^*$$) $$\ge k \epsilon$$

Next, we show that $$||w_{k+1}|| \ge k \epsilon​$$ $$w_{k+1} \cdot w^* = ||w_{k+1}|| \times ||w^*|| \times \text{cos}(\theta)​$$ where $$\theta​$$ is the angle between $$w_{k+1}​$$ and $$w^*​$$ Because $$cos(\theta) \le 1​$$, $$||w_{k+1}|| \times ||w^*|| \ge w_{k+1} \cdot (w^*)^T​$$ ​Thus $$||w_{k+1}|| \times ||w^*|| \ge k \epsilon​$$ Finally because $$||w^*|| = 1​$$ by definition, $$||w_{k+1}|| \ge k \epsilon​$$

Now we notice that $$||w_{k+1}||^2 = ||w_{k} + y_t (x_t)^T||^2$$ $$= ||w_k||^2 + 2y_t (w_k \cdot x_t) + ||x_k||^2$$ $$\le ||w_k||^2 + ||x_k||^2$$ because by definition $$\text{sign}(y_t(w_k \cdot x_t)) = -1$$ as it is the first misclassified input, so $$2y_t(w_k \cdot x_t) < 0$$. $$\le ||w_k||^2 + R^2$$ because by definition $$||x_i|| \le R$$

Next we show that $$||w_{k+1}||^2 \le kR^2$$ with a proof by induction.

Base case where $$k = 0$$ $$||w_{0+1}||^2 = 0 \le 0(R)$$

Inductive step where $$k \to k+1$$

By the above part, we know that $$||w_{k+1}||^2 \le ||w_k||^2 + R^2$$. Now, by the inductive hypothesis, we see that $$||w_{k+1}|| \le (k-1)R^2 + R^2 = kR^2$$.

Thus, we have the full inequality: $$k^2 \epsilon^2 \le ||w_{k+1}||^2 \le kR^2$$

And thus we can see that $$k \le \frac{R^2}{\epsilon^2}$$, bounding the number of iterations we run our algorithm by the margin and the farthest point.

Last Updated: 2020-06-12 07:40
First Published: 2020-06-12 07:40