Proving the No-Free-Lunch Theorem

The No-Free-Lunch Theorem is an important part of statistical learning theory. I'm going to trace the proof found in Understanding Machine Learning for my own benefit.

The theorem is this:

Say there is a learning algorithm A which outputs a binary classifier and has the 0-1 loss function over a domain \(X\). Then, there exists a distribution \(D\) over \(X \times \{0,1\}\) such that:

  1. There exists a function \(f: X \to \{0,1\}​\) with \(L_D(f) = 0​\). That is, there is some binary function over \(X​\) such that the expected loss of \(f​\) over \(D​\) is 0, meaning there is some function which will correctly classify all the points.

  2. With probability of at least \(\frac{1}{7}\) over the choice of \(S \sim D^m\) \(L_D(A(S)) \ge \frac{1}{8}\). In other words, for at least \(\frac{1}{7}\) of all possible training sets of size \(m\) from \(D\), the expected loss of output of the training algorithm on the training set is at least \(\frac{1}{8}\).

Proof:

Let \(C \subseteq X\) such that \(|C| = 2m\). We note that there are \(T = 2^{2m}\) possible functions from \(C\) to \(\{0,1\}\). Let \(D_i\) be the distribution over \(D\) that only assigns probability to the \((x,f_i(x))\) pairs for the corresponding function.

We're going to show that:

\(\underset{i \in [T]}{\text{max }} \underset{S \sim D_i^m}{\mathbb{E}} [L_{D_i}(A(S))] \ge 1/4\)

That is, the largest expected loss of running our learning algorithm on a training set of size \(m\) among all functions from \(C\) to \(\{0,1\}\) (or rather, the distribution that is defined only on the function's \((x, f(x))\) pairs) is 1/4.

When a random variable is between 0 and 1, we can use Markov's inequality to show that:

\(P(X \ge a) \ge \frac{\mathbb{E}[X]-a}{1-a}\)

Applying that to our inequality above, we see that:

\(P(L_D(A'(S)) \ge \frac{1}{8}) \ge \frac{\frac{1}{4}-\frac{1}{8}}{\frac{7}{8}} \ge \frac{1}{7}\) as desired.

So now we just have to prove the inequality itself.

Well, what do we know about the expected loss for a given distribution and a learning algorithm? It's just the averaged loss, over all possible choices of a training set:

\(\underset{S \sim D_i^m}{\mathbb{E}}[L_{D_i}(A(S))] = \frac{1}{k}\sum_{j=1}^kL_{D_i}(A(S^i_j))\)

And we care about taking the maximum among all possible distributions, which is greater than the average loss among all distributions, which is greater than the minimum value:

\(\underset{i \in [T]}{\text{max }} \frac{1}{k}\sum_{j=1}^kL_{D_i}(A(S^i_j))\)

\(\ge \frac{1}{T}\sum^T_{i=1}\frac{1}{k}\sum^k_{j=1}L_{D_i}(A(S^i_j))​\)

\(=\frac{1}{k}\sum^k_{j=1}\frac{1}{T}\sum^T_{i=1}L_{D_i}(A(S^i_j))​\)

\(\ge \underset{j \in [k]}{\text{min }} \frac{1}{T}\sum_{i=1}^TL_{D_i}(A(S^i_j))\)

(To be clear, the last inequality refers to the lowest cost possible among all possible training sets, averaged across all possible distributions.)

Next, let's fix some \(j \in [k]\). \(S_j = (x_1,..,x_m)\). Let \((v_1, ...,v_p)\) be the examples in \(C\) that aren't in \(S_j\). We can see that \(p \ge m\).

Thus:

\(L_{D_i}(h) = \frac{1}{2m}\sum_{x\in C}\mathbf{1}[h(x) \not= f_i(x)]\)

\(\ge \frac{1}{2p}\sum_{j=1}^p\mathbf{1}[h(v_j) \not= f_i(v_j)]\)

Combining our two inequalities, we get:

\( \frac{1}{T}\sum_{i=1}^TL_{D_i}(A(S^i_j)) \ge \frac{1}{T}\sum_{i=1}^T \frac{1}{2p}\sum_{j=1}^p\mathbf{1}[h(v_j) \not= f_i(v_j)]\)

\( = \frac{1}{2p}\sum_{j=1}^p\frac{1}{T}\sum_{i=1}^T \mathbf{1}[h(v_j) \not= f_i(v_j)]\)

\(\ge \frac{1}{2}\underset{r \in [p]}{\text{min}}\frac{1}{T}\sum_{i=1}^T \mathbf{1}[h(v_j) \not= f_i(v_j)]\)

Now, fix a point \(v_r \in [p]\). Notice that we can partition \([T]\) into \((f_a, f_b)\) pairs such that \(f_a\) and \(f_b\) agree on every value except for \(v_r\). We can do this because \([T]\) is the set of all functions, so we can imagine fixing \(f(v_r) = 0\) and letting all the other values vary, as well as fixing \(f(v_r) = 1\) and doing the same thing. It's clear this will give us two sets of equal size, as well as enable the above pairing.

This means that:

\(\mathbf{1}[h(v_j) \not = f_a(j)] + \mathbf{1}[h(v_j) \not = f_b(j)] = 1\), which means that the average loss for the pair of functions is \(\frac{1}{2}\).

Thus, \(\frac{1}{2}\underset{r \in [p]}{\text{min}}\frac{1}{T}\sum_{i=1}^T \mathbf{1}[h(v_j) \not= f_i(v_j)] = \frac{1}{4}\)

Thus, \(\underset{i \in [T]}{\text{max }} \frac{1}{k}\sum_{j=1}^kL_{D_i}(A(S^i_j)) \ge \frac{1}{4}\) as desired.


Last Updated: 2020-06-12 07:40
First Published: 2020-06-12 07:40