2

The question concerns the proof of theorem 7.2 in Understanding Machine Learning by Shalev-Schwartz & Ben-David. The authors argue in the following way: suppose $\mathcal{H}$ is non-uniformly learnable, that is,

Def: There exists an algorithm $A$ an a function $m_{\mathcal{H}}^{NUL}:(0,1)^2 \times \mathcal{H} \rightarrow \mathbb{N}$ such that for any $\epsilon, \delta\in (0,1)$ and any $h\in\mathcal{H}$, if $m\geq m_{\mathcal{H}}^{NUL}(\epsilon, \delta, h)$, then for any distribution $\mathcal{D}$ over the sample space, if the size of training sample $S$ is equal to $m$, then with probability of at least $1-\delta$ we have $$L_{\mathcal{D}}(A(S)) \leq L_{\mathcal{D}}(h) + \epsilon,$$ where $L_{\mathcal{D}}$ denotes the true risk, i.e. the probability of misclassification.

Now, the theorem concerned says that the class is non-uniform learnable if and only if it is a countable union of hypotheses with a finite VC-dimension. The proof in the book defines $$\mathcal{H}_n = \{h\in \mathcal{H}:m_{\mathcal{H}}^{NUL}(1/8, 1/7, h) \leq n\}.$$ Obviously, $\mathcal{H}$ is a union of $\mathcal{H}_n$. However, it is not clear to me why do $\mathcal{H}_n$ have finite VC-dimension. The proof invokes Fundamental Theorem of Statistical Learning which provides the lower bound for the required training sample guaranteeing $(\epsilon, \delta)$ accuracy-confidence but this theorem assumes the finite VC-dimension, which is precisely what we want to show. Therefore, the argument on the surface appears to be circular. It is also not clear to me why would it break down if I replaced $(1/8, 1/7)$ with $(1/2, 1)$.

Any insight would be much appreciated!

2 Answers2

1

Statement: $VC(\cal{H}_n) \leq 2n$.

Proof:

  1. By contradiction, assume it is larger than $2n$, i.e., there exist $2n$ points ${\cal Y}^{2n} \subseteq \cal{X}$, which can be shattered by $\cal{H}_n$.
  2. The no-free-lunch theorem says, that for the non-uniformly learning algorithm $A^{\text{NUL}}$ (from the definition of the non-uniform learnability) there exists a distribution ${\cal D}_A$ over ${\cal Y}^{2n}$, for which with the probability of at least 1/7 over the choice $S \sim {\cal D}_A^n$ it holds that ${\cal L}_{{\cal D}_A}(A(S)) > \frac18$.
  3. Since ${\cal Y}^{2n}$ is shattered by ${\cal {H}}_n$, there exists a hypothesis $h'$ such that ${\cal L}_{{\cal D}_A}(h') = 0$. The non-uniform learnability definition w.r.t. $h'$ says, in particular, that with the probability of at least 6/7 over the choice $S \sim {\cal D}_A^n$, it holds that ${\cal L}_{{\cal D}_A}(A(S)) \leq {\cal L}_{{\cal D}_A}(h') + \frac18 = \frac18$, what contradicts the statement in (2).
Elnur
  • 352
0

You're right about the argument is circular and I think there's a typo there. Specifically, the paragraph should have been written as Using Corollary 6.4 ... instead of Using the fundamental theorem of statistical learning ....

So it reads as follows:

In addition using the definition of $m_\mathcal{H}^{NUL}$ we know that any distribution $\mathcal{D}$ that satisfies the realizability assumption with respect to $\mathcal{H}_n$, with probability of at least 6/7 over $S \sim \mathcal{D}_n$ we have that $L_\mathcal{D}(A(S)) \le 1/8$.

Using Corollary 6.4, this implies that the VC-dimension of $\mathcal{H}_n$ must be finite...

Why I think so? It is because if VC-dimension of $\mathcal{H}_n$ is infinite, then based on Corollary 6.4, there exists a distribution $\mathcal{D}$ such that $L_\mathcal{D}(h)=0$ (or equivalently the realizability holds) and with probability of at least 1/7 over $S \sim \mathcal{D}^n$ we have that $L_\mathcal{D}(A(S)) \ge 1/8$.

This is clearly contrary to the assumption that $\mathcal{H}_n$ is nonuniform, i.e. for any $\mathcal{D}$ if $L_\mathcal{D}(h) = 0$, or equivalently the realizability assumption holds, then with probability of at least 6/7 over $S \sim \mathcal{D}^n$ we have that $L_\mathcal{D}(A(S)) \le 1/8$.