The first observation is that if H has finite VC dimension , then empirical risk converge to the true risk as the sample size grows .Therefore ERM generates better and better hypothesis . This suggest to look for some set with infinite VC dimension . Regarding the distribution , to make things easy for us and hard for ERM , we should look for some uniform distribution.
An example that I think it works is the following :
- $X = [0,1]$
- H is the set of all functions $h:X \rightarrow \{0,1\}$ . H clearly has infinite VC dimension
- $c \in H$ some arbitrary fixt function
- P is the distribution induced by the uniform distribution over X and $c$
Since $c \in H$ we can take $h^{*} = c$ , so the true risk is $L_{P}(h^{*}) = 0$
The sample always has a 0 measure support on X (it's a finit set) . The algorithm A will pick a hypothesis with 0 empirical risk , but since it does not know anything about the values that the function c has on the rest of the domain X , it can't perform better than random guessing and it's true risk will be 1/2 .
Let $P_{H}$ be the uniform distrubution over H -
$\forall x \in X ,f\sim P_{H} , P(f(x)=0)= P(f(x)=1)=1/2$. For a sample $S$ we can define a new probability distribution $P_{HS}$ by conditioning on the values of the sample : $\forall x_{i} \in S f(x_{i}) = c(x_{i})$ .This ,combined with the fact that support(S) has 0 measure leads to $E_{x\sim P}[E_{h\sim P_{HS}}[A(S)(x) \ne h(x)]] = 1/2$ . Therefore must exist at least a hypothesis h that agrees with the one provided by the algorithm A, but is differnent on the rest of the domain X . Letting c be this hypothesis, we have that $L_{P}(A(S)) = 1/2$