GAN Nash equilibrium

Question

I'm reading Ian Goodfellow’s article about Generative Adversarial Networks (https://arxiv.org/pdf/1701.00160.pdf) and, on page 22, I found a sentence that I don’t understand.

It’s about the GAN convergence evaluated with the Nash game equilibrium.

Ian say”

“… that the game converges to its equilibrium if both players’ policies can be updated directly in function space. In practice, the players are represented with deep neural nets and updates are made in parameter space, so these results, which depend on convexity, do not apply.”

Ok, the equilibrium is researched via parameter space, but I don’t well understand what is the function space and where is the difference of searching in function space Vs. parameter space.

score 2 · Answer 1 · answered Jul 05 '19 at 01:14

I suggest referring back to the original GAN paper (pg. 5):

Proposition 2. If $G$ and $D$ have enough capacity, and at each step of Algorithm 1, the discriminator is allowed to reach its optimum given $G$, and $p_g$ is updated so as to improve the criterion $$\mathbb{E}_{x\sim p_\text{data}} [\log D^∗_G(x)] + \mathbb{E}_{x\sim p_\text{g}}[\log(1 − D^∗_G(x))]$$ then $p_g$ converges to $p_\text{data}$.

after which it is said:

In practice, adversarial nets represent a limited family of $p_g$ distributions via the function $G(z; θ_g)$, and we optimize $θ_g$ rather than $p_g$ itself, so the proofs do not apply.

So there are three effects in play here:

The generator and discriminator are artificial neural networks, and therefore do not have infinite capacity. This is why the first sentence of proposition 2 stipulates $G$ and $D$ having enough capacity. Note that continuous function spaces are infinite dimensional, so moving in the finite parameter space of the model will not be the same as moving in the full function space.
It's even worse, because $G$ is only implicitly representing $p_g$. Even if $G$ is capable of generating a given image, the input latent noise $z\sim U$ may not allow it to do so. Other issues can occur due to e.g. topological issues (if the true data distribution has two separate modes with zero density in between, the vanilla GAN will be unable to represent this). In other words, the distributions that can be represented by $G$ (here, the density functions) are limited (no matter the capacity of $G$) in the vanilla GAN formulation. Some of these limitations have been reduced in more modern GANs.
The convexity requirement. This is what is referred to in "... these results, which depend on convexity, do not apply." The fact is that the GAN value function $V(G,D)$ satisfies the following (paraphrased from the proof of prop 2 in the original GAN paper):

Let $V(G,D)=: U(p_g,D)$ be a function of $p_g$. Then $U$ is convex in $p_g$.

Since $U$ is convex, we are guaranteed to be able to reach the global optimum. But there's a problem: this convexity relies on a major assumption in prop 2, namely that "the discriminator is allowed to reach its optimum given $G$". In practice, this will not occur for most scenarios. Thus the theoretical convexity guarantee does not hold, and hence the global convergence guarantee does not hold.

For the last part of your question:

but I don’t well understand what is the function space and where is the difference of searching in function space Vs. parameter space.

as noted above, the network only covers a small piece of "function space" (e.g., the space of smooth functions), a limitation exacerbated by the use of the implicit density representation. Searching in the parameter space is restricted to that small subset of function space, unlike the "direct updates" to $p_g$ required by the proof.

Graham Pulford · Answer 2 · 2022-02-13T11:55:16.190

The formulation of the GAN in Goodfellow et al's 2014 paper are as variational optimisatons (i.e. over spaces of functions) and their practical implementation via parametric optimisation. The theory in the paper is developed for optimisaton over functions D (discriminator) and G (generator). In the implementation the functions D and G are realised by multilatyer or deep neural networks, so this is a parametric optimisation. Thus the variational optimisation indicated in the theory part of the paper is never actually carried out.

Although not stated in the 2014 paper, the theoretical results on GAN convergence and Nash equilibria require dim(z)$\geq$dim(x), where dim(x) is the latent space dimension and dim(z) the latent space dimension, for validity of Proposition 1 concerning the optimal detector. All the other results in the paper assume the optimal detector result.

When dim(z)<dim(x), which generally happens in practice, Proposition 1 is false since the variational calculus used in the proof (after the change of variables x=G(z)) cannot be applied to non continuous/non differentiable integrands. It is demonstrated in the paper: https://www.researchgate.net/publication/356815736_Convergence_and_Optimality_Analysis_of_Low-Dimensional_Generative_Adversarial_Networks_using_Error_Function_Integrals

that $p_g(x)$, the PDF of the generator output, is degenerate whenever dim(z)<dim(x), i.e. it contains delta functions. Thus when writing the optimisation problem for $V[D,G]$ for the function D as: $\int\left( {\rm p}_d(x)\log(D(x))+{\rm p}_g(x)\log[1-D(x)]\right)dx$, we cannot apply calculus of variations because the integrand contains delta functions, which are discontinuous, via the term $p_g(x)$. This has been demonstrated in practice by the results in C. Qin et al. (2020) 1

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center. — Community, Feb 12 '22 at 16:55

GAN Nash equilibrium

2 Answers2