year: 2017
paper: https://arxiv.org/pdf/1701.07875.pdf
website:
code:
status: reference
type: tweak
Takeaways
Wasserstein loss 1
This loss function depends on a modification of the GAN scheme (called “Wasserstein GAN” or “WGAN”) in which the discriminator does not actually classify instances. For each instance it outputs a number. This number does not have to be less than one or greater than 0, so we can’t use 0.5 as a threshold to decide whether an instance is real or fake. Discriminator training just tries to make the output bigger for real instances than for fake instances.
Because it can’t really discriminate between real and fake, the WGAN discriminator is actually called a “critic” instead of a “discriminator”. This distinction has theoretical importance, but for practical purposes we can treat it as an acknowledgement that the inputs to the loss functions don’t have to be probabilities.
The loss functions themselves are deceptively simple:
Critic Loss: D(x) - D(G(z))
The discriminator tries to maximize this function. In other words, it tries to maximize the difference between its output on real instances and its output on fake instances.
Generator Loss: D(G(z))
The generator tries to maximize this function. In other words, It tries to maximize the discriminator’s output for its fake instances.
In these functions:
D(x)
is the critic’s output for a real instance.G(z)
is the generator’s output when given noise z.D(G(z))
is the critic’s output for a fake instance.- The output of critic D does not have to be between 1 and 0.
- The formulas derive from the earth mover distance between the real and generated distributions.
In TF-GAN, see wasserstein_generator_loss and wasserstein_discriminator_loss for implementations.
Requirements
The theoretical justification for the Wasserstein GAN (or WGAN) requires that the weights throughout the GAN be clipped so that they remain within a constrained range.
Benefits
Wasserstein GANs are less vulnerable to getting stuck than minimax-based GANs, and avoid problems with vanishing gradients. The earth mover distance also has the advantage of being a true metric: a measure of distance in a space of probability distributions. Cross-entropy is not a metric in this sense.