Large Scale Adversarial Representation Learning
GANs can be used for unsupervised learning where a generator maps latent samples to generate data, but this framework does not include an inverse mapping from data to latent representation. BiGAN adds an encoder E to the standard generatordiscriminator GAN architecture — the encoder takes input data x and outputs a latent representation z of the input.
By Most Husne Jahan, Robert Hensley, Gurinder Ghotra
This post is part of the "superblog" that is the collective work of the participants of the GAN workshop organized by Aggregate Intellect. This post serves as a proof of work, and covers some of the concepts covered in the workshop in addition to advanced concepts pursued by the participants.
Papers Referenced: :
1. Comparison of BiGAN, BigGAN, and BigBiGAN
BiGAN: Bidirectional Generative Adversarial Networks (BiGANs)
Figure 1: The structure of Bidirectional Generative Adversarial Networks (BiGAN).
GANs can be used for unsupervised learning where a generator maps latent samples to generate data, but this framework does not include an inverse mapping from data to latent representation.
BiGAN adds an encoder E to the standard generatordiscriminator GAN architecture — the encoder takes input data x and outputs a latent representation z of the input. The BiGAN discriminator D discriminates not only in data space (x versus G(z)), but jointly in data and latent space (tuples (x, E(x)) versus (G(z), z)), where the latent component is either an encoder output E(x) or a generator input z.
BigGAN: LARGE SCALE GAN TRAINING FOR HIGH FIDELITY NATURAL IMAGE SYNTHESIS
BigGAN essentially allows scaling of traditional GAN models. This results in GAN models with more parameters (e.g. more feature maps), larger batch sizes, and architectural changes. The BigGAN architecture also introduces a “truncation trick” used during image generation which results in an improvement in image quality. A specific regularization technique is used to support this trick. For image synthesis use cases, truncation trick involves using a different distribution of samples for the generator’s latent space during training than during inference. This “truncation trick” is a Gaussian distribution during training, but during inference a truncated Gaussian is used  where values above a given threshold are resampled. The resulting approach is capable of generating larger and higherquality images than traditional GANs, such as 256×256 and 512×512 images. The authors proposed a model (BigGAN) with modifications focused on the following aspects:
Figure 2: Summary of the SelfAttention Module Used in the SelfAttention GAN
Figure 3: Sample images generated by BigGANs
BigBiGAN – bidirectional BigGAN: Large Scale Adversarial Representation Learning
(Unsupervised Representation Learning)
Researchers introduced BigBiGAN which is built upon the stateoftheart BigGAN model, extending it to representation learning by adding an encoder and modifying the discriminator. BigBiGAN is a combination of BigGAN and BiGAN which explores the potential of GANs for a wide range of applications, like unsupervised representation learning and unconditional image generation.
It has been shown that BigBiGAN (BiGAN with BigGAN generator) matches the state of the art in unsupervised representation learning on ImageNet. The authors proposed a more stable version of the joint discriminator for BigBiGAN, compared to the discriminator used previously. They also have shown that the representation learning objective also helps unconditional image generation.
Figure 4: An annotated illustration of the architecture of BigBiGAN. The red section is derived from BiGAN, whereas the blue sections are based on the BigGAN structure with the modified discriminators
The above figure shows the structure of the BigBiGAN framework, where a joint discriminator D is used to compute the loss. Its inputs are datalatent representation pairs, either (x ∼ P_{x}, z ~ E(x)), sampled from the data distribution P_{x} and encoder E outputs, or (x ∼ G(z), z ∼ P_{z}), sampled from the generator G outputs and the latent distribution P_{z}. The loss includes the unary data term S_{x} and the unary latent term S_{z}, as well as the joint term S_{xz} which ties the data and latent distributions.
Figure 5: Selected reconstructions from an unsupervised BigBiGAN model
In summary, BigBiGAN represents progress in image generation quality that translates to substantially improved representation learning performance.
Ref:
 BiGAN Paper: https://arxiv.org/pdf/1605.09782.pdf
 BigBiGAN Paper: https://arxiv.org/pdf/1907.02544.pdf
 BigGAN Paper: https://arxiv.org/pdf/1809.11096.pdf
2. Ablation study conducted for BigBiGAN:
As an ablation study, different elements in the BigBiGAN architecture were removed in order to better understand the effects of the respective elements. The metrics used for the study were IS
and FID
scores. IS
score measures convergence to major modes while FID
score measures how well the entire distribution is represented. A higher IS
score is considered to be better, whereas a lower FID
score is considered better. The following points highlight the findings of the ablation study:
 Latent distribution Pz and stochastic E.
The study upholds the findings of BIGGAN of using random sampling from the latent space z as a superior method.
 Unary loss terms:
 Removing both the terms is equal to using BIGAN.
 Removing S_{x} leads to inferior results in classification as S_{x} represents the standard generator loss in the base GAN.
 Removing S_{z} does not have much impact on classification accuracy.
 Keeping only S_{z} has a negative impact on classification accuracy.
Divergence in the IS
and FID
score led to the postulation that the BIGBiGAN may be forcing the generator to produce distinguishable outputs across the entire latent space, rather than collapsing large volumes of latent space into a single mode of data distribution.
 It would have been interesting to see how much improvement the unary terms impose with the reduction of generator from BIGGAN to DCGAN, this change of generator would have conclusively shown their advantage.
 Table of
IS
andFID
scores (with relevant scores highlighted):
Table 1: Results for variants of BigBiGAN, given in Inception Score (IS
) and Fréchet Inception Distance (FID
) of the generated images, and ImageNET top1 classification accuracy percentage
3. Generator Capacity
They found that generator capacity was critical to the results. By reducing the generator’s capacity, the researchers saw a reduction in classification accuracy. The generator was changed from DCGAN to BIGGAN, which is a key contributor to its success.
4. Comparison to Standard BigGAN
BigBiGAN without the encoder and with only the S_{x} unary term was found to produce a worse IS
metric and the same FID
metric when compared to BIGGAN. From this, the researchers postulated that the addition of the encoder and the new joint discriminator did not decrease the generated image quality as can be seen from the FID
score. The reason for a lower IS
score is attributed to reasons similar to the ones for S_{z} unary term (as in point 2  Unary loss term).
5. Higher resolution input for Encoder with varying resolution output from Generator
BigGAN uses
 Higher resolution for the encoder.
 Lower resolution for generator and discriminator.
 They experimented with varying resolution sizes for the encoder and the generator and concluded that an increase in the resolution of the generator with a fixed high resolution for the encoder improves performance.
Note: looking at the table (the relevant portion is highlighted) this seems to be the case only with IS
and not with FID
, which increases to 38.58 from 15.82 when we go from low resolution for the generator to high resolution.
6. Decoupled Encoder / Generator optimizer:
Changing the learning rate for the encoder dramatically improved training and the final representation. Using a 10X higher learning rate for the encoder while keeping the generator learning rate fixed led to better results.
7. BigBiGAN basic structures compared to the Standard GAN
At the heart of the standard GAN is a generator and a discriminator. The BigBiGAN expands on this, building on the work of BiGAN and BigGAN, to include an encoder and “unary” term discriminators (F and H) which are then jointly discriminated along the lines of “encoder vs generator” through the final discriminator (J). As a result of these additions, some natural model changes emerge.
Change in the discrimination paradigm
Where the standard GAN discriminates between ‘real’ and ‘fake’ inputs, the BigBiGAN shifts that paradigm slightly to discriminating between ‘encoder’ and ‘generator’. If you think about the model in terms of “real” and “fake” you might be tempted to think about the real latent space z as “real” and the fake latent space E(x) as “fake” – this is different than what they do, and is important to the reason why we should notice the shift towards encoder vs generator. From this point on, each discriminator will be seen as discriminating “encoder from generator” and no longer “real from fake.”
Other natural model changes that emerge from the addition of an encoder and unary terms
Since the generator attempts to generate images, and the encoder attempts to generate latent space (aka the “noise” in the standard GAN), the structure of the outputs are different shapes. The image shapes are handled similar to a DCGAN, and the latent space shapes are handled with linear layers like the original GAN. As a result, the F discriminator is a CNN that discriminates between encoder and generator images, while the H discriminator is a linear module that accepts a flattened input and discriminates between encoder and generator latent space.
After the first phase of discrimination, the outputs of F are flattened so they can be concatenated with H outputs, then F and H outputs are jointly fed into the final discriminator J. As such, J will then discriminate between the concatenated encoder values [ F_{out}^{e}, F_{out}^{e} ], and the concatenated generator values [ F_{out}^{g}, F_{out}^{g} ] which can also be written as [ F(x), H(E(x)) ] ( vs ) [ F(G(z)), H(z) ].
For scoring F, H, and J  with F_out
and H_out
needing to be matrices that can be feed into J  reducing F_out
and H_out
to a scalar needs to be done after their respective discrimination. Off the back of this requirement emerges the terms S_{x}, _{xz} and _{z}. These are linear layers that simply reduce S_{x}(F_{out}), S_{xz}(J_{out}) and S_{z}(H_{out}) each to a scalar that can then be summed up (S_{x} + S_{xz} + S_{z}) and scored.
Compared to the standard GAN that is discriminating real values from fake values: (x) from G(z), the BigBiGAN can be seen as similarly discriminating a group of encoder values from a group of generator values: (S_{x}^{e} + S_{xz}^{e} + S_{z}^{e}) from (S_{x}^{g} + S_{xz}^{g} + S_{z}^{g}).
Original. Reposted with permission.
Related:
 Semisupervised learning with Generative Adversarial Networks
 Intro to Adversarial Machine Learning and Generative Adversarial Networks
 An Overview of Density Estimation
Top Stories Past 30 Days  


