Signature
← Back to Ramblings

Tips & Tricks for Training GANs

Hard-Won Lessons from Getting Generators to Actually Generate

The Honest Truth

Training GANs is notoriously finicky. The adversarial dynamic means you're optimizing two networks simultaneously in opposition, and the whole system can collapse in creative and frustrating ways. These tips come from real experience training models on real data, not textbook theory.

1. Start Simple, Then Scale

Don't jump straight to your full dataset at full resolution. Build confidence in your pipeline first:

  • Overfit on one image first. If your GAN can't memorize a single image, it definitely can't learn a distribution. This validates your architecture and training loop.
  • Then try 10 images. Can it learn a tiny distribution? Good. Now you know the adversarial dynamics work.
  • Start at low resolution. Train at 64x64 before attempting 256x256 or higher. Debug faster, iterate faster.
  • Use a known dataset first. Before training on your custom data, verify your code works on MNIST or CelebA. If it doesn't work on a standard benchmark, the problem is your code, not your data.

2. Learning Rates Are Everything

The balance between generator and discriminator learning rates is the single most impactful hyperparameter:

📉
Use a Lower Generator LR

A common starting point: discriminator at 1e-4, generator at 1e-4 or slightly lower. If the generator learns too fast, it'll find degenerate shortcuts. If the discriminator runs away, the generator gets no useful gradient signal.

🧪
TTUR (Two Time-Scale Update Rule)

Use different learning rates for G and D. The original TTUR paper showed that giving the discriminator a higher learning rate (e.g., 4e-4 vs 1e-4) can improve convergence. This lets D provide better gradients without running away.

⚠️
Adam Betas Matter

The default Adam beta1 of 0.9 is often too high for GANs. Try beta1=0.0 or beta1=0.5 with beta2=0.999 or beta2=0.9. High momentum in Adam can cause oscillations in the adversarial game.

3. Fighting Mode Collapse

Mode collapse is when the generator finds a small set of outputs that fool the discriminator and stops exploring. Your generator becomes a one-trick pony. Signs to watch for:

Red Flags

  • Generated images all look nearly identical
  • Loss oscillates wildly or discriminator loss drops to zero
  • Generator loss keeps decreasing but image quality stagnates
  • FID score plateaus while loss still improves

Countermeasures that actually work:

  • Minibatch discrimination: Let the discriminator see statistics across the batch, not just individual samples. If all samples look the same, the discriminator can catch it.
  • Feature matching: Instead of training G to maximize D's error, train it to match the statistics of real data in D's intermediate layers.
  • Use Wasserstein loss: WGAN and WGAN-GP provide more stable gradients and a meaningful loss metric. The Earth Mover's distance doesn't saturate like the original GAN loss.
  • Increase latent dimensionality: If your latent space is too small, the generator can't encode enough variation. 100-512 dimensions is typical.
  • Unrolled optimization: Take multiple discriminator steps per generator step. This prevents the generator from exploiting a momentarily confused discriminator.

4. Architecture Choices That Matter

Do This

  • Use strided convolutions instead of pooling
  • BatchNorm in G, LayerNorm or SpectralNorm in D
  • LeakyReLU (slope 0.2) in D, ReLU in G
  • Tanh activation on G's output layer
  • Scale inputs to [-1, 1] to match tanh range
  • Use a power-of-2 image size (64, 128, 256...)

Avoid This

  • Max pooling (loses spatial information)
  • BatchNorm in D (leaks batch information)
  • Sigmoid output in D with WGAN loss
  • Sparse gradients (no ReLU in D)
  • Fully connected layers in deep architectures
  • Very deep networks without residual connections

Spectral Normalization

One of the most impactful stabilization techniques. Normalizes the weight matrices of the discriminator by their spectral norm (largest singular value), enforcing a Lipschitz constraint without the computational overhead of gradient penalty. Often you can use SpectralNorm in D and skip gradient penalty entirely.

5. Data Preparation Is Half the Battle

Garbage in, garbage out applies doubly to GANs:

  • Clean your dataset aggressively. A few corrupted, mislabeled, or drastically out-of-distribution images can derail training. The discriminator will latch onto artifacts.
  • Align and crop consistently. If you're generating faces, align them. If you're generating snow crystals, center them. Spatial consistency helps the generator focus on content rather than position.
  • Augment carefully. Random flips and small rotations are usually safe. Color jitter and aggressive crops can confuse the distribution the generator is trying to learn.
  • More data beats more training. If your results are bad, consider collecting more data before training longer. GANs memorize small datasets.
  • Normalize to [-1, 1]. Match the tanh output range. This sounds obvious but is a common source of bugs.

6. Monitoring Training

GAN loss curves are notoriously uninformative. Don't rely on them alone:

🖼️
Save Generated Samples Frequently

Generate and save a grid of images every N steps using a fixed latent vector. This lets you visually track improvement and catch mode collapse early. Your eyes are a better metric than the loss curve.

📊
Track FID Score

Fréchet Inception Distance compares the statistics of generated images to real images in Inception feature space. Lower is better. Compute it periodically (not every step, it's expensive). This is the closest thing to an objective quality metric.

⚖️
Watch the D/G Loss Balance

For WGAN: the critic loss should be negative and relatively stable. If it crashes to zero, D has won. If it oscillates wildly, the learning rates are likely too high. For vanilla GAN: D loss around 0.5-0.7 and G loss around 1.0-2.0 is a reasonable range.

7. Choosing a Loss Function

The loss function shapes the entire training dynamic. Here's a practical comparison:

Loss Stability Quality Notes
Vanilla (BCE) Low Moderate Prone to vanishing gradients and mode collapse
WGAN High Good Requires weight clipping, which can limit capacity
WGAN-GP High High Gradient penalty is more principled than clipping
Hinge High High Used in BigGAN and SAGAN. Simple and effective
Non-saturating Moderate Good -log(D(G(z))) instead of log(1-D(G(z))). Better gradients early on

Recommendation: Start with WGAN-GP or hinge loss. They provide meaningful loss curves and stable training. Only use vanilla GAN loss if you have a specific reason.

8. Working with Small Datasets

Not everyone has millions of images. When your dataset is small (hundreds to low thousands):

📐

Progressive Growing

Start at low resolution and scale up. Early phases learn structure from limited data efficiently.

🔄

Data Augmentation

Apply augmentation to both real and fake images (DiffAugment). This prevents D from memorizing the training set.

🧊

Freeze D Layers

With limited data, D overfits fast. Freezing early D layers after initial training can help.

The snowGAN project trains on ~2,341 samples using progressive growing. It's proof that you can get meaningful results from small, domain-specific datasets if your training strategy is right.

9. Common Pitfalls & Quick Fixes

Checkerboard artifacts in generated images

Caused by transposed convolutions with stride > 1. Fix: use nearest-neighbor upsampling followed by a regular convolution instead of ConvTranspose2d.

Generator produces noise / no recognizable structure

Discriminator is too strong. Lower D learning rate, reduce D updates per G update, or add noise to D's inputs (label smoothing or instance noise).

Training loss oscillates wildly

Learning rates too high, or batch size too small. Try halving both LRs first. If that doesn't help, increase batch size or add gradient penalty.

Good samples early, then quality degrades

The generator may be overfitting or the adversarial balance shifted. Save checkpoints frequently and consider learning rate decay in late training.

NaN losses

Gradient explosion. Add gradient clipping (max_norm=1.0), lower learning rates, or check for division by zero in custom loss functions. Also verify your data doesn't contain NaN/inf values.

The GAN Training Checklist

Before you start training, run through this:

  • ☐ Data cleaned, aligned, and normalized to [-1, 1]
  • ☐ Architecture validated on a known dataset
  • ☐ Single-image overfit test passed
  • ☐ Learning rates set (try 1e-4 / 1e-4 or TTUR)
  • ☐ Adam betas adjusted (beta1=0.0 or 0.5)
  • ☐ Loss function chosen (WGAN-GP or hinge recommended)
  • ☐ Image saving callback configured (every N steps)
  • ☐ Checkpoint saving enabled
  • ☐ FID evaluation scheduled
  • ☐ GPU memory checked (reduce batch size if needed)