How to Evaluate Generative Adversarial Networks

By Jason Brownlee on July 12, 2019 in Generative Adversarial Networks 33

Generative adversarial networks, or GANs for short, are an effective deep learning approach for developing generative models.

Unlike other deep learning neural network models that are trained with a loss function until convergence, a GAN generator model is trained using a second model called a discriminator that learns to classify images as real or generated. Both the generator and discriminator model are trained together to maintain an equilibrium.

As such, there is no objective loss function used to train the GAN generator models and no way to objectively assess the progress of the training and the relative or absolute quality of the model from loss alone.

Instead, a suite of qualitative and quantitative techniques have been developed to assess the performance of a GAN model based on the quality and diversity of the generated synthetic images.

In this post, you will discover techniques for evaluating generative adversarial network models based on generated synthetic images.

After reading this post, you will know:

There is no objective function used when training GAN generator models, meaning models must be evaluated using the quality of the generated synthetic images.
Manual inspection of generated images is a good starting point when getting started.
Quantitative measures, such as the inception score and the Frechet inception distance, can be combined with qualitative assessment to provide a robust assessment of GAN models.

Kick-start your project with my new book Generative Adversarial Networks with Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Evaluate Generative Adversarial Networks
Photo by Carol VanHook, some rights reserved.

Overview

This tutorial is divided into five parts; they are:

The Problem of Evaluating GAN Generator Models
Manual GAN Generator Evaluation
Qualitative GAN Generator Evaluation
Quantitative GAN Generator Evaluation
Which GAN Evaluation Scheme to Use

The Problem of Evaluating GAN Generator Models

Generative adversarial networks are a type of deep-learning-based generative model.

GANs have proved to be remarkably effective at generating both high-quality and large synthetic images in a range of problem domains.

Instead of being trained directly, the generator models are trained by a second model, called the discriminator, that learns to differentiate real images from fake or generated images. As such, there is no objective function or objective measure for the generator model.

Generative adversarial networks lack an objective function, which makes it difficult to compare performance of different models.

— Improved Techniques for Training GANs, 2016.

This means that there is no generally agreed upon way of evaluating a given GAN generator model.

This is a problem for the research and use of GANs; for example, when:

Choosing a final GAN generator model during a training run.
Choosing generated images to demonstrate the capability of a GAN generator model.
Comparing GAN model architectures.
Comparing GAN model configurations.

The objective evaluation of GAN generator models remains an open problem.

While several measures have been introduced, as of yet, there is no consensus as to which measure best captures strengths and limitations of models and should be used for fair model comparison.

— Pros and Cons of GAN Evaluation Measures, 2018.

As such, GAN generator models are evaluated based on the quality of the images generated, often in the context of the target problem domain.

Want to Develop GANs from Scratch?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Manual GAN Generator Evaluation

Many GAN practitioners fall back to the evaluation of GAN generators via the manual assessment of images synthesized by a generator model.

This involves using the generator model to create a batch of synthetic images, then evaluating the quality and diversity of the images in relation to the target domain.

This may be performed by the researcher or practitioner themselves.

Visual examination of samples by humans is one of the common and most intuitive ways to evaluate GANs.

— Pros and Cons of GAN Evaluation Measures, 2018.

The generator model is trained iteratively over many training epochs. As there is no objective measure of model performance, we cannot know when the training process should stop and when a final model should be saved for later use.

Therefore, it is common to use the current state of the model during training to generate a large number of synthetic images and to save the current state of the generator used to generate the images. This allows for the post-hoc evaluation of each saved generator model via its generated images.

One training epoch refers to one cycle through the images in the training dataset used to update the model. Models may be saved systematically across training epochs, such as every one, five, ten, or more training epochs.

Although manual inspection is the simplest method of model evaluation, it has many limitations, including:

It is subjective, including biases of the reviewer about the model, its configuration, and the project objective.
It requires knowledge of what is realistic and what is not for the target domain.
It is limited to the number of images that can be reviewed in a reasonable time.

… evaluating the quality of generated images with human vision is expensive and cumbersome, biased […] difficult to reproduce, and does not fully reflect the capacity of models.

— Pros and Cons of GAN Evaluation Measures, 2018.

The subjective nature almost certainty leads to biased model selection and cherry picking and should not be used for final model selection on non-trivial projects.

Nevertheless, it is a starting point for practitioners when getting familiar with the technique.

Thankfully, more sophisticated GAN generator evaluation methods have been proposed and adopted.

For a thorough survey, see the 2018 paper titled “Pros and Cons of GAN Evaluation Measures.” This paper divides GAN generator model evaluation into qualitative and quantitative measures, and we will review some of them in the following sections using this division.

Qualitative GAN Generator Evaluation

Qualitative measures are those measures that are not numerical and often involve human subjective evaluation or evaluation via comparison.

Five qualitative techniques for evaluating GAN generator models are listed below.

Nearest Neighbors.
Rapid Scene Categorization.
Rating and Preference Judgment.
Evaluating Mode Drop and Mode Collapse.
Investigating and Visualizing the Internals of Networks.

Summary of Qualitative GAN Generator Evaluation Methods
Taken from: Pros and Cons of GAN Evaluation Measures.

Perhaps the most used qualitative GAN generator model is an extension of the manual inspection of images referred to as “Rating and Preference Judgment.”

These types of experiments ask subjects to rate models in terms of the fidelity of their generated images.

— Pros and Cons of GAN Evaluation Measures, 2018.

This is where human judges are asked to rank or compare examples of real and generated images from the domain.

The “Rapid Scene Categorization” method is generally the same, although images are presented to human judges for a very limited amount of time, such as a fraction of a second, and classified as real or fake.

Images are often presented in pairs and the human judge is asked which image they prefer, e.g. which image is more realistic. A score or rating is determined based on the number of times a specific model generated images on such tournaments. Variance in the judging is reduced by averaging the ratings across multiple different human judges.

This is a labor-intensive exercise, although costs can be lowered by using a crowdsourcing platform like Amazon’s Mechanical Turk, and efficiency can be increased by using a web interface.

One intuitive metric of performance can be obtained by having human annotators judge the visual quality of samples. We automate this process using Amazon Mechanical Turk […] using the web interface […] which we use to ask annotators to distinguish between generated data and real data.

— Improved Techniques for Training GANs, 2016.

A major downside of the approach is that the performance of human judges is not fixed and can improve over time. This is especially the case if they are given feedback, such as clues on how to detect generated images.

By learning from such feedback, annotators are better able to point out the flaws in generated images, giving a more pessimistic quality assessment.

— Improved Techniques for Training GANs, 2016.

Another popular approach for subjectively summarizing generator performance is “Nearest Neighbors.” This involves selecting examples of real images from the domain and locating one or more most similar generated images for comparison.

Distance measures, such as Euclidean distance between the image pixel data, is often used for selecting the most similar generated images.

The nearest neighbor approach is useful to give context for evaluating how realistic the generated images happen to be.

Quantitative GAN Generator Evaluation

Quantitative GAN generator evaluation refers to the calculation of specific numerical scores used to summarize the quality of generated images.

Twenty-four quantitative techniques for evaluating GAN generator models are listed below.

Average Log-likelihood
Coverage Metric
Inception Score (IS)
Modified Inception Score (m-IS)
Mode Score
AM Score
Frechet Inception Distance (FID)
Maximum Mean Discrepancy (MMD)
The Wasserstein Critic
Birthday Paradox Test
Classifier Two-sample Tests (C2ST)
Classification Performance
Boundary Distortion
Number of Statistically-Different Bins (NDB)
Image Retrieval Performance
Generative Adversarial Metric (GAM)
Tournament Win Rate and Skill Rating
Normalized Relative Discriminative Score (NRDS)
Adversarial Accuracy and Adversarial Divergence
Geometry Score
Reconstruction Error
Image Quality Measures (SSIM, PSNR and Sharpness Difference)
Low-level Image Statistics
Precision, Recall and F1 Score

Summary of Quantitative GAN Generator Evaluation Methods
Taken from: Pros and Cons of GAN Evaluation Measures.

The original 2014 GAN paper by Goodfellow, et al. titled “Generative Adversarial Networks” used the “Average Log-likelihood” method, also referred to as kernel estimation or Parzen density estimation, to summarize the quality of the generated images.

This involves the challenging approach of estimating how well the generator captures the probability distribution of images in the domain and has generally been found not to be effective for evaluating GANs.

Parzen windows estimation of likelihood favors trivial models and is irrelevant to visual fidelity of samples. Further, it fails to approximate the true likelihood in high dimensional spaces or to rank models

— Pros and Cons of GAN Evaluation Measures, 2018.

Two widely adopted metrics for evaluating generated images are the Inception Score and the Frechet Inception Distance.

The inception score was proposed by Tim Salimans, et al. in their 2016 paper titled “Improved Techniques for Training GANs.”

Inception Score (IS) […] is perhaps the most widely adopted score for GAN evaluation.

— Pros and Cons of GAN Evaluation Measures, 2018.

Calculating the inception score involves using a pre-trained deep learning neural network model for image classification to classify the generated images. Specifically, the Inception v3 model described by Christian Szegedy, et al. in their 2015 paper titled “Rethinking the Inception Architecture for Computer Vision.” The reliance on the inception model gives the inception score its name.

A large number of generated images are classified using the model. Specifically, the probability of the image belonging to each class is predicted. The probabilities are then summarized in the score to both capture how much each image looks like a known class and how diverse the set of images are across the known classes.

A higher inception score indicates better-quality generated images.

The Frechet Inception Distance, or FID, score was proposed and used by Martin Heusel, et al. in their 2017 paper titled “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium.” The score was proposed as an improvement over the existing Inception Score.

FID performs well in terms of discriminability, robustness and computational efficiency. […] It has been shown that FID is consistent with human judgments and is more robust to noise than IS.

— Pros and Cons of GAN Evaluation Measures, 2018.

Like the inception score, the FID score uses the inception v3 model. Specifically, the coding layer of the model (the last pooling layer prior to the output classification of images) is used to capture computer vision specific features of an input image. These activations are calculated for a collection of real and generated images.

The activations for each real and generated image are summarized as a multivariate Gaussian and the distance between these two distributions is then calculated using the Frechet distance, also called the Wasserstein-2 distance.

A lower FID score indicates more realistic images that match the statistical properties of real images.

Which GAN Evaluation Scheme to Use

When getting started, it is a good idea to start with the manual inspection of generated images in order to evaluate and select generator models.

Manual Image Inspection

Developing GAN models is complex enough for beginners. Manual inspection can get you a long way while refining your model implementation and testing model configurations.

Once your confidence in developing GAN models improves, both the Inception Score and the Frechet Inception Distance can be used to quantitatively summarize the quality of generated images. There is no single best and agreed upon measure, although, these two measures come close.

As of yet, there is no consensus regarding the best score. Different scores assess various aspects of the image generation process, and it is unlikely that a single score can cover all aspects. Nevertheless, some measures seem more plausible than others (e.g. FID score).

— Pros and Cons of GAN Evaluation Measures, 2018.

These measures capture the quality and diversity of generated images, both alone (former) and compared to real images (latter) and are widely used.

Inception Score
Frechet Inception Distance

Both measures are easy to implement and calculate on batches of generated images. As such, the practice of systematically generating images and saving models during training can and should continue to be used to allow post-hoc model selection.

The nearest neighbor method can be used to qualitatively summarize generated images. Human-based ratings and preference judgments can also be used if needed via a crowdsourcing platform.

Nearest Neighbors
Rating and Preference Judgment

Summary

In this post, you discovered techniques for evaluating generative adversarial network models based on generated synthetic images.

Specifically, you learned:

There is no objective function used when training GAN generator models, meaning models must be evaluated using the quality of the generated synthetic images.
Manual inspection of generated images is a good starting point when getting started.
Quantitative measures, such as the inception score and the Frechet inception distance, can be combined with qualitative assessment to provide a robust assessment of GAN models.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

33 Responses to How to Evaluate Generative Adversarial Networks

David March 26, 2020 at 6:16 am #

Hi! Nice blog, there’s a lot of stuff covered.

I need to evaluate different versions of a GAN trained to generate faces. Given there’s no person class in the dataset inception v3 was trained with, I assume the inception score is not an option then, right?

What about the FID? I guess it might be better than IS for this case, as the real images are an input too, but I don’t know if it’d be reliable enough. What do you think?

Reply
- Jason Brownlee March 26, 2020 at 8:05 am #
  
  Thanks!
  
  Right.
  
  Yes, FID might be a good starting point:
  https://machinelearningmastery.com/how-to-implement-the-frechet-inception-distance-fid-from-scratch/
  
  Reply
Christine March 31, 2020 at 10:06 am #

Hi, thank you for a great tutorial.

I have a question. I have just started working with GANs, previously I was only working on supervised ML/DL tasks like classification or segmentation. In supervised learning we always compare the loss values on training and validation data.
1)Is there a logic in doing the same with GANs?

Lets say, I am training a CycleGAN, I have losses of generator and discriminator. They have a particular trend. If I compute the loss of the same models on validation set(unseen data) – the trend is slightly different. For example, the training losses are lower for both discriminator and generator.

2)Is it possible I derive any conclusions from this about quality of GAN performance?

Reply
- Jason Brownlee March 31, 2020 at 1:35 pm #
  
  No, not really.
  
  The most reliable way I use is to use the model to generate images, then choose the model that generates the best images.
  
  Reply
Joseph July 29, 2020 at 3:06 pm #

useful metrics to evaluate a GAN’s performance?

Reply
- Jason Brownlee July 30, 2020 at 6:15 am #
  
  Yes, that is the topic of the above tutorial.
  
  If you cannot choose, consider IS or FID.
  
  Reply
Rizwan August 23, 2020 at 8:38 pm #

Hi,
Thanks for this very useful post.
I am working on network data that is in tabular form. I am trying to generate the similar data using classical and conditional GANs. The purpose is to adversarially train the classifiers with the GANs generated network traffic data to minimize the false negatives. I am evaluating the generator performance after every epoch. The method I am adopting is to generate data G of size input data X. Then I divide the G and X into two halves and make one train and one test set combining X/2 and G/2 for each set. However, I am facing difficulty to evaluate the generator performance as I am not getting encouraging results. Need your advice.

Reply
- Jason Brownlee August 24, 2020 at 6:22 am #
  
  Thanks!
  
  Perhaps focus on generating images and evaluating them subjectively? It’s an excellent starting point.
  
  Reply
Rizwan August 23, 2020 at 8:55 pm #

I record the weights with lowest accuracy over the test set (X/2 U G/2) after training on the other halves. Then I use the weights to generate data to add into the training set to improve the classifier performance.

(BTW Sorry for another posting)

Reply
- Jason Brownlee August 24, 2020 at 6:23 am #
  
  Accuracy is a terrible metric for GANs, please don’t use it.
  
  Reply
Mak September 23, 2020 at 6:52 am #

Hi. What measure would you recommend for evaluating Pix2Pix GAN?

Reply
- Jason Brownlee September 23, 2020 at 8:24 am #
  
  Good question, I recommend starting here:
  https://machinelearningmastery.com/how-to-evaluate-generative-adversarial-networks/
  
  Reply
  - Mak September 23, 2020 at 5:05 pm #
    
    Let’s put it this way: Are IS and FID measures applicable on evaluating Pix2Pix GAN, or manual method is better solution?
    
    Reply
    - Jason Brownlee September 24, 2020 at 6:10 am #
      
      You must choose the metric that aligns with the goals of your project. I cannot choose for you.
      
      Reply
Jayeeta Chakraborty November 24, 2020 at 5:55 pm #

For augmenting 1-D signals, what do you think would be the appropriate metric?

Reply
- Jason Brownlee November 25, 2020 at 6:42 am #
  
  I don’t know, I guess it depends on the specifics if your problem.
  
  Reply
MuhdKiru December 10, 2020 at 12:50 am #

Amazing. Your works are really saving my butts

Reply
- Jason Brownlee December 10, 2020 at 6:27 am #
  
  Thanks, I’m happy to hear that.
  
  Reply
Swetashree Mishra February 10, 2021 at 7:13 pm #

How can we do a comparison of different quantitative evaluation metric for GANs? Is there any way where we can compare Average Likelihood with FID and so on with others?

Reply
- Jason Brownlee February 11, 2021 at 5:52 am #
  
  See some of the papers in the “further reading” section for comparisons.
  
  Reply
T_SM August 26, 2021 at 12:16 am #

Hi! I am working with audio anomaly detection with GAN. So which evaluation metric is best?

Reply
- Adrian Tam August 27, 2021 at 5:37 am #
  
  Let’s think in this way: What can you tell from a good result vs a bad result? How can we convert their difference into a single scalar value?
  
  Reply
Ali December 7, 2021 at 4:36 pm #

Hi, Thank you for this great post
I am working on a specific skin lesion classification and there is a lack of labeled data for that, so I want to use GANs to generate synthesis images for training my model but I am worried that these generated images wouldn’t be correct In terms of medicine. because almost all famous skin lesion data are labeled by dermatologists and histopathologists, so can I use IS and FID for evaluating my GAN or it is not enough for my project?

Reply
- Adrian Tam December 8, 2021 at 7:59 am #
  
  I think that’s a better question to the domain experts. They may give you some insight on how to make it better.
  
  Reply
Nicholas January 22, 2022 at 7:02 pm #

Hi, do you know how to evaluate GAN for tabular data? I’m using vanilla GAN to oversample minority class from dataset credit card fraud.

Reply
- James Carmichael January 24, 2022 at 11:01 am #
  
  Hi Nicholas…Interesting question as this is not the typical application, however the following may be of interest to you:
  
  https://towardsdatascience.com/review-of-gans-for-tabular-data-a30a2199342
  
  Reply
Bahar April 11, 2022 at 6:56 pm #

Hi, Thank you for the great post

For augmenting 1-D data like Gene Expression Data with vector inputs , what do you think would be the appropriate metric?

Reply
- James Carmichael April 14, 2022 at 3:39 am #
  
  Hi Bahar…You may find the following of interest:
  
  https://machinelearningmastery.com/how-to-implement-the-frechet-inception-distance-fid-from-scratch/
  
  Reply
  - Bahar April 15, 2022 at 5:44 pm #
    
    Thanks for your reply.
    
    FID is not just for image data?
    
    Reply
Kushagra Anand August 9, 2022 at 10:40 am #

If I use the Brisque score to calculate the image quality, what range is suitable for a GAN’s output? provided the images I have used lie in the range of 11-15.

Reply
- James Carmichael August 10, 2022 at 5:35 am #
  
  Hi Kushagra…The following discussion may help add clarity:
  
  https://www.researchgate.net/post/How_to_infer_the_magnitude_of_a_BRISQUE_or_NIQE_score
  
  Reply
noel October 16, 2023 at 7:01 am #

i want the code of mnist score (Inception score) for the mnist dataset

Reply
- James Carmichael October 17, 2023 at 9:58 am #
  
  Hi Noel…The following resource may be of interest to you:
  
  https://github.com/sundyCoder/IS_MS_SS
  
  Reply

Navigation

How to Evaluate Generative Adversarial Networks

Overview

The Problem of Evaluating GAN Generator Models

Want to Develop GANs from Scratch?

Manual GAN Generator Evaluation

Qualitative GAN Generator Evaluation

Quantitative GAN Generator Evaluation

Which GAN Evaluation Scheme to Use

Further Reading

Summary

Develop Generative Adversarial Networks Today!

Develop Your GAN Models in Minutes

Finally Bring GAN Models to your Vision Projects

More On This Topic

33 Responses to How to Evaluate Generative Adversarial Networks

Leave a Reply Click here to cancel reply.