Summary of the paper “Understanding Deep Learning Requires Rethinking Generalization”

6 min readMar 11, 2020

Regrading traditional learning theory, we have a bunch of researchers and also books talking about it. Personally, I am not that good at deep understanding the traditional learning theory. I was trying to learn, however, it is so deep and I don’t feel that I have touched the base yet. Before I read the paper of “UNDERSTANDING DEEP LEARNING REQUIRES RETHINKING GENERALIZATION”, actually I read several blog articles from the first author of this paper Chiyuan Zhang. These blogs are great and it helps me a lot in understanding some traditional theories including the relationship between the law of large numbers and machine learning, Hoeffding’s inequality, Empirical risk minimization. These blogs definitely helped me a lot in understanding the traditional learning theory.

Since the paper is trying to build a new theory to explain the deep learning system. It included some brief introduction to the old theory. It seems that the old theory can not well explain the deep learning system. The author is trying to use some experiments to create new theories/hypotheses. It is kind of helpful as we do need theory to explain why DL works so well.

Before diving deep into the paper, I’d like to put a brief summary of the main terminologies used in the paper.

generalization error: the difference between “training error” and “test error”.

The motivation of this paper

“ It is certainly easy to come up with natural model architectures that generalize poorly. What is it then that distinguishes neural networks that generalize well from those that don’t? A satisfying answer to this question would not only help to make neural networks more interpretable, but it might also lead to more principled and reliable model architecture design.”

Above is the description from the paper. From my personal point of view, one of the motivations should be the authors of this paper want to build the theory part of deep learning. This kind of explains why this paper got so much attention.

Several important/interesting points

Deep neural networks easily fit random labels

This one is easy to be understood as based on Universal approximation theorem, a feed-forward network with a single hidden layer containing a finite number of neurons can approximate continuous functions on compact subsets of R^n. Currently, the deep neural networks have a stronger model capacity and it is not too surprising that the model can fit the training data even if the random labels are given.

SGD can perform as an implicit regularizer.

It is widely accepted that some explicit regularizers like data argumentation, dropout, and weight-decay can help reduce the generalization error. This is pretty natural to think of as data argumentation will generate more data samples. Dropout has a similar effect to data argumentation. Weight-decay is following the widely accepted idea of Occam’s razor. It is sometimes paraphrased by a statement like “the simplest solution is most likely the right one”.

In the paper, it is mentioned that the SGD itself may perform as an implicit regularizer. This idea is kind of new and we will see whether it is true if we can find a formal way to prove this hypothesis.

“For linear models, SGD always converges to a solution with the small norm. Hence, the algorithm itself is implicitly regularizing the solution. Indeed, we show on small data sets that even Gaussian kernel methods can generalize well with no regularization. Though this doesn’t explain why certain architectures generalize better than other architectures, it does suggest that more investigation is needed to understand exactly what the properties are inherited by models that were trained using SGD.”

New observations in Deep neural networks rule out some classical learning theories.

“We discuss in further detail below how these observations rule out all of VC-dimension, Rademacher complexity, and uniform stability as possible explanations for the generalization performance of state-of-the-art neural networks.” I don’t quite understand the VC-dimension, Rademacher complexity, and uniform stability. Maybe I should learn and summary them in separate blogs in the future.

Main experiment results

True label, random labels, and different other ways are compared by the following chart based on CIFAR10 dataset.

Impact of regularizers on the final performance.

Different regularizers are summarized by the presentation slides from ICLR 2017

From the first author’s presentation slides

From the below left, we can see if without data augmentation, without weight decay, without dropout, the test performance is around 0.6. With data augmentation and dropout the performance is around 0.7. With data augmentation, dropout and weight decay, the performance is approaching to 0.8. Also, the introduction of the weight decay will have extra befit which is we do not need to worry too much about the test performance dropping when we over train the model. From the results, we can have a very natural question, does the weight decay always help reduce the generalization error? We can not get a conclusion from the results of only one experiment. However, during my research process of the Frustum VoxNet, I did observe sometimes that the early stopped model performs better than the overtrained model on the test data set.

From the below right, we can see that if the batch normalization is used, the training process is more stable and the test performance is also more stable. Also, we can have test performance gain by using batch normalization.

Critique from myself about this paper

This paper is a great paper. It gave impressive attention to push the researcher to keep on moving forward to build a better theory to deep understand deep learning models. At the same time, I’d like to point out my two cents about some parts that I feel it can not persuade me.

First

“ Hence, the concept is not strong enough to distinguish between the models trained on the true labels (small generalization error) and models trained on random labels (high generalization error)” I don’t quite agree that we can compare that a model trained based on true label and test based on the true label with a model trained based on a random label and test based on the true label. We will always have an i.i.d assumption during the analysis, right? The later one doesn’t follow the i.i.d assumption.

Second

“In summary, our observations on both explicit and implicit regularizers are consistently suggesting that regularizers, when properly tuned, could help to improve the generalization performance. However, it is unlikely that the regularizers are the fundamental reason for generalization, as the networks continue to perform well after all the regularizers removed.”

What is the fundamental reason for generalization?

Nice slides from the presentation

The first author chiyuan zhang’s presentation slides can be found from the link below:

Several slides are pretty useful and I’d like to share here.