(Averaged) Cross-Entropy Loss is not a Proper Scoring Rule
In a recent paper we wrote when referring to the KL divergence that
In some domains, the length of the sequences n differs in each example, which can be incorporated by choosing an effective length N = max n, and treating all sequences shorter than N as having a sequence of padding tokens appended,
with a footnote that
Some care is required here, as averaging the loss of each example over its length leads to an inconsistent estimator.
Although we wrote ‘an inconsistent estimator’ in the paper, more precisely we should have specified that it is not a `proper scoring rule’: in other words, the minimizer of a certain type of cross-entropy loss with variable-length sequences (in fact, the default loss in PyTorch) is not the data generating distribution.
Here I will elaborate on this point a bit more, and discuss why I doubt it’s a problem for modern sequence modelling in practice.
Preliminary
We will start with some definitions.
Proper Scoring Rule
A scoring rule is a way to evaluate how good a probabilistic prediction is. We say a scoring rule is “proper” if the best score (lowest loss) is always achieved by predicting the true probabilities (the probability of the data-generating process).
The KL-divergence
The KL-divergence between two distributions
As with all
If our random variable
or
The Cross-Entropy Loss
Since we usually don’t have density estimates for a target distribution
It is clear that
from which we can see that the cross-entropy loss is simply a shifted KL-divergence. Since the shift doesn’t depend on
The Averaged Cross-Entropy Loss
Another variant of the cross-entropy loss is the averaged form. In particular, let’s say we have data sequences which are of varying length. We can imagine they have a generative process where we first pick the length of the sequence
As discussed below, this is effectively the default cross-entropy loss available for sequence modelling in PyTorch, when using a batch (or micro-batch) size of 1 and default keyword argument reduction=mean
.
Below, I’ll show that this loss is not a proper scoring rule. We’ll do this by demonstrating a case where
A Simple Sequential Model
Let’s say we have the following setup: we have a categorical variable <eos>
is the end-of-sentence token. If this is generated, the sequence ends.
We have a simple data-generating process <eos>
token is generated. With probability <eos>
token is generated, so almost surely has finite length.
We are going to fit this with another distribution <eos>
with probability
We’ll calculate the KL-divergence, cross-entropy, and averaged cross-entropy between
The KL-divergence
Computing the KL-divergence between
where we have used the result that
We can see that
Cross-Entropy
As a check, we’ll first compute the entropy of our sequences:
Now we can compute the cross-entropy loss:
We see again that the loss will be minimized when
Averaged Cross-Entropy Loss
Finally we can look at the averaged cross-entropy loss. Here we have
Where we have used the result that
So, to minimize the averaged cross-entropy loss, the model has to give a probability which is not equal to the ground-truth probability!
A bit more algebra shows that if we have a probability <eos>
token, we should set
Some Intuition
After going through the derivations, we can see that the averaged cross-entropy is not a proper scoring rule. This might seem a bit strange, since we can usually scale losses by a constant and the minimum is the same. However, the problem here is that the scaling depends on the content of the sequences. In other words, when choosing <eos>
token appears, this current timestep will have a weighting of 1. However, if <eos>
is not chosen, and is chosen at the next timestep, the current timestep will have a weighting of <eos>
. Indeed, I believe that in the case where the content of the sequence is independent from the length, the averaged cross-entropy is still a proper scoring rule. Of course in contemporary sequence modelling, the content depends heavily on the length–indeed, for a sequence with length <eos>
token or similar.
Practical Implications
Why is this discussion about averaged cross entropy relevant? I think it’s worth thinking about because the
default implementation of the cross entropy loss in Pytorch is an averaged cross entropy loss. Specifically, when dealing with a single masked sequence where the masking indices are set to ignore_index
, the default behavior is to carry out an averaged cross entropy. The key advantage of this behavior is that the losses are on the same scale for each sequence (with the same order of magnitude of the entropy of the text). Practically speaking, having losses that are similar in scale is likely important for stable training.
However, it’s worth pointing out that the behaviour under batching is quite close to the behaviour of the correct cross-entropy loss. If we have two sequences, one with length 1 and 1023 padding tokens, corresponding to loss
In principle, this issue would arise more frequently with high amounts of gradient accumulation, such as a micro-batch size of 1.
I tried to demonstrate this issue in a practical setting by training a small transformer on a corpus and comparing the learned models. I used Andrej Kaparthy’s Shakespeare corpus, and a vanilla 4-layer transformer. I trained with a minibatch size of 128, first using a per-device `microbatch’ size of 128 and then using a microbatch size of 1 with a factor of 128 gradient accumulation. I didn’t use any dropout, so these two training setups should have been completely equivalent if non-averaged cross-entropy was being used. As we can see, there are some differences between the trained models that are obtained. The model with smaller microbatches should be more likely to generate shorter sequences. However, the effect is not particularly strong, given the large error bars. The test perplexity of the different models is also very similar across microbatch size. Therefore, it seems that the effect of the non-proper loss is quite small in practice.
