In a recent paper we wrote when referring to the KL divergence that
In some domains, the length of the sequences n differs in each example, which can be incorporated by choosing an effective length N = max n, and treating all sequences shorter than N as having a sequence of padding tokens appended,
Last updated on
Nov 10, 2024
9 min read