BART

Official

Summary

BART is a denoising autoencoder for pre-training sequence-to-sequence models. BART is trained by corrupting text with an arbitrary noise function and learning a model to reconstruct the original text.

It uses a standard Transformer-based neural machine translation architecture which, despite its simplicity, can be seen as generalizing BERT (due to the bidirectional encoder), GPT (with the left-to-right decoder), and many other more recent pre-training schemes.

BART also presents a new scheme for machine translation where a BART model is stacked above a few additional transformer layers. These layers are trained to essentially translate the foreign language to noised English, by propagation through BART, thereby using BART as a pre-trained target-side language model.

Architecture

It uses standard sequence-to-sequence Transformer architecture except, following GPT, they modify ReLU activation functions to GeLUs and initialise parameters from N(0,0.02). The architecture is closely related to BERT with the following differences:

each layer of the decoder additionally performs cross-attention over the final hidden layer of the encoder (as in the transformer sequence-to-sequence model)

BERT uses an additional feed-forward network before word prediction while BART does not

Pre-Training Task

BART pre-trains a model combining Bidirectional and Auto-Regressive Transformers. Pre-training has two stages:

text is corrupted with an arbitrary noising function

sequence-to-sequence model is learned to reconstruct the original text

BART optimizes a reconstruction loss - the cross-entropy between the decoder’s output and the original document. BART allows us to apply any type of document corruption.

Experiments

Performance

We evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme where spans of text are replaced with a single mask token.

Fine-Tuning

The representations produced can be used in several ways for downstream applications such as sequence classification, token classification, sequence generation and machine translation.