Dimsum

Official

Summary

Lay summarization aims to generate lay summaries of scientific papers automatically based on the BART model. We leverage sentence labels as extra supervision signals to improve the performance. The task aims to generate summaries that are representative of the content, comprehensible and interesting to a lay audience.

We observe that lots of the sentences in lay summaries have corresponding sentences in original papers. We think that making binary sentence labels for extractice summarization and utilizing them as extra supervision signals can help model generate better summaries. Therefore we conduct BART encoder to make sentence representations and train extractive summarization together with abstraction summarization.

Architecture

The input document is fed into the bidirectional encoder, then the contextual embeddings of the i-th [CLS] symbol are used as the sentence representations. After a feed-forward neural network, these sentence representations produce a binary distribution about whether they belong to the extractive summary. The abstractive summary is generated by auto-regressive decoder. Overall loss is a linear combination of cross-entropy losses of abstractive and extractive summaries.

Pre-Training Task

We first represent the document using the sentences in its Abstract, Introduction and Conclusion. The first pre-processing approach is removing tags and outliers. Then we truncate all input text to a max length of 1024 tokens due to the carrying capacity of the BART model.

We use BART fine-tuned on CNN/DailyMail dataset to initialize our model. Then we use an unsupervised approach to convert the abstractive summaries to extractive labels and train both summarization simultaneously.

To make the ground-truth sentence-level binary labels for extractive summarization (called ORACLE) we use a greedy algorithm based on the idea that selected sentences from the input should be the ones that maximize the Rouge score with the respective gold summary.

Datasets

Dataset of CL-LaySumm 2020 and ScisummNet.

Performance