đź“„

T5

Official

GitHub - google-research/text-to-text-transfer-transformer: Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
T5X is the new and improved implementation of T5 (and more) in JAX and Flax. T5 on Tensorflow with MeshTF is no longer actively developed. If you are new to T5, we recommend starting with T5X. The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer .
https://github.com/google-research/text-to-text-transfer-transformer
Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer
Over the past few years, transfer learning has led to a new wave of state-of-the-art results in natural language processing (NLP). Transfer learning's effectiveness comes from pre-training a model on abundantly-available unlabeled text data with a self-supervised task, such as language modeling or filling in missing words.
https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html

Summary

We explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Modern techniques for transfer learning in NLP often pre-train using unsupervised learning on unlabeled data. We leverage a unified approach to transfer learning that allows us to systematically study different approaches.

The basic idea underlying our work is to treat every text processing problem as a “text-to-text” problem. This framework allows us to directly apply the same model, objective, training procedure, and decoding process to every task, including English-based NLP problems, question answering, document summarization, sentiment classification, etc.

Early results on transfer learning for NLP leveraged recurrent neural networks, but it has recently become more common to use models based on the “Transformer” architecture. Due to its increasing ubiquity, all of the models we study are based on it.

The Annotated Transformer
There is now a new version of this blog post updated for modern PyTorch. -------- The Transformer from "Attention is All You Need" has been on a lot of people's minds over the last year. Besides producing major improvements in translation quality, it provides a new architecture for many other NLP tasks.
http://nlp.seas.harvard.edu/2018/04/03/attention.html
The Illustrated Transformer
Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations: Arabic, Chinese (Simplified) 1, Chinese (Simplified) 2, French 1, French 2, Japanese, Korean, Persian, Russian, Spanish, Vietnamese Watch: MIT's Deep Learning State of the Art lecture referencing this post In the previous post, we looked at Attention - a ubiquitous method in modern deep learning models.
https://jalammar.github.io/illustrated-transformer/

Architecture

First, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of “blocks”, each of which comprises two sub-components: a self-attention layer followed by a small feed-forward network. We use a simplified version of input normalization (rescaling). A residual skip connection adds each sub-component’s input to its output. Dropout is applied within the feed-forward network, on the skip connection, on the attention weights and at the input and output of the entire stack.

The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. The self-attention mechanism in the decoder also uses a form of auto-regressive or casual self-attention which only allows the model to attend past outputs.

The output of the final decoder block is fed into a dense layer with a softmax output, whose weights are shared with the input embedding matrix. All attention mechanisms in the Transformer are split up into independent “heads” whose outputs are concatenated before being further processed.

Instead of using a fixed embedding for each position, relative position embeddings produce a different learned embedding according to the offset between the “key” and “query” being compared in the self-attention mechanism.

We use a simplified form of position embeddings where each “embedding” is simply a scalar that is added to the corresponding logit used for computing the attention weights. For efficiency, we also share the position embedding parameters across all layers in our model, though within a given layer each attention head uses a different learned position embedding.

Dataset (C4)

Common Crawl
We build and maintain an open repository of web crawl data that can be accessed and analyzed by anyone. Need years of free web page data to help change the world. Please donate today, so we can continue to provide you and others like you with this priceless resource.
https://commoncrawl.org/

We used the following heuristics for cleaning up Common Crawl’s web-extracted text:

langdetect
Port of Nakatani Shuyo's language-detection library (version from 03/03/2014) to Python. $ pip install langdetect Supported Python versions 2.7, 3.4+.
https://pypi.org/project/langdetect/
c4 | TensorFlow Datasets
Warning: Manual download required. See instructions below. A colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset: https://commoncrawl.org To generate this dataset, please follow the instructions from t5. Due to the overhead of cleaning the dataset, it is recommend you prepare it with a distributed service like Cloud Dataflow.
https://www.tensorflow.org/datasets/catalog/c4
GitHub - jcpeterson/openwebtext: Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed. - GitHub - jcpeterson/openwebtext: Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.
https://github.com/jcpeterson/openwebtext

Downstream Tasks

We measure performance on the GLUE and SuperGLUE text classification meta-benchmarks:

SuperGLUE Benchmark
SuperGLUE is a new benchmark styled after original GLUE benchmark with a set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
https://super.gluebenchmark.com/

Pre-Training

The model is trained with a maximum likelihood objective (using “teacher forcing”) regardless of the task. To specify which task the model should perform, we add a task-specific prefix to the original input sequence before feeding it to the model.

Note that the choice of text prefix used for a given task is essentially a hyperparameter. We found that changing the exact wording of the prefix had limited impact and so did not perform extensive experiments into different prefix choices. We allow for separately fine-tuning the model on each individual task and use short task prefixes instead of an explicit question-answer format.

We mainly consider models that explicitly process an input with an encoder before generating an output with a separate decoder and we focus on transfer learning rather than zero-shot learning. Our framework also allows for generative tasks like machine translation and abstractive summarization, where it is not possible to enumerate all possible output choices.

During pre-training, we use an “inverse square root” learning rate schedule: 1 / sqrt(max(n, k)) where n is the current training iteration and k is the number of warm-up steps (set to 104 in all of our experiments). This sets a constant learning rate of 0.01 for the first 104 steps, then exponentially decays the learning rate until pre-training is over.

Experiments

To provide a reasonable means of comparison, we consider multiple configurations for our encoder-decoder model. We will refer to the number of layers and parameters in a BERT(BASE) -sized layer stack as L and P, respectively. We will use M to refer to the number of FLOPs required for an L + L-layer encoder-decoder model or L-layer decoder-only model to process a given input-target pair.

In total, we will compare:

Performance

Further Readings

Review - T5: Text-to-Text Transfer Transformer
Language Model where Input: Text, Output: Text Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, T5, by Google 2020 JMLR, Over 3000 Citations ( Sik-Ho Tsang @ Medium)Language Model, Natural Language Processing, NLP, Transformer A unified framework that converts all text-based language problems into a text-to-text format.
https://sh-tsang.medium.com/review-t5-text-to-text-transfer-transformer-b3f0f3c07295
Understanding T5 Model : Text to Text Transfer Transformer Model
Recent years have seen a plethora of pre-trained models such as ULMFiT, BERT, GPT, etc being open-sourced to the NLP community. Given the size of such humungous models, it's nearly impossible to train such networks from scratch considering the amount of data and computation that is required.
https://towardsdatascience.com/understanding-t5-model-text-to-text-transfer-transformer-model-69ce4c165023
bert/multilingual.md at master · google-research/bert
There are two multilingual models currently available.
https://github.com/google-research/bert/blob/master/multilingual.md
Comparative analysis of T5 model for abstractive text summarization on different datasets
7 Pages Posted: 3 May 2022 See all articles by Tawmo T National Institute of Technology Silchar National Institute of Technology Silchar National Institute of Technology Silchar National Institute of Technology (NIT), Silchar Date Written: April 29, 2022 Abstractive Text Summarization is a burgeoning natural language processing task that has seen success with the Transformer model.
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4096413