By Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu and Jiebo Luo
Best Long Paper
Grammar induction aims to capture syntactic information in sentences in the form of constituency parsing trees. Sup
Unsupervised grammar induction provide evidence for statistical learning because it has minimal assumption about linguistic knowledge built into the models
There are many texts and videos/images on social media and images have shown to help us induce syntax structure. In Image0-aided unsupervised Grammar induction, the model exploits regularities between text spans and images. Videos not only represent static objects, but they can also show actions and dynamic interaction between objects and can represent verb phrases.
The paper is inspired by Compound PCFG in which a grammar inducer is used to create the chart and then they optimize on marginal likelihood of the sentence. Visually Grounded-PCFG, consider image sentence matching during the training. However, simply replacing images with videos is not trivial due to multimodality and temporal modeling in videos.

Image taken from the presentation
The baseline model created by the authors combines VC-PCFG with objects. The authors also compare this baseline with VC-PCFG + [Action, scene, audio, OCR, Face, speech]. In the final model, all the aforementioned features from a Multi-Modal Transformer are extracted and used. In the paper, it’s been shown that this model (MMC-PCFG) outperforms all the other baselines consistently on three different datasets.
By Simone Conia, Andrea Bacciu and Roberto Navigli
Outstanding Long Paper
Semantic role labeling (SRL) is the task of automatically addressing “who did what to whom, where, when, and how?” SRL includes predicate identification and disambiguation, argument identification, and argument classification. SRL predicate-argument structure inventories are language-specific and simultaneously label semantics across languages is expensive and needs human experts (in the desired languages). Hence, there is a need for heterogeneous linguistic resources. The intuition behind this idea is that semantic relations may be deeply rooted beyond language-specific realization and as a result authors try to build a model that can learn from many inventories for deeper sentence-level semantics.
The model architecture (illustrated below) can be roughly divided into the following components: • A universal sentence encoder whose parameters are shared across languages and which produces word encodings that capture predicate-related information • A universal predicate-argument encoder whose parameters are also shared across languages and which models predicate-argument relations • A set of language-specific decoders which indicate whether words are predicates, select the most appropriate sense for each predicate, and assign a semantic role to every predicate-argument couple, according to several different SRL inventories

Image taken from the presentation
By Timo Schick and Hinrich Schütze
Outstanding Long Paper
In GPT-3 priming, the model is given a few demonstrations of inputs and corresponding outputs as context for its predictions, but no gradient updates are performed. Using priming, GPT-3 has shown that it has amazing few-shot abilities. However, the problem with priming is that we need gigantic language models, also priming does not scale to more than a few examples as the context window is limited to a few hundred tokens.
An alternative approach is pattern-exploiting training (PET) which combines the idea of reformulating tasks as cloze questions with regular gradient-based finetuning. More formally, the task of mapping inputs x ∈ X to outputs y ∈ Y , for which PET requires a set of pattern-verbalizer pairs (PVPs). Each PVP p = (P, v) consists of • a pattern P : X → T∗ that maps inputs to cloze questions containing a single mask; • a verbalizer v : Y → T that maps each output to a single token representing its task-specific meaning in the pattern
By adopting PET, the authors showed that even a small model such as ALBERT can outperform GPT-3 on SuperGLUE. Through a series of ablation studies, authors showed that pattern and verbalizers significantly contribute to the performance of the model. From the paper, one can conclude that PET works well with small language models, and scaled to more examples than fit into the context window. However, the disadvantage of PET compared to few-shot learning is that PET requires fine-tuning multiple models for each task and doesn’t work for generative tasks.
By Guanghui Qin and Jason Eisner
Best Short Paper
In few shot settings, we can extract factual knowledge from the model’s training corpora by prompting the model. In this paper, the authors show that the choice of prompt is important for the performance of the model and explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. This can easily be achieved since from the perspective of LM words are just continuous vectors. Hence the used prompts consist of “soft words,” i.e., continuous vectors that do not necessarily word type embeddings from the language model. This is called soft prompts and the benefit of using the soft prompts (compared to hard prompts and words) is that they are easy to search with backprop and they enable us with a larger space of prompts.
In addition to soft promoting, for each task, authors optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. They show that across multiple English LMs and tasks, their approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.

Image taken from the presentation
By Teven Le Scao and Alexander Rush
Outstanding Short Paper
When fine-tuning pre-trained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. In this paper, authors compare prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. They show that prompting is often worth 100s of data points on average across classification tasks.
To compare heads vs prompts, they consider two transfer learning settings for text classification: head-based, where a generic head layer takes in pretrained representations to predict an output class; prompt-based, where a task-specific pattern string is designed to coax the model into producing a textual output corresponding to a given class. Both can be utilized for fine-tuning with supervised training data but prompts further allow the user to customize patterns to help the model.
For the prompt model, we follow the notation from PET (described earlier here) and decompose a prompt into a pattern and a verbalizer. The pattern turns the input text into a cloze task, i.e. a sequence with a masked token or tokens that need to be filled.In order to measure the effectiveness of promoting, authors introduce a metric, the average data advantage.
Tzu-Hsiang Lin, Yipeng Shi, Chentao Ye, Yang Fan, Weitong Ruan, Emre Barut, Wael Hamza, Chengwei Su
There are many domains in dialogue systems and defending on the domain, the concept of concept can vary from minutes to hours. This paper uses temporal representations that combine second difference and turn order offset information has been used to utilize both recent and distant context in various encoder architectures. More specifically, previous 9 turns of context within a few days have been included in the model. In order to determine which temporal information (turn vs time difference) contribute to the model, authors considered using time mask, turn embedding, turn embedding over time mask, time mask over turn embedding, and time and turn embedding. The model is shown below.

Image taken from the paper
Time mask feeds the time difference into a 2 layer network and sigmoid function to produce a masking vector. Turn embedding is the turn difference which is projected into a fixed-size embedding vector. Turn embedding over time mask provides turn order information on top of seconds (assuming turn is more important) and similarly in mask over turn embedding provides time seconds information on top of order turn (assuming time is more important). Finally, time and turn
]]>By Yejin Choi
In this talk, Yejin mainly talks about abduction and counterfactual reasoning and mentions that although they’re different, both involve nonmonotonic reasoning with past context X and future constraint Z.

Picture taken from the slides
Using language models such as GPT2 bc they are only good at conditioning on the past and we can incorporate the future by incorporating both past and future as the past with a special token in between. But this doesn’t generalize well to out-of-domain distribution. So they propose to use back-prop as an inference time algorithm rather than training time only for abduction. For counterfactual reasoning, the same works just need to change the loss function to K-L divergence. Perhaps the image below is more clear

Image taken from the slides
Knowledge and reasoning:
The Off-the-shelf language models are not equivalent to knowledge models, and we need to build knowledge models if we want to do reasoning. COMET is the common sense knowledge graph that has 1.33 M commonsense if-then inferences over 23 relations (or inference time). Comet is at least 400 times smaller than GPT3 and still, COMET outperforms GPT3 and COMET in common sense reasoning and COMET generalizes well on out-of-domain examples.
Key take away message: A lot of commonsense reasoning might require the full scope of language and the distinction between knowledge and reasoning is blurry and there’s and we need natural language for reasoning since there is no way we can represent the complexity of commonsense in clean-cut logical forms.
Authors: Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur
They examined whether the curriculum learning works or not. In order to do that, they first define the notion of difficulty for each example. The difficulty of each sample is defined by its learned iteration. Learned iteration is the iteration at which the sample is classified correctly and also in all the later iterations that sample is classified correctly as well. They sort the data based on the learned iteration and define a window of size k. Then they compare the performance of standard training with curriculum learning, anti-curriculum learning, and random learning (they shuffle the data, but a sample from each window of size k one by one). They observed: Curricula achieve (almost) no improvement in the standard-setting. They show curriculum learning, random, and anti-curriculum learning perform almost equally well in the standard-setting Curriculum learning improves over standard training when training time is limited. Imitating the large data regime, where training for multiple epochs is not feasible, They limit the number of iterations in the training algorithm and compare curriculum, random and anti-curriculum ordering against standard training. Our experiments reveal a clear advantage of curriculum learning over other methods. Curriculum learning improves over standard training in a noisy regime. Finally, They mimic noisy data by adding label noise to CIFAR100. Our experiments indicate that curriculum learning has a clear advantage over other curricula and standard training.
Authors: Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller
traditional attention-based models are not scalable since attention has quadratic time and space complexity. This paper introduces the performer, at efficient attentions base model. Performer provides linear space and time complexity without any assumption needed (such as sparsity or low-rankness). To approximate softmax attention kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) which approximates the standard attention weights. FAVOR is fully compatible with regular Transformers and the authors show theoretically that FAVOR guarantees unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance.

Image taken from the paper
See the paper for more details of how FAVOR works by decomposing each attention matrix in each batch into two matrices Q and K and their properties.
Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi
In order to learn deeper and wider representation, linear layers are used which allows to learn global representation. In order to improve the efficiency, the authors proposed to divide the input into two groups and then for each group, apply linear transformation into each group independently. In this manner, the number of parameters is reduced while the model still learns local representation within each group and not t. To allow global representation learning, the authors suggest feature shuffle. After applying one linear transformation to each group independently, the features are shuffled and then group linear transformation. With this model, the model learns both global and local representation. In addition to feature shuffle, the authors used the input mixer connection to avoid vanishing radiant decent problem, improve stability in training, and also improving performance while increasing depth (without input mixer connection, authors observe the model gets worse when increasing the depth).

Image taken from the paper
Authors: Anna Golubeva, Guy Gur-Ari, Behnam Neyshabur
They showed that when fixing the size of the model (and the number of parameters), while increasing the width of the model, the model performance improves. More specifically, they showed They observed that for ImageNet, increasing the width (by Setting some weights to zero using a mask that is randomly chosen) leads to almost identical performance as when they allow the number of weights to increase along with the width. To understand the observed effect theoretically, authors study a simplified model and show that the improved performance of a wider, sparse network is correlated with a reduced distance between its Gaussian Process kernel and that of an infinitely wide network. Authors propose that reduced kernel distance may explain the observed effect
Authors: Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi
They propose a new framework, deep bootstrap, to study generalization in deep learning. Where in real-world where SGD takes step in empirical loss and in the ideal world, SGD, takes steps in population loss (and hence train loss = test loss). The ideal world is a world when there exists infinite labeled data hence when they increase the steps, they’ll see new data points and they don’t have to see a sample multiple times during training. They show that The generalization of models is largely determined by their optimization speed in online and offline learning. They claim the real world is equivalent to the ideal world as long as the ideal world hasn’t converged yet. They also claim good models and training procedures are those which (1) optimize quickly in the Ideal World and (2) do not optimize too quickly in the Real World.
Authors: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, Yoav Shoham
Instead of masking tokens randomly in Transformers which is suboptimal, the authors propose PMI-Masking, which is based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. They showed that (1) PMI-Masking dramatically accelerates training, matching the end-of-pretraining performance of existing approaches in roughly half of the training time; and (2) PMI-Masking improves upon previous masking approaches at the end of pretraining.
]]>For example, in the paperHow Does BERT Answer Questions?, a BERT model is trained on Question Answering (QA) as QA is one of such tasks that require a combination of multiple simpler tasks such as Coreference Resolution and Relation Modeling to arrive at the correct answer. After the model is trained, a layer-wise visualisation of token representations is provided in which the visualiztion reveals information about the internal state of Transformer networks.
In the figure below, you can see the overview of the BERT architecture and the probing setup. The hidden states of each layer as input to a set of probing tasks to examine the encoded information.
Picture taken from the paper, How Does BERT Answer Questions?
In order to visualize the token representaion in each layer, authors use dimensionality reduction + K-means Clustering. For dimensionality reduction, they apply T-distributed Stochastic Neighbor Embedding (t-SNE), Principal Component Analysis (PCA) and Independent Component Analysis (ICA) to vectors in each layer. Then for clustering, they choose number of clusters k in regard to the number of observed clusters in PCA. In the figure below, you can see the visualization of each layer of the BERT model trained on the SQUAD dataset. As illustrated in the figure, the early layers within the BERT-based models group tokens into topical clusters., resulting vector spaces are similar in nature to embedding spaces from e.g. Word2Vec and hold little task-specific information. Therefore, these initial layers reach low accuracy on semantic probing tasks, BERT’s early layers can be seen as an implicit replacement of embedding layers common in neural network architectures. These lower layers of a language model encode more local syntax rather than more complex semantics. In the middle layers of the observed networks we see clusters of entities that are less connected by their topical similarity. Rather, they are connected by their relation within a certain input context. These task-specific clusters appear to already include a filtering of question-relevant entities. Figure belowe (b) shows a cluster with words like countries, schools, detention and country names, in which ’detention’ is a common practice in schools. This cluster helps to solve the question “What is a common punishment in the UK and Ireland?”.

In short, the authors observed the model’s ability to recognize entities (Named Entity Labeling), to identify their mentions (Coreference Resolution) and to find relations (Relation Recognition) improves until higher network layers. The figire below visualizes these abilities. Information about Named Entities is learned first, whereas recognizing coreferences or relations are more difficult tasks and require input from additional layers until the model’s performance peaks.

References
Paper from Facebook AI
Presenation in ACL 2020, “Beyond BERT” by Mike Lewis

In this paper, authors tried grounding natural language into the reality of our world instead of MLM. The problem with next token predictions MLM is that they focus only on linguistic form, they learn the characteristics of coherent language without necessarily associating meaning to it.
MARGE is a pre-trained sequence-to-sequence model learned with an unsupervised multilingual multi-document paraphrasing objective. During pre-training, the input to the model is a batch of evidence documents z1..M and target documents x1..N . The model is trained to maximize the likelihood of the targets, conditioned on the evidence documents, and the relevance of each evidence document to each target:
A truly remarkable outcome is that MARGE can perform decent zero-shot machine translation that is, without any fine-tuning on parallel data
Paper by PolyAI

They used shared parameters, quantization and less number of layers, but instead used a very long sequence input (for conversation it’s needed) and reduced the number of parameters by order of magnitude.
Paper from Facebook AI and Stanford
Presenation in ACL 2020, “Beyond BERT” by Mike Lewis

The main contribution: Improvement for downstream tasks which need factuality like QA The paper introduces kNN-LM, an approach that extends a pre-trained LM by linearly interpolating its next word distribution with a k-nearest neighbors (kNN) model. The nearest neighbors are computed according to distance in the pre-trained embedding space and can be drawn from any text collection, including the original LM training data. This approach allows rare patterns to be memorized explicitly, rather than implicitly in model parameters. It also improves performance when the same training data is used for learning the prefix representations and the kNN model, strongly suggesting that the prediction problem is more challenging than previously appreciated.
Paper from Google

With T5, authors propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework allows to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). One can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself. Fidnings:
Paper from Technical University Darmstadt, New York University, CIFAR, University of Cambridge,DeepMind

AdapterHub enables you to perform transfer learning of generalized pre-trained transformers such as BERT, RoBERTa, and XLM-R to downstream tasks such as question-answering, classification, etc. using adapters instead of fine-tuning. Adapters serve the same purpose as fine-tuning but do it by stitching in layers to the main pre-trained model, and updating the weights Φ of these new layers, whilst freezing the weights θ of the pre-trained model. As you might imagine, this makes adapters much more efficient, both in terms of time and storage, compared to fine-tuning. Adapters have also been shown to be able to match the performance of state-of-the-art fine-tuning methods!
]]>Here is also my Colab notebook on fine tuning T5 model for summarization task using Trenasformers + PyTorch Lightning
Therea are two general approches to text summarziation:
Prior to the hype of deep learning, TextRank and LexRank were two popular methods for extractive summarization. TextRank was mainly used for a single document and LexRank was mainly used for multi-document summarization. Both TextRank and LexRank create a graph of sentences and run the page rank algorithm to get the most important sentences (basically centroids in the graph). The edges between sentences are based on semantic similarity and content overlap. LexRank uses cosine similarity of TF-IDF vectors of sentences and TextRank uses the number of words two sentences have in common normalized by sentence length.
LexRank uses the score of sentences from the page rank algorithm as a feature in a larger system with other features such as sentence position, sentence length. source
When training a model for summarization, one can use cross-entropy (similar to language modeling task) to train the model. The offline evaluation metrics are poorly correlated with human judgment and ignores important features such as factual correctness. The offline evaluation metrics can be categorized into the following groups.
ROUGE: A Package for Automatic Evaluation of Summaries ROUGE is the standard automatic evaluation measure for evaluating summarization tasks.
Better Summarization Evaluation with Word Embeddings for ROUGE instead of hard lexical matching of bigrams, R-WE uses soft matching based on the cosine similarity of word embeddings.
A Graph-theoretic Summary Evaluation for ROUGE combines lexical and semantic matching by applying graph analysis algorithms to the WordNet semantic network
leverages synonym dictionaries, such as WordNet, and considers all synonyms of matched words when computing token overlap. ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks. To address ROUGE’s problems, the authors propose the following metrics:
It captures synonyms using a synonym dictionary (synonym dictionary customizable by application and domain)
An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.
Deep Reinforcement Learning with Distributional Semantic Rewards for Abstractive Summarization Given that contextualized word representations (such as ELMO, BERT, GPT) have shown that they have a powerful capacity of reflecting distributional semantic, the authors propose to use the distributional semantic reward to boost the reinforcement learning-based abstractive summarization system.
BERTScore: Evaluating Text Generation with BERT
BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, they compute token similarity using contextual embeddings. In another word, BERTSCORE focuses on sentence-level generation evaluation
by using pre-trained BERT contextualized embeddings to compute the similarity between two sentences as a weighted aggregation of cosine similarities between their tokens. BERTScore has a higher correlation with human evaluation on text generation
tasks comparing to the existing evaluation metric
Paper from Google

With T5, authors propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework allows to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). One can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.
Fidnings:
The authors designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization. The hypothesis is that he closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance.
In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.
The authors found out that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. They automatically identified these sentences by finding those that were most similar to the rest of the document according to ROUGE metric. Similar to T5, the model is pre-trained on a very large corpus of web-crawled documents, and then fine-tunedd on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.
]]>AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization
Authors: Xinsong Zhang, Hang Li
ByteDance AI Lab
Year: August 2020
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. T5
Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu
Year: July 2020
Authors: Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, Luke Zettlemoyer
Year: June 2020
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.
Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.
Google and Stanford
Year: March 2020
Generalization through Memorization: Nearest Neighbor Language Models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis
Facebook and Stanford Presenation in ACL 2020, “Beyond BERT” by Mike Lewis
Year: Feb 2020
ConveRT: Efficient and Accurate Conversational Representations from Transformers
Authors: Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Pei-Hao Su, Tsung-Hsien Wen, Ivan Vulić
PolyAI
Year: Nov 2019
Authors:Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer
Year: October 2019
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Authors: Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut
Google and Toyota Technological Institute
Year: September 2019
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Auhtors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov
UW and Facebook
Year: July 2019
Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le
Year: June 2019
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Authors:Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
CMU, Google
Year: May 2019
Cross-lingual Language Model Pretraining
Authors:Guillaume Lample, Alexis Conneau
year: January 2019
Improving Language Understanding by Generative Pre-Training
Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever
OpenAI
Year: June 2018
Deep contextualized word representations ELMo
Auhtors: Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee and Luke Zettlemoyer
Allen Institute for Artificial Intelligence and UW
year: March 2018
Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Year: Dec 2017
]]>Paper from Google

The transformer era began with this paper from Google. The architecture consists of an encoder and a decoder block to solve machine translation.
Positional encoding Using sin and cos functions, the earlier dimensions have smaller wavelengths and can capture short-range offset, while the later dimensions can capture longer distance offset. This blog does a great job in explaining the positional encoding.
Transformer blocks are characterized by a multi-head self-attention mechanism, a position-wise feed-forward network, layer normalization modules, and residual connectors. The input to the Transformer model is often a tensor of shape RB × RN , where B is the batch size, N the sequence length.
There are residual self-attention blocks which are efficient in transferring positional encodings to the top layers. For a deeper understanding of transformers, I recommend reading the original paper, this blogpost by from Rémi Louf and The Annotated Transformer by Alexander Rush After “Attention all you need” BERT from Google and GPT from OpenAI were introduced which I will exlained leter in this post.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Paper from Google

BERT is based on subword token encoding and multi-layer transformer architecture. The transformer blocks are the same as the original transformer blocks introduced in “Attention is all you need” paper, however BERT only uses the Encoder part of the transformer architecture (and this is why it is not suitable for text generation tasks).
BERT uses a huge corpus of data for pre-training the model on a self-supervised task, masked language modeling. They mask tokens in the text and the decoder should predict the masked tokens. BERT masks 15% of the tokens. From the 15% selected tokens, 80% of them are actually being masked, 10% replaced by a random token, and 10% unchanged and the model is expected to predict all of these tokens (and the loss of all predictions will be backpropagated). See my colab notebook for the code of masking.
You may ask why not masked all the 15% of tokens?! The reason is if you are masking all the tokens masked, the model will try to only represent the masked tokens and ignore the rest. When you replace tokens by random or leave them unchanged, the model needs to make a prediction for every single token, because it doesn’t have a clue which ones are replaced by a random token which one is original, so it will make an effort to predict all the tokens and learn from all of them. After pretraining, the model can be fine-tuned on many language understanding tasks such as translation, NER, QA, and text classifications.
One of the disadvantages of BERT is that BERT fails to model the joint probability of the predicted tokens, i.e. it assumes that predicted tokens ([MASK]s) are independent.
Improving Language Understanding by Generative Pre-Training
Paper from OpenAI

ALL GPTs have only the transformer decoder (and not the encore part). In GPT first the model is pre-trained for LM tasks (causal LM) and then fine-tuned on the final task. They found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. So the final objective is L3(C) = L2(C) + λ ∗ L1(C) in which L2 is the objective for labeled data and L1 is the objective for LM. Overall, larger datasets benefit from the auxiliary objective ( λ ∗ L1(C) ) but smaller datasets do not.
For Training, they used a modified version of L2 regularization with w=0.01 on all non-bias or gain weights. For the activation function, they used GELU. They also learned the position embeddings instead of the sinusoidal version. For fine-tuning, 3 epochs were sufficient for the most task and they set λ =0.5
GPT1 was trained on book corpus (about 7K unpublished books), the context size (number of tokens) was 512.
GPT2 was trained on outbound links from Reddit (the articles which were linked from Reddit) which received at least 3 karma (keeping only posts with a high number of upvotes) resulted in 45M links after filtering (removing Wikipedia pages etc) and they ended up with 8M webpages, 40GB of text (around 10B tokens). GPT2 model has 1.5B parameters (when 48 layers), size of vocab is 50.257K, context size is 1024 tokens (remember BERT was 512), and batch size 512. GPT2 achieves SOTA in 7 out of 7 NLP benchmarks.
Differences between GPT and GPT2: Layer normalization was moved to the input of each sub-block, similar to a residual unit of type “building block” (differently from the original type “bottleneck”, it has batch normalization applied before weight layers). An additional layer normalization was added after the final self-attention block. A modified initialization was constructed as a function of the model depth. The weights of residual layers were initially scaled by a factor of 1/√n where n is the number of residual layers. Use a larger vocabulary size and context size.
GPT2 data compression: (tokens/parameters) = 10B/1.5B=6.66 This means in GPT2 there is one parameter for every 6.66 tokens.
GPT3 has 175B parameters and is trained on 499B tokens (from common crawl, webtext2, books1, books2, Wikipedia). The 175 Billion parameters need 175*4=700GB memory (floating-point needs 4 bytes)
GPT3 data compression (tokens/parameters) = 499/175=2.85 GPT3 has lower data compression compared to GPT2 so, with this amount of parameters, whether the model functions by memorizing the data in the training and pattern matching in inference.
GPT-3 shows that it is possible to improve the performance of a model by “simply” increasing the model size, and in consequence, the dataset size and the computation (TFLOP) the model consumes. However, as the performance increases, the model size has to increase more rapidly. Precisely, the model size varies as some power of the improvement of model performance. Remember language model performance scales as a power-law of model size, dataset size, and the amount of computation.
Generalized Autoregressive Pretraining for Language Understanding” from Carnegie Mellon and Google Research.
Paper from

Similar to Bert, XLNet is using booksCorpus and English Wikipedia (13GB of plain text). In addition, authors include Giga5(16 GB) ClueWeb (19G after filtering), Common Crawl (110 GB after filtering) for pretraining. In total, they have 32.89 B tokens.
XLNet-Large is similar to BERT-Large in model size and they use a sequence of 512 tokens. The authors observe with a batch size of 8192, it took 5.5 days to train, and still, it underfits the data. XLNet achieves SOTA on 18 out of 20 NLP tasks.
XLNet combines the bidirectional capability of BERT with the autoregressive technology of Transformer-XL. Remember the disadvantage of BERT is that BERT fails to model the joint probability of the predicted tokens, i.e. it assumes that predicted tokens ([MASK]s) are independent, AR models eliminate the independence assumption made in BERT (the shortcoming of BERT are resolved in T5 as well with a less complicated approached compared to XLNet).
XlNet has advantages of both AR (auro-regressive) and AE (auto-encoding) models. Since an AR language model is only trained to encode a unidirectional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pretraining. Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.
XLnet uses the following techniques:
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Paper from UW and Facebook AI
In the RoBERTa paper, the authors first performed an ablation study on the effect of different BERT’s parts of performance on end tasks. They observed that larger batch size increases the performance, the NSP doesn’t affect performance (sometimes no NSP increases performance) and tries static vs dynamic masking and observed dynamic masking improved performance. They gathered all the learning and proposed RoBERTa (Robustly optimized BERT approach) Specifically, RoBERTa is trained with dynamic masking, FULL-SENTENCES without NSP loss, large mini-batches, and a larger byte-level BPE.
So in short, RoBERTa provides a better pre-trained compared to BERT by
They realized BERT is greatly under fitted, so they used 160GB of text instead of the 16GB dataset originally used to train BERT, batch size of 8K instead of 256 in the original BERT base model. They also removed the next sequence prediction objective from the training procedure as they realized it does not help the model. RoBERTa matches XLNet models on the GLUE benchmark and sets a new state of the art in 4/9 individual tasks of GLUE. Also, they match SOTA on SQuAD and RACE.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Paper from Google Research and Toyota Technological Institute
ALBERT incorporates two-parameter reduction techniques that lift the major obstacles in scaling pre-trained models.
Inter-sentence coherence loss
To further improve the performance of ALBERT, they also introduce a self-supervised loss for sentence-order prediction (SOP). This forces the model to learn finer-grained distinctions about discourse-level coherence properties. SOP is a much more difficult task compared to NSP and SOP can solve the NSP task to a reasonable degree. Remember in NSP two sentences/segments are different in terms of topic and are not coherent. However, in SOP only the order of sentences is swapped, hence the topic is still the same but are not coherent hence the task is harder (remember topic prediction is easier to learn compared to coherence prediction and topic prediction overlaps with what is learned using the MLM loss)
ALBERT doesn’t use dropout. ALBERT v2 — This throws a light on the fact that a lot of assumptions are taken for granted are not necessarily true. The regularisation effect of parameter sharing in ALBERT is so strong that dropouts are not needed. (ALBERT v1 had dropouts
Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
Paper from Facebook AI

a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture.
They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where arbitrary length spans of text. BART is particularly effective when fine-tuned for text generation but also works well for comprehension tasks.
BART contains roughly 10% more parameters than the equivalently sized BERT model.
Pretraining: Token Masking like BERT Token Deletion Random tokens are deleted from the input. The model must decide which positions are missing inputs. Text Infilling A number of text spans are sampled. Each span is replaced with a single [MASK] token.
Sentence Permutation A document is divided into sentences based on full stops, and these sentences are shuffled in random order. Document Rotation A token is chosen uniformly at random, and the document is rotated so that it begins with that token.
Paper by ByteDance

AMBERT proposes a simple twist to BERT: tokenize the input twice, once with a fine-grained tokenizer (subword or word level), and once with a coarse-grained tokenizer (phrase level).
The two inputs share a BERT encoder. A forward pass through the model consists of the following two steps: text tokens → token embeddings (via separate weights): Each list of tokens (one fine-grained, one coarse-grained) is looked up in its own embedding matrix and turned into a list of real-valued vectors. token embeddings → contextual embeddings (via shared weights): The two real-valued vectors are fed into the same BERT encoder (a stack of Transformer layers) — which can be done either sequentially through a single encoder copy, or in parallel through two encoder copies with tied parameters. This results in two lists of per-token contextual embeddings. Because AMBERT uses two embedding matrices (one for each tokenization), its parameter count is notably higher than BERT’s (194M vs 110M for the English Base models). However, latency remains relatively unchanged since AMBERT solely adds a new set of dictionary lookups
]]>In order to understand this post, it’s helpful to first read about Hidden Markov Models and Gibbs Distribution. This video explains the basics of HMM and this video by Daphne Koller is a very good introduction on Gibbs distribution.
Now let’ see why we need CRF?

From the paper
NER is in general formulated as a sequence labeling problem with a multi-layer Perceptron + Softmax layer as the tag decoder layer. The architecture can be a bidirectional LSTM with a softmax layer on top. The softmax layer predicts the tag for each token (e.g. B-PER, I_LOC). One shortcoming of using the softmax layer on top is that each token is predicted indepently. So it is possible that the sequence of tags is incosistent, for exaple, B_PER, I_LOC, O, O, I_PER.
In order to overcome this shortcoming, instead of using a softmax layer, we can use a Conditional Random Field (CRF) layer on top. When CRF is predicting the tag for a sequence, it also considers the surronding tokens and their tags into account as well. More formally, a Conditional Random Field* (CRF) is a standard model for predicting the most likely sequence of labels that correspond to a sequence of inputs.
Formally, we can take the above sequence of hidden states h = (h1, h2, …, hn) as our input to the CRF layer, and its output is our final prediction label sequence y = (y1, y2, …, yn), where yi in the set of all possible labels. We denote Y(h) as the set of all possible label sequences. Then we can derive the conditional probability of the output sequence given the input hidden state sequence is

where W and b are the two weight matrices
and the subscription indicates that we extract the
weight vector for the given label pair (yi−1, yi).
To train the CRF layer, we can use the classic maximum conditional likelihood estimation to train our
model. The final log-likelihood with respect to the
weight matrices is

Finally, we can adopt the Viterbi algorithm for training the CRF layer and the decoding the optimal output sequence y.
For a very nice explanation of CRF, I highly recommend watching these videos by Hugo Larochelle
]]>Paper: Distilling the Knowledge in a Neural Network
We have built sophisticated models that solve complex problems such as natural language inference and common sense reasoning. However, these large, high performing models come with their own costs. They need a huge amount of computational resources (such as GPU’s memory), are slow (depending on computation resources), and hence cannot be run on low resource devices such as a mobile device. When computational resources are limited, the model can be made smaller by sharing parameters, quantization, and knowledge distillation. Knowledge distillation is a very successful model compression method in which a small model is trained to mimic a pre-trained, larger mode.
In distillation, a large trained model (teacher) is distilled in a smaller model (student). The important point is that the cost function of the student model is the

where x is the input, W is the student model parameters, y is the ground truth label, H is the cross-entropy loss function, σ is the softmax function parameterized by the temperature T, and α and β are coefficients. zs and zt are the logits of the student and teacher respectively.

Why distillation works: In my opinion there are two reasons that distillation works really well, transferring (1) dark knowledge and (2) learning winning the lottery ticket
Transferring dark knowledge Remember the cost function of the teacher model: cross-entropy(students’ prediction, ground truth) + cross-entropy(student’s prediction (with temperature cross-entropy), teacher’s prediction (with temperature cross entropy))
The first part is the usual cross-entropy, however, the second part is transferring the dark knowledge from the teacher model to the distilled model. What is dark knowledge? For example, if the model is classifying images, the input is an image and the true label, e.g. A cat. The output of the large model, however, is the distribution of the probabilities of the classes. For the image of a cat as the input, the model gives 0 probability for the image to be a computer, however, the model produces 0.2 probability for the image to be a lion or a dog. Hence, it has much more information (AKA dark knowledge) compared to the input labels. This knowledge is then transferred to the distilled model with the second cross-entropy. Hence, in addition to the available class labels, the student model can use to use soft probabilities (or‘logits’) of the teacher mode. It has the effect of giving much more information for each input data to the model and makes the input to the student’s model much richer.
(2) Learning the winning lottery ticket Often in knowledge distillation, the student model’s intermediate layers are forced to output a result similar to the teacher’s intermediate layers. For example, if the teacher model is 12 layer BERT and the student model is 6 layer BERT, the output of the first layer of the distilled model should look like the output of the 2nd layer of the teacher BERT. Why does this help? Remember the “Lottery Ticket Hypothesis” paper, the main idea of deep learning and very large models is that it tries many representations and embeddings and since it’s trying many many representations, the one that helps the ML task is also embedded in the trained model. Hence, the large/teacher model has all those trial and errors and useful representation. The distilled model is forcing the intermediate representations to look like an intermediate representation of the large model and hence forced to pick up only the useful representations (the winning lottery only) from the larger model.
]]>