Jekyll2024-11-07T15:33:57-08:00https://sanazbahargam.github.io/feed.xmlSanaz BahargamMachine Learning, Natural Language ProcessingSanaz BahargamNaacl20212021-06-12T00:00:00-07:002021-06-12T00:00:00-07:00https://sanazbahargam.github.io/NAACL2021Here are the summary of some of talks from NAACL 2021. So far, I only summarized the following papers, I will be summarizing more. So, I will either append it here or a new blog post. Feel free to check back soon.

Table of Contents

Video-aided Unsupervised Grammar Induction

By Songyang Zhang, Linfeng Song, Lifeng Jin, Kun Xu, Dong Yu and Jiebo Luo

Best Long Paper

Paper

Code

Grammar induction aims to capture syntactic information in sentences in the form of constituency parsing trees. Sup

Unsupervised grammar induction provide evidence for statistical learning because it has minimal assumption about linguistic knowledge built into the models

There are many texts and videos/images on social media and images have shown to help us induce syntax structure. In Image0-aided unsupervised Grammar induction, the model exploits regularities between text spans and images. Videos not only represent static objects, but they can also show actions and dynamic interaction between objects and can represent verb phrases.

The paper is inspired by Compound PCFG in which a grammar inducer is used to create the chart and then they optimize on marginal likelihood of the sentence. Visually Grounded-PCFG, consider image sentence matching during the training. However, simply replacing images with videos is not trivial due to multimodality and temporal modeling in videos.

VC-PCFG
Image taken from the presentation

The baseline model created by the authors combines VC-PCFG with objects. The authors also compare this baseline with VC-PCFG + [Action, scene, audio, OCR, Face, speech]. In the final model, all the aforementioned features from a Multi-Modal Transformer are extracted and used. In the paper, it’s been shown that this model (MMC-PCFG) outperforms all the other baselines consistently on three different datasets.

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources

By Simone Conia, Andrea Bacciu and Roberto Navigli

Outstanding Long Paper

Paper

Code

Semantic role labeling (SRL) is the task of automatically addressing “who did what to whom, where, when, and how?” SRL includes predicate identification and disambiguation, argument identification, and argument classification. SRL predicate-argument structure inventories are language-specific and simultaneously label semantics across languages is expensive and needs human experts (in the desired languages). Hence, there is a need for heterogeneous linguistic resources. The intuition behind this idea is that semantic relations may be deeply rooted beyond language-specific realization and as a result authors try to build a model that can learn from many inventories for deeper sentence-level semantics.

The model architecture (illustrated below) can be roughly divided into the following components: • A universal sentence encoder whose parameters are shared across languages and which produces word encodings that capture predicate-related information • A universal predicate-argument encoder whose parameters are also shared across languages and which models predicate-argument relations • A set of language-specific decoders which indicate whether words are predicates, select the most appropriate sense for each predicate, and assign a semantic role to every predicate-argument couple, according to several different SRL inventories

SemanticRoleLabeling
Image taken from the presentation

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

By Timo Schick and Hinrich Schütze

Outstanding Long Paper

Paper

Code

In GPT-3 priming, the model is given a few demonstrations of inputs and corresponding outputs as context for its predictions, but no gradient updates are performed. Using priming, GPT-3 has shown that it has amazing few-shot abilities. However, the problem with priming is that we need gigantic language models, also priming does not scale to more than a few examples as the context window is limited to a few hundred tokens.

An alternative approach is pattern-exploiting training (PET) which combines the idea of reformulating tasks as cloze questions with regular gradient-based finetuning. More formally, the task of mapping inputs x ∈ X to outputs y ∈ Y , for which PET requires a set of pattern-verbalizer pairs (PVPs). Each PVP p = (P, v) consists of • a pattern P : X → T∗ that maps inputs to cloze questions containing a single mask; • a verbalizer v : Y → T that maps each output to a single token representing its task-specific meaning in the pattern

By adopting PET, the authors showed that even a small model such as ALBERT can outperform GPT-3 on SuperGLUE. Through a series of ablation studies, authors showed that pattern and verbalizers significantly contribute to the performance of the model. From the paper, one can conclude that PET works well with small language models, and scaled to more examples than fit into the context window. However, the disadvantage of PET compared to few-shot learning is that PET requires fine-tuning multiple models for each task and doesn’t work for generative tasks.

Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

By Guanghui Qin and Jason Eisner

Best Short Paper

Paper

Code

In few shot settings, we can extract factual knowledge from the model’s training corpora by prompting the model. In this paper, the authors show that the choice of prompt is important for the performance of the model and explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. This can easily be achieved since from the perspective of LM words are just continuous vectors. Hence the used prompts consist of “soft words,” i.e., continuous vectors that do not necessarily word type embeddings from the language model. This is called soft prompts and the benefit of using the soft prompts (compared to hard prompts and words) is that they are easy to search with backprop and they enable us with a larger space of prompts.

In addition to soft promoting, for each task, authors optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. They show that across multiple English LMs and tasks, their approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.

SoftPrompts
Image taken from the presentation

How many data points is a prompt worth?

By Teven Le Scao and Alexander Rush

Outstanding Short Paper

Paper

Code

When fine-tuning pre-trained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. In this paper, authors compare prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. They show that prompting is often worth 100s of data points on average across classification tasks.

To compare heads vs prompts, they consider two transfer learning settings for text classification: head-based, where a generic head layer takes in pretrained representations to predict an output class; prompt-based, where a task-specific pattern string is designed to coax the model into producing a textual output corresponding to a given class. Both can be utilized for fine-tuning with supervised training data but prompts further allow the user to customize patterns to help the model.

For the prompt model, we follow the notation from PET (described earlier here) and decompose a prompt into a pattern and a verbalizer. The pattern turns the input text into a cloze task, i.e. a sequence with a masked token or tokens that need to be filled.In order to measure the effectiveness of promoting, authors introduce a metric, the average data advantage.

Contextual Domain Classification with Temporal Representations

Tzu-Hsiang Lin, Yipeng Shi, Chentao Ye, Yang Fan, Weitong Ruan, Emre Barut, Wael Hamza, Chengwei Su

Paper

There are many domains in dialogue systems and defending on the domain, the concept of concept can vary from minutes to hours. This paper uses temporal representations that combine second difference and turn order offset information has been used to utilize both recent and distant context in various encoder architectures. More specifically, previous 9 turns of context within a few days have been included in the model. In order to determine which temporal information (turn vs time difference) contribute to the model, authors considered using time mask, turn embedding, turn embedding over time mask, time mask over turn embedding, and time and turn embedding. The model is shown below.

TemporalRepresentations
Image taken from the paper

Time mask feeds the time difference into a 2 layer network and sigmoid function to produce a masking vector. Turn embedding is the turn difference which is projected into a fixed-size embedding vector. Turn embedding over time mask provides turn order information on top of seconds (assuming turn is more important) and similarly in mask over turn embedding provides time seconds information on top of order turn (assuming time is more important). Finally, time and turn

]]>
Sanaz Bahargam
ICLR 20212021-05-12T00:00:00-07:002021-05-12T00:00:00-07:00https://sanazbahargam.github.io/posts/2021/05/ICLR2021ICLR 2021

Table of Contents

Commonsense AI: Myth and Truth

By Yejin Choi

ICLR Link

In this talk, Yejin mainly talks about abduction and counterfactual reasoning and mentions that although they’re different, both involve nonmonotonic reasoning with past context X and future constraint Z.


Picture taken from the slides

Using language models such as GPT2 bc they are only good at conditioning on the past and we can incorporate the future by incorporating both past and future as the past with a special token in between. But this doesn’t generalize well to out-of-domain distribution. So they propose to use back-prop as an inference time algorithm rather than training time only for abduction. For counterfactual reasoning, the same works just need to change the loss function to K-L divergence. Perhaps the image below is more clear

Common sense
Image taken from the slides

Knowledge and reasoning:

The Off-the-shelf language models are not equivalent to knowledge models, and we need to build knowledge models if we want to do reasoning. COMET is the common sense knowledge graph that has 1.33 M commonsense if-then inferences over 23 relations (or inference time). Comet is at least 400 times smaller than GPT3 and still, COMET outperforms GPT3 and COMET in common sense reasoning and COMET generalizes well on out-of-domain examples.

Key take away message: A lot of commonsense reasoning might require the full scope of language and the distinction between knowledge and reasoning is blurry and there’s and we need natural language for reasoning since there is no way we can represent the complexity of commonsense in clean-cut logical forms.

WHEN DO CURRICULA WORK

Authors: Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

ICLR Link

Paper

They examined whether the curriculum learning works or not. In order to do that, they first define the notion of difficulty for each example. The difficulty of each sample is defined by its learned iteration. Learned iteration is the iteration at which the sample is classified correctly and also in all the later iterations that sample is classified correctly as well. They sort the data based on the learned iteration and define a window of size k. Then they compare the performance of standard training with curriculum learning, anti-curriculum learning, and random learning (they shuffle the data, but a sample from each window of size k one by one). They observed: Curricula achieve (almost) no improvement in the standard-setting. They show curriculum learning, random, and anti-curriculum learning perform almost equally well in the standard-setting Curriculum learning improves over standard training when training time is limited. Imitating the large data regime, where training for multiple epochs is not feasible, They limit the number of iterations in the training algorithm and compare curriculum, random and anti-curriculum ordering against standard training. Our experiments reveal a clear advantage of curriculum learning over other methods. Curriculum learning improves over standard training in a noisy regime. Finally, They mimic noisy data by adding label noise to CIFAR100. Our experiments indicate that curriculum learning has a clear advantage over other curricula and standard training.

RETHINKING ATTENTION WITH PERFORMERS

Authors: Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller

ICLR Link

Paper

traditional attention-based models are not scalable since attention has quadratic time and space complexity. This paper introduces the performer, at efficient attentions base model. Performer provides linear space and time complexity without any assumption needed (such as sparsity or low-rankness). To approximate softmax attention kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) which approximates the standard attention weights. FAVOR is fully compatible with regular Transformers and the authors show theoretically that FAVOR guarantees unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance.

Performer
Image taken from the paper

See the paper for more details of how FAVOR works by decomposing each attention matrix in each batch into two matrices Q and K and their properties.

DeLighT: Deep and Light-weight Transformer

Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi

ICLR Link

Paper

In order to learn deeper and wider representation, linear layers are used which allows to learn global representation. In order to improve the efficiency, the authors proposed to divide the input into two groups and then for each group, apply linear transformation into each group independently. In this manner, the number of parameters is reduced while the model still learns local representation within each group and not t. To allow global representation learning, the authors suggest feature shuffle. After applying one linear transformation to each group independently, the features are shuffled and then group linear transformation. With this model, the model learns both global and local representation. In addition to feature shuffle, the authors used the input mixer connection to avoid vanishing radiant decent problem, improve stability in training, and also improving performance while increasing depth (without input mixer connection, authors observe the model gets worse when increasing the depth).

DeLight

DeLight
Image taken from the paper

Are wider nets better given the same number of parameters?

Authors: Anna Golubeva, Guy Gur-Ari, Behnam Neyshabur

ICLR Link

Paper

They showed that when fixing the size of the model (and the number of parameters), while increasing the width of the model, the model performance improves. More specifically, they showed They observed that for ImageNet, increasing the width (by Setting some weights to zero using a mask that is randomly chosen) leads to almost identical performance as when they allow the number of weights to increase along with the width. To understand the observed effect theoretically, authors study a simplified model and show that the improved performance of a wider, sparse network is correlated with a reduced distance between its Gaussian Process kernel and that of an infinitely wide network. Authors propose that reduced kernel distance may explain the observed effect

The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

Authors: Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi

ICLR Link

Paper

They propose a new framework, deep bootstrap, to study generalization in deep learning. Where in real-world where SGD takes step in empirical loss and in the ideal world, SGD, takes steps in population loss (and hence train loss = test loss). The ideal world is a world when there exists infinite labeled data hence when they increase the steps, they’ll see new data points and they don’t have to see a sample multiple times during training. They show that The generalization of models is largely determined by their optimization speed in online and offline learning. They claim the real world is equivalent to the ideal world as long as the ideal world hasn’t converged yet. They also claim good models and training procedures are those which (1) optimize quickly in the Ideal World and (2) do not optimize too quickly in the Real World.

PMI-MASKING: PRINCIPLED MASKING OF CORRELATED SPANS

Authors: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, Yoav Shoham

ICLR Link

Paper

Instead of masking tokens randomly in Transformers which is suboptimal, the authors propose PMI-Masking, which is based on the concept of Pointwise Mutual Information (PMI), which jointly masks a token n-gram if it exhibits high collocation over the corpus. They showed that (1) PMI-Masking dramatically accelerates training, matching the end-of-pretraining performance of existing approaches in roughly half of the training time; and (2) PMI-Masking improves upon previous masking approaches at the end of pretraining.

]]>
Sanaz Bahargam
Edge Probing2020-11-06T00:00:00-08:002020-11-06T00:00:00-08:00https://sanazbahargam.github.io/posts/2020/11/EdgeProbingIn the past couple of years, Transformers has acheived state of art results in a variety of natural language tasks. In order to better understand Transformers and what they are learning in practice, researchers have done layer-wise analysis of Transformer’s hidden states to understand what the Transformer is learning in each layer. A wave of recent work has started to “prob” the state of the art Tranformers to inspect the structure of the network to assess whether there exist localizable regions associated with distinct types of linguistic decisions, both syntactic and semantic information. Researchers examine the hidden states between encoder layers directly and use those hidden states in a linear layer + softmax to predict what kind of information in encoded in each hidden state.

For example, in the paperHow Does BERT Answer Questions?, a BERT model is trained on Question Answering (QA) as QA is one of such tasks that require a combination of multiple simpler tasks such as Coreference Resolution and Relation Modeling to arrive at the correct answer. After the model is trained, a layer-wise visualisation of token representations is provided in which the visualiztion reveals information about the internal state of Transformer networks.

In the figure below, you can see the overview of the BERT architecture and the probing setup. The hidden states of each layer as input to a set of probing tasks to examine the encoded information.

pic Picture taken from the paper, How Does BERT Answer Questions?

In order to visualize the token representaion in each layer, authors use dimensionality reduction + K-means Clustering. For dimensionality reduction, they apply T-distributed Stochastic Neighbor Embedding (t-SNE), Principal Component Analysis (PCA) and Independent Component Analysis (ICA) to vectors in each layer. Then for clustering, they choose number of clusters k in regard to the number of observed clusters in PCA. In the figure below, you can see the visualization of each layer of the BERT model trained on the SQUAD dataset. As illustrated in the figure, the early layers within the BERT-based models group tokens into topical clusters., resulting vector spaces are similar in nature to embedding spaces from e.g. Word2Vec and hold little task-specific information. Therefore, these initial layers reach low accuracy on semantic probing tasks, BERT’s early layers can be seen as an implicit replacement of embedding layers common in neural network architectures. These lower layers of a language model encode more local syntax rather than more complex semantics. In the middle layers of the observed networks we see clusters of entities that are less connected by their topical similarity. Rather, they are connected by their relation within a certain input context. These task-specific clusters appear to already include a filtering of question-relevant entities. Figure belowe (b) shows a cluster with words like countries, schools, detention and country names, in which ’detention’ is a common practice in schools. This cluster helps to solve the question “What is a common punishment in the UK and Ireland?”.

pic

In short, the authors observed the model’s ability to recognize entities (Named Entity Labeling), to identify their mentions (Coreference Resolution) and to find relations (Relation Recognition) improves until higher network layers. The figire below visualizes these abilities. Information about Named Entities is learned first, whereas recognizing coreferences or relations are more difficult tasks and require input from additional layers until the model’s performance peaks.

pic

References

]]>
Sanaz Bahargam
Transformers 22020-09-22T00:00:00-07:002020-09-22T00:00:00-07:00https://sanazbahargam.github.io/posts/2020/09/Transformers2This blog post is the continuation of my previous blog post, Transformers. In my previous blog post, I explained original Transformer paper, BERT, GPT, XLNet, RoBERTa, ALBERT, BART, and AMBER. In this blog post, I will explain MARGE, ConveRT, Generalization through Memorization, AdapterHub, and T5. Images and content used in this blogpost, otherwise mentioned, are all taken from the papers on each model.

Table of Contents

Pre-training via Paraphrasing - MARGE (Multilingual Autoencoder that Retrieves and Generates

Paper from Facebook AI

Presenation in ACL 2020, “Beyond BERT” by Mike Lewis pic

In this paper, authors tried grounding natural language into the reality of our world instead of MLM. The problem with next token predictions MLM is that they focus only on linguistic form, they learn the characteristics of coherent language without necessarily associating meaning to it.

MARGE is a pre-trained sequence-to-sequence model learned with an unsupervised multilingual multi-document paraphrasing objective. During pre-training, the input to the model is a batch of evidence documents z1..M and target documents x1..N . The model is trained to maximize the likelihood of the targets, conditioned on the evidence documents, and the relevance of each evidence document to each target:

  • The model first computes a relevance score f(xi , zj ) between every pair of documents xi and zi , by embedding each document and computing their cosine similarities.
  • The model then computes the likelihood of reconstructing each xi conditioned on z1..M and each f(xi , ·), using a modified seq2seq model. The similarity score encourages the model to attend more to relevant evidence documents. Backpropagating the reconstruction loss therefore improves both the sequence-to-sequence model and the relevance model.
  • They construct batches so that evidence documents are relevant to the targets, using the relevance model for retrieval

A truly remarkable outcome is that MARGE can perform decent zero-shot machine translation that is, without any fine-tuning on parallel data

ConveRT: Efficient and Accurate Conversational Representations from Transformers

Paper by PolyAI

pic

They used shared parameters, quantization and less number of layers, but instead used a very long sequence input (for conversation it’s needed) and reduced the number of parameters by order of magnitude.

Generalization through Memorization: Nearest Neighbor Language Models

Paper from Facebook AI and Stanford

Presenation in ACL 2020, “Beyond BERT” by Mike Lewis

pic

The main contribution: Improvement for downstream tasks which need factuality like QA The paper introduces kNN-LM, an approach that extends a pre-trained LM by linearly interpolating its next word distribution with a k-nearest neighbors (kNN) model. The nearest neighbors are computed according to distance in the pre-trained embedding space and can be drawn from any text collection, including the original LM training data. This approach allows rare patterns to be memorized explicitly, rather than implicitly in model parameters. It also improves performance when the same training data is used for learning the prefix representations and the kNN model, strongly suggesting that the prediction problem is more challenging than previously appreciated.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, T5

Paper from Google pic

With T5, authors propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework allows to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). One can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself. Fidnings:

  • model architectures, where they found that encoder-decoder models generally outperformed “decoder-only” language models;
  • pre-training objectives, where they confirmed that fill-in-the-blank-style denoising objectives (where the model is trained to recover missing words in the input) worked best and that the most important factor was the computational cost;
  • unlabeled datasets, where they showed that training on in-domain data can be beneficial but that pre-training on smaller datasets can lead to detrimental overfitting;
  • training strategies, where they found that multitask learning could be close to competitive with a pre-train-then-fine-tune approach but requires carefully choosing how often the model is trained on each task;
  • and scale, where they compare scaling up the model size, the training time, and the number of ensembled models to determine how to make the best use of fixed compute power.

AdapterHub: A Framework for Adapting Transformers

Paper from Technical University Darmstadt, New York University, CIFAR, University of Cambridge,DeepMind

pic

AdapterHub enables you to perform transfer learning of generalized pre-trained transformers such as BERT, RoBERTa, and XLM-R to downstream tasks such as question-answering, classification, etc. using adapters instead of fine-tuning. Adapters serve the same purpose as fine-tuning but do it by stitching in layers to the main pre-trained model, and updating the weights Φ of these new layers, whilst freezing the weights θ of the pre-trained model. As you might imagine, this makes adapters much more efficient, both in terms of time and storage, compared to fine-tuning. Adapters have also been shown to be able to match the performance of state-of-the-art fine-tuning methods!

]]>
Sanaz Bahargam
Text Summarization2020-09-14T00:00:00-07:002020-09-14T00:00:00-07:00https://sanazbahargam.github.io/posts/2020/09/TextSummarizationAutomatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Text summarization finds the most informative sentences in a document.

Here is also my Colab notebook on fine tuning T5 model for summarization task using Trenasformers + PyTorch Lightning

Table of Contents

Strategies for generating summaries

Therea are two general approches to text summarziation:

  • extractive summarization, where salient spans of text are identified as important segments and directly copied into the summary (similar to highlighting text with a marker).
  • abstractive summarization, where the generated summary is a paraphrased of the important part of the text and hence more similar to human-generated summaries
  • Hybrid models, combines extractive and abstractive summarization and include two phases, content selection, and paraphrasing.

Prior to the hype of deep learning, TextRank and LexRank were two popular methods for extractive summarization. TextRank was mainly used for a single document and LexRank was mainly used for multi-document summarization. Both TextRank and LexRank create a graph of sentences and run the page rank algorithm to get the most important sentences (basically centroids in the graph). The edges between sentences are based on semantic similarity and content overlap. LexRank uses cosine similarity of TF-IDF vectors of sentences and TextRank uses the number of words two sentences have in common normalized by sentence length.

LexRank uses the score of sentences from the page rank algorithm as a feature in a larger system with other features such as sentence position, sentence length. source

Evaluation:

When training a model for summarization, one can use cross-entropy (similar to language modeling task) to train the model. The offline evaluation metrics are poorly correlated with human judgment and ignores important features such as factual correctness. The offline evaluation metrics can be categorized into the following groups.

n-GRAM Matching Metrics

ROUGE metric

ROUGE: A Package for Automatic Evaluation of Summaries ROUGE is the standard automatic evaluation measure for evaluating summarization tasks.

pic  

  • Disadvantages:
    • not suitable for abstractive summarization since it is based on n-gram overlaps. It expects the generated summary to be identical to the reference summary and does not recognize synonym concepts. It also doesn’t capture subset coverage (it focuses on the complete set of n-gram overlap).
    • biased toward shorter summarizes

ROUGE-WE (R-WE)

Better Summarization Evaluation with Word Embeddings for ROUGE instead of hard lexical matching of bigrams, R-WE uses soft matching based on the cosine similarity of word embeddings.

ROUGE-G

A Graph-theoretic Summary Evaluation for ROUGE  combines lexical and semantic matching by applying graph analysis algorithms to the WordNet semantic network

ROUGE 2.0  

leverages synonym dictionaries, such as WordNet, and considers all synonyms of matched words when computing token overlap. ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks. To address ROUGE’s problems, the authors propose the following metrics:

ROUGE-{N|Topic|TopicUniq}+Synonyms

 It captures synonyms using a synonym dictionary (synonym dictionary customizable by application and domain)

  • ROUGE-Topic - topic or subset coverage (topic customizable by POS occurrence)
  • ROUGE-TopicUniq- unique topic or subset coverage (topic customizable by POS occurrence)

METEOR

An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments based on a generalized concept of unigram matching between the machine produced translation and human-produced reference translations. Unigrams can be matched based on their surface forms, stemmed forms, and meanings; furthermore, METEOR can be easily extended to include more advanced matching strategies. Once all generalized unigram matches between the two strings have been found, METEOR computes a score for this matching using a combination of unigram-precision, unigram-recall, and a measure of fragmentation that is designed to directly capture how well-ordered the matched words in the machine translation are in relation to the reference.

Embedding Based Metrics

Distributional Semantics Reward (DSR)

Deep Reinforcement Learning with Distributional Semantic Rewards for Abstractive Summarization Given that contextualized word representations (such as ELMO, BERT, GPT) have shown that they have a powerful capacity of reflecting distributional semantic, the authors propose to use the distributional semantic reward to boost the reinforcement learning-based abstractive summarization system.

  • Advantages:
    • DSR does not rely on cross-entropy loss (XENT) to produce readable phrases. Thus, no exposure bias is introduced.
    • DSR improves generated tokens’ diversity and fluency while avoiding unnecessary repetitions.

BERTScore

BERTScore: Evaluating Text Generation with BERT  BERTSCORE computes a similarity score for each token in the candidate sentence with each token in the reference sentence. However, instead of exact matches, they compute token similarity using contextual embeddings. In another word,  BERTSCORE focuses on sentence-level generation evaluation by using pre-trained BERT contextualized embeddings to compute the similarity between two sentences as a weighted aggregation of cosine similarities between their tokens. BERTScore  has a higher correlation with human evaluation on text generation tasks comparing to the existing evaluation metric pic  

  • Advantages: BERTSCORE addresses two common pitfalls in n-gram-based metrics such as  BLEU, ROUGE, and METEOR.
    • First, such methods often fail to robustly match paraphrases.  This leads to performance underestimation when semantically-correct phrases are penalized because they differ from the surface form of the reference.
    • Second, n-gram models fail to capture distant dependencies and penalize semantically-critical ordering changes. For example, given a small window of size two, BLEU will only mildly penalize swapping of cause and effect clauses (e.g. A because B instead of B because A), especially when the arguments A and B are long phrases.

Human Judgement, Learned Metrics

  • Relevance (selection of important content from the source)  - consistency (factual alignment between the summary and the source)
  • Fluency (quality of individual sentences)
  • Coherence (collective quality of all sentences)
  • Grammer
  • Information Quality
  • Duplication
  • Diversity
  • Brevity

Models

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, T5

Paper from Google pic

With T5, authors propose reframing all NLP tasks into a unified text-to-text-format where the input and output are always text strings, in contrast to BERT-style models that can only output either a class label or a span of the input. The text-to-text framework allows to use the same model, loss function, and hyperparameters on any NLP task, including machine translation, document summarization, question answering, and classification tasks (e.g., sentiment analysis). One can even apply T5 to regression tasks by training it to predict the string representation of a number instead of the number itself.

Fidnings:

  • model architectures, where they found that encoder-decoder models generally outperformed “decoder-only” language models;
  • pre-training objectives, where they confirmed that fill-in-the-blank-style denoising objectives (where the model is trained to recover missing words in the input) worked best and that the most important factor was the computational cost;
  • unlabeled datasets, where they showed that training on in-domain data can be beneficial but that pre-training on smaller datasets can lead to detrimental overfitting;
  • training strategies, where they found that multitask learning could be close to competitive with a pre-train-then-fine-tune approach but requires carefully choosing how often the model is trained on each task;
  • and scale, where they compare scaling up the model size, the training time, and the number of ensembled models to determine how to make the best use of fixed compute power.

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Paper source pic

The authors designed a pre-training self-supervised objective (called gap-sentence generation) for Transformer encoder-decoder models to improve fine-tuning performance on abstractive summarization. The hypothesis is that he closer the pre-training self-supervised objective is to the final down-stream task, the better the fine-tuning performance.

In PEGASUS pre-training, several whole sentences are removed from documents and the model is tasked with recovering them. An example input for pre-training is a document with missing sentences, while the output consists of the missing sentences concatenated together. This is an incredibly difficult task that may seem impossible, even for people, and we don’t expect the model to solve it perfectly. However, such a challenging task encourages the model to learn about language and general facts about the world, as well as how to distill information taken from throughout a document in order to generate output that closely resembles the fine-tuning summarization task. The advantage of this self-supervision is that you can create as many examples as there are documents, without any human annotation, which is often the bottleneck in purely supervised systems.

The authors found out that choosing “important” sentences to mask worked best, making the output of self-supervised examples even more similar to a summary. They automatically identified these sentences by finding those that were most similar to the rest of the document according to ROUGE metric. Similar to T5, the model is pre-trained on a very large corpus of web-crawled documents, and then fine-tunedd on 12 public down-stream abstractive summarization datasets, resulting in new state-of-the-art results as measured by automatic metrics, while using only 5% of the number of parameters of T5. The datasets were chosen to be diverse, including news articles, scientific papers, patents, short stories, e-mails, legal documents, and how-to directions, showing that the model framework is adaptive to a wide-variety of topics.

]]>
Sanaz Bahargam
NLP Papers2020-09-03T00:00:00-07:002020-09-03T00:00:00-07:00https://sanazbahargam.github.io/posts/2020/09/NLPPapersThese are the most important transformer papers (in my opinion) that anyone working with Transformers should know. Also, there is a nice summary of Efficient Transformers: A Survey by folks at Google that I highly recommend as well.

AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization

Authors: Xinsong Zhang, Hang Li

ByteDance AI Lab

Year: August 2020

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. T5

Authors: Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Google

Year: July 2020

Pre-training via Paraphrasing

Authors: Mike Lewis, Marjan Ghazvininejad, Gargi Ghosh, Armen Aghajanyan, Sida Wang, Luke Zettlemoyer

Facebook

Year: June 2020

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators.

Authors: Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning.

Google and Stanford

Year: March 2020

Generalization through Memorization: Nearest Neighbor Language Models

Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, Mike Lewis

Facebook and Stanford Presenation in ACL 2020, “Beyond BERT” by Mike Lewis

Year: Feb 2020

ConveRT: Efficient and Accurate Conversational Representations from Transformers

Authors: Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Pei-Hao Su, Tsung-Hsien Wen, Ivan Vulić

PolyAI

Year: Nov 2019

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Authors:Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, Luke Zettlemoyer

Facebook

Year: October 2019

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Authors: Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

Google and Toyota Technological Institute

Year: September 2019

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Auhtors: Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov

UW and Facebook

Year: July 2019

Generalized Autoregressive Pretraining for Language Understanding” from Carnegie Mellon and Google Research, XLNet

Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le

Year: June 2019

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Authors:Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova

CMU, Google

Year: May 2019

Cross-lingual Language Model Pretraining

Authors:Guillaume Lample, Alexis Conneau

Facebook

year: January 2019

Improving Language Understanding by Generative Pre-Training

Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

OpenAI

Year: June 2018

Deep contextualized word representations ELMo

Auhtors: Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee and Luke Zettlemoyer

Allen Institute for Artificial Intelligence and UW

year: March 2018

Attention is all you need

Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Google

Year: Dec 2017

]]>
Sanaz Bahargam
Transformers2020-07-30T00:00:00-07:002020-07-30T00:00:00-07:00https://sanazbahargam.github.io/posts/2020/07/TransformersTransformers: This post contains my notes throughout years on different transformers. These notes are very crude and not edited yet (more like my cheat sheets), but I thought to share it anyway. Please let me know if you have any comments or if you find any mistakes. Images used in this blogpost, otherwise mentioned, are all taken from the papers on each model.

Table of Contents

Attention is all you need

Paper from Google

pic

The transformer era began with this paper from Google. The architecture consists of an encoder and a decoder block to solve machine translation.

Positional encoding Using sin and cos functions, the earlier dimensions have smaller wavelengths and can capture short-range offset, while the later dimensions can capture longer distance offset. This blog does a great job in explaining the positional encoding.

Transformer blocks are characterized by a multi-head self-attention mechanism, a position-wise feed-forward network, layer normalization modules, and residual connectors. The input to the Transformer model is often a tensor of shape RB × RN , where B is the batch size, N the sequence length.

There are residual self-attention blocks which are efficient in transferring positional encodings to the top layers. For a deeper understanding of transformers, I recommend reading the original paper, this blogpost by from Rémi Louf and The Annotated Transformer by Alexander Rush After “Attention all you need” BERT from Google and GPT from OpenAI were introduced which I will exlained leter in this post.

BERT

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Paper from Google

pic

BERT is based on subword token encoding and multi-layer transformer architecture. The transformer blocks are the same as the original transformer blocks introduced in “Attention is all you need” paper, however BERT only uses the Encoder part of the transformer architecture (and this is why it is not suitable for text generation tasks).

BERT uses a huge corpus of data for pre-training the model on a self-supervised task, masked language modeling. They mask tokens in the text and the decoder should predict the masked tokens. BERT masks 15% of the tokens. From the 15% selected tokens, 80% of them are actually being masked, 10% replaced by a random token, and 10% unchanged and the model is expected to predict all of these tokens (and the loss of all predictions will be backpropagated). See my colab notebook for the code of masking.

You may ask why not masked all the 15% of tokens?! The reason is if you are masking all the tokens masked, the model will try to only represent the masked tokens and ignore the rest. When you replace tokens by random or leave them unchanged, the model needs to make a prediction for every single token, because it doesn’t have a clue which ones are replaced by a random token which one is original, so it will make an effort to predict all the tokens and learn from all of them. After pretraining, the model can be fine-tuned on many language understanding tasks such as translation, NER, QA, and text classifications.

One of the disadvantages of BERT is that BERT fails to model the joint probability of the predicted tokens, i.e. it assumes that predicted tokens ([MASK]s) are independent.

GPT

Improving Language Understanding by Generative Pre-Training

Paper from OpenAI pic

ALL GPTs have only the transformer decoder (and not the encore part). In GPT first the model is pre-trained for LM tasks (causal LM) and then fine-tuned on the final task. They found that including language modeling as an auxiliary objective to the fine-tuning helped learning by (a) improving generalization of the supervised model, and (b) accelerating convergence. So the final objective is L3(C) = L2(C) + λ ∗ L1(C) in which L2 is the objective for labeled data and L1 is the objective for LM. Overall, larger datasets benefit from the auxiliary objective ( λ ∗ L1(C) ) but smaller datasets do not.

For Training, they used a modified version of L2 regularization with w=0.01 on all non-bias or gain weights. For the activation function, they used GELU. They also learned the position embeddings instead of the sinusoidal version. For fine-tuning, 3 epochs were sufficient for the most task and they set λ =0.5

GPT1 was trained on book corpus (about 7K unpublished books), the context size (number of tokens) was 512.

GPT2 was trained on outbound links from Reddit (the articles which were linked from Reddit) which received at least 3 karma (keeping only posts with a high number of upvotes) resulted in 45M links after filtering (removing Wikipedia pages etc) and they ended up with 8M webpages, 40GB of text (around 10B tokens). GPT2 model has 1.5B parameters (when 48 layers), size of vocab is 50.257K, context size is 1024 tokens (remember BERT was 512), and batch size 512. GPT2 achieves SOTA in 7 out of 7 NLP benchmarks.

Differences between GPT and GPT2: Layer normalization was moved to the input of each sub-block, similar to a residual unit of type “building block” (differently from the original type “bottleneck”, it has batch normalization applied before weight layers). An additional layer normalization was added after the final self-attention block. A modified initialization was constructed as a function of the model depth. The weights of residual layers were initially scaled by a factor of 1/√n where n is the number of residual layers. Use a larger vocabulary size and context size.

GPT2 vs GPT3 Data compression

GPT2 data compression: (tokens/parameters) = 10B/1.5B=6.66 This means in GPT2 there is one parameter for every 6.66 tokens.

GPT3 has 175B parameters and is trained on 499B tokens (from common crawl, webtext2, books1, books2, Wikipedia). The 175 Billion parameters need 175*4=700GB memory (floating-point needs 4 bytes)

GPT3 data compression (tokens/parameters) = 499/175=2.85 GPT3 has lower data compression compared to GPT2 so, with this amount of parameters, whether the model functions by memorizing the data in the training and pattern matching in inference.

GPT-3 shows that it is possible to improve the performance of a model by “simply” increasing the model size, and in consequence, the dataset size and the computation (TFLOP) the model consumes. However, as the performance increases, the model size has to increase more rapidly. Precisely, the model size varies as some power of the improvement of model performance. Remember language model performance scales as a power-law of model size, dataset size, and the amount of computation.

XLNet

Generalized Autoregressive Pretraining for Language Understanding” from Carnegie Mellon and Google Research.

Paper from

pic

Similar to Bert, XLNet is using booksCorpus and English Wikipedia (13GB of plain text). In addition, authors include Giga5(16 GB) ClueWeb (19G after filtering), Common Crawl (110 GB after filtering) for pretraining. In total, they have 32.89 B tokens.

XLNet-Large is similar to BERT-Large in model size and they use a sequence of 512 tokens. The authors observe with a batch size of 8192, it took 5.5 days to train, and still, it underfits the data. XLNet achieves SOTA on 18 out of 20 NLP tasks.

XLNet combines the bidirectional capability of BERT with the autoregressive technology of Transformer-XL. Remember the disadvantage of BERT is that BERT fails to model the joint probability of the predicted tokens, i.e. it assumes that predicted tokens ([MASK]s) are independent, AR models eliminate the independence assumption made in BERT (the shortcoming of BERT are resolved in T5 as well with a less complicated approached compared to XLNet).

XlNet has advantages of both AR (auro-regressive) and AE (auto-encoding) models. Since an AR language model is only trained to encode a unidirectional context (either forward or backward), it is not effective at modeling deep bidirectional contexts. On the contrary, downstream language understanding tasks often require bidirectional context information. This results in a gap between AR language modeling and effective pretraining. Firstly, instead of using a fixed forward or backward factorization order as in conventional AR models, XLNet maximizes the expected log likelihood of a sequence w.r.t. all possible permutations of the factorization order. Thanks to the permutation operation, the context for each position can consist of tokens from both left and right. In expectation, each position learns to utilize contextual information from all positions, i.e., capturing bidirectional context.

XLnet uses the following techniques:

  • relative positional embeddings (instead of absolute position for each token)
  • extending attention queries across previous (not trainable) token sequences
  • using permutations to consider masked language modeling to looks ahead to future tokens Implementing the aforementioned techniques is complicated and perhaps this is why I haven’t seen XLNet being widely used.

RoBERTa

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Paper from UW and Facebook AI

In the RoBERTa paper, the authors first performed an ablation study on the effect of different BERT’s parts of performance on end tasks. They observed that larger batch size increases the performance, the NSP doesn’t affect performance (sometimes no NSP increases performance) and tries static vs dynamic masking and observed dynamic masking improved performance. They gathered all the learning and proposed RoBERTa (Robustly optimized BERT approach) Specifically, RoBERTa is trained with dynamic masking, FULL-SENTENCES without NSP loss, large mini-batches, and a larger byte-level BPE.

So in short, RoBERTa provides a better pre-trained compared to BERT by

  • training the model longer, with bigger batches, over more data; - removing the next sentence prediction objective;
  • training on longer sequences; and
  • dynamically changing the masking pattern applied to the training data.

They realized BERT is greatly under fitted, so they used 160GB of text instead of the 16GB dataset originally used to train BERT, batch size of 8K instead of 256 in the original BERT base model. They also removed the next sequence prediction objective from the training procedure as they realized it does not help the model. RoBERTa matches XLNet models on the GLUE benchmark and sets a new state of the art in 4/9 individual tasks of GLUE. Also, they match SOTA on SQuAD and RACE.

ALBERT

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Paper from Google Research and Toyota Technological Institute

ALBERT incorporates two-parameter reduction techniques that lift the major obstacles in scaling pre-trained models.

  • Factorized embedding parameterization The first one is a factorized embedding parameterization. In BERT, XLNet, and RoBERTa the WordPiece embedding size E is tied with the hidden layer size H, i.e., E ≡ H. By decomposing the large vocabulary embedding matrix into two small matrices, they separate the size of the hidden layers from the size of vocabulary embedding. This separation makes it easier to grow the hidden size without significantly increasing the parameter size of the vocabulary embeddings. Instead of projecting the one-hot vectors directly into the hidden space of size H, they first project them into a lower dimensional embedding space of size E, and then project it to the hidden space. By using this decomposition, they reduce the embedding parameters from O(V × H) to O(V × E + E × H). This parameter reduction is significant when H ≫ E Cross-layer parameter sharing.
  • The second technique is cross-layer parameter sharing. This technique prevents the parameter from growing with the depth of the network. Both techniques significantly reduce the number of parameters for BERT without seriously hurting performance, thus improving parameter-efficiency. An ALBERT configuration similar to BERT-large has 18x fewer parameters and can be trained about 1.7x faster. The parameter reduction techniques also act as a form of regularization that stabilizes the training and helps with generalization.

Inter-sentence coherence loss

To further improve the performance of ALBERT, they also introduce a self-supervised loss for sentence-order prediction (SOP). This forces the model to learn finer-grained distinctions about discourse-level coherence properties. SOP is a much more difficult task compared to NSP and SOP can solve the NSP task to a reasonable degree. Remember in NSP two sentences/segments are different in terms of topic and are not coherent. However, in SOP only the order of sentences is swapped, hence the topic is still the same but are not coherent hence the task is harder (remember topic prediction is easier to learn compared to coherence prediction and topic prediction overlaps with what is learned using the MLM loss)

ALBERT doesn’t use dropout. ALBERT v2 — This throws a light on the fact that a lot of assumptions are taken for granted are not necessarily true. The regularisation effect of parameter sharing in ALBERT is so strong that dropouts are not needed. (ALBERT v1 had dropouts

BART

Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Paper from Facebook AI pic

a denoising autoencoder for pretraining sequence-to-sequence models. BART is trained by (1) corrupting text with an arbitrary noising function, and (2) learning a model to reconstruct the original text. It uses a standard Transformer-based neural machine translation architecture.

They evaluate a number of noising approaches, finding the best performance by both randomly shuffling the order of the original sentences and using a novel in-filling scheme, where arbitrary length spans of text. BART is particularly effective when fine-tuned for text generation but also works well for comprehension tasks.

BART contains roughly 10% more parameters than the equivalently sized BERT model.

Pretraining: Token Masking like BERT Token Deletion Random tokens are deleted from the input. The model must decide which positions are missing inputs. Text Infilling A number of text spans are sampled. Each span is replaced with a single [MASK] token.

Sentence Permutation A document is divided into sentences based on full stops, and these sentences are shuffled in random order. Document Rotation A token is chosen uniformly at random, and the document is rotated so that it begins with that token.

AMBERT: A Multigrained BERT

Paper by ByteDance

pic

AMBERT proposes a simple twist to BERT: tokenize the input twice, once with a fine-grained tokenizer (subword or word level), and once with a coarse-grained tokenizer (phrase level).

The two inputs share a BERT encoder. A forward pass through the model consists of the following two steps: text tokens → token embeddings (via separate weights): Each list of tokens (one fine-grained, one coarse-grained) is looked up in its own embedding matrix and turned into a list of real-valued vectors. token embeddings → contextual embeddings (via shared weights): The two real-valued vectors are fed into the same BERT encoder (a stack of Transformer layers) — which can be done either sequentially through a single encoder copy, or in parallel through two encoder copies with tied parameters. This results in two lists of per-token contextual embeddings. Because AMBERT uses two embedding matrices (one for each tokenization), its parameter count is notably higher than BERT’s (194M vs 110M for the English Base models). However, latency remains relatively unchanged since AMBERT solely adds a new set of dictionary lookups

]]>
Sanaz Bahargam
Fine Tuning T5 for Summary Generation with PyTorch Lightning2020-07-26T00:00:00-07:002020-07-26T00:00:00-07:00https://sanazbahargam.github.io/posts/2020/07/T5SummarizationMy Colab notebook on fine tuning T5 model for summarization task using Trenasformers + PyTorch Lightning

]]>
Sanaz Bahargam
Conditional Random Field2020-04-28T00:00:00-07:002020-04-28T00:00:00-07:00https://sanazbahargam.github.io/posts/2020/04/CRFIn this post, I briefly explain what is conditional random Fields and how they can be used for sequence labeling. CRF is a discriminative model best suited for tasks in which contextual information or state of the neighbors affects the current prediction. CRFs are widely used in named entity recognition, part of speech tagging, gene prediction, noise reduction, and object detection problems.

In order to understand this post, it’s helpful to first read about Hidden Markov Models and Gibbs Distribution. This video explains the basics of HMM and this video by Daphne Koller is a very good introduction on Gibbs distribution.

Now let’ see why we need CRF? pic

From the paper

NER is in general formulated as a sequence labeling problem with a multi-layer Perceptron + Softmax layer as the tag decoder layer. The architecture can be a bidirectional LSTM with a softmax layer on top. The softmax layer predicts the tag for each token (e.g. B-PER, I_LOC). One shortcoming of using the softmax layer on top is that each token is predicted indepently. So it is possible that the sequence of tags is incosistent, for exaple, B_PER, I_LOC, O, O, I_PER.

In order to overcome this shortcoming, instead of using a softmax layer, we can use a Conditional Random Field (CRF) layer on top. When CRF is predicting the tag for a sequence, it also considers the surronding tokens and their tags into account as well. More formally, a Conditional Random Field* (CRF) is a standard model for predicting the most likely sequence of labels that correspond to a sequence of inputs.

Formally, we can take the above sequence of hidden states h = (h1, h2, …, hn) as our input to the CRF layer, and its output is our final prediction label sequence y = (y1, y2, …, yn), where yi in the set of all possible labels. We denote Y(h) as the set of all possible label sequences. Then we can derive the conditional probability of the output sequence given the input hidden state sequence is

pic

where W and b are the two weight matrices and the subscription indicates that we extract the weight vector for the given label pair (yi−1, yi). To train the CRF layer, we can use the classic maximum conditional likelihood estimation to train our model. The final log-likelihood with respect to the weight matrices is pic

Finally, we can adopt the Viterbi algorithm for training the CRF layer and the decoding the optimal output sequence y.

For a very nice explanation of CRF, I highly recommend watching these videos by Hugo Larochelle

]]>
Sanaz Bahargam
Knowledge Distillation2020-01-09T00:00:00-08:002020-01-09T00:00:00-08:00https://sanazbahargam.github.io/posts/2020/01/KnowledgeDistillationIn this post, I will discuss what is knowledge distillation (also refered as Student-Teacher Learning), what is the intuition behind it, and why it works!

Paper: Distilling the Knowledge in a Neural Network

We have built sophisticated models that solve complex problems such as natural language inference and common sense reasoning. However, these large, high performing models come with their own costs. They need a huge amount of computational resources (such as GPU’s memory), are slow (depending on computation resources), and hence cannot be run on low resource devices such as a mobile device. When computational resources are limited, the model can be made smaller by sharing parameters, quantization, and knowledge distillation. Knowledge distillation is a very successful model compression method in which a small model is trained to mimic a pre-trained, larger mode.

In distillation, a large trained model (teacher) is distilled in a smaller model (student). The important point is that the cost function of the student model is the pic

where x is the input, W is the student model parameters, y is the ground truth label, H is the cross-entropy loss function, σ is the softmax function parameterized by the temperature T, and α and β are coefficients. zs and zt are the logits of the student and teacher respectively.

pic

Why distillation works: In my opinion there are two reasons that distillation works really well, transferring (1) dark knowledge and (2) learning winning the lottery ticket

Transferring dark knowledge Remember the cost function of the teacher model: cross-entropy(students’ prediction, ground truth) + cross-entropy(student’s prediction (with temperature cross-entropy), teacher’s prediction (with temperature cross entropy))

The first part is the usual cross-entropy, however, the second part is transferring the dark knowledge from the teacher model to the distilled model. What is dark knowledge? For example, if the model is classifying images, the input is an image and the true label, e.g. A cat. The output of the large model, however, is the distribution of the probabilities of the classes. For the image of a cat as the input, the model gives 0 probability for the image to be a computer, however, the model produces 0.2 probability for the image to be a lion or a dog. Hence, it has much more information (AKA dark knowledge) compared to the input labels. This knowledge is then transferred to the distilled model with the second cross-entropy. Hence, in addition to the available class labels, the student model can use to use soft probabilities (or‘logits’) of the teacher mode. It has the effect of giving much more information for each input data to the model and makes the input to the student’s model much richer.

(2) Learning the winning lottery ticket Often in knowledge distillation, the student model’s intermediate layers are forced to output a result similar to the teacher’s intermediate layers. For example, if the teacher model is 12 layer BERT and the student model is 6 layer BERT, the output of the first layer of the distilled model should look like the output of the 2nd layer of the teacher BERT. Why does this help? Remember the “Lottery Ticket Hypothesis” paper, the main idea of deep learning and very large models is that it tries many representations and embeddings and since it’s trying many many representations, the one that helps the ML task is also embedded in the trained model. Hence, the large/teacher model has all those trial and errors and useful representation. The distilled model is forcing the intermediate representations to look like an intermediate representation of the large model and hence forced to pick up only the useful representations (the winning lottery only) from the larger model.

]]>
Sanaz Bahargam