Sanaz Bahargam

Grammar induction aims to capture syntactic information in sentences in the form of constituency parsing trees. Sup

Unsupervised grammar induction provide evidence for statistical learning because it has minimal assumption about linguistic knowledge built into the models

There are many texts and videos/images on social media and images have shown to help us induce syntax structure. In Image0-aided unsupervised Grammar induction, the model exploits regularities between text spans and images. Videos not only represent static objects, but they can also show actions and dynamic interaction between objects and can represent verb phrases.

The paper is inspired by Compound PCFG in which a grammar inducer is used to create the chart and then they optimize on marginal likelihood of the sentence. Visually Grounded-PCFG, consider image sentence matching during the training. However, simply replacing images with videos is not trivial due to multimodality and temporal modeling in videos.

Image taken from the presentation

The baseline model created by the authors combines VC-PCFG with objects. The authors also compare this baseline with VC-PCFG + [Action, scene, audio, OCR, Face, speech]. In the final model, all the aforementioned features from a Multi-Modal Transformer are extracted and used. In the paper, it’s been shown that this model (MMC-PCFG) outperforms all the other baselines consistently on three different datasets.

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources

By Simone Conia, Andrea Bacciu and Roberto Navigli

Outstanding Long Paper

Semantic role labeling (SRL) is the task of automatically addressing “who did what to whom, where, when, and how?” SRL includes predicate identification and disambiguation, argument identification, and argument classification. SRL predicate-argument structure inventories are language-specific and simultaneously label semantics across languages is expensive and needs human experts (in the desired languages). Hence, there is a need for heterogeneous linguistic resources. The intuition behind this idea is that semantic relations may be deeply rooted beyond language-specific realization and as a result authors try to build a model that can learn from many inventories for deeper sentence-level semantics.

The model architecture (illustrated below) can be roughly divided into the following components: • A universal sentence encoder whose parameters are shared across languages and which produces word encodings that capture predicate-related information • A universal predicate-argument encoder whose parameters are also shared across languages and which models predicate-argument relations • A set of language-specific decoders which indicate whether words are predicates, select the most appropriate sense for each predicate, and assign a semantic role to every predicate-argument couple, according to several different SRL inventories

Image taken from the presentation

It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners

By Timo Schick and Hinrich Schütze

Outstanding Long Paper

In GPT-3 priming, the model is given a few demonstrations of inputs and corresponding outputs as context for its predictions, but no gradient updates are performed. Using priming, GPT-3 has shown that it has amazing few-shot abilities. However, the problem with priming is that we need gigantic language models, also priming does not scale to more than a few examples as the context window is limited to a few hundred tokens.

An alternative approach is pattern-exploiting training (PET) which combines the idea of reformulating tasks as cloze questions with regular gradient-based finetuning. More formally, the task of mapping inputs x ∈ X to outputs y ∈ Y , for which PET requires a set of pattern-verbalizer pairs (PVPs). Each PVP p = (P, v) consists of • a pattern P : X → T∗ that maps inputs to cloze questions containing a single mask; • a verbalizer v : Y → T that maps each output to a single token representing its task-specific meaning in the pattern

By adopting PET, the authors showed that even a small model such as ALBERT can outperform GPT-3 on SuperGLUE. Through a series of ablation studies, authors showed that pattern and verbalizers significantly contribute to the performance of the model. From the paper, one can conclude that PET works well with small language models, and scaled to more examples than fit into the context window. However, the disadvantage of PET compared to few-shot learning is that PET requires fine-tuning multiple models for each task and doesn’t work for generative tasks.

Learning How to Ask: Querying LMs with Mixtures of Soft Prompts

By Guanghui Qin and Jason Eisner

Best Short Paper

In few shot settings, we can extract factual knowledge from the model’s training corpora by prompting the model. In this paper, the authors show that the choice of prompt is important for the performance of the model and explore the idea of learning prompts by gradient descent—either fine-tuning prompts taken from previous work, or starting from random initialization. This can easily be achieved since from the perspective of LM words are just continuous vectors. Hence the used prompts consist of “soft words,” i.e., continuous vectors that do not necessarily word type embeddings from the language model. This is called soft prompts and the benefit of using the soft prompts (compared to hard prompts and words) is that they are easy to search with backprop and they enable us with a larger space of prompts.

In addition to soft promoting, for each task, authors optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. They show that across multiple English LMs and tasks, their approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit: random initialization is nearly as good as informed initialization.

Image taken from the presentation

How many data points is a prompt worth?

By Teven Le Scao and Alexander Rush

Outstanding Short Paper

When fine-tuning pre-trained models for classification, researchers either use a generic model head or a task-specific prompt for prediction. In this paper, authors compare prompted and head-based fine-tuning in equal conditions across many tasks and data sizes. They show that prompting is often worth 100s of data points on average across classification tasks.

To compare heads vs prompts, they consider two transfer learning settings for text classification: head-based, where a generic head layer takes in pretrained representations to predict an output class; prompt-based, where a task-specific pattern string is designed to coax the model into producing a textual output corresponding to a given class. Both can be utilized for fine-tuning with supervised training data but prompts further allow the user to customize patterns to help the model.

For the prompt model, we follow the notation from PET (described earlier here) and decompose a prompt into a pattern and a verbalizer. The pattern turns the input text into a cloze task, i.e. a sequence with a masked token or tokens that need to be filled.In order to measure the effectiveness of promoting, authors introduce a metric, the average data advantage.

Contextual Domain Classification with Temporal Representations

Tzu-Hsiang Lin, Yipeng Shi, Chentao Ye, Yang Fan, Weitong Ruan, Emre Barut, Wael Hamza, Chengwei Su

There are many domains in dialogue systems and defending on the domain, the concept of concept can vary from minutes to hours. This paper uses temporal representations that combine second difference and turn order offset information has been used to utilize both recent and distant context in various encoder architectures. More specifically, previous 9 turns of context within a few days have been included in the model. In order to determine which temporal information (turn vs time difference) contribute to the model, authors considered using time mask, turn embedding, turn embedding over time mask, time mask over turn embedding, and time and turn embedding. The model is shown below.

Image taken from the paper

Time mask feeds the time difference into a 2 layer network and sigmoid function to produce a masking vector. Turn embedding is the turn difference which is projected into a fixed-size embedding vector. Turn embedding over time mask provides turn order information on top of seconds (assuming turn is more important) and similarly in mask over turn embedding provides time seconds information on top of order turn (assuming time is more important). Finally, time and turn

ICLR 2021

2021-05-12T00:00:00-07:00

ICLR 2021

Commonsense AI: Myth and Truth

By Yejin Choi

In this talk, Yejin mainly talks about abduction and counterfactual reasoning and mentions that although they’re different, both involve nonmonotonic reasoning with past context X and future constraint Z.

Picture taken from the slides

Using language models such as GPT2 bc they are only good at conditioning on the past and we can incorporate the future by incorporating both past and future as the past with a special token in between. But this doesn’t generalize well to out-of-domain distribution. So they propose to use back-prop as an inference time algorithm rather than training time only for abduction. For counterfactual reasoning, the same works just need to change the loss function to K-L divergence. Perhaps the image below is more clear

Image taken from the slides

Knowledge and reasoning:

The Off-the-shelf language models are not equivalent to knowledge models, and we need to build knowledge models if we want to do reasoning. COMET is the common sense knowledge graph that has 1.33 M commonsense if-then inferences over 23 relations (or inference time). Comet is at least 400 times smaller than GPT3 and still, COMET outperforms GPT3 and COMET in common sense reasoning and COMET generalizes well on out-of-domain examples.

Key take away message: A lot of commonsense reasoning might require the full scope of language and the distinction between knowledge and reasoning is blurry and there’s and we need natural language for reasoning since there is no way we can represent the complexity of commonsense in clean-cut logical forms.

WHEN DO CURRICULA WORK

Authors: Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur

They examined whether the curriculum learning works or not. In order to do that, they first define the notion of difficulty for each example. The difficulty of each sample is defined by its learned iteration. Learned iteration is the iteration at which the sample is classified correctly and also in all the later iterations that sample is classified correctly as well. They sort the data based on the learned iteration and define a window of size k. Then they compare the performance of standard training with curriculum learning, anti-curriculum learning, and random learning (they shuffle the data, but a sample from each window of size k one by one). They observed: Curricula achieve (almost) no improvement in the standard-setting. They show curriculum learning, random, and anti-curriculum learning perform almost equally well in the standard-setting Curriculum learning improves over standard training when training time is limited. Imitating the large data regime, where training for multiple epochs is not feasible, They limit the number of iterations in the training algorithm and compare curriculum, random and anti-curriculum ordering against standard training. Our experiments reveal a clear advantage of curriculum learning over other methods. Curriculum learning improves over standard training in a noisy regime. Finally, They mimic noisy data by adding label noise to CIFAR100. Our experiments indicate that curriculum learning has a clear advantage over other curricula and standard training.

RETHINKING ATTENTION WITH PERFORMERS

Authors: Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, David Benjamin Belanger, Lucy J Colwell, Adrian Weller

traditional attention-based models are not scalable since attention has quadratic time and space complexity. This paper introduces the performer, at efficient attentions base model. Performer provides linear space and time complexity without any assumption needed (such as sparsity or low-rankness). To approximate softmax attention kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+) which approximates the standard attention weights. FAVOR is fully compatible with regular Transformers and the authors show theoretically that FAVOR guarantees unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence, and low estimation variance.

Image taken from the paper

See the paper for more details of how FAVOR works by decomposing each attention matrix in each batch into two matrices Q and K and their properties.

DeLighT: Deep and Light-weight Transformer

Authors: Sachin Mehta, Marjan Ghazvininejad, Srinivasan Iyer, Luke Zettlemoyer, Hannaneh Hajishirzi

In order to learn deeper and wider representation, linear layers are used which allows to learn global representation. In order to improve the efficiency, the authors proposed to divide the input into two groups and then for each group, apply linear transformation into each group independently. In this manner, the number of parameters is reduced while the model still learns local representation within each group and not t. To allow global representation learning, the authors suggest feature shuffle. After applying one linear transformation to each group independently, the features are shuffled and then group linear transformation. With this model, the model learns both global and local representation. In addition to feature shuffle, the authors used the input mixer connection to avoid vanishing radiant decent problem, improve stability in training, and also improving performance while increasing depth (without input mixer connection, authors observe the model gets worse when increasing the depth).

Image taken from the paper

Are wider nets better given the same number of parameters?

Authors: Anna Golubeva, Guy Gur-Ari, Behnam Neyshabur

They showed that when fixing the size of the model (and the number of parameters), while increasing the width of the model, the model performance improves. More specifically, they showed They observed that for ImageNet, increasing the width (by Setting some weights to zero using a mask that is randomly chosen) leads to almost identical performance as when they allow the number of weights to increase along with the width. To understand the observed effect theoretically, authors study a simplified model and show that the improved performance of a wider, sparse network is correlated with a reduced distance between its Gaussian Process kernel and that of an infinitely wide network. Authors propose that reduced kernel distance may explain the observed effect

The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

Authors: Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi

They propose a new framework, deep bootstrap, to study generalization in deep learning. Where in real-world where SGD takes step in empirical loss and in the ideal world, SGD, takes steps in population loss (and hence train loss = test loss). The ideal world is a world when there exists infinite labeled data hence when they increase the steps, they’ll see new data points and they don’t have to see a sample multiple times during training. They show that The generalization of models is largely determined by their optimization speed in online and offline learning. They claim the real world is equivalent to the ideal world as long as the ideal world hasn’t converged yet. They also claim good models and training procedures are those which (1) optimize quickly in the Ideal World and (2) do not optimize too quickly in the Real World.

PMI-MASKING: PRINCIPLED MASKING OF CORRELATED SPANS

Authors: Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend, Kevin Leyton-Brown, Moshe Tennenholtz, Yoav Shoham