Pytorch Implementation of PolyLM

DL Presentation Poster

Introduction

The “meaning conflation deficiency” of word embeddings problem is that word embeddings fail to effectively capture words with multiple meanings. This causes words such as “wrong” and “left” to be clustered too closely in embedding space by analogy with different senses of “right.” The paper “PolyLM: Learning about Polysemy through Language Modeling” tries to solve this problem by embedding the multiple “senses” of a word using unsupervised learning.

Because of the debiasing lab, we became aware that word embeddings learn subtle interconnections between words, much like humans do. In linguistics, the priming of “wrong” by “left” through their connection to “right” is called mediated semantic priming. We chose to implement this paper because it tries to disambiguate these word senses at the input level. This could have implications for how humans and machines “understand” language.

Related Work

Asaf Amrami and Yoav Goldberg have a 2019 paper called “Towards better substitution-based word sense induction” that attempts to address the same problem. They use substitute vectors, which involves learning the set of most likely words that could have occurred instead of the focus word. They then clustered each of these sets and used them to represent different sense of the focus word.

We have found a Tensorflow implementation of the model by the paper’s authors, but we intend to implement the paper in Pytorch.

Data

We plan to use WikiText dataset or the WMT 2011 News Crawl dataset as our training corpus. We also plan to test the model on WSI datasets such as SemEval-2010 Task 14 and SemEval-2013 Task 13.

Methodology

The model described in the paper has several layers for processing input, disambiguating sense embeddings, and predicting masked word vocabularies. It is compatible with any contextualizer (transformer model), and we will most likely use BERT or ELMo. We will be combining language modeling loss, distinctness loss, and match loss into the loss function. They ensure that the model gets the correct target token, finds distinct senses, and learns disambiguation sense probabilities similar to prediction sense probabilities.

The most difficult part of implementing the model will most likely be training the contextualizer. It’s not possible to use pretrained contextualizers, since their word embeddings need to match our sense embeddings. The original paper trained the model over 6M batches; we can probably only manage a percent of that.

Metrics

The authors of the paper evaluated PolyLM on human-labeled word sense induction (WSI) datasets. These datasets consisted of passages containing one of a set of polysemous focus words whose senses have been labeled by humans. The authors evaluated their performance using paired F-Score, V-Measure, Fuzzy B-Cubed, and Fuzzy Normalized Mutual Information scores. Then they compared the performance to that of Amrami and Goldberg (see above). Because we do not have the resources to train a BERTLARGE model, we will instead reimplement the code in Pytorch, and then compare its performance to the original Tensorflow implementation. We expect to get similar results on the WSI datasets with the same amount of training. Then we plan to adjust the architecture to see if we can perhaps get better results with less parameters—one of the benefits of PolyLM compared to previous state-of-the-art systems.

Ethics

As the landscape around machine learning and language modeling shifts every day, it is extremely important to discuss the possible deficiencies of word embeddings and how they might bias and affect the outputs of these increasingly popular models. As we’ve learned in the debiasing lab, language models can very easily be swayed to output correlations that don’t necessarily reflect an even-handed or ethical worldview. Our paper doesn’t seek to solve this issue, but can give us more insight into how we could possibly go about doing so. By extracting different senses and associations, we can highlight associations that we might deem problematic or inaccurate, and seek to fix those issues with a finer grained comb. As language models are becoming ubiquitous in daily life, the algorithm we are referencing has the potential to impact millions. Mistakes in such an algorithm can create very problematic associations that can quickly diffuse into the way we perceive language.

Division of Labor

Max - Write polylm.py in PyTorch Jason - Write polylm.py in PyTorch Claire - Write bert.py in PyTorch Chloe

Try training the original model on OSCAR to estimate how long it would be to train an LLM from scratch
Try preprocessing the original data using the data.py script
Try testing the original model using the SemEval 2010 and 2013 WSI datasets

Here is our reflection for Check-In #3.

Poster:

See below:

CODE:

https://github.com/qiaochloe/polylm

Reflection/Writeup:

Here

Title: Box Jellyfish: “PolyLM: Automatically Disambiguating Senses from Word Tokens”

Who: Chloe Qiao Jason Lin Max Guo Claire Robertson

Introduction: Traditional word embeddings often conflate multiple senses of a word into a single vector, limiting their effectiveness in tasks like word sense induction (WSI). To tackle this issue, we have implemented an architecture called PolyLM (Ansell et al. 2021). PolyLM treats learning sense embeddings as a language modeling problem, leveraging two key assumptions: first, that the probability of a word in a context is the sum of its sense probabilities, and second, that word senses are predictable from context. PolyLM's architecture trains a language model to predict the next word by summing the probabilities of its senses. The model is then able to calculate multiple embeddings in tandem for distinct senses of a word, sidestepping the meaning of conflation deficiency. Because of this PolyLM’s are able to avoid what is known as It has long been known as conflation deficiency: For example, naïve architectures cluster tokens like “wrong” and “left” unnaturally closely in the embedding space by analogy with different senses of “right”, the PolyLM is able to sidestep this issue . This issue is known as the meaning. This enables homophonous words to cluster at distinct positions in the embedding space, improving performance in context.

The PolyLM architecture achieves very high performance with far fewer parameters than SoTA models, making it a viable tool for researchers and practitioners. The original Ansell et al. implementation was written in TensorFlow. In our implementation, we first recontextualized the architecture into PyTorch, then modified the model to fit within the parameters of our design (OSCAR GPU and constraints). We trained the model for roughly 2 hours, and successfully replicated the results of Ansell et al. (2021).

Methodology: To train our model, we downloaded the Simple English Wikipedia dataset (Coster and Kauchak 2011), a repository of text files extracted from Simple English Wikipedia articles. We first ran this dataset through the original PolyLM implementation’s preprocessing package. This produced a split and tokenized dataset suitable for training.

Before beginning our own implementation, we ran this data through the TensorFlow implementation from the Ansell et al. paper. This step served as a benchmark to ensure the attainability of two goals: first, that the model would be able to train at a reasonable pace, and second, that it would perform as advertised. Though it took some time, we were able to run the final model of the PolyLM paper and its BERT-integrated RNN model through the OSCAR supercomputer system with relatively low loss.

After confirming that the original model was effective, we began the conversion of our model from TensorFlow to PyTorch. This process was especially challenging because much of the original implementation was dependent on processes unique to the TensorFlow library, such as variable scope. This forced us to take a radically different approach to some parts of our code.

In addition, the Ansell et al. implementation used manual gradient descent and performed cross-entropy calculations without making use of TensorFlow or NumPy. This complicated the program design considerably. Most of our effort was therefore spent trying to simplify and debug the TensorFlow implementation prior to reimplementing it. Since this complete overhaul, our model has diverged so greatly from the original that the two are nearly unrecognizable.

Finally, we trained our implementation on the Simple Wikipedia data (2,000 batches) for 2 hours. Once training had concluded, we analyzed the effectiveness of our model using PCA to visualize our output embeddings.

Results Loss for PolyLM is separated into 3 distinct losses that address different components of polysemy: language modeling loss, distinctness loss, and match loss. Language modeling loss is self explanatory, determining the accuracy of the model in a very general sense. Distinctness loss works to phase out certain, superfluous senses as not every word is guaranteed to have the preset number of senses within the model. Finally, match loss works to make sure no single sense of a word is allocated a significant amount of probability mass.

Our model displayed effectiveness in minimizing total loss, stabilizing at ~0.03 after 500 batches. This suggests to us that we’ve created a model with accurate language modeling capabilities, while also being able to split words into contextually significant senses with a fair probability distribution.

By mapping our embedding weights and interpreting it via a principal component analysis, we are able to determine the individual clustering of such embeddings to determine how the model is separating words. As shown above there is very little separation between individual words by the PolyLM, thus we can assume that the model was efficiently clustering words as we had intended and capable of determining the effective spread of embeddings to represent words. This was good news.

WSI PORTION: After much challenge we were able to integrate the WSI component of the PolyLM model. This enabled us to get an F_score value of 0.6342955347507954.

In addition it generated some gem sentences as below: "johnson never would have believe she have a son that age . mrs. roebuck think johnson be a " sweet bawh t ' - lah lahk thet " , but her herman be get to be a man , there be no get around it . " just befoh he leave foh his academeh we wuh hevin dack - rihs on the vuhranduh , major roebuck an ah , an huhmun say <3RD> ' may ah hev one too ' ? ?" "on the other hand , howsomever , maybe you would n't either . i figger it 's probl ' y a sixty - five - mile walk , and i c 'n maybe get this spring patch up in a couple of hour " . " how - with what ? ?" "" we aim t ' be see - lective , y ' know ? ? do n't like to bother no one unless we have to , which i figger we do , in your case . figger we get to be plumb careful with any of you highlands big shot " ." "do n't like to bother no one unless we have to , which i figger we do , in your case . figger we get to be plumb careful with any of you highlands big shot " . mcbride redden ." "" hell , yes . she 's be hangin ' around me a lot here lately , and i figgered i might as well 's try it . besides i hear her old uncle that stay <3RD> there have <3RD> be doin ' it " ." "" he 's in morgan 's ferry " . barton half - straighten in surprise . " what 's he do there " ? ?"

It is still a bit of a mystery why there are such spelling errors and rather strange choices of words, i,e. vuhranduh — it may be how the words are tokenized or our data being strange, but our model did perform as intended.

Challenges The most difficult part of the project has been translating legacy Tensorflow code into modern PyTorch. This caused us to have to restructure many parts of the program, including creating new classes for variable scopes. In addition, the Tensorflow code itself is within the Tensorflow 1 version which has limited documentation in terms of its conversion to PyTorch. Notably there is a significant lack of information on how to deal with the variable scopes approach that Tensorflow has, as well as some of the embedding features and other methods which don’t directly translate over to Pytorch.

Reflection Our project definitely did not turn out how we wanted it to. While we did indeed reach our base and target goals, the way in which we did it was much more overwhelming and tiring than we had first imagined. It took many sleepless nights of trial and error before we had anything recognizable with our code and to compile it such that it even runs. The process was as such arduous, but in the end we were able to meet our functionality requirements and thus our base/target goals.

Our model ended up being much different from how we envisioned it. Notably we thought that the tensorflow translation to pytorch would be relatively quick, however it turns out that this process was actually the hardest process. Due to the variable scope functionality of tensorflow we had to translate this feature into a bunch of different class systems and new methods which ended up massively complicating the code and changing the architecture of our code to be very distinct from the original code of the PolyLM. However the overall, conceptual model did remain the same: the polylm had an integration of two models, the BERT encoder and the PolyLM itself. The BERT encoder is a specially modified version of BERT which was equipped to assign sense to words, these embeddings are then transferred to the PolyLM which attempts to discern between the embeddings, assign losses and then determine the appropriate metrics to train. Then the model trains the next batch and repeats. Notably for the original paper model there were significantly more epochs and GPUS that they were able to parallelize in order to train much faster than we did. As such while it wasn't necessarily the same, we did end up following a similar approach to the model in the paper which was what we expected.

Our approach changed wildly over time. Notably, our most major pivot was working out the tensorflow to pytorch conversion for our variable conversion step that had to be made into new classes and structures in the program. This made entirely new classes and programs within our model which ended up significantly altering our model structure and thus the overall composition of our PolyLM. Another pivot we made was starting with the PolyLM and then fixing BERT. We had difficulties with integrating our BERT with our pytorch version of PolyLM. This led to an overwhelming number of errors which frankly made life way too difficult. Instead we decided to focus on fixing the PolyLM and integrating it with the original tensorflow implementation of BERT. This ended up working and then we fixed BERT to get our entire model in order. With our group as a whole, we believe that we could have chosen a different project. Fundamentally, this project was more of a hassle than rewarding in terms of knowledge gained. While we did learn a lot about the difference between tensorflow and pytorch, a better project would have been implementation of a predictive regression model in pytorch which would have taught us the same knowledge with the added benefit of finishing a project we have something to show for.

We definitely could train our model for much longer to get better results. While our loss was really nice at a low of 0.03 as well as having proper PCA embeddings, the outputs of our model were questionable compared to what they could have been. As such we definitely could have trained it for a couple more hours to get a more precise and effective model which may have enabled us to get better results.

Our greatest takeaway from this project is learning about PolyLMs as well as the difficulty of coding PolyLMs. We also learned about the difference between tensorflow and Pytorch implementations. So overall we did have our fair share of takeaways and learned much, though at the cost of many hours of sleep.

Built With

Updates

Chloe Qiao started this project — Apr 12, 2024 01:47 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.