This project was dedicated to creating a library for evaluating the quality of sentence embeddings for the russian language. We assess their generalization power by using them as features on a broad and diverse set of tasks. SentEvalRu currently includes 17 NLP tasks.
Our goal is to evaluate different algorithms of text representation with datasets in russian. This is the first approach of such task for the russian language. We were inspired to create this library by SentEval[1].
This project was implemented in the context of winter school ComptechNsk'19, the idea of creating SentEvalRu belongs to MIPT's Neural Networks and Deep Learning Lab who develops artificial intelligence system iPavlov.
Project participants:
- Mosolova Anna (project manager)
- Obukhova Alisa (technical writer)
- Pauls Aleksey (engineer)
- Stroganov Mikhail (engineer)
- Timasova Ekaterina (researcher)
- Shugalevskaya Natalya (researcher)
Sentence embeddings are used in a wide range of tasks where NLP systems are required. For example:
- intent classifier;
- QA systems;
- sentiment analysis;
- machine translation;
- document clustering.
Our tool helps to evaluate sentence embeddings, and that could be useful for everyone, who solves these tasks or analyses embeddings' quality for russian scientifically.
Our project currently includes following models of text representation for the russian language:
- Bert [2]
- FastText [4]
- FastText+IDF [4] [5]
- Skip-Thought [6]
There is no way to evaluate embeddings' quality direclty so we can only solve some NLP tasks using these embeddings and evaluate them depending on the results of these systems.
For example, we can use following tasks:
- sentiment analysis;
- named-entity recognition;
- topic modelling; etc.
We suggest evaluating embeddings by means of these tasks:
| Tag | Task | Type | Description |
|---|---|---|---|
| MRPC | MRPC | paraphrase detection | Detect whether one sentence is the paraphrase of another one |
| SST-3 | SST/dialog-2016 | ternary sentiment analysis | Detect a text sentiment (positive (1), neutral (0), negative (-1)) |
| SST-2 | SST/binary | binary sentiment analysis | Detect a text sentiment (positive (1), negative (-1)) |
| TagCl | Tags classifier | tag classifier | Detect a tag of news from Interfax Corpus |
| ReadabilityCl | Readability classifier | Readability Classifier | Detect a readibility grade of a text (1-10) |
| PoemsCl | Poems classifier | genre classifier | Detect poem's genre |
| ProzaCl | Proza classifier | genre classifier | Detect prose's genre |
| TREC | TREC (translated) | question-type classification | Detect a type of a question (about entity, human, description, location etc.) |
| SICK | SICK-E (translated) | natural language inference | Detect whether a second sentence is an entailment, a contradiction, or neutral of the first one) |
| STS | STS (translated) | semantic textual similarity | Detect a semantic similarity grade of two texts |
Futher information is available in /data
You should install all required modules before the start:
- Python 3 with NumPy/SciPy
- Pytorch>=0.4
- scikit-learn>=0.18.0
- TensorFlow >=1.12.0
- Keras >=2.2.4
- ... We recommend using Anaconda package, or you can just run the following command:
pip3 install -r requirements.txt
git init
git clone https://github.com/comptechml/SentEvalRu.git
cd SentEvalRu
You should store your datasets in /data, and you could add your examples (new embeddings) to /examples.
Available tasks for russian are situated in /examples.
To evaluate your sentence embeddings, SentEval requires that you implement two functions:
- prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
- batcher (transforms a batch of text sentences into sentence embeddings)
batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.
prepare(params, samples)
- params: senteval parameters.
- samples: list of all sentences from the tranfer task.
- output: No output. Arguments stored in "params" can further be used by batcher.
Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.
batcher(params, batch)
- params: senteval parameters.
- batch: numpy array of text sentences (of size params.batch_size)
- output: numpy array of sentence embeddings (of size params.batch_size)
Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.
After having implemented the batch and prepare function for your own sentence encoder,
- to perform the actual evaluation, first import senteval and set its parameters:
import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}- (optional) set the parameters of the classifier (when applicable):
params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
'tenacity': 5, 'epoch_size': 4}You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.
- Create an instance of the class SE:
se = senteval.engine.SE(params, batcher, prepare)- define the set of transfer tasks and run the evaluation:
transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)The current list of available tasks is:
['SST2', 'SST3', 'MRPC', 'ReadabilityCl', 'TagCl', 'PoemsCl', 'ProzaCl', 'TREC', 'STS', 'SICK']Global parameters of SentEval:
# senteval parameters
task_path # path to SentEval datasets (required)
seed # seed
usepytorch # use cuda-pytorch (else scikit-learn) where possible
kfold # k-fold validation for MR/CR/SUB/MPQA.Parameters of the classifier:
nhid: # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim: # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity: # how many times dev acc does not increase before training stops
epoch_size: # each epoch corresponds to epoch_size pass on the train set
max_epoch: # max number of epoches
dropout: # dropout for MLP[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations
[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[3] Daniel Cer, Yinfei Yang, Sheng-yi Kong, ... Universal Sentence Encoder
[4] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
[5] Martin Klein, Michael L. Nelson, Approximating Document Frequency with Term Count Values
[6] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler, Skip-Thought Vectors