READMEen.md

SentEvalRu

Russian|English

This project was dedicated to creating a library for evaluating the quality of sentence embeddings for the russian language. We assess their generalization power by using them as features on a broad and diverse set of tasks. SentEvalRu currently includes 17 NLP tasks.

Our goal is to evaluate different algorithms of text representation with datasets in russian. This is the first approach of such task for the russian language. We were inspired to create this library by SentEval[1].

This project was implemented in the context of winter school ComptechNsk'19, the idea of creating SentEvalRu belongs to MIPT's Neural Networks and Deep Learning Lab who develops artificial intelligence system iPavlov.

Project participants:

Mosolova Anna (project manager)
Obukhova Alisa (technical writer)
Pauls Aleksey (engineer)
Stroganov Mikhail (engineer)
Timasova Ekaterina (researcher)
Shugalevskaya Natalya (researcher)

What tasks does it help to solve?

Sentence embeddings are used in a wide range of tasks where NLP systems are required. For example:

Our tool helps to evaluate sentence embeddings, and that could be useful for everyone, who solves these tasks or analyses embeddings' quality for russian scientifically.

Available models of text representation

Our project currently includes following models of text representation for the russian language:

Bert [2]
FastText [4]
FastText+IDF [4] [5]
Skip-Thought [6]

Evaluation and tasks

There is no way to evaluate embeddings' quality direclty so we can only solve some NLP tasks using these embeddings and evaluate them depending on the results of these systems.

For example, we can use following tasks:

sentiment analysis;
named-entity recognition;
topic modelling; etc.

We suggest evaluating embeddings by means of these tasks:

Tag	Task	Type	Description
MRPC	MRPC	paraphrase detection	Detect whether one sentence is the paraphrase of another one
SST-3	SST/dialog-2016	ternary sentiment analysis	Detect a text sentiment (positive (1), neutral (0), negative (-1))
SST-2	SST/binary	binary sentiment analysis	Detect a text sentiment (positive (1), negative (-1))
TagCl	Tags classifier	tag classifier	Detect a tag of news from Interfax Corpus
ReadabilityCl	Readability classifier	Readability Classifier	Detect a readibility grade of a text (1-10)
PoemsCl	Poems classifier	genre classifier	Detect poem's genre
ProzaCl	Proza classifier	genre classifier	Detect prose's genre
TREC	TREC (translated)	question-type classification	Detect a type of a question (about entity, human, description, location etc.)
SICK	SICK-E (translated)	natural language inference	Detect whether a second sentence is an entailment, a contradiction, or neutral of the first one)
STS	STS (translated)	semantic textual similarity	Detect a semantic similarity grade of two texts

Futher information is available in /data

Prerequisites

You should install all required modules before the start:

Python 3 with NumPy/SciPy
Pytorch>=0.4
scikit-learn>=0.18.0
TensorFlow >=1.12.0
Keras >=2.2.4
... We recommend using Anaconda package, or you can just run the following command:

pip3 install -r requirements.txt

Setup

git init
git clone https://github.com/comptechml/SentEvalRu.git
cd SentEvalRu

You should store your datasets in /data, and you could add your examples (new embeddings) to /examples.

Examples

Available tasks for russian are situated in /examples.

How to use SentEval

To evaluate your sentence embeddings, SentEval requires that you implement two functions:

prepare (sees the whole dataset of each task and can thus construct the word vocabulary, the dictionary of word vectors etc)
batcher (transforms a batch of text sentences into sentence embeddings)

1.) prepare(params, samples) (optional)

batcher only sees one batch at a time while the samples argument of prepare contains all the sentences of a task.

prepare(params, samples)

params: senteval parameters.
samples: list of all sentences from the tranfer task.
output: No output. Arguments stored in "params" can further be used by batcher.

Example: in bow.py, prepare is is used to build the vocabulary of words and construct the "params.word_vect* dictionary of word vectors.

2.) batcher(params, batch)

batcher(params, batch)

params: senteval parameters.
batch: numpy array of text sentences (of size params.batch_size)
output: numpy array of sentence embeddings (of size params.batch_size)

Example: in bow.py, batcher is used to compute the mean of the word vectors for each sentence in the batch using params.word_vec. Use your own encoder in that function to encode sentences.

3.) evaluation on transfer tasks

After having implemented the batch and prepare function for your own sentence encoder,

to perform the actual evaluation, first import senteval and set its parameters:

import senteval
params = {'task_path': PATH_TO_DATA, 'usepytorch': True, 'kfold': 10}

(optional) set the parameters of the classifier (when applicable):

params['classifier'] = {'nhid': 0, 'optim': 'adam', 'batch_size': 64,
                                 'tenacity': 5, 'epoch_size': 4}

You can choose nhid=0 (Logistic Regression) or nhid>0 (MLP) and define the parameters for training.

Create an instance of the class SE:

se = senteval.engine.SE(params, batcher, prepare)

define the set of transfer tasks and run the evaluation:

transfer_tasks = ['MR', 'SICKEntailment', 'STS14', 'STSBenchmark']
results = se.eval(transfer_tasks)

The current list of available tasks is:

['SST2', 'SST3', 'MRPC', 'ReadabilityCl', 'TagCl', 'PoemsCl', 'ProzaCl', 'TREC', 'STS', 'SICK']

SentEval parameters

Global parameters of SentEval:

# senteval parameters
task_path                   # path to SentEval datasets (required)
seed                        # seed
usepytorch                  # use cuda-pytorch (else scikit-learn) where possible
kfold                       # k-fold validation for MR/CR/SUB/MPQA.

Parameters of the classifier:

nhid:                       # number of hidden units (0: Logistic Regression, >0: MLP); Default nonlinearity: Tanh
optim:                      # optimizer ("sgd,lr=0.1", "adam", "rmsprop" ..)
tenacity:                   # how many times dev acc does not increase before training stops
epoch_size:                 # each epoch corresponds to epoch_size pass on the train set
max_epoch:                  # max number of epoches
dropout:                    # dropout for MLP

References

[1] A. Conneau, D. Kiela, SentEval: An Evaluation Toolkit for Universal Sentence Representations

[2] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[3] Daniel Cer, Yinfei Yang, Sheng-yi Kong, ... Universal Sentence Encoder

[4] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

[5] Martin Klein, Michael L. Nelson, Approximating Document Frequency with Term Count Values

[6] Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S. Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler, Skip-Thought Vectors

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SentEvalRu

Russian|English

What tasks does it help to solve?

Available models of text representation

Evaluation and tasks

Prerequisites

Setup

Examples

How to use SentEval

1.) prepare(params, samples) (optional)

2.) batcher(params, batch)

3.) evaluation on transfer tasks

SentEval parameters

References

FilesExpand file tree

READMEen.md

Latest commit

History

READMEen.md

File metadata and controls

SentEvalRu

Russian|English

What tasks does it help to solve?

Available models of text representation

Evaluation and tasks

Prerequisites

Setup

Examples

How to use SentEval

1.) prepare(params, samples) (optional)

2.) batcher(params, batch)

3.) evaluation on transfer tasks

SentEval parameters

References