Official repository for the paper "Pretraining and Benchmarking Modern Encoders for Latvian".
This repository provides a suite of Latvian-specific pretrained encoder models based on RoBERTa, DeBERTaV3, and ModernBERT, together with evaluation resources and benchmark results. The models are trained on a curated Latvian corpus of 6.4B words, combining large-scale web data with high-quality sources such as news, books, and academic texts.
Paper • Model collection • Datasets • Citation
- Latvian-specific pretrained encoders across multiple modern architectures
- Long-context variants supporting up to 8,192 tokens
- Unified Latvian benchmark covering diagnostic, morphosyntactic, and semantic tasks
- lv-deberta-base the strongest overall performance across the evaluation benchmarks
| Model | Params | Architecture | Context | Link |
|---|---|---|---|---|
| lv-deberta-base | 111M | DeBERTaV3 | 1024 | HF |
| lv-mbert-mini | 59M | ModernBERT | 8192 | HF |
| lv-mbert-base | 136M | ModernBERT | 8192 | HF |
| lv-mbert-large | 377M | ModernBERT | 8192 | HF |
| lv-roberta-base | 124M | RoBERTa | 1024 | HF |
- LTEC (Twitter sentiment)
- ScaLA (linguistic acceptability)
- FSNER (named entity recognition)
- WikiQA (reading comprehension)
- COPA (commonsense reasoning)
- Latvian Universal Dependencies v2.16
- Latvian Word Sense Disambiguation (WSD)
For full results, see the paper.
from transformers import AutoTokenizer, AutoModel
model_name = "AiLab-IMCS-UL/lv-deberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
text = "Šis ir piemērs latviešu valodā."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
print(outputs.last_hidden_state.shape)@inproceedings{znotins-2026-pretraining,
title = "Pretraining and Benchmarking Modern Encoders for {L}atvian",
author = "Znotins, Arturs",
booktitle = "Proceedings of the Second Workshop on Language Models for Low-Resource Languages ({L}o{R}es{LM} 2026)",
month = mar,
year = "2026",
address = "Rabat, Morocco",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.loreslm-1.40/",
doi = "10.18653/v1/2026.loreslm-1.40",
pages = "461--470",
ISBN = "979-8-89176-377-7"
}