Latvian Encoders

Official repository for the paper "Pretraining and Benchmarking Modern Encoders for Latvian".

This repository provides a suite of Latvian-specific pretrained encoder models based on RoBERTa, DeBERTaV3, and ModernBERT, together with evaluation resources and benchmark results. The models are trained on a curated Latvian corpus of 6.4B words, combining large-scale web data with high-quality sources such as news, books, and academic texts.

Paper • Model collection • Datasets • Citation

Highlights

Latvian-specific pretrained encoders across multiple modern architectures
Long-context variants supporting up to 8,192 tokens
Unified Latvian benchmark covering diagnostic, morphosyntactic, and semantic tasks
lv-deberta-base the strongest overall performance across the evaluation benchmarks

Released models

Model	Params	Architecture	Context	Link
lv-deberta-base	111M	DeBERTaV3	1024	HF
lv-mbert-mini	59M	ModernBERT	8192	HF
lv-mbert-base	136M	ModernBERT	8192	HF
lv-mbert-large	377M	ModernBERT	8192	HF
lv-roberta-base	124M	RoBERTa	1024	HF

Benchmark

LTEC (Twitter sentiment)
ScaLA (linguistic acceptability)
FSNER (named entity recognition)
WikiQA (reading comprehension)
COPA (commonsense reasoning)
Latvian Universal Dependencies v2.16
Latvian Word Sense Disambiguation (WSD)

For full results, see the paper.

Usage

from transformers import AutoTokenizer, AutoModel

model_name = "AiLab-IMCS-UL/lv-deberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "Šis ir piemērs latviešu valodā."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

Citation

@inproceedings{znotins-2026-pretraining,
    title = "Pretraining and Benchmarking Modern Encoders for {L}atvian",
    author = "Znotins, Arturs",
    booktitle = "Proceedings of the Second Workshop on Language Models for Low-Resource Languages ({L}o{R}es{LM} 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.loreslm-1.40/",
    doi = "10.18653/v1/2026.loreslm-1.40",
    pages = "461--470",
    ISBN = "979-8-89176-377-7"
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latvian Encoders

Highlights

Released models

Benchmark

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Latvian Encoders

Highlights

Released models

Benchmark

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages