Skip to content

LUMII-AILab/latvian-encoders

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Latvian Encoders

Official repository for the paper "Pretraining and Benchmarking Modern Encoders for Latvian".

This repository provides a suite of Latvian-specific pretrained encoder models based on RoBERTa, DeBERTaV3, and ModernBERT, together with evaluation resources and benchmark results. The models are trained on a curated Latvian corpus of 6.4B words, combining large-scale web data with high-quality sources such as news, books, and academic texts.

PaperModel collectionDatasetsCitation

Highlights

  • Latvian-specific pretrained encoders across multiple modern architectures
  • Long-context variants supporting up to 8,192 tokens
  • Unified Latvian benchmark covering diagnostic, morphosyntactic, and semantic tasks
  • lv-deberta-base the strongest overall performance across the evaluation benchmarks

Released models

Model Params Architecture Context Link
lv-deberta-base 111M DeBERTaV3 1024 HF
lv-mbert-mini 59M ModernBERT 8192 HF
lv-mbert-base 136M ModernBERT 8192 HF
lv-mbert-large 377M ModernBERT 8192 HF
lv-roberta-base 124M RoBERTa 1024 HF

Benchmark

  • LTEC (Twitter sentiment)
  • ScaLA (linguistic acceptability)
  • FSNER (named entity recognition)
  • WikiQA (reading comprehension)
  • COPA (commonsense reasoning)
  • Latvian Universal Dependencies v2.16
  • Latvian Word Sense Disambiguation (WSD)

For full results, see the paper.

Usage

from transformers import AutoTokenizer, AutoModel

model_name = "AiLab-IMCS-UL/lv-deberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

text = "Šis ir piemērs latviešu valodā."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

print(outputs.last_hidden_state.shape)

Citation

@inproceedings{znotins-2026-pretraining,
    title = "Pretraining and Benchmarking Modern Encoders for {L}atvian",
    author = "Znotins, Arturs",
    booktitle = "Proceedings of the Second Workshop on Language Models for Low-Resource Languages ({L}o{R}es{LM} 2026)",
    month = mar,
    year = "2026",
    address = "Rabat, Morocco",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.loreslm-1.40/",
    doi = "10.18653/v1/2026.loreslm-1.40",
    pages = "461--470",
    ISBN = "979-8-89176-377-7"
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors