Skip to content

2ai-lab/LakotaBERT-Low-Resource-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🪶 LakotaBERT: Low-Resource Language Model

License: MIT Model: RoBERTa Language: Lakota 🏆 Best Paper Award at RTIP2P-2024

📅 Project Date: Spring 2024 🧠 Model Architecture: RoBERTa (Robustly Optimized BERT) 📉 Objective: Masked Language Modeling (MLM) for Endangered Language Revitalization 🛠️ Tech Stack: PyTorch, Hugging Face Transformers, Tesseract OCR, Python


📖 Research Abstract

Lakota is a critically endangered language of the Sioux people in North America. This project introduces LakotaBERT, the first large language model (LLM) tailored for Lakota, aiming to support language revitalization efforts.

Unlike English-based models, LakotaBERT was pre-trained from scratch on a custom-compiled corpus of 105K sentences. The model achieved a Masked Language Modeling (MLM) accuracy of 51.48%, demonstrating performance comparable to that of English-based models.


📊 Performance Metrics

We evaluated the model against baseline models using a single ground truth assumption. The detailed results for LakotaBERT are below:

Metric Score Description
Accuracy 51.48% Percentage of masked tokens correctly predicted
Precision 0.56 Proportion of correct predictions among all positive predictions
F1 Score 0.49 Balances precision and recall into a single performance metric
MRR 0.51 Average reciprocal ranks of the correct answers within the predicted lists
CER 0.43 Character-level prediction errors normalized by the length of the longest string

🏗️ Pipeline Architecture

The project followed a robust pipeline for training a transformer-based model tailored to Lakota:

  • Data Acquisition: Gathered datasets from bilingual and monolingual sources, resulting in approximately 105K lines of Lakota and English. Employed the Tesseract OCR engine to extract texts from PDF formats.
  • Tokenization: Employed Byte Pair Encoding (BPE) during tokenization. Used a vocabulary size of 52,000 to capture the diversity of words and tokens in the Lakota language.
  • Pre-training: Utilized the RoBERTa architecture with a masking probability of 15% for masked language modeling.

🚀 Getting Started & Model Weights

The pre-trained model weights, configuration, and tokenizer files are hosted on Hugging Face. You can load the model directly via the transformers library:

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("kanishka7878/LakotaBERT")
model = AutoModelForMaskedLM.from_pretrained("kanishka7878/LakotaBERT")

💻 Implementation Details

The training script utilizes the Hugging Face Trainer API with optimized hyperparameters for low-resource settings:

# Configuration for Low-Resource setting (from src/train_lakota_roberta.py)
from transformers import RobertaConfig

config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,  # Optimized for smaller dataset size
    type_vocab_size=1,
)

About

A Transformer-based Large Language Model (LLM) for the low-resource Lakota language, pre-trained on a custom 105k-sentence corpus using RoBERTa architecture.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages