Skip to content

DamienGVA/beto-esg-es

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BETO-ESG-ES — Spanish ESG Domain-Adaptive Pretraining & Fine-tuning

This repo trains a Spanish ESG-specific BERT starting from BETO via Domain-Adaptive Pre-Training (DAPT) and then fine-tunes on ESG downstream tasks (sentence classification / NER).

Quick start

  1. Install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
  1. Data layout

Place raw text files (one document per line or one big corpus) under your Drive and point DATA_DIR to it.

Recommended layout:

/MyDrive/esg_corpus/
  train.txt     # large corpus (one document per line)
  valid.txt     # small held-out file
  1. Run DAPT (MLM)
python scripts/train_mlm.py --config configs/mvp.yaml
  1. Resume on a bigger GPU
  • Upload checkpoints/beto_esg_mvp to your cloud VM.
  • Rerun train_mlm.py with a larger config (e.g., configs/a100_300m.yaml) and resume_from_checkpoint: checkpoints/beto_esg_mvp.

Colab / Kaggle notes

  • Colab: mount Drive and set DATA_DIR: "/content/drive/MyDrive/esg_corpus".
  • Kaggle: upload train.txt as a Kaggle Dataset and set DATA_DIR accordingly (e.g., /kaggle/input/esg-corpus).

Project tree

configs/        # YAML configs for runs
scripts/        # training, preprocessing, eval
data_raw/       # (ignored) raw PDFs/HTML (store in Drive)
data_proc/      # processed text (ignored)
checkpoints/    # model checkpoints (ignored)
notebooks/      # Colab/Kaggle entrypoints
docs/           # model card, datasheets

Optional tools

  • Consider DVC for tracking large data stored in Drive/S3, and Weights & Biases for experiment tracking.

About

Specific trained NLP on ESG in spanish from BETO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages