This repo trains a Spanish ESG-specific BERT starting from BETO via Domain-Adaptive Pre-Training (DAPT) and then fine-tunes on ESG downstream tasks (sentence classification / NER).
- Install
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt- Data layout
Place raw text files (one document per line or one big corpus) under your Drive and point DATA_DIR to it.
Recommended layout:
/MyDrive/esg_corpus/
train.txt # large corpus (one document per line)
valid.txt # small held-out file
- Run DAPT (MLM)
python scripts/train_mlm.py --config configs/mvp.yaml- Resume on a bigger GPU
- Upload
checkpoints/beto_esg_mvpto your cloud VM. - Rerun
train_mlm.pywith a larger config (e.g.,configs/a100_300m.yaml) andresume_from_checkpoint: checkpoints/beto_esg_mvp.
- Colab: mount Drive and set
DATA_DIR: "/content/drive/MyDrive/esg_corpus". - Kaggle: upload
train.txtas a Kaggle Dataset and setDATA_DIRaccordingly (e.g.,/kaggle/input/esg-corpus).
configs/ # YAML configs for runs
scripts/ # training, preprocessing, eval
data_raw/ # (ignored) raw PDFs/HTML (store in Drive)
data_proc/ # processed text (ignored)
checkpoints/ # model checkpoints (ignored)
notebooks/ # Colab/Kaggle entrypoints
docs/ # model card, datasheets
- Consider DVC for tracking large data stored in Drive/S3, and Weights & Biases for experiment tracking.