To setup the conda environment, follow the same steps as LLaVA as highlighted in README_LLAVA.md.
Please check out our LLM360/CrystalChat-7B-MLLM for the MLLM model based on CrystalChat trained on LLAVA fine-tuning data and LLM360/CrystalChat-7B-Web2Code for the Web2Code model.
Chat about images using our model. It also supports multiple GPUs, 4-bit and 8-bit quantized inference.
python -m llava.serve.cli \
--model-path /path/to/the/model \
--image-file "path_to_image.jpg" \
--load-4bitWe use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.
- Pretraining
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|---|---|---|---|---|---|
| Web2Code | 256 | 1e-3 | 1 | 2048 | 0 |
- Finetuning
| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|---|---|---|---|---|---|
| Web2Code | 128 | 2e-5 | 1 | 2048 | 0 |
Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.
Training script with DeepSpeed ZeRO-2: pretrain_crystal_chat.sh.
- Prepare data
Please download the annotation of the final mixture the LLaVA instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:
- COCO: train2017
- GQA: images
- OCR-VQA: download script, we save all files as
.jpg - TextVQA: train_val_images
- VisualGenome: part1, part2
After downloading all of them, organize the data as follows in ./playground/data,
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
To prepare the Web2Code dataset, download the annotation and images of the final dataset [Web2Code Dataset]
Web2Code_image
├── games
│ ├── 01
│ ├── ...
│ └── 09
├── jobs
│ ├── 03
│ ├── ...
│ └── 13
- Start training!
Visual instruction tuning takes around 26 hours for the model on 16x A100 (40G). Pretraining takes around 20 hours for the model on 16x A100 (40G).
Training script with DeepSpeed ZeRO-3: finetune_crystal_chat_sh.
If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝
@article{web2code2024,
title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
author={Sukmin Yun and Haokun Lin and Rusiru Thushara and Mohammad Qazim Bhat and Yongxin Wang and Zutao Jiang and Mingkai Deng and Jinhong Wang and Tianhua Tao and Junbo Li and Haonan Li and Preslav Nakov and Timothy Baldwin and Zhengzhong Liu and Eric P. Xing and Xiaodan Liang and Zhiqiang Shen},
journal={arXiv preprint arXiv:2406.20098},
year={2024}
}