web2code

Web2Code

Install

To setup the conda environment, follow the same steps as LLaVA as highlighted in README_LLAVA.md.

CrystalChat-MLLM Weights

Please check out our LLM360/CrystalChat-7B-MLLM for the MLLM model based on CrystalChat trained on LLAVA fine-tuning data and LLM360/CrystalChat-7B-Web2Code for the Web2Code model.

CLI Inference

Chat about images using our model. It also supports multiple GPUs, 4-bit and 8-bit quantized inference.

python -m llava.serve.cli \
    --model-path /path/to/the/model \
    --image-file "path_to_image.jpg" \
    --load-4bit

Train

Hyperparameters

We use a similar set of hyperparameters as Vicuna in finetuning. Both hyperparameters used in pretraining and finetuning are provided below.

Pretraining

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
Web2Code	256	1e-3	1	2048	0

Finetuning

Hyperparameter	Global Batch Size	Learning rate	Epochs	Max length	Weight decay
Web2Code	128	2e-5	1	2048	0

Pretrain (feature alignment)

Please download the 558K subset of the LAION-CC-SBU dataset with BLIP captions we use in the paper here.

Training script with DeepSpeed ZeRO-2: pretrain_crystal_chat.sh.

Visual Instruction Tuning

Prepare data

Please download the annotation of the final mixture the LLaVA instruction tuning data llava_v1_5_mix665k.json, and download the images from constituting datasets:

COCO: train2017
GQA: images
OCR-VQA: download script, we save all files as .jpg
TextVQA: train_val_images
VisualGenome: part1, part2

After downloading all of them, organize the data as follows in ./playground/data,

├── coco
│   └── train2017
├── gqa
│   └── images
├── ocr_vqa
│   └── images
├── textvqa
│   └── train_images
└── vg
    ├── VG_100K
    └── VG_100K_2

To prepare the Web2Code dataset, download the annotation and images of the final dataset [Web2Code Dataset]

Web2Code_image

├── games
│   ├── 01
│   ├── ...
│   └── 09
├── jobs
│   ├── 03
│   ├── ...
│   └── 13

Start training!

Visual instruction tuning takes around 26 hours for the model on 16x A100 (40G). Pretraining takes around 20 hours for the model on 16x A100 (40G).

Training script with DeepSpeed ZeRO-3: finetune_crystal_chat_sh.

Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{web2code2024,
  title={Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs},
  author={Sukmin Yun and Haokun Lin and Rusiru Thushara and Mohammad Qazim Bhat and Yongxin Wang and Zutao Jiang and Mingkai Deng and Jinhong Wang and Tianhua Tao and Junbo Li and Haonan Li and Preslav Nakov and Timothy Baldwin and Zhengzhong Liu and Eric P. Xing and Xiaodan Liang and Zhiqiang Shen},
  journal={arXiv preprint arXiv:2406.20098},
  year={2024}
}

Name		Name	Last commit message	Last commit date
parent directory ..
docs		docs
images		images
llava		llava
scripts		scripts
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_LLAVA.md		README_LLAVA.md
cog.yaml		cog.yaml
predict.py		predict.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Web2Code

Contents

Install

CrystalChat-MLLM Weights

CLI Inference

Train

Hyperparameters

Pretrain (feature alignment)

Visual Instruction Tuning

Citation

FilesExpand file tree

web2code

Directory actions

More options

Directory actions

More options

Latest commit

History

web2code

Folders and files

parent directory

README.md

Web2Code

Contents

Install

CrystalChat-MLLM Weights

CLI Inference

Train

Hyperparameters

Pretrain (feature alignment)

Visual Instruction Tuning

Citation