DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Under Submission

Yu Zhou*†, Sohyun An*, Haikang Deng*, Da Yin, Clark Peng, Cho-Jui Hsieh, Kai-Wei Chang, Nanyun Peng

Introduction

Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.

Quick Start

1. Installation

Build the DialectGen environment from conda:

   conda env create -f environment.yml
   conda activate DialectGen

2. Evaluate Image or Video Generative Models on DialectGen

2.1 Image Generation

For existing models in the DialectGen paper, please use scripts in src/img_generation. To evaluate your own model, please duplicate any existing script in src/img_generation and modify with your model generation function.

python src/img_generation/sd35-turbo.py --dialects aae bre che ine sge --mode concise

Arguments

--dialects: The dialects you want to generate images for, can be any of: [aae, bre, che, ine, sge].
--mode: Which evaluation mode you would like to use, can be any of: [concise, detailed, polysemy].
--replace: Add this argument if you would like to re-generate images for the given dialect and mode.

Output Structure

DialectGen/
├── data/
   └── image/
       └── {mode}/
           └── {model}/
               ├── sae_images/
                   └── ...
               └── dialect_imgs/
                   └── {prompt}
                       ├── 0.jpg
                       ├── ...
                       ├── 9.jpg

2.2 Video Generation

Installation

We strongly recommend creating a fresh Conda environment per model to avoid dependency conflicts.

Create & activate environment

conda create -n <env_name> python=3.10 -y
conda activate <env_name>

Install model-specific requirements

pip install -r src/video_generation/<ModelDir>/requirements.txt

Install CogVideo dependencies

pip install diffusers accelerate transformers

Usage

Running a Directory Model

Each subfolder ships a run.sh.

Example for VideoCrafter:

cd src/video_generation/VideoCrafter
bash run.sh

Running CogVideoX5B

CogVideo is implemented in diffusers, so you can run it if you have the diffusers library installed.

To run cogvideo, use simply run the gen_cog.sh bash file.

bash gen_cog.sh

2.3 Evaluation

Installation

Please follow instructions in the VQAScore Github Repo to create the conda environment t2v.

conda activate t2v

VQA Score and CLIP Score Evaluation

Run the following scripts with the required parameters:

For VQA Score evaluation:

python src/evaluation/eval_vqa_score.py --models stable-diffusion-3.5-large-turbo --modes concise,detailed --dialects sge bre

For CLIP Score evaluation:

python src/evaluation/eval_clip_score.py --models stable-diffusion-3.5-large-turbo,stable-diffusion3-medium --modes concise,detailed --dialects sge bre

Arguments

--models: The list of models you want to evaluate.
--modes: The list of modes you want to evaluate.
--dialects: The dialects you want to evaluate, can be any of: [aae, bre, che, ine, sge].

Aggragate Evaluation Results

For aggregating evaluation results and calculating final scores for each dataset split, please refer to src/evaluation/aggragate_model_scores.py and src/evaluation/calculate_split_scores.py.

3. Mitigation Method

3.1 Installation

Navigate to the directory by running the following command:

cd src/mitigation

Switch back to the DialectGen environment and install the following additional packages.

pip install wandb
pip install datasets==3.1.0

3.2 Download MS COCO dataset

If the MSCOCO dataset is not available, you will need to first download it. Run download_mscoco.sh for this purpose. This will create a folder named mscoco under the data directory and download the data into it.

bash download_mscoco.sh

3.3 Fine-tune a text encoder

Fine-tune a text encoder using the following command. The relevant configuration is included in configs folder.

python finetune.py --config configs/sd15.yaml

Arguments

--config: The path to the yaml file that contains the configuration used for fine-tuning.
--dialect: The types of dialects to be used for fine-tuning
--mode: The mode of dialect (concise or detailed)

3.4 Generate images

After the encoder has been fine-tuned, images are generated using the fine-tuned encoder. Specify the path to the fine-tuned encoder in encoder_path. If swap=1, the image is generated using the fine-tuned encoder (i.e., swapped in). If SWAP=0, the original encoder is used for image generation.

Diaelct/SAE

python generate_images.py --model stable-diffusion-v1-5/stable-diffusion-v1-5 --encoder models/... --swap 1 --dialect sge

SAE Polysemy

python generate_images_polysemy.py --model stable-diffusion-v1-5/stable-diffusion-v1-5 --encoder models/... --swap 1 --dialect sge

SAE MSCOCO

python generate_images_mscoco.py --model $model --encoder models/... --swap sge

Arguments

--encoder: The path to the fine-tuned encoder used for image generation.
--swap: If set to 1, uses the fine-tuned encoder; if 0, uses the original encoder.
--dialect: The target dialect for image generation.

3.5 Evaluation

Once all images are generated, perform scoring using the VQA metric. To do this, first switch back to the t2v environment.

conda activate t2v

Dialect/SAE

python vqa_score_understanding.py --res_dir data/generated/... --dialect sge

SAE Polysemy

python vqa_score_understanding_polysemy.py --res_dir data/generated/... --dialect sge

Then a file named vqa_score_understanding_polysemy.json will be created under the res_dir directory. Run the following script to aggregate the results.

python aggregate_polysemy_res.py --res_path home/.../vqa_score_understanding_polysemy.json

res_path refers to the absolute path of vqa_score_understanding_polysemy.json.

SAE MSCOCO

python vqa_score_understanding_mscoco.py --res_dir data/generated/...

Arguments

res_dir: The directory where the images were generated.

BibTex

If you find our work helpful, please kindly cite our work :)

@article{zhou2025dialectgen,
  title={DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation},
  author={Zhou, Yu and An, Sohyun and Deng, Haikang and Yin, Da and Peng, Clark and Hsieh, Cho-Jui and Chang, Kai-Wei and Peng, Nanyun},
  journal={arXiv preprint arXiv:2510.14949},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
assets		assets
data/text		data/text
out		out
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Folders and files

Latest commit

History

Repository files navigation

DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Table of Contents

Introduction

Quick Start

1. Installation

2. Evaluate Image or Video Generative Models on DialectGen

2.1 Image Generation

Arguments

Output Structure

2.2 Video Generation

Installation

Create & activate environment

Install model-specific requirements

Install CogVideo dependencies

Usage

Running a Directory Model

Running CogVideoX5B

2.3 Evaluation

Installation

VQA Score and CLIP Score Evaluation

Arguments

Aggragate Evaluation Results

3. Mitigation Method

3.1 Installation

3.2 Download MS COCO dataset

3.3 Fine-tune a text encoder

Arguments

3.4 Generate images

Diaelct/SAE

SAE Polysemy

SAE MSCOCO

Arguments

3.5 Evaluation

Dialect/SAE

SAE Polysemy

SAE MSCOCO

Arguments

BibTex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages