Under Submission
Yu Zhou*†, Sohyun An*, Haikang Deng*, Da Yin, Clark Peng, Cho-Jui Hsieh, Kai-Wei Chang, Nanyun Peng
Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
Build the DialectGen environment from conda:
conda env create -f environment.yml
conda activate DialectGenFor existing models in the DialectGen paper, please use scripts in src/img_generation. To evaluate your own model, please duplicate any existing script in src/img_generation and modify with your model generation function.
python src/img_generation/sd35-turbo.py --dialects aae bre che ine sge --mode concise--dialects: The dialects you want to generate images for, can be any of: [aae, bre, che, ine, sge].--mode: Which evaluation mode you would like to use, can be any of: [concise, detailed, polysemy].--replace: Add this argument if you would like to re-generate images for the given dialect and mode.
DialectGen/
├── data/
└── image/
└── {mode}/
└── {model}/
├── sae_images/
└── ...
└── dialect_imgs/
└── {prompt}
├── 0.jpg
├── ...
├── 9.jpg
We strongly recommend creating a fresh Conda environment per model to avoid dependency conflicts.
conda create -n <env_name> python=3.10 -y
conda activate <env_name>pip install -r src/video_generation/<ModelDir>/requirements.txt pip install diffusers accelerate transformers Each subfolder ships a run.sh.
Example for VideoCrafter:
cd src/video_generation/VideoCrafter
bash run.shCogVideo is implemented in diffusers, so you can run it if you have the diffusers library installed.
To run cogvideo, use simply run the gen_cog.sh bash file.
bash gen_cog.shPlease follow instructions in the VQAScore Github Repo to create the conda environment t2v.
conda activate t2vRun the following scripts with the required parameters:
For VQA Score evaluation:
python src/evaluation/eval_vqa_score.py --models stable-diffusion-3.5-large-turbo --modes concise,detailed --dialects sge breFor CLIP Score evaluation:
python src/evaluation/eval_clip_score.py --models stable-diffusion-3.5-large-turbo,stable-diffusion3-medium --modes concise,detailed --dialects sge bre--models: The list of models you want to evaluate.--modes: The list of modes you want to evaluate.--dialects: The dialects you want to evaluate, can be any of: [aae, bre, che, ine, sge].
For aggregating evaluation results and calculating final scores for each dataset split, please refer to src/evaluation/aggragate_model_scores.py and src/evaluation/calculate_split_scores.py.
Navigate to the directory by running the following command:
cd src/mitigationSwitch back to the DialectGen environment and install the following additional packages.
pip install wandb
pip install datasets==3.1.0If the MSCOCO dataset is not available, you will need to first download it. Run download_mscoco.sh for this purpose. This will create a folder named mscoco under the data directory and download the data into it.
bash download_mscoco.shFine-tune a text encoder using the following command. The relevant configuration is included in configs folder.
python finetune.py --config configs/sd15.yaml--config: The path to theyamlfile that contains the configuration used for fine-tuning.--dialect: The types of dialects to be used for fine-tuning--mode: The mode of dialect (conciseordetailed)
After the encoder has been fine-tuned, images are generated using the fine-tuned encoder. Specify the path to the fine-tuned encoder in encoder_path. If swap=1, the image is generated using the fine-tuned encoder (i.e., swapped in). If SWAP=0, the original encoder is used for image generation.
python generate_images.py --model stable-diffusion-v1-5/stable-diffusion-v1-5 --encoder models/... --swap 1 --dialect sgepython generate_images_polysemy.py --model stable-diffusion-v1-5/stable-diffusion-v1-5 --encoder models/... --swap 1 --dialect sgepython generate_images_mscoco.py --model $model --encoder models/... --swap sge--encoder: The path to the fine-tuned encoder used for image generation.--swap: If set to 1, uses the fine-tuned encoder; if 0, uses the original encoder.--dialect: The target dialect for image generation.
Once all images are generated, perform scoring using the VQA metric. To do this, first switch back to the t2v environment.
conda activate t2vpython vqa_score_understanding.py --res_dir data/generated/... --dialect sgepython vqa_score_understanding_polysemy.py --res_dir data/generated/... --dialect sgeThen a file named vqa_score_understanding_polysemy.json will be created under the res_dir directory. Run the following script to aggregate the results.
python aggregate_polysemy_res.py --res_path home/.../vqa_score_understanding_polysemy.jsonres_path refers to the absolute path of vqa_score_understanding_polysemy.json.
python vqa_score_understanding_mscoco.py --res_dir data/generated/...res_dir: The directory where the images were generated.
If you find our work helpful, please kindly cite our work :)
@article{zhou2025dialectgen,
title={DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation},
author={Zhou, Yu and An, Sohyun and Deng, Haikang and Yin, Da and Peng, Clark and Hsieh, Cho-Jui and Chang, Kai-Wei and Peng, Nanyun},
journal={arXiv preprint arXiv:2510.14949},
year={2025}
}