Haoyi Qiu

I am a second-year PhD candidate in Computer Science at UCLA advised by Prof. Nanyun (Violet) Peng. Before, I graduated from the University of Michigan, with a B.S. in Computer Science and Math, advised by Prof. Joyce Y. Chai.

My research addresses fundamental questions in model evaluation, safety alignment, multimodal understanding, and multimodal reasoning:

How can we rigorously evaluate AI systems (LLMs, VLMs, agents, diffusion models) beyond surface-level metrics?
How do we align foundation models with diverse cultural norms and ensure geo-diverse safety through post-training?
How do we ensure foundation models understand and respect pluralism?
How can we build models that reason effectively across modalities while remaining trustworthy and aligned with human values?

I have been recognized with the Outstanding Graduate MS Student Award from UCLA CS Department (2025) and the Outstanding Reviewer Award at EMNLP 2023 (top 0.74%).

If you're interested in my research, potential collaborations, or simply want to catch up, feel free to drop me an email.

Email: [email protected] 💌 / Google Scholar 👩🏻‍🎓 / Twitter 📝 / Life 🌷 /

Research Projects

* denotes Equal Contributions and Project Lead

From Preferences to Prejudice: The Role of Alignment Tuning in Shaping Social Bias in Video Diffusion Models
Zefan Cai*, Haoyi Qiu*, Haozhe Zhao*, Ke Wan, Jiachen Li, Jiuxiang Gu, Wen Xiao, Nanyun Peng, Junjie Hu
TMLR, 2026

We introduce VideoBiasEval, a diagnostic framework that traces how social biases evolve through the alignment pipeline of video diffusion models. By training our own reward models and alignment-tuned diffusion models, we provide a full training recipe and reveal that alignment tuning not only strengthens representational biases but also makes them temporally stable across generated videos.

MMGR: Multi-Modal Generative Reasoning
Zefan Cai*, Haoyi Qiu*, Tianyi Ma*, Haozhe Zhao*, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu
ArXiv, 2025

We propose MMGR, a principled evaluation framework for generative reasoning across abstract reasoning, embodied navigation, and physical commonsense, revealing that leading video and image models struggle with causal correctness and long-horizon spatial planning.

Multimodal Cultural Safety: Evaluation Framework and Alignment Strategies
Haoyi Qiu, Kung-Hsiang Huang, Ruichen Zheng, Jiao Sun, Nanyun Peng
TMLR, 2025

We introduce CROSS, a benchmark of 1,284 multilingual visually grounded queries from 16 countries to assess cultural safety in LVLMs. We develop supervised fine-tuning with culturally grounded data and preference tuning with contrastive response pairs, substantially improving GPT-4o's cultural awareness (+60%) and compliance (+55%) while preserving general capabilities.

MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion
Haoyi Qiu, Yilun Zhou, Pranav Narayanan Venkit, Kung-Hsiang Huang, Jiaxin Zhang, Nanyun Peng, Chien-Sheng Wu
ArXiv, 2025

We introduce MMPersuade, a unified framework for studying multimodal persuasion in LVLMs, showing that multimodal inputs substantially increase model susceptibility to persuasion—especially in misinformation scenarios—and that different persuasion strategies vary in effectiveness across contexts.

GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness
Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, Chien-Sheng Wu
ArXiv, 2025

We propose GUI-KV, a plug-and-play KV cache compression method for GUI agents that exploits spatial saliency and temporal redundancy, reducing decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over full-cache baselines without retraining.

Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding
Kung-Hsiang Huang, Can Qin, Haoyi Qiu, Philippe Laban, Shafiq Joty, Caiming Xiong, Chien-Sheng Wu
ACL Findings, 2025

We propose CogAlign, a post-training strategy inspired by Piaget's cognitive development theory that trains VLMs to recognize invariant properties under visual transformations. CogAlign outperforms supervised fine-tuning while requiring 60% less training data, significantly improving visual arithmetic and downstream chart and geometry understanding.

Evaluating Cultural and Social Awareness of LLM Web Agents
Haoyi Qiu, Alexander R. Fabbri, Divyansh Agarwal, Kung-Hsiang Huang, Sarah Tan, Nanyun Peng, Chien-Sheng Wu
NAACL Findings, 2025 Best Paper Nomination

We introduce CASA, a benchmark for evaluating LLM web agents' cultural and social awareness across shopping and forum tasks, finding that agents achieve less than 10% awareness coverage with over 40% norm violation rates. We show that fine-tuning on culture-specific datasets enhances cross-region generalization, while prompting boosts complex task navigation.

From Pixels to Insights: A Survey on Automatic Chart Understanding in the Era of Large Foundation Models
Kung-Hsiang Huang, Hou Pong Chan, Yi R. Fung, Haoyi Qiu, Mingyang Zhou, Shafiq Joty, Shih-Fu Chang, Heng Ji
TKDE, 2024

We provide a comprehensive survey of automatic chart understanding in the era of large foundation models, covering tasks, evaluation metrics, modeling strategies, and future directions including domain-specific charts and agent-oriented settings.

SafeWorld: Geo-Diverse Safety Alignment
Da Yin*, Haoyi Qiu*, Kung-Hsiang Huang, Kai-Wei Chang, Nanyun Peng
NeurIPS, 2024

We introduce SafeWorld, a benchmark of 2,342 queries grounded in human-verified cultural norms and legal policies from 50 countries, along with a multi-dimensional safety evaluation framework. Using synthesized preference pairs for DPO, our trained SafeWorldLM outperforms GPT-4o across all evaluation dimensions by a large margin.

New Job, New Gender? Measuring the Social Bias in Image Generation Models
Wenxuan Wang, Haonan Bai, Jen-tse Huang, Yuxuan Wan, Youliang Yuan, Haoyi Qiu, Nanyun Peng, Michael R Lyu
ACM MM, 2024 (Oral)

We propose BiasPainter, an evaluation framework that automatically triggers and detects social biases in image generation models across 62 professions, 39 activities, 57 objects, and 70 personality traits, achieving 90.8% accuracy on automatic bias detection.

VALOR-Eval: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models
Haoyi Qiu*, Wenbo Hu*, Zi-Yi Dou, Nanyun Peng
ACL Findings, 2024

We introduce a multi-dimensional benchmark and LLM-based evaluation framework for LVLM hallucination that covers objects, attributes, and relations, generalizing the CHAIR metric to jointly assess both faithfulness and coverage of model outputs.

AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation
Haoyi Qiu, Kung-Hsiang Huang*, Jingnong Qu*, Nanyun Peng
NAACL, 2024 (Oral)

We propose AMRFact, a framework that leverages Abstract Meaning Representations to generate coherent factually inconsistent summaries with high error-type coverage for training factual consistency evaluators, significantly outperforming previous systems on the AggreFact-SOTA benchmark.

Gender Biases in Automatic Evaluation Metrics for Image Captioning
Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng
EMNLP, 2023

We conduct a systematic study of gender biases in model-based evaluation metrics (e.g., CLIPScore) for image captioning, revealing that using biased metrics as rewards in RL training propagates and amplifies stereotypes in generation models. We propose an effective debiasing method without hurting correlation with human judgments.

Award

Outstanding Graduate MS Student (1 per department) - UCLA CS Department, 05/2025

Outstanding Reviewer Award (0.74%) - EMNLP 2023

James B. Angell Scholar, University Honors, EECS Scholar, The University of Michigan

Service

Area Chair: ACL Rolling Review 2025

Reviewer: ICLR 2025, ACL Rolling Review 2024, EMNLP 2023

Education Experience

University of California, Los Angeles
Master of Science in Computer Science
Thesis: Geo-Diverse Safety and Cultural Alignment in Language Models: Evaluating Cultural Awareness and Norm Sensitivity
Sep 2022 - June 2024

University of Michigan
Bachelor of Science in Computer Science, Pure Mathematics, and Statistics
May 2022

Teaching Experience

Teaching Assistant: Intro to AI (EECS 492)
University of Michigan, FA 2021

Working Experience

Salesforce AI Research
Research Scientist Intern
Hosts: Yilun Zhou, Pranav Narayanan Venkit, Chien-Sheng Wu
Jun. 2025 - Sep. 2025

Salesforce AI Research
Research Scientist Intern
Hosts: Alexander R. Fabbri, Divyansh Agarwal, Chien-Sheng Wu
Jun. 2024 - Sep. 2024

Meta
Machine Learning Engineer Intern
May. 2022 - Aug. 2022

Goldman Sachs
Engineering Summer Analyst
Jun. 2021 - Aug. 2021

This website is built using the source code from Jon Barron's public academic website.