Rice Workshop on Large Language Models

Yangfeng Ji

Associate Professor

Computer Science
University of Virginia

Arman Cohan

Assistant Professor

Computer Science
Yale University

Mark Yatskar

Assistant Professor

Computer and Information Science
University of Pennsylvania

Vera Liao

Associate Professor

CSE Department
University of Michigan

Jiawei Zhou

Assistant Professor

Computer Science
Stony Brook University

Kuan-Hao Huang

Assistant Professor

CS&E Department
Texas A&M University

Elias Stengel-Eskin

Assistant Professor

Computer Science
UT Austin

Time	Session
8:30 – 9:00	Registration & Coffee
9:00 – 9:10	Opening Remarks Hanjie Chen & Vicente Ordóñez
9:10 – 9:55	Limitations of Current Vision Language Models Talk Raymond J. Mooney View Abstract There has been impressive recent progress on Vision-Language Models (VLMs), and current models can solve a wide range of challenging multimodal problems. However, they still suffer from a number of important limitations. Current LLMs have demonstrated impressive reasoning abilities in text; however, a number of recent results demonstrate that current VLMs are relatively poor at "multimodal reasoning" that requires integrating information from both text and images. We have developed a new benchmark on multimodal entity tracking, i.e., determining how actions change the states of entities, when actions are expressed using either text, images, or both. Although the latest reasoning LLMs are quite good at tracking entities when actions are expressed textually, VLMs still struggle to track entities when actions are expressed visually or multimodally. Our results demonstrate an interesting inability of frontier models to properly integrate linguistic and visual reasoning. The progress on automatic image and video generation from text is also impressive; however, current models that directly generate pixels from language can still produce unrealistic images or videos that violate fundamental physics. I will argue that an approach that first connects language to a coherent 3D dynamic model of the world and then generates images from such 3D models is potentially more promising.
9:55 – 10:40	The Mismeasure of Models: Missing Factors in LLM Evaluation Talk Yangfeng Ji View Abstract The evaluation of Large Language Models (LLMs) is often distilled into a single score on a leaderboard, a practice that can be misleading and obscure a model's true capabilities. This talk argues for a more nuanced, multi-faceted approach to LLM evaluation, moving beyond simplistic metrics. We delve into several 'missing factors' critical for a comprehensive understanding of performance. Specifically, we explore how to disentangle syntactic complexity from semantic meaning to reveal that many errors stem from unfamiliar linguistic structures rather than a lack of understanding. Furthermore, we introduce activation steering as a technique to measure the model's sensitivity to specific latent concepts. Finally, we examine how evaluation design, such as length biases and example permutations in in-context learning, can significantly impact results. This talk advocates for targeted, diagnostic frameworks that move beyond accuracy to provide a deeper understanding of what LLMs truly know.
10:40 – 11:00	Coffee Break
11:00 – 11:45	Can LLMs Judge Alignment? From Benchmark Limits to Reference-Guided Improvement Talk Arman Cohan View Abstract Many important problems in LLM alignment live in non-verifiable domains, where there is no reliable automatic reward. In these settings, progress depends critically on evaluation: if we cannot verify outputs directly, we need models that can judge them well. This talk explores that challenge: I will begin by briefly revisiting the limitations of current automatic alignment benchmarking, showing that system-level rankings are sensitive to benchmark design and become unreliable when models are close in quality. I will then discuss evidence that a model's ability to evaluate alignment is closely related to its own alignment quality, motivating benchmarks such as AlignEval that assess models in their role as evaluators rather than only through their generations. Finally, I will then present a reference-guided approach for non-verifiable domains, where high-quality reference answers substantially improve LLM judges and turn them into effective signals for post-training. These reference-guided judges enable self-improvement that outperforms both direct supervised distillation on references and self-improvement with reference-free judges.
11:45 – 12:30	Toward Safer Medical AI via Scholarly Supervision Talk Mark Yatskar View Abstract Medical AI still depends on painstakingly curated, modest-sized datasets that are expensive to build and quickly become incomplete or outdated. As a result, models often learn too narrowly, becoming confounded in some settings and unsafe in others. In this talk, we show how to turn PubMed and PubChem text into scalable supervision using language models for two domains: clinical radiology and therapeutic discovery. In radiology, literature-built models perform strongly and transfer far more robustly across hospitals. In drug design, we introduce MedexCLIP, a multimodal foundation model of molecules and text trained from literature, enabling zero-shot prediction of safety and pharmacokinetic properties and practical constraints for automated discovery pipelines. Together, these results position academic literature as a powerful, continuously updated training signal for medical AI.
12:30 – 14:00	Lunch Break
14:00 – 14:45	Revisiting Intelligence Augmentation: Investigating and Mitigating the Risks of AI to Human Intelligence Talk Vera Liao View Abstract Powerful AI technologies, especially recently developed large language models, are increasingly mediating or even replacing human thinking, from information and knowledge acquisition, judgment and decision making, creativity, to our understanding of the world. In 1962, Douglas Engelbart described a vision of Intelligence Augmentation (IA), in which machines should augment, instead of replacing, human thinking processes. In this talk, I will revisit this vision and pose the question: Are we moving away from IA with increasingly capable and agentic AI? Drawing on human-computer interaction research, including our own work, I will examine two interconnected threats. First, I will present findings from research that studies and mitigates people's overreliance on AI, highlighting fundamental obstacles to maintaining human oversight of AI and arguing that a productivity-oriented approach to AI development and use will structurally worsen these obstacles. I will then discuss our recent work studying how new affordances of LLMs threaten the integrity of information and knowledge acquisition, and situate this discussion in broader empirical research that has identified how AI is reshaping human cognition. I will close the talk with reflections on what intelligence augmentation actually requires, and how these requirements might be embedded in the technical objectives of AI and the sociotechnical infrastructures through which AI is deployed.
14:45 – 15:30	Beyond Scaling: Advancing Efficient, Multimodal, and Trustworthy Large Language Models Talk Jiawei Zhou View Abstract The rapid advancement and widespread adoption of large language models (LLMs) are transforming how we work, enabling increasingly sophisticated applications in real-world settings such as software development, workflow automation, and even scientific research. Although recent excitement around LLM-powered agents and applications (e.g., OpenClaw and MoltBook) has further amplified the visibility of these systems, many fundamental challenges remain in the underlying models. In this talk, drawing on personal experiences developing AI-assisted tools, we examine three critical areas where continued technical advances are needed to enable the next generation of more capable and trustworthy AI systems. These directions include improving computational efficiency and overcoming the bottleneck of LLM autoregressive generation, advancing multimodal models that integrate information from the real world beyond text, and developing methods for trustworthy personalization and safety in increasingly autonomous AI systems. I will share our recent work addressing these directions, including algorithmic advances for accelerating LLM inference and integrating dynamic knowledge, approaches for reducing hallucinations in vision-language models, leveraging visual text inputs for long-context processing, and new evaluation frameworks for fairness and personalization. I will conclude with reflections on open research challenges and what the next generation of AI systems may look like.
15:30 – 16:00	Coffee Break
16:00 – 16:45	Controlling Large Language Models via Inference-Time Steering Talk Kuan-Hao Huang View Abstract Large language models (LLMs) are increasingly used in real-world applications, where developers and users often require fine-grained control over model behavior. However, modifying model capabilities typically requires costly retraining or complex alignment procedures, making it difficult to adapt models to new tasks, languages, or user preferences after deployment. In this talk, I will present our recent work on inference-time steering, a lightweight approach for controlling LLM behavior by directly modifying internal representations. I will begin by showing how LLMs can be shifted into different "language modes" through activation steering, enabling language-specific behavior and improving multilingual performance. Next, I will present how inference-time steering enables more flexible and controllable trade-offs among multiple user preferences, offering advantages over traditional preference optimization methods. Finally, I will discuss the limitations of global steering vectors in long-form and multi-attribute generation, and introduce a context-aware steering approach that adapts steering directions based on the model's current representation, leading to more reliable and effective control. I will conclude the talk by discussing open challenges for building more controllable and adaptable LLMs.
16:45 – 17:30	Multi-Model Training for Multi-Agent Communication Skills Talk Elias Stengel-Eskin View Abstract As we scale from individual agents to teams of agents, inter-agent communication will become increasingly important. In this talk, I will describe a general paradigm for teaching multi-agent communication skills through multi-model reinforcement learning, which I will illustrate via three key collaborative skills: expressing confidence in a calibrated way, responding robustly to positive and negative persuasion, and expressing reasoning faithfully. I will show how these problems can be framed in terms of speaker-listener games, and how this framing allows us to teach models collaborative skills, often using games simulated on smaller models to train larger models.
17:30 – 17:40	Closing Remarks Hanjie Chen & Vicente Ordóñez

Time

Session

8:30 – 9:00

Registration & Coffee

9:00 – 9:10

Opening Remarks

Hanjie Chen & Vicente Ordóñez

9:10 – 9:55

Limitations of Current Vision Language Models Talk

Raymond J. Mooney

View Abstract

There has been impressive recent progress on Vision-Language Models (VLMs), and current models can solve a wide range of challenging multimodal problems. However, they still suffer from a number of important limitations. Current LLMs have demonstrated impressive reasoning abilities in text; however, a number of recent results demonstrate that current VLMs are relatively poor at "multimodal reasoning" that requires integrating information from both text and images. We have developed a new benchmark on multimodal entity tracking, i.e., determining how actions change the states of entities, when actions are expressed using either text, images, or both. Although the latest reasoning LLMs are quite good at tracking entities when actions are expressed textually, VLMs still struggle to track entities when actions are expressed visually or multimodally. Our results demonstrate an interesting inability of frontier models to properly integrate linguistic and visual reasoning. The progress on automatic image and video generation from text is also impressive; however, current models that directly generate pixels from language can still produce unrealistic images or videos that violate fundamental physics. I will argue that an approach that first connects language to a coherent 3D dynamic model of the world and then generates images from such 3D models is potentially more promising.

9:55 – 10:40

The Mismeasure of Models: Missing Factors in LLM Evaluation Talk

Yangfeng Ji

View Abstract

The evaluation of Large Language Models (LLMs) is often distilled into a single score on a leaderboard, a practice that can be misleading and obscure a model's true capabilities. This talk argues for a more nuanced, multi-faceted approach to LLM evaluation, moving beyond simplistic metrics. We delve into several 'missing factors' critical for a comprehensive understanding of performance. Specifically, we explore how to disentangle syntactic complexity from semantic meaning to reveal that many errors stem from unfamiliar linguistic structures rather than a lack of understanding. Furthermore, we introduce activation steering as a technique to measure the model's sensitivity to specific latent concepts. Finally, we examine how evaluation design, such as length biases and example permutations in in-context learning, can significantly impact results. This talk advocates for targeted, diagnostic frameworks that move beyond accuracy to provide a deeper understanding of what LLMs truly know.

10:40 – 11:00

Coffee Break

11:00 – 11:45

Can LLMs Judge Alignment? From Benchmark Limits to Reference-Guided Improvement Talk

Arman Cohan

View Abstract

Many important problems in LLM alignment live in non-verifiable domains, where there is no reliable automatic reward. In these settings, progress depends critically on evaluation: if we cannot verify outputs directly, we need models that can judge them well. This talk explores that challenge: I will begin by briefly revisiting the limitations of current automatic alignment benchmarking, showing that system-level rankings are sensitive to benchmark design and become unreliable when models are close in quality. I will then discuss evidence that a model's ability to evaluate alignment is closely related to its own alignment quality, motivating benchmarks such as AlignEval that assess models in their role as evaluators rather than only through their generations. Finally, I will then present a reference-guided approach for non-verifiable domains, where high-quality reference answers substantially improve LLM judges and turn them into effective signals for post-training. These reference-guided judges enable self-improvement that outperforms both direct supervised distillation on references and self-improvement with reference-free judges.

11:45 – 12:30

Toward Safer Medical AI via Scholarly Supervision Talk

Mark Yatskar

View Abstract

Medical AI still depends on painstakingly curated, modest-sized datasets that are expensive to build and quickly become incomplete or outdated. As a result, models often learn too narrowly, becoming confounded in some settings and unsafe in others. In this talk, we show how to turn PubMed and PubChem text into scalable supervision using language models for two domains: clinical radiology and therapeutic discovery. In radiology, literature-built models perform strongly and transfer far more robustly across hospitals. In drug design, we introduce MedexCLIP, a multimodal foundation model of molecules and text trained from literature, enabling zero-shot prediction of safety and pharmacokinetic properties and practical constraints for automated discovery pipelines. Together, these results position academic literature as a powerful, continuously updated training signal for medical AI.

12:30 – 14:00

Lunch Break

14:00 – 14:45

Revisiting Intelligence Augmentation: Investigating and Mitigating the Risks of AI to Human Intelligence Talk

Vera Liao

View Abstract

Powerful AI technologies, especially recently developed large language models, are increasingly mediating or even replacing human thinking, from information and knowledge acquisition, judgment and decision making, creativity, to our understanding of the world. In 1962, Douglas Engelbart described a vision of Intelligence Augmentation (IA), in which machines should augment, instead of replacing, human thinking processes. In this talk, I will revisit this vision and pose the question: Are we moving away from IA with increasingly capable and agentic AI? Drawing on human-computer interaction research, including our own work, I will examine two interconnected threats. First, I will present findings from research that studies and mitigates people's overreliance on AI, highlighting fundamental obstacles to maintaining human oversight of AI and arguing that a productivity-oriented approach to AI development and use will structurally worsen these obstacles. I will then discuss our recent work studying how new affordances of LLMs threaten the integrity of information and knowledge acquisition, and situate this discussion in broader empirical research that has identified how AI is reshaping human cognition. I will close the talk with reflections on what intelligence augmentation actually requires, and how these requirements might be embedded in the technical objectives of AI and the sociotechnical infrastructures through which AI is deployed.

14:45 – 15:30

Beyond Scaling: Advancing Efficient, Multimodal, and Trustworthy Large Language Models Talk

Jiawei Zhou

View Abstract

The rapid advancement and widespread adoption of large language models (LLMs) are transforming how we work, enabling increasingly sophisticated applications in real-world settings such as software development, workflow automation, and even scientific research. Although recent excitement around LLM-powered agents and applications (e.g., OpenClaw and MoltBook) has further amplified the visibility of these systems, many fundamental challenges remain in the underlying models. In this talk, drawing on personal experiences developing AI-assisted tools, we examine three critical areas where continued technical advances are needed to enable the next generation of more capable and trustworthy AI systems. These directions include improving computational efficiency and overcoming the bottleneck of LLM autoregressive generation, advancing multimodal models that integrate information from the real world beyond text, and developing methods for trustworthy personalization and safety in increasingly autonomous AI systems. I will share our recent work addressing these directions, including algorithmic advances for accelerating LLM inference and integrating dynamic knowledge, approaches for reducing hallucinations in vision-language models, leveraging visual text inputs for long-context processing, and new evaluation frameworks for fairness and personalization. I will conclude with reflections on open research challenges and what the next generation of AI systems may look like.

15:30 – 16:00

Coffee Break

16:00 – 16:45

Controlling Large Language Models via Inference-Time Steering Talk

Kuan-Hao Huang

View Abstract

Large language models (LLMs) are increasingly used in real-world applications, where developers and users often require fine-grained control over model behavior. However, modifying model capabilities typically requires costly retraining or complex alignment procedures, making it difficult to adapt models to new tasks, languages, or user preferences after deployment. In this talk, I will present our recent work on inference-time steering, a lightweight approach for controlling LLM behavior by directly modifying internal representations. I will begin by showing how LLMs can be shifted into different "language modes" through activation steering, enabling language-specific behavior and improving multilingual performance. Next, I will present how inference-time steering enables more flexible and controllable trade-offs among multiple user preferences, offering advantages over traditional preference optimization methods. Finally, I will discuss the limitations of global steering vectors in long-form and multi-attribute generation, and introduce a context-aware steering approach that adapts steering directions based on the model's current representation, leading to more reliable and effective control. I will conclude the talk by discussing open challenges for building more controllable and adaptable LLMs.

16:45 – 17:30

Multi-Model Training for Multi-Agent Communication Skills Talk

Elias Stengel-Eskin

View Abstract

As we scale from individual agents to teams of agents, inter-agent communication will become increasingly important. In this talk, I will describe a general paradigm for teaching multi-agent communication skills through multi-model reinforcement learning, which I will illustrate via three key collaborative skills: expressing confidence in a calibrated way, responding robustly to positive and negative persuasion, and expressing reasoning faithfully. I will show how these problems can be framed in terms of speaker-listener games, and how this framing allows us to teach models collaborative skills, often using games simulated on smaller models to train larger models.

17:30 – 17:40

Closing Remarks

Hanjie Chen & Vicente Ordóñez

Rice Workshop on
Large Language Models

Exploring the Frontiers of AI

Distinguished Speakers

Raymond J. Mooney

Yangfeng Ji

Arman Cohan

Mark Yatskar

Vera Liao

Jiawei Zhou

Kuan-Hao Huang

Elias Stengel-Eskin

Schedule

Organizers

Hanjie Chen

Vicente Ordóñez

Sponsors

Venue

Ralph S. O’Connor Building

Date & Time

Directions