Taiwei Shi

Experiential Reinforcement Learning

2026-02-14T00:00:00+00:00

Abstract

Reinforcement learning has become the central approach for language models (LMs) to learn from environmental reward or feedback. In practice, the environmental feedback is usually sparse and delayed. Learning from such signals is challenging, as LMs must implicitly infer how observed failures should translate into behavioral changes for future iterations.

We introduce Experiential Reinforcement Learning (ERL), a training paradigm that embeds an explicit experience–reflection–consolidation loop into the reinforcement learning process. Given a task, the model generates an initial attempt, receives environmental feedback, and produces a reflection that guides a refined second attempt, whose success is reinforced and internalized into the base policy.

This process converts feedback into structured behavioral revision, improving exploration and stabilizing optimization while preserving gains at deployment without additional inference cost. Across sparse-reward control environments and agentic reasoning benchmarks, ERL consistently improves learning efficiency and final performance over strong reinforcement learning baselines, achieving gains of up to +81% in complex multi-step environments and up to +11% in tool-using reasoning tasks.

Overview

Experiential Reinforcement Learning (ERL) augments traditional reinforcement learning with an explicit loop:

Experience — The model attempts a task and receives feedback
Reflection — The model generates a structured critique
Consolidation — The model attempts a second time and the successful revisions are internalized into the policy

This enables models to transform sparse rewards into actionable behavioral updates.

Motivation

Traditional RL with verifiable rewards (RLVR) relies on trial-and-error driven by scalar rewards, which can lead to inefficient exploration and unstable learning in sparse-reward environments.

ERL introduces structured intermediate reasoning to:

Accelerate learning
Enable within-episode correction
Preserve improvements without reflection at inference

Method

Experiments

We evaluate ERL on:

FrozenLake — Sparse-reward navigation
Sokoban — Long-horizon planning
HotpotQA — Tool-using reasoning

Models:

Qwen3-4B-Instruct
Olmo-3-7B-Instruct

Optimizer: GRPO

Results

ERL consistently achieves:

Faster convergence
Higher final reward
Improved learning efficiency

Final Performance

Task	Qwen RLVR	Qwen ERL	Olmo RLVR	Olmo ERL
FrozenLake	0.86	0.94	0.39	0.66
HotpotQA	0.45	0.56	0.47	0.50
Sokoban	0.06	0.87	0.04	0.20

Learning Dynamics

Post-reflection trajectories consistently outperform both pre-reflection and RLVR, demonstrating that reflection provides immediate within-episode improvement.

Ablation Study

Task	RLVR	ERL	ERL w/o Memory	ERL w/o Reflection
FrozenLake (Qwen)	0.86	0.94	0.86	0.60
HotpotQA (Qwen)	0.45	0.56	0.56	0.48
Sokoban (Qwen)	0.06	0.87	0.87	0.59
FrozenLake (Olmo)	0.39	0.66	0.64	0.54
HotpotQA (Olmo)	0.47	0.50	0.47	0.46
Sokoban (Olmo)	0.04	0.20	0.24	0.06

To isolate the contribution of individual components in ERL, we conduct ablations that remove either cross-episode memory or structured reflection while keeping the rest of the training setup fixed.

ERL w/o Memory removes cross-episode reflection reuse but keeps within-episode reflection and retry.
ERL w/o Reflection keeps the two-attempt structure but replaces structured reflection with a generic retry prompt.

Results show that removing reflection leads to the largest performance drop, indicating that structured reflective reasoning is the primary driver of ERL’s gains. Removing memory generally slows convergence and slightly reduces performance, suggesting it mainly improves stability and cumulative learning across episodes.

Key Contributions

Introduces Experiential Reinforcement Learning, embedding reflection into RL training
Proposes an internalization mechanism via selective distillation
Demonstrates improved efficiency and performance across control and reasoning tasks

One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

2026-02-05T00:00:00+00:00

Abstract

This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.

CoAct-1: Computer-using Agents with Coding as Actions

2025-08-22T00:00:00+00:00

Abstract

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

The Hallucination Tax of Reinforcement Finetuning

2025-08-22T00:00:00+00:00

Abstract

Reinforcement finetuning (RFT) has become a standard approach for enhancing the reasoning capabilities of large language models (LLMs). However, its impact on model trustworthiness remains underexplored. In this work, we identify and systematically study a critical side effect of RFT, which we term the hallucination tax: a degradation in refusal behavior causing models to produce hallucinated answers to unanswerable questions confidently. To investigate this, we introduce SUM (Synthetic Unanswerable Math), a high-quality dataset of unanswerable math problems designed to probe models’ ability to recognize an unanswerable question by reasoning from the insufficient or ambiguous information. Our results show that standard RFT training could reduce model refusal rates by more than 80%, which significantly increases model’s tendency to hallucinate. We further demonstrate that incorporating just 10% SUM during RFT substantially restores appropriate refusal behavior, with minimal accuracy trade-offs on solvable tasks. Crucially, this approach enables LLMs to leverage inference-time compute to reason about their own uncertainty and knowledge boundaries, improving generalization not only to out-of-domain math problems but also to factual question answering tasks.

STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models

2025-08-22T00:00:00+00:00

Abstract

Steerability, or the ability of large language models (LLMs) to adapt outputs to align with diverse community-specific norms, perspectives, and communication styles, is critical for real-world applications but remains under-evaluated. We introduce Steer-Bench, a benchmark for assessing population-specific steering using contrasting Reddit communities. Covering 30 contrasting subreddit pairs across 19 domains, Steer-Bench includes over 10,000 instruction-response pairs and validated 5,500 multiple-choice question with corresponding silver labels to test alignment with diverse community norms. Our evaluation of 13 popular LLMs using Steer-Bench reveals that while human experts achieve an accuracy of 81% with silver labels, the best-performing models reach only around 65% accuracy depending on the domain and configuration. Some models lag behind human-level alignment by over 15 percentage points, highlighting significant gaps in community-sensitive steerability. Steer-Bench is a benchmark to systematically assess how effectively LLMs understand community-specific instructions, their resilience to adversarial steering attempts, and their ability to accurately represent diverse cultural and ideological perspectives.

Efficient Reinforcement Finetuning via Adaptive Curriculum Learning

2025-04-07T00:00:00+00:00

Abstract

Reinforcement finetuning (RFT) has shown great potential for enhancing the mathematical reasoning capabilities of large language models (LLMs), but it is often sample- and compute-inefficient, requiring extensive training. In this work, we introduce AdaRFT (Adaptive Curriculum Reinforcement Finetuning), a method that significantly improves both the efficiency and final accuracy of RFT through adaptive curriculum learning. AdaRFT dynamically adjusts the difficulty of training problems based on the model’s recent reward signals, ensuring that the model consistently trains on tasks that are challenging but solvable. This adaptive sampling strategy accelerates learning by maintaining an optimal difficulty range, avoiding wasted computation on problems that are too easy or too hard. AdaRFT requires only a lightweight extension to standard RFT algorithms like Proximal Policy Optimization (PPO), without modifying the reward function or model architecture. Experiments on competition-level math datasets-including AMC, AIME, and IMO-style problems-demonstrate that AdaRFT significantly improves both training efficiency and reasoning performance. We evaluate AdaRFT across multiple data distributions and model sizes, showing that it reduces the number of training steps by up to 2x and improves accuracy by a considerable margin, offering a more scalable and effective RFT framework.

Code: github.com/uscnlp-lime/verl
Dataset: huggingface.co/datasets/lime-nlp/DeepScaleR_Difficulty

Highlights

Dynamically adapts training difficulty using a lightweight curriculum scheduler
Compatible with standard RFT algorithms like PPO, GRPO, REINFORCE++
Improves both sample efficiency and final accuracy on math reasoning benchmarks
Up to 2× faster convergence vs PPO baseline
Seamlessly integrated into any RFT frameworks without modifying reward functions or model architectures

How It Works

AdaRFT tracks the model’s reward signal and adaptively shifts the target difficulty.
At each step, it samples training problems closest to the current target difficulty.

Results

Learning Curves

AdaRFT (orange) consistently leads in early training and reaches higher accuracy. For instance, on skew-difficult data, PPO needs +71.7% more steps to reach AdaRFT’s performance.

Final Accuracy at Step 100

Across all setups and model sizes, AdaRFT outperforms PPO and PPO with filtered data.

Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base

2025-03-30T00:00:00+00:00

Abstract

Large language models (LLMs) possess impressive linguistic capabilities but often fail to faithfully retain factual knowledge, leading to hallucinations and unreliable outputs. Understanding LLMs’ knowledge deficiencies by exhaustively evaluating against full-scale knowledge bases is computationally prohibitive, especially for closed-weight models. We propose stochastic error ascent (SEA), a scalable and efficient framework for discovering knowledge deficiencies (errors) in closed-weight LLMs under a strict query budget. Rather than naively probing all knowledge candidates, SEA formulates error discovery as a stochastic optimization process: it iteratively retrieves new high-error candidates by leveraging the semantic similarity to previously observed failures. To further enhance search efficiency and coverage, SEA employs hierarchical retrieval across document and paragraph levels, and constructs a relation directed acyclic graph to model error propagation and identify systematic failure modes. Empirically, SEA uncovers 40.7x more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, while reducing the cost-per-error by 599x and 9x, respectively. Human evaluation confirms the high quality of generated questions, while ablation and convergence analyses validate the contribution of each component in SEA. Further analysis on the discovered errors reveals correlated failure patterns across LLM families and recurring deficits, highlighting the need for better data coverage and targeted fine-tuning in future LLM development.

Key Contributions

Stochastic Error Ascent (SEA): A novel framework that efficiently identifies knowledge deficiencies in LLMs by modeling error discovery as a stochastic optimization process.
Hierarchical Retrieval: Utilizes both document and paragraph-level retrieval to enhance search efficiency and coverage.
Relation DAG Construction: Builds a directed acyclic graph to model error propagation and uncover systematic failure patterns.
Empirical Validation: SEA uncovers 40.7× more knowledge errors than Automated Capability Discovery and 26.7% more than AutoBencher, with significant reductions in cost per error.

SEA Framework Overview

Experimental Results

Error Discovery Efficiency:
- SEA uncovers 40.7× more knowledge errors than Automated Capability Discovery.
- SEA identifies 26.7% more errors than AutoBencher.
Cost Reduction:
- Achieves a 599× reduction in cost per error compared to Automated Capability Discovery.
- Offers a 9× cost reduction compared to AutoBencher.
Human Evaluation: Confirms the high quality of questions generated by SEA.
Ablation Studies: Validate the contribution of each component within the SEA framework.

Conclusion

SEA presents a scalable and efficient approach to uncovering knowledge deficiencies in LLMs, particularly under constraints of limited query budgets. By leveraging semantic similarities and modeling error propagation, SEA significantly enhances the discovery of knowledge errors, paving the way for more reliable and accurate language models.

Detecting and Filtering Unsafe Training Data via Data Attribution

2025-02-16T00:00:00+00:00

Abstract

Large language models (LLMs) are highly susceptible to unsafe training data, where even small quantities of harmful data can lead to undesirable model behaviors. Identifying and filtering such unsafe data is crucial for the development of trustworthy AI systems. Existing methods primarily rely on moderation classifiers, which suffer from high computational costs, rigidity in predefined taxonomies, and lack of insight into the training process. To address these issues, we introduce DABUF (Data-Attribution-Based Unsafe Training Data Detection and Filtering), a novel approach that leverages data attribution techniques to trace harmful model outputs back to their influential training instances. Unlike traditional moderation classifiers, DABUF enables flexible and adaptable detection of various unsafe data types. Our experiments demonstrate that DABUF outperforms state-of-the-art methods in identifying and filtering unsafe training data, leading to significantly safer LLMs across multiple domains, including jailbreaking detection and gender bias mitigation.

Method Overview

DABUF operates through a two-phase process:

Unsafe Training Data Detection

DABUF detects unsafe training data by attributing harmful model outputs to specific training instances. Using gradient similarity techniques, DABUF quantifies the influence of each data point on unsafe model behaviors. However, long-form outputs with mixed safe and unsafe content pose challenges for direct attribution. To address this, DABUF integrates moderation classifiers to refine attribution targets, ensuring a more precise identification of harmful training instances.

Unsafe Data Filtering

Once identified, the most influential unsafe training samples are removed from the dataset, mitigating unsafe model behaviors. The method balances recall and precision to ensure that filtering does not overly impact benign data while significantly improving model safety.

Key Takeaways

We apply DABUF to various datasets, including jailbreaking data (ToxicChat, XSTest-Response) and gender bias data (Bias in Bios), and compare its performance against leading moderation tools such as OpenAI’s Moderation API, Llama-Guard-3-8B, Wildguard, and GradSafe. Key findings include:

DABUF significantly improves unsafe data detection.
- DABUF outperforms baseline classifiers, achieving a 7.5% higher AUPRC in jailbreaking detection and a 44.1% improvement in gender bias detection compared to state-of-the-art models.
- Unlike existing moderation classifiers, DABUF effectively handles diverse safety concerns without relying on predefined taxonomies.
Filtering unsafe training data with DABUF leads to safer models.
- Retraining on DABUF-filtered data results in significantly lower attack success rates (ASR) in jailbreaking evaluations.
- In gender bias mitigation, DABUF reduces the True Positive Rate (TPR) gender gap, showcasing its effectiveness in reducing biases in LLM outputs.
DABUF’s data attribution approach generalizes well across domains.
- The method’s flexibility allows it to be applied beyond traditional content moderation, extending to adversarial attack resistance and fairness-related biases.
- Unlike heuristic-based classifiers, DABUF directly ties unsafe behaviors to their training sources, providing a transparent and explainable filtering mechanism.

Conclusion

DABUF presents a scalable, adaptable, and effective approach for detecting and filtering unsafe training data in LLMs. By leveraging data attribution instead of rigid moderation classifiers, DABUF provides a more flexible and efficient method for enhancing model safety. Future directions include refining attribution techniques for more complex unsafe behaviors and extending the method to additional fairness and security-related challenges.

On the Trustworthiness of Generative Foundation Models

2025-02-16T00:00:00+00:00

Abstract

Generative Foundation Models (GenFMs) have become widely used across various domains, but concerns remain about their trustworthiness, including truthfulness, safety, fairness, robustness, and privacy. This paper presents a framework to address these issues through three key contributions. First, we review global AI governance policies and industry standards and propose standardized guidelines for assessing and improving GenFM trustworthiness. Second, we introduce TrustGen, a benchmarking platform designed to evaluate models across different dimensions, including text-to-image, large language, and vision-language models. Unlike traditional evaluation methods, TrustGen enables adaptive assessments through metadata curation, test case generation, and contextual variation. Using TrustGen, we analyze the trustworthiness of current GenFMs, highlighting progress while identifying challenges such as overly cautious safety measures and persistent vulnerabilities in open-source models. This work provides a foundation for developing safer and more responsible generative AI and includes an open-source evaluation toolkit for further research.

Method Overview

TrustGen evaluates GenFMs through three main components:

Standardized Guidelines for Trustworthy GenFMs

A set of guidelines developed through multidisciplinary collaboration, integrating technical, ethical, legal, and societal perspectives. These guidelines provide a structured approach to evaluating and improving model trustworthiness.

Dynamic Trustworthiness Evaluation

TrustGen moves beyond static evaluation benchmarks by introducing a modular framework with three key components:

Metadata Curation – Collects and organizes evaluation metadata dynamically.
Test Case Generation – Produces diverse evaluation cases for different trust dimensions.
Contextual Variation – Modifies test cases to ensure adaptability across models and scenarios.

This approach allows for real-time, flexible assessments that reduce biases from predefined test cases.

Trustworthiness Assessment of State-of-the-Art Models

We apply TrustGen to benchmark leading generative models, including text-to-image, large language, and vision-language models. Our results show that while models have made progress, key trade-offs remain between safety, usability, and robustness.

Key Takeaways

Our evaluation of state-of-the-art GenFMs using TrustGen reveals several insights:

Persistent trustworthiness challenges:
- While leading models perform well in safety and fairness, issues remain in truthfulness and robustness.
- Some models are overly cautious, leading to reduced usefulness in benign scenarios.
Open-source models are catching up:
- Certain open-source models now match or outperform proprietary ones in areas like privacy and fairness.
- Models such as CogView-3-Plus and Llama-3.2-70B show trustworthiness levels comparable to top commercial models.
Narrowing trustworthiness gap:
- The differences in trust scores among top models have decreased, suggesting improvements across the industry.
- Collaboration and shared best practices have contributed to more consistent trustworthiness enhancements.
Interconnected trustworthiness factors:
- Improvements in one area (e.g., safety) often impact others (e.g., usability).
- A balanced approach is needed to ensure models remain both useful and responsible.

Conclusion

TrustGen provides a new benchmark for evaluating the trustworthiness of GenFMs, enabling adaptive and iterative assessments that address limitations of static evaluation methods. While significant improvements have been made, challenges remain, particularly in balancing safety, robustness, and practical usability. Future research should focus on refining evaluation strategies and fostering interdisciplinary collaboration to ensure that generative AI systems are fair, reliable, and aligned with human needs.

To support further research, the TrustEval-toolkit is available at:
🔗 GitHub Repository

WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback

2024-09-05T00:00:00+00:00

Abstract

As large language models (LLMs) continue to advance, aligning these models with human preferences has emerged as a critical challenge. Traditional alignment methods, relying on human or LLM annotated datasets, are limited by their resource-intensive nature, inherent subjectivity, misalignment with real-world user preferences, and the risk of feedback loops that amplify model biases. To overcome these limitations, we introduce WildFeedback, a novel framework that leverages in-situ user feedback during conversations with LLMs to create preference datasets automatically. Given a corpus of multi-turn user-LLM conversation, WildFeedback identifies and classifies user feedback to LLM responses between conversation turns. The user feedback is then used to create examples of preferred and dispreferred responses according to users’ preference. Our experiments demonstrate that LLMs fine-tuned on WildFeedback dataset exhibit significantly improved alignment with user preferences, as evidenced by both traditional benchmarks and our proposed checklist-guided evaluation. By incorporating in-situ feedback from actual users, WildFeedback addresses the scalability, subjectivity, and bias challenges that plague existing approaches, marking a significant step toward developing LLMs that are more responsive to the diverse and evolving needs of their users.

Method Overview

WildFeedback operates through a three-step process:

Feedback Signal Identification

This step involves analyzing user-LLM interactions to identify feedback signals (satisfaction or dissatisfaction). Feedback signals are extracted from real dialogues using rubrics to classify user satisfaction (SAT) and dissatisfaction (DSAT).

Preference Data Construction

Conversations containing satisfaction (SAT) or dissatisfaction (DSAT) signals are used to identify prompts and summarize user preferences, such as preferences for more detailed or precise responses. Dispreferred responses are directly taken from instances that triggered DSAT signals, while preferred responses are generated using GPT-4 or on-policy models guided by the summarized user preferences. To ensure safety, additional instructions prevent generating harmful content, and moderation filters are applied. This approach produces a dataset that better captures authentic user preferences, enhancing LLM alignment with real-world user expectations.

User-guided Evaluation

The user-guided evaluation in WildFeedback aligns model assessments with real user preferences by incorporating direct user feedback into the evaluation process. Instead of relying solely on automated or human annotator judgments, this method uses feedback signals from user-LLM interactions to guide evaluations, ensuring they reflect actual user expectations. Evaluators, including LLMs like GPT-4, are provided with a checklist of summarized user preferences from the dataset, which informs their assessment of model responses. This approach reduces biases common in traditional benchmarks and ensures that the evaluation process accurately measures how well models meet user needs, leading to more reliable and user-aligned performance metrics.

Key Takeaways

We applied WildFeedback to the WildChat dataset and constructed a preference dataset of more than 20k samples. To validate the effectiveness of WildFeedback, We finetune Mistral, Phi 3, LLaMA 3 on it and compare their performances with the non-finetuned models on MT-Bench, AlpacaEval 2, Arena-Hard, and the held-out test set of WildFeedback. For WildFeedback evaluation, we report the win, tie, lose percentage against the off-the-shelf instruct models with GPT-4 as the judge. Results are shown in Table 3. Some key takeaways are

Training models on the WildFeedback dataset can significantly and consistently boost model performance across all benchmarks. Models trained the dataset exhibit higher win rates across AlpacaEval 2, Arena-Hard, and MT-Bench, as well as improved performance in both settings of WildFeedback (with and without a checklist).
WildFeedback significantly enhances model alignment with in-situ user feedback. As detailed in the previous section, WildFeedback has two versions, differing in whether the preferred responses are generated by GPT-4 or the policy models themselves. Compared to off-the-shelf instruction models, those trained on either version of WildFeedback demonstrate a stronger alignment with real user preferences, winning much more often on the WildFeedback test set as compared with the off-the-shelf instruct models and the models trained on UltraFeedback.
WildFeedback does not compromise model performance on other benchmarks. Training on either version of WildFeedback not only aligns models more closely with user preferences but also does not compromise performance on other benchmarks; in most cases, it even leads to improvements.