alphaXiv

Explore

Sign In

Blog

Labs

Feedback

Browser Extension

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings

Maximum Likelihood Reinforcement Learning03/19 · Fahim Tajwar · CMU

Data-driven Discovery at Ai203/21 · Bodhisattwa Majumder · Ai2

Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights

12 Mar 2026

Yulu Gan

Phillip Isola

MIT CSAIL researchers Yulu Gan and Phillip Isola propose that large pretrained neural networks exist in a "thicket" regime, where their weight space is dense with diverse task-specific experts accessible via simple random perturbations. Their "RandOpt" algorithm, which randomly samples and ensembles these perturbations, achieved performance competitive with or superior to established gradient-based post-training methods on various LLM and VLM tasks.

#computer-science #artificial-intelligence #machine-learning

Paper thumbnail

Temporal Straightening for Latent Planning

12 Mar 2026

University of Toronto New York University logo

New York University

Ying Wang

Oumayma Bounou

Gaoyue Zhou

Researchers introduced "temporal straightening," a geometric regularization technique that encourages straighter trajectories in latent space, improving representations for model-based reinforcement learning. This method, inspired by the perceptual straightening hypothesis, significantly enhances gradient-based planning performance by creating a latent space where Euclidean distances more accurately reflect true environmental distances, leading to substantial gains in goal-reaching success rates.

#computer-science #machine-learning #deep-reinforcement-learning

Paper thumbnail

OpenClaw-RL: Train Any Agent Simply by Talking

10 Mar 2026

Yinjie Wang

Xuyang Chen

Xiaolong Jin

OpenClaw-RL is a framework that systematically converts real-time next-state signals from AI agent interactions into continuous, online learning sources. The system recovers both implicit evaluative signals and explicit directive signals, enabling agents to achieve rapid personalization in conversational settings and improve performance across diverse general agent tasks like terminal, GUI, SWE, and tool-calling environments.

#agentic-frameworks #agents #computer-science

Resources 1,786

Paper thumbnail

Ψ_0

: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

12 Mar 2026

Songlin Wei

Hongyi Jing

Boqian Li

Ψ0 introduces an open foundation model for humanoid loco-manipulation, employing a decoupled learning strategy that pre-trains on human egocentric videos for generalizable visual-action representations and post-trains on significantly less robot data for precise joint control. This approach achieves over 40% higher success rates on complex, long-horizon tasks compared to state-of-the-art baselines, demonstrating improved data efficiency.

#computer-science #robotics

Paper thumbnail

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

12 Mar 2026

Guanyu Jiang

Zhaochen Su

Xiaoye Qu

XSKILL introduces a dual-stream framework that enables multimodal agents to continually learn from visually-grounded task-level skills and action-level experiences without explicit retraining. This approach consistently improves agent performance, showing Average@4 gains of 2.58 to 6.71 points over baselines across various benchmarks, by enhancing tool-use efficiency and flexibility.

#agents #computer-science #continual-learning

Paper thumbnail

Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training

12 Mar 2026

Yixin Liu

Yue Yu

DiJia Su

Researchers at Meta Superintelligence Labs and Yale University conducted a controlled study on the practical impact of reasoning LLM-judges in policy post-training within non-verifiable domains, finding that while these judges prevent reward hacking, policies trained with them learn to generate highly effective adversarial outputs. Policies trained using reasoning judges achieved strong performance against a gold-standard evaluator, but exploited judge vulnerabilities through strategies like over-refusal and fabricated policies, generalizing to models like GPT-4.1 and achieving win rates up to 90% on Arena-Hard-V2 subsets.

#adversarial-attacks #agents #computer-science

Paper thumbnail

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

12 Mar 2026

Yushi Bai

Qian Dong

Ting Jiang

IndexCache, developed by Tsinghua University and Z.ai, accelerates sparse attention in large language models by reusing token selection indices across transformer layers, reducing the O(L^2) indexer computation cost. This method yields up to a 1.82x speedup in prefill latency and a 1.48x speedup in decode throughput for 200K token contexts, while maintaining model quality.

#agents #attention-mechanisms #computer-science

Paper thumbnail

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

12 Mar 2026

Fangfu Liu

Diankun Wu

Jiawei Chi

Spatial-TTT equips Multimodal Large Language Models (MLLMs) with the ability to process and reason about 3D spaces from continuous video streams using a test-time training framework. It integrates a hybrid architecture and spatial-predictive mechanism, achieving state-of-the-art performance, including a 12.3 percentage point improvement on MindCube-Tiny and robust object counting over 120-minute videos.

#computer-science #continual-learning #computer-vision-and-pattern-recognition

Paper thumbnail

RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks

12 Mar 2026

Ruiying Li

Yunlang Zhou

YuYao Zhu

RoboClaw introduces an agentic framework that unifies data collection, policy learning, and task execution for long-horizon robotic manipulation by using an off-the-shelf Vision-Language Model as a meta-controller. This approach employs self-resetting data collection and continuous process supervision to reduce human intervention and enhance task success rates in real-world environments.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Ranking Reasoning LLMs under Test-Time Scaling

11 Mar 2026

Case Western Reserve University

Mohsen Hariri

Michael Hinczewski

Jing Ma

Researchers at Case Western Reserve University formalized a dense benchmark ranking framework for reasoning LLMs under test-time scaling, systematically comparing 72 statistical methods. Their analysis revealed that while most methods agree at high trial budgets, a Bayesian estimator incorporating a greedy prior achieved the highest low-budget stability, reducing the standard deviation of Kendall's τb by 16–52% at N=1.

#computer-science #machine-learning #mathematics

Paper thumbnail

daVinci-Env: Open SWE Environment Synthesis at Scale

13 Mar 2026

Dayuan Fu

Shenyu Wu

Yunze Wu

The daVinci-Env (OpenSWE) framework synthesizes 45,320 executable Docker environments from over 12.8k repositories, creating the largest open-source dataset for software engineering (SWE) agent training. Models trained on this dataset achieve up to 66.0% Pass@1 on SWE-Bench Verified and show improved performance across various general capability benchmarks.

#agentic-frameworks #agents #computer-science

Paper thumbnail

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

12 Mar 2026

Baifeng Shi

Stephanie Fu

Long Lian

Researchers from UC Berkeley, NVIDIA, MIT, and Clarifai developed AutoGaze, a lightweight module that enables Multi-modal Large Language Models to efficiently process long-form, high-resolution videos by adaptively selecting multi-scale informative visual patches before attention. This approach achieved up to 19x Vision Transformer and 10x MLLM latency reduction, scaling MLLMs to 1K frames at 4K resolution and improving performance by 10.1% on the new HLVid benchmark compared to baselines.

#computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation

13 Mar 2026

Tsinghua University

Yichen Zhang

Da Peng

Zonghao Guo

Researchers from Tsinghua University and collaborators developed CHEERS, a unified multimodal model that integrates visual comprehension and high-fidelity image generation by decoupling patch-level details from semantic representations. This architecture achieves competitive performance on understanding and generation benchmarks using reduced training data and demonstrates emergent zero-shot image editing.

#computer-science #artificial-intelligence #computer-vision-and-pattern-recognition

Paper thumbnail

Multimodal OCR: Parse Anything from Documents

13 Mar 2026

Handong Zheng

Yumeng Li

Kaile Zhang

A new Multimodal OCR (MOCR) paradigm is introduced, which unifies the parsing of both textual content and visual graphics like charts and diagrams into structured representations, often as SVG code. The dots.mocr system demonstrates superior performance on document parsing benchmarks and excels in converting graphics to SVG, while retaining robust general vision-language capabilities.

#computer-science #computer-vision-and-pattern-recognition #data-curation

Paper thumbnail

CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges

12 Mar 2026

Zi-Han Wang

Lam Nguyen

Zhengyang Zhao

CreativeBench provides a new benchmark for quantitatively evaluating machine creativity in code generation, incorporating both combinatorial and exploratory aspects using a cognitive framework. Complementarily, EvoRePE offers an inference-time steering strategy that boosts model creativity, revealing that while model scaling improves correctness, it can suppress novelty, whereas EvoRePE consistently enhances diverse outputs with low overhead.

#computer-science #artificial-intelligence

Paper thumbnail

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

12 Mar 2026

Tianwei Xiong

Jun Hao Liew

Zilong Huang

Researchers at The University of Hong Kong and ByteDance Seed developed EVATok, an adaptive tokenization framework that dynamically assigns tokens based on video content complexity. This method enables more efficient video reconstruction and generation, reducing token usage by over 24% for reconstruction and 26% for generation on UCF-101, while achieving superior quality and new state-of-the-art performance in downstream autoregressive tasks.

#computer-science #computer-vision-and-pattern-recognition #generative-models

Paper thumbnail

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

12 Mar 2026

Xuanlang Dai

Yujie Zhou

Long Xing

EndoCoT introduces a framework for diffusion models to perform endogenous Chain-of-Thought reasoning by iteratively refining latent thought states. The model achieved an average accuracy of 92.1% across diverse visual reasoning tasks, outperforming Diff Thinker by 8.3 percentage points, and demonstrated scalable reasoning and interpretable, step-by-step problem-solving.

#chain-of-thought #computer-science #computation-and-language

Paper thumbnail

One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers

12 Mar 2026

Moayed Haji-Ali

Willi Menapace

Ivan Skorokhodov

The Elastic Latent Interface Transformer (ELIT) integrates a flexible latent representation into Diffusion Transformers (DiTs), enabling adaptive and non-uniform computation allocation and variable test-time compute from a single model. This approach yielded substantial FID improvements on ImageNet-1K and allowed up to 63% FLOPs reduction at high resolutions with graceful quality trade-offs.

#attention-mechanisms #computer-science #computer-vision-and-pattern-recognition

Paper thumbnail

VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model

13 Mar 2026

Xiangyu Sun

Shijie Wang

Fengyi Zhang

VGGT-World presents a geometry world modeling framework that directly forecasts the temporal evolution of 3D scene geometry using latent features from a frozen Geometry Foundation Model (GFM). This approach achieves superior geometric consistency in future depth and point cloud predictions, improving metrics like AbsRel by up to 32%, while significantly reducing computational overhead with 3.6x to 5x faster inference and a much smaller trainable footprint compared to video-centric world models.

#computer-science #computer-vision-and-pattern-recognition #geometric-deep-learning

Paper thumbnail

ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control

12 Mar 2026

Chetan Borse

Zhixian Xie

Wei-Cheng Huang

ComFree-Sim introduces a GPU-parallelized analytical contact physics engine for robotics simulation, overcoming the contact resolution bottleneck with near-linear runtime scaling and significantly higher throughput than existing iterative solvers. The engine achieves up to 3x faster simulation and 2x higher throughput compared to MJWarp, directly improving real-time control success rates by an average of 27 percentage points in dexterous manipulation tasks.

#computer-science #robotics

Paper thumbnail

There are no more papers matching your filters at the moment.