alphaXiv

Explore

Sign In

Blog

Labs

Feedback

Browser Extension

We're hiring

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings
Maximum Likelihood Reinforcement Learning03/19 · Fahim Tajwar · CMU
Data-driven Discovery at Ai203/21 · Bodhisattwa Majumder · Ai2
HotLikes
Briefs
Neural Thickets: Diverse Task Experts Are Dense Around Pretrained Weights
12 Mar 2026
Yulu Gan
Phillip Isola

MIT CSAIL researchers Yulu Gan and Phillip Isola propose that large pretrained neural networks exist in a "thicket" regime, where their weight space is dense with diverse task-specific experts accessible via simple random perturbations. Their "RandOpt" algorithm, which randomly samples and ensembles these perturbations, achieved performance competitive with or superior to established gradient-based post-training methods on various LLM and VLM tasks.

View blog
#computer-science#artificial-intelligence#machine-learning
Resources2
Paper thumbnail
2,079
Temporal Straightening for Latent Planning
12 Mar 2026
University of Toronto logoUniversity of TorontoNew York University logoNew York University
Ying Wang
Oumayma Bounou
Gaoyue Zhou

Researchers introduced "temporal straightening," a geometric regularization technique that encourages straighter trajectories in latent space, improving representations for model-based reinforcement learning. This method, inspired by the perceptual straightening hypothesis, significantly enhances gradient-based planning performance by creating a latent space where Euclidean distances more accurately reflect true environmental distances, leading to substantial gains in goal-reaching success rates.

View blog
#computer-science#machine-learning#deep-reinforcement-learning
Resources
Paper thumbnail
699
OpenClaw-RL: Train Any Agent Simply by Talking
10 Mar 2026
Yinjie Wang
Xuyang Chen
Xiaolong Jin

OpenClaw-RL is a framework that systematically converts real-time next-state signals from AI agent interactions into continuous, online learning sources. The system recovers both implicit evaluative signals and explicit directive signals, enabling agents to achieve rapid personalization in conversational settings and improve performance across diverse general agent tasks like terminal, GUI, SWE, and tool-calling environments.

View blog
#agentic-frameworks#agents#computer-science
Resources1,786
Paper thumbnail
4,835
Ψ0Ψ_0Ψ0​: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation
12 Mar 2026
Songlin Wei
Hongyi Jing
Boqian Li

Ψ0 introduces an open foundation model for humanoid loco-manipulation, employing a decoupled learning strategy that pre-trains on human egocentric videos for generalizable visual-action representations and post-trains on significantly less robot data for precise joint control. This approach achieves over 40% higher success rates on complex, long-horizon tasks compared to state-of-the-art baselines, demonstrating improved data efficiency.

View blog
#computer-science#robotics
Resources84
Paper thumbnail
519
XSkill: Continual Learning from Experience and Skills in Multimodal Agents
12 Mar 2026
Guanyu Jiang
Zhaochen Su
Xiaoye Qu

XSKILL introduces a dual-stream framework that enables multimodal agents to continually learn from visually-grounded task-level skills and action-level experiences without explicit retraining. This approach consistently improves agent performance, showing Average@4 gains of 2.58 to 6.71 points over baselines across various benchmarks, by enhancing tool-use efficiency and flexibility.

View blog
#agents#computer-science#continual-learning
Resources9
Paper thumbnail
343
Examining Reasoning LLMs-as-Judges in Non-Verifiable LLM Post-Training
12 Mar 2026
Yixin Liu
Yue Yu
DiJia Su

Researchers at Meta Superintelligence Labs and Yale University conducted a controlled study on the practical impact of reasoning LLM-judges in policy post-training within non-verifiable domains, finding that while these judges prevent reward hacking, policies trained with them learn to generate highly effective adversarial outputs. Policies trained using reasoning judges achieved strong performance against a gold-standard evaluator, but exploited judge vulnerabilities through strategies like over-refusal and fabricated policies, generalizing to models like GPT-4.1 and achieving win rates up to 90% on Arena-Hard-V2 subsets.

View blog
#adversarial-attacks#agents#computer-science
Resources
Paper thumbnail
285
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
12 Mar 2026
Yushi Bai
Qian Dong
Ting Jiang

IndexCache, developed by Tsinghua University and Z.ai, accelerates sparse attention in large language models by reusing token selection indices across transformer layers, reducing the O(L^2) indexer computation cost. This method yields up to a 1.82x speedup in prefill latency and a 1.48x speedup in decode throughput for 200K token contexts, while maintaining model quality.

View blog
#agents#attention-mechanisms#computer-science
Resources2
Paper thumbnail
386
Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training
12 Mar 2026
Fangfu Liu
Diankun Wu
Jiawei Chi

Spatial-TTT equips Multimodal Large Language Models (MLLMs) with the ability to process and reason about 3D spaces from continuous video streams using a test-time training framework. It integrates a hybrid architecture and spatial-predictive mechanism, achieving state-of-the-art performance, including a 12.3 percentage point improvement on MindCube-Tiny and robust object counting over 120-minute videos.

View blog
#computer-science#continual-learning#computer-vision-and-pattern-recognition
Resources71
Paper thumbnail
381
RoboClaw: An Agentic Framework for Scalable Long-Horizon Robotic Tasks
12 Mar 2026
Ruiying Li
Yunlang Zhou
YuYao Zhu

RoboClaw introduces an agentic framework that unifies data collection, policy learning, and task execution for long-horizon robotic manipulation by using an off-the-shelf Vision-Language Model as a meta-controller. This approach employs self-resetting data collection and continuous process supervision to reduce human intervention and enhance task success rates in real-world environments.

View blog
#agentic-frameworks#agents#computer-science
Resources
Paper thumbnail
305
Ranking Reasoning LLMs under Test-Time Scaling
11 Mar 2026
Case Western Reserve University
Mohsen Hariri
Michael Hinczewski
Jing Ma

Researchers at Case Western Reserve University formalized a dense benchmark ranking framework for reasoning LLMs under test-time scaling, systematically comparing 72 statistical methods. Their analysis revealed that while most methods agree at high trial budgets, a Bayesian estimator incorporating a greedy prior achieved the highest low-budget stability, reducing the standard deviation of Kendall's τb by 16–52% at N=1.

View blog
#computer-science#machine-learning#mathematics
Resources5
Paper thumbnail
500
daVinci-Env: Open SWE Environment Synthesis at Scale
13 Mar 2026
Dayuan Fu
Shenyu Wu
Yunze Wu

The daVinci-Env (OpenSWE) framework synthesizes 45,320 executable Docker environments from over 12.8k repositories, creating the largest open-source dataset for software engineering (SWE) agent training. Models trained on this dataset achieve up to 66.0% Pass@1 on SWE-Bench Verified and show improved performance across various general capability benchmarks.

View blog
#agentic-frameworks#agents#computer-science
Resources9
Paper thumbnail
64
Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing
12 Mar 2026
Baifeng Shi
Stephanie Fu
Long Lian

Researchers from UC Berkeley, NVIDIA, MIT, and Clarifai developed AutoGaze, a lightweight module that enables Multi-modal Large Language Models to efficiently process long-form, high-resolution videos by adaptively selecting multi-scale informative visual patches before attention. This approach achieved up to 19x Vision Transformer and 10x MLLM latency reduction, scaling MLLMs to 1K frames at 4K resolution and improving performance by 10.1% on the new HLVid benchmark compared to baselines.

View blog
#computer-science#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
128
Cheers: Decoupling Patch Details from Semantic Representations Enables Unified Multimodal Comprehension and Generation
13 Mar 2026
Tsinghua University logoTsinghua University
Yichen Zhang
Da Peng
Zonghao Guo

Researchers from Tsinghua University and collaborators developed CHEERS, a unified multimodal model that integrates visual comprehension and high-fidelity image generation by decoupling patch-level details from semantic representations. This architecture achieves competitive performance on understanding and generation benchmarks using reduced training data and demonstrates emergent zero-shot image editing.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Resources34
Paper thumbnail
61
Multimodal OCR: Parse Anything from Documents
13 Mar 2026
Handong Zheng
Yumeng Li
Kaile Zhang

A new Multimodal OCR (MOCR) paradigm is introduced, which unifies the parsing of both textual content and visual graphics like charts and diagrams into structured representations, often as SVG code. The dots.mocr system demonstrates superior performance on document parsing benchmarks and excels in converting graphics to SVG, while retaining robust general vision-language capabilities.

View blog
#computer-science#computer-vision-and-pattern-recognition#data-curation
Resources
Paper thumbnail
55
CreativeBench: Benchmarking and Enhancing Machine Creativity via Self-Evolving Challenges
12 Mar 2026
Zi-Han Wang
Lam Nguyen
Zhengyang Zhao

CreativeBench provides a new benchmark for quantitatively evaluating machine creativity in code generation, incorporating both combinatorial and exploratory aspects using a cognitive framework. Complementarily, EvoRePE offers an inference-time steering strategy that boosts model creativity, revealing that while model scaling improves correctness, it can suppress novelty, whereas EvoRePE consistently enhances diverse outputs with low overhead.

View blog
#computer-science#artificial-intelligence
Resources1
Paper thumbnail
116
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
12 Mar 2026
Tianwei Xiong
Jun Hao Liew
Zilong Huang

Researchers at The University of Hong Kong and ByteDance Seed developed EVATok, an adaptive tokenization framework that dynamically assigns tokens based on video content complexity. This method enables more efficient video reconstruction and generation, reducing token usage by over 24% for reconstruction and 26% for generation on UCF-101, while achieving superior quality and new state-of-the-art performance in downstream autoregressive tasks.

View blog
#computer-science#computer-vision-and-pattern-recognition#generative-models
Resources14
Paper thumbnail
182
EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models
12 Mar 2026
Xuanlang Dai
Yujie Zhou
Long Xing

EndoCoT introduces a framework for diffusion models to perform endogenous Chain-of-Thought reasoning by iteratively refining latent thought states. The model achieved an average accuracy of 92.1% across diverse visual reasoning tasks, outperforming Diff Thinker by 8.3 percentage points, and demonstrated scalable reasoning and interpretable, step-by-step problem-solving.

View blog
#chain-of-thought#computer-science#computation-and-language
Resources20
Paper thumbnail
101
One Model, Many Budgets: Elastic Latent Interfaces for Diffusion Transformers
12 Mar 2026
Moayed Haji-Ali
Willi Menapace
Ivan Skorokhodov

The Elastic Latent Interface Transformer (ELIT) integrates a flexible latent representation into Diffusion Transformers (DiTs), enabling adaptive and non-uniform computation allocation and variable test-time compute from a single model. This approach yielded substantial FID improvements on ImageNet-1K and allowed up to 63% FLOPs reduction at high resolutions with graceful quality trade-offs.

View blog
#attention-mechanisms#computer-science#computer-vision-and-pattern-recognition
Resources9
Paper thumbnail
137
VGGT-World: Transforming VGGT into an Autoregressive Geometry World Model
13 Mar 2026
Xiangyu Sun
Shijie Wang
Fengyi Zhang

VGGT-World presents a geometry world modeling framework that directly forecasts the temporal evolution of 3D scene geometry using latent features from a frozen Geometry Foundation Model (GFM). This approach achieves superior geometric consistency in future depth and point cloud predictions, improving metrics like AbsRel by up to 32%, while significantly reducing computational overhead with 3.6x to 5x faster inference and a much smaller trainable footprint compared to video-centric world models.

View blog
#computer-science#computer-vision-and-pattern-recognition#geometric-deep-learning
Resources
Paper thumbnail
50
ComFree-Sim: A GPU-Parallelized Analytical Contact Physics Engine for Scalable Contact-Rich Robotics Simulation and Control
12 Mar 2026
Chetan Borse
Zhixian Xie
Wei-Cheng Huang

ComFree-Sim introduces a GPU-parallelized analytical contact physics engine for robotics simulation, overcoming the contact resolution bottleneck with near-linear runtime scaling and significantly higher throughput than existing iterative solvers. The engine achieves up to 3x faster simulation and 2x higher throughput compared to MJWarp, directly improving real-time control success rates by an average of 27 percentage points in dexterous manipulation tasks.

View blog
#computer-science#robotics
Resources
Paper thumbnail
138
There are no more papers matching your filters at the moment.
Sign in