alphaXiv

Explore

Sign In

Briefs

Blog

Labs

Feedback

Browser Extension

Dark mode

We're hiring

Ask or search anything...

What are the most popular benchmarks for math reasoning?

Alt+↵ To search

Events

Watch Recordings
Bring Research to Life: alphaXiv x marimo05/01 · alphaXiv x marimo
HotLikes
Sign in
HotLikes
Qwen3.5-Omni Technical Report
17 Apr 2026
Qwen Team

Alibaba Cloud's Qwen Team developed Qwen3.5-Omni, a large language model scaling to hundreds of billions of parameters that processes and generates across text, images, audio, and video within a unified architecture. It achieves robust omnimodal understanding, real-time streaming speech synthesis, and agentic capabilities, demonstrating strong performance on diverse benchmarks and an emergent "Audio-Visual Vibe Coding" function.

View blog
#computer-science#computation-and-language#audio-and-speech-processing
Resources
Paper thumbnail
964
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
20 Apr 2026
Guanting Dong
Junting Lu
Junjie Huang

Renmin University of China and ByteDance Seed researchers introduced Agent-World, a framework for advancing general agent intelligence through scalable real-world environment synthesis and a continuous self-evolving training mechanism. The approach enables agents to learn and adapt by autonomously discovering and constructing a diverse ecosystem of stateful, executable tools and environments from real-world sources, achieving consistent performance improvements across 23 challenging agent benchmarks and outperforming prior environment-scaling methods.

View blog
#computer-science#artificial-intelligence#computation-and-language
Resources
Paper thumbnail
239
Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap
17 Apr 2026
Yige Xu
Yongjie Wang
Zizhuo Wu

Researchers at Nanyang Technological University introduce CROSSMATH, a novel multimodal reasoning benchmark designed to rigorously assess whether Vision-Language Models (VLMs) truly perform visual reasoning or primarily rely on text. The study reveals a substantial performance gap in VLMs when reasoning with images versus text, but demonstrates that targeted post-training strategies can significantly enhance visual reasoning capabilities and improve out-of-domain generalization, such as increasing Qwen3.5-9B's image-only Macro Accuracy from 3.2% to 50.4%.

View blog
#computer-science#computation-and-language#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
222
(1D) Ordered Tokens Enable Efficient Test-Time Search
16 Apr 2026
Zhitong Gao
Parham Rezaei
Ali Cy

This research reveals that 1D ordered tokenization, which processes images from coarse-to-fine semantic detail, intrinsically improves the efficiency of test-time search in autoregressive image generation models. This enables more effective zero-shot multimodal control and even training-free generation, outperforming traditional 2D grid tokenization in inference-time scaling.

View blog
#computer-science#artificial-intelligence#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
563
Learning Whole-Body Humanoid Locomotion via Motion Generation and Motion Tracking
19 Apr 2026
Zewei Zhang
Kehan Wen
Michael Xu

This research introduces a whole-body humanoid locomotion framework that merges a diffusion-based motion generator with an RL-based motion tracker. The system enables the Unitree G1 robot to adaptively navigate diverse, challenging terrains in real-time by producing perception-aware reference motions and executing them robustly.

View blog
#computer-science#robotics
Resources
Paper thumbnail
133
OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
20 Apr 2026
Jinghui Lu
Jiayi Guan
Zhijian Huang

The Xiaomi Embodied Intelligence Team developed OneVL, a framework integrating a Vision-Language-Action model with a world model auxiliary for autonomous driving, which is the first latent Chain-of-Thought (CoT) method to surpass explicit autoregressive CoT in trajectory prediction performance while maintaining answer-only inference latency. OneVL achieved an 88.84 PDM-score on NAVSIM, outperforming prior 8B models by up to 2.64 points, with inference latency comparable to answer-only prediction.

View blog
#autonomous-vehicles#causal-inference#chain-of-thought
Resources953
Paper thumbnail
130
LLM Reasoning Is Latent, Not the Chain of Thought
17 Apr 2026
Wenshuo Wang

Research demonstrates that the core mechanism of Large Language Model reasoning is primarily driven by internal latent-state trajectories, rather than solely by explicit chains of thought, with the dominant factor shifting based on the task's structural demands. This work provides a framework to differentiate and empirically validate how generic computation, surface verbalizations, or latent dynamics mediate performance gains in LLMs.

View blog
#agents#chain-of-thought#computer-science
Resources
Paper thumbnail
210
MultiWorld: Scalable Multi-Agent Multi-View Video World Models
20 Apr 2026
Haoyu Wu
Jiwen Yu
Yingtian Zou

MultiWorld presents a unified framework for multi-agent, multi-view video world modeling, designed to simulate complex interactive environments with precise multi-agent controllability and multi-view consistency. The framework demonstrates improved visual quality, action-following ability, and significantly reduced reprojection error on multi-player video game and multi-robot manipulation datasets.

View blog
#agent-based-systems#computer-science#computer-vision-and-pattern-recognition
Resources7
Paper thumbnail
91
Towards Ultra-High-Rate Quantum Error Correction with Reconfigurable Atom Arrays
17 Apr 2026
Chen Zhao
Casey Duckering
Andi Gu

Researchers co-designed ultra-high-rate quantum low-density parity-check (qLDPC) codes with reconfigurable neutral atom arrays, identifying structural conditions on affine permutation matrices to enable efficient syndrome extraction. Circuit-level simulations demonstrated logical error rates as low as 1.3 × 10^-13 per logical qubit per round at a 0.1% physical error rate, indicating the practical viability of these codes for fault-tolerant quantum computing.

View blog
#computer-science#information-theory#physics
Resources
Paper thumbnail
124
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
14 Apr 2026
Jiacheng Liu
Xiaohan Zhao
Xinyi Shang

A detailed architectural analysis of Anthropic's Claude Code, a production-grade AI agent, is provided through source-level examination, revealing a design philosophy prioritizing a robust operational harness over raw AI decision logic. The study contrasts this with OpenClaw, an open-source system, to highlight how different deployment contexts influence architectural choices and identifies future design challenges for agent systems.

View blog
#agentic-frameworks#agents#computer-science
Resources37
Paper thumbnail
907
Repurposing 3D Generative Model for Autoregressive Layout Generation
17 Apr 2026
Haoran Feng
Yifan Niu
Zehuan Huang

LaviGen adapts a 3D generative model for autoregressive 3D scene layout generation driven by text instructions, operating directly in native 3D space with a dual-guidance self-rollout distillation strategy. The system yields layouts with approximately 19% greater physical plausibility and 65% faster inference compared to prior state-of-the-art methods.

View blog
#computer-science#computer-vision-and-pattern-recognition#generative-models
Resources6
Paper thumbnail
116
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
16 Apr 2026
Ruoyu Qin
Weiran He
Yaoyu Wang

The Prefill-as-a-Service (PrfaaS) architecture from Moonshot AI and Tsinghua University enables practical cross-datacenter prefill-decode disaggregation for hybrid-attention large language models by transferring KVCache over commodity Ethernet. This system achieves 54% higher throughput and a 50% reduction in mean Time-To-First-Token compared to a homogeneous baseline, utilizing only 13% of available inter-cluster bandwidth.

View blog
#computer-science#distributed-parallel-and-cluster-computing
Resources
Paper thumbnail
579
π0.7π_{0.7}π0.7​: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
16 Apr 2026
Physical Intelligence
Bo Ai
Ali Amin

Researchers at Physical Intelligence developed π0.7, a 5-billion-parameter generalist robotic foundation model. This model utilizes a diversified prompting strategy, incorporating language instructions, episode metadata, and generated subgoal images, to achieve strong out-of-the-box performance and compositional generalization across diverse tasks and robot platforms.

View blog
#agents#computer-science#machine-learning
Resources
Paper thumbnail
194
EasyVideoR1: Easier RL for Video Understanding
18 Apr 2026
Chuanyu Qin
Chenxu Yang
Qingyi Si
Reinforcement learning from verifiable rewards (RLVR) has demonstrated remarkable effectiveness in improving the reasoning capabilities of large language models. As models evolve into natively multimodal architectures, extending RLVR to video understanding becomes increasingly important yet remains largely unexplored, due to the diversity of video task types, the computational overhead of repeatedly decoding and preprocessing high-dimensional visual inputs, and the difficulty of reproducible evaluation across numerous sensitive hyperparameters. Existing open-source RL training frameworks provide solid infrastructure for text and image scenarios but lack systematic optimizations tailored for video modality. In this work, we present \textbf{EasyVideoR1}, a complete and efficient reinforcement learning framework specifically designed for training large vision-language models on video understanding tasks. EasyVideoR1 makes the following contributions: (1) a full video RL training pipeline with offline preprocessing and tensor caching that eliminates redundant video decoding and yields a 1.47 ×\times× throughput improvement; (2) a comprehensive, task-aware reward system covering 11 distinct video and image problem types with unified routing and modular extension; (3) a mixed offline-online data training paradigm that combines curated high-quality trajectories with on-policy exploration, benefiting the learning of more challenging tasks; (4) joint image-video training with independently configurable pixel budgets, allowing the two modalities to mutually reinforce each other; and (5) an asynchronous multi-benchmark evaluation framework covering 22 mainstream video understanding benchmarks, with reproduced accuracy closely aligned with officially reported scores.
View blog
#agents#computer-science#computer-vision-and-pattern-recognition
Resources6
Paper thumbnail
59
Geometric Context Transformer for Streaming 3D Reconstruction
16 Apr 2026
Lin-Zhuo Chen
Jian Gao
Yihang Chen

LingBot-Map, a feed-forward 3D foundation model, performs streaming 3D reconstruction by introducing a Geometric Context Transformer (GCT) that employs a novel attention mechanism to manage multi-level geometric context. The system achieves superior accuracy and efficiency in camera pose estimation and dense 3D reconstruction compared to existing streaming methods, operating at approximately 20 FPS for sequences up to 10,000 frames.

View blog
#attention-mechanisms#computer-science#computer-vision-and-pattern-recognition
Resources
Paper thumbnail
681
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
15 Apr 2026
Yaxuan Li
Yuxin Zuo
Bingxiang He

This research systematically investigates On-Policy Distillation (OPD) for Large Language Models (LLMs), identifying that successful distillation relies on thinking-pattern consistency and a teacher's novel capabilities, not merely higher scores. The study reveals token-level alignment dynamics and proposes practical strategies like off-policy cold starts and teacher-aligned prompts, while also highlighting OPD's limitations in long-horizon tasks.

View blog
#computer-science#artificial-intelligence#computation-and-language
Resources4
Paper thumbnail
1,063
SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents
19 Apr 2026
Ziao Zhang
Kou Shi
Shiting Huang

SKILLFLOW, a new benchmark, evaluates how autonomous agents continuously discover, repair, and manage skills from their own experience across sequential tasks. Experiments reveal that leading LLM-based agents exhibit varied capacities for skill evolution, with some improving task success by over 8% through effective skill refinement, while others struggle with skill fragmentation and propagating errors.

View blog
#agentic-frameworks#agents#computer-science
Resources2
Paper thumbnail
45
Autogenesis: A Self-Evolving Agent Protocol
16 Apr 2026
Wentao Zhang

Autogenesis introduces a self-evolving agent protocol (AGP) that formalizes resource management and state transitions for LLM-based agents. This protocol enables auditable and safe self-modification, leading to enhanced performance across scientific, general agent, and algorithmic coding benchmarks.

View blog
#agentic-frameworks#agents#computer-science
Resources
Paper thumbnail
288
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
17 Apr 2026
Jiazheng Zhang
Ziche Fu
Zhiheng Xi

AgentV-RL introduces an agentic verifier framework that uses bidirectional, multi-turn, tool-augmented reasoning to scrutinize LLM-generated solutions for complex problems. It achieves state-of-the-art results in mathematical reasoning, outperforming larger models by up to 25.2 percentage points on MATH500 by integrating a comprehensive verification process.

View blog
#agentic-frameworks#agents#computer-science
Resources
Paper thumbnail
97
Seedance 2.0: Advancing Video Generation for World Complexity
15 Apr 2026
Team Seedance
De Chen
Liyang Chen

ByteDance Seed's Seedance 2.0 introduces a unified, large-scale model for multi-modal audio-video generation, significantly advancing quality and control across text, image, and audio inputs. The model achieved top rankings in expert and public benchmarks, demonstrating superior performance in aspects like natural motion, temporal coherence, and synchronized high-fidelity audio.

View blog
#computer-science#computer-vision-and-pattern-recognition#generative-models
Resources
Paper thumbnail
1,200
There are no more papers matching your filters at the moment.