A Model Context Protocol (MCP) server for searching, downloading, and reading arXiv papers — designed as a specialist agent for integration into multi-agent systems like Microsoft Magentic-UI and AutoGen.
The idea: Rather than treating arXiv search as a simple lookup tool, this server is structured as a first-class research agent — one you can plug directly into a Magentic-One-style team as an
McpAgent, giving an Orchestrator access to the full scientific literature as a delegatable resource.
Magentic-UI supports custom McpAgent instances via mcp_agent_configs in its config file. This server plugs in directly:
# examples/magentic_ui_config.yaml
client:
mcp_agent_configs:
- agent_name: ArxivResearcher
description: >
Specialist agent for searching and reading arXiv papers.
Use when the task requires finding academic papers, understanding
research literature, or retrieving technical details from published work.
server_params:
type: StdioServerParams
command: python
args: ["-m", "arxiv_mcp_server"]
env:
PYTHONPATH: /path/to/arxiv-deep-research/srcOnce registered, the Magentic-UI Orchestrator can delegate research subtasks to this agent through the standard Task Ledger / Progress Ledger pattern — exactly how WebSurfer handles web browsing, but for academic literature.
See examples/autogen_research_team.py for a complete 3-agent team:
Orchestrator (MagenticOneGroupChat)
├── ArxivSurfer ← this MCP server, wrapped via StdioServerParams + mcp_server_tools
└── Coder ← synthesizes findings into structured markdown reports
pip install "autogen-agentchat" "autogen-ext[openai]" "mcp>=1.2.0"
export OPENAI_API_KEY=...
python examples/autogen_research_team.py| Tool | Description |
|---|---|
search_papers |
Query arXiv with advanced filters: date range, category, sort by relevance or date |
download_paper |
Fetch a paper PDF and convert to clean markdown for LLM consumption |
read_paper |
Access previously downloaded paper content |
list_papers |
View all papers in local storage |
Supports rich query syntax — quoted phrases, boolean operators, field-specific search (ti:, au:, abs:), and category filtering:
{
"query": "\"multi-agent\" AND \"orchestration\" ANDNOT survey",
"max_results": 10,
"date_from": "2024-01-01",
"categories": ["cs.AI", "cs.MA"],
"sort_by": "relevance"
}At a high level, arxiv-deep-research runs a simple but powerful multi‑stage loop:
- Plan the research task
- A coordinator agent (for example the AutoGen
MagenticOneGroupChatOrchestrator) takes the user goal and breaks it into sub‑tasks.
- A coordinator agent (for example the AutoGen
- Discover candidate papers
- The coordinator calls the MCP
search_paperstool to find relevant arXiv papers by topic, category, and date.
- The coordinator calls the MCP
- Download and normalize content
- For selected IDs, it calls
download_paper, which fetches the PDF and converts it into clean markdown for LLMs to read.
- For selected IDs, it calls
- Deep paper analysis
- The coordinator (or another agent) uses the
deep-paper-analysisprompt to ask for a structured analysis of a given paper ID, optionally across multiple calls as you explore related work.
- The coordinator (or another agent) uses the
- Synthesis and reporting
- A downstream agent such as
Coder(in the AutoGen example) turns these analyses into a final research report: summaries, comparison tables, open problems, and next‑step suggestions.
- A downstream agent such as
You can run this pipeline manually by calling the tools and prompts from any MCP‑aware client, or automatically using the sample AutoGen team.
The repo includes a retrieval quality benchmark (eval/benchmark.py) measuring:
- Precision@K — fraction of top-K results that are relevant
- Recall@K — fraction of known relevant papers found in top-K
- MRR — Mean Reciprocal Rank of first relevant result
Ground-truth queries are seeded from landmark papers (AutoGen 2308.08155, Magentic-One 2411.04468, RAG 2005.11401, CoT 2201.11903) and can be extended automatically using the synthetic data pipeline below.
python eval/benchmark.py --k 10 --output results.jsonscripts/generate_eval_tasks.py implements a 4-stage pipeline that generates diverse benchmark queries from arXiv abstracts — mirroring the AgentInstruct approach:
Stage 1: Seed collection → fetch paper abstracts from arXiv by category
Stage 2: Content transform → extract key concepts and problem statements
Stage 3: Instruction gen → generate realistic research queries via GPT-4o-mini
Stage 4: Instruction refine → create harder variants at subtopic intersections
export OPENAI_API_KEY=...
python scripts/generate_eval_tasks.py --seed-category cs.AI --num-seeds 20 --output eval/generated_queries.jsonOutput includes easy/medium/hard difficulty tiers for stratified evaluation.
Every tool call is instrumented with OpenTelemetry spans (mirrors AutoGen v0.4's built-in OTel support):
# Console output (no infrastructure needed)
export ARXIV_MCP_TRACE_CONSOLE=true
python -m arxiv_mcp_server
# OTLP export to Jaeger / Azure Monitor
docker run -d --name jaeger -p 16686:16686 -p 4317:4317 jaegertracing/all-in-one
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_SERVICE_NAME=arxiv-mcp-server
python -m arxiv_mcp_server
# View traces: http://localhost:16686Spans recorded: mcp.tool.search_papers, mcp.tool.download_paper, mcp.tool.read_paper — each with query, categories, result count, latency, and error status as attributes.
Tracing is a zero-cost no-op when opentelemetry-sdk is not installed.
Requires Python 3.11+
git clone https://github.com/freyzo/arxiv-deep-research
cd arxiv-deep-research
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
# Optional: OTel tracing
pip install -e ".[tracing]"{
"mcpServers": {
"arxiv": {
"command": "/path/to/.venv/bin/python",
"args": ["-m", "arxiv_mcp_server", "--storage-path", "/path/to/papers"]
}
}
}{
"mcpServers": {
"arxiv": {
"command": "python",
"args": ["-m", "arxiv_mcp_server"],
"env": { "PYTHONPATH": "/path/to/arxiv-deep-research/src" }
}
}
}Comprehensive analysis workflow covering executive summary, methodology, results, implications, and future directions:
{ "paper_id": "2401.12345" }There are two main ways to run research sessions today.
This uses OpenAI models to coordinate a full research workflow.
cd arxiv-deep-research
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
pip install "autogen-agentchat" "autogen-ext[openai]" "mcp>=1.2.0"
export OPENAI_API_KEY=your_openai_key
python examples/autogen_research_team.pyThis starts an interactive console UI where:
- the Orchestrator plans the work,
- ArxivSurfer searches and downloads papers via MCP, and
- Coder writes the final markdown report.
To resume a session, you can:
- run the script again and paste the previous summary as part of a new task, or
- keep the same console session open and give the team a follow‑up instruction (for example, “Now focus on safety trade‑offs”).
You can also talk to the MCP server directly and build your own loop:
cd arxiv-deep-research
python3 -m venv .venv
source .venv/bin/activate
pip install -e .
export ARXIV_MCP_TRACE_CONSOLE=true # optional
python -m arxiv_mcp_serverWhile this server runs, any MCP‑aware client can:
- call
search_papersanddownload_paper, - use
read_paperto pull content into the chat, and - call the
deep-paper-analysisprompt multiple times.
The prompt handler keeps a simple global research context, so repeated calls in the same process will mention previously analyzed paper IDs and encourage the model to connect them. In practice, “resuming” a research session means:
- keeping the same MCP server process alive, and
- issuing new
deep-paper-analysiscalls for new paper IDs from the same client or workspace.
arxiv-deep-research/
├── src/arxiv_mcp_server/
│ ├── server.py # MCP server + OTel init
│ ├── tracing.py # @trace_tool decorator, OTLP + console exporters
│ ├── config.py
│ ├── tools/ # search, download, read, list
│ └── prompts/ # deep research analysis prompt
├── examples/
│ ├── autogen_research_team.py # Magentic-One-style 3-agent team
│ └── magentic_ui_config.yaml # McpAgent config for Magentic-UI
├── eval/
│ └── benchmark.py # Precision@K / Recall@K / MRR harness
├── scripts/
│ └── generate_eval_tasks.py # AgentInstruct-style query generator
└── pyproject.toml
| Variable | Default | Description |
|---|---|---|
ARXIV_STORAGE_PATH |
~/.arxiv-mcp-server/papers |
Paper storage location |
ARXIV_MCP_TRACE_CONSOLE |
false |
Enable console trace output |
OTEL_EXPORTER_OTLP_ENDPOINT |
— | OTLP endpoint (e.g. http://localhost:4317) |
OTEL_SERVICE_NAME |
arxiv-mcp-server |
Service name in traces |
If you use the optional eval data generator, you also need:
| Variable | Description |
|---|---|
OPENAI_API_KEY |
Used by scripts/generate_eval_tasks.py to talk to gpt-4o-mini |
- Model support is OpenAI‑only today.
- The AutoGen research team and the synthetic eval generator both call OpenAI models (
gpt-4o/gpt-4o-mini) via the OpenAI Python SDK. - There is no first‑class
google-genai/ Gemini or Gemma integration yet, even though the design would support it.
- The AutoGen research team and the synthetic eval generator both call OpenAI models (
- No MCP Resources yet.
- Papers are exposed only via tools (
read_paper) rather than as MCP Resources with stablearxiv://URIs. MCP clients that prefer Resources cannot list papers yet.
- Papers are exposed only via tools (
- Limited testing.
- The core retrieval and eval logic has very light automated testing; metric functions and tool handlers should gain unit tests over time.
Planned improvements (subject to change):
- Gemini / Gemma support via
google-genai- Add an optional
google-genaidependency and a small runner that can call Gemini/Gemma models usingGEMINI_API_KEY. - Expose this as an alternative backend for the research team demo and the eval generator.
- Add an optional
- MCP Resources for downloaded papers
- Implement
list_resources/read_resourceso downloaded PDFs appear asarxiv://paper_idresources in MCP clients.
- Implement
- Stronger testing and evals
- Add unit tests for metrics, search helpers, and prompt handlers.
- Automate running
eval/benchmark.pyand track regression over time.
- Richer research sessions
- Replace the simple global research context with explicit session IDs and persisted state, so “resume session X” becomes a first‑class feature across restarts.
