The search folder provides a small toolkit and FastAPI service to search research papers across multiple sources:
- Exa Websets (
exa_search.py) — rich web search with enrichment and caching - Hugging Face Papers (
hugging_face_paper.py) — daily/weekly/monthly feeds - MCP arXiv (
mcp_client.py) — call theacademia_mcparXiv tool via MCP - MCP Google Scholar (
mcp_google_scholar.py) — call a Google Scholar MCP tool - Unified router (
paper_search.py) — register/search across tools - FastAPI (
api.py) — simple HTTP API exposing tools and a unified/search
- Python 3.10+
- Packages from the repo-level
requirements.txt(FastAPI, requests, exa-py, mcp, etc.)
Install:
pip install -r /Users/svetachurina/Sam/requirements.txtCreate a .env at the repo root (or export in your shell):
EXA_API_KEY— required for Exa WebsetsACADEMIA_MCP_API_KEY— required for MCP-based tools (arXiv, Google Scholar)
Example .env:
EXA_API_KEY=exa_live_xxx
ACADEMIA_MCP_API_KEY=smithery_xxx-
Exa (
exa)- Required:
enrichment_description - Optional:
count(int, default 10),use_cache(bool, default true) - Notes: reads
EXA_API_KEY; results cached undersearch/.cache/exa/
- Required:
-
Hugging Face
hf_daily:date(YYYY-MM-DD, optional)hf_weekly:end_date(YYYY-MM-DD, optional),days(int, default 7)hf_monthly:end_date(YYYY-MM-DD, optional),days(int, default 30)- Notes:
querymay be empty; server-side filtering matches title/summary/highlights
-
MCP arXiv (
mcp_arxiv)- Required:
query - Optional:
limit,offset,sort_by,end_date,sort_order,start_date,include_abstracts - Notes: uses
ACADEMIA_MCP_API_KEY; parameters are forwarded to the MCP tool
- Required:
-
MCP Google Scholar (
mcp_google_scholar)- Required:
query - Optional:
author,startYear,endYear,numResults - Notes: uses
ACADEMIA_MCP_API_KEY; parameters are forwarded to the MCP tool
- Required:
Run locally:
python -m uvicorn search.api:app --host 127.0.0.1 --port 8000 --reloadEndpoints:
GET /health→{ "status": "ok" }GET /tools→ list of registered tool namesPOST /tools/exa→ register Exa at runtime if not auto-registeredPOST /search→ run a query through the chosen toolGET /trend→ get trending Hugging Face papers without configuring tools
Request/response models are defined in search/api.py.
Example: GET /trend
curl -s "http://127.0.0.1:8000/trend?period=weekly&end_date=2025-10-17&days=7&limit=20" | jqQuery params:
period: one ofdaily|weekly|monthly(default:daily)date: ISO dateYYYY-MM-DD(fordaily)end_date: ISO date to end the trailing window (forweekly/monthly)days: window size override (weeklydefault 7,monthlydefault 30)limit: truncate results to first N
Response shape:
{
"period": "weekly",
"count": 20,
"results": [ { "title": "...", "authors": ["..."], "publishedAt": "..." } ]
}POST /search responds with:
{
"tool": "exa",
"count": 5,
"results": [ { "...": "tool-specific fields" } ]
}resultsis a list of JSON-like dicts. Shape varies by tool.limit(if provided in the request) caps the list length after tool execution.
Examples of results entries by tool (indicative, not exhaustive):
- Exa:
{
"id": "webset-item-id",
"url": "https://example.com/paper",
"title": "Paper title",
"enrichments": [ { "description": "Main research outcome", "text": "..." } ]
}- Hugging Face daily/weekly/monthly (normalized fields):
The HF tools return a simplified record with these fields extracted by hugging_face_paper.py:
title— paper titleauthors— list of author names (prefers nested user.fullname when present)publishedAt— ISO date/time string if availablesummary— summary/highlights text if availableupvotes— number of upvotesgithubrepo— canonical GitHub repo URL if presentai_keywords— list of AI keywords (may be empty)githubstart— GitHub star count (if provided by source)
Example:
{
"title": "Paper title",
"authors": ["Author A", "Author B"],
"publishedAt": "2025-10-17",
"summary": "...",
"upvotes": 42,
"githubrepo": "https://github.com/org/repo",
"ai_keywords": ["LLM", "RAG"],
"githubstart": 1234
}Notes:
- Weekly/monthly helpers fetch trailing windows ending at optional
end_date(inclusive) and results are sorted byupvotesdescending. - When persisting via
save_papers_as_json(..., include_links=true), best‑effortlinksare attached (e.g.,arxiv,huggingface,source,pdf,github).
Normalization and utilities from hugging_face_paper.py:
fetch_daily_papers(date?: YYYY-MM-DD)→ returns normalized records sorted by upvotesfetch_weekly_papers(end_date?: YYYY-MM-DD, days: int = 7)→ trailing window ending atend_date(inclusive)fetch_monthly_papers(end_date?: YYYY-MM-DD, days: int = 30)→ trailing window ending atend_date(inclusive)sort_by_upvotes(records)→ helper to sort any HF list byupvotesdescformat_papers(records)→ multi-line human-readable list for console outputsave_papers_as_json(records, path, include_links: bool = False)→ pretty JSON to disk; ifinclude_linksis true, attaches link map
Link resolution with include_links=true uses best-effort keys:
arxiv—https://arxiv.org/abs/{id}when the paper id matches arXiv patternshuggingface—https://huggingface.co/papers/{id}whenpaper.idis presentsource— original source URL if provided by Hugging Face payloadpdf— direct PDF URL when availablegithub— canonicalhttps://github.com/{org}/{repo}derived from repo hint
Saving examples:
from pathlib import Path
from search.hugging_face_paper import fetch_daily_papers, save_papers_as_json
papers = fetch_daily_papers()
save_papers_as_json(papers, Path("search/data") / "huggingface_daily_papers.json", include_links=True)- MCP arXiv / Google Scholar:
{
"title": "Paper title",
"authors": ["Author A", "Author B"],
"abstract": "...",
"url": "https://arxiv.org/abs/XXXX.YYYYY"
}If EXA_API_KEY is in the environment, Exa registers automatically. You can also register via the API:
curl -X POST http://127.0.0.1:8000/tools/exa \
-H 'Content-Type: application/json' \
-d '{
"name": "exa",
"enrichment_description": "Main research outcome",
"count": 10,
"use_cache": true
}'The unified /search accepts:
query(string; optional for HF tools)tool(string; one of the names in/tools, e.g.exa,mcp_arxiv,mcp_google_scholar,hf_daily,hf_weekly,hf_monthly)tool_kwargs(object; forwarded to the specific tool)limit(int; optional hard cap applied after tool returns)
Example: Exa
curl -X POST http://127.0.0.1:8000/search \
-H 'Content-Type: application/json' \
-d '{
"query": "speculative decoding transformers",
"tool": "exa",
"tool_kwargs": {"enrichment_description": "Main research outcome", "count": 5, "use_cache": true}
}'Example: MCP arXiv
curl -X POST http://127.0.0.1:8000/search \
-H 'Content-Type: application/json' \
-d '{
"query": "graph neural networks",
"tool": "mcp_arxiv",
"tool_kwargs": {"limit": 25, "include_abstracts": true, "sort_by": "submitted_date", "sort_order": "descending"},
"limit": 25
}'Example: MCP Google Scholar
curl -X POST http://127.0.0.1:8000/search \
-H 'Content-Type: application/json' \
-d '{
"query": "multimodal large language models",
"tool": "mcp_google_scholar",
"tool_kwargs": {"author": "Kaiming He", "startYear": 2018, "endYear": 2025, "numResults": 10}
}'Example: Hugging Face (no query required)
curl -X POST http://127.0.0.1:8000/search \
-H 'Content-Type: application/json' \
-d '{
"tool": "hf_daily",
"tool_kwargs": {"date": "2025-10-17"},
"limit": 50
}'Exa (returns list of dicts):
from search.exa_search import exa_search
items = exa_search(
query="Top AI research labs focusing on LLMs",
enrichment_description="Main research outcome",
count=5,
use_cache=True,
)Hugging Face daily/weekly/monthly:
from search.hugging_face_paper import fetch_daily_papers, fetch_weekly_papers, fetch_monthly_papers, sort_by_upvotes, format_papers
daily = fetch_daily_papers() # optional date="YYYY-MM-DD"
weekly = fetch_weekly_papers() # optional end_date, days
monthly = fetch_monthly_papers() # optional end_date, days
top = sort_by_upvotes(daily)[:20]
print(format_papers(top))MCP arXiv tool:
import asyncio
from search.mcp_client import arxiv_search
async def run():
result = await arxiv_search(
query="diffusion model inversion",
limit=10,
include_abstracts=True,
)
print(result)
asyncio.run(run())MCP Google Scholar tool:
import asyncio
from search.mcp_google_scholar import search_google_scholar
async def run():
result = await search_google_scholar(
query="retrieval augmented generation",
author="MacGlashan",
startYear=2020,
endYear=2025,
numResults=10,
)
print(result)
asyncio.run(run())You can compose tools with ResearchPaperSearcher from paper_search.py:
from search.paper_search import ResearchPaperSearcher
searcher = ResearchPaperSearcher()
searcher.add_hf_tools(make_default=True) # registers hf_daily, hf_weekly, hf_monthly
# Optionally add Exa and MCP tools if env vars are set
searcher.add_exa_tool(enrichment_description="Main research outcome", name="exa")
searcher.add_mcp_arxiv_tool(name="mcp_arxiv")
searcher.add_mcp_google_scholar_tool(name="mcp_google_scholar")
results = searcher.search("vision-language models", tool="exa", count=5)- HF feeds require no credentials but depend on public endpoints; a network issue can reduce coverage for specific dates.
- Exa responses are cached under
search/.cache/exa/keyed by query + enrichment. - MCP servers are hosted via Smithery; ensure
ACADEMIA_MCP_API_KEYis valid and has access. - The FastAPI service attempts to register tools opportunistically; missing keys won’t crash the server.
Maintenance tips:
- Clear Exa cache by deleting files in
search/.cache/exa/. - Use
GET /toolsto confirm which tools registered successfully at startup.