search

Search module

The search folder provides a small toolkit and FastAPI service to search research papers across multiple sources:

Exa Websets (exa_search.py) — rich web search with enrichment and caching
Hugging Face Papers (hugging_face_paper.py) — daily/weekly/monthly feeds
MCP arXiv (mcp_client.py) — call the academia_mcp arXiv tool via MCP
MCP Google Scholar (mcp_google_scholar.py) — call a Google Scholar MCP tool
Unified router (paper_search.py) — register/search across tools
FastAPI (api.py) — simple HTTP API exposing tools and a unified /search

Requirements

Python 3.10+
Packages from the repo-level requirements.txt (FastAPI, requests, exa-py, mcp, etc.)

Install:

pip install -r /Users/svetachurina/Sam/requirements.txt

Environment variables

Create a .env at the repo root (or export in your shell):

EXA_API_KEY — required for Exa Websets
ACADEMIA_MCP_API_KEY — required for MCP-based tools (arXiv, Google Scholar)

Example .env:

EXA_API_KEY=exa_live_xxx
ACADEMIA_MCP_API_KEY=smithery_xxx

Tool parameters at a glance

Exa (exa)
- Required: enrichment_description
- Optional: count (int, default 10), use_cache (bool, default true)
- Notes: reads EXA_API_KEY; results cached under search/.cache/exa/
Hugging Face
- hf_daily: date (YYYY-MM-DD, optional)
- hf_weekly: end_date (YYYY-MM-DD, optional), days (int, default 7)
- hf_monthly: end_date (YYYY-MM-DD, optional), days (int, default 30)
- Notes: query may be empty; server-side filtering matches title/summary/highlights
MCP arXiv (mcp_arxiv)
- Required: query
- Optional: limit, offset, sort_by, end_date, sort_order, start_date, include_abstracts
- Notes: uses ACADEMIA_MCP_API_KEY; parameters are forwarded to the MCP tool
MCP Google Scholar (mcp_google_scholar)
- Required: query
- Optional: author, startYear, endYear, numResults
- Notes: uses ACADEMIA_MCP_API_KEY; parameters are forwarded to the MCP tool

FastAPI service

Run locally:

python -m uvicorn search.api:app --host 127.0.0.1 --port 8000 --reload

Endpoints:

GET /health → { "status": "ok" }
GET /tools → list of registered tool names
POST /tools/exa → register Exa at runtime if not auto-registered
POST /search → run a query through the chosen tool
GET /trend → get trending Hugging Face papers without configuring tools

Request/response models are defined in search/api.py.

Response shape

Example: GET /trend

curl -s "http://127.0.0.1:8000/trend?period=weekly&end_date=2025-10-17&days=7&limit=20" | jq

Query params:

period: one of daily | weekly | monthly (default: daily)
date: ISO date YYYY-MM-DD (for daily)
end_date: ISO date to end the trailing window (for weekly/monthly)
days: window size override (weekly default 7, monthly default 30)
limit: truncate results to first N

Response shape:

{
  "period": "weekly",
  "count": 20,
  "results": [ { "title": "...", "authors": ["..."], "publishedAt": "..." } ]
}

POST /search responds with:

{
  "tool": "exa",
  "count": 5,
  "results": [ { "...": "tool-specific fields" } ]
}

results is a list of JSON-like dicts. Shape varies by tool.
limit (if provided in the request) caps the list length after tool execution.

Examples of results entries by tool (indicative, not exhaustive):

Exa:

{
  "id": "webset-item-id",
  "url": "https://example.com/paper",
  "title": "Paper title",
  "enrichments": [ { "description": "Main research outcome", "text": "..." } ]
}

Hugging Face daily/weekly/monthly (normalized fields):

The HF tools return a simplified record with these fields extracted by hugging_face_paper.py:

title — paper title
authors — list of author names (prefers nested user.fullname when present)
publishedAt — ISO date/time string if available
summary — summary/highlights text if available
upvotes — number of upvotes
githubrepo — canonical GitHub repo URL if present
ai_keywords — list of AI keywords (may be empty)
githubstart — GitHub star count (if provided by source)

Example:

{
  "title": "Paper title",
  "authors": ["Author A", "Author B"],
  "publishedAt": "2025-10-17",
  "summary": "...",
  "upvotes": 42,
  "githubrepo": "https://github.com/org/repo",
  "ai_keywords": ["LLM", "RAG"],
  "githubstart": 1234
}

Notes:

Weekly/monthly helpers fetch trailing windows ending at optional end_date (inclusive) and results are sorted by upvotes descending.
When persisting via save_papers_as_json(..., include_links=true), best‑effort links are attached (e.g., arxiv, huggingface, source, pdf, github).

Hugging Face details

Normalization and utilities from hugging_face_paper.py:

fetch_daily_papers(date?: YYYY-MM-DD) → returns normalized records sorted by upvotes
fetch_weekly_papers(end_date?: YYYY-MM-DD, days: int = 7) → trailing window ending at end_date (inclusive)
fetch_monthly_papers(end_date?: YYYY-MM-DD, days: int = 30) → trailing window ending at end_date (inclusive)
sort_by_upvotes(records) → helper to sort any HF list by upvotes desc
format_papers(records) → multi-line human-readable list for console output
save_papers_as_json(records, path, include_links: bool = False) → pretty JSON to disk; if include_links is true, attaches link map

Link resolution with include_links=true uses best-effort keys:

arxiv — https://arxiv.org/abs/{id} when the paper id matches arXiv patterns
huggingface — https://huggingface.co/papers/{id} when paper.id is present
source — original source URL if provided by Hugging Face payload
pdf — direct PDF URL when available
github — canonical https://github.com/{org}/{repo} derived from repo hint

Saving examples:

from pathlib import Path
from search.hugging_face_paper import fetch_daily_papers, save_papers_as_json

papers = fetch_daily_papers()
save_papers_as_json(papers, Path("search/data") / "huggingface_daily_papers.json", include_links=True)

MCP arXiv / Google Scholar:

{
  "title": "Paper title",
  "authors": ["Author A", "Author B"],
  "abstract": "...",
  "url": "https://arxiv.org/abs/XXXX.YYYYY"
}

Register Exa (optional)

If EXA_API_KEY is in the environment, Exa registers automatically. You can also register via the API:

curl -X POST http://127.0.0.1:8000/tools/exa \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "exa",
    "enrichment_description": "Main research outcome",
    "count": 10,
    "use_cache": true
  }'

Search via HTTP

The unified /search accepts:

query (string; optional for HF tools)
tool (string; one of the names in /tools, e.g. exa, mcp_arxiv, mcp_google_scholar, hf_daily, hf_weekly, hf_monthly)
tool_kwargs (object; forwarded to the specific tool)
limit (int; optional hard cap applied after tool returns)

Example: Exa

curl -X POST http://127.0.0.1:8000/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "speculative decoding transformers",
    "tool": "exa",
    "tool_kwargs": {"enrichment_description": "Main research outcome", "count": 5, "use_cache": true}
  }'

Example: MCP arXiv

curl -X POST http://127.0.0.1:8000/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "graph neural networks",
    "tool": "mcp_arxiv",
    "tool_kwargs": {"limit": 25, "include_abstracts": true, "sort_by": "submitted_date", "sort_order": "descending"},
    "limit": 25
  }'

Example: MCP Google Scholar

curl -X POST http://127.0.0.1:8000/search \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "multimodal large language models",
    "tool": "mcp_google_scholar",
    "tool_kwargs": {"author": "Kaiming He", "startYear": 2018, "endYear": 2025, "numResults": 10}
  }'

Example: Hugging Face (no query required)

curl -X POST http://127.0.0.1:8000/search \
  -H 'Content-Type: application/json' \
  -d '{
    "tool": "hf_daily",
    "tool_kwargs": {"date": "2025-10-17"},
    "limit": 50
  }'

Using the modules directly (Python)

Exa (returns list of dicts):

from search.exa_search import exa_search

items = exa_search(
    query="Top AI research labs focusing on LLMs",
    enrichment_description="Main research outcome",
    count=5,
    use_cache=True,
)

Hugging Face daily/weekly/monthly:

from search.hugging_face_paper import fetch_daily_papers, fetch_weekly_papers, fetch_monthly_papers, sort_by_upvotes, format_papers

daily = fetch_daily_papers()  # optional date="YYYY-MM-DD"
weekly = fetch_weekly_papers()  # optional end_date, days
monthly = fetch_monthly_papers()  # optional end_date, days

top = sort_by_upvotes(daily)[:20]
print(format_papers(top))

MCP arXiv tool:

import asyncio
from search.mcp_client import arxiv_search

async def run():
    result = await arxiv_search(
        query="diffusion model inversion",
        limit=10,
        include_abstracts=True,
    )
    print(result)

asyncio.run(run())

MCP Google Scholar tool:

import asyncio
from search.mcp_google_scholar import search_google_scholar

async def run():
    result = await search_google_scholar(
        query="retrieval augmented generation",
        author="MacGlashan",
        startYear=2020,
        endYear=2025,
        numResults=10,
    )
    print(result)

asyncio.run(run())

Unified router (in-process)

You can compose tools with ResearchPaperSearcher from paper_search.py:

from search.paper_search import ResearchPaperSearcher

searcher = ResearchPaperSearcher()
searcher.add_hf_tools(make_default=True)  # registers hf_daily, hf_weekly, hf_monthly
# Optionally add Exa and MCP tools if env vars are set
searcher.add_exa_tool(enrichment_description="Main research outcome", name="exa")
searcher.add_mcp_arxiv_tool(name="mcp_arxiv")
searcher.add_mcp_google_scholar_tool(name="mcp_google_scholar")

results = searcher.search("vision-language models", tool="exa", count=5)

Notes

HF feeds require no credentials but depend on public endpoints; a network issue can reduce coverage for specific dates.
Exa responses are cached under search/.cache/exa/ keyed by query + enrichment.
MCP servers are hosted via Smithery; ensure ACADEMIA_MCP_API_KEY is valid and has access.
The FastAPI service attempts to register tools opportunistically; missing keys won’t crash the server.

Maintenance tips:

Clear Exa cache by deleting files in search/.cache/exa/.
Use GET /tools to confirm which tools registered successfully at startup.

Name		Name	Last commit message	Last commit date
parent directory ..
.cache/exa		.cache/exa
__pycache__		__pycache__
data		data
working_with_paper		working_with_paper
.gitignore		.gitignore
3666025.3699354.pdf		3666025.3699354.pdf
README.md		README.md
__init__.py		__init__.py
api.py		api.py
exa_search.py		exa_search.py
hugging_face_paper.py		hugging_face_paper.py
mcp_client.py		mcp_client.py
mcp_google_scholar.py		mcp_google_scholar.py
mcp_http_client.py		mcp_http_client.py
paper_search.py		paper_search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Search module

Requirements

Environment variables

Tool parameters at a glance

FastAPI service

Response shape

Hugging Face details

Register Exa (optional)

Search via HTTP

Using the modules directly (Python)

Unified router (in-process)

Notes

FilesExpand file tree

search

Directory actions

More options

Directory actions

More options

Latest commit

History

search

Folders and files

parent directory

README.md

Search module

Requirements

Environment variables

Tool parameters at a glance

FastAPI service

Response shape

Hugging Face details

Register Exa (optional)

Search via HTTP

Using the modules directly (Python)

Unified router (in-process)

Notes