CollabNet is a full-stack prototype that helps researchers discover potential collaborators by combining live OpenAlex data with curated analytics. The React frontend surfaces trends, search tools, and compatibility insights, while the Flask backend orchestrates API calls, computes proactive metrics, and falls back to an offline corpus when the network is unreachable.
- Architecture Overview
- Prerequisites
- Quick Start
- Data Flow Walkthrough
- Backend Capabilities
- Offline Dataset
- Extending the Project
- Troubleshooting
frontend/ (React + Tailwind + Chart.js)
├── src/pages
│ ├── Dashboard.jsx ← Trend visualisations & quick search
│ ├── SearchResults.jsx ← Topic/author search with co-author graph previews
│ ├── MatchEvaluating.jsx ← Spinner that triggers /api/match
│ └── Compatibility.jsx ← Gauge, bar chart & evidence lists
└── public/
backend/ (Flask)
├── app.py ← REST API, OpenAlex integrations, analytics
└── openalex_offline.py ← Offline sample corpus & helper lookups
- Frontend: React (React Router, Chart.js, Leaflet, React Force Graph). It proxies API calls to the backend (
package.jsonsets"proxy": "http://localhost:5000"). - Backend: Flask app that wraps OpenAlex endpoints, aggregates data, and performs analytics (trending topics/researchers, compatibility scoring, co-author graphs). Uses
requestswith custom retry logic and resilient fallbacks. - Offline data: Rich sample dataset in
openalex_offline.pyto keep the UI functional without network access.
- Python: 3.11+ recommended (the repo ships with a sample
venv/but you can create your own). - Node.js: 18.x (React Scripts 5 requires ≥ Node 14; Node 18 LTS tested).
- npm: matches Node installation.
- OpenAlex connectivity: Optional, but required for live data. When unavailable, the backend automatically falls back to offline samples.
cd backend
python -m venv .venv # optional – create your own virtualenv
.\.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -r requirements.txt
python app.pyBy default the Flask app runs on http://0.0.0.0:5000 with debug logging enabled. The only external dependency is OpenAlex, accessed via HTTPS with the [email protected] mailto parameter.
cd frontend
npm install
npm startThe development server starts at http://localhost:3000 and transparently proxies API requests to the Flask backend.
File: frontend/src/pages/Dashboard.jsx
- Trending fetch – On mount, issues parallel requests to:
/api/trending/topics→ returns topics with recent publication surges./api/trending/scientists→ returns authors with high recent activity.
- Institution aggregation – For each trending topic, fetches
/api/institutions/<topic>to pull institutional attribution counts, merges them, and keeps the top five earners by works. - Network sizing – For each trending scientist, calls
/api/coauthor-network/author/<id>and inspectspayload.stats.node_countto quantify the breadth of their co-author graph. - Charts:
- Trending Topics (bar) – labels = concept display names, values = total
works_count(global tally) for scale context. - Top Institutions (bar) – labels = institution names, values = aggregated works totals from the previous step.
- Network Momentum (line) – labels = trending researcher names; values = node counts from their co-author networks.
- Trending Topics (bar) – labels = concept display names, values = total
- Overlay – Submitting an empty search toggles an overlay panel that showcases the same trending data the user sees in the charts, encouraging exploration.
File: frontend/src/pages/SearchResults.jsx
- Topic mode:
/api/topics?query=<q>fetches matching concepts, the first concept is selected, and/api/authors/<concept>fetches leading authors. All cards display works, citations, and last-known institutions. - Author mode:
/api/authors?query=<q>returns matching authors outright. - Selecting an author triggers two follow-up calls:
/api/author/<id>for enriched profile data if summary stats were missing./api/coauthor-network/author/<id>to render a Force-Directed collaboration network (nodes = authors, weight = shared works).
Files: frontend/src/pages/MatchEvaluating.jsx, frontend/src/pages/Compatibility.jsx
- Match flow begins on
/match-evaluating?target=<id>&name=<display>:- Displays a progress indicator.
- POSTs
{ "target_id": "<OpenAlex Author ID>" }to/api/match.
/api/matchaggregates detailed profiles and returns:breakdown: overall score and sub-metrics on a 0–100 scale.evidence: overlapping concepts, mutual co-authors, aligned publications, and median publication years.
- Compatibility page renders:
- Gauge – overall percentage from
breakdown.overallusing Chart.js Doughnut. - Horizontal Bar Chart – sub-metrics (topic similarity, co-author distance, institution proximity, recency alignment).
- Evidence Lists – bullet lists for overlapping concepts, shared co-authors, and a list of aligned publications
{ title, year }plus context about publication recency alignment.
- Gauge – overall percentage from
Implementation: backend/app.py (compute_trending_topics, compute_trending_scientists)
- Time windows – Uses a rolling six-month window (
TRENDING_WINDOW_MONTHS = 6). Dates are computed on the fly withtrending_window_strings():- Recent window:
[recent_start, today] - Previous window: prior six months ending the day before
recent_start
- Recent window:
- OpenAlex groupings:
/works?group_by=concepts.idfor topics./works?group_by=authorships.author.idfor scientists. Both calls request up tolimit * 4entries to ensure enough candidates before filtering.
- Growth computation:
recent_count= works in recent window.previous_count= works in previous window (default 0 if absent).growth=recent_count - previous_count.growth_rate=(recent - previous) / (previous or recent)for tie-breaking.
- Ranking:
- Topics prefer positive growth. The ranking tuple is
(growth, growth_rate, recent_count)descending. - Scientists sort by
(recent_count, growth)descending before enrichment.
- Topics prefer positive growth. The ranking tuple is
- Enrichment:
- Concepts →
fetch_concept_briefhits/concepts/<id>(parallelised withThreadPoolExecutor). - Authors →
fetch_author_briefhits/authors/<id>forworks_count,cited_by_count, and institution metadata.
- Concepts →
- Return payload:
{ "topics": [{ "id": "https://openalex.org/T101", "display_name": "Machine Learning", "description": "...", "works_count": 2450000, "recent_publications": 18450, "growth": 3220 }], "scientists": [{ "id": "https://openalex.org/A1969205032", "display_name": "Fei-Fei Li", "works_count": 420, "cited_by_count": 98000, "last_known_institution": {...}, "recent_publications": 28, "growth": 9 }] } - Fallback – If any upstream call fails, the endpoint supplies offline equivalents (
OFFLINE_DATA.trending_topics,OFFLINE_DATA.trending_scientists).
Implementation: backend/app.py (compute_compatibility and helpers)
Profile construction
collect_research_profile(author_id)builds a profile by combining:- Summary stats (
fetch_author_brief→/authors/<id>). - Works (
fetch_author_works→/worksfiltered by author, newest first, up to 200 entries). - Fallbacks to the offline corpus for both metadata and works when necessary.
- Summary stats (
build_research_profileextracts:concept_counts: weighted sum of concept scores across the author’s works.concept_names: display names for reporting.coauthors: mapping of collaborator ID → display name.coauthor_graph: adjacency set for BFS pathfinding.works: curated work summaries for evidence (title, year, concepts, authors, citations).median_year: median publication year (rounded).institution: normalized institution object (id,display_name,type,country_code).
Metrics
- Topic Similarity (40%)
- Converts concept frequency dictionaries into sparse vectors.
- Cosine similarity =
dot(A, B) / (||A|| * ||B||). - Scaled to 0–100 by multiplying by 100 and clamping.
- Evidence: top five overlapping concepts with relative weights, e.g.
Machine Learning (you 12.50 · them 10.80).
- Co-author Distance (30%)
- Merges user and target co-author graphs.
- Shortest path computed with BFS (up to depth 6).
- Score mapping:
Path length Score 0 (same person) 100 1 (direct co-author) 100 2 80 3 55 4 35 ≥5 or unknown 15 (or 0 if disconnected) - Evidence: list of mutual co-author names (up to five).
- Institution Proximity (20%)
- Hierarchical comparison:
- Same
institution.id→ 100. - Same
country_code→ 80. - Same
type(e.g., education, business) → 60. - Both institutions present but otherwise different → 40.
- Missing data → 0.
- Same
- Hierarchical comparison:
- Recency Alignment (10%)
- Uses median publication year for user (
y_u) and target (y_t). - Score =
max(0, 100 - 12.5 * |y_u - y_t|)(every year difference costs 12.5 points). - Evidence includes both medians for context, even when missing.
- Uses median publication year for user (
Overall Score
overall = round(
0.40 * topic_similarity +
0.30 * coauthor_distance +
0.20 * institution_proximity +
0.10 * recency_alignment
)
Clamped to [0, 100]. All sub-metrics are stored in breakdown for display on the compatibility page.
Aligned Publications
- Filters target works to those containing overlapping concept IDs (top 10 overlaps).
- Returns up to five entries sorted by publication year (newest first).
- Each entry includes
title,year, and the matched concepts.
- All OpenAlex helper functions (
fetch_openalex,fetch_author_endpoint,fetch_works_endpoint,fetch_institution_endpoint) catchrequests.RequestException, log details, and returnNoneto indicate failure. - Endpoint handlers interpret
Noneas “switch to offline data”:- Search endpoints use
OFFLINE_DATA.search_topics/search_authors. - Trending endpoints use curated trending lists.
- Compatibility uses
OFFLINE_DATA.author_profile+author_worksto rebuild complete profiles offline.
- Search endpoints use
- Offline detections ensure responses always contain arrays/objects (never
null), so the frontend can render consistent placeholders.
| Method | Route | Description | Key Params |
|---|---|---|---|
| GET | /api/health |
Simple ok check | – |
| GET | /api/topics |
Search OpenAlex concepts | q (or query), limit |
| GET | /api/authors |
Search OpenAlex authors by name | q (or query), limit |
| GET | /api/authors/<topic_id> |
Authors associated with a concept | limit |
| GET | /api/author/<author_id> |
Detailed author profile | – |
| GET | /api/institutions/<topic_id> |
Institutions active in a concept | limit |
| GET | /api/coauthor-network/<topic_id> |
Co-author graph seeded from works with a topic | limit_works (≤200) |
| GET | /api/coauthor-network/author/<author_id> |
Author-centric co-author graph | limit_works (≤200) |
| GET | /api/trending/topics |
Top topics by recent growth | limit (default 5, max 20) |
| GET | /api/trending/scientists |
Top researchers by recent output | limit (default 5, max 20) |
| POST | /api/match |
Compatibility analysis | JSON body { target_id, user_id? } |
Sample compatibility request:
curl -X POST http://localhost:5000/api/match \
-H "Content-Type: application/json" \
-d '{"target_id": "https://openalex.org/A1969205032"}'Defined in backend/openalex_offline.py:
- Mimics OpenAlex payload shapes (topics, authors, institutions, networks).
- Includes curated trending metrics (
_trending_topics_data,_trending_scientists_data) aligned with the UI’s expectations (recent_publications,growth, counts). - Provides synthetic works per author (titles, years, concepts, co-authors) so compatibility scoring remains meaningful offline.
- Offers helper methods (
search_topics,author_profile,author_works,trending_topics, etc.) consumed by the backend when real API calls fail. default_user_id()currently returns Fei-Fei Li’s OpenAlex ID, used when/api/matchis called without auser_id.
- Change user profile source: Right now the backend assumes a default user ID until a dedicated user profile endpoint is introduced. Plugging in an authenticated profile would simply require replacing the
user_idselection logic in/api/match. - Adjust weightings: Tune the constants in
compute_compatibilityto emphasise different collaboration signals. - Add more metrics: e.g., funding overlap, geographic proximity, research impact trajectory. Compute your metric, normalise to
[0, 100], and adjust the weighted average. - Persist results: Introduce a database to store cached trending lists or previously computed compatibility scores for quicker access.
- Broaden offline samples: Extend
openalex_offline.pywith additional concepts/authors to demonstrate richer scenarios without network access.
- No charts / empty data: Check backend logs. If OpenAlex is unreachable, the backend should log warnings and still serve offline data. Ensure the Flask process has not crashed.
- CORS issues: CORS is enabled (
flask_cors.CORS(app)). When deploying, tighten the origin list as needed. - Rate limiting: Grouping requests can be heavy. The backend already reduces
per_pageand retries withoutselecton failure. Consider caching or increasing the mailto usage when moving to production. - React build errors: Ensure Node 18+ is installed. Delete
node_modulesand rerunnpm installif dependencies drift.
CollabNet is designed as a demonstrative sandbox: every response is structured to keep the UI responsive whether or not the network is available. Dive into the code, adjust the analytics, and tailor the research matching engine to your needs!