An intelligent AI-powered search engine for scientific literature, built with SPECTER embeddings and multi-index FAISS for semantic similarity search across 4,000+ research articles from Scopus.
- 🧠 Semantic Search: Natural language queries like "machine learning in healthcare"
- 👥 Author Search: Find research by specific authors or research groups
- 🏢 Institution Search: Discover research from specific universities or countries
- 📊 Multi-Index System: 5 specialized FAISS indexes for different search strategies
- 🔍 Intelligent Query Processing: Automatically detects search intent (author, institution, semantic)
- 📱 Modern UI: Clean Gradio interface with pagination and detailed results
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Scopus API │───▶│ Data Pipeline │───▶│ SQLite DB │
│ (Raw Data) │ │ (Clean & Store) │ │ (Structured) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Multi-Index │◀───│ SPECTER Embeddings│◀───│ Text Content │
│ FAISS Search │ │ (Semantic Vectors)│ │ (Title+Abstract)│
└─────────────────┘ │
│ ▼
▼ ┌─────────────────┐
┌─────────────────────────────────────────────────────────────────┐
│ Gradio Chat Interface │
│ (Intelligent Query Processing + Results Display) │
└─────────────────────────────────────────────────────────────────┘
The system uses 5 specialized FAISS indexes for different search scenarios:
- Content Index: Title + Abstract semantic search
- Metadata Index: Authors + Keywords + Institutions
- Full Index: Complete article text (title + abstract + metadata)
- Institution Index: Institutional affiliations and locations
- Combined Index: Unified search across all content types
git clone https://github.com/anVSS1/Scientific-Article-Recommender.git
cd Scientific-Article-Recommender
pip install -r requirements.txt- Get your Scopus API key from Elsevier Developer Portal
- Create a
config.pyfile:
# config.py
API_KEY = "your_scopus_api_key_here"
INST_TOKEN = "your_scopus_inst_token_here"- Update
scopus_api.pyto use the config:
from config import API_KEY, INST_TOKEN# Collect data from Scopus API
python scopus_api.py
# Populate the database
python populate_database.py# Generate FAISS indexes for semantic search
python enhanced_semantic_indexing.py# Local development
python app.py
# For Hugging Face Spaces deployment
python huggingface\ space/app_hf.pyscopus_database.db- SQLite database with articles (created by data pipeline)scopus_combined_metadata_index.faiss- Main FAISS indexscopus_article_ids_for_index.json- Article ID mappings
scopus_content_index.faiss- Content-only searchscopus_metadata_index.faiss- Metadata-focused searchscopus_institution_index.faiss- Institution-based searchscopus_full_index.faiss- Comprehensive search
- Backend: Python, SQLite, FAISS
- ML/AI: SPECTER embeddings (scientific papers), sentence-transformers
- Frontend: Gradio
- Deployment: Hugging Face Spaces ready
Scientific-Article-Recommender/
├── app.py # Main Gradio application
├── scopus_api.py # Scopus API data collection
├── database.py # Database schema and operations
├── populate_database.py # Data processing pipeline
├── enhanced_semantic_indexing.py # FAISS index creation
├── requirements.txt # Python dependencies
├── README.md # This file
└── kaggle_semantic_indexing.ipynb # Jupyter notebook for indexing
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- API Limits: Scopus API has rate limits and requires institutional access
- Data Size: Full database and indexes can be several GB
- Performance: Semantic search requires substantial computational resources
- Privacy: This project handles academic data responsibly per Scopus terms
Note: This repository contains the code and scripts but excludes large data files (database, indexes) and sensitive API credentials. See setup instructions above for complete deployment.