A hybrid recommendation system for scientific papers that combines semantic embeddings, knowledge graphs, and user behavior analysis to provide personalized research paper recommendations.
- Hybrid Recommendation Engine: Combines content-based filtering using SciBERT embeddings with collaborative filtering
- Knowledge Graph Integration: Uses Neo4j to store and query relationships between papers, concepts, and authors
- Semantic Search: Leverages SciBERT embeddings for semantic similarity between research papers
- Ontology-based Recommendations: Integrates scientific concept hierarchies for better recommendations
- Modern Web Interface: Clean, responsive UI built with Flask and TailwindCSS
- Real-time Recommendations: Instant paper suggestions based on topics and user preferences
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Flask Web App ββββββ Recommendation ββββββ Neo4j Graph β
β β β Engine β β Database β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
β β β
βββββββββββ ββββββββββββββββ ββββββββββββββββ
β UI β β SciBERT β β OpenAlex β
βTemplatesβ β Embeddings β β API β
βββββββββββ ββββββββββββββββ ββββββββββββββββ
- Python 3.8+
- Neo4j Database (Desktop or Cloud)
- Git
- 8GB+ RAM (for SciBERT embedding generation)
git clone https://github.com/anVSS1/Scientific-Article-Recommender.git
cd Scientific-Article-Recommender
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Download Neo4j Desktop: https://neo4j.com/download/
- Create a new database
- Set password (remember this!)
- Start the database
# Copy environment template
cp .env.example .env
# Edit .env file with your settings:
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_actual_password
FLASK_DEBUG=True
FLASK_PORT=5050- Edit the notebook:
existing_scripts/openalec-fetcher.ipynb - Add your email (required by OpenAlex API):
EMAIL_FOR_OPENALEX = "[email protected]" # Replace this!
- Configure domains you want to fetch:
DOMAINS = [ ("Computer Science", "https://openalex.org/C41008148", 1400), ("Artificial Intelligence", "https://openalex.org/C154945302", 400), ("Physics", "https://openalex.org/C121332964", 200), # Add more domains as needed ]
- Run the notebook to fetch articles and concepts
All scripts have hardcoded paths that you MUST change to match your setup:
# Change these paths to match your data location:
articles_file = "data/cleaned data/processed_articles.json" # Update path
concepts_file = "data/cleaned data/processed_concepts.json" # Update path
output_dir = "data/embeddings/" # Update path# Update these paths:
articles_file = "data/cleaned data/processed_articles.json"
concepts_file = "data/cleaned data/processed_concepts.json"# Update these paths:
articles_file = "data/cleaned data/processed_articles.json"
concepts_file = "data/cleaned data/processed_concepts.json"
output_file = "data/fake_user_logs.csv"# Update paths in the Flask app:
embeddings_file = "data/embeddings/embeddings_articles.csv"
concepts_embeddings_file = "data/embeddings/embeddings_concepts.csv"# Generate SciBERT embeddings for your articles
python existing_scripts/generate_embeddings.pyNote: This process can take several hours depending on your dataset size and hardware.
# Import articles and concepts to Neo4j
python existing_scripts/import_data_to_neo4j.py
# Load embeddings to Neo4j
python existing_scripts/load_embeddings_to_neo4j.py
# Generate fake user data for testing
python existing_scripts/fake_user_generator.py
# Create user profiles in Neo4j
python existing_scripts/populate_user_profiles_neo4j.pycd Website
python app.pyNavigate to http://localhost:5050
Scientific-Article-Recommender/
βββ Website/ # Flask web application
β βββ app.py # Main Flask app
β βββ backend/reco.py # Recommendation engine
β βββ templates/ # HTML templates
βββ existing_scripts/ # Data processing scripts
β βββ openalec-fetcher.ipynb # Fetch data from OpenAlex
β βββ generate_embeddings.py # Generate SciBERT embeddings
β βββ import_data_to_neo4j.py # Import to Neo4j
β βββ load_embeddings_to_neo4j.py # Load embeddings
β βββ fake_user_generator.py # Generate test users
β βββ populate_user_profiles_neo4j.py # Setup user profiles
βββ data/ # YOUR DATA GOES HERE
β βββ fetched data/ # Raw OpenAlex data
β βββ cleaned data/ # Processed articles/concepts
β βββ embeddings/ # Generated embeddings
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ README.md # This file
Edit openalec-fetcher.ipynb to fetch different scientific domains:
DOMAINS = [
("Your Domain", "OpenAlex_Concept_ID", target_count),
("Biology", "https://openalex.org/C86803240", 500),
("Medicine", "https://openalex.org/C71924100", 300),
]Change the SciBERT model in generate_embeddings.py:
self.tokenizer = AutoTokenizer.from_pretrained("your-preferred-model")
self.model = AutoModel.from_pretrained("your-preferred-model")Modify the hybrid recommendation weights in Website/backend/reco.py:
content_weight = 0.6 # Content-based filtering weight
ontology_weight = 0.4 # Ontology-based weight- Navigate to the main page
- Select "Topic Search"
- Enter: "Neural Networks", "Machine Learning", etc.
- Get semantically similar papers
- Select "Personalized" mode
- Choose a user profile (generated by fake_user_generator.py)
- Enter a search query
- Receive recommendations based on user history
- Click "Explore Ontology" to browse concept hierarchies
- Search for concepts and explore relationships
"No module named 'neo4j'"
pip install neo4j"Connection refused" to Neo4j
- Make sure Neo4j is running
- Check your .env file credentials
- Verify the URI (usually bolt://localhost:7687)
"File not found" errors
- Update all file paths in the scripts to match your setup
- Make sure you've run the data fetching steps
SciBERT model download fails
# Install transformers properly
pip install transformers torchEmbedding generation is slow
- Use GPU if available (install
torchwith CUDA) - Reduce batch size in generate_embeddings.py
- Process smaller datasets first
- Use GPU: Install PyTorch with CUDA for faster embedding generation
- Increase RAM: 8GB+ recommended for large datasets
- SSD Storage: Faster I/O for large data processing
- Batch Processing: Adjust batch sizes based on your hardware
- Website: https://openalex.org/
- API Docs: https://docs.openalex.org/
- Rate Limits: Be respectful, use your email
- Data License: CC0 (public domain)
- Computer Science
- Artificial Intelligence
- Physics
- Biology
- Medicine
- Mathematics
- Engineering
- And many more...
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Update file paths in your modifications
- Test with your own data
- Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- OpenAlex for providing open scientific data
- SciBERT for scientific text embeddings
- Neo4j for graph database technology
- Flask for the web framework
Developer 1: anVSS1
Email: [email protected]
LinkedIn: LinkedIn Profile
GitHub: @anVSS1
Developer 2: KAN
LinkedIn: LinkedIn Profile
Project Link: https://github.com/anVSS1/Scientific-Article-Recommender
-
No Data Included: This repository contains only code. You must fetch and generate your own data.
-
Update All Paths: Every script has hardcoded file paths that need to be updated for your system.
-
OpenAlex Email Required: You must add your email to the OpenAlex fetcher script.
-
Hardware Requirements: Embedding generation requires significant computational resources.
-
Neo4j Setup: You must install and configure Neo4j before running the application.
β Star this repo if you find it helpful!