Skip to content

anVSS1/Scientific-Article-Recommender

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ”¬ Scientific Article Recommender

A hybrid recommendation system for scientific papers that combines semantic embeddings, knowledge graphs, and user behavior analysis to provide personalized research paper recommendations.

Python Flask Neo4j License

🌟 Features

  • Hybrid Recommendation Engine: Combines content-based filtering using SciBERT embeddings with collaborative filtering
  • Knowledge Graph Integration: Uses Neo4j to store and query relationships between papers, concepts, and authors
  • Semantic Search: Leverages SciBERT embeddings for semantic similarity between research papers
  • Ontology-based Recommendations: Integrates scientific concept hierarchies for better recommendations
  • Modern Web Interface: Clean, responsive UI built with Flask and TailwindCSS
  • Real-time Recommendations: Instant paper suggestions based on topics and user preferences

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Flask Web App │────│  Recommendation │────│   Neo4j Graph   β”‚
β”‚                 β”‚    β”‚     Engine      β”‚    β”‚    Database     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                       β”‚                       β”‚
         β”‚                       β”‚                       β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚   UI    β”‚          β”‚   SciBERT    β”‚        β”‚   OpenAlex   β”‚
    β”‚Templatesβ”‚          β”‚  Embeddings  β”‚        β”‚     API      β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸš€ Complete Setup Guide

Prerequisites

  • Python 3.8+
  • Neo4j Database (Desktop or Cloud)
  • Git
  • 8GB+ RAM (for SciBERT embedding generation)

1. Clone and Setup Repository

git clone https://github.com/anVSS1/Scientific-Article-Recommender.git
cd Scientific-Article-Recommender

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

2. Setup Neo4j Database

  1. Download Neo4j Desktop: https://neo4j.com/download/
  2. Create a new database
  3. Set password (remember this!)
  4. Start the database

3. Configure Environment Variables

# Copy environment template
cp .env.example .env

# Edit .env file with your settings:
NEO4J_URI=bolt://localhost:7687
NEO4J_USER=neo4j
NEO4J_PASSWORD=your_actual_password
FLASK_DEBUG=True
FLASK_PORT=5050

πŸ“Š Data Setup (IMPORTANT!)

⚠️ This repository contains NO data files - you need to fetch and generate your own data.

Step 1: Fetch Scientific Articles

  1. Edit the notebook: existing_scripts/openalec-fetcher.ipynb
  2. Add your email (required by OpenAlex API):
    EMAIL_FOR_OPENALEX = "[email protected]"  # Replace this!
  3. Configure domains you want to fetch:
    DOMAINS = [
        ("Computer Science", "https://openalex.org/C41008148", 1400),
        ("Artificial Intelligence", "https://openalex.org/C154945302", 400),
        ("Physics", "https://openalex.org/C121332964", 200),
        # Add more domains as needed
    ]
  4. Run the notebook to fetch articles and concepts

Step 2: Update File Paths in Scripts

All scripts have hardcoded paths that you MUST change to match your setup:

existing_scripts/generate_embeddings.py

# Change these paths to match your data location:
articles_file = "data/cleaned data/processed_articles.json"  # Update path
concepts_file = "data/cleaned data/processed_concepts.json"  # Update path
output_dir = "data/embeddings/"  # Update path

existing_scripts/import_data_to_neo4j.py

# Update these paths:
articles_file = "data/cleaned data/processed_articles.json"
concepts_file = "data/cleaned data/processed_concepts.json"

existing_scripts/fake_user_generator.py

# Update these paths:
articles_file = "data/cleaned data/processed_articles.json"
concepts_file = "data/cleaned data/processed_concepts.json"
output_file = "data/fake_user_logs.csv"

Website/app.py

# Update paths in the Flask app:
embeddings_file = "data/embeddings/embeddings_articles.csv"
concepts_embeddings_file = "data/embeddings/embeddings_concepts.csv"

Step 3: Generate Embeddings

# Generate SciBERT embeddings for your articles
python existing_scripts/generate_embeddings.py

Note: This process can take several hours depending on your dataset size and hardware.

Step 4: Import Data to Neo4j

# Import articles and concepts to Neo4j
python existing_scripts/import_data_to_neo4j.py

# Load embeddings to Neo4j
python existing_scripts/load_embeddings_to_neo4j.py

# Generate fake user data for testing
python existing_scripts/fake_user_generator.py

# Create user profiles in Neo4j
python existing_scripts/populate_user_profiles_neo4j.py

Step 5: Run the Application

cd Website
python app.py

Navigate to http://localhost:5050

πŸ“ Project Structure

Scientific-Article-Recommender/
β”œβ”€β”€ Website/                          # Flask web application
β”‚   β”œβ”€β”€ app.py                       # Main Flask app
β”‚   β”œβ”€β”€ backend/reco.py              # Recommendation engine
β”‚   └── templates/                   # HTML templates
β”œβ”€β”€ existing_scripts/                # Data processing scripts
β”‚   β”œβ”€β”€ openalec-fetcher.ipynb      # Fetch data from OpenAlex
β”‚   β”œβ”€β”€ generate_embeddings.py       # Generate SciBERT embeddings
β”‚   β”œβ”€β”€ import_data_to_neo4j.py     # Import to Neo4j
β”‚   β”œβ”€β”€ load_embeddings_to_neo4j.py # Load embeddings
β”‚   β”œβ”€β”€ fake_user_generator.py      # Generate test users
β”‚   └── populate_user_profiles_neo4j.py # Setup user profiles
β”œβ”€β”€ data/                           # YOUR DATA GOES HERE
β”‚   β”œβ”€β”€ fetched data/               # Raw OpenAlex data
β”‚   β”œβ”€β”€ cleaned data/               # Processed articles/concepts
β”‚   └── embeddings/                 # Generated embeddings
β”œβ”€β”€ requirements.txt                # Python dependencies
β”œβ”€β”€ .env.example                    # Environment template
└── README.md                      # This file

πŸ”§ Customization Options

Change Research Domains

Edit openalec-fetcher.ipynb to fetch different scientific domains:

DOMAINS = [
    ("Your Domain", "OpenAlex_Concept_ID", target_count),
    ("Biology", "https://openalex.org/C86803240", 500),
    ("Medicine", "https://openalex.org/C71924100", 300),
]

Modify Embedding Model

Change the SciBERT model in generate_embeddings.py:

self.tokenizer = AutoTokenizer.from_pretrained("your-preferred-model")
self.model = AutoModel.from_pretrained("your-preferred-model")

Adjust Recommendation Weights

Modify the hybrid recommendation weights in Website/backend/reco.py:

content_weight = 0.6  # Content-based filtering weight
ontology_weight = 0.4  # Ontology-based weight

πŸ§ͺ Usage Examples

Topic-based Search

  1. Navigate to the main page
  2. Select "Topic Search"
  3. Enter: "Neural Networks", "Machine Learning", etc.
  4. Get semantically similar papers

Personalized Recommendations

  1. Select "Personalized" mode
  2. Choose a user profile (generated by fake_user_generator.py)
  3. Enter a search query
  4. Receive recommendations based on user history

Ontology Explorer

  • Click "Explore Ontology" to browse concept hierarchies
  • Search for concepts and explore relationships

πŸ” Troubleshooting

Common Issues

"No module named 'neo4j'"

pip install neo4j

"Connection refused" to Neo4j

  • Make sure Neo4j is running
  • Check your .env file credentials
  • Verify the URI (usually bolt://localhost:7687)

"File not found" errors

  • Update all file paths in the scripts to match your setup
  • Make sure you've run the data fetching steps

SciBERT model download fails

# Install transformers properly
pip install transformers torch

Embedding generation is slow

  • Use GPU if available (install torch with CUDA)
  • Reduce batch size in generate_embeddings.py
  • Process smaller datasets first

Performance Tips

  • Use GPU: Install PyTorch with CUDA for faster embedding generation
  • Increase RAM: 8GB+ recommended for large datasets
  • SSD Storage: Faster I/O for large data processing
  • Batch Processing: Adjust batch sizes based on your hardware

πŸ“Š Data Sources & APIs

OpenAlex Integration

Scientific Domains Available

  • Computer Science
  • Artificial Intelligence
  • Physics
  • Biology
  • Medicine
  • Mathematics
  • Engineering
  • And many more...

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Update file paths in your modifications
  4. Test with your own data
  5. Commit changes (git commit -m 'Add AmazingFeature')
  6. Push to branch (git push origin feature/AmazingFeature)
  7. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • OpenAlex for providing open scientific data
  • SciBERT for scientific text embeddings
  • Neo4j for graph database technology
  • Flask for the web framework

πŸ“§ Contact

Developer 1: anVSS1
Email: [email protected] LinkedIn: LinkedIn Profile
GitHub: @anVSS1

Developer 2: KAN
LinkedIn: LinkedIn Profile

Project Link: https://github.com/anVSS1/Scientific-Article-Recommender


⚠️ IMPORTANT NOTES

  1. No Data Included: This repository contains only code. You must fetch and generate your own data.

  2. Update All Paths: Every script has hardcoded file paths that need to be updated for your system.

  3. OpenAlex Email Required: You must add your email to the OpenAlex fetcher script.

  4. Hardware Requirements: Embedding generation requires significant computational resources.

  5. Neo4j Setup: You must install and configure Neo4j before running the application.

⭐ Star this repo if you find it helpful!

About

A sophisticated hybrid recommendation system for scientific articles that combines ontology-based knowledge graphs, content filtering, collaborative filtering, and user-based approaches to provide personalized research paper recommendations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors