Skip to content

rzhangbq/paperswithcode-rebuilt

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Papers with Code Rebuilt

⚠️ IMPORTANT WARNING ⚠️

This application is rebuilt from the discontinued Papers with Code website. All data remains un-updated since the website shut down and should be considered a historical snapshot rather than current research information.

What this means:

  • Data is frozen as of when Papers with Code was discontinued
  • No new papers, methods, or datasets are being added
  • Performance metrics and leaderboards are not current
  • Use this for historical research and reference purposes only

A modern web application that provides a comprehensive interface for exploring academic papers, code repositories, datasets, methods, and leaderboards from the Papers with Code platform. This project rebuilds the core functionality of Papers with Code with a focus on performance, user experience, and modern web technologies.

🎯 Purpose

This application serves as a research tool for:

  • Researchers looking for the latest papers in their field
  • Developers seeking code implementations of research papers
  • Students exploring datasets and methods for their projects
  • Anyone interested in staying updated with cutting-edge AI/ML research

πŸš€ Features

  • πŸ“„ Papers Browser: Search and browse academic papers with abstracts
  • πŸ’» Code Repositories: Find official and community code implementations
  • πŸ† Leaderboards: View performance evaluations across different datasets and tasks
  • πŸ“Š Datasets: Explore datasets used in research
  • πŸ”¬ Methods: Discover research methods and approaches
  • πŸ” Advanced Search: Search across papers, titles, and abstracts
  • πŸ“± Responsive Design: Works seamlessly on desktop and mobile devices

πŸ› οΈ Technology Stack

Frontend

  • React 18 with TypeScript
  • Vite for fast development and building
  • Tailwind CSS for styling
  • React Query for data fetching and caching
  • React Router for navigation
  • Lucide React for icons

Backend

  • Node.js with Express.js
  • SQLite for data storage and efficient querying
  • Database-driven architecture for fast data access

Data Sources

  • Papers with Code API data (papers, code links, evaluations, methods, datasets)
  • SQLite database with optimized schema and indexes

πŸ“¦ Installation

Prerequisites

  • Node.js (v16 or higher)
  • npm or yarn
  • Python 3.7+ (for database building and cleaning)
  • wget (for downloading data files)
  • gunzip (for extracting compressed files)

Setup Instructions

  1. Clone the repository

    git clone <repository-url>
    cd paperswithcode-rebuilt
  2. Install dependencies

    npm install
  3. Build the database (if not already built)

    cd data
    python build_database.py

    This will:

    • Create the SQLite database from JSON files
    • Set up optimized schema with indexes
    • Import all data efficiently
  4. Clean the database (recommended for optimal performance)

    python clean_methods_database.py

    This will:

    • Remove spam entries and irrelevant content
    • Clean up customer service spam, phone numbers, and commercial content
    • Ensure only legitimate academic content remains
  5. Start the development server

    npm run dev:full

    This starts both the backend server (port 3001) and frontend development server (port 5173)

πŸƒβ€β™‚οΈ Running the Application

Development Mode

# Start both frontend and backend
npm run dev:full

# Or start them separately
npm run server    # Backend only (port 3001)
npm run dev       # Frontend only (port 5173)

Production Build

# Build the frontend
npm run build

# Start the production server
npm run server

πŸ—οΈ Project Structure

paperswithcode-rebuilt/
β”œβ”€β”€ src/                          # Frontend source code
β”‚   β”œβ”€β”€ components/               # React components
β”‚   β”‚   β”œβ”€β”€ Header.tsx           # Navigation and search header
β”‚   β”‚   β”œβ”€β”€ PaperCard.tsx        # Individual paper display
β”‚   β”‚   β”œβ”€β”€ DatasetCard.tsx      # Dataset information display
β”‚   β”‚   β”œβ”€β”€ MethodCard.tsx       # Method information display
β”‚   β”‚   β”œβ”€β”€ LeaderboardTable.tsx # Performance leaderboard
β”‚   β”‚   β”œβ”€β”€ LeaderboardChart.tsx # Chart visualization for leaderboards
β”‚   β”‚   β”œβ”€β”€ PerformanceChart.tsx # Performance metrics visualization
β”‚   β”‚   β”œβ”€β”€ ContentRenderer.tsx  # Content rendering utilities
β”‚   β”‚   β”œβ”€β”€ MathRenderer.tsx     # Mathematical expression rendering
β”‚   β”‚   └── LoadingSpinner.tsx   # Loading indicator
β”‚   β”œβ”€β”€ hooks/                   # Custom React hooks
β”‚   β”‚   └── useData.ts          # Data fetching and caching hooks
β”‚   β”œβ”€β”€ services/                # API service functions
β”‚   β”œβ”€β”€ types/                   # TypeScript type definitions
β”‚   β”œβ”€β”€ pages/                   # Page components
β”‚   β”‚   β”œβ”€β”€ PapersPage.tsx      # Papers listing page
β”‚   β”‚   β”œβ”€β”€ DatasetsPage.tsx    # Datasets listing page
β”‚   β”‚   β”œβ”€β”€ MethodsPage.tsx     # Methods listing page
β”‚   β”‚   └── LeaderboardsPage.tsx # Leaderboards page
β”‚   β”œβ”€β”€ utils/                   # Utility functions
β”‚   β”‚   └── dateUtils.ts        # Date handling utilities
β”‚   β”œβ”€β”€ App.tsx                  # Main application component
β”‚   └── main.tsx                 # Application entry point
β”œβ”€β”€ data/                        # Data files and database
β”‚   β”œβ”€β”€ papers_with_code.db     # SQLite database with all data
β”‚   β”œβ”€β”€ build_database.py       # Database builder script
β”‚   β”œβ”€β”€ clean_methods_database.py # Database cleaning script
β”‚   β”œβ”€β”€ clean_dataset_database.py # Dataset cleaning script
β”‚   └── README.md               # Database documentation
β”œβ”€β”€ server.js                   # Express.js backend server
└── package.json                # Project dependencies and scripts

πŸ”§ Components Overview

Frontend Components

  • Header: Navigation bar with search functionality and tab switching
  • PaperCard: Displays paper information including title, authors, abstract, and code links
  • DatasetCard: Shows dataset details and usage statistics
  • MethodCard: Presents research methods and their applications
  • LeaderboardTable: Displays performance rankings for different tasks and datasets
  • LeaderboardChart: Interactive charts for visualizing performance data
  • PerformanceChart: Performance metrics visualization
  • ContentRenderer: Utilities for rendering various content types
  • MathRenderer: Mathematical expression rendering with LaTeX support
  • LoadingSpinner: Visual feedback during data loading

Backend Services

  • Database API: Fast SQLite-based data access with optimized queries
  • Search API: Provides fast search across papers and abstracts using database indexes
  • Pagination Service: Efficient pagination for large datasets
  • Leaderboard Service: Real-time performance rankings and evaluations

πŸ“Š Data Sources

The application uses data from Papers with Code:

  • Papers: Academic papers with abstracts and metadata
  • Code Links: Connections between papers and their code implementations
  • Evaluations: Performance metrics and leaderboards
  • Methods: Research methods and approaches
  • Datasets: Dataset information and usage statistics

πŸ—„οΈ Database Architecture

The application now uses a SQLite database instead of JSON streaming for improved performance:

Benefits:

  • Faster queries with database indexes
  • Reduced memory usage (no need to load large JSON files)
  • Better search performance with SQL LIKE queries
  • Efficient pagination for large datasets
  • Relational data structure with proper foreign keys

Database Schema:

  • Core tables: papers, authors, tasks, methods, datasets, evaluations, code_links
  • Relationship tables: paper_authors, paper_tasks, paper_methods, evaluation_categories_rel
  • Optimized indexes on frequently queried fields

Current Database Statistics:

  • Total papers: ~2.4M academic papers with abstracts
  • Methods: 1,940 legitimate research methods (cleaned from 8,725 total)
  • Datasets: Cleaned and validated research datasets
  • Code links: Connections between papers and implementations
  • Evaluations: Performance metrics and leaderboards

Database Cleaning:

  • Comprehensive spam removal from methods database
  • Removed 6,706 spam entries (76.8% of original methods were spam)
  • Cleaned categories: Customer service spam, phone numbers, travel/airline spam, commercial advertising
  • Preserved legitimate content: All academic methods and research content maintained
  • Multi-language support: Handles spam in English and Spanish

Migration:

  • The old JSON streaming approach has been replaced with database queries
  • All data is now stored in data/papers_with_code.db
  • Large JSON files have been removed to save disk space (~2.7GB freed)

πŸ” Search Functionality

  • Full-text search across paper titles and abstracts
  • Real-time results with debounced input
  • Pagination for large result sets
  • Filtering by different data types
  • Mathematical expression rendering with LaTeX support

🎨 UI/UX Features

  • Modern Design: Clean, responsive interface using Tailwind CSS
  • Dark/Light Mode: Automatic theme detection
  • Loading States: Smooth loading indicators
  • Error Handling: Graceful error messages and recovery
  • Mobile Responsive: Optimized for all screen sizes
  • Interactive Charts: Performance visualization with charts
  • Mathematical Rendering: Beautiful LaTeX math expression display

πŸš€ Performance Optimizations

  • SQLite database with optimized schema and indexes for fast queries
  • React Query caching for API responses
  • Lazy loading of components and data
  • Efficient pagination for large datasets
  • Database-driven search with full-text search capabilities
  • Cleaned database with only legitimate academic content for faster queries

🧹 Database Maintenance

Cleaning Scripts

The project includes automated cleaning scripts to maintain data quality:

  • clean_methods_database.py: Removes spam from methods database

    • Detects customer service spam, phone numbers, travel content
    • Removes commercial advertising and irrelevant content
    • Preserves legitimate research methods
    • Multi-language spam detection (English/Spanish)
  • clean_dataset_database.py: Cleans dataset database

    • Removes invalid homepages and spam entries
    • Ensures dataset quality and relevance

Running Maintenance

cd data
# Clean methods database
python clean_methods_database.py

# Clean datasets database
python clean_dataset_database.py

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow TypeScript best practices
  • Use Tailwind CSS for styling
  • Ensure responsive design for mobile devices
  • Add tests for new functionality
  • Update documentation for new features

πŸ“ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • Papers with Code for providing the data
  • The open-source community for the amazing tools and libraries used in this project
  • Contributors who helped clean and maintain the database quality

πŸ“ž Support

If you encounter any issues or have questions:

  1. Check the existing issues in the repository
  2. Create a new issue with detailed information about your problem
  3. Include system information and error messages
  4. For database-related issues, check the data/ directory documentation

πŸ”„ Data Updates

The application can be updated with fresh data from Papers with Code:

  1. Download new data from the official sources
  2. Rebuild the database using python build_database.py
  3. Clean the database using the cleaning scripts
  4. Restart the application to use the updated data

Happy researching! πŸŽ“

Papers with code datasets

You can download the full dataset behind paperswithcode.com here:

Download links for the data dumps are:

The last JSON is in the sota-extractor format and the code from there can be used to load in the JSON into a set of Python classes.

At the moment, data is regenerated daily.

Part of the data is coming from the sources listed in the sota-extractor README.

Licence

All data is licenced under CC-BY-SA.

About

Forked from https://github.com/paperswithcode/paperswithcode-data, to rebuild paperswithcode.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TypeScript 55.2%
  • Python 30.0%
  • JavaScript 13.5%
  • CSS 1.2%
  • HTML 0.1%