Skip to content

lohpaul9/ScaleLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScaleLLM: A Technique for Scalable LLM-augmented Data Systems

Large language models (LLMs) offer powerful semantic insights for data analytics, but row-by-row LLM calls quickly become prohibitively expensive in large datasets. ScaleLLM is a novel system that substantially reduces both latency and cost on text classification tasks by coupling LLM-generated labels on a small subset of data with a lightweight machine learning model for large-scale inference.

This approach provides significant speed-ups—up to 37×—while maintaining accuracy close to that of a full LLM baseline, converging within 1% of its accuracy in several tasks. ScaleLLM also provides cost-accuracy trade-off projections, giving users fine-grained control over performance trade-offs.

Features

  • Efficient Inference: Up to 37× speed-up compared to full LLM baselines
  • Cost Optimization: Significant reduction in API costs while maintaining accuracy
  • Embedding Views: Reusable embedding representations for efficient querying
  • Web Interface: Visual UI for exploring and analyzing results
  • Multiple Datasets: Support for various text classification tasks including:
    • Yelp restaurant reviews classification
    • Yahoo Answers classification
    • Hate speech detection
    • Offensive tweets classification
    • MTOP dataset processing

Prerequisites

  • Python 3.10
  • PostgreSQL with pgvector extension
  • Node.js (for frontend)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd scalellm
  2. Install Python dependencies:

    pip install -r requirements.txt
  3. Set up environment variables: Create a .env file in the root directory with your configuration:

    # Database configuration
    DATABASE_URL=postgresql://postgres:postgres@localhost:5432/dev
    
    # OpenAI API key (required for LLM operations)
    OPENAI_API_KEY=your_openai_api_key_here

Running Instructions

1. Start PostgreSQL Instance

Option A: Using Docker Compose (Recommended)

docker-compose up -d postgres

This will start PostgreSQL with pgvector extension on port 5432.

Option B: Local PostgreSQL Installation

  • Install PostgreSQL and the pgvector extension
  • Create a database named dev
  • Ensure the database is accessible at localhost:5432

2. Run Data Loader Scripts

Load your desired dataset using the available dataloaders:

# Yelp dataset (restaurant reviews)
python dataloaders/load_yelp_dataset.py

# Yahoo Answers classification
python dataloaders/yahoo_answer_classification.py

# Hate speech detection
python dataloaders/hate_speech_dataloader.py

# Offensive tweets classification
python dataloaders/offensive_tweets_dataset.py

# MTOP dataset
python dataloaders/mtop_dataset.py

3. Run Main Application

Execute the main ScaleLLM pipeline:

cd src
python main.py

This will:

  • Install the pgvector extension
  • Set up metadata tables
  • Run the classification pipeline on the loaded data
  • Generate embeddings and perform inference
  • Clean up temporary data

4. Launch Web Applications

Backend API:

cd src/webapp/backend
uvicorn app:app --reload --port 8000

Frontend UI:

cd src/webapp/frontend
npm install
npm run dev

The web interface will be available at http://localhost:5173 (or the port shown in the terminal).

Project Structure

scalellm/
├── src/
│   ├── main.py                 # Main application entry point
│   ├── embeddings.py           # Embedding generation and management
│   ├── generations.py          # LLM generation utilities
│   ├── webapp/                 # Web application
│   │   ├── backend/            # FastAPI backend
│   │   └── frontend/           # React frontend
│   └── models/                 # ML models and utilities
├── dataloaders/                # Dataset loading scripts
├── requirements.txt            # Python dependencies
├── docker-compose.yml          # Docker setup for PostgreSQL
└── README.md                   # This file

Usage Examples

Text Classification

ScaleLLM can classify restaurant reviews by cuisine type, detect hate speech, classify Yahoo Answers, and more. The system automatically:

  1. Generates embeddings for text data
  2. Creates a small labeled subset using LLM calls
  3. Trains a lightweight classifier
  4. Performs efficient inference on the full dataset

Cost-Accuracy Trade-offs

The system provides projections for different cost-accuracy trade-offs, allowing users to choose the optimal balance for their use case.

Citation

If you use ScaleLLM in your research, please cite:

@inproceedings{alaparthi2025scalellm,
  title={ScaleLLM: A Technique for Scalable LLM-augmented Data Systems},
  author={Alaparthi, Ashwin and Loh, Paul and Marcus, Ryan},
  booktitle={Companion of the 2025 International Conference on Management of Data (SIGMOD-Companion '25)},
  pages={1--4},
  year={2025},
  organization={ACM},
  doi={10.1145/3722212.3725130}
}

Coming Soon

A second part of this work focusing on constrained LLMs is coming soon! This extension will explore techniques for incorporating domain-specific constraints and business rules into the LLM-augmented data processing pipeline.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by research grants and computing resources from our institutions. We thank the open-source community for the tools and libraries that made this project possible.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors