ScaleLLM: A Technique for Scalable LLM-augmented Data Systems

Large language models (LLMs) offer powerful semantic insights for data analytics, but row-by-row LLM calls quickly become prohibitively expensive in large datasets. ScaleLLM is a novel system that substantially reduces both latency and cost on text classification tasks by coupling LLM-generated labels on a small subset of data with a lightweight machine learning model for large-scale inference.

This approach provides significant speed-ups—up to 37×—while maintaining accuracy close to that of a full LLM baseline, converging within 1% of its accuracy in several tasks. ScaleLLM also provides cost-accuracy trade-off projections, giving users fine-grained control over performance trade-offs.

Features

Efficient Inference: Up to 37× speed-up compared to full LLM baselines
Cost Optimization: Significant reduction in API costs while maintaining accuracy
Embedding Views: Reusable embedding representations for efficient querying
Web Interface: Visual UI for exploring and analyzing results
Multiple Datasets: Support for various text classification tasks including:
- Yelp restaurant reviews classification
- Yahoo Answers classification
- Hate speech detection
- Offensive tweets classification
- MTOP dataset processing

Prerequisites

Python 3.10
PostgreSQL with pgvector extension
Node.js (for frontend)

Installation

Clone the repository:
```
git clone <repository-url>
cd scalellm
```
Install Python dependencies:
```
pip install -r requirements.txt
```

Set up environment variables: Create a .env file in the root directory with your configuration:

# Database configuration
DATABASE_URL=postgresql://postgres:postgres@localhost:5432/dev

# OpenAI API key (required for LLM operations)
OPENAI_API_KEY=your_openai_api_key_here

Running Instructions

1. Start PostgreSQL Instance

Option A: Using Docker Compose (Recommended)

docker-compose up -d postgres

This will start PostgreSQL with pgvector extension on port 5432.

Option B: Local PostgreSQL Installation

Install PostgreSQL and the pgvector extension
Create a database named dev
Ensure the database is accessible at localhost:5432

2. Run Data Loader Scripts

Load your desired dataset using the available dataloaders:

# Yelp dataset (restaurant reviews)
python dataloaders/load_yelp_dataset.py

# Yahoo Answers classification
python dataloaders/yahoo_answer_classification.py

# Hate speech detection
python dataloaders/hate_speech_dataloader.py

# Offensive tweets classification
python dataloaders/offensive_tweets_dataset.py

# MTOP dataset
python dataloaders/mtop_dataset.py

3. Run Main Application

Execute the main ScaleLLM pipeline:

cd src
python main.py

This will:

Install the pgvector extension
Set up metadata tables
Run the classification pipeline on the loaded data
Generate embeddings and perform inference
Clean up temporary data

4. Launch Web Applications

Backend API:

cd src/webapp/backend
uvicorn app:app --reload --port 8000

Frontend UI:

cd src/webapp/frontend
npm install
npm run dev

The web interface will be available at http://localhost:5173 (or the port shown in the terminal).

Project Structure

scalellm/
├── src/
│   ├── main.py                 # Main application entry point
│   ├── embeddings.py           # Embedding generation and management
│   ├── generations.py          # LLM generation utilities
│   ├── webapp/                 # Web application
│   │   ├── backend/            # FastAPI backend
│   │   └── frontend/           # React frontend
│   └── models/                 # ML models and utilities
├── dataloaders/                # Dataset loading scripts
├── requirements.txt            # Python dependencies
├── docker-compose.yml          # Docker setup for PostgreSQL
└── README.md                   # This file

Usage Examples

Text Classification

ScaleLLM can classify restaurant reviews by cuisine type, detect hate speech, classify Yahoo Answers, and more. The system automatically:

Generates embeddings for text data
Creates a small labeled subset using LLM calls
Trains a lightweight classifier
Performs efficient inference on the full dataset

Cost-Accuracy Trade-offs

The system provides projections for different cost-accuracy trade-offs, allowing users to choose the optimal balance for their use case.

Citation

If you use ScaleLLM in your research, please cite:

@inproceedings{alaparthi2025scalellm,
  title={ScaleLLM: A Technique for Scalable LLM-augmented Data Systems},
  author={Alaparthi, Ashwin and Loh, Paul and Marcus, Ryan},
  booktitle={Companion of the 2025 International Conference on Management of Data (SIGMOD-Companion '25)},
  pages={1--4},
  year={2025},
  organization={ACM},
  doi={10.1145/3722212.3725130}
}

Coming Soon

A second part of this work focusing on constrained LLMs is coming soon! This extension will explore techniques for incorporating domain-specific constraints and business rules into the LLM-augmented data processing pipeline.

Contributing

We welcome contributions! Please feel free to submit issues, feature requests, or pull requests.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

This work was supported by research grants and computing resources from our institutions. We thank the open-source community for the tools and libraries that made this project possible.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
dataloaders		dataloaders
src		src
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScaleLLM: A Technique for Scalable LLM-augmented Data Systems

Features

Prerequisites

Installation

Running Instructions

1. Start PostgreSQL Instance

2. Run Data Loader Scripts

3. Run Main Application

4. Launch Web Applications

Project Structure

Usage Examples

Text Classification

Cost-Accuracy Trade-offs

Citation

Coming Soon

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ScaleLLM: A Technique for Scalable LLM-augmented Data Systems

Features

Prerequisites

Installation

Running Instructions

1. Start PostgreSQL Instance

2. Run Data Loader Scripts

3. Run Main Application

4. Launch Web Applications

Project Structure

Usage Examples

Text Classification

Cost-Accuracy Trade-offs

Citation

Coming Soon

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages