GPU-accelerated data curation for training better AI models, faster. Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.
Part of the NVIDIA NeMo software suite for managing the AI agent lifecycle.
- 2026-04: NeMo Curator 26.04 released with Cosmos-Xenna 0.2.0 upgrade, simplified
ResourcesAPI, and Ray 2.54. See the release notes. - 2026-02: NeMo Curator 26.02 released with Ray-based pipeline architecture for all modalities — text, image, video, and audio.
| Modality | Key Capabilities | Get Started |
|---|---|---|
| Text | Deduplication • Classification • Quality Filtering • Language Detection | Text Guide |
| Image | Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication | Image Guide |
| Video | Scene Detection • Clip Extraction • Motion Filtering • Deduplication | Video Guide |
| Audio | ASR Transcription • Quality Assessment • WER Filtering | Audio Guide |
# Install for your modality
uv pip install "nemo-curator[text_cuda12]"
# Run the quickstart example
python tutorials/quickstart.pyFull setup: Installation Guide • Docker • Tutorials
NeMo Curator uses a modular, Ray-based pipeline architecture. Data flows through composable processing stages — each stage handles a discrete curation task (loading, filtering, deduplication, etc.) and can be configured with independent resource requirements.
Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.
| Category | Features | Documentation |
|---|---|---|
| Data Sources | Common Crawl • Wikipedia • ArXiv • Custom datasets | Load Data |
| Quality Filtering | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | Quality Assessment |
| Deduplication | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | Deduplication |
| Processing | Text cleaning • Language identification | Content Processing |
Curate large-scale image datasets for vision language models (VLMs) and generative AI training.
| Category | Features | Documentation |
|---|---|---|
| Data Loading | WebDataset format • Large-scale image-text pairs | Load Data |
| Embeddings | CLIP embeddings for semantic analysis | Embeddings |
| Filtering | Aesthetic quality scoring • NSFW detection | Filters |
Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).
| Category | Features | Documentation |
|---|---|---|
| Data Loading | Local paths • S3-compatible storage • HTTP(S) URLs | Load Data |
| Clipping | Fixed-stride splitting • Scene-change detection (TransNetV2) | Clipping |
| Processing | GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering | Processing |
| Embeddings | Cosmos-Embed1 for clip-level embeddings | Embeddings |
| Deduplication | K-means clustering • Pairwise similarity for near-duplicates | Deduplication |
Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.
| Category | Features | Documentation |
|---|---|---|
| Data Loading | Local files • Custom manifests • Public datasets (FLEURS) | Load Data |
| ASR Processing | NeMo Framework pretrained models • Automatic transcription | ASR Inference |
| Quality Assessment | Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering | Quality Assessment |
| Integration | Text curation workflow integration for multimodal pipelines | Text Integration |
High-quality training data is the single most important factor in building performant AI models. Raw datasets contain noise, duplicates, low-quality content, and potentially harmful material that degrade model performance and increase training costs.
At scale, data curation is a throughput maximization problem. A typical pipeline chains stages with very different compute profiles — lightweight CPU tokenization, small GPU classifiers, large GPU inference models — and a naive sequential approach leaves most hardware idle most of the time.
Example: Consider a pipeline with language identification (0.5B model, 1 GB VRAM, 2s/sample), tokenization (CPU-only, 1s/sample), and a 5B answer model (10 GB VRAM, 10s/sample) processing 1,000 questions on a single 102 GB GPU:
| Approach | How it works | Total runtime |
|---|---|---|
| Sequential | Process each sample through all stages, one at a time | ~13,000 seconds |
| NeMo Curator | Stream batches, auto-scale replicas per stage, overlap CPU/GPU work | ~1,000 seconds |
NeMo Curator achieves this by streaming data through the pipeline so all stages run concurrently, auto-balancing replicas to match each stage's throughput (2× language ID, 1× tokenizer, 10× answer model), and keeping GPU workers busy over 99% of the time after an initial warm-up period. See the scaling concepts for details.
NeMo Curator powers the data pipelines behind NVIDIA Nemotron models. For example, the Nemotron-4 pre-training dataset was curated using NeMo Curator's text processing pipeline across 8+ trillion tokens of multilingual web data, applying quality filtering, deduplication, and domain classification at scale.
NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.
Proven Results:
- 16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)
- 40% lower total cost of ownership (TCO) compared to CPU-based alternatives
- Near-linear scaling from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)
Real-World Recipe: The Nemotron-CC curation pipeline uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the Nemotron-CC datasets. The SDG stage is also available as an in-repo tutorial.
Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:
Results: Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.
| Resource | Links |
|---|---|
| Documentation | Main Docs • API Reference • Concepts |
| Tutorials | Text • Image • Video • Audio |
| Recipes | Nemotron-CC: end-to-end web data curation • SDG tutorial (in-repo) |
| Deployment | Installation • Infrastructure |
| Community | GitHub Discussions • Issues |
We welcome community contributions! Please refer to CONTRIBUTING.md for guidelines.
If you find NeMo Curator useful in your research, please cite:
@misc{nemo_curator,
title = {NeMo Curator: GPU-Accelerated Data Curation for Training AI Models},
author = {NVIDIA},
year = {2024},
url = {https://github.com/NVIDIA-NeMo/Curator}
}For the data curation pipeline behind Nemotron models, please also cite:
@article{parmar2024nemotron4,
title = {Nemotron-4 15B Technical Report},
author = {Parmar, Jupinder and Satheesh, Shrimai and others},
journal = {arXiv preprint arXiv:2402.16819},
year = {2024}
}


