Skip to content

NVIDIA-NeMo/Curator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

779 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

https://pypi.org/project/nemo-curator codecov https://pypi.org/project/nemo-curator/ NVIDIA-NeMo/Curator https://github.com/NVIDIA-NeMo/Curator/releases https://github.com/Naereen/badges/

NVIDIA NeMo Curator

GPU-accelerated data curation for training better AI models, faster. Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.

Part of the NVIDIA NeMo software suite for managing the AI agent lifecycle.

Updates

  • 2026-04: NeMo Curator 26.04 released with Cosmos-Xenna 0.2.0 upgrade, simplified Resources API, and Ray 2.54. See the release notes.
  • 2026-02: NeMo Curator 26.02 released with Ray-based pipeline architecture for all modalities — text, image, video, and audio.

What You Can Do

Modality Key Capabilities Get Started
Text Deduplication • Classification • Quality Filtering • Language Detection Text Guide
Image Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication Image Guide
Video Scene Detection • Clip Extraction • Motion Filtering • Deduplication Video Guide
Audio ASR Transcription • Quality Assessment • WER Filtering Audio Guide

Quick Start

# Install for your modality
uv pip install "nemo-curator[text_cuda12]"

# Run the quickstart example
python tutorials/quickstart.py

Full setup: Installation GuideDockerTutorials


Architecture

NeMo Curator uses a modular, Ray-based pipeline architecture. Data flows through composable processing stages — each stage handles a discrete curation task (loading, filtering, deduplication, etc.) and can be configured with independent resource requirements.

NeMo Curator architecture diagram showing modular pipeline stages


Features by Modality

Text Curation

Process and curate high-quality text datasets for large language model (LLM) training with multilingual support.

Category Features Documentation
Data Sources Common Crawl • Wikipedia • ArXiv • Custom datasets Load Data
Quality Filtering 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type Quality Assessment
Deduplication Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) Deduplication
Processing Text cleaning • Language identification Content Processing

Image Curation

Curate large-scale image datasets for vision language models (VLMs) and generative AI training.

Category Features Documentation
Data Loading WebDataset format • Large-scale image-text pairs Load Data
Embeddings CLIP embeddings for semantic analysis Embeddings
Filtering Aesthetic quality scoring • NSFW detection Filters

Video Curation

Process large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).

Category Features Documentation
Data Loading Local paths • S3-compatible storage • HTTP(S) URLs Load Data
Clipping Fixed-stride splitting • Scene-change detection (TransNetV2) Clipping
Processing GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering Processing
Embeddings Cosmos-Embed1 for clip-level embeddings Embeddings
Deduplication K-means clustering • Pairwise similarity for near-duplicates Deduplication

Audio Curation

Prepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.

Category Features Documentation
Data Loading Local files • Custom manifests • Public datasets (FLEURS) Load Data
ASR Processing NeMo Framework pretrained models • Automatic transcription ASR Inference
Quality Assessment Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering Quality Assessment
Integration Text curation workflow integration for multimodal pipelines Text Integration

Why Data Curation?

High-quality training data is the single most important factor in building performant AI models. Raw datasets contain noise, duplicates, low-quality content, and potentially harmful material that degrade model performance and increase training costs.

Common data curation challenges: quality, deduplication, filtering, and scale

At scale, data curation is a throughput maximization problem. A typical pipeline chains stages with very different compute profiles — lightweight CPU tokenization, small GPU classifiers, large GPU inference models — and a naive sequential approach leaves most hardware idle most of the time.

Example: Consider a pipeline with language identification (0.5B model, 1 GB VRAM, 2s/sample), tokenization (CPU-only, 1s/sample), and a 5B answer model (10 GB VRAM, 10s/sample) processing 1,000 questions on a single 102 GB GPU:

Approach How it works Total runtime
Sequential Process each sample through all stages, one at a time ~13,000 seconds
NeMo Curator Stream batches, auto-scale replicas per stage, overlap CPU/GPU work ~1,000 seconds

NeMo Curator achieves this by streaming data through the pipeline so all stages run concurrently, auto-balancing replicas to match each stage's throughput (2× language ID, 1× tokenizer, 10× answer model), and keeping GPU workers busy over 99% of the time after an initial warm-up period. See the scaling concepts for details.


Proven at Scale: Nemotron

NeMo Curator powers the data pipelines behind NVIDIA Nemotron models. For example, the Nemotron-4 pre-training dataset was curated using NeMo Curator's text processing pipeline across 8+ trillion tokens of multilingual web data, applying quality filtering, deduplication, and domain classification at scale.


Why NeMo Curator?

Performance at Scale

NeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.

Proven Results:

  • 16× faster fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)
  • 40% lower total cost of ownership (TCO) compared to CPU-based alternatives
  • Near-linear scaling from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)

Real-World Recipe: The Nemotron-CC curation pipeline uses NeMo Curator end-to-end — from Common Crawl extraction through language identification, exact/fuzzy/substring deduplication, ensemble quality classification, and LLM-based synthetic data generation — to reproduce the Nemotron-CC datasets. The SDG stage is also available as an in-repo tutorial.

Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling

Quality Improvements

Data curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:

Model accuracy improvements across curation pipeline stages

Results: Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.


Learn More

Resource Links
Documentation Main DocsAPI ReferenceConcepts
Tutorials TextImageVideoAudio
Recipes Nemotron-CC: end-to-end web data curationSDG tutorial (in-repo)
Deployment InstallationInfrastructure
Community GitHub DiscussionsIssues

Contribute

We welcome community contributions! Please refer to CONTRIBUTING.md for guidelines.


Citation

If you find NeMo Curator useful in your research, please cite:

@misc{nemo_curator,
  title = {NeMo Curator: GPU-Accelerated Data Curation for Training AI Models},
  author = {NVIDIA},
  year = {2024},
  url = {https://github.com/NVIDIA-NeMo/Curator}
}

For the data curation pipeline behind Nemotron models, please also cite:

@article{parmar2024nemotron4,
  title = {Nemotron-4 15B Technical Report},
  author = {Parmar, Jupinder and Satheesh, Shrimai and others},
  journal = {arXiv preprint arXiv:2402.16819},
  year = {2024}
}