A comprehensive collection of high-quality datasets specifically curated for artificial intelligence and machine learning workflows.
This repository provides a structured catalog of public datasets ideal for:
- Training and fine-tuning machine learning models
- Building data pipelines and ETL processes
- Benchmarking algorithm performance
- Practicing data cleaning and preprocessing techniques
- Supporting research in various AI domains
Browse our dataset collections by type:
- Computer Vision Datasets - Image datasets for object detection, classification, and more
- Natural Language Processing Datasets - Text datasets for NLP tasks
- Tabular Datasets - Structured data for regression and classification
- Time Series Datasets - Sequential data for forecasting and analysis
- Graph Datasets - Network and relationship data
- Audio Datasets - Speech and sound processing collections
- Multimodal Datasets - Combined data types for advanced applications
- Data Cleaning Best Practices - Essential techniques for preparing ML-ready data
- Model Evaluation Metrics - Comprehensive guide to measuring model performance
- ML Data Pipeline Architecture - Building scalable data pipelines for ML
- Tabular Data Exploration - Walkthrough of exploring and preprocessing tabular data
- NLP Text Preprocessing - Techniques for preparing text data for ML models
For a complete list of dataset sources, see our original dataset catalog.
We welcome contributions! Please see our Contributing Guidelines for details on how to add datasets or examples.
This catalog is available under the MIT License. Individual datasets may have their own licenses.