Training Data Bot helps you turn raw documents into clean training datasets for LLM fine-tuning.
It supports mixed input sources (PDF, text files, folders, URLs), generates multiple task-style examples, scores quality, and exports the result in common formats.
People often have useful content spread across manuals, notes, web pages, and internal docs, but that content is not ready for model training.
This project helps by:
- Collecting content automatically from files, directories, and websites.
- Converting content into training examples (QA, classification, summarization).
- Filtering low-quality examples using quality scoring.
- Exporting a dataset in JSONL/JSON/CSV for downstream ML workflows.
In simple terms: it saves time and gives teams a repeatable pipeline instead of manual copy-paste dataset creation.
The main orchestrator is TrainingDataBot. It runs this workflow:
- Load documents
UnifiedLoaderdetects input type and routes to:PDFLoaderDocumentLoaderWebLoader
- Preprocess text
TextPreprocessorchunks each document into overlapping text segments.
- Generate training tasks
TaskManagerapplies task generators:- QA generation
- Classification
- Summarization
- Evaluate quality
QualityEvaluatorscores dataset quality and returns aQualityReport.- Optional quality filtering removes weaker examples.
- Export dataset
DatasetExporterwrites output asjsonl,json, orcsv.- Optional split export for train/validation/test.
training_data_bot/bot.py→ main orchestrator (TrainingDataBot)training_data_bot/sources/→ file/pdf/web loadingtraining_data_bot/preprocessing/→ chunking and text preparationtraining_data_bot/tasks/→ training example generationtraining_data_bot/evaluation/→ quality scoringtraining_data_bot/storage/→ dataset exporting and persistence helperstraining_data_bot/models.py→ core data models
Make sure your terminal is inside:
D:\Training_Data_Bot-main
python -m venv .venv.venv\Scripts\Activate.ps1If PowerShell blocks script execution, run once:
Set-ExecutionPolicy -Scope Process -ExecutionPolicy BypassThen activate again.
pip install -r requirements.txtCreate a file named run_example.py in the project root and paste:
import asyncio
from training_data_bot import TrainingDataBot
async def main():
async with TrainingDataBot() as bot:
documents = await bot.load_documents([
"docs/manual.pdf",
"docs/notes.txt",
"https://example.com/help"
])
dataset = await bot.process_documents(
documents=documents,
task_types=None,
quality_filter=True,
chunk_size=800,
overlap=120,
quality_threshold=0.65,
)
report = await bot.evaluate_dataset(dataset)
print("Quality score:", report.overall_score)
print("Passed:", report.passed)
output_path = await bot.export_dataset(
dataset=dataset,
output_path="output/training_data.jsonl",
format="jsonl",
split_data=True,
)
print("Exported to:", output_path)
print("Stats:", bot.get_statistics())
if __name__ == "__main__":
asyncio.run(main())Run it:
python run_example.pyAfter a successful run, you should see dataset files under output/.
from training_data_bot import TrainingDataBotMain methods:
load_documents(...)process_documents(...)evaluate_dataset(...)export_dataset(...)get_statistics()
- Support chatbot training
- Convert help-center docs + FAQs into QA training examples.
- Internal knowledge assistant
- Transform company policies and manuals into structured fine-tuning data.
- Domain-specific model improvement
- Build a custom dataset from technical documentation in finance, legal, healthcare, etc.
- Data curation workflows for ML teams
- Standardize document-to-dataset preprocessing in one repeatable pipeline.
-
Module import errors
- Confirm virtual environment is active before running Python.
-
No output examples generated
- Check that source files/URLs are valid and contain enough text.
-
Very few examples after filtering
- Lower
quality_thresholdor setquality_filter=False.
- Lower
-
Path/file not found
- Use correct absolute/relative paths for your local files.
- Current AI/Decodo clients are lightweight placeholders; the pipeline is designed to be extendable.
- The default generation/evaluation behavior is deterministic and tutorial-friendly.
See LICENSE.
graph LR
%% Main Nodes
Root["📁 Project Root"]
%% Top-Level Files & Folders
Root --> D1["📁 Training Data Bot Tutorial"]
Root --> F1("📄 Architecting_the_Training_Data_Bot.pdf")
Root --> F2("📄 Production AI Tutorial.pdf")
Root --> F3("🖼️ Mind Map.png")
%% Main Package
Root --> Pkg["📁 training_data_bot"]
Pkg --> Pkg_Init("🐍 __init__.py")
Pkg --> Pkg_Bot("🐍 bot.py")
%% Subpackages
Pkg --> AI["📁 ai"]
AI --> AI_Init("🐍 __init__.py")
AI --> AI_Client("🐍 client.py")
Pkg --> Core["📁 core"]
Core --> Core_Init("🐍 __init__.py")
Core --> Core_Config("🐍 config.py")
Core --> Core_Exc("🐍 exceptions.py")
Core --> Core_Log("🐍 logging.py")
Pkg --> Decodo["📁 decodo"]
Decodo --> Dec_Init("🐍 __init__.py")
Decodo --> Dec_Client("🐍 client.py")
Pkg --> Eval["📁 evaluation"]
Eval --> Eval_Init("🐍 __init__.py")
Eval --> Eval_Qual("🐍 quality_evaluator.py")
Pkg --> Preproc["📁 preprocessing"]
Preproc --> Pre_Init("🐍 __init__.py")
Preproc --> Pre_Text("🐍 text_preprocessor.py")
Pkg --> Sources["📁 sources"]
Sources --> Src_Init("🐍 __init__.py")
Sources --> Src_Base("🐍 base.py")
Sources --> Src_Doc("🐍 document.py")
Sources --> Src_PDF("🐍 pdf.py")
Sources --> Src_Uni("🐍 unified.py")
Sources --> Src_Web("🐍 web.py")
Pkg --> Storage["📁 storage"]
Storage --> Sto_Init("🐍 __init__.py")
Storage --> Sto_DB("🐍 database.py")
Storage --> Sto_Exp("🐍 exporter.py")
Pkg --> Tasks["📁 tasks"]
Tasks --> Task_Init("🐍 __init__.py")
Tasks --> Task_Gen("🐍 generators.py")
Tasks --> Task_Man("🐍 manager.py")
%% Styling classes for a polished look
classDef folder fill:#e1f5fe,stroke:#01579b,stroke-width:2px,color:#000;
classDef file fill:#fff8e1,stroke:#ff8f00,stroke-width:1px,color:#000;
classDef python fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px,color:#000;
%% Applying styles
class Root,D1,Pkg,AI,Core,Decodo,Eval,Preproc,Sources,Storage,Tasks folder;
class F1,F2,F3 file;
class Pkg_Init,Pkg_Bot,AI_Init,AI_Client,Core_Init,Core_Config,Core_Exc,Core_Log,Dec_Init,Dec_Client,Eval_Init,Eval_Qual,Pre_Init,Pre_Text,Src_Init,Src_Base,Src_Doc,Src_PDF,Src_Uni,Src_Web,Sto_Init,Sto_DB,Sto_Exp,Task_Init,Task_Gen,Task_Man python;