Generate, validate, deduplicate, and export synthetic training data for LLM fine-tuning and evaluation.
synthdata-gen is a complete pipeline for producing synthetic training data. Define a schema describing the shape of each example, optionally plug in any LLM, and the library handles generation, output parsing, schema validation, quality heuristics, deduplication, and export to training-ready formats (OpenAI fine-tuning JSONL, Alpaca, ShareGPT, CSV, plain JSONL).
The library works in three modes:
- Template-based generation -- no LLM required. A built-in deterministic generator produces examples matching your schema using seeded pseudo-random values. Useful for testing pipelines, prototyping schemas, and generating placeholder data.
- LLM-based generation -- provide any async function that calls an LLM. The pipeline builds prompts from your schema, parses structured output from LLM responses (including JSON embedded in markdown fences), retries on failure, and tracks token usage and cost.
- Custom generation -- provide your own
generateFncallback for full control over how examples are produced, while still benefiting from the validation, dedup, and export stages.
Each pipeline stage (generation, validation, deduplication, export) is independently usable as a standalone function.
npm install synthdata-genRequires Node.js >= 18.
import { generate } from 'synthdata-gen';
import type { ExampleSchema } from 'synthdata-gen';
const schema: ExampleSchema = {
fields: {
instruction: { type: 'string', min: 10, max: 200, description: 'A clear instruction' },
output: { type: 'string', min: 20, max: 1000, description: 'The expected response' },
category: { type: 'enum', enum: ['coding', 'writing', 'reasoning'] },
},
};
const result = await generate(schema, { count: 100 });
console.log(result.data); // GeneratedExample[]
console.log(result.stats); // GenerationStatsimport { generate } from 'synthdata-gen';
import type { LlmFunction } from 'synthdata-gen';
const myLlm: LlmFunction = async (messages, options) => {
const response = await callMyProvider(messages, options);
return {
content: response.text,
usage: {
promptTokens: response.usage.prompt_tokens,
completionTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens,
},
};
};
const result = await generate(schema, {
llm: myLlm,
count: 500,
batchSize: 5,
format: 'openai',
seeds: [
{ instruction: 'Explain recursion', output: 'Recursion is when a function calls itself...', category: 'coding' },
],
costTracking: {
promptTokenCost: 0.000003,
completionTokenCost: 0.000015,
currency: 'USD',
},
});
console.log(result.exported); // OpenAI fine-tuning JSONL string
console.log(result.stats.cost); // { promptTokens, completionTokens, totalCost, currency }
console.log(result.stats.durationMs); // wall-clock time- Schema-driven generation -- define field types, constraints, descriptions, and required fields; the library compiles schemas into LLM prompts and validates output against them.
- Three generation modes -- template-based (no LLM), LLM-based (any provider), or custom callback.
- Robust LLM output parsing -- extracts JSON from bare responses, markdown code fences, and mixed text.
- Schema validation -- type checking, string length constraints, numeric ranges, regex patterns, enum membership, array bounds, and nested object validation.
- Quality heuristics -- detect empty fields, placeholder text (lorem ipsum, TODO, N/A), duplicate field values, and enforce minimum word counts.
- Custom validators -- plug in arbitrary validation functions alongside built-in checks.
- Three deduplication strategies -- exact match (normalized hash), near-duplicate (Jaccard similarity on n-grams), and semantic (cosine similarity on embeddings via a pluggable embedder).
- Cross-set deduplication -- remove generated examples that overlap with an existing dataset.
- Five export formats -- OpenAI fine-tuning JSONL, Alpaca, ShareGPT, CSV, and plain JSONL, with configurable field mappings.
- Diversity controls -- temperature variation (linear, cycle, random), topic rotation, seed example rotation, negative example generation, and constraint variation.
- Cost tracking -- track prompt and completion tokens, compute estimated cost per run.
- Deterministic generation -- seeded PRNG for reproducible template-based output.
- Full TypeScript support -- all types exported, strict mode compatible.
Main pipeline function. Generates examples, validates, deduplicates, and optionally exports.
function generate(schema: ExampleSchema, options: GenerateOptions): Promise<GenerateResult>Parameters:
| Parameter | Type | Description |
|---|---|---|
schema |
ExampleSchema |
Schema defining the shape of each example |
options |
GenerateOptions |
Pipeline configuration (see below) |
GenerateOptions:
| Option | Type | Default | Description |
|---|---|---|---|
count |
number |
required | Number of examples to generate |
llm |
LlmFunction |
undefined |
Async function that calls an LLM |
generateFn |
(schema, batchIndex) => Record[] |
undefined |
Custom generation callback |
batchSize |
number |
1 |
Examples per LLM call |
systemPrompt |
string |
undefined |
Custom system prompt (use {schema_description} placeholder) |
additionalInstructions |
string |
undefined |
Extra instructions appended to the system prompt |
seeds |
Record<string, unknown>[] |
undefined |
Few-shot seed examples |
diversity |
DiversityConfig |
undefined |
Diversity strategy configuration |
validation |
ValidationConfig |
undefined |
Validation and heuristics configuration |
retry |
RetryConfig |
{ maxRetries: 3 } |
Retry configuration for LLM failures |
dedup |
DedupOptions |
{ strategy: 'exact' } |
Deduplication configuration |
invalidHandling |
'discard' | 'log' | 'repair' |
'discard' |
How to handle invalid examples |
structuredOutput |
boolean |
undefined |
Request JSON mode from the LLM provider |
costTracking |
CostConfig |
undefined |
Token cost tracking configuration |
format |
ExportFormat |
undefined |
Export format for the exported field in the result |
Returns GenerateResult:
| Field | Type | Description |
|---|---|---|
data |
GeneratedExample[] |
Final validated, deduplicated examples |
stats |
GenerationStats |
Pipeline statistics |
exported |
string | undefined |
Formatted output string (if format was specified) |
Validate an array of examples against a schema. Returns a ValidationResult for each example.
function validate(
examples: Record<string, unknown>[],
schema: ExampleSchema,
config?: ValidationConfig,
): ValidationResult[]Returns an array of:
interface ValidationResult {
valid: boolean;
index: number;
errors: ValidationError[];
}
interface ValidationError {
path: string[];
message: string;
code: string;
}Validation error codes: required, invalid_type, too_small, too_big, invalid_string, invalid_enum_value, heuristic_non_empty, heuristic_placeholder, heuristic_duplicate_fields, heuristic_min_words, global_min_length, global_max_length, custom_<name>.
Validate a single example. Returns an array of ValidationError objects (empty array means valid).
function validateExample(
example: Record<string, unknown>,
schema: ExampleSchema,
config?: ValidationConfig,
): ValidationError[]Deduplicate an array of examples. Supports exact, near-duplicate, and semantic strategies.
function deduplicate(
examples: Record<string, unknown>[],
options?: Partial<DedupOptions>,
): Promise<DedupResult>DedupOptions:
| Option | Type | Default | Description |
|---|---|---|---|
strategy |
'exact' | 'near' | 'semantic' | 'none' |
'exact' |
Deduplication strategy |
threshold |
number |
0.85 (near) / 0.92 (semantic) |
Similarity threshold for near/semantic dedup |
ngramSize |
number |
2 |
N-gram size for near-duplicate detection |
fields |
string[] |
all fields | Subset of fields to compare |
embedder |
(text: string) => Promise<number[]> |
undefined |
Embedding function (required for semantic strategy) |
existingData |
Record<string, unknown>[] |
undefined |
Existing dataset for cross-set deduplication |
Returns DedupResult:
interface DedupResult {
data: Record<string, unknown>[]; // Deduplicated examples
removed: number; // Number of duplicates removed
pairs: Array<[number, number, number]>; // [indexA, indexB, similarity]
}Export examples to a training-ready format string.
function exportAs(
examples: Record<string, unknown>[],
format: ExportFormat,
options?: ExportOptions,
): stringExportFormat: 'openai' | 'alpaca' | 'sharegpt' | 'csv' | 'jsonl'
ExportOptions:
| Option | Type | Default | Description |
|---|---|---|---|
fieldMap |
Record<string, string> |
undefined |
Map format roles to your field names |
systemPrompt |
string |
undefined |
Static system prompt (OpenAI/ShareGPT formats) |
delimiter |
string |
',' |
CSV column delimiter |
quote |
string |
'"' |
CSV quote character |
header |
boolean |
true |
Include CSV header row |
fields |
string[] |
all fields | Subset of fields to include |
Each export format is available as a standalone function:
import { exportOpenAI, exportAlpaca, exportShareGPT, exportCSV, exportJSONL } from 'synthdata-gen';| Function | Output format |
|---|---|
exportOpenAI(examples, options?) |
{"messages": [{"role": "system", ...}, {"role": "user", ...}, {"role": "assistant", ...}]} per line |
exportAlpaca(examples, options?) |
{"instruction": "...", "input": "...", "output": "..."} per line |
exportShareGPT(examples, options?) |
{"conversations": [{"from": "human", "value": "..."}, {"from": "gpt", "value": "..."}]} per line |
exportCSV(examples, options?) |
Comma-separated values with header row |
exportJSONL(examples, options?) |
One JSON object per line |
All exporters automatically exclude _meta fields from output.
Low-level functions for template-based generation and prompt construction:
import {
generateExample,
generateExamples,
buildSchemaPrompt,
buildSystemPrompt,
parseJsonResponse,
} from 'synthdata-gen';| Function | Description |
|---|---|
generateExample(schema, seed?) |
Generate a single example from a schema using the built-in template generator. Deterministic when a seed is provided. |
generateExamples(schema, count, baseSeed?) |
Generate multiple examples. Each example uses baseSeed + index for its seed. |
buildSchemaPrompt(schema) |
Compile a schema into a natural-language prompt describing the expected JSON structure. |
buildSystemPrompt(schema, customPrompt?, additionalInstructions?) |
Build the full system prompt for LLM-based generation. Supports a custom prompt template with {schema_description} placeholder. |
parseJsonResponse(text) |
Extract JSON objects/arrays from an LLM response. Handles bare JSON, markdown code fences, and JSON embedded in explanatory text. |
A schema defines the structure of each generated example using ExampleSchema:
const schema: ExampleSchema = {
fields: {
question: { type: 'string', min: 10, max: 500, description: 'A natural language question' },
answer: { type: 'string', min: 20, max: 2000, pattern: '^[A-Z]' },
category: { type: 'enum', enum: ['science', 'history', 'technology'] },
difficulty: { type: 'integer', min: 1, max: 5 },
score: { type: 'number', min: 0, max: 100 },
active: { type: 'boolean' },
tags: { type: 'array', items: { type: 'string' }, min: 1, max: 5 },
metadata: {
type: 'object',
properties: {
source: { type: 'string' },
verified: { type: 'boolean' },
},
requiredFields: ['source'],
},
},
required: ['question', 'answer', 'category'],
};Supported field types:
| Type | SchemaField properties |
|---|---|
string |
min (min length), max (max length), pattern (regex), description |
number |
min, max, description |
integer |
min, max, description |
boolean |
description |
enum |
enum (valid values array), description |
array |
items (element schema), min (min items), max (max items), description |
object |
properties (nested fields), requiredFields (required property names), description |
All fields support required (default: true) and default (default value when field is omitted).
const config: ValidationConfig = {
// Global string field length constraints
minFieldLength: 10,
maxFieldLength: 5000,
// Quality heuristics
heuristics: {
nonEmpty: true, // Reject empty/whitespace-only required string fields
noPlaceholder: true, // Reject placeholder text (lorem ipsum, TODO, TBD, N/A, etc.)
noDuplicateFields: { // Reject examples where specified field pairs are identical
pairs: [['question', 'answer']],
},
minWordCount: { // Enforce minimum word count on specified fields
fields: ['answer'],
min: 5,
},
},
// Custom validators
custom: [
{
name: 'no-question-in-output',
validate: (example) => ({
valid: !String(example.answer).endsWith('?'),
message: 'Answer should not end with a question mark',
}),
},
],
};The noDuplicateFields and minWordCount heuristics also accept true to use automatic inference: noDuplicateFields: true pairs common field name patterns (question/answer, instruction/output, input/output, prompt/response, query/response), and minWordCount: true applies a default minimum of 3 words to all string fields.
const diversity: DiversityConfig = {
temperature: {
min: 0.3,
max: 1.2,
strategy: 'cycle', // 'linear' | 'cycle' | 'random'
},
topics: ['algorithms', 'databases', 'networking', 'security'],
negativeExampleRatio: 0.1,
negativeInstructions: 'Generate an example with a subtle factual error.',
constraintVariation: [
{ instruction: 'Write in a formal academic tone.' },
{ instruction: 'Write in a casual conversational tone.' },
],
};const retry: RetryConfig = {
maxRetries: 3, // Maximum retry attempts per batch
includeFeedback: true, // Include validation error feedback in retry prompt
backoff: 'exponential', // 'none' | 'linear' | 'exponential'
backoffMs: 1000, // Base backoff delay in milliseconds
};const costTracking: CostConfig = {
promptTokenCost: 0.000003, // Cost per prompt token
completionTokenCost: 0.000015, // Cost per completion token
currency: 'USD',
};The GenerationStats.cost field in the result contains promptTokens, completionTokens, totalCost, and currency.
The validate and validateExample functions return structured error objects rather than throwing. Each ValidationError includes:
path-- array of field names locating the error (e.g.,['address', 'zip']for nested fields,['tags', '0']for array elements)message-- human-readable description of the failurecode-- machine-readable error code for programmatic handling
import { validateExample } from 'synthdata-gen';
const errors = validateExample(
{ instruction: 'Hi', output: 123, category: 'invalid' },
schema,
);
for (const err of errors) {
console.log(`[${err.code}] ${err.path.join('.')}: ${err.message}`);
}
// [too_small] instruction: String must contain at least 10 character(s), received 2
// [invalid_type] output: Expected string, received number
// [invalid_enum_value] category: Invalid enum value. Expected one of ["coding", ...], received "invalid"The generate function handles LLM failures internally using the retry configuration. Invalid examples are handled according to the invalidHandling option:
'discard'(default) -- silently drops invalid examples'log'-- discards but records invalid examples and their errors instats.invalidExamples'repair'-- includes invalid examples in the output with_meta.repaired: true
Validation error counts are always available in stats.validationErrors regardless of the handling mode.
The deduplicate function throws an Error if the semantic strategy is used without providing an embedder function:
// Throws: "Semantic dedup requires an embedder function"
await deduplicate(examples, { strategy: 'semantic' });The exportAs function throws an Error for unsupported format strings:
// Throws: "Unsupported export format: xml"
exportAs(examples, 'xml' as ExportFormat);Override the default system prompt using the {schema_description} placeholder:
const result = await generate(schema, {
count: 100,
llm: myLlm,
systemPrompt: 'You are a medical expert generating training data.\n\n{schema_description}',
additionalInstructions: 'All examples must be about cardiology.',
});Remove generated examples that duplicate entries in an existing dataset:
import { deduplicate } from 'synthdata-gen';
const result = await deduplicate(newExamples, {
strategy: 'exact',
existingData: existingDataset,
});
console.log(`Removed ${result.removed} duplicates of existing data`);Provide an embedding function for meaning-level deduplication:
const result = await deduplicate(examples, {
strategy: 'semantic',
threshold: 0.92,
embedder: async (text) => {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
},
});Deduplicate based on a subset of fields:
const result = await deduplicate(examples, {
strategy: 'near',
threshold: 0.85,
ngramSize: 2,
fields: ['instruction'], // Only compare instruction fields
});Map your schema fields to the roles expected by each export format:
import { exportOpenAI, exportAlpaca } from 'synthdata-gen';
const qaData = [
{ question: 'What is TCP?', answer: 'TCP is a connection-oriented protocol.' },
];
// Map question -> user, answer -> assistant
const openai = exportOpenAI(qaData, {
fieldMap: { user: 'question', assistant: 'answer' },
systemPrompt: 'You are a networking expert.',
});
// Map question -> instruction, answer -> output
const alpaca = exportAlpaca(qaData, {
fieldMap: { instruction: 'question', output: 'answer' },
});Request multiple examples per LLM call to reduce API costs:
const result = await generate(schema, {
llm: myLlm,
count: 1000,
batchSize: 10, // 10 examples per LLM call
structuredOutput: true, // Request JSON mode if provider supports it
});Use the template generator directly without the full pipeline:
import { generateExample, generateExamples } from 'synthdata-gen';
// Single example, deterministic with seed
const example = generateExample(schema, 42);
// Multiple examples, deterministic with base seed
const examples = generateExamples(schema, 100, 42);Generate the prompt that would be sent to an LLM, without calling one:
import { buildSchemaPrompt, buildSystemPrompt } from 'synthdata-gen';
const schemaPrompt = buildSchemaPrompt(schema);
// "Generate a JSON object with the following structure:\n{ ... }"
const systemPrompt = buildSystemPrompt(schema, undefined, 'Focus on edge cases.');
// Full system prompt with schema description and additional instructionsExtract JSON from messy LLM output:
import { parseJsonResponse } from 'synthdata-gen';
const objects = parseJsonResponse('Here is the result:\n```json\n{"key": "value"}\n```\nDone!');
// [{ key: "value" }]
const array = parseJsonResponse('[{"a": 1}, {"b": 2}]');
// [{ a: 1 }, { b: 2 }]All types are exported from the package entry point:
import type {
// LLM interface
Message,
LlmCallOptions,
LlmResponse,
LlmFunction,
// Schema
FieldType,
SchemaField,
ExampleSchema,
// Generation
GeneratedExample,
DiversityConfig,
HeuristicsConfig,
CustomValidator,
ValidationConfig,
RetryConfig,
CostConfig,
DedupOptions,
GenerateOptions,
ExportFormat,
ExportOptions,
// Results
GenerateResult,
GenerationStats,
ValidationResult,
ValidationError,
DedupResult,
} from 'synthdata-gen';The GeneratedExample<T> type is generic. By default it is Record<string, unknown> & { _meta?: ... }. You can narrow it with your own type:
interface QAPair {
question: string;
answer: string;
category: string;
}
const result = await generate(schema, { count: 10 });
const data = result.data as GeneratedExample<QAPair>[];MIT