Skip to content

AIdventures/gliner-datagen

Repository files navigation

GliNER2 Synthetic Data Generation

Index

Project Structure

Here is an overview of the key files and directories for the synthetic data generation framework:

.
├── pyproject.toml                     <- Project dependencies and tool config
├── design.md                          <- Project documentation
├── generate.py                        <- CLI script for synthetic dataset generation
├── generate.ipynb                     <- Interactive notebook for dataset generation
├── train.py                           <- Minimal fine-tuning and evaluation script
│
├── datagen                            <- Synthetic data generation framework
│   │
│   ├── api                            <- Public Python API and CLI entrypoints
│   │   ├── generator.py               <- Main `DataGenerator` public API
│   │   └── cli.py                     <- CLI implementation
│   │
│   ├── runtime                        <- LLM client and config management
│   │   ├── config.py                  <- Environment variable loading
│   │   └── llm.py                     <- LLM provider client wrapper
│   │
│   ├── workflow                       <- LangGraph execution engine
│   │   ├── graph.py                   <- Pipeline assembly and state machine
│   │   ├── context.py                 <- Execution context management
│   │   └── nodes                      <- Graph step implementations
│   │       ├── decomposition.py       <- Task specification parsing
│   │       ├── planning.py            <- Diversity constraint planning
│   │       ├── generation.py          <- Example generation execution
│   │       ├── validation.py          <- Quality control checks
│   │       ├── retry.py               <- Failed example retry logic
│   │       └── finalize.py            <- Workflow state packaging
│   │
│   ├── prompting                      <- System prompts and rendering
│   │   ├── decomposition.py           <- Parsing prompt templates
│   │   ├── generation.py              <- Generation prompt templates
│   │   └── templates.py               <- Reusable template formatting
│   │
│   ├── prompts                        <- Raw markdown prompt templates
│   │   ├── decomposition              <- Inference of GLiNER2 task shapes
│   │   └── generation                 <- LLM guidelines for example creation
│   │
│   ├── planning                       <- Diversity generation
│   │   ├── constraints.py             <- Constraint models
│   │   └── defaults.py                <- Fallback diversity values
│   │
│   ├── validation                     <- Quality control and metrics
│   │   ├── records.py                 <- Grounding and shape validation
│   │   └── metrics.py                 <- Success and retry tracking
│   │
│   ├── domain                         <- Core system models
│   │   ├── specs.py                   <- Task specification schemas
│   │   ├── examples.py                <- Output record formats
│   │   └── state.py                   <- Graph execution state models
│   │
│   └── io                             <- Output formatting
│       └── reporting.py               <- Markdown summary generation
│
└── misc                               <- Plots and assets used in docs

Environment Setup

Install uv if needed:

curl -LsSf https://astral.sh/uv/install.sh | sh

Create the environment and install everything:

uv sync --all-groups

Design Reasoning

Architecture Overview

The synthetic data generation framework for GLiNER2 is implemented in the datagen package. It is designed to be fully compatible with the GLiNER2 ecosystem, supporting all core task types: Named Entity Recognition (NER), Single-label and Multi-label Classification, Relation Extraction, and JSON Extraction.

At its core, it wraps a LangGraph workflow that methodically guides the generation process from a free-form task description to a validated dataset of GLiNER2-compatible JSONL records.

The key design principle is that deterministic logic runs first, and LLM generation only receives pre-validated, structured context.

Datagen Pipeline Overview

The internal workflow relies on specific state schemas to maintain structure throughout the pipeline. Core models like GenerationState track the overall pipeline execution, while ExampleConstraint strictly governs the requirements and diversity constraints mapped to each specific example. To ensure correctness and schema adherence, we also rely on InputExample or GLiNER2 classes as validation-oriented structures throughout the pipeline.

Output Schemas Overview

Process Walkthrough

The workflow in datagen/workflow/graph.py is intentionally organized as a short sequence of focused steps:

  1. decompose_spec
  2. plan_constraints
  3. generate_examples
  4. validate_and_measure
  5. prepare_retry
  6. finalize_output

That ordering is part of the design. It gives the model less ambiguity at each stage and makes it easier to reason about failure modes, coverage, and quality.

Decompose Spec

A core challenge of the system is determining which GLiNER2 tasks are implicitly requested by a natural-language prompt such as "Extract company names and classify sentiment." This task-type inference happens first, in decompose_spec.

Rather than asking one prompt to simultaneously infer the schema and write examples, the pipeline first uses a reasoning model (DATAGEN_DECOMPOSE_MODEL) to convert the free-form description into a normalized TaskSpec. That spec captures:

  • the inferred task families (ner, classification, relation_extraction, json_extraction)
  • the supporting schema details each task needs, such as entity types, labels, relation fields, or JSON fields
  • domain and subtopic pools that can seed diverse examples
  • optional planning hints, but only when the prompt makes them clear enough to trust
  • whether the request implies composed multi-task examples via requires_composition

This decomposition step is also where we keep the output schema honest. TaskSpec is validated so, for example, NER cannot be inferred without entity types and classification cannot be inferred without labels. That gives downstream steps a stable contract instead of a loosely interpreted prompt.

Plan Constraints

Once the spec is fixed, plan_constraints converts it into one ExampleConstraint per requested example. This is the main mechanism we use to enforce diversity and control label balance before generation starts.

Each constraint can vary:

  • domain and subtopic, to avoid collapsing the dataset into one semantic niche
  • tone and perspective, to expose the model to different surface forms
  • length and sentence complexity, to cover both simple and denser examples
  • negation and ambiguity flags, to deliberately include harder cases
  • approximate entity and relation budgets, to avoid uniform annotation density

The important split is deterministic versus stochastic planning. Classification label assignment is deterministic: labels are rotated in round-robin fashion so the requested classes stay balanced across the run, including multi-label tasks when relevant. Other dimensions are diversified with a mix of cycling and sampling, while still honoring any prompt-derived hints from decomposition. If the prompt does not provide useful domain coverage, the planner falls back to small generic defaults rather than leaving diversity unspecified.

At a concrete level, ExampleConstraint is the handoff object between planning and generation. A typical planned item looks like:

ExampleConstraint(
    example_id=7,
    classification_targets={"sentiment": ["negative"]},
    domain="customer reviews",
    subtopic="service complaints",
    tone="casual",
    perspective="first_person",
    length="medium length (2-3 sentences)",
    sentence_complexity="moderate syntax",
    has_negation=False,
    has_ambiguity=True,
    n_entities=2,
    n_relations=0,
)

The filling logic is intentionally simple and explicit: prompt-derived hints override defaults when trustworthy; classification targets are assigned deterministically for label balance; domain, subtopic, tone, perspective, length, and complexity are cycled from small pools; and flags or annotation budgets such as has_negation, has_ambiguity, n_entities, and n_relations are lightly sampled within bounded ranges, only when the inferred task type makes them relevant. generate_examples then consumes exactly one such constraint per request, so the model is not asked to invent the target label mix or diversity pattern on its own, and prepare_retry can resend only the failed constraints without replanning the whole batch.

Generate Examples

generate_examples then asks the generation model (DATAGEN_GENERATION_MODEL) for one candidate per constraint, using the normalized task spec plus the specific per-example blueprint.

Diversity is also encouraged at sampling time by using a relatively high generation temperature, so outputs do not collapse too quickly toward the same phrasing or structure.

This is where multi-task composition is made concrete. When the decomposed spec says a request should compose multiple tasks, the generation prompt explicitly instructs the model to merge them into one coherent output object instead of emitting disconnected mini-examples. For example, one record may contain both "entities" and "classifications", or pair relation extraction with JSON extraction, as long as all annotations are supported by the same input text.

Giving the model both the full task spec and a single structured constraint makes the request much more precise than a flat natural-language instruction. It also helps the generated records stay internally consistent: the text, the task payloads, and the intended label assignment are all planned together.

Validate and Measure

Before any candidate is accepted, validate_and_measure runs it through a local validation pipeline in datagen/validation/records.py.

Current checks include:

  • top-level shape validation for the GLiNER2-compatible record structure
  • conversion into GLiNER2 InputExample objects to catch malformed payloads
  • duplicate filtering within the current run and against already accepted examples
  • content validation so examples contain at least one supported task
  • grounding checks that require extracted values to literally appear in the source text
  • relation-consistency checks across the dataset so a relation name does not drift in shape between examples

This step also computes run statistics, including classification balance over the accepted examples from that validation pass. In practice, validation is where we turn "the model produced JSON" into "the pipeline accepts this as usable training data."

Finalize Output

If validation leaves gaps and retries remain, prepare_retry sends only the missing or failed constraints back through generation. Accepted examples are preserved, so retries focus on recovery rather than regenerating the entire batch.

finalize_output then returns the accepted examples as a list of InputExample objects, computes final aggregate metrics such as classification balance and entity balance, and packages the state returned by the public API. This keeps the external interface simple even though the internal workflow is staged and retry-aware.

Architecture Decisions and Tradeoffs

The main architectural decision was to separate decomposition, planning, generation, and validation instead of relying on one monolithic prompt.

  • Benefit: reliability improves because each step has a narrower job and a clearer contract.
  • Benefit: modularity improves because the spec, planner, generator, and validator can evolve independently.
  • Benefit: model choice becomes more flexible because decomposition and generation can use different models (DATAGEN_DECOMPOSE_MODEL vs. DATAGEN_GENERATION_MODEL).
  • Tradeoff: total request count and end-to-end latency increase relative to a single-call design.
  • Mitigation: bounded parallelism (DATAGEN_GENERATION_PARALLELISM) reduces the wall-clock cost of the generation stage.

Another explicit tradeoff is using JSON mode plus local validation instead of provider-enforced structured outputs as the primary control mechanism. We are also deliberately not asking the model to generate the final fully structured record JSON directly and then validate it afterward. In practice, that approach usually consumes more tokens because the model has to emit the entire schema shape and field structure on every sample, which increases both latency and generation cost.

  • Benefit: the prompting stack stays simpler and more portable across providers.
  • Benefit: generation remains lightweight even for composed or moderately nested schemas.
  • Tradeoff: syntactic guarantees are weaker up front, so correctness depends on strict downstream validation and retries.

For this submission, that tradeoff felt reasonable: local validation already checks shape, grounding, consistency, and duplicates, so the system does not equate "valid JSON" with "accepted example." Empirically, across the generated train and held-out datasets, using OpenAI and Anthropic models for a total of 400 examples, we observed 0 validation errors or retries due to formatting, so structured-output enforcement did not appear necessary to avoid malformed generations in this run.

Current Limitations

The current design is reliable for small to medium runs, but several limitations remain:

  • accepted examples are accumulated in memory and only packaged at the end, which is less durable for long or expensive runs
  • observability is intentionally light, so there is limited built-in tracing for latency, cost, and failure analysis
  • diversity planning is still heuristic and relatively coarse; it varies several useful dimensions, but it does not yet model richer personas or deeper semantic coverage explicitly
  • grounding checks are intentionally strict but still simple, relying mainly on literal text presence rather than stronger semantic verification

These are known limitations of the current submission; the Future Work section below describes how we would address them in a fuller iteration.

Testing Strategy

The datagen package is covered by focused unit tests in tests/test_datagen.py. These tests target the behaviors that matter for this submission: deterministic planning, schema normalization, conversion into native GLiNER2 InputExample objects, validation rules such as grounding and relation-shape consistency, and the local CLI, reporting, and runtime helpers around generation.

The goal is to verify the contract of the generation pipeline rather than the quality of a particular remote model response. In practice, that means the tests check that datagen produces valid, internally consistent training records and that its deterministic support logic behaves predictably under retry, validation, and configuration scenarios.

Run the datagen test file with:

uv run pytest tests/test_datagen.py

Environment Variables

The generator reads configuration from environment variables through datagen/runtime/config.py.

Minimal provider setup:

# For OpenAI
OPENAI_API_KEY=your_openai_key

# For Anthropic
ANTHROPIC_API_KEY=your_anthropic_key

Optional datagen-specific settings:

See .env.example for the documented DATAGEN_* settings, example values, and inline explanations for each variable.

Notes:

  • DATAGEN_DECOMPOSE_MODEL is the model used in the first graph step, where the task description is converted into a normalized spec with task types, labels, entity types, relation shapes, and JSON schema hints.
  • DATAGEN_GENERATION_MODEL is the model used in the main generation step, where per-example requests are created from the spec plus the planned diversity and label-balance constraints.
  • DATAGEN_GENERATION_PARALLELISM limits how many generation requests can run at once.
  • DATAGEN_REQUESTS_PER_MINUTE caps the total outgoing LLM request rate across the client and defaults to 200.
  • You can keep both on the same model for simplicity, or use a cheaper/faster model for DATAGEN_DECOMPOSE_MODEL and a stronger model for DATAGEN_GENERATION_MODEL.
  • LiteLLM uses the provider-specific credentials for the selected model. For OpenAI models, set OPENAI_API_KEY. For Anthropic models, set ANTHROPIC_API_KEY. Other providers supported by LiteLLM can also be used by setting their respective API keys.

Python API

Use the public datagen.DataGenerator class for programmatic generation:

from datagen import DataGenerator

generator = DataGenerator()
examples = generator.generate(
    "Extract company names and classify sentiment.",
    n=3,
)

API notes:

  • generate(task_description, n) returns a list[dict] of accepted GLiNER2-compatible examples
  • generate_state(task_description, n) returns the finalized internal workflow state, including the inferred spec, planned constraints, accepted records, and run stats
  • n must be a positive integer, otherwise the API raises ValueError
  • If the workflow cannot produce enough valid examples after exhausting retries, the API raises RuntimeError

CLI Usage

Run the CLI from the repository root (optionally also write a custom Markdown run summary):

uv run python -m datagen \
  --task-description "Extract companies names and classify sentiment." \
  --num-examples 3 \
  --output-file data/company_sentiment.jsonl \
  --summary-file data/company_sentiment_report.md

Arguments:

  • --task-description: task description
  • --num-examples / -n: number of examples to generate
  • --output-file / -o: required path for the generated examples JSONL file; parent directories are created automatically
  • --summary-file: write a Markdown summary of the inferred task spec, planned constraints, and output statistics
  • --show-config / --no-show-config: print the resolved runtime configuration before running; enabled by default
  • --verbose: enable INFO-level logs for datagen and LiteLLM

Fine-Tune And Evaluate

The repository also includes a minimal train.py script for the bonus task. It loads GLiNER2 JSONL files directly, evaluates the base fastino/gliner2-base-v1 model on a held-out file, fine-tunes on the training file, then evaluates again and writes a JSON metrics report.

For the experiment reported here, I used two independently generated literature datasets:

  • data/literature_multitask-gpt.jsonl: 300 training examples generated with GPT. See data/literature_multitask_report-gpt.md.
  • data/literature_multitask-claude.jsonl: 100 held-out examples generated with Claude. See data/literature_multitask_report-claude.md.

I chose the literature domain because the GLiNER2 paper reports CrossNER literature as one of the weaker settings, so it is a useful stress test for a targeted synthetic-data fine-tune. This is still only a directional internal experiment, and a direct review against the paper's exact split and protocol is still pending.

Run the training script from the repository root:

uv run python train.py \
  --train-path data/literature_multitask-gpt.jsonl \
  --heldout-path data/literature_multitask-claude.jsonl \
  --output-dir outputs/literature_multitask_finetune \
  --wandb-project fastino_interview

Current script behavior:

  • --train-path is required
  • --heldout-path is required
  • --output-dir defaults to outputs/company_sentiment_training
  • --wandb-project is optional; providing it enables W&B logging, if it is omitted, W&B logging stays disabled

Evaluation results for the literature experiment are summarized below, comparing the base model against the LoRA fine-tuned model on the Claude held-out set:

Task Metric Base Fine-tuned Improvement
NER Precision 46.97% 53.16% +6.19%
Recall 92.66% 94.21% +1.54%
F1 62.34% 67.97% +5.63%
Classification (literary_period) Accuracy 86.12% 90.07% +3.95%

These results suggest that the synthetic literature data provides a useful fine-tuning signal for both tasks. The clearest NER gain is in precision, which improves substantially while recall remains high, leading to a meaningful F1 increase overall. The classification task also improves by nearly four points, which is encouraging, although the result should still be treated as directional because the held-out evaluation set is synthetic rather than a fully independent human-labeled benchmark.

Future Work

The items below are prioritized by balancing expected implementation effort against likely practical payoff for this project. Higher-priority items are the ones that seem most likely to improve cost, reliability, or dataset quality without requiring a disproportionate increase in system complexity, while lower-priority items are either more speculative, more expensive to build, or better suited to a later production-hardening phase.

Prompt Caching for Cost Reduction (High)

Most of the input tokens per generation request consist of the fixed system prompt, task schema, and rules. Since modern LLM APIs (like Anthropic and OpenAI) offer prompt caching that reduces input token costs by approximately 90% (and improves latency), caching these shared prefixes would yield massive cost savings for large runs.

To effectively leverage this:

  • The static pieces of the prompt (system instructions, schemas, formatting rules) must be placed at the very beginning of the prompt context and marked for caching.
  • The dynamic pieces (the specific ExampleConstraint such as length, tone, and domain for the current batch) should be moved to the very end of the prompt so they don't break the cache prefix.
  • Structuring the templates this way ensures that only the tiny dynamic constraint portion incurs the full input token price across the hundreds or thousands of parallel example generation requests.

Large Or Demanding Runs (High)

The current datagen implementation is optimized for simplicity: the workflow collects accepted examples in memory, and the public Python API returns those records directly as an in-memory list.

That behavior is fine for normal small-scale runs, but it may become a poor fit for larger or more expensive jobs. If a user requests a high number of examples, keeping all accepted records in memory can become unnecessarily wasteful. There is also a durability concern: if a long run fails partway through because of a retry limit, provider error, timeout, or interruption, the accepted examples generated so far are not durably persisted by the Python API itself, even though they already incurred model cost.

One possible future direction would be to incrementally append validated records to a JSONL file as they are accepted, instead of treating the final in-memory list as the primary output boundary. In that design, the generator could return the path to the JSONL file rather than a list of records. That would reduce memory pressure for large runs and preserve partial progress if execution stops before the full request completes.

This is intentionally documented as future work only. The current interface and behavior have not been changed yet, and the existing implementation should still be understood as an in-memory API with optional file output layered around it.

Another natural extension is provider-aware rate control: add tokens-per-minute budgeting, inspect LiteLLM/provider rate-limit metadata more deeply, and choose between waiting, backing off, or failing over to another allowed model/vendor pool when request budgets are exhausted.

Data Deduplication And Quality (Medium)

The current pipeline already does part of this job today: it validates the GLiNER2 record shape, rejects malformed examples, and filters exact duplicates within a run by using a normalized input/output fingerprint. That gives a solid baseline for correctness and removes obvious repeated samples, but it still leaves room for stronger dataset hygiene as the system scales.

One useful next step would be broader deduplication beyond exact matches. In particular, semantic-similarity deduplication could help catch examples that are not string-identical but still express nearly the same underlying content. A practical version of that would embed candidate examples into vector space and compare them by similarity, or cluster nearby examples and keep only one representative from each cluster. The same general idea could also be used for lightweight decontamination by checking training candidates against evaluation examples and removing overlaps that are too semantically close.

I also would not use a single fixed similarity threshold for every run. The acceptable distance between examples should probably be dynamic because users can prompt for anything from very narrow domains to broad heterogeneous ones. If the requested domain is narrow, a higher bar for novelty makes sense because semantically near-duplicates are more likely; if the domain is broad, the natural variance should be higher and an overly aggressive threshold could throw away legitimately distinct examples. In practice, that could mean generating a larger candidate pool, validating all candidates, scoring pairwise or cluster-level similarity, removing the examples that fall into the too-similar region, and then selectively retrying generation until the target count is met. The thresholding itself could be calibrated from the batch distribution rather than hard-coded, for example by using distance percentiles, intra-batch variance or standard-deviation-based cutoffs, or heuristics that adapt based on the requested domain granularity.

Another worthwhile extension would be data quality evaluation beyond structural validation. There are several reasonable options here: an LLM-as-a-judge pass with a concise rubric, a reward model that scores instruction/response quality, or a smaller classifier that predicts whether an example is acceptable for training. I would treat these as optional ranking or filtering layers on top of the current deterministic validation, mainly to improve relevance, diversity, and overall signal quality rather than to replace the existing schema checks.

Entity And Relation Distribution Diversity (Medium)

The current pipeline already pays some attention to output balance for classification tasks, but it does not explicitly manage diversity in the extracted supervision itself, especially for entity, relations, and JSON annotations. That means a dataset can look varied at the prompt or label level while still over-concentrating on a narrow subset of entity types, relation types, span patterns, annotation structures, or repeated slot values.

I think this is an important missing signal of dataset diversity. For extraction-style tasks, it would be useful to track not only the distribution of entity labels and relation labels, but also the diversity of the extracted values themselves. For example, it is not enough for the dataset to contain many company_name annotations if those spans keep collapsing onto the same few values such as Microsoft, Google, or Apple. The same issue applies to relation extraction and JSON extraction: a dataset may have the right schema keys while still reusing a narrow set of entity values, relation arguments, field values, or value combinations. In addition to label counts, I would therefore track example-level annotation counts, span lengths, simple co-occurrence patterns, and value-level frequency distributions or concentration metrics.

In practice, this could be implemented as a lightweight feedback loop: monitor the accepted dataset so far, estimate which label types, structural patterns, and value patterns are currently overrepresented or under-covered, and bias future planning or prompting toward the missing regions. That would not guarantee perfect balance, and it should not force unnatural examples, but it could materially improve coverage and make the resulting corpus more useful for extraction training than relying on prompt diversity alone.

Observability And Traceability (Medium)

The current submission intentionally keeps observability concerns out of the core datagen package so the interview solution stays focused and avoids extra tracing-specific dependencies. That tradeoff keeps installation and runtime behavior simpler, which felt more appropriate for a compact assessment submission than adding a second layer of instrumentation.

If this were hardened for production or longer-running research workflows, a natural next step would be optional tracing around decomposition and generation calls, plus richer run-level metadata for cost, latency, and failure analysis. LangSmith would be one reasonable option here, but the same idea could also be implemented with another tracing backend or a lightweight custom logging layer.

Multi-Model Resilience And Diversity (Low)

Right now the system assumes a single configured model per run, which keeps behavior predictable and the implementation simple. A useful low-priority extension would be to support multiple models in two distinct ways that solve different problems.

The first is resilience: allow a primary model plus a fallback cascade for cases where a provider is unavailable, an API quota is exhausted, latency becomes unacceptable, or repeated transient errors make progress stall. In that design, the system would try the preferred model first and then fail over through an ordered list of allowed backups so long runs can continue without manual intervention.

The second is diversity: allow sampling from a user-specified pool of models on purpose so the dataset is not generated from a single model's style, bias profile, or failure mode. That could be useful when users want broader stylistic variation or want to reduce the risk that one provider's generation habits dominate the synthetic corpus.

I would likely keep these as separate arguments rather than merging them into one setting, because "fallbacks for reliability" and "multiple models for intentional diversity" are related but operationally different goals. For example, one argument could define an ordered failover list, while another could define a sampling pool plus optional weights. Keeping them separate would make behavior easier to reason about and would avoid ambiguity about whether secondary models should only be used on failure or should participate in normal generation.

Richer Dataset Diversity Through Personas (Low)

The current diversity planner already varies domain, subtopic, tone, length, and complexity, but it still operates with fairly coarse-grained control over perspective. A useful future extension would be to explicitly sample or construct personas during planning so generation can cover a wider range of voices, goals, expertise levels, and communicative contexts.

In practice, that could mean attaching persona-like attributes to each planned example, such as role, background knowledge, intent, region, or writing situation, then conditioning the generation prompt on those attributes. That would make it easier to explore dataset diversity beyond surface style variation and could reduce the risk of producing many semantically similar examples that only differ lexically.

This idea is directly aligned with recent work on persona-driven synthesis, especially Scaling Synthetic Data Creation with 1,000,000,000 Personas, which argues that large persona sets can unlock broader synthetic data coverage. I would treat that as future research rather than a guaranteed improvement here, but it is a strong direction for expanding the diversity-planning layer beyond the current handcrafted constraint pools.

ToDo's

  • Hugging Face integration for versioning and management
    • Dataset
    • Model checkpoints
  • Cache the prompts for cost reduction
  • Save-to-disk generation for large runs

About

LangGraph-powered synthetic data pipeline for GLiNER2. Features automated task decomposition, diversity planning, and strict grounding validation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors