Norovirus Capture Sequencing Simulator. Generates realistic paired-end FASTQ from simulated hybrid capture enrichment of Norovirus VP1, calibrated from empirical data in Bhamidipati et al. (2025) Sci Rep 15:20526.
Human Norovirus genotype is defined by the VP1 region (ORF2, ~1,700 bp). Hybrid capture enrichment followed by paired-end sequencing on NovaSeq is the current gold standard for lineage surveillance. This simulator produces controlled FASTQ datasets — with known genotypes, Ct values, and ground truth coverage — so that full analysis pipelines can be benchmarked before deployment.
Empirical calibration source:
Bhamidipati et al. (2025) "Complete genomic characterization of RSV and HuNoV using probe-based capture enrichment." Sci Rep 15:20526 (PMC12216758)
The simulation pipeline has six stages that mirror the wet-lab workflow:
1. Ct-to-viral-fraction model. Each sample's Ct value is converted to a
viral fraction using vf = 0.03 * 2^((25 - ct) / 3.32). The exponential
relationship follows standard qPCR theory — each 3.32 Ct difference
corresponds to a 10-fold change in template concentration (2^3.32 ≈ 10).
The calibration parameters (base fraction 0.03 at Ct 25) were tuned to
produce coverage outputs consistent with the empirical results in Table 1
of Bhamidipati et al. (2025); they are not directly reported in the paper.
A Ct of 20 gives ~30% viral reads; a Ct of 35 gives ~0.03%. The total
pre-capture library (default 500,000 fragments) is split into n_viral and
n_background based on this fraction.
2. Fragment generation. Viral fragments are sampled from the reference genome using a truncated normal length distribution (mean=380 bp, sd=80 bp, range 200-600 bp) matching heat fragmentation at 94 C for 10 minutes. Each fragment records its genomic coordinates, strand, GC content, and whether it overlaps VP1. Background fragments come from human DNA (GC ~0.41), gut microbiome (GC ~0.52), or wastewater environmental sources (GC ~0.48), mixed according to sample type (stool: 80/20 human/microbiome; wastewater: 10/30/60 human/microbiome/environmental).
3. Capture enrichment. Hybrid capture is modelled per-fragment. Viral
fragments that overlap VP1 are retained with probability
P = exp(-6.0 * (gc - 0.47)^2), a Gaussian centered on the optimal probe GC
of 0.47. Fragments outside VP1 and all background fragments pass through at
the configured off-target rate (default 59.2%, from 1 - 0.408 on-target rate
in Table 1 of Bhamidipati et al.). This produces on-target and off-target
fragment pools.
4. PCR duplicate model. Each captured fragment is assigned a copy count
drawn from a geometric distribution with p = 1 - dup_rate. At the default
40% duplicate rate, most fragments appear once, but some are amplified 2-5x,
matching the 12-20 PCR cycles used in post-capture amplification.
5. Read simulation. The duplicated fragment pool is written as a PBSIM3 TSV (sequence, start, end per fragment) and passed to art_modern, which generates paired-end 2x150 bp reads with Illumina NovaSeq-calibrated base quality scores and error profiles. The output is deinterleaved into R1/R2 FASTQ files and gzip-compressed.
6. Truth and QC. Ground truth VP1 coverage is computed directly from
fragment coordinates: mean depth and the fraction of VP1 bases at >= 20x.
Each sample receives a completeness call — complete (>= 90% of VP1 at
= 20x),
low_coverage(>= 90% breadth but < 20x mean), orincomplete(< 90% breadth) — following the criteria in the paper.
- Python >= 3.11
- art_modern (external binary, see below)
- Python dependencies:
click,biopython,numpy,scipy
git clone <repo>
cd nocasim
uv venv .venv && source .venv/bin/activate
uv pip install -e ".[dev]"Download the .deb from the
art_modern releases page,
then install it:
sudo dpkg -i art-modern_*.deb
sudo apt-get install -f -yVerify the install:
art_modern --versionPre-built images are available from Google Artifact Registry:
docker pull us-docker.pkg.dev/general-theiagen/theiagen/nocasim:v0.3.0Run a single sample:
docker run --rm -v $(pwd)/results:/output \
us-docker.pkg.dev/general-theiagen/theiagen/nocasim:v0.3.0 single \
--reference /opt/nocasim/data/references/GII.4.fasta \
--ct 28.0 --outdir /output --art-modern art_modernRun batch mode with a custom sample sheet:
docker run --rm \
-v $(pwd)/samples.tsv:/data/samples.tsv \
-v $(pwd)/results:/output \
us-docker.pkg.dev/general-theiagen/theiagen/nocasim:v0.3.0 simulate \
--sample-sheet /data/samples.tsv \
--references /opt/nocasim/data/references/ \
--art-modern art_modern \
--outdir /outputTo build locally instead:
docker build -t nocasim .nocasim single \
--reference data/references/GII.4.fasta \
--ct 28.0 \
--outdir results/ \
--art-modern art_modernnocasim simulate \
--sample-sheet samples.tsv \
--references data/references/ \
--art-modern art_modern \
--outdir results/Use --sample-type wastewater to simulate wastewater surveillance samples.
The background composition shifts from stool (80% human / 20% microbiome) to
wastewater (10% human / 30% microbiome / 60% environmental):
nocasim simulate \
--sample-sheet samples.tsv \
--references data/references/ \
--art-modern art_modern \
--sample-type wastewater \
--outdir results/Optionally provide a wastewater background FASTA for realistic environmental sequences instead of synthetic ones:
nocasim simulate \
--sample-sheet samples.tsv \
--references data/references/ \
--art-modern art_modern \
--sample-type wastewater \
--wastewater-bg data/background/wastewater_metagenome.fasta \
--outdir results/Built-in presets simulate realistic wastewater genotype distributions:
nocasim simulate \
--sample-sheet samples.tsv \
--references data/references/ \
--art-modern art_modern \
--preset us-2024 \
--outdir results/| Preset | Composition | Source |
|---|---|---|
us-2024 |
GII.17:0.75, GII.4:0.11, GII.2:0.05, GI.1:0.04, GII.6:0.03, GI.3:0.02 | CaliciNet 2024-25 (PMC12205451) |
diverse |
GII.17:0.25, GII.4:0.20, GII.2:0.15, GI.1:0.15, GII.6:0.13, GI.7:0.12 | Synthetic high-diversity |
gi-dominant |
GI.1:0.40, GI.3:0.25, GI.7:0.20, GII.17:0.15 | Synthetic GI-heavy |
outbreak |
GII.17:0.90, GII.4:0.10 | Single-genotype emergence |
For custom mixtures, use --mixture instead:
nocasim simulate \
--sample-sheet samples.tsv \
--references data/references/ \
--art-modern art_modern \
--mixture "GII.17:0.60,GII.4:0.25,GI.1:0.15" \
--outdir results/--mixture and --preset are mutually exclusive.
nocasim download-probes --outdir data/probes/This downloads Supplementary File 2 from PMC12216758 and saves it as
data/probes/hunov_probes.txt (BED-like probe interval file).
The sample sheet is a TSV file with three columns:
sample_id genotype ct_value
sample_001 GII.4 24.5
sample_002 GII.17 30.0
sample_003 GI.1 31.7
The genotype value must match the filename stem of a FASTA file in the
--references directory. For example, GII.4 maps to
data/references/GII.4.fasta.
The genotype column also accepts comma-separated mixture specs for
simulating wastewater samples with multiple co-circulating lineages:
sample_id genotype ct_value
ww_001 GII.17:0.75,GII.4:0.15,GI.1:0.10 24.5
ww_002 GII.4 28.0
Proportions must sum to 1.0. Samples with a bare genotype name are treated
as single-lineage (equivalent to GII.4:1.0).
Alternatively, use --preset or --mixture CLI flags to apply a mixture
to all samples at once (see Mixture presets below).
Per-sample mixture syntax in the TSV overrides CLI flags.
Example sample sheets are included at data/samples.tsv (single-lineage)
and data/samples_mixture.tsv (multi-lineage).
The repository includes 35 reference sequences (9 GI + 26 GII genotypes)
from the CDC Calicivirus Typing Tool.
One FASTA file per genotype in data/references/, where the filename stem
matches the genotype string in the sample sheet:
data/references/
├── GI.1.fasta (EU085529, 3081 bp)
├── GI.2.fasta (AF435807, 2354 bp)
├── ...
├── GII.4.fasta (X76716, 3881 bp)
├── GII.17.fasta (KJ156329, 3723 bp)
├── ...
└── GII.27.fasta (MK733205, 7308 bp)
References range from partial VP1 (~1,000 bp) to full genome (~7,700 bp).
Full-genome references use the configured vp1_start/vp1_end coordinates
to extract the target region (defaults: 5,100-6,800 bp). Shorter sequences
are treated as the target region in their entirety.
results/
├── summary.tsv
└── sample_001/
├── sample_001_R1.fastq.gz
├── sample_001_R2.fastq.gz
└── sample_001_manifest.json
summary.tsv contains one row per sample with columns: sample_id,
genotype, ct_value, vp1_mean_depth, vp1_completeness_20x,
completeness_call, lineage_detail. For multi-lineage samples,
lineage_detail contains per-lineage stats in the format
GII.17:67.1x/0.94;GII.4:13.4x/0.82 (depth/completeness per genotype).
sample_001_manifest.json contains per-sample ground truth metrics including
achieved on-target rate, duplicate rate, mean VP1 depth, and completeness call.
For multi-lineage samples, the manifest also includes mixture (input
proportions), per_lineage (per-genotype coverage stats), and aggregate
(weighted overall stats).
Completeness calls follow the paper's criteria:
complete: >= 20x coverage across >= 90% of VP1low_coverage: >= 90% of VP1 breadth covered but mean depth < 20xincomplete: < 90% of VP1 covered
| Parameter | Default | Source |
|---|---|---|
--read-len |
150 | NovaSeq 2x150 bp |
--dup-rate |
0.40 | 12-20 post-capture PCR cycles |
--off-target |
0.592 | 1 - 0.408 on-target rate (Table 1) |
--total-fragments |
500000 | Pre-capture library size |
--sample-type |
stool | Background composition model (stool or wastewater) |
--preset |
— | Built-in mixture preset (us-2024, diverse, gi-dominant, outbreak) |
--mixture |
— | Custom mixture spec, e.g. "GII.17:0.75,GII.4:0.25" |
--seed |
42 | RNG seed for reproducibility |
| Source | Stool | Wastewater |
|---|---|---|
| Human | 80% | 10% |
| Microbiome | 20% | 30% |
| Environmental | 0% | 60% |
Background FASTA files (--human-bg, --microbiome-bg, --wastewater-bg) are
optional. When omitted, synthetic sequences with realistic GC content profiles
are generated as placeholders.
Fragment length distribution is a truncated normal (mean=380 bp, sd=80 bp, range 200-600 bp) matching heat fragmentation at 94 degrees C for 10 minutes as described in the paper. This differs from sonication (log-normal) — do not substitute a log-normal distribution.
pytest tests/ -v- ORF1 / polymerase region (recombination makes it irrelevant for genotyping)
- Intra-lineage quasispecies diversity or SNV simulation
- Recombinant genomes (chimeric ORF1/ORF2 junctions)
- RNA extraction efficiency
- Reverse transcription efficiency