Skip to content

scisciuro/automa-tic

Repository files navigation

automa-tic

Automated GC-MS TIC chromatogram preprocessing.

Version 0.2.0 — see CHANGELOG.md for release notes.

Preprocessing pipeline for GC-MS Total Ion Current (TIC) chromatograms, designed for downstream chemometrics and machine learning.

Handles data from multiple instrument vendors (Agilent, Shimadzu) and produces standardised CSV outputs per measurement.

Pipeline overview

Raw CSV → Cutoff → Resample → Smooth → Baseline correction → RT Alignment → Peak picking → Normalization → Export

Each step is optional or configurable via config.json. All parameters used are recorded in per-sample QC files for full reproducibility.

Installation

Requires Python ≥ 3.9. Dependencies:

numpy
scipy
pandas
pybaselines
matplotlib

Install with:

pip install -r requirements.txt

Quick start

Try it on the included demo files:

python run_files.py -c config.example.json

This processes demo_data/shimadzu_demo.csv with sensible defaults (alignment to D3, normalization to IS_heptaneD16) and writes results to output_demo/.

Use it on your own data:

  1. Place your CSV data files in the project directory (or a subfolder).
  2. Copy config.example.json to config.json and edit input_path, out_dir, method, and processing parameters.
  3. Edit methods.json to define anchor substances for your instrument/method.
  4. Run:
python run_files.py

CLI overrides are available for all common options:

python run_files.py -i mydata.csv                        # single file
python run_files.py -i data_folder/ -o results/           # batch mode
python run_files.py -c my_config.json                     # custom config
python run_files.py -i data/ -p "*.csv"                   # batch with glob pattern
python run_files.py -f "250611"                            # only files containing "250611"
python run_files.py -m agilent_standard                    # override method
python run_files.py -s asls_test                           # suffix on output filenames
python run_files.py -i data/ -f "250611" -s batch1 -m shimadzu_standard  # combined
Flag Long form Description
-c --config Path to config file (default: config.json)
-i --input Input file or directory (overrides config)
-o --output Output directory (overrides config)
-p --pattern Glob pattern for batch mode (overrides config)
-f --filter Filename filter: only process files whose name contains this string
-s --suffix Suffix appended to output filenames (overrides run_id)
-m --method Method preset name from methods.json (overrides config)

Project structure

gcms_preproc/                   # Python package (automa-tic)
├── __init__.py             # Package init, __version__
├── io/
│   ├── readers.py          # CSV import (Agilent, Shimadzu, generic fallback)
│   └── writers.py          # CSV/JSON export (with optional suffix)
├── core/
│   ├── models.py           # Chromatogram dataclass
│   ├── resampling.py       # Uniform grid interpolation
│   ├── smoothing.py        # Savitzky-Golay filter
│   ├── baseline.py         # Baseline correction (asls, psalsa, aspls)
│   ├── baseevaluation.py   # Baseline quality metrics and parameter sweeps
│   ├── cutoff.py           # RT range or point-based trimming
│   ├── alignment.py        # RT alignment (shift, linear) with peak validation
│   ├── normalization.py    # Normalization (anchor IS, total peak area, fallback)
│   ├── peaks.py            # SNR-based peak picking, EMG deconvolution, quality metrics
│   └── qc.py              # Noise estimation (MAD-based)
├── pipeline.py             # Main orchestrator (single file + batch, progress logging)
├── run_files.py            # CLI entry point, config/method loader, early validation
├── plot-one.py             # Interactive QC plot (matplotlib, suffix support, auto-rescale)
├── config.json             # Your processing parameters (edit for your data)
├── config.example.json     # Working example pointing at the demo data
├── methods.json            # Per-instrument anchor substance definitions
├── demo_data/              # Example chromatograms (one per instrument)
│   ├── shimadzu_demo.csv
│   └── agilent_demo.csv
├── requirements.txt        # Python dependencies
├── README.md               # This file
├── CHANGELOG.md            # Release notes
└── LICENSE                 # MIT

Configuration

Settings are split across two files:

  • config.json — processing parameters and mode selection (what to do)
  • methods.json — per-instrument/method anchor substance definitions (what's available)

This separation means you can switch between instruments or analytical methods without changing processing settings, and vice versa.

config.json reference

Paths and general

Key Type Description
input_path string Input file or directory (relative to config location)
out_dir string Output directory
pattern string Glob pattern for batch mode (e.g. "*.csv")
filename_filter string or null Only process files containing this string (batch mode)
method string Method preset name from methods.json
methods_file string Path to methods file (default: "methods.json")
dt_min_ms number Sampling interval in milliseconds (typically 100)
run_id string or null Optional label; defaults to filename stem. Also used as output filename suffix when set.
blmetrics bool If true, log baseline evaluation metrics to baseline_eval_results.csv

Cutoff

Trim chromatogram before processing. Useful for removing solvent delay or column bleed at the end.

"cutoff": {"mode": "rt", "start_min": 1.60, "end_min": 28.00}

Modes: "none", "rt" (start_min/end_min), "points" (drop_first/drop_last).

Smoothing

Savitzky-Golay filter applied after resampling to uniform grid.

"smoothing": {"window_min": 0.015, "polyorder": 3}

window_min is in minutes and converted to points internally based on dt_min_ms.

Baseline correction

Three methods available. Switch by changing baseline_method; method-specific parameters are in baseline_presets, shared parameters in baseline_common.

"baseline_method": "asls",
"baseline_presets": {
    "asls":   {"lam": 1e6, "p": 0.01},
    "psalsa": {"lam": 1e5, "p": 0.05},
    "aspls":  {"lam": 1e6}
},
"baseline_common": {"diff_order": 2, "max_iter": 50, "tol": 1e-3}
Method Description Reference
asls Asymmetric least squares Eilers, Anal Chem 2003, 75:3631
psalsa Penalized asls with smoothing Oller-Moreno et al., 2014
aspls Adaptive smoothness penalized LS (no p, auto-estimated) Zhang et al., 2020

Lambda (lam) is automatically scaled for the sampling rate via dt_ref_min.

Peak picking

SNR-based detection using scipy.signal.find_peaks, with optional EMG-based deconvolution and per-peak quality metrics.

"peaks": {
    "min_snr": 6,
    "min_width_min": 0.008,
    "max_width_min": 0.50,
    "deconvolution": false,
    "deconv_max_components": 4
}
Key Description
min_snr Minimum signal-to-noise ratio. Prominence threshold = min_snr × noise_sigma. 6 ≈ 2× LOD (ICH Q2)
min_width_min Minimum peak FWHM in minutes (filters spikes)
max_width_min Maximum peak FWHM in minutes (filters broad artifacts); null = no limit
deconvolution If true, attempt EMG-based deconvolution of overlapping peaks (see below)
deconv_max_components Maximum number of peak components to fit per region (default 4)

Width filtering uses FWHM (full width at half maximum, rel_height=0.5). Integration boundaries are determined separately at 95% base width (rel_height=0.95) to capture the full peak area including tails. This prevents tailing peaks from being incorrectly filtered while still measuring their complete area.

Noise is estimated internally via MAD of first differences (Rousseeuw & Croux, 1993).

Deconvolution

When enabled, the pipeline uses curve fitting to resolve overlapping peaks within merged integration regions. The approach is selective — only regions with multiple detected local maxima are fitted; well-resolved single peaks are left untouched.

The fitting model is the Exponentially Modified Gaussian (EMG), which is the standard chromatographic peak model (Foley & Dorsey, 1983, Anal Chem 55:730). It has 4 parameters per component: amplitude, center (mu), Gaussian width (sigma), and exponential decay (tau). The tau parameter captures the characteristic tailing observed in GC-MS peaks due to column dispersion and interaction kinetics.

Fitting procedure per region:

  1. Detect local maxima with a low prominence threshold to find shoulders
  2. Estimate initial parameters from signal shape (FWHM → sigma, trailing decay rate → tau)
  3. Fit sum-of-EMGs with constrained centers (±0.01 min around detected maxima)
  4. If EMG fit fails or is poor (normalized RMSE > 0.12), fall back to sum-of-Gaussians
  5. If both fail, keep the region as a single peak
  6. Adjacent deconvolved peak boundaries are set at valley minima in the original signal (no overlap)

Integration regions are merged using a smart criterion: overlapping intervals are only merged when the valley between the two peaks stays above 30% of the smaller peak's height. If the valley drops below this threshold, the regions are kept separate.

Deconvolved peaks are marked with is_deconvolved = 1.0 in the peak table. Their area is computed from the fitted EMG/Gaussian component (analytical integration), not from the raw signal.

Typical performance overhead: ~0.5–2 seconds per chromatogram (depending on the number of multi-peak regions).

Quality metrics

Each detected peak is assigned quality metrics:

Metric Description
prominence_height_ratio Prominence / height. Close to 1.0 for well-resolved peaks rising from baseline; low for peaks riding on the tail of a neighbor.
tailing_factor USP tailing factor at 5% peak height (USP <621>, EP 2.2.46). 1.0 = symmetric, >1 = tailing, <1 = fronting. NaN if unmeasurable.
local_snr SNR computed from a local noise estimate (±0.3 min window) rather than the global noise. More representative for chromatograms with non-uniform noise.
quality Combined classification: high, good, medium, or low (see below).

Quality classification thresholds (all must pass for the level):

Level local_snr prominence_height_ratio tailing_factor
high ≥ 50 ≥ 0.8 0.8 – 2.0
good ≥ 20 ≥ 0.5 0.5 – 3.0
medium ≥ 10 ≥ 0.3 0.3 – 5.0
low below medium

Tailing ranges are relaxed compared to HPLC conventions to accommodate typical GC-MS peak shapes. Unmeasurable tailing (NaN) does not penalize quality. These thresholds are designed as a starting point — peaks are flagged, not filtered, so downstream analysis can apply its own criteria.

Alignment

RT correction using one or two anchor substances (referenced by name from the active method).

"alignment": {"mode": "none", "substances": []}
"alignment": {"mode": "shift", "substances": ["D5"]}
"alignment": {"mode": "linear", "substances": ["D3", "IS_hexadecaneD34"]}
Mode Description
none No alignment
shift Constant RT shift (1 anchor): RT' = RT − offset
linear Linear transform (2 anchors): RT' = a·RT + b

Anchors are validated before use: the found maximum must pass an SNR check (≥ 5) and must not be at the window edge. If validation fails, alignment is skipped with a warning identifying the sample and substance.

Local linear detrending is applied to the substance window before searching for the apex. A straight line is fitted between mean-of-3 anchors at the window edges and subtracted from the signal. This makes anchor detection robust against small peaks sitting on the tail of a larger neighbor and against peaks on a sloping baseline residual.

Normalization

"normalization": {"mode": "none", "substance": null, "fallback": "total_peak_area"}
"normalization": {"mode": "anchor_area", "substance": "IS_heptaneD16", "fallback": "total_peak_area"}
"normalization": {"mode": "total_peak_area"}
Mode Description
none No normalization
anchor_area Divide by IS peak area (recommended)
anchor_height Divide by IS peak height
total_peak_area Divide by sum of all integrated peak areas

Anchor-based modes validate the IS peak (SNR check, optional area range check). If validation fails, automatically falls back to the mode specified in fallback. The substance name is referenced from the active method in methods.json.

For anchor modes, the IS peak is re-found and re-integrated using local linear detrending in the substance window — independent of whether the global peak picker found it. This gives a tangent-skim-style area equivalent to commercial chromatography software, robust against sloping baselines and small peaks on tails. The detrended values then replace the IS row's area/height/RT in the peak table; if the global picker missed the IS entirely, a new row is inserted with quality = "anchor_detrended" and NaN for area_raw / height_raw.

methods.json reference

Defines per-instrument/method anchor substance dictionaries. Each method has a name, optional description, and an anchor_substances dictionary.

{
    "shimadzu_standard": {
        "description": "Shimadzu GC-MS, standard method",
        "anchor_substances": {
            "D3":               {"expected_rt_min": 3.70, "window_min": 0.08},
            "IS_heptaneD16":    {"expected_rt_min": 6.00, "window_min": 0.08, "expected_area_range": null},
            ...
        }
    },
    "agilent_standard": {
        "description": "Agilent GC-MS, standard method",
        "anchor_substances": { ... }
    }
}

Substance names referenced in config.json (alignment/normalization) are looked up in the active method's anchor_substances. This is validated at startup before any file is processed — a typo in a substance name gives an immediate error, not a mid-batch failure.

expected_area_range is optional. Set to [low, high] after establishing typical IS areas from a few runs. Used by normalization to flag magnitude anomalies or co-elution.

Output files

For each processed sample, the pipeline writes three files. If a suffix is set (via --suffix or run_id), it is inserted into the filename: <sample_id>__<suffix>__<tag>.<ext>. Without a suffix: <sample_id>__<tag>.<ext>.

File Contents
*__processed_tic.csv Full signal trace (see columns below)
*__peak_table.csv Peak list (see columns below)
*__qc.json Complete parameter record and quality metrics

processed_tic.csv columns

Column Description
rt_min Retention time on uniform resampled grid (minutes)
tic_raw Intensity after resampling (before processing)
tic_smooth After Savitzky-Golay smoothing
baseline Estimated baseline curve
tic_corrected tic_smooth − baseline
rt_min_aligned RT axis after alignment (= rt_min if alignment is off)
tic_aligned_corrected Corrected signal on aligned RT axis
tic_final After normalization (fully processed signal)

For downstream use, rt_min_aligned + tic_final is always the fully processed pair.

peak_table.csv columns

Column Description
sample_id Sample identifier
rt_apex_min Retention time of peak apex (minutes)
rt_start_min Integration start (minutes), at 95% base width
rt_end_min Integration end (minutes), at 95% base width
area_raw Integrated area before normalization
area Integrated area after normalization
height_raw Peak height before normalization
height Peak height after normalization
prominence Peak prominence above local baseline
width_min Peak FWHM in minutes (full width at half maximum)
snr Global signal-to-noise ratio (height / global noise_sigma)
is_deconvolved 1.0 if from valley-based deconvolution, 0.0 otherwise
prominence_height_ratio Prominence / height (1.0 = rises from baseline, low = on shoulder)
tailing_factor USP tailing factor at 5% height (1.0 = symmetric; NaN if unmeasurable)
local_snr SNR from local noise estimate (±0.3 min window around peak)
quality Combined quality: high, good, medium, or low

When normalization is "none", area_raw = area and height_raw = height.

Batch outputs

In batch mode, additional files are written to the output directory:

File Description
batch_summary.csv Per-sample summary (see columns below)
<timestamp>_run-batch.log Full log with timestamps, progress, warnings, and errors for every file. Filename includes the run timestamp (YYYY-MM-DD_HH-MM-SS) so multiple runs in the same output directory don't overwrite each other.
baseline_eval_results.csv Baseline quality scores (only if blmetrics is enabled)

batch_summary.csv columns

Compiled from all per-sample QC files after batch processing. One row per sample.

Column Description
sample_id Sample identifier
n_peaks Number of peaks detected
norm_mode Normalization mode used (none, anchor_area, total_peak_area, etc.)
norm_status Normalization outcome (ok, anchor_not_found, anchor_validation_failed, etc.)
norm_factor Applied normalization factor

The following columns appear when anchor-based normalization is used:

Column Description
anchor_rt Detected RT of the anchor/IS peak (minutes)
anchor_area Raw integrated area of the anchor peak
anchor_height Height of the anchor peak
anchor_snr SNR of the anchor peak
anchor_valid True if anchor passed all validation checks
anchor_issues Semicolon-separated list of validation failure reasons (empty if valid)

The following columns appear when anchor validation failed and fallback was triggered:

Column Description
fallback_mode Fallback normalization mode that was applied (e.g., total_peak_area)
fallback_status Outcome of the fallback normalization

The following columns are added automatically when ≥ 3 samples have valid anchor areas, for batch-level IS consistency checking:

Column Description
anchor_area_zscore Robust z-score of the anchor area relative to the batch (using MAD-scaled deviation from the median)
anchor_area_outlier True if the absolute z-score exceeds 2.0, indicating the IS area for this sample is anomalous compared to the batch

Outlier flags help identify samples with potential co-elution on the IS peak, sample preparation errors, or injection volume inconsistencies — issues that per-sample validation alone may not catch.

Logging and progress

The pipeline logs to both the terminal and a timestamped log file (<timestamp>_run-batch.log) in the output directory. The timestamp format is YYYY-MM-DD_HH-MM-SS, so each run produces its own log file.

Terminal output shows a parameter summary at the start of each run, per-file progress with peak counts, and any warnings or errors. Example:

============================================================
automa-tic v0.2.0
============================================================
  Method:        shimadzu_standard
                 Shimadzu GC-MS, standard method
  Sampling:      100 ms
  Cutoff:        RT 1.6 – 28.0 min
  Smoothing:     SavGol window=0.015 min, order=3
  Baseline:      asls (lam=1e+06, p=0.01)
  Peaks:         SNR >= 6, width 0.008–0.5 min
  Alignment:     shift → D5
  Normalization: anchor_area → IS_heptaneD16
============================================================
  Batch: 12 file(s) in data/
--- File 1/12: sample_001.csv ---
  [sample_001] Reading...
  [sample_001] 47 peaks found
  [sample_001] Done.
--- File 2/12: sample_002.csv ---
  [sample_002] Reading...
  [sample_002] Alignment anchor 'D5' (expected RT=7.300): below_snr (SNR=2.1, required=5.0). Skipping alignment.
  [sample_002] 38 peaks found
  [sample_002] Done.

The log file includes additional debug-level detail for each processing step.

Visualization

plot-one.py — interactive matplotlib plot for one sample:

python plot-one.py output/ sample_id
python plot-one.py output/ sample_id -s asls_test    # with suffix

Defaults to showing tic_final (fully processed signal) with y-axis auto-scaled. Toggleable layers: raw, smooth, baseline, corrected, final, and peak spans. Y-axis auto-rescales when toggling between traces with different magnitudes (e.g., raw vs normalized).

Baseline parameter sweeps

To compare baseline settings:

  1. Set "blmetrics": true in the config.
  2. Set "run_id" to a descriptive label (e.g. "asls_lam1e6_p0.01") — this also becomes the output filename suffix.
  3. Run the pipeline. Results are appended to baseline_eval_results.csv in the output directory.
  4. Change baseline parameters and run_id, repeat.
  5. Compare scores — lower is better. Key metrics: NAR_neg_area_ratio (over-subtraction), residual_ac1_in_mask (residual structure), residual_local_mean_drift_MAD (remaining drift).

The sweep can also be run programmatically using baseevaluation.evaluate_sweep().

Versioning

automa-tic uses Semantic Versioning:

  • MAJOR (1.0.0): breaking changes to config format, output structure, or API
  • MINOR (0.2.0): new features, new output columns, new processing options
  • PATCH (0.1.1): bug fixes, parameter tuning, documentation updates

The current version is stored in gcms_preproc/__init__.py and recorded in every QC JSON output (automa_tic_version field), ensuring full traceability of which pipeline version produced each result.

License

Released under the MIT License — free to use, modify, and distribute, including for commercial purposes, as long as the copyright notice is preserved.

About

Automated GC-MS TIC chromatogram preprocessing pipeline. Handles smoothing, baseline correction, RT alignment, peak picking with EMG deconvolution, normalization with anchor detrending, and quality assessment.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages