Automated GC-MS TIC chromatogram preprocessing.
Version 0.2.0 — see CHANGELOG.md for release notes.
Preprocessing pipeline for GC-MS Total Ion Current (TIC) chromatograms, designed for downstream chemometrics and machine learning.
Handles data from multiple instrument vendors (Agilent, Shimadzu) and produces standardised CSV outputs per measurement.
Raw CSV → Cutoff → Resample → Smooth → Baseline correction → RT Alignment → Peak picking → Normalization → Export
Each step is optional or configurable via config.json. All parameters used are recorded in per-sample QC files for full reproducibility.
Requires Python ≥ 3.9. Dependencies:
numpy
scipy
pandas
pybaselines
matplotlib
Install with:
pip install -r requirements.txtTry it on the included demo files:
python run_files.py -c config.example.jsonThis processes demo_data/shimadzu_demo.csv with sensible defaults (alignment to D3, normalization to IS_heptaneD16) and writes results to output_demo/.
Use it on your own data:
- Place your CSV data files in the project directory (or a subfolder).
- Copy
config.example.jsontoconfig.jsonand editinput_path,out_dir, method, and processing parameters. - Edit
methods.jsonto define anchor substances for your instrument/method. - Run:
python run_files.pyCLI overrides are available for all common options:
python run_files.py -i mydata.csv # single file
python run_files.py -i data_folder/ -o results/ # batch mode
python run_files.py -c my_config.json # custom config
python run_files.py -i data/ -p "*.csv" # batch with glob pattern
python run_files.py -f "250611" # only files containing "250611"
python run_files.py -m agilent_standard # override method
python run_files.py -s asls_test # suffix on output filenames
python run_files.py -i data/ -f "250611" -s batch1 -m shimadzu_standard # combined| Flag | Long form | Description |
|---|---|---|
-c |
--config |
Path to config file (default: config.json) |
-i |
--input |
Input file or directory (overrides config) |
-o |
--output |
Output directory (overrides config) |
-p |
--pattern |
Glob pattern for batch mode (overrides config) |
-f |
--filter |
Filename filter: only process files whose name contains this string |
-s |
--suffix |
Suffix appended to output filenames (overrides run_id) |
-m |
--method |
Method preset name from methods.json (overrides config) |
gcms_preproc/ # Python package (automa-tic)
├── __init__.py # Package init, __version__
├── io/
│ ├── readers.py # CSV import (Agilent, Shimadzu, generic fallback)
│ └── writers.py # CSV/JSON export (with optional suffix)
├── core/
│ ├── models.py # Chromatogram dataclass
│ ├── resampling.py # Uniform grid interpolation
│ ├── smoothing.py # Savitzky-Golay filter
│ ├── baseline.py # Baseline correction (asls, psalsa, aspls)
│ ├── baseevaluation.py # Baseline quality metrics and parameter sweeps
│ ├── cutoff.py # RT range or point-based trimming
│ ├── alignment.py # RT alignment (shift, linear) with peak validation
│ ├── normalization.py # Normalization (anchor IS, total peak area, fallback)
│ ├── peaks.py # SNR-based peak picking, EMG deconvolution, quality metrics
│ └── qc.py # Noise estimation (MAD-based)
├── pipeline.py # Main orchestrator (single file + batch, progress logging)
├── run_files.py # CLI entry point, config/method loader, early validation
├── plot-one.py # Interactive QC plot (matplotlib, suffix support, auto-rescale)
├── config.json # Your processing parameters (edit for your data)
├── config.example.json # Working example pointing at the demo data
├── methods.json # Per-instrument anchor substance definitions
├── demo_data/ # Example chromatograms (one per instrument)
│ ├── shimadzu_demo.csv
│ └── agilent_demo.csv
├── requirements.txt # Python dependencies
├── README.md # This file
├── CHANGELOG.md # Release notes
└── LICENSE # MIT
Settings are split across two files:
config.json— processing parameters and mode selection (what to do)methods.json— per-instrument/method anchor substance definitions (what's available)
This separation means you can switch between instruments or analytical methods without changing processing settings, and vice versa.
| Key | Type | Description |
|---|---|---|
input_path |
string | Input file or directory (relative to config location) |
out_dir |
string | Output directory |
pattern |
string | Glob pattern for batch mode (e.g. "*.csv") |
filename_filter |
string or null | Only process files containing this string (batch mode) |
method |
string | Method preset name from methods.json |
methods_file |
string | Path to methods file (default: "methods.json") |
dt_min_ms |
number | Sampling interval in milliseconds (typically 100) |
run_id |
string or null | Optional label; defaults to filename stem. Also used as output filename suffix when set. |
blmetrics |
bool | If true, log baseline evaluation metrics to baseline_eval_results.csv |
Trim chromatogram before processing. Useful for removing solvent delay or column bleed at the end.
"cutoff": {"mode": "rt", "start_min": 1.60, "end_min": 28.00}Modes: "none", "rt" (start_min/end_min), "points" (drop_first/drop_last).
Savitzky-Golay filter applied after resampling to uniform grid.
"smoothing": {"window_min": 0.015, "polyorder": 3}window_min is in minutes and converted to points internally based on dt_min_ms.
Three methods available. Switch by changing baseline_method; method-specific parameters are in baseline_presets, shared parameters in baseline_common.
"baseline_method": "asls",
"baseline_presets": {
"asls": {"lam": 1e6, "p": 0.01},
"psalsa": {"lam": 1e5, "p": 0.05},
"aspls": {"lam": 1e6}
},
"baseline_common": {"diff_order": 2, "max_iter": 50, "tol": 1e-3}| Method | Description | Reference |
|---|---|---|
| asls | Asymmetric least squares | Eilers, Anal Chem 2003, 75:3631 |
| psalsa | Penalized asls with smoothing | Oller-Moreno et al., 2014 |
| aspls | Adaptive smoothness penalized LS (no p, auto-estimated) |
Zhang et al., 2020 |
Lambda (lam) is automatically scaled for the sampling rate via dt_ref_min.
SNR-based detection using scipy.signal.find_peaks, with optional EMG-based deconvolution and per-peak quality metrics.
"peaks": {
"min_snr": 6,
"min_width_min": 0.008,
"max_width_min": 0.50,
"deconvolution": false,
"deconv_max_components": 4
}| Key | Description |
|---|---|
min_snr |
Minimum signal-to-noise ratio. Prominence threshold = min_snr × noise_sigma. 6 ≈ 2× LOD (ICH Q2) |
min_width_min |
Minimum peak FWHM in minutes (filters spikes) |
max_width_min |
Maximum peak FWHM in minutes (filters broad artifacts); null = no limit |
deconvolution |
If true, attempt EMG-based deconvolution of overlapping peaks (see below) |
deconv_max_components |
Maximum number of peak components to fit per region (default 4) |
Width filtering uses FWHM (full width at half maximum, rel_height=0.5). Integration boundaries are determined separately at 95% base width (rel_height=0.95) to capture the full peak area including tails. This prevents tailing peaks from being incorrectly filtered while still measuring their complete area.
Noise is estimated internally via MAD of first differences (Rousseeuw & Croux, 1993).
When enabled, the pipeline uses curve fitting to resolve overlapping peaks within merged integration regions. The approach is selective — only regions with multiple detected local maxima are fitted; well-resolved single peaks are left untouched.
The fitting model is the Exponentially Modified Gaussian (EMG), which is the standard chromatographic peak model (Foley & Dorsey, 1983, Anal Chem 55:730). It has 4 parameters per component: amplitude, center (mu), Gaussian width (sigma), and exponential decay (tau). The tau parameter captures the characteristic tailing observed in GC-MS peaks due to column dispersion and interaction kinetics.
Fitting procedure per region:
- Detect local maxima with a low prominence threshold to find shoulders
- Estimate initial parameters from signal shape (FWHM → sigma, trailing decay rate → tau)
- Fit sum-of-EMGs with constrained centers (±0.01 min around detected maxima)
- If EMG fit fails or is poor (normalized RMSE > 0.12), fall back to sum-of-Gaussians
- If both fail, keep the region as a single peak
- Adjacent deconvolved peak boundaries are set at valley minima in the original signal (no overlap)
Integration regions are merged using a smart criterion: overlapping intervals are only merged when the valley between the two peaks stays above 30% of the smaller peak's height. If the valley drops below this threshold, the regions are kept separate.
Deconvolved peaks are marked with is_deconvolved = 1.0 in the peak table. Their area is computed from the fitted EMG/Gaussian component (analytical integration), not from the raw signal.
Typical performance overhead: ~0.5–2 seconds per chromatogram (depending on the number of multi-peak regions).
Each detected peak is assigned quality metrics:
| Metric | Description |
|---|---|
prominence_height_ratio |
Prominence / height. Close to 1.0 for well-resolved peaks rising from baseline; low for peaks riding on the tail of a neighbor. |
tailing_factor |
USP tailing factor at 5% peak height (USP <621>, EP 2.2.46). 1.0 = symmetric, >1 = tailing, <1 = fronting. NaN if unmeasurable. |
local_snr |
SNR computed from a local noise estimate (±0.3 min window) rather than the global noise. More representative for chromatograms with non-uniform noise. |
quality |
Combined classification: high, good, medium, or low (see below). |
Quality classification thresholds (all must pass for the level):
| Level | local_snr | prominence_height_ratio | tailing_factor |
|---|---|---|---|
high |
≥ 50 | ≥ 0.8 | 0.8 – 2.0 |
good |
≥ 20 | ≥ 0.5 | 0.5 – 3.0 |
medium |
≥ 10 | ≥ 0.3 | 0.3 – 5.0 |
low |
below medium |
Tailing ranges are relaxed compared to HPLC conventions to accommodate typical GC-MS peak shapes. Unmeasurable tailing (NaN) does not penalize quality. These thresholds are designed as a starting point — peaks are flagged, not filtered, so downstream analysis can apply its own criteria.
RT correction using one or two anchor substances (referenced by name from the active method).
"alignment": {"mode": "none", "substances": []}
"alignment": {"mode": "shift", "substances": ["D5"]}
"alignment": {"mode": "linear", "substances": ["D3", "IS_hexadecaneD34"]}| Mode | Description |
|---|---|
none |
No alignment |
shift |
Constant RT shift (1 anchor): RT' = RT − offset |
linear |
Linear transform (2 anchors): RT' = a·RT + b |
Anchors are validated before use: the found maximum must pass an SNR check (≥ 5) and must not be at the window edge. If validation fails, alignment is skipped with a warning identifying the sample and substance.
Local linear detrending is applied to the substance window before searching for the apex. A straight line is fitted between mean-of-3 anchors at the window edges and subtracted from the signal. This makes anchor detection robust against small peaks sitting on the tail of a larger neighbor and against peaks on a sloping baseline residual.
"normalization": {"mode": "none", "substance": null, "fallback": "total_peak_area"}
"normalization": {"mode": "anchor_area", "substance": "IS_heptaneD16", "fallback": "total_peak_area"}
"normalization": {"mode": "total_peak_area"}| Mode | Description |
|---|---|
none |
No normalization |
anchor_area |
Divide by IS peak area (recommended) |
anchor_height |
Divide by IS peak height |
total_peak_area |
Divide by sum of all integrated peak areas |
Anchor-based modes validate the IS peak (SNR check, optional area range check). If validation fails, automatically falls back to the mode specified in fallback. The substance name is referenced from the active method in methods.json.
For anchor modes, the IS peak is re-found and re-integrated using local linear detrending in the substance window — independent of whether the global peak picker found it. This gives a tangent-skim-style area equivalent to commercial chromatography software, robust against sloping baselines and small peaks on tails. The detrended values then replace the IS row's area/height/RT in the peak table; if the global picker missed the IS entirely, a new row is inserted with quality = "anchor_detrended" and NaN for area_raw / height_raw.
Defines per-instrument/method anchor substance dictionaries. Each method has a name, optional description, and an anchor_substances dictionary.
{
"shimadzu_standard": {
"description": "Shimadzu GC-MS, standard method",
"anchor_substances": {
"D3": {"expected_rt_min": 3.70, "window_min": 0.08},
"IS_heptaneD16": {"expected_rt_min": 6.00, "window_min": 0.08, "expected_area_range": null},
...
}
},
"agilent_standard": {
"description": "Agilent GC-MS, standard method",
"anchor_substances": { ... }
}
}Substance names referenced in config.json (alignment/normalization) are looked up in the active method's anchor_substances. This is validated at startup before any file is processed — a typo in a substance name gives an immediate error, not a mid-batch failure.
expected_area_range is optional. Set to [low, high] after establishing typical IS areas from a few runs. Used by normalization to flag magnitude anomalies or co-elution.
For each processed sample, the pipeline writes three files. If a suffix is set (via --suffix or run_id), it is inserted into the filename: <sample_id>__<suffix>__<tag>.<ext>. Without a suffix: <sample_id>__<tag>.<ext>.
| File | Contents |
|---|---|
*__processed_tic.csv |
Full signal trace (see columns below) |
*__peak_table.csv |
Peak list (see columns below) |
*__qc.json |
Complete parameter record and quality metrics |
| Column | Description |
|---|---|
rt_min |
Retention time on uniform resampled grid (minutes) |
tic_raw |
Intensity after resampling (before processing) |
tic_smooth |
After Savitzky-Golay smoothing |
baseline |
Estimated baseline curve |
tic_corrected |
tic_smooth − baseline |
rt_min_aligned |
RT axis after alignment (= rt_min if alignment is off) |
tic_aligned_corrected |
Corrected signal on aligned RT axis |
tic_final |
After normalization (fully processed signal) |
For downstream use, rt_min_aligned + tic_final is always the fully processed pair.
| Column | Description |
|---|---|
sample_id |
Sample identifier |
rt_apex_min |
Retention time of peak apex (minutes) |
rt_start_min |
Integration start (minutes), at 95% base width |
rt_end_min |
Integration end (minutes), at 95% base width |
area_raw |
Integrated area before normalization |
area |
Integrated area after normalization |
height_raw |
Peak height before normalization |
height |
Peak height after normalization |
prominence |
Peak prominence above local baseline |
width_min |
Peak FWHM in minutes (full width at half maximum) |
snr |
Global signal-to-noise ratio (height / global noise_sigma) |
is_deconvolved |
1.0 if from valley-based deconvolution, 0.0 otherwise |
prominence_height_ratio |
Prominence / height (1.0 = rises from baseline, low = on shoulder) |
tailing_factor |
USP tailing factor at 5% height (1.0 = symmetric; NaN if unmeasurable) |
local_snr |
SNR from local noise estimate (±0.3 min window around peak) |
quality |
Combined quality: high, good, medium, or low |
When normalization is "none", area_raw = area and height_raw = height.
In batch mode, additional files are written to the output directory:
| File | Description |
|---|---|
batch_summary.csv |
Per-sample summary (see columns below) |
<timestamp>_run-batch.log |
Full log with timestamps, progress, warnings, and errors for every file. Filename includes the run timestamp (YYYY-MM-DD_HH-MM-SS) so multiple runs in the same output directory don't overwrite each other. |
baseline_eval_results.csv |
Baseline quality scores (only if blmetrics is enabled) |
Compiled from all per-sample QC files after batch processing. One row per sample.
| Column | Description |
|---|---|
sample_id |
Sample identifier |
n_peaks |
Number of peaks detected |
norm_mode |
Normalization mode used (none, anchor_area, total_peak_area, etc.) |
norm_status |
Normalization outcome (ok, anchor_not_found, anchor_validation_failed, etc.) |
norm_factor |
Applied normalization factor |
The following columns appear when anchor-based normalization is used:
| Column | Description |
|---|---|
anchor_rt |
Detected RT of the anchor/IS peak (minutes) |
anchor_area |
Raw integrated area of the anchor peak |
anchor_height |
Height of the anchor peak |
anchor_snr |
SNR of the anchor peak |
anchor_valid |
True if anchor passed all validation checks |
anchor_issues |
Semicolon-separated list of validation failure reasons (empty if valid) |
The following columns appear when anchor validation failed and fallback was triggered:
| Column | Description |
|---|---|
fallback_mode |
Fallback normalization mode that was applied (e.g., total_peak_area) |
fallback_status |
Outcome of the fallback normalization |
The following columns are added automatically when ≥ 3 samples have valid anchor areas, for batch-level IS consistency checking:
| Column | Description |
|---|---|
anchor_area_zscore |
Robust z-score of the anchor area relative to the batch (using MAD-scaled deviation from the median) |
anchor_area_outlier |
True if the absolute z-score exceeds 2.0, indicating the IS area for this sample is anomalous compared to the batch |
Outlier flags help identify samples with potential co-elution on the IS peak, sample preparation errors, or injection volume inconsistencies — issues that per-sample validation alone may not catch.
The pipeline logs to both the terminal and a timestamped log file (<timestamp>_run-batch.log) in the output directory. The timestamp format is YYYY-MM-DD_HH-MM-SS, so each run produces its own log file.
Terminal output shows a parameter summary at the start of each run, per-file progress with peak counts, and any warnings or errors. Example:
============================================================
automa-tic v0.2.0
============================================================
Method: shimadzu_standard
Shimadzu GC-MS, standard method
Sampling: 100 ms
Cutoff: RT 1.6 – 28.0 min
Smoothing: SavGol window=0.015 min, order=3
Baseline: asls (lam=1e+06, p=0.01)
Peaks: SNR >= 6, width 0.008–0.5 min
Alignment: shift → D5
Normalization: anchor_area → IS_heptaneD16
============================================================
Batch: 12 file(s) in data/
--- File 1/12: sample_001.csv ---
[sample_001] Reading...
[sample_001] 47 peaks found
[sample_001] Done.
--- File 2/12: sample_002.csv ---
[sample_002] Reading...
[sample_002] Alignment anchor 'D5' (expected RT=7.300): below_snr (SNR=2.1, required=5.0). Skipping alignment.
[sample_002] 38 peaks found
[sample_002] Done.
The log file includes additional debug-level detail for each processing step.
plot-one.py — interactive matplotlib plot for one sample:
python plot-one.py output/ sample_id
python plot-one.py output/ sample_id -s asls_test # with suffixDefaults to showing tic_final (fully processed signal) with y-axis auto-scaled. Toggleable layers: raw, smooth, baseline, corrected, final, and peak spans. Y-axis auto-rescales when toggling between traces with different magnitudes (e.g., raw vs normalized).
To compare baseline settings:
- Set
"blmetrics": truein the config. - Set
"run_id"to a descriptive label (e.g."asls_lam1e6_p0.01") — this also becomes the output filename suffix. - Run the pipeline. Results are appended to
baseline_eval_results.csvin the output directory. - Change baseline parameters and
run_id, repeat. - Compare scores — lower is better. Key metrics:
NAR_neg_area_ratio(over-subtraction),residual_ac1_in_mask(residual structure),residual_local_mean_drift_MAD(remaining drift).
The sweep can also be run programmatically using baseevaluation.evaluate_sweep().
automa-tic uses Semantic Versioning:
- MAJOR (1.0.0): breaking changes to config format, output structure, or API
- MINOR (0.2.0): new features, new output columns, new processing options
- PATCH (0.1.1): bug fixes, parameter tuning, documentation updates
The current version is stored in gcms_preproc/__init__.py and recorded in every QC JSON output (automa_tic_version field), ensuring full traceability of which pipeline version produced each result.
Released under the MIT License — free to use, modify, and distribute, including for commercial purposes, as long as the copyright notice is preserved.