automa-tic

Automated GC-MS TIC chromatogram preprocessing.

Version 0.2.0 — see CHANGELOG.md for release notes.

Preprocessing pipeline for GC-MS Total Ion Current (TIC) chromatograms, designed for downstream chemometrics and machine learning.

Handles data from multiple instrument vendors (Agilent, Shimadzu) and produces standardised CSV outputs per measurement.

Pipeline overview

Raw CSV → Cutoff → Resample → Smooth → Baseline correction → RT Alignment → Peak picking → Normalization → Export

Each step is optional or configurable via config.json. All parameters used are recorded in per-sample QC files for full reproducibility.

Installation

Requires Python ≥ 3.9. Dependencies:

numpy
scipy
pandas
pybaselines
matplotlib

Install with:

pip install -r requirements.txt

Quick start

Try it on the included demo files:

python run_files.py -c config.example.json

This processes demo_data/shimadzu_demo.csv with sensible defaults (alignment to D3, normalization to IS_heptaneD16) and writes results to output_demo/.

Use it on your own data:

Place your CSV data files in the project directory (or a subfolder).
Copy config.example.json to config.json and edit input_path, out_dir, method, and processing parameters.
Edit methods.json to define anchor substances for your instrument/method.
Run:

python run_files.py

CLI overrides are available for all common options:

python run_files.py -i mydata.csv                        # single file
python run_files.py -i data_folder/ -o results/           # batch mode
python run_files.py -c my_config.json                     # custom config
python run_files.py -i data/ -p "*.csv"                   # batch with glob pattern
python run_files.py -f "250611"                            # only files containing "250611"
python run_files.py -m agilent_standard                    # override method
python run_files.py -s asls_test                           # suffix on output filenames
python run_files.py -i data/ -f "250611" -s batch1 -m shimadzu_standard  # combined

Flag	Long form	Description
`-c`	`--config`	Path to config file (default: `config.json`)
`-i`	`--input`	Input file or directory (overrides config)
`-o`	`--output`	Output directory (overrides config)
`-p`	`--pattern`	Glob pattern for batch mode (overrides config)
`-f`	`--filter`	Filename filter: only process files whose name contains this string
`-s`	`--suffix`	Suffix appended to output filenames (overrides `run_id`)
`-m`	`--method`	Method preset name from `methods.json` (overrides config)

Project structure

gcms_preproc/                   # Python package (automa-tic)
├── __init__.py             # Package init, __version__
├── io/
│   ├── readers.py          # CSV import (Agilent, Shimadzu, generic fallback)
│   └── writers.py          # CSV/JSON export (with optional suffix)
├── core/
│   ├── models.py           # Chromatogram dataclass
│   ├── resampling.py       # Uniform grid interpolation
│   ├── smoothing.py        # Savitzky-Golay filter
│   ├── baseline.py         # Baseline correction (asls, psalsa, aspls)
│   ├── baseevaluation.py   # Baseline quality metrics and parameter sweeps
│   ├── cutoff.py           # RT range or point-based trimming
│   ├── alignment.py        # RT alignment (shift, linear) with peak validation
│   ├── normalization.py    # Normalization (anchor IS, total peak area, fallback)
│   ├── peaks.py            # SNR-based peak picking, EMG deconvolution, quality metrics
│   └── qc.py              # Noise estimation (MAD-based)
├── pipeline.py             # Main orchestrator (single file + batch, progress logging)
├── run_files.py            # CLI entry point, config/method loader, early validation
├── plot-one.py             # Interactive QC plot (matplotlib, suffix support, auto-rescale)
├── config.json             # Your processing parameters (edit for your data)
├── config.example.json     # Working example pointing at the demo data
├── methods.json            # Per-instrument anchor substance definitions
├── demo_data/              # Example chromatograms (one per instrument)
│   ├── shimadzu_demo.csv
│   └── agilent_demo.csv
├── requirements.txt        # Python dependencies
├── README.md               # This file
├── CHANGELOG.md            # Release notes
└── LICENSE                 # MIT

Configuration

Settings are split across two files:

config.json — processing parameters and mode selection (what to do)
methods.json — per-instrument/method anchor substance definitions (what's available)

This separation means you can switch between instruments or analytical methods without changing processing settings, and vice versa.

config.json reference

Paths and general

Key	Type	Description
`input_path`	string	Input file or directory (relative to config location)
`out_dir`	string	Output directory
`pattern`	string	Glob pattern for batch mode (e.g. `"*.csv"`)
`filename_filter`	string or null	Only process files containing this string (batch mode)
`method`	string	Method preset name from `methods.json`
`methods_file`	string	Path to methods file (default: `"methods.json"`)
`dt_min_ms`	number	Sampling interval in milliseconds (typically `100`)
`run_id`	string or null	Optional label; defaults to filename stem. Also used as output filename suffix when set.
`blmetrics`	bool	If `true`, log baseline evaluation metrics to `baseline_eval_results.csv`

Cutoff

Trim chromatogram before processing. Useful for removing solvent delay or column bleed at the end.

"cutoff": {"mode": "rt", "start_min": 1.60, "end_min": 28.00}

Modes: "none", "rt" (start_min/end_min), "points" (drop_first/drop_last).

Smoothing

Savitzky-Golay filter applied after resampling to uniform grid.

"smoothing": {"window_min": 0.015, "polyorder": 3}

window_min is in minutes and converted to points internally based on dt_min_ms.

Baseline correction

Three methods available. Switch by changing baseline_method; method-specific parameters are in baseline_presets, shared parameters in baseline_common.

"baseline_method": "asls",
"baseline_presets": {
    "asls":   {"lam": 1e6, "p": 0.01},
    "psalsa": {"lam": 1e5, "p": 0.05},
    "aspls":  {"lam": 1e6}
},
"baseline_common": {"diff_order": 2, "max_iter": 50, "tol": 1e-3}

Method	Description	Reference
asls	Asymmetric least squares	Eilers, Anal Chem 2003, 75:3631
psalsa	Penalized asls with smoothing	Oller-Moreno et al., 2014
aspls	Adaptive smoothness penalized LS (no `p`, auto-estimated)	Zhang et al., 2020

Lambda (lam) is automatically scaled for the sampling rate via dt_ref_min.

Peak picking

SNR-based detection using scipy.signal.find_peaks, with optional EMG-based deconvolution and per-peak quality metrics.

"peaks": {
    "min_snr": 6,
    "min_width_min": 0.008,
    "max_width_min": 0.50,
    "deconvolution": false,
    "deconv_max_components": 4
}

Key	Description
`min_snr`	Minimum signal-to-noise ratio. Prominence threshold = min_snr × noise_sigma. 6 ≈ 2× LOD (ICH Q2)
`min_width_min`	Minimum peak FWHM in minutes (filters spikes)
`max_width_min`	Maximum peak FWHM in minutes (filters broad artifacts); `null` = no limit
`deconvolution`	If `true`, attempt EMG-based deconvolution of overlapping peaks (see below)
`deconv_max_components`	Maximum number of peak components to fit per region (default 4)

Width filtering uses FWHM (full width at half maximum, rel_height=0.5). Integration boundaries are determined separately at 95% base width (rel_height=0.95) to capture the full peak area including tails. This prevents tailing peaks from being incorrectly filtered while still measuring their complete area.

Noise is estimated internally via MAD of first differences (Rousseeuw & Croux, 1993).

Deconvolution

When enabled, the pipeline uses curve fitting to resolve overlapping peaks within merged integration regions. The approach is selective — only regions with multiple detected local maxima are fitted; well-resolved single peaks are left untouched.

The fitting model is the Exponentially Modified Gaussian (EMG), which is the standard chromatographic peak model (Foley & Dorsey, 1983, Anal Chem 55:730). It has 4 parameters per component: amplitude, center (mu), Gaussian width (sigma), and exponential decay (tau). The tau parameter captures the characteristic tailing observed in GC-MS peaks due to column dispersion and interaction kinetics.

Fitting procedure per region:

Detect local maxima with a low prominence threshold to find shoulders
Estimate initial parameters from signal shape (FWHM → sigma, trailing decay rate → tau)
Fit sum-of-EMGs with constrained centers (±0.01 min around detected maxima)
If EMG fit fails or is poor (normalized RMSE > 0.12), fall back to sum-of-Gaussians
If both fail, keep the region as a single peak
Adjacent deconvolved peak boundaries are set at valley minima in the original signal (no overlap)

Integration regions are merged using a smart criterion: overlapping intervals are only merged when the valley between the two peaks stays above 30% of the smaller peak's height. If the valley drops below this threshold, the regions are kept separate.

Deconvolved peaks are marked with is_deconvolved = 1.0 in the peak table. Their area is computed from the fitted EMG/Gaussian component (analytical integration), not from the raw signal.

Typical performance overhead: ~0.5–2 seconds per chromatogram (depending on the number of multi-peak regions).

Quality metrics

Each detected peak is assigned quality metrics:

Metric	Description
`prominence_height_ratio`	Prominence / height. Close to 1.0 for well-resolved peaks rising from baseline; low for peaks riding on the tail of a neighbor.
`tailing_factor`	USP tailing factor at 5% peak height (USP <621>, EP 2.2.46). 1.0 = symmetric, >1 = tailing, <1 = fronting. NaN if unmeasurable.
`local_snr`	SNR computed from a local noise estimate (±0.3 min window) rather than the global noise. More representative for chromatograms with non-uniform noise.
`quality`	Combined classification: `high`, `good`, `medium`, or `low` (see below).

Quality classification thresholds (all must pass for the level):

Level	local_snr	prominence_height_ratio	tailing_factor
`high`	≥ 50	≥ 0.8	0.8 – 2.0
`good`	≥ 20	≥ 0.5	0.5 – 3.0
`medium`	≥ 10	≥ 0.3	0.3 – 5.0
`low`	below medium

Tailing ranges are relaxed compared to HPLC conventions to accommodate typical GC-MS peak shapes. Unmeasurable tailing (NaN) does not penalize quality. These thresholds are designed as a starting point — peaks are flagged, not filtered, so downstream analysis can apply its own criteria.

Alignment

RT correction using one or two anchor substances (referenced by name from the active method).

"alignment": {"mode": "none", "substances": []}
"alignment": {"mode": "shift", "substances": ["D5"]}
"alignment": {"mode": "linear", "substances": ["D3", "IS_hexadecaneD34"]}

Mode	Description
`none`	No alignment
`shift`	Constant RT shift (1 anchor): RT' = RT − offset
`linear`	Linear transform (2 anchors): RT' = a·RT + b

Anchors are validated before use: the found maximum must pass an SNR check (≥ 5) and must not be at the window edge. If validation fails, alignment is skipped with a warning identifying the sample and substance.

Local linear detrending is applied to the substance window before searching for the apex. A straight line is fitted between mean-of-3 anchors at the window edges and subtracted from the signal. This makes anchor detection robust against small peaks sitting on the tail of a larger neighbor and against peaks on a sloping baseline residual.

Normalization

"normalization": {"mode": "none", "substance": null, "fallback": "total_peak_area"}
"normalization": {"mode": "anchor_area", "substance": "IS_heptaneD16", "fallback": "total_peak_area"}
"normalization": {"mode": "total_peak_area"}

Mode	Description
`none`	No normalization
`anchor_area`	Divide by IS peak area (recommended)
`anchor_height`	Divide by IS peak height
`total_peak_area`	Divide by sum of all integrated peak areas

Anchor-based modes validate the IS peak (SNR check, optional area range check). If validation fails, automatically falls back to the mode specified in fallback. The substance name is referenced from the active method in methods.json.

For anchor modes, the IS peak is re-found and re-integrated using local linear detrending in the substance window — independent of whether the global peak picker found it. This gives a tangent-skim-style area equivalent to commercial chromatography software, robust against sloping baselines and small peaks on tails. The detrended values then replace the IS row's area/height/RT in the peak table; if the global picker missed the IS entirely, a new row is inserted with quality = "anchor_detrended" and NaN for area_raw / height_raw.

methods.json reference

Defines per-instrument/method anchor substance dictionaries. Each method has a name, optional description, and an anchor_substances dictionary.

{
    "shimadzu_standard": {
        "description": "Shimadzu GC-MS, standard method",
        "anchor_substances": {
            "D3":               {"expected_rt_min": 3.70, "window_min": 0.08},
            "IS_heptaneD16":    {"expected_rt_min": 6.00, "window_min": 0.08, "expected_area_range": null},
            ...
        }
    },
    "agilent_standard": {
        "description": "Agilent GC-MS, standard method",
        "anchor_substances": { ... }
    }
}

Substance names referenced in config.json (alignment/normalization) are looked up in the active method's anchor_substances. This is validated at startup before any file is processed — a typo in a substance name gives an immediate error, not a mid-batch failure.

expected_area_range is optional. Set to [low, high] after establishing typical IS areas from a few runs. Used by normalization to flag magnitude anomalies or co-elution.

Output files

For each processed sample, the pipeline writes three files. If a suffix is set (via --suffix or run_id), it is inserted into the filename: <sample_id>__<suffix>__<tag>.<ext>. Without a suffix: <sample_id>__<tag>.<ext>.

File	Contents
`*__processed_tic.csv`	Full signal trace (see columns below)
`*__peak_table.csv`	Peak list (see columns below)
`*__qc.json`	Complete parameter record and quality metrics

processed_tic.csv columns

Column	Description
`rt_min`	Retention time on uniform resampled grid (minutes)
`tic_raw`	Intensity after resampling (before processing)
`tic_smooth`	After Savitzky-Golay smoothing
`baseline`	Estimated baseline curve
`tic_corrected`	tic_smooth − baseline
`rt_min_aligned`	RT axis after alignment (= rt_min if alignment is off)
`tic_aligned_corrected`	Corrected signal on aligned RT axis
`tic_final`	After normalization (fully processed signal)

For downstream use, rt_min_aligned + tic_final is always the fully processed pair.

peak_table.csv columns

Column	Description
`sample_id`	Sample identifier
`rt_apex_min`	Retention time of peak apex (minutes)
`rt_start_min`	Integration start (minutes), at 95% base width
`rt_end_min`	Integration end (minutes), at 95% base width
`area_raw`	Integrated area before normalization
`area`	Integrated area after normalization
`height_raw`	Peak height before normalization
`height`	Peak height after normalization
`prominence`	Peak prominence above local baseline
`width_min`	Peak FWHM in minutes (full width at half maximum)
`snr`	Global signal-to-noise ratio (height / global noise_sigma)
`is_deconvolved`	1.0 if from valley-based deconvolution, 0.0 otherwise
`prominence_height_ratio`	Prominence / height (1.0 = rises from baseline, low = on shoulder)
`tailing_factor`	USP tailing factor at 5% height (1.0 = symmetric; NaN if unmeasurable)
`local_snr`	SNR from local noise estimate (±0.3 min window around peak)
`quality`	Combined quality: `high`, `good`, `medium`, or `low`

When normalization is "none", area_raw = area and height_raw = height.

Batch outputs

In batch mode, additional files are written to the output directory:

File	Description
`batch_summary.csv`	Per-sample summary (see columns below)
`<timestamp>_run-batch.log`	Full log with timestamps, progress, warnings, and errors for every file. Filename includes the run timestamp (`YYYY-MM-DD_HH-MM-SS`) so multiple runs in the same output directory don't overwrite each other.
`baseline_eval_results.csv`	Baseline quality scores (only if `blmetrics` is enabled)

batch_summary.csv columns

Compiled from all per-sample QC files after batch processing. One row per sample.

Column	Description
`sample_id`	Sample identifier
`n_peaks`	Number of peaks detected
`norm_mode`	Normalization mode used (`none`, `anchor_area`, `total_peak_area`, etc.)
`norm_status`	Normalization outcome (`ok`, `anchor_not_found`, `anchor_validation_failed`, etc.)
`norm_factor`	Applied normalization factor

The following columns appear when anchor-based normalization is used:

Column	Description
`anchor_rt`	Detected RT of the anchor/IS peak (minutes)
`anchor_area`	Raw integrated area of the anchor peak
`anchor_height`	Height of the anchor peak
`anchor_snr`	SNR of the anchor peak
`anchor_valid`	`True` if anchor passed all validation checks
`anchor_issues`	Semicolon-separated list of validation failure reasons (empty if valid)

The following columns appear when anchor validation failed and fallback was triggered:

Column	Description
`fallback_mode`	Fallback normalization mode that was applied (e.g., `total_peak_area`)
`fallback_status`	Outcome of the fallback normalization

The following columns are added automatically when ≥ 3 samples have valid anchor areas, for batch-level IS consistency checking:

Column	Description
`anchor_area_zscore`	Robust z-score of the anchor area relative to the batch (using MAD-scaled deviation from the median)
`anchor_area_outlier`	`True` if the absolute z-score exceeds 2.0, indicating the IS area for this sample is anomalous compared to the batch

Outlier flags help identify samples with potential co-elution on the IS peak, sample preparation errors, or injection volume inconsistencies — issues that per-sample validation alone may not catch.

Logging and progress

The pipeline logs to both the terminal and a timestamped log file (<timestamp>_run-batch.log) in the output directory. The timestamp format is YYYY-MM-DD_HH-MM-SS, so each run produces its own log file.

Terminal output shows a parameter summary at the start of each run, per-file progress with peak counts, and any warnings or errors. Example:

============================================================
automa-tic v0.2.0
============================================================
  Method:        shimadzu_standard
                 Shimadzu GC-MS, standard method
  Sampling:      100 ms
  Cutoff:        RT 1.6 – 28.0 min
  Smoothing:     SavGol window=0.015 min, order=3
  Baseline:      asls (lam=1e+06, p=0.01)
  Peaks:         SNR >= 6, width 0.008–0.5 min
  Alignment:     shift → D5
  Normalization: anchor_area → IS_heptaneD16
============================================================
  Batch: 12 file(s) in data/
--- File 1/12: sample_001.csv ---
  [sample_001] Reading...
  [sample_001] 47 peaks found
  [sample_001] Done.
--- File 2/12: sample_002.csv ---
  [sample_002] Reading...
  [sample_002] Alignment anchor 'D5' (expected RT=7.300): below_snr (SNR=2.1, required=5.0). Skipping alignment.
  [sample_002] 38 peaks found
  [sample_002] Done.

The log file includes additional debug-level detail for each processing step.

Visualization

plot-one.py — interactive matplotlib plot for one sample:

python plot-one.py output/ sample_id
python plot-one.py output/ sample_id -s asls_test    # with suffix

Defaults to showing tic_final (fully processed signal) with y-axis auto-scaled. Toggleable layers: raw, smooth, baseline, corrected, final, and peak spans. Y-axis auto-rescales when toggling between traces with different magnitudes (e.g., raw vs normalized).

Baseline parameter sweeps

To compare baseline settings:

Set "blmetrics": true in the config.
Set "run_id" to a descriptive label (e.g. "asls_lam1e6_p0.01") — this also becomes the output filename suffix.
Run the pipeline. Results are appended to baseline_eval_results.csv in the output directory.
Change baseline parameters and run_id, repeat.
Compare scores — lower is better. Key metrics: NAR_neg_area_ratio (over-subtraction), residual_ac1_in_mask (residual structure), residual_local_mean_drift_MAD (remaining drift).

The sweep can also be run programmatically using baseevaluation.evaluate_sweep().

Versioning

automa-tic uses Semantic Versioning:

MAJOR (1.0.0): breaking changes to config format, output structure, or API
MINOR (0.2.0): new features, new output columns, new processing options
PATCH (0.1.1): bug fixes, parameter tuning, documentation updates

The current version is stored in gcms_preproc/__init__.py and recorded in every QC JSON output (automa_tic_version field), ensuring full traceability of which pipeline version produced each result.

License

Released under the MIT License — free to use, modify, and distribute, including for commercial purposes, as long as the copyright notice is preserved.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

automa-tic

Pipeline overview

Installation

Quick start

Project structure

Configuration

config.json reference

Paths and general

Cutoff

Smoothing

Baseline correction

Peak picking

Deconvolution

Quality metrics

Alignment

Normalization

methods.json reference

Output files

processed_tic.csv columns

peak_table.csv columns

Batch outputs

batch_summary.csv columns

Logging and progress

Visualization

Baseline parameter sweeps

Versioning

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
demo_data		demo_data
gcms_preproc		gcms_preproc
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
config.example.json		config.example.json
config.json		config.json
gitignore		gitignore
methods.json		methods.json
plot-one.py		plot-one.py
requirements.txt		requirements.txt
run_files.py		run_files.py

Folders and files

Latest commit

History

Repository files navigation

automa-tic

Pipeline overview

Installation

Quick start

Project structure

Configuration

config.json reference

Paths and general

Cutoff

Smoothing

Baseline correction

Peak picking

Deconvolution

Quality metrics

Alignment

Normalization

methods.json reference

Output files

processed_tic.csv columns

peak_table.csv columns

Batch outputs

batch_summary.csv columns

Logging and progress

Visualization

Baseline parameter sweeps

Versioning

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages