Audit Proteomics Exchange (PRIDE) study metadata from the command line.
PXAudit fetches a PRIDE dataset's project metadata and file list, classifies every file with a deterministic FileTypeClassifier, then assigns two orthogonal quality scores:
- Tier — a 7-level FAIR ladder from None through Diamond, based on metadata completeness and open-data practices.
- Quant Tier — a secondary axis for quantification readiness, from No Quant through Quant-Complete.
Results are written to a local SQLite database so you can query, compare, and track scores over time.
Requires Python ≥ 3.12. uv is the recommended runner.
# Clone and install in editable mode
git clone https://github.com/LangeLab/PXAudit.git
cd PXAudit
uv syncThe pxaudit console script is installed automatically. Verify with:
uv run pxaudit --helpuv run pxaudit check PXD000001On first run, PXAudit fetches project metadata and file lists from the PRIDE REST API and caches both responses under ~/.pxaudit_cache/. Subsequent runs for the same accession are instant (cache hits skip the network entirely). Audit results are written to pxaudit_results.db in the current directory.
Usage: pxaudit check [OPTIONS] ACCESSION
Audit a single Proteomics Exchange accession.
Options:
--no-cache Skip cache reads and always fetch fresh data from PRIDE.
--db PATH SQLite output path. [default: pxaudit_results.db]
--help Show this message and exit.# Audit a single dataset
uv run pxaudit check PXD004683
# Force a fresh fetch (bypasses local cache)
uv run pxaudit check PXD004683 --no-cache
# Write results to a specific database file
uv run pxaudit check PXD004683 --db ~/audits/lab.dbNon-PRIDE accessions (e.g. MSV, JPST, IPX) are accepted without error and assigned the Unverifiable tier — PXAudit only has access to the PRIDE REST API.
Accession : PXD000001
Tier : Silver
Quant Tier: Partial
------------------------------------------------
Metadata
✔ Title TMT proteomics of human cell lines
✔ Organism Homo sapiens (9606)
✔ Instrument LTQ Orbitrap Velos
✘ Organism part annotated
✔ Publication linked
✘ Quant metadata (CV methods)
------------------------------------------------
Files (142 total)
✔ Result/Search files present
✔ PSI-standard results (mzIdentML / mzTab-ID)
✔ Open spectra (mzML / MGF)
✘ SDRF file present
✔ mzTab summary present
✘ Tabular quant table (proteinGroups / evidence)
------------------------------------------------
PXAudit scores each dataset on a 7-tier FAIR ladder. Every tier adds one FAIR requirement to the previous; a dataset must satisfy all criteria up to and including the tier it claims.
| Tier | Requirements |
|---|---|
| None | Missing a mandatory metadata field (title, organism, or instrument). |
| Raw | Mandatory metadata present; no processed result files found. |
| Bronze | Result/search files present, but none are PSI-standard (mzIdentML / mzTab). |
| Silver | PSI-standard results present; no SDRF experimental-design file. |
| Gold | SDRF present; open spectra (mzML / MGF) or organism-part annotation missing. |
| Platinum | Open spectra + organism-part annotation present; no linked PubMed publication. |
| Diamond | All FAIR criteria met: PSI results, SDRF, open spectra, organism part, and a publication. |
Tier logic is version-stamped (
tier_logic_version = "v2.0") and stored in the database so that re-scoring after a logic update can be detected.
The quant tier is independent of the FAIR tier and indicates quantification readiness.
| Quant Tier | Meaning |
|---|---|
| Unverifiable | Non-PRIDE accession; cannot be evaluated. |
| No Quant | No PSI-standard results and no tabular quant files. |
| Partial | Either PSI-standard IDs or a quant table — but not both. |
| Quant-Ready | PSI IDs + tabular quant table present; CV-term quantification metadata missing. |
| Quant-Complete | PSI IDs + tabular quant table + CV-term method metadata — fully described. |
The following scores were produced by running pxaudit check against the live PRIDE REST API on 2026-03-21 and are included in the integration test suite.
| Accession | Tier | Quant Tier |
|---|---|---|
| PXD057701 | Raw | No Quant |
| PXD002244 | Bronze | No Quant |
| PXD000001 | Silver | Partial |
| PXD073444 | Platinum | Partial |
| PXD075811 | Platinum | Partial |
| PXD004683 | Diamond | Partial |
Every check run upserts three tables in the SQLite database:
| Table | Description |
|---|---|
study |
One row per accession — title, organism, instrument, submission year and type, keywords. |
study_files |
One row per file — name, PRIDE category, extension, FTP URL, size in bytes. |
audit |
One row per accession — computed tier, quant tier, 13 has_* quality flags, files_fetch_failed, is_unverifiable, and tier_logic_version. |
Example queries
-- Tier distribution across all audited datasets
SELECT tier, COUNT(*) AS n FROM audit GROUP BY tier ORDER BY n DESC;
-- All Diamond datasets
SELECT accession, quant_tier FROM audit WHERE tier = 'Diamond';
-- Datasets ready for re-scoring after a logic update
SELECT accession FROM audit WHERE tier_logic_version != 'v2.0';
-- File-type breakdown for a single accession
SELECT file_category, COUNT(*) AS n
FROM study_files
WHERE accession = 'PXD004683'
GROUP BY file_category;uv sync
uv run pre-commit installPre-commit runs ruff (lint + format, line-length 100) on every commit.
src/pxaudit/
├── cli.py # click entry points (pxaudit check)
├── tier_engine.py # 7-tier FAIR ladder + quant tier logic
├── file_classifier.py # deterministic FileClass assignment for every file type
├── pride_client.py # PRIDE REST API v3 client with pagination + retry/backoff
├── db.py # SQLite schema + upsert helpers
└── cache.py # local JSON response cache (~/.pxaudit_cache/)
# Unit tests (default — no network required)
uv run pytest
# With coverage report
uv run pytest --cov=pxaudit --cov-report=term-missing
# Live integration tests against the real PRIDE API (requires network)
uv run pytest -m integration -v --no-covThe default run excludes integration tests (-m 'not integration' is set in pyproject.toml). The test suite has 384 unit tests with 100% branch coverage across all modules, plus 10 live integration tests covering six real PRIDE accessions.
PXAudit only has access to the PRIDE REST API. Accessions from other repositories (MassIVE MSV, jPOST JPST, iProX IPX) are accepted as valid input but receive:
tier = "Unverifiable"quant_tier = "Unverifiable"- All Boolean flags set to
False
This is by design — no assertion of quality is made for data that cannot be inspected.
v0.1.0 covers single-study auditing. Planned work beyond this release:
- Bulk audit —
pxaudit bulk-audit --input accessions.txtwith rate-limited batch processing and TSV/JSON export alongside the SQLite database. - File manifest — extend
study_fileswith richer provenance (download URLs, file sizes already captured; planned: per-file hash, last-seen timestamps). - Reporting —
pxaudit report --db results.dbgenerating tier distributions, SDRF adoption trends, metadata completeness over time, and an exemplar shortlist as a Quarto-rendered HTML report. - Multi-repository — plugin adapters for MassIVE, jPOST, and iProX so non-PRIDE accessions are audited rather than marked Unverifiable.
Contributions and issue reports are welcome.
If you use PXAudit in your research, please cite it as:
@software{ergin_pxaudit_2026,
author = {Ergin, Enes Kemal},
title = {{PXAudit}: A command-line tool for auditing {Proteomics Exchange} study metadata},
year = {2026},
version = {0.1.0},
url = {https://github.com/LangeLab/PXAudit},
license = {MIT},
}A CITATION.cff file is included in the repository root for tools that parse it automatically (e.g. GitHub's Cite this repository button, Zenodo).
MIT