You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Acknowledge license and PhysioNet data use agreement
extra_gated_description
This dataset contains derived data from PhysioNet restricted-access datasets (MIMIC-CXR). By requesting access, you confirm that you have an active PhysioNet credentialed account and have signed the relevant data use agreements.
extra_gated_button_content
Request access
extra_gated_prompt
You agree to not use this dataset to conduct experiments that cause harm to human subjects, and you confirm compliance with the PhysioNet data use agreement.
extra_gated_fields
Full Name
Affiliation
Country
PhysioNet Username
I want to use this dataset for
I have a valid PhysioNet credentialed account with MIMIC-CXR access
I agree to use this dataset for non-commercial use ONLY
text
text
country
text
type
options
select
Research
Education
label
value
Other
other
checkbox
checkbox
tags
medical-imaging
chest-xray
embeddings
shortcut-detection
fairness
bias-detection
celeba
chexpert
mimic-cxr
ShortKit-ML Benchmark Data
Pre-computed embeddings, metadata, and full original dataset labels for reproducing paper benchmarks. All embeddings were extracted with seed=42 for full reproducibility.
Full Dataset Files (not just embeddings)
This repository includes the complete original label/metadata files for CheXpert and MIMIC-CXR — not only the embedding subsets used in our experiments:
File
Rows
Description
train.csv
223,414
Full CheXpert training set — Path, Sex, Age, AP/PA, 14 diagnosis labels
valid.csv
234
Full CheXpert validation set — same schema
mimic_cxr/mimic-cxr-2.0.0-chexpert.csv
227,827
Full MIMIC-CXR diagnosis labels — 14 CheXpert-style labels per study
mimic_cxr/mimic-cxr-2.0.0-metadata.csv
377,110
Full MIMIC-CXR DICOM metadata — view position, rows, cols, study date
MIMIC-CXR: Johnson et al. MIMIC-CXR-JPG v2.1.0. Embeddings via MITCriticalData/qml-mimic-cxr-embeddings. Diagnosis labels from PhysioNet (CheXpert labeler output). Demographics from MIMIC-IV via subject_id join.
The _cache/ subdirectory in chexpert_multibackbone/ contains raw PIL images cached during extraction. It is excluded from the HuggingFace upload (large binary pickle). Re-run the extraction script to regenerate.
MIMIC-CXR *_metadata_orig.csv files are pre-diagnosis-join backups. The *_metadata.csv files contain the joined version with 14 diagnosis columns.
All random seeds are fixed to 42. CheXpert multi-backbone uses the first 2000 samples from the streaming iterator (deterministic ordering from HuggingFace).