This repo includes technical documentation and Python codebase for data preparation for 3D brain MR image classification using ADNI dataset.
We have used this data preprocessing and processing of ADNI dataset for several publications (see: About Citation)
✅ REFACTORED: Converted from Jupyter Notebooks to modular Python codebase
- Clean package structure with reusable modules
- Executable scripts for each pipeline step
- Comprehensive logging and configuration
- Full documentation in STRUCTURE.md
Execute the complete data processing pipeline:
python scripts/run_pipeline.py --seq T1 --cond AD --step allSee Usage Examples for detailed instructions.
ADNI-processing/
├── src/ # Core Python package
│ ├── config.py # Configuration & constants
│ ├── metadata.py # Metadata utilities
│ ├── file_operations.py # File moving functions
│ └── logging.py # Logging utilities
│
├── scripts/ # Executable workflows
│ ├── move_preprocessed_files.py
│ ├── move_to_preprocess.py
│ ├── move_to_convert.py
│ ├── move_final_files.py
│ └── run_pipeline.py # Master orchestrator
│
├── outputs/ # Results & logs
│ └── logs/
│
├── TempMeta/ # Metadata CSVs
├── TempData/ # Temporary data
├── preprocessed/ # Final preprocessed files
├── final/ # Final output
│
└── STRUCTURE.md # Detailed structure documentation
See STRUCTURE.md for complete details on the Python codebase structure, module reference, and usage examples.
The ADNI data processing pipeline follows these high-level steps:
- Data Collection: Download raw DICOM and NIfTI files from ADNI
- Data Cleaning: Filter relevant sequences (T1-weighted MP-RAGE, T2-weighted TSE/FSE) and create matched T1/T2 pairs
- Data Organizing: Move files to preprocessing queue with proper directory structure
- Preprocessing: Apply SPM normalization, skull-stripping, registration, and scaling
- Augmentation: Apply data augmentation (flipping, rotation) to balance dataset
- Final Output: Move processed files to final output directory
The refactored Python codebase automates steps 3-6 with reusable, configurable modules. Steps 1-2 (collection and cleaning) involve manual data acquisition and metadata preparation.
Input data structure expected:
./DICOM/{seq}/{cond}/ # Raw DICOM files
./3T/{seq}/{cond}/ # Raw NIfTI files
./TempMeta/ # Metadata CSVs
Required metadata file:
Balanced_Meta_{seq}w_{cond}.csv- Finalized balanced dataset list
The original Jupyter notebooks have been converted into a modular Python package that automates the data processing pipeline.
The core file movement and organization operations are implemented as reusable Python functions in src/file_operations.py. Each function handles a specific step in the data processing pipeline.
Collects already-preprocessed files from archive folders and moves them to the target preprocessed directory.
Purpose: Handle previously processed data to avoid reprocessing
Function:
movePreprocessed(meta_df, path, seq, cond, tesla=3, divider="raw_")What it does:
- Searches for preprocessed files matching pattern
**/wm*.nii - Compares metadata with available files
- Moves matching files to
/preprocessed/{seq}/{cond}/ - Exports list of not-yet-preprocessed files as CSV for next step
- Indexes files with metadata ID:
/preprocessed/{seq}/{cond}/{meta-id}-{filename}.nii
Inputs:
- Balanced metadata CSV:
TempMeta/Balanced_Meta_{seq}w_{cond}.csv - Source preprocessed files:
./preprocessed_old/{seq}/{cond}/
Outputs:
- Organized preprocessed files:
./preprocessed/{seq}/{cond}/ - Unprocessed list:
TempMeta/To-Be-Preprocessed_{seq}w_{cond}.csv
Usage:
python scripts/move_preprocessed_files.py --seq T1 --cond AD --path ./preprocessed_oldOrganizes raw NIFTI files that need preprocessing with proper directory structure for Windows-based processing.
Purpose: Prepare files for SPM/MATLAB preprocessing on Windows machines
Function:
move2preprocess(meta_df, seq, cond, tesla=3, divider="raw_")What it does:
- Identifies files to be preprocessed from metadata
- Organizes each file in a subdirectory:
{subject-id}-{series-id}-{image-id} - Moves to
/TempData/{seq}/{cond}/for transfer to Windows - Maintains filename integrity for later matching
File Organization:
TempData/T1/AD/
├── 002_S_0001-S29096-I41124/
│ └── ADNI_002_S_0001_MR_MPRAGE_br_raw_20070329110738780_1_S29096_I41124.nii
├── 002_S_0456-S29097-I41125/
│ └── ADNI_002_S_0456_MR_MPRAGE_br_raw_20070330120442156_2_S29097_I41125.nii
└── ...
Inputs:
- Unprocessed metadata:
TempMeta/To-Be-Preprocessed_{seq}w_{cond}.csv - Raw files:
./3T/{seq}/
Outputs:
- Organized files:
./TempData/{seq}/{cond}/{subject}-{series}-{image}/
Usage:
python scripts/move_to_preprocess.py --seq T1 --cond ADPrepares DICOM files for DICOM-to-NIfTI conversion by organizing them with proper metadata-based directory structure.
Purpose: Organize raw DICOM files before conversion process
Function:
move2convert(meta_df, seq, cond, tesla=3, divider="raw_")What it does:
- Matches DICOM files with metadata
- Creates subdirectories:
{subject-id}-{series-id}_{image-id} - Moves to
/2convert/{seq}/{cond}/for conversion process - Preserves original DICOM filenames for conversion tools
File Organization:
2convert/T1/AD/
├── 002_S_0001-S29096_I41124/
│ ├── DICOM_file_001.dcm
│ ├── DICOM_file_002.dcm
│ └── ...
├── 002_S_0456-S29097_I41125/
│ └── ...
└── ...
Notes:
- DICOM files must be in:
./DICOM/{seq}/{cond}/ - Directory naming helps conversion tools process related images together
- This step is crucial before DICOM to NIfTI conversion
Inputs:
- Balanced metadata CSV:
TempMeta/Balanced_Meta_{seq}w_{cond}.csv - DICOM files:
./DICOM/{seq}/{cond}/
Outputs:
- Organized DICOM:
./2convert/{seq}/{cond}/{subject}-{series}_{image}/
Usage:
python scripts/move_to_convert.py --seq T1 --cond ADMoves converted NIfTI files from conversion folder to preprocessed folder with metadata indexing.
Purpose: Archive converted files in preprocessed folder
Function:
moveConverted(meta_df, seq, cond, tesla=3, divider="br_")What it does:
- Searches for white matter segmented files:
**/wm*.nii - Matches metadata with filenames
- Moves to
/preprocessed/{seq}/{cond}/ - Indexes with metadata ID for later pairing of T1 and T2
Inputs:
- Balanced metadata CSV:
TempMeta/Balanced_Meta_{seq}w_{cond}.csv - Converted files:
./Converted/{seq}/{cond}/
Outputs:
- Preprocessed files:
./preprocessed/{seq}/{cond}/{meta-id}-{filename}.nii
Flexible function that moves any files based on filename pattern matching.
Purpose: Move final preprocessed/processed files to output directory
Function:
freemove(source_path, target_path, seq, cond, tesla=3, file_format='**/*wm*.nii')What it does:
- Uses customizable glob pattern to find files
- Default pattern matches white matter segmented files:
**/wm*.nii - Moves to
/final/{seq}/{cond}/with indexed naming - Flexible for different file types and patterns
File Organization:
final/T1/AD/
├── 0-wm{filename}.nii
├── 1-wm{filename}.nii
├── 2-wm{filename}.nii
└── ...
Parameters:
file_format: Glob pattern (e.g.,**/*wm*.nii,**/*y_*.nii,**/*p*.nii)- Can move different file types: segmented, normalized, warped, etc.
Inputs:
- Processed files:
./processed/{seq}/{cond}/
Outputs:
- Final files:
./final/{seq}/{cond}/{index}-{filename}.nii
Usage:
# Move white matter segmented files (default)
python scripts/move_final_files.py --seq T1 --cond AD \
--source ./processed --target ./final
# Move normalized files
python scripts/move_final_files.py --seq T1 --cond AD \
--source ./processed --target ./final --pattern "**/*n*.nii"
# Move warped files
python scripts/move_final_files.py --seq T1 --cond AD \
--source ./processed --target ./final --pattern "**/*wn*.nii"Organizes hold-out datasets for robustness evaluation with proper grouping.
Purpose: Prepare new/hold-out data that has never been used in training
Function:
move2separate(meta_df, seq, tesla=3, ONLY_BASELINE=False, divider="Br_")What it does:
- Filters new datasets from different sources
- Excludes data from subjects already in training set
- Organizes into
/DataSep/for manual verification - Enables robustness testing with completely new data
Inputs:
- New dataset metadata:
HoldOut_Cleaned_{seq}w_{cond}_{tesla}T.csv - New data files: source directory
Outputs:
- Separated data:
./DataSep/{seq}/{subject-id}-{series-id}/
Run the complete workflow automatically:
python scripts/run_pipeline.py --seq T1 --cond AD --step allPipeline Steps:
- Move preprocessed files
- Move files to preprocessing queue
- Move DICOM files for conversion
- Move final processed files
Options:
--seq {T1, T2} # MRI sequence (required)
--cond {AD, CN, MCI} # Condition (required)
--step {all|step_name} # Which step to run (default: all)
--old-path PATH # Path to old preprocessed files
--source-path PATH # Path to processed files
--target-path PATH # Output path for final files
Examples:
# Process single group
python scripts/run_pipeline.py --seq T1 --cond AD --step all
# Run only preprocessing step
python scripts/run_pipeline.py --seq T1 --cond AD --step move_to_preprocess
# Process multiple groups (T1)
for cond in AD CN MCI; do
python scripts/run_pipeline.py --seq T1 --cond $cond --step all
done
# Process all groups (T1 and T2)
for seq in T1 T2; do
for cond in AD CN MCI; do
python scripts/run_pipeline.py --seq $seq --cond $cond --step all
done
doneAll operations are logged to outputs/logs/ with timestamps and details:
outputs/logs/
├── move_preprocessed_files_20241229_101523.log
├── move_to_preprocess_20241229_101545.log
├── move_to_convert_20241229_101612.log
└── move_final_files_20241229_101634.log
Log example:
2024-12-29 10:15:23,456 - move_preprocessed_files - INFO - Starting: Move Preprocessed Files
2024-12-29 10:15:24,123 - move_preprocessed_files - INFO - Loaded metadata with 150 records
2024-12-29 10:15:45,789 - move_preprocessed_files - INFO - Total T1w-AD data is 145 and not preprocessed is 5
2024-12-29 10:15:45,456 - move_preprocessed_files - INFO - Completed Move Preprocessed Files in 21.57s
Global configuration in src/config.py:
- Directory paths (inputs, outputs, temporary)
- Sequences, conditions, Tesla strengths
- File patterns and delimiters
- Metadata column definitions
- Logging settings
Customize these settings for different project configurations.
Data preprocessing is performed using SPM and includes:
- Normalization / Intensity normalization
- Skull-stripping
- Registration to standard template
- Scaling
Data augmentation (flipping and rotation) is applied post-preprocessing to balance dataset sizes.
For model training, data loading ensures:
- Proper split-before-augmentation to prevent data leakage
- Paired T1/T2 images fed together for joint training
- Metadata ID-based grouping for cross-validation
For more detail please refer to the publication:
[1] Rumala, D.J. (2023). How You Split Matters: Data Leakage and Subject Characteristics Studies in Longitudinal Brain MRI Analysis. In: Wesarg, S., et al. Clinical Image-Based Procedures, Fairness of AI in Medical Imaging, and Ethical and Philosophical Issues in Medical Imaging . CLIP EPIMI FAIMI 2023 2023 2023. Lecture Notes in Computer Science, vol 14242. Springer, Cham. https://doi.org/10.1007/978-3-031-45249-9_23.
More information and comprehensive summary about this publication can be freely accessed here: https://djrumala.github.io/publications/how-you-split-matters