Docs are also available in Portuguese (PT)
An opinionated ETL (Extract, Transform, Load) pipeline designed to process Brazil's public healthcare data (DataSUS) from raw CSV files into analysis-ready datasets.
Brazil's SUS (Sistema Único de Saúde) provides extensive public health data, but it requires domain-specific preprocessing before analysis. This includes removing unnecessary columns, handling missing values, and optimizing data types. Manually scripting these transformations for each dataset is time-consuming and error-prone.
etlSUS automates the entire process. Simply specify the dataset, and the library handles downloading, transforming, and loading the data into a database and/or merging all files.
poetry add git+https://github.com/GOPAD-Datasus/etlSUS.gitfrom etlsus import pipeline
if __name__ == '__main__':
pipeline(
dataset='SINASC', # Choose between 'SINASC' or 'SIM'
data_dir='path/to/data/dir',
)- Simple Interface: Select your dataset (SINASC and SIM) and specify the base directory
- Automated Processing: Handles download, transformation, and loading automatically
- Optimized Transformations: Removes irrelevant columns and values while preserving analytical value
- Multiple Output Formats:
- Direct export to relational databases
- Merged single file for multi-year analysis
- Multiple files
After running the pipeline, your data directory will be organized as follows:
# Using data_dir = "./data"
./data
├── raw/ # Downloaded CSV files
├── processed/ # Cleaned and transformed files
└── dataset.parquet.gzip # (Optional) Merged file
- Supports only PARQUET output files.
LGNU | © GOPAD 2025