Skip to content

GOPAD-Datasus/etlSUS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

107 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

etlSUS

License: LGPL 2.1 Python Python package PRs Welcome DOI

Docs are also available in Portuguese (PT)

An opinionated ETL (Extract, Transform, Load) pipeline designed to process Brazil's public healthcare data (DataSUS) from raw CSV files into analysis-ready datasets.

Overview

The Problem

Brazil's SUS (Sistema Único de Saúde) provides extensive public health data, but it requires domain-specific preprocessing before analysis. This includes removing unnecessary columns, handling missing values, and optimizing data types. Manually scripting these transformations for each dataset is time-consuming and error-prone.

The Solution

etlSUS automates the entire process. Simply specify the dataset, and the library handles downloading, transforming, and loading the data into a database and/or merging all files.

🚀 Quick Start

1. Installation

poetry add git+https://github.com/GOPAD-Datasus/etlSUS.git

2. Run the Pipeline

from etlsus import pipeline


if __name__ == '__main__':
    pipeline(
        dataset='SINASC',  # Choose between 'SINASC' or 'SIM'
        data_dir='path/to/data/dir',
    )

📌 Features

  • Simple Interface: Select your dataset (SINASC and SIM) and specify the base directory
  • Automated Processing: Handles download, transformation, and loading automatically
  • Optimized Transformations: Removes irrelevant columns and values while preserving analytical value
  • Multiple Output Formats:
    • Direct export to relational databases
    • Merged single file for multi-year analysis
    • Multiple files

📁 Project Structure

After running the pipeline, your data directory will be organized as follows:

# Using data_dir = "./data"

./data
├── raw/                  # Downloaded CSV files
├── processed/            # Cleaned and transformed files
└── dataset.parquet.gzip  # (Optional) Merged file

Limitation

  • Supports only PARQUET output files.

📝 License

LGNU | © GOPAD 2025

About

A streamlined data pipeline designed to take Brazil's public healthcare data to analysis-ready level.

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages