Skip to content

deem-data/stratum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stratum

Python CI Rust CI codecov License Python

Stratum is an ML system for efficiently executing large-scale agentic pipeline search. It integrates with MLE agents by representing batches of agent-generated pipelines as lazily evaluated DAGs, applying logical and runtime optimizations, and executing them across heterogeneous backends, including a Rust-based runtime. Stratum builds on skrub's operator abstraction and under active development.


Design Principles

  • Provide seamless and unrestricted support for arbitrary ML libraries without operator porting.
  • Enable lazy evaluation and provide operator semantics that enable logical rewrites and cost-based optimizations.
  • Implement a runtime with efficient operator kernels (in Rust), scheduling across CPUs, GPUs, and distributed backends, plus runtime optimizations such as buffer pools, reuse of intermediates, and inter- and intra-operator parallelization.

Installation

For now, you need to build stratum from source.

Requirements:

From the repository root, install the extension in editable (development) mode:

maturin develop --release

For more details (including building wheels), see the Developer Instructions section below.


Usage

To leverage stratum, agent prompts or pipelines need minor changes. Prompts should be modified to generate code following skrub DataOps syntax.

Stratum can also significantly speed up human-written skrub code.

The following flags enable different features of Stratum. These flags can be set via environment variables or directly in code:

import stratum

stratum.set_config(
    rust_backend=True,
    scheduler=True,
    stats=True,
    debug_timing=False,
)

Example Code

import stratum as skrub #drop-in replacement
from sklearn.preprocessing import OneHotEncoder
from skrub.datasets import fetch_employee_salaries
from skrub import TableVectorizer, StringEncoder

def main():
    # Load dataset
    dataset = fetch_employee_salaries()
    employees, salaries = dataset.X, dataset.y
    employees = employees.dropna()

    skrub.set_config(rust_backend=True, debug_timing=True, scheduler=True, stats=True) #stratum's config
    vectorizer = TableVectorizer(high_cardinality=StringEncoder(), low_cardinality=OneHotEncoder())
    employees_enc = vectorizer.fit_transform(employees)
    print(f"Encoded data shape: {employees_enc.shape}")

if __name__ == "__main__":
    main()

Repository Layout

stratum/
├─ pyproject.toml          # Project metadata + Python/Rust build config (maturin)
├─ README.md
├─ LICENSE
├─ _rust/                  # Rust crate (PyO3 extension)
│  ├─ Cargo.toml
│  └─ src/lib.rs           # Defines #[pymodule] fn _rust_backend_native(...)
└─ stratum/                # Python package
   ├─ __init__.py          # Façade over skrub + automatic patching
   ├─ _config.py           # set_config/get_config + runtime/env sync
   ├─ _api.py              # High-level grid search / evaluate helpers
   ├─ _rust_backend.py     # Python <-> Rust shim (re-exports native fns)
   ├─ adapters/            # Public API (dispatch to Rust or fall back to skrub)
   │  ├─ string_encoder.py # RustyStringEncoder
   │  └─ one_hot_encoder.py # RustyOneHotEncoder
   ├─ logical_optimizer/   # DAG representation + logical rewrites
   ├─ runtime/             # Schedulers and runtime execution
   ├─ patching/            # Hooks that patch upstream skrub
   └─ tests/               # Test suite

Developer Instructions

Local Dev Install (Editable)

maturin develop				# Debug mode
maturin develop --release	# Optimized dev build

Building Wheels

This produces redistributable .whl files under dist/.

# Linux / macOS
maturin build --release -o dist --interpreter python3.10 --compatibility linux

# Windows
maturin build --release -o dist

Then install with:

pip install ./dist/stratum-*.whl

License

Apache License 2.0. See LICENSE for details.

About

An ML system for efficiently executing large-scale agentic pipeline search.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors