Skip to content

4gt-104/matyan-core

Repository files navigation

Matyan

A scalable, self-hosted ML experiment tracker

Aim-compatible UI and SDK · FoundationDB · Kafka · Cloud Storage (S3, GCS, Azure)


PyPI - matyan-client PyPI - matyan-backend PyPI - matyan-frontier PyPI - matyan-ui PyPI - matyan-api-models Python License Platform Docs



ℹ️ About

Matyan (մատյան, book of records in Armenian) is a self-hosted ML experiment tracking stack forked from Aim. The backend is fully reimplemented on FoundationDB, Kafka, and Cloud Storage (S3, GCS, Azure) for horizontal scalability — while the original Aim React UI and Python client SDK API are preserved unchanged.

Matyan logs your training runs and any ML metadata, enables a beautiful UI to compare and observe them, and provides an SDK to query them programmatically.

Log Metadata Across Your ML Pipeline 💾 Visualize & Compare Metadata via UI 📊
  • Metrics, hyperparameters, images, audio, text, distributions
  • Structured and terminal run logs
  • Aim-compatible SDK — no code changes needed
  • Metadata visualization via explorers (metrics, images, audio, …)
  • Grouping, aggregation, and subplots
  • Querying using MatyanQL (Python expressions)
Scale to Thousands of Runs ⚡ Production-Ready Deployment 🚀
  • FoundationDB backend — handles 10,000s of runs
  • Kafka-based ingestion pipeline with consumer workers
  • Secondary indexes (Tier 1 + Tier 2 hparam) for fast queries
  • Helm chart for Kubernetes with all components
  • Stateless, horizontally scalable services
  • S3, GCS, or Azure Blob Storage for large artifact blobs
Matyan demo

SEAMLESSLY INTEGRATES WITH:


🏁 Quick Start

1. Start the infrastructure

./dev/compose-cluster.sh up -d

This starts FoundationDB, Kafka, and S3 (RustFS) locally via Docker Compose. (GCS and Azure backends are supported in production). Then start the backend, frontier, and UI from their package directories (see each component README for uv run commands).

2. Install the client

python3 -m pip install matyan-client

Or with uv:

uv add matyan-client

3. Log a training run

from matyan_client import Run

run = Run()

run["hparams"] = {
    "learning_rate": 0.001,
    "batch_size": 32,
}

for i in range(100):
    run.track(i * 0.01, name="loss", step=i, context={"subset": "train"})
    run.track(1 - i * 0.01, name="acc", step=i, context={"subset": "train"})

run.close()

The same Run API works as in Aim — see Supported types for images, audio, distributions, figures, and text.

Query runs programmatically via SDK
from matyan_client import Repo

repo = Repo("http://localhost:53800")

query = "metric.name == 'loss'"

for run_metrics_collection in repo.query_metrics(query).iter_runs():
    for metric in run_metrics_collection:
        params = metric.run[...]
        steps, values = metric.values.sparse_numpy()
Deploy on Kubernetes
helm install matyan deploy/helm/matyan -f deploy/helm/matyan/values-production.yaml

See deploy/helm/matyan/README.md for all values and production notes.

Read the full documentation at matyan-core/deployment 📖


🏗 Architecture

flowchart TB
    subgraph clients["Training Clients"]
        C["matyan-client"]
    end

    subgraph ui["UI"]
        U["matyan-ui"]
    end

    subgraph ingestion["Ingestion path"]
        F["matyan-frontier<br/>(Ingestion Gateway)"]
        K["Kafka<br/>data-ingestion"]
        IW["Ingestion Workers"]
        STR["Cloud Storage<br/>(S3 / GCS / Azure)"]
    end

    subgraph control["Control path"]
        B["matyan-backend<br/>(REST API)"]
        KC["Kafka<br/>control-events"]
        CW["Control Workers"]
    end

    subgraph storage["Storage"]
        FDB["FoundationDB"]
    end

    C -->|"WebSocket (metrics, hparams)"| F
    C -->|"PUT blob"| STR
    F -->|"blob ref"| K
    K --> IW
    IW --> FDB
    U --> B
    B --> FDB
    B --> KC
    KC --> CW
    CW -->|"cleanup"| STR
Loading
Concern Entry point Consistency
UI reads matyan-backend Immediate
UI control ops (delete, rename) matyan-backend Immediate for user, async for cleanup
Client metrics / hparams ingestion frontier (WebSocket) Eventual
Client blob upload (images, audio) frontier (presigned URL) Eventual

📁 Repo Layout

Path Purpose
extra/matyan-backend/ REST API, FDB storage, ingestion/control Kafka workers, CLI. README
extra/matyan-frontier/ Ingestion gateway: WebSocket + presigned URLs (S3, GCS, Azure SAS); publishes to Kafka. README
extra/matyan-ui/ React frontend (from Aim) + Python wrapper for serving. README
extra/matyan-client/ Python client SDK (Aim-compatible API); connects to frontier and backend.
extra/matyan-api-models/ Shared Pydantic models (WS, Kafka, REST). README
deploy/helm/matyan/ Helm chart for Kubernetes. README
dev/docker-compose.yml Local dev: FDB, Kafka, S3 (RustFS), optional app services.
docs/ MkDocs source for the documentation site.

🚢 Deployment

Local development

The fastest way to get everything running is Docker Compose. A single script starts all infrastructure dependencies and the Matyan services:

./dev/compose-cluster.sh up -d

This brings up:

Service Port Purpose
FoundationDB Primary storage (internal)
Apache Kafka 9092 Ingestion + control event bus
RustFS (S3-compatible) 9000 / 9001 Blob artifact storage + console
matyan-backend 53800 REST API
matyan-frontier 53801 WebSocket ingestion gateway
matyan-ui 8000 React UI

Point your browser to http://localhost:8000 once all services are healthy. Use http://localhost:9001 for the RustFS console (credentials: rustfsadmin / rustfsadmin).

To seed demo data into a running stack:

cd extra/matyan-backend
uv run python scripts/seed_data.py seed

Kubernetes

Matyan ships a Helm chart covering all application services and their infrastructure dependencies (FoundationDB via the fdb-operator, Kafka, RustFS).

Prerequisites: a Kubernetes cluster (1.25+) with a default or named StorageClass.

Generate a Fernet key (required for encrypted blob URIs):

uvx --from cryptography python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"

Install the chart:

helm install matyan deploy/helm/matyan \
  -f deploy/helm/matyan/values-dev.yaml \
  --set ui.hostBase=https://matyan.example.com \
  --set backend.hostBase=https://matyan.example.com \
  --set blobUriSecret.value=<your-fernet-key> \
  --set fdb-cluster.processes.general.volumeClaimTemplate.storageClassName=<your-storage-class>

Scaling: all application services (matyan-backend, matyan-frontier, ingestion workers, control workers) are stateless. Scale any of them independently by adjusting replicaCount in the values file — FoundationDB and Kafka handle coordination automatically.

See deploy/helm/matyan/README.md for the full values reference and production configuration notes (TLS, resource limits, external Kafka/S3, multi-node FDB).


🆚 Comparisons to familiar tools

TensorBoard vs Matyan

Training run comparison

  • Tracked parameters are first-class citizens in Matyan. You can search, group, and aggregate by params — deeply exploring all tracked data (metrics, hyperparameters, images, audio) in the UI.
  • With TensorBoard, users are forced to encode parameters into the run name to search and compare them. TensorBoard has no grouping, aggregation, or subplot features.

Scalability

  • Matyan is built on FoundationDB and Kafka to handle 10,000s of training runs at both the storage and UI layers.
  • TensorBoard becomes slow and hard to use when a few hundred training runs are queried or compared.
MLflow vs Matyan

MLflow is an end-to-end ML lifecycle tool. Matyan is focused on training tracking and observability.

Run comparison

  • Matyan treats tracked parameters as first-class citizens. Users can query runs, metrics, images, and filter using params with full grouping, aggregation, and subplotting.
  • MLflow has basic search by config but lacks grouping, aggregation, and rich comparison features.

UI scalability

  • Matyan's UI handles thousands of metrics with thousands of steps smoothly.
  • MLflow's UI slows noticeably with a few hundred runs.

Deployment

  • Both are self-hosted and open-source.
  • Matyan adds a Kafka-based ingestion pipeline and FoundationDB for high-throughput, horizontally scalable deployments.
Weights and Biases vs Matyan

Hosted vs self-hosted

  • Weights and Biases is a hosted, closed-source MLOps platform. Your experiment data lives on their servers.
  • Matyan is fully self-hosted and open-source — your data stays in your own infrastructure (FoundationDB + S3/GCS/Azure).

Cost

  • W&B charges per seat / usage at scale.
  • Matyan is free; you only pay for your own compute and storage.
Aim vs Matyan

Matyan is a fork of Aim. The UI and Python SDK API surface are almost identical — minor code changes needed to switch.

Storage backend

  • Aim uses an embedded RocksDB store (custom Cython extensions) on a single node. Storage is tied to the machine running aim up.
  • Matyan replaces RocksDB with FoundationDB — a distributed, ACID-compliant key-value store designed for horizontal scaling. All runs share a single logical key space across a cluster.

Ingestion pipeline

  • Aim writes tracking data synchronously in the same process as the server.
  • Matyan routes tracking data through Kafka → ingestion workers → FoundationDB, decoupling the write path from the API. The frontier service can handle bursts from many concurrent training jobs without backpressure on the API.

Deployment model

  • Aim is a single aim up process — simple to start, harder to scale.
  • Matyan is a set of stateless, horizontally scalable microservices (backend API, frontier, ingestion workers, control workers) deployable on Kubernetes via Helm.

When to use Aim

Aim is a great choice for individual researchers running experiments on a single machine where simplicity matters more than scale.

When to use Matyan

Matyan is the right choice when you need to scale to many concurrent training jobs, many users, or large run counts — while keeping the familiar Aim UI and SDK.


📦 Component READMEs


📬 Contact

Questions, feedback, or collaboration? Reach out at [email protected].


⚖️ License

Apache 2.0 — see LICENSE.

Matyan is a fork of Aim by AimStack, used under the Apache 2.0 license.

About

Open-source ML experiment tracking system built on FoundationDB — a self-hosted alternative to Aim/AimHub with a Kafka-based ingestion pipeline and cloud artifact storage.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors