Skip to content

guidelabs/steerling

Repository files navigation

Steerling

PyPI version License Python HuggingFace Ruff pre-commit

An interpretable causal diffusion language model.

Steerling-8B combines masked diffusion language modeling with concept decomposition, enabling:

  • Generation: Non-autoregressive text generation via confidence-based unmasking
  • Attribution: Decompose predictions into known concept contributions
  • Steering: Intervene on concept activations to control generation
  • Embeddings: Extract hidden, composed, known, or discovered representations

For more information, tutorials, and updates, visit guidelabs.ai. To learn more about the architecture behind Steerling, check out our blog posts on Scaling Interpretable Models with 8B Parameters and Causal Diffusion Language Models.

Quick Start

uv pip install steerling
source .venv/bin/activate
import torch
from transformers import AutoModel, AutoTokenizer
from steerling import SteerlingGenerator
from steerling.configs.generation import GenerationConfig

model = AutoModel.from_pretrained("guidelabs/steerling-8b", trust_remote_code=True, torch_dtype=torch.bfloat16)
tokenizer = AutoTokenizer.from_pretrained("guidelabs/steerling-8b", trust_remote_code=True)
generator = SteerlingGenerator.from_model(model, tokenizer, device="cuda")

prompt = "The key to understanding neural networks is"
config = GenerationConfig(max_new_tokens=128, steps=128, temperature=0.4)
text = generator.generate(prompt, config)
print(text)

Requirements: Python >= 3.13, GPU with >= 18 GB VRAM (H100, A100, A6000, RTX 4090), CUDA 12.8

Model Details

Property Value
Parameters ~8B
Architecture CausalDiffusionLM + Interpretable Concept Head
Context Length 4096
Vocabulary 100,281 (cl100k_base + specials)
Known Concepts 33,732
Discovered Concepts 101,196
GQA 32 heads, 4 KV heads
Precision bfloat16

Architecture

Steerling uses block-causal attention (bidirectional within 64-token blocks, causal across blocks) with masked diffusion training. At inference, tokens are generated by iteratively unmasking positions in order of model confidence. The interpretable concept heads decompose transformer hidden states h into:

h → known_features + unk_hat + epsilon = composed → lm_head → logits

Embedding decomposition animation

  • known_features: Weighted sum of top-k learned concept embeddings
  • unk_hat: Residual features captured by a factorized discovered concept head
  • epsilon: Small correction term for reconstruction fidelity

Interpretability

Every prediction Steerling makes can be decomposed into three components: known concepts (human-interpretable features), discovered concepts (learned residual features), and epsilon (reconstruction correction). The plot below shows the fraction of each token's logit attributable to each component:

Per-token logit contribution decomposition

See logit_contribution.ipynb for per-token decomposition and chunk_level_concept_attribution.ipynb for chunk-level concept attribution.

Installation

# From PyPI
uv pip install steerling

# From source
git clone https://github.com/guidelabs/steerling.git
cd steerling
uv sync --extra dev    # full dev environment
source .venv/bin/activate

Note: PyTorch is installed with CUDA 12.8 support automatically via the PyTorch index configured in pyproject.toml. If you need a different CUDA version, install PyTorch manually before installing steerling.

Evaluation

We provide evaluation scripts based on lm-evaluation-harness.

# Run all benchmarks (HellaSwag, ARC-Challenge, WinoGrande, PIQA, MMLU, GSM8K)
bash scripts/eval_steerling_lm_eval.sh

# Specify a model path
MODEL_PATH=/path/to/local/model bash scripts/eval_steerling_lm_eval.sh

# Run specific tasks
TASKS="hellaswag arc_challenge" bash scripts/eval_steerling_lm_eval.sh

# Or use the Python CLI directly
python scripts/evaluate.py --model guidelabs/steerling-8b --tasks hellaswag arc_challenge

Notebooks

Notebook Description
generation.ipynb Text generation — block-by-block unmasking, special tokens, early stopping with <|endofchunk|>
logit_contribution.ipynb Decompose each predicted token's logit into known concept, discovered concept, and residual contributions
chunk_level_concept_attribution.ipynb Attribute generated text chunks to known concepts using normalized feature contributions

FAQ

  • Where can I read more about the architecture?
    See our blog posts: Scaling Interpretable Models with 8B Parameters and Causal Diffusion Language Models. A detailed technical report is coming soon.

  • Is there an instruction-tuned model?
    Stay tuned.

  • What dataset was this trained on?
    An augmented version of the Nemotron-CC-HQ dataset for approximately 1.35 trillion tokens.

  • What is block-causal attention?
    Standard causal attention only lets each token attend to previous tokens. Block-causal attention groups tokens into blocks of 64 and allows bidirectional attention within each block, while maintaining causal ordering across blocks. See Causal Diffusion Language Models for more details.

  • What are "known" and "discovered" concepts?
    The model decomposes its internal representations into two parts:

    • Known concepts (33,732): learned, supervised features corresponding to identifiable patterns a human can understand.
    • Discovered concepts (101,196): capture the signal that known concepts don't explain.
    • Together they reconstruct the full hidden state: hidden ≈ known_features + discovered_features + epsilon.
  • How do I find concept IDs for steering?
    A full walkthrough of concept extraction and steering is coming in the next few weeks.

  • What GPU do I need?
    Steerling-8B in bfloat16 requires approximately 18 GB VRAM. It fits on a single H100, A100 (40GB or 80GB), A6000 (48GB), or RTX 4090 (24GB).

  • Can I fine-tune this model?
    Yes, but fine-tuning code is not included in this release. If there is sufficient interest, we will support it in a future release.

  • What tokenizer does Steerling-8B use?
    OpenAI's cl100k_base tokenizer (via tiktoken) with 4 additional special tokens: <|pad|>, <|bos|>, <|endofchunk|>, and <|mask|>, for a total vocabulary of 100,281 tokens.

  • How do I get training data attributions?
    This release supports concept and feature attributions via the provided notebooks. Training data attribution is not currently supported but will be added in a future release.

License

The Steerling source code is released under the Apache License 2.0.

The model weights are provided for research and evaluation purposes. The weights were trained on datasets with varying license terms, including Nemotron-CC-HQ and Dolmino Mix. Some training data includes synthetic content generated by third-party models with their own license terms. We are currently reviewing the implications of these upstream licenses for downstream use of the model weights. Please check back for updates.

For questions about commercial use of the model weights, contact us at [email protected].