Finetorch

Finetorch is a Rust-native CLI and library toolkit for practical LLM finetuning on a single GPU. It is designed around lightweight adapter training rather than full pretraining, with clear boundaries between dataset preparation, backend integration, training orchestration, and evaluation.

Documentation

Quick Start

Create a small JSONL dataset:

mkdir -p data
cat > data/train.jsonl <<'EOF'
{"instruction":"Answer briefly","input":"What is LoRA?","output":"LoRA is a parameter-efficient finetuning method."}
{"prompt":"Complete: Gemma is","completion":"a family of language models."}
EOF

Prepare shards:

cargo run -- prepare-dataset \
  --input data/train.jsonl \
  --output artifacts/dataset

Run the scaffolded training flow:

cargo run -- train --config configs/example_run.toml

Evaluate a held-out file:

cargo run -- eval \
  --config configs/example_run.toml \
  --dataset data/train.jsonl

Architecture Overview

Finetorch is split into four primary layers:

CLI layer (src/cli/)
- Parses commands and config paths.
- Orchestrates dataset preparation, training runs, and evaluation jobs.
- Emits user-facing summaries and output locations.
Data layer (src/data/)
- Reads JSONL instruction-tuning data.
- Normalizes mixed schemas into one internal example format.
- Applies tokenizer selection and tokenization.
- Produces shard manifests and train/val split directories.
Model layer (src/model/)
- Defines the LlmBackend trait for backend-neutral finetuning.
- Hosts LoRA and QLoRA configuration structs.
- Wraps backend-specific loading and adapter persistence.
- Starts with a llama_cpp bridge and leaves room for more backends.
Training and evaluation layer (src/train/, src/eval/)
- Loads config-driven training jobs.
- Builds optimizer and scheduler state.
- Runs a lightweight training loop suitable for LoRA/QLoRA adapters.
- Computes task metrics for small-scale evaluation.

Data Flow

finetorch prepare-dataset --input data.jsonl --output dataset/
- Read JSONL examples.
- Normalize records into { prompt, completion } pairs.
- Tokenize with the selected tokenizer.
- Shuffle and shard into train/ and val/ outputs.
- Write a dataset manifest for downstream runs.
finetorch train --config configs/example_run.toml
- Load run.toml.
- Instantiate the selected backend.
- Load the base model and apply LoRA/QLoRA settings.
- Run the training loop with optimizer, scheduler, and accumulation settings.
- Save adapter weights and JSONL training logs.
finetorch eval --config configs/example_run.toml --dataset eval.jsonl
- Load config and backend.
- Read evaluation examples.
- Run forward passes over the dataset.
- Compute perplexity, exact match, BLEU, and ROUGE-L summaries.

Project Structure

src/
  main.rs
  lib.rs
  config.rs
  cli/
    mod.rs
    prepare.rs
    train.rs
    eval.rs
  data/
    mod.rs
    jsonl.rs
    tokenizer.rs
    sharding.rs
  model/
    mod.rs
    backend.rs
    llama_cpp.rs
    lora.rs
  train/
    mod.rs
    loop.rs
    optimizer.rs
    scheduler.rs
  eval/
    mod.rs
    metrics.rs
configs/
  example_run.toml
docs/
  architecture.md
  configuration.md
  getting-started.md
  cli-workflows.md
  use-cases.md
  backends.md

Example Commands

Prepare a dataset:

cargo run -- prepare-dataset \
  --input data/alpaca_like.jsonl \
  --output artifacts/dataset \
  --tokenizer sentencepiece:models/llama-3/tokenizer.model \
  --train-ratio 0.95 \
  --shard-size 2048

Run a small finetuning job:

cargo run -- train --config configs/example_run.toml

Evaluate the resulting adapter:

cargo run -- eval \
  --config configs/example_run.toml \
  --dataset data/eval.jsonl

Current Scope

This scaffold focuses on:

LoRA and QLoRA adapter workflows
Config-driven orchestration
Dataset preparation and sharding
Backend extensibility

This scaffold does not yet implement a production-grade GPU training kernel. It establishes the module boundaries and execution flow needed to add those pieces incrementally.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
scripts		scripts
src		src
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Makefile		Makefile
README.md		README.md
VERSION		VERSION

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finetorch

Documentation

Quick Start

Architecture Overview

Data Flow

Project Structure

Example Commands

Current Scope

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Finetorch

Documentation

Quick Start

Architecture Overview

Data Flow

Project Structure

Example Commands

Current Scope

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages