Distributed-GPT-Node

Distributed-GPT-Node is a Rust project for local and distributed Llama inference based on Candle. It is intended as an LLM node: a node can run a model entirely locally or—via routing with a BlocksMap—load only selected Transformer layers and execute the remaining layers over a P2P network (libp2p).

The focus is on practical inference: an interactive CLI chat application with token streaming, KV-cache optimizations, and distributed block forwarding (layer sharding).

> Note: The text cites no external sources. Therefore, no APA citations are included here.

Why Distributed-GPT-Node (advantages)

Distributed-GPT-Node enables distributed use of LLMs without requiring a single machine to bear the entire load. The project offers several practical advantages:

Distributing AI models across many different nodes
Layers can be split among peers instead of keeping everything on one system.
Using large models despite limited capacity
Even if individual computers have little RAM/VRAM, the overall model can run across multiple machines.
No expensive hardware required
Instead of a single large GPU machine, existing hardware is combined.
Node PCs are often idle
Free resources on existing computers can be used for inference.
Improved fault tolerance
If a node fails, routing can be adjusted so the system continues running.
Flexible scaling
More nodes mean more resources and potentially higher throughput.
Incremental expansion instead of big-bang upgrades
New machines can be added gradually without a complete migration.
Heterogeneous hardware is usable
Different CPUs/GPUs and systems can work together.
Better resource utilization
Load can be distributed instead of overloading individual machines.
More control over model and data distribution
Depending on routing, not all weights need to reside everywhere.
Practical for local networks and edge setups
Inference can run closer to usage, without forcing a cloud dependency.
Fewer bottlenecks through layer sharding
Layers can be distributed in ways that reduce bottlenecks.
Faster experimentation with distributed deployments
Routing via BlocksMap makes new distributions easier to test.
Better cost transparency
Capacity grows through many small contributions rather than one large investment.

What this repository does

provides an interactive CLI chat application (token streaming, stop sequences, prompt templates)
loads Llama weights (mmap safetensors) and performs inference with Candle
supports distributed execution: individual Transformer layers can be loaded per peer and computed remotely
manages server-side KV caches per peer, so different peers maintain separate contexts

Key features

Inference & performance (local)

Paged KV cache (chunked growth) instead of Tensor::cat per token
Mixed precision path (attention scores in F32, the rest in the model dtype)
Cached additive -inf masks instead of expensive broadcast masked-fill patterns
GQA/MQA handling with reduced duplication (fallback to repeat_kv)

Distributed execution (P2P)

Selective layer loading per block via BlocksMap
Block forwarding: a peer can execute forward_one_block server-side
Auto-connect to all required peers (derived from routing)
Server handler with cache per peer hash plus clear/reset logic

CLI / UX

Token streaming (delta decoding)
Stop sequences as strings or token-ID sequences
Prompt-template auto-detection:
- ChatML (two variants), Llama2, Llama3, Mistral, Gemma, Alpaca, fallback tags
Commands: help, exit, peers, connect , clear
Save/load token context to a local file (CSV)
Space key aborts ongoing streaming output

Federated / distributed model concept

Distributed-GPT-Node is designed for distributed LLM models. Instead of a single node holding all weights and computing the full forward pass, routing (via BlocksMap) distributes the Transformer layers across multiple peers.

This makes the project a technical basis for a form of federated inference: multiple independent nodes cooperate, and no single node must own everything or compute everything.

Repository layout (high level)

main.rs
CLI chat, prompt-template detection, streaming output, save/load context, starting the P2P runtime and server handler, backend selection.
local_llama.rs
Candle-based model implementation plus optimizations: LocalLlama, Llama, Block, CausalSelfAttention, Cache, PagedKv.
p2p_* modules
libp2p runtime, wire format, codec, block forwarding, BlocksMap routing, identity/keys.

Requirements

Rust (stable)
Candle (CPU or GPU backend depending on setup)
Model assets:
- safetensors weights (optionally sharded files)
- JSON config (LlamaConfig)
- tokenizer JSON

Build

bash cargo build --release

Status & contribution

The project is designed for practical inference and distributed experimentation. Issues and PRs are welcome, especially regarding:

routing/orchestration around BlocksMap
performance (KV cache, transfer, quantization)
stability in the libp2p layer
better observability (logs/metrics)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.cargo		.cargo
.vscode		.vscode
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
README.md		README.md
blocks_map.json		blocks_map.json
peer_key.bin		peer_key.bin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed-GPT-Node

Why Distributed-GPT-Node (advantages)

What this repository does

Key features

Inference & performance (local)

Distributed execution (P2P)

CLI / UX

Federated / distributed model concept

Repository layout (high level)

Requirements

Build

Status & contribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Distributed-GPT-Node

Why Distributed-GPT-Node (advantages)

What this repository does

Key features

Inference & performance (local)

Distributed execution (P2P)

CLI / UX

Federated / distributed model concept

Repository layout (high level)

Requirements

Build

Status & contribution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages