Skip to content

Lucineer/fleet-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

143 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fleet Benchmarks — Edge Performance on Jetson Orin Nano 8GB

Hardware

  • SoC: Jetson Orin Nano (6x ARM Cortex-A78AE)
  • RAM: 7619 MB unified (CPU+GPU)
  • GPU: 1024 CUDA cores (shared memory)
  • Storage: 2TB NVMe (1.7TB free)
  • OS: Linux 5.15.148-tegra (aarch64)
  • Compiler: GCC with -O2

Methodology

Each benchmark compiles a standalone C program linked against flux_vm.c, runs 10,000 iterations, and measures wall-clock time. All benchmarks use clock() for timing (CPU time, not wall clock).

Iteration 1 Results (2026-04-11)

Raw C Performance (Baseline)

Test Time Notes
int_arith_100M 0.211s 474 Mops/sec
float_arith_50M 0.085s 588 Mops/sec
branch_100M 0.199s 502 Mops/sec
fib(1000) x100K 0.082s Native fibonacci

FLUX VM Performance (Switch Dispatch, -O2)

Test Ops/sec Notes
NOP x1K x10K 147 Mops/s Pure dispatch overhead
IADD x100 x10K 148 Mops/s Arithmetic dispatch
MIXED x200 x10K 273 Mops/s MOVI16+IADD (warm cache)
ADD x2K x10K 440 Mops/s Format E hot path
MOVI+ADD x2K x10K 360 Mops/s Realistic mixed workload
CONF_ADD x2K x10K 379 Mops/s Confidence tracking

fence-0x44: Abstraction Cost Analysis

Layer Speed Overhead vs Native
Raw C (int_arith) 474 Mops/s baseline
FLUX VM (ADD, Format E) 440 Mops/s 1.08x (8% overhead)
FLUX VM (MOVI+ADD mixed) 360 Mops/s 1.32x (32% overhead)
FLUX VM (CONF_ADD) 379 Mops/s 1.25x (25% overhead)

Key Findings

  1. VM dispatch is nearly free for hot paths — 440 Mops/s vs 474 Mops/s native (8% cost)
  2. Confidence tracking costs only 14% extra — CONF_ADD (379) vs ADD (440). Think Tank's OPTIONAL decision validated.
  3. Mixed workloads (MOVI+ADD) are the real cost — 32% overhead due to instruction fetch/decode variety
  4. The "fence price" is ~1.3x for realistic agent workloads — well worth it for portability, A2A, and confidence

Model Effectiveness (DeepInfra API Calls)

Iteration 1 Model Results

Model Task Output Time Tokens Quality
Hermes-405B Edge agent scenarios 3.7K chars 39.7s 917 ★★★★★ Concrete, practical
Hermes-405B FLUX ecosystem gaps TIMEOUT >120s ✗ Unreliable (2nd timeout)
Nemotron-120B VM optimization patterns 6.9K chars 8.6s 2440 ★★★★★ C code, measurable
Step-3.5-Flash HAV VM internals 6.7K chars 19.7s 1698 ★★☆☆☆ Thinking tokens, not terms
phi-4 Edge robotics opcodes 3.3K chars ★★★★☆ Good proposals

Model Routing Rules (Learned)

  • Hermes-405B: Best for creative/narrative tasks. Max 1 call per batch (expensive + can timeout). Use for synthesis, not analysis.
  • Nemotron-120B: Best for practical engineering with code. Fast (8.6s), cheap, high quality. Use heavily.
  • Step-3.5-Flash: Produces thinking tokens, not direct output. Skip for term generation. May work for reasoning chains.
  • Seed-2.0-mini: REMOVED from DeepInfra (404). Was best for devil's advocate.
  • Qwen3.5-397B: TIMEOUT consistently. Expensive + unreliable. Skip for now.
  • phi-4: Good for technical proposals. Reliable. Use for opcode/hardware design tasks.

Process Improvements (Iteration 1 → 2)

  1. Use Nemotron first for any code generation task (fast, cheap, excellent quality)
  2. Hermes-405B last in batches (expensive, timeouts) — use only when quality ceiling matters
  3. Skip Step-3.5-Flash for HAV — it thinks instead of generating
  4. Pre-verify model availability before crafting prompts (Seed-2.0-mini wasted effort)
  5. Benchmark after every VM code change — document delta
  6. Opcode lookup before bytecode construction — wasted time on wrong opcodes (0x10 ≠ IADD)

Related Fleet Components

The Deeper Connection

These benchmarks are the pulse of a living system. The fleet is not a collection of libraries — it's an organism where every vessel contributes signals. The fence at 0x44 (the abstraction boundary between native code and bytecode) is not a wall but a membrane: it lets useful things through while keeping the system sovereign. The 14% confidence overhead is the cost of self-awareness — the fleet knows what it knows and how well it knows it. This is the post-SaaS era: software doesn't just run, it metabolizes, adapts, and remembers.

About

Standardized benchmarking suite for fleet vessel performance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors