You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Fleet Benchmarks — Edge Performance on Jetson Orin Nano 8GB
Hardware
SoC: Jetson Orin Nano (6x ARM Cortex-A78AE)
RAM: 7619 MB unified (CPU+GPU)
GPU: 1024 CUDA cores (shared memory)
Storage: 2TB NVMe (1.7TB free)
OS: Linux 5.15.148-tegra (aarch64)
Compiler: GCC with -O2
Methodology
Each benchmark compiles a standalone C program linked against flux_vm.c, runs 10,000 iterations, and measures wall-clock time. All benchmarks use clock() for timing (CPU time, not wall clock).
Iteration 1 Results (2026-04-11)
Raw C Performance (Baseline)
Test
Time
Notes
int_arith_100M
0.211s
474 Mops/sec
float_arith_50M
0.085s
588 Mops/sec
branch_100M
0.199s
502 Mops/sec
fib(1000) x100K
0.082s
Native fibonacci
FLUX VM Performance (Switch Dispatch, -O2)
Test
Ops/sec
Notes
NOP x1K x10K
147 Mops/s
Pure dispatch overhead
IADD x100 x10K
148 Mops/s
Arithmetic dispatch
MIXED x200 x10K
273 Mops/s
MOVI16+IADD (warm cache)
ADD x2K x10K
440 Mops/s
Format E hot path
MOVI+ADD x2K x10K
360 Mops/s
Realistic mixed workload
CONF_ADD x2K x10K
379 Mops/s
Confidence tracking
fence-0x44: Abstraction Cost Analysis
Layer
Speed
Overhead vs Native
Raw C (int_arith)
474 Mops/s
baseline
FLUX VM (ADD, Format E)
440 Mops/s
1.08x (8% overhead)
FLUX VM (MOVI+ADD mixed)
360 Mops/s
1.32x (32% overhead)
FLUX VM (CONF_ADD)
379 Mops/s
1.25x (25% overhead)
Key Findings
VM dispatch is nearly free for hot paths — 440 Mops/s vs 474 Mops/s native (8% cost)
Confidence tracking costs only 14% extra — CONF_ADD (379) vs ADD (440). Think Tank's OPTIONAL decision validated.
Mixed workloads (MOVI+ADD) are the real cost — 32% overhead due to instruction fetch/decode variety
The "fence price" is ~1.3x for realistic agent workloads — well worth it for portability, A2A, and confidence
Model Effectiveness (DeepInfra API Calls)
Iteration 1 Model Results
Model
Task
Output
Time
Tokens
Quality
Hermes-405B
Edge agent scenarios
3.7K chars
39.7s
917
★★★★★ Concrete, practical
Hermes-405B
FLUX ecosystem gaps
TIMEOUT
>120s
—
✗ Unreliable (2nd timeout)
Nemotron-120B
VM optimization patterns
6.9K chars
8.6s
2440
★★★★★ C code, measurable
Step-3.5-Flash
HAV VM internals
6.7K chars
19.7s
1698
★★☆☆☆ Thinking tokens, not terms
phi-4
Edge robotics opcodes
3.3K chars
—
—
★★★★☆ Good proposals
Model Routing Rules (Learned)
Hermes-405B: Best for creative/narrative tasks. Max 1 call per batch (expensive + can timeout). Use for synthesis, not analysis.
Nemotron-120B: Best for practical engineering with code. Fast (8.6s), cheap, high quality. Use heavily.
Step-3.5-Flash: Produces thinking tokens, not direct output. Skip for term generation. May work for reasoning chains.
Seed-2.0-mini: REMOVED from DeepInfra (404). Was best for devil's advocate.
Qwen3.5-397B: TIMEOUT consistently. Expensive + unreliable. Skip for now.
phi-4: Good for technical proposals. Reliable. Use for opcode/hardware design tasks.
Process Improvements (Iteration 1 → 2)
Use Nemotron first for any code generation task (fast, cheap, excellent quality)
Hermes-405B last in batches (expensive, timeouts) — use only when quality ceiling matters
Skip Step-3.5-Flash for HAV — it thinks instead of generating
Pre-verify model availability before crafting prompts (Seed-2.0-mini wasted effort)
Benchmark after every VM code change — document delta
Opcode lookup before bytecode construction — wasted time on wrong opcodes (0x10 ≠ IADD)
These benchmarks are the pulse of a living system. The fleet is not a collection of libraries — it's an organism where every vessel contributes signals. The fence at 0x44 (the abstraction boundary between native code and bytecode) is not a wall but a membrane: it lets useful things through while keeping the system sovereign. The 14% confidence overhead is the cost of self-awareness — the fleet knows what it knows and how well it knows it. This is the post-SaaS era: software doesn't just run, it metabolizes, adapts, and remembers.
About
Standardized benchmarking suite for fleet vessel performance