Fast, zero-dependency, inference engine for Nemotron 3 in pure Java.
- Single file, no dependencies, based on llama3.java
- Supports Nemotron 3 model families: dense and MoE (Mixture of Experts)
- Mixed layer types: attention, feed-forward (FFN), and recurrent SSM (State Space Model)
- Fast GGUF format parser
- Supported dtypes/quantizations:
F16,BF16,F32,Q4_0,Q4_1,Q4_K,Q5_K,Q6_K,Q8_0 - Matrix-vector kernels using Java's Vector API
- CLI with
--chatand--promptmodes - Thinking mode control with
--think off|on|inline - GraalVM Native Image support
- AOT model preloading for instant time-to-first-token
Download GGUF models from Hugging Face:
| Model | Architecture | GGUF Repository |
|---|---|---|
| Nano 8B | Dense | unsloth/Nemotron-3-Nano-8B-GGUF |
| Nano 30B-A3B | MoE, 30B total / 3B active | unsloth/Nemotron-3-Nano-30B-A3B-GGUF |
Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K).
A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:
./llama-quantize --pure ./Nemotron-3-Nano-8B-BF16.gguf ./Nemotron-3-Nano-8B-Q4_0.gguf Q4_0Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.
Java 21+ is required, in particular for the MemorySegment mmap-ing feature.
jbang is a good fit for this use case.
jbang Nemotron3.java --help
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --prompt "Explain quantum computing like I'm five"
Or run it directly (still via jbang):
chmod +x Nemotron3.java
./Nemotron3.java --helpCompile to produce a nemotron3 native executable, then:
./nemotron3 --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chatNemotron3.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).
To AOT pre-load a GGUF model, pass the system property -Dnemotron3.PreloadGGUF=/path/to/model.gguf at build time.
A larger specialized binary is generated with parse overhead removed for that specific model.
It can still run other models with the usual parsing overhead.
Apache 2.0