Skip to content

mukel/nemotron3.java

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Nemotron3.java

Java 21+ License: Apache 2.0 GraalVM Platform

Fast, zero-dependency, inference engine for Nemotron 3 in pure Java.


Features

  • Single file, no dependencies, based on llama3.java
  • Supports Nemotron 3 model families: dense and MoE (Mixture of Experts)
  • Mixed layer types: attention, feed-forward (FFN), and recurrent SSM (State Space Model)
  • Fast GGUF format parser
  • Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
  • Matrix-vector kernels using Java's Vector API
  • CLI with --chat and --prompt modes
  • Thinking mode control with --think off|on|inline
  • GraalVM Native Image support
  • AOT model preloading for instant time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model Architecture GGUF Repository
Nano 8B Dense unsloth/Nemotron-3-Nano-8B-GGUF
Nano 30B-A3B MoE, 30B total / 3B active unsloth/Nemotron-3-Nano-30B-A3B-GGUF

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./Nemotron-3-Nano-8B-BF16.gguf ./Nemotron-3-Nano-8B-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

jbang Nemotron3.java --help
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Nemotron3.java
./Nemotron3.java --help

GraalVM Native Image

Compile to produce a nemotron3 native executable, then:

./nemotron3 --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat

AOT model preloading

Nemotron3.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model, pass the system property -Dnemotron3.PreloadGGUF=/path/to/model.gguf at build time. A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

Related Repositories

License

Apache 2.0

About

Fast NVIDIA's Nemotron 3 inference in pure Java

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors