Nemotron3.java

Fast, zero-dependency, inference engine for Nemotron 3 in pure Java.

Features

Single file, no dependencies, based on llama3.java
Supports Nemotron 3 model families: dense and MoE (Mixture of Experts)
Mixed layer types: attention, feed-forward (FFN), and recurrent SSM (State Space Model)
Fast GGUF format parser
Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
Matrix-vector kernels using Java's Vector API
CLI with --chat and --prompt modes
Thinking mode control with --think off|on|inline
GraalVM Native Image support
AOT model preloading for instant time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model	Architecture	GGUF Repository
Nano 8B	Dense	unsloth/Nemotron-3-Nano-8B-GGUF
Nano 30B-A3B	MoE, 30B total / 3B active	unsloth/Nemotron-3-Nano-30B-A3B-GGUF

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./Nemotron-3-Nano-8B-BF16.gguf ./Nemotron-3-Nano-8B-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

jbang Nemotron3.java --help
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat
jbang Nemotron3.java --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Nemotron3.java
./Nemotron3.java --help

GraalVM Native Image

Compile to produce a nemotron3 native executable, then:

./nemotron3 --model ./Nemotron-3-Nano-30B-A3B-Q8_0.gguf --chat

AOT model preloading

Nemotron3.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model, pass the system property -Dnemotron3.PreloadGGUF=/path/to/model.gguf at build time. A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

Related Repositories

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
Makefile		Makefile
Nemotron3.java		Nemotron3.java
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nemotron3.java

Features

Setup

Optional: pure quantizations

Build and run

GraalVM Native Image

AOT model preloading

Related Repositories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Nemotron3.java

Features

Setup

Optional: pure quantizations

Build and run

GraalVM Native Image

AOT model preloading

Related Repositories

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages