README.md

Qwen35.java

Fast, zero-dependency, inference engine for Qwen 3.5 in pure Java.

Features

Single file, no dependencies, based on llama3.java
Supports all tested Qwen 3.5 model families: 0.8B, 2B, 4B, 9B, 27B, and 35B-A3B (MoE)
Fast GGUF format parser
Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
Matrix-vector kernels using Java's Vector API
CLI with --chat and --prompt modes
GraalVM Native Image support
AOT model preloading for instant time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model	Architecture	GGUF Repository
0.8B	Dense, ~0.8B total params	unsloth/Qwen3.5-0.8B-GGUF
2B	Dense, ~2B total params	unsloth/Qwen3.5-2B-GGUF
4B	Dense, ~4B total params	unsloth/Qwen3.5-4B-GGUF
9B	Dense, ~9B total params	unsloth/Qwen3.5-9B-GGUF
27B	Dense, ~27B total params	unsloth/Qwen3.5-27B-GGUF
35B-A3B	Mixture of Experts (MoE)	unsloth/Qwen3.5-35B-A3B-GGUF

Larger MoE models are not tested.

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./Qwen3.5-4B-F16.gguf ./Qwen3.5-4B-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

No-setup one-liner, no git clone, no manual model download required ... ~5GB download once, then cached by jbang:

jbang qwen35@mukel \
    --model %{https://hf.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q8_0.gguf} \
    --system-prompt "You are a helpful assistant" \
    --chat

Alternatively:

jbang Qwen35.java --help
jbang Qwen35.java --model ./Qwen3.5-4B-Q4_0.gguf --chat
jbang Qwen35.java --model ./Qwen3.5-4B-Q4_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Qwen35.java
./Qwen35.java --help

Optional: Makefile

A simple Makefile is provided. Run make jar to produce qwen35.jar.

Run the resulting qwen35.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar qwen35.jar --help

GraalVM Native Image

Compile with make native to produce a qwen35 executable, then:

./qwen35 --model ./Qwen3.5-4B-Q8_0.gguf --chat

AOT model preloading

Qwen35.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

Related Repositories

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen35.java

Features

Setup

Optional: pure quantizations

Build and run

Optional: Makefile

GraalVM Native Image

AOT model preloading

Related Repositories

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Qwen35.java

Features

Setup

Optional: pure quantizations

Build and run

Optional: Makefile

GraalVM Native Image

AOT model preloading

Related Repositories

License