Skip to content

Latest commit

 

History

History
122 lines (86 loc) · 4.58 KB

File metadata and controls

122 lines (86 loc) · 4.58 KB

Qwen35.java

Java 21+ License: Apache 2.0 GraalVM Platform

Fast, zero-dependency, inference engine for Qwen 3.5 in pure Java.


Features

  • Single file, no dependencies, based on llama3.java
  • Supports all tested Qwen 3.5 model families: 0.8B, 2B, 4B, 9B, 27B, and 35B-A3B (MoE)
  • Fast GGUF format parser
  • Supported dtypes/quantizations: F16, BF16, F32, Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, Q8_0
  • Matrix-vector kernels using Java's Vector API
  • CLI with --chat and --prompt modes
  • GraalVM Native Image support
  • AOT model preloading for instant time-to-first-token

Setup

Download GGUF models from Hugging Face:

Model Architecture GGUF Repository
0.8B Dense, ~0.8B total params unsloth/Qwen3.5-0.8B-GGUF
2B Dense, ~2B total params unsloth/Qwen3.5-2B-GGUF
4B Dense, ~4B total params unsloth/Qwen3.5-4B-GGUF
9B Dense, ~9B total params unsloth/Qwen3.5-9B-GGUF
27B Dense, ~27B total params unsloth/Qwen3.5-27B-GGUF
35B-A3B Mixture of Experts (MoE) unsloth/Qwen3.5-35B-A3B-GGUF

Larger MoE models are not tested.

Optional: pure quantizations

Q4_0 files are often mixed-quant in practice (for example, token_embd.weight and output.weight may use Q6_K). A pure quantization is not required, but can be generated from an F32/F16/BF16 GGUF source with llama-quantize from llama.cpp:

./llama-quantize --pure ./Qwen3.5-4B-F16.gguf ./Qwen3.5-4B-Q4_0.gguf Q4_0

Pick any supported target quantization, for example Q4_0, Q4_1, Q4_K, Q5_K, Q6_K, or Q8_0.

Build and run

Java 21+ is required, in particular for the MemorySegment mmap-ing feature.

jbang is a good fit for this use case.

No-setup one-liner, no git clone, no manual model download required ... ~5GB download once, then cached by jbang:

jbang qwen35@mukel \
    --model %{https://hf.co/unsloth/Qwen3.5-4B-GGUF/resolve/main/Qwen3.5-4B-Q8_0.gguf} \
    --system-prompt "You are a helpful assistant" \
    --chat

Alternatively:

jbang Qwen35.java --help
jbang Qwen35.java --model ./Qwen3.5-4B-Q4_0.gguf --chat
jbang Qwen35.java --model ./Qwen3.5-4B-Q4_0.gguf --prompt "Explain quantum computing like I'm five"

Or run it directly (still via jbang):

chmod +x Qwen35.java
./Qwen35.java --help

Optional: Makefile

A simple Makefile is provided. Run make jar to produce qwen35.jar.

Run the resulting qwen35.jar as follows:

java --enable-preview --add-modules jdk.incubator.vector -jar qwen35.jar --help

GraalVM Native Image

Compile with make native to produce a qwen35 executable, then:

./qwen35 --model ./Qwen3.5-4B-Q8_0.gguf --chat

AOT model preloading

Qwen35.java supports AOT model preloading to reduce parse overhead and time-to-first-token (TTFT).

To AOT pre-load a GGUF model:

PRELOAD_GGUF=/path/to/model.gguf make native

A larger specialized binary is generated with parse overhead removed for that specific model. It can still run other models with the usual parsing overhead.

Related Repositories

License

Apache 2.0