Convert plain English into executable shell commands using an 800M-parameter model that runs on your laptop.
NL2Shell is an open-source fine-tune of Qwen3.5-0.8B — a hybrid DeltaNet architecture with 75% linear + 25% softmax attention layers — trained on 11,894 deduplicated natural language to bash pairs using QLoRA. It outputs only the shell command, no explanations, no markdown fences. Edge-deployable via GGUF (q4_k_m ~400 MB).
# Install Ollama: https://ollama.com
ollama run hf.co/AryaYT/nl2shell-0.8b "find all Python files modified in the last 24 hours"
# find . -name '*.py' -mtime -1Add a shell helper so you can type nl <description> from anywhere:
# Add to ~/.zshrc or ~/.bashrc
nl() { ollama run hf.co/AryaYT/nl2shell-0.8b "$*" 2>/dev/null; }
# Then:
nl show disk usage of each subdirectory
# du -sh */ | sort -rhfrom transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "AryaYT/nl2shell-0.8b"
SYSTEM = (
"You are an expert shell programmer. Given a natural language request, "
"output ONLY the corresponding shell command. No explanations."
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto")
def nl2shell(request: str) -> str:
prompt = (
f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
f"<|im_start|>user\n{request}<|im_end|>\n"
f"<|im_start|>assistant\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=128, temperature=0.1,
do_sample=True, pad_token_id=tokenizer.eos_token_id,
)
full = tokenizer.decode(outputs[0], skip_special_tokens=False)
return full.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip()
print(nl2shell("show all running Docker containers"))
# docker psTry it in your browser with no setup — huggingface.co/spaces/AryaYT/nl2shell-demo
| Natural Language | Generated Shell Command |
|---|---|
| list all files sorted by size | ls -lhS |
| find all Python files modified in the last 24 hours | find . -name '*.py' -mtime -1 |
| kill the process using port 3000 | lsof -ti:3000 | xargs kill -9 |
| show disk usage of each subdirectory | du -sh */ | sort -rh |
| compress the src directory into a tar.gz | tar -czf src.tar.gz src/ |
| count lines of code in all TypeScript files | find . -name '*.ts' | xargs wc -l |
| show git log as one-line summaries | git log --oneline -20 |
| generate a random 32-character password | openssl rand -base64 32 |
| find and delete all node_modules directories | find . -name 'node_modules' -type d -prune -exec rm -rf {} + |
| check SSL certificate expiry of a domain | echo | openssl s_client -connect example.com:443 2>/dev/null | openssl x509 -noout -dates |
| restart the DNS cache on macOS | sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder |
| show which process is using the most CPU | ps aux --sort=-%cpu | head -5 |
- Sub-1B model — 859M parameters, fits in 1 GB RAM (GGUF q4_k_m). Runs on a Raspberry Pi 4.
- GGUF-ready — q4_k_m (~400 MB) for edge/Raspberry Pi, q8_0 (~650 MB) for desktop, via Ollama.
- Hybrid DeltaNet architecture — Qwen3.5-0.8B uses 75% linear attention + 25% softmax attention layers. First NL2Bash model on this architecture class.
- Response-only loss masking — trained only on assistant shell output tokens, not user instructions. Cleaner gradient signal.
- ChatML prompt format — standard
<|im_start|>/<|im_end|>tokens, compatible with any OpenAI-style chat pipeline. - Linux + macOS coverage — NL2Bash benchmark pairs (GNU tools) plus 40 handcrafted macOS/Homebrew synthetic pairs.
- MIT licensed — use freely in commercial products, modify, redistribute.
- Reproducible — full training code, dataset pipeline, and benchmark script included.
This project uses uv for Python package management.
# Clone the repo
git clone https://github.com/aryateja2106/nl2shell
cd nl2shell
# Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dev tools (linting, type checking)
uv sync --group dev
# Install for inference / demo
uv sync --group demo
# Install for training (requires CUDA GPU)
uv sync --group train
# Install for benchmarking
uv sync --group evalRequirements: Python 3.10+. For training: CUDA GPU with at least 16 GB VRAM (A100 recommended). For GGUF inference via Ollama: any hardware including Apple Silicon and Raspberry Pi.
uv sync --group demo
python app.py
# Opens at http://localhost:7860The Gradio interface lets you type a natural language description and get the shell command instantly. Includes 15 example prompts and copy-to-clipboard.
Training requires a CUDA GPU. The scripts are designed for Google Colab A100 but work on any compatible GPU.
uv sync --group train
# Set your HuggingFace token
export HF_TOKEN=hf_...
# Run QLoRA fine-tuning
python train.pyThe training script:
- Loads Qwen3.5-0.8B with 4-bit NF4 quantization via Unsloth (falls back to standard PEFT if Unsloth is unavailable)
- Applies QLoRA adapters (r=16, alpha=32) to all attention and MLP projections
- Loads and formats the dataset as ChatML
- Applies
train_on_responses_onlyto mask system/user tokens from loss - Trains for 3–4 epochs with cosine LR schedule
- Evaluates on 7 test prompts
- Merges LoRA adapters, exports GGUF (q4_k_m + q8_0), and pushes to HuggingFace
Important: Do not modify prepare.py (dataset and eval utilities). Edit only train.py for hyperparameter changes.
uv sync --group data
# Rebuild the deduplicated 11,894-pair dataset and push to HuggingFace
export HF_TOKEN=hf_...
python build_v2_dataset.pyuv sync --group eval
# Evaluate on NL2Bash test split (606 examples)
# Metrics: charBLEU, template accuracy, exact match
python benchmark.pyResults are saved to benchmark_results.json with per-example predictions.
uv sync --group dev
ruff check . # lint
ruff format . # format
ty check . # type checkQwen3.5-0.8B is Alibaba's 859M-parameter hybrid language model. Unlike standard transformers that use full softmax attention in every layer, it uses a DeltaNet architecture — a mix of linear recurrent attention (efficient O(n) complexity) and traditional softmax attention (O(n²) but higher expressivity):
| Property | Value |
|---|---|
| Total parameters | 859M |
| Architecture | Hybrid DeltaNet |
| Attention layers | 25% softmax (every 4th layer) |
| Linear layers | 75% DeltaNet recurrent |
| Context window | 262,144 tokens |
| Hidden size | 1024 |
| Layers | 24 |
| Heads | 16 |
| Prompt format | ChatML (<|im_start|> / <|im_end|>) |
The linear attention layers give Qwen3.5 better inference efficiency at long contexts. Shell commands are short (< 128 tokens), so the practical benefit here is reduced model size and faster loading rather than sequence length.
NL2Shell is believed to be the first NL2Bash fine-tune on a hybrid DeltaNet model, establishing a baseline for this architecture class on the task.
| Hyperparameter | Value |
|---|---|
| Method | QLoRA (PEFT) |
| Base quantization | 4-bit NF4 |
| Compute dtype | bfloat16 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Loss masking | Response-only (system + user tokens masked) |
| Sequence packing | Yes (efficient GPU utilization) |
| Gradient checkpointing | Yes (Unsloth) |
| Optimizer | AdamW 8-bit |
| LR scheduler | Cosine |
| Max sequence length | 512 tokens |
HuggingFace: AryaYT/nl2shell-training
The v2 dataset contains 11,894 deduplicated natural language to bash command pairs sourced from:
| Source | Pairs | Notes |
|---|---|---|
| GWHed/nl2bash | ~8,090 | Classic NL2Bash benchmark, GNU/Linux commands |
| AnishJoshi/nl2bash-custom | ~3,764 | Supplemental NL2Bash pairs |
| macOS synthetic (handcrafted) | 40 | Homebrew, launchd, macOS-specific commands |
| Total (deduplicated) | 11,894 | SHA256 hash dedup on (nl, bash) pairs |
Every pair is formatted as ChatML before training:
<|im_start|>system
You are an expert shell programmer. Given a natural language request, output ONLY the corresponding shell command. No explanations.<|im_end|>
<|im_start|>user
{natural language request}<|im_end|>
<|im_start|>assistant
{shell command}<|im_end|>
The dataset is shuffled with seed 42 for reproducibility.
| v1 | v2 | |
|---|---|---|
| Training pairs | 8,130 | 11,894 |
| Epochs | 3 | 4 |
| Hardware | A100 (Colab) | H100 (Colab Pro) |
| Batch size | 8 | 8 |
| Gradient accumulation | 4 (effective batch 32) | 8 (effective batch 64) |
| Learning rate | 2e-4 | 2e-4 |
| Warmup | 20 steps | 5% of steps |
| Final train loss | 0.6338 | in progress |
| Unsloth | Yes | Yes |
| Sequence packing | Yes | Yes |
v2 hyperparameters were reviewed by Amp with additional warmup steps and doubled effective batch size for the larger dataset.
Evaluated on the GWHed/nl2bash test split (606 examples).
- charBLEU — character-level BLEU-4 with brevity penalty (NLC2CMD standard)
- Template accuracy — commands match after normalizing quoted strings, paths, and numbers to placeholders
- Exact match — identical string after stripping whitespace
| Model | charBLEU | Template Acc | Exact Match |
|---|---|---|---|
| NL2Shell-0.8b v1 | pending | pending | pending |
| NL2Shell-0.8b v2 | pending | pending | pending |
| Qwen2.5-Coder-0.5B + LoRA | — | — | 0.46 IC-ALFA* |
| Llama-3.2-1B + LoRA | — | — | 0.37 IC-ALFA* |
*IC-ALFA (Westenfelder 2025) is execution accuracy, not directly comparable to charBLEU. Apples-to-apples comparison pending.
To run benchmarks yourself:
uv sync --group eval
python benchmark.py
# Saves per-example results to benchmark_results.jsonNL2Shell is a fine-tuned language model that converts plain English descriptions into shell/bash commands. You describe what you want to do — "find all log files larger than 100 MB" — and the model outputs the exact command: find / -name '*.log' -size +100M. It is a research project and open-source tool built on top of Qwen3.5-0.8B, an 859M-parameter hybrid DeltaNet model.
The easiest path is Ollama:
ollama run hf.co/AryaYT/nl2shell-0.8b "your description here"For programmatic use, load AryaYT/nl2shell-0.8b with the Hugging Face transformers library and send a ChatML-formatted prompt (see the Quick Start section above). There is also a no-install web demo.
As of March 2026, NL2Shell-0.8b is the only publicly available fine-tune specifically targeting shell command generation on Qwen3.5-0.8B (hybrid DeltaNet architecture). It is competitive with Qwen2.5-Coder-0.5B + LoRA at a similar or smaller parameter count. For edge deployment (Raspberry Pi, M-series Mac with no GPU), the GGUF q4_k_m variant at ~400 MB is the most practical option in its class.
After downloading the GGUF model once, Ollama runs fully offline:
# First run downloads the model (~400 MB for q4_k_m)
ollama run hf.co/AryaYT/nl2shell-0.8b "list all files sorted by modification date"
# Subsequent runs are instant, no internet requiredThe q4_k_m GGUF runs on CPU. On Apple Silicon (M1/M2/M3/M4), Ollama uses Metal GPU acceleration automatically.
Qwen3.5-0.8B (and the NL2Shell fine-tune) uses the ChatML prompt format. The key is to use the correct system prompt and output format:
SYSTEM = "You are an expert shell programmer. Given a natural language request, output ONLY the corresponding shell command. No explanations."
prompt = (
f"<|im_start|>system\n{SYSTEM}<|im_end|>\n"
f"<|im_start|>user\n{your_request}<|im_end|>\n"
f"<|im_start|>assistant\n"
)Set temperature to 0.1 for deterministic, predictable output. Shell commands are not creative tasks — low temperature is strongly recommended.
| Project | Model | Size | GGUF | Notes |
|---|---|---|---|---|
| NL2Shell (this project) | Qwen3.5-0.8B | 859M | Yes | Hybrid DeltaNet, MIT |
| Westenfelder 2025 | Qwen2.5-Coder-0.5B | 500M | No | IC-ALFA 0.46, not public |
| Westenfelder 2025 | Llama-3.2-1B | 1B | No | IC-ALFA 0.37, not public |
| NLC2CMD (2021) | various | large | No | IBM, competition-era |
| ShellGPT | GPT-4o via API | API | No | Requires OpenAI key |
| GitHub Copilot CLI | GPT-4o via API | API | No | Subscription required |
NL2Shell is the only open-weight, edge-deployable, MIT-licensed option purpose-trained for NL-to-bash on the Qwen3.5 architecture.
The full pipeline is in this repo. The key steps:
- Dataset: Collect (natural language, bash command) pairs. NL2Bash (~10K pairs from GWHed/nl2bash) is the standard benchmark dataset. Add domain-specific pairs for your use case (macOS, Kubernetes, etc.).
- Format: Convert to ChatML format with a system prompt instructing the model to output only the command.
- QLoRA: Use
peft+trl(SFTTrainer) with r=16, alpha=32, NF4 4-bit quantization. Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj. - Loss masking: Apply
train_on_responses_only(TRL/Unsloth) to mask system and user tokens. The model should only learn to predict the shell command, not re-learn the prompt. - Export: Merge adapters back into the base model, then export GGUF with llama.cpp via Unsloth's
save_pretrained_gguf.
See train.py for the complete implementation and prepare.py for dataset formatting utilities.
# pip install transformers torch
from transformers import pipeline
pipe = pipeline("text-generation", model="AryaYT/nl2shell-0.8b", device_map="auto")
SYSTEM = "You are an expert shell programmer. Given a natural language request, output ONLY the corresponding shell command. No explanations."
def bash(request: str) -> str:
prompt = f"<|im_start|>system\n{SYSTEM}<|im_end|>\n<|im_start|>user\n{request}<|im_end|>\n<|im_start|>assistant\n"
result = pipe(prompt, max_new_tokens=64, temperature=0.1, do_sample=True)
full = result[0]["generated_text"]
return full.split("<|im_start|>assistant\n")[-1].split("<|im_end|>")[0].strip()
print(bash("show all running Docker containers")) # docker ps
print(bash("count lines in all Python files")) # find . -name '*.py' | xargs wc -lThe model generates syntactically valid shell commands but does not verify correctness or safety. Always review the generated command before running it, especially when it involves rm, kill, sudo, chmod, or network operations. Commands involving paths, ports, or domain names will use placeholder values from training data — you must substitute your actual values. This is a research tool, not a production-grade shell assistant.
nl2shell/
├── prepare.py # Dataset loading + ChatML formatting (IMMUTABLE)
├── train.py # QLoRA fine-tuning script (editable)
├── benchmark.py # charBLEU / template acc / exact match evaluation
├── build_v2_dataset.py # Dataset pipeline: dedup, merge, push to HF
├── app.py # Gradio demo (local + HF Spaces)
├── deploy-space.sh # HF Space deployment script
├── experiment_gpu.py # GPU memory profiling experiments
├── pyproject.toml # uv project config, dependency groups
├── program.md # Training constraints and budget
├── SPACE_README.md # HuggingFace Spaces card
├── CLAUDE.md # AI agent instructions
├── docs/
│ ├── TASK-PLAN.md # T1-T9 dependency graph for project tasks
│ ├── RUNBOOK.md # Copy-paste commands for training workflow
│ ├── paper-outline.md # ACL short paper structure
│ └── research/ # Prior art, benchmarks, neural memory notes
├── paper/
│ ├── nl2shell.tex # ACL short paper (LaTeX)
│ └── nl2shell.bib # Bibliography
└── notebooks/
├── train-v1.ipynb # v1 training notebook (A100, 8,130 pairs)
└── train-v2.ipynb # v2 training notebook (H100, 11,894 pairs)
Contributions are welcome. Areas where help is most valuable:
- Benchmarking — run
benchmark.pyand open an issue with your results (hardware, model version, metrics) - Dataset expansion — shell command pairs for domains not well-covered: Kubernetes, AWS CLI, PowerShell, zsh-specific syntax
- Alternative architectures — fine-tuning comparisons on Llama-3.2-1B, Phi-3.5-mini, or Mistral-7B for the same task
- Evaluation harness — implementing IC-ALFA (execution accuracy via Docker sandbox) for apples-to-apples comparison with prior work
- Edge deployment — testing on Raspberry Pi 4/5, RISC-V boards, or other constrained hardware
To contribute:
git clone https://github.com/aryateja2106/nl2shell
cd nl2shell
uv sync --group dev
git checkout -b feat/your-feature
# Lint and format before submitting
ruff check .
ruff format .
ty check .Open a pull request with a clear description of what changed and why.
If you use NL2Shell in your research, please cite:
@misc{nl2shell2026,
title = {NL2Shell: Natural Language to Shell Command Translation
with Hybrid DeltaNet Architecture},
author = {Arya Teja},
year = {2026},
howpublished = {\url{https://github.com/aryateja2106/nl2shell}},
note = {Fine-tuned Qwen3.5-0.8B (QLoRA) on NL2Bash + macOS synthetic pairs.
Model: \url{https://huggingface.co/AryaYT/nl2shell-0.8b}},
}| Resource | URL |
|---|---|
| Model (HuggingFace) | https://huggingface.co/AryaYT/nl2shell-0.8b |
| Dataset (HuggingFace) | https://huggingface.co/datasets/AryaYT/nl2shell-training |
| Demo (Gradio Space) | https://huggingface.co/spaces/AryaYT/nl2shell-demo |
| GitHub | https://github.com/aryateja2106/nl2shell |
| CloudAGI | https://cloudagi.ai |
| Author | https://github.com/aryateja2106 |
| Base model | https://huggingface.co/Qwen/Qwen3.5-0.8B |
| NL2Bash dataset | https://huggingface.co/datasets/GWHed/nl2bash |
MIT License. See LICENSE.
Built by Arya Teja as part of CloudAGI — Agent Credit Economy.