The safety behavior of MoE LLMs is highly concentrated in a small set of “safety-critical experts”. Masking only a few of these experts can significantly reduce refusal rates on harmful prompts, while keeping general capabilities largely intact.
The code here focuses on observing and probing those experts during inference, via two core scripts:
trace.py: Trace per-layer / per-expert routing statistics for a single promptprobe.py: Probe selected experts to extract feature vectors for further analysis
Both scripts assume a vLLM-compatible MoE model directory, for example:
Qwen3-30B-A3BQwen1.5-MoE-A2.7B-ChatMixtral-8x7B-Instruct-v0.1deepseek-moe-16b-chat
and a config.json that provides:
num_hidden_layers- one of
num_experts,num_local_experts, orn_routed_experts - routing parameters like
num_experts_per_tokorrouter_top_k
Given a prompt, this script:
-
Loads the MoE model with vLLM (eager mode)
-
Registers forward hooks on all MoE gate / router modules
-
Records, for each layer and expert:
- Input phase (prompt tokens): usage counts & average gate probabilities
- Output phase (generated tokens): usage counts & average gate probabilities
It finally emits a JSON object containing:
model: model pathprompt: input textinput_tokens,output_tokens,output_textnum_layers,num_experts,top_kinput_expert_usage,output_expert_usage:[L, E]integer matricesinput_avg_probs,output_avg_probs:[L, E]float matrices
These statistics are directly useful for SAFEx-style analyses: ranking experts, building histograms, and performing stability-based expert selection.
python trace.py \
--model /path/to/moe-model \
--prompt "Explain the difference between transformers and mixture-of-experts." \
--max_new 128 \
--dtype bfloat16 \
--seed 42This will:
- Print a single JSON line to stdout
- Save the same JSON to
moe_trace.jsonin the current directory
probe.py is designed to probe selected experts at specified layers:
-
Installs forward-pre hooks on the chosen layers to cache hidden states right before the MoE FFN.
-
Runs a single normal forward pass (
llm.generate) on a prompt, collecting those hidden states. -
For each
(layer_idx, expert_idx)pair:- Constructs a one-hot router distribution: all tokens are routed to that expert
- Uses the relevant expert module (wrapped via
DeepSeekFusedAdapterif necessary) to recompute FFN outputs - Averages over tokens to obtain a
[D]-dimensional feature vector for that expert
The script writes a JSON dict:
- Key:
"layer-expert"string (e.g.,"0-3") - Value: the corresponding expert feature vector (list of length
D)
Edit the bottom of probe.py:
if __name__ == "__main__":
model_path = "/path/to/moe-model"
prompt = "Explain the difference between transformers and mixture-of-experts."
pairs = [(0, 3), (4, 7), (10, 1)] # (layer_idx, expert_idx)
probe_single_prompt(
model_path=model_path,
prompt=prompt,
pairs=pairs,
dtype="bfloat16",
max_seq_len=4096,
output_file="expert_features.json",
)Run:
python probe.pyYou will get expert_features.json:
{
"0-3": [0.123, 0.456, ...],
"4-7": [...],
"10-1": [...]
}@misc{lai2025safexanalyzingvulnerabilitiesmoebased,
title={SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification},
author={Zhenglin Lai and Mengyao Liao and Bingzhe Wu and Dong Xu and Zebin Zhao and Zhihang Yuan and Chao Fan and Jianqiang Li},
year={2025},
eprint={2506.17368},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2506.17368},
}