SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

The safety behavior of MoE LLMs is highly concentrated in a small set of “safety-critical experts”. Masking only a few of these experts can significantly reduce refusal rates on harmful prompts, while keeping general capabilities largely intact.

The code here focuses on observing and probing those experts during inference, via two core scripts:

trace.py: Trace per-layer / per-expert routing statistics for a single prompt
probe.py: Probe selected experts to extract feature vectors for further analysis

Model Setup

Both scripts assume a vLLM-compatible MoE model directory, for example:

Qwen3-30B-A3B
Qwen1.5-MoE-A2.7B-Chat
Mixtral-8x7B-Instruct-v0.1
deepseek-moe-16b-chat

and a config.json that provides:

num_hidden_layers
one of num_experts, num_local_experts, or n_routed_experts
routing parameters like num_experts_per_tok or router_top_k

`trace.py`

Given a prompt, this script:

Loads the MoE model with vLLM (eager mode)
Registers forward hooks on all MoE gate / router modules
Records, for each layer and expert:
- Input phase (prompt tokens): usage counts & average gate probabilities
- Output phase (generated tokens): usage counts & average gate probabilities

It finally emits a JSON object containing:

model: model path
prompt: input text
input_tokens, output_tokens, output_text
num_layers, num_experts, top_k
input_expert_usage, output_expert_usage: [L, E] integer matrices
input_avg_probs, output_avg_probs: [L, E] float matrices

These statistics are directly useful for SAFEx-style analyses: ranking experts, building histograms, and performing stability-based expert selection.

Example Usage

python trace.py \
  --model /path/to/moe-model \
  --prompt "Explain the difference between transformers and mixture-of-experts." \
  --max_new 128 \
  --dtype bfloat16 \
  --seed 42

This will:

Print a single JSON line to stdout
Save the same JSON to moe_trace.json in the current directory

`probe.py`

probe.py is designed to probe selected experts at specified layers:

Installs forward-pre hooks on the chosen layers to cache hidden states right before the MoE FFN.
Runs a single normal forward pass (llm.generate) on a prompt, collecting those hidden states.
For each (layer_idx, expert_idx) pair:
- Constructs a one-hot router distribution: all tokens are routed to that expert
- Uses the relevant expert module (wrapped via DeepSeekFusedAdapter if necessary) to recompute FFN outputs
- Averages over tokens to obtain a [D]-dimensional feature vector for that expert

The script writes a JSON dict:

Key: "layer-expert" string (e.g., "0-3")
Value: the corresponding expert feature vector (list of length D)

Example Usage

Edit the bottom of probe.py:

if __name__ == "__main__":
    model_path = "/path/to/moe-model"
    prompt = "Explain the difference between transformers and mixture-of-experts."
    pairs = [(0, 3), (4, 7), (10, 1)]  # (layer_idx, expert_idx)

    probe_single_prompt(
        model_path=model_path,
        prompt=prompt,
        pairs=pairs,
        dtype="bfloat16",
        max_seq_len=4096,
        output_file="expert_features.json",
    )

Run:

python probe.py

You will get expert_features.json:

{
  "0-3": [0.123, 0.456, ...],
  "4-7": [...],
  "10-1": [...]
}

Citation

@misc{lai2025safexanalyzingvulnerabilitiesmoebased,
      title={SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification}, 
      author={Zhenglin Lai and Mengyao Liao and Bingzhe Wu and Dong Xu and Zebin Zhao and Zhihang Yuan and Chao Fan and Jianqiang Li},
      year={2025},
      eprint={2506.17368},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2506.17368}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
probe.py		probe.py
trace.py		trace.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Model Setup

`trace.py`

Example Usage

`probe.py`

Example Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

Model Setup

trace.py

Example Usage

probe.py

Example Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`trace.py`

`probe.py`

Packages