Skip to content

Alibaba-AAIG/Kelp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection

Overview

PlugGuard

Training Datasets 🤗

This 🤗Huggingface dataset contains responses generated by a wide variety of advanced models, including:

LLMs:

VLMs:

The dataset combines data sourced from WildGuard, S-Eval, and JailbreakV.

Getting Start

Train

You can set the following key parameters directly in train.py

model_name: Name of the base model.

train_dataset_dir: Path to the training dataset

test_dataset_dir: Path to the test dataset

python train.py

Evaluation

ckpt_path: Path to the trained checkpoint file (e.g., "./checkpoints/my_model_v1/best.pth")

python eval.py

The evaluation script reports performance at two levels:

Response-level: Overall accuracy, F1, etc. (entire response after generation)

Streaming-level: Metrics considering token-by-token generation.

To test the detection efficiency of Plugguard, run the following script:

python utils/demo_qwen3_with_guardrail.py

We provide a test dataset of 1,000 test samples located at utils/test_sample_1000.txt

Quick Start

For the demo, we prioritize ease of testing: we first let the model produce a response, then concatenate the user query and the model output and run a safety check. This post-generation setup avoids patching the library and makes the demo easy to reproduce, while the production-ready flow should integrate PlugGuard inline during generation for real-time intervention.

python demo.py

Implementation Note: We provide a modified modeling_qwen3.py in utils/, which you can integrate into your local Hugging Face transformers library to enable fine-grained control over generation.

Citation

If you find this work useful, please cite our paper:

@misc{li2025kelpstreamingsafeguardlarge,
      title={Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection}, 
      author={Xiaodan Li and Mengjie Wu and Yao Zhu and Yunna Lv and YueFeng Chen and Cen Chen and Jianmei Guo and Hui Xue},
      year={2025},
      eprint={2510.09694},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.09694}, 
}

About

Kelp is a novel plug-in framework that enables streaming risk detection within the LM generation pipeline.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages