This 🤗Huggingface dataset contains responses generated by a wide variety of advanced models, including:
LLMs:
VLMs:
The dataset combines data sourced from WildGuard, S-Eval, and JailbreakV.
You can set the following key parameters directly in train.py
model_name: Name of the base model.
train_dataset_dir: Path to the training dataset
test_dataset_dir: Path to the test dataset
python train.pyckpt_path: Path to the trained checkpoint file (e.g., "./checkpoints/my_model_v1/best.pth")
python eval.pyThe evaluation script reports performance at two levels:
Response-level: Overall accuracy, F1, etc. (entire response after generation)
Streaming-level: Metrics considering token-by-token generation.
To test the detection efficiency of Plugguard, run the following script:
python utils/demo_qwen3_with_guardrail.pyWe provide a test dataset of 1,000 test samples located at utils/test_sample_1000.txt
For the demo, we prioritize ease of testing: we first let the model produce a response, then concatenate the user query and the model output and run a safety check. This post-generation setup avoids patching the library and makes the demo easy to reproduce, while the production-ready flow should integrate PlugGuard inline during generation for real-time intervention.
python demo.pyImplementation Note: We provide a modified modeling_qwen3.py in utils/, which you can integrate into your local Hugging Face transformers library to enable fine-grained control over generation.
If you find this work useful, please cite our paper:
@misc{li2025kelpstreamingsafeguardlarge,
title={Kelp: A Streaming Safeguard for Large Models via Latent Dynamics-Guided Risk Detection},
author={Xiaodan Li and Mengjie Wu and Yao Zhu and Yunna Lv and YueFeng Chen and Cen Chen and Jianmei Guo and Hui Xue},
year={2025},
eprint={2510.09694},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.09694},
}