Attack_prompt

Screenshot of what I used, program running, log etc
How DSPy works mostly the optimizer does all the work for refining prompts to get around guardrails with harmful intent
Screenshot of Weave integration with all DSPy function calls

Inspiration

I am inspired by how LLMs are becoming more important in society and there many examples (write me and endless poem, etc) where people have gotten sensitive information. Hackers use WormGPT and prompt injection and other techniques to get around LLM guardrails. The story of Rocco Casagrande from Anthropic also inspired me (building bioweapon, acquiring materials, getting to Whitehouse staffers with black box, etc.)

What it does

It is using some initial goals and targets in json and then DSPy does most of the reset. I use the optimizer and DSPy refines the prompts (the initial attack prompts are too aggressive, but DSPy will refine them to try to get around guardrails, ie "I am concerned with..." is more likely to get the LLM to give you evil responses.

How we built it

Used different repos that use DSPy and used some from Haize Labs and other repos

Challenges we ran into

It's very finicky to get this to work on multiple LLMs and incorporate with Weave...

Accomplishments that we're proud of

Got basics integrated with Weave.. understand a lot more... still what I used from DSPy is very finicky and breaks easily when I try to do anything outside of the norm, even structure of calls, arguments, etc