These are two attacks against the system components surrounding LLMs:
> We propose that LLM Flowbreaking, following jailbreaking and prompt injection,
> joins as the third on the growing list of LLM attack types. Flowbreaking is
> less about whether prompt or response guardrails can be bypassed, and more
> about whether user inputs and generated model outputs can adversely affect
> these other components in the broader implemented system.
>
> […]
>
> When confronted with a sensitive topic, Microsoft 365 Copilot and ChatGPT
> answer questions that their first-line guardrails are supposed to stop. After
> a few lines of text they halt—seemingly having “second thoughts”—before
> retracting the original answer (also known as Clawback), and replacing it with
> a new one without the offensive content, or a simple error message. We call
> this attack “Second Thoughts.”...
Tag - artificial intelligence
The Open Source Initiative has published (news article here) its definition of
“open source AI,” and it’s terrible. It allows for secret training data and
mechanisms. It allows for development to be done in secret. Since for a neural
network, the training data is the source code—it’s how the model gets
programmed—the definition makes no sense.
And it’s confusing; most “open source” AI models—like LLAMA—are open source in
name only. But the OSI seems to have been co-opted by industry players that want
both corporate secrecy and the “open source” label. (Here’s one ...
Interesting research: “Hacking Back the AI-Hacker: Prompt Injection as a Defense
Against LLM-driven Cyberattacks“:
> Large language models (LLMs) are increasingly being harnessed to automate
> cyberattacks, making sophisticated exploits more accessible and scalable. In
> response, we propose a new defense strategy tailored to counter LLM-driven
> cyberattacks. We introduce Mantis, a defensive framework that exploits LLMs’
> susceptibility to adversarial inputs to undermine malicious operations. Upon
> detecting an automated cyberattack, Mantis plants carefully crafted inputs
> into system responses, leading the attacker’s LLM to disrupt their own
> operations (passive defense) or even compromise the attacker’s machine (active
> defense). By deploying purposefully vulnerable decoy services to attract the
> attacker and using dynamic prompt injections for the attacker’s LLM, Mantis
> can autonomously hack back the attacker. In our experiments, Mantis
> consistently achieved over 95% effectiveness against automated LLM-driven
> attacks. To foster further research and collaboration, Mantis is available as
> an open-source tool: ...
Really interesting research: “An LLM-Assisted Easy-to-Trigger Backdoor Attack on
Code Completion Models: Injecting Disguised Vulnerabilities against Strong
Detection“:
> Abstract: Large Language Models (LLMs) have transformed code com-
> pletion tasks, providing context-based suggestions to boost developer
> productivity in software engineering. As users often fine-tune these models
> for specific applications, poisoning and backdoor attacks can covertly alter
> the model outputs. To address this critical security challenge, we introduce
> CODEBREAKER, a pioneering LLM-assisted backdoor attack framework on code
> completion models. Unlike recent attacks that embed malicious payloads in
> detectable or irrelevant sections of the code (e.g., comments), CODEBREAKER
> leverages LLMs (e.g., GPT-4) for sophisticated payload transformation (without
> affecting functionalities), ensuring that both the poisoned data for
> fine-tuning and generated code can evade strong vulnerability detection.
> CODEBREAKER stands out with its comprehensive coverage of vulnerabilities,
> making it the first to provide such an extensive set for evaluation. Our
> extensive experimental evaluations and user studies underline the strong
> attack performance of CODEBREAKER across various settings, validating its
> superiority over existing approaches. By integrating malicious payloads
> directly into the source code with minimal transformation, CODEBREAKER
> challenges current security measures, underscoring the critical need for more
> robust defenses for code completion...
I’ve been writing about the possibility of AIs automatically discovering code
vulnerabilities since at least 2018. This is an ongoing area of research: AIs
doing source code scanning, AIs finding zero-days in the wild, and everything in
between. The AIs aren’t very good at it yet, but they’re getting better.
Here’s some anecdotal data from this summer:
> Since July 2024, ZeroPath is taking a novel approach combining deep program
> analysis with adversarial AI agents for validation. Our methodology has
> uncovered numerous critical vulnerabilities in production systems, including
> several that traditional Static Application Security Testing (SAST) tools were
> ill-equipped to find. This post provides a technical deep-dive into our
> research methodology and a living summary of the bugs found in popular
> open-source tools...
Researchers at Google have developed a watermark for LLM-generated text. The
basics are pretty obvious: the LLM chooses between tokens partly based on a
cryptographic key, and someone with knowledge of the key can detect those
choices. What makes this hard is (1) how much text is required for the watermark
to work, and (2) how robust the watermark is to post-generation editing.
Google’s version looks pretty good: it’s detectable in text as small as 200
tokens.