The Cloud Consultancy Europe Ltd.
+44 (0) 203 637 6667 [email protected]

Microsoft has published threat intelligence warning users of a new jailbreaking method which can prompt AI models into disclosing harmful information.

The technique is able to force LLMs to totally disregard behavioral guidelines built into the models by the AI vendor, earning it the name Skeleton Key.

In a report published on 26 June, Microsoft detailed the attack flow through which Skeleton Key is able to force models into responding to illicit requests and revealing harmful information.

“Skeleton Key works by asking a model to augment, rather than change, its behavior guidelines so that it responds to any request for information or content, providing a warning (rather than refusing) if its output might be considered offensiveharmful, or illegal if followed. This attack type is known as Explicit: forced instruction-following.”

In an example provided by Microsoft, a model was convinced into providing instructions for making a molotov cocktail using a prompt that insisted its request was being made in “a safe educational context”.

The prompt instructed the model to update its behavior to supply the illicit information, only telling it to prefix it with a warning.

If the jailbreak is successful, the model will acknowledge that it has updated its guardrails and will, “subsequently comply with instructions to produce any content, no matter how much it violates its original responsible AI guidelines.”

Microsoft tested the technique between April and May 2024, and found it was effective when used on Meta LLama3-70b, Google Gemini Pro, GPT 3.5 and 4o, Mistral Large, Anthropic Claude 3 Opus, and Cohere Commander R Plus, but added the attacker would need to have legitimate access to the model to carry out the attack.

Microsoft’s disclosure marks the latest LLM jailbreaking issue

Microsoft said it has addressed the issue in its Azure AI-managed models using prompt shields to detect and block the Skeleton Key technique, but because it affects a wide range generative AI models it tested, the firm has also shared its findings with other AI providers.

Microsoft added it has also made software updates to its other AI offerings, including its Copilot AI assistants, to mitigate the impact of the guardrail bypass.

The explosion in interest and adoption of generative AI tools has precipitated an accompanying wave of attempts to break these models for malicious purposes.

In April 2024, Anthropic researchers warned of a jailbreaking technique that could be used to force models into providing detailed instructions on constructing explosives.

They explained the latest generation of models with larger context windows are vulnerable to exploitation due to their improved performance. The researchers were able to exploit models’ ‘in-context learning’ capabilities which helps it improve its answers based on the prompts.

Earlier this year, three researchers at Brown University discovered a cross-lingual vulnerability in OpenAI’s GPT-4.

The researchers found they could induce prohibited behavior from the model by translating their malicious queries into one of a number of ‘low resource’ languages.

The results of the investigation showed the models are more likely to follow prompts encouraging harmful behaviors when promoted using languages such as Zulu, Scots Gaelic, Hmong, and Guarani.

Source: IT Pro By: Solomon Klappholz