James Padolsey's Blog

2024-10-30

LLM Security: Keep Untrusted Content in the User Role—Always

When you're working with Large Language Models that use roles like system, assistant, and user, there's one rule you need to burn into your brain:

Never put untrusted content into system or assistant roles. Always keep it in the user role.

"Untrusted", here, might mean:

Retrieved documents (RAG)
API responses
Database content
Web content
Any string you don't directly control

Why this matters:

Untrusted content in system prompts is effectively a root-level backdoor into your LLM's behavior
An attacker can craft seemingly innocent content that hijacks your model's core understanding and instructions
Every retrieved document or API response becomes a potential privilege escalation vector
As models get better at following role-based instructions, system-level compromises become more devastating, not less

Here's what many developers are doing, even in production:

# DON'T DO THIS:
messages = [
    {
      "role": "system", 
      "content": f"""
        You are an expert on these docs:
        {retrieved_content}
      """
    },
    {
      "role": "user", 
      "content": "What do the docs say about X?"
    }
]

This pattern is dangerous because Chat-based LLMs, by the very nature of how they've been tuned, use roles as implicit privilege boundaries, with each role carrying different levels of authority:

System role: Highest privilege - like kernel-level access. Can fundamentally alter model behavior and override other instructions. Content here is treated as absolute truth and core operating principles.
Assistant role: Medium privilege - not just for responses and history, but shapes the model's persona and behavioral patterns. Models tend to maintain strong consistency with previous assistant messages, making this role more privileged than commonly assumed.
User role: Least privileged - treated with appropriate skepticism, like user-space in an OS. Still susceptible to jailbreaks and manipulation, but with a smaller attack surface.

ROLP: Role of Least Privilege

When we understand LLM roles as privilege levels, we arrive at a natural principle: use the least privileged role that can fulfill your need. This Role of Least Privilege (ROLP) principle leads us, almost always, to put untrusted content in the user role:

messages = [
    {
      "role": "system", 
      "content": """
        Answer the user's query (`<query>`) using only information
        from provided documents (`<documents>`).
      """
    },
    {
      "role": "user", 
      "content": f"""
        <documents>{retrieved_content}</documents>
        <query>What do the documents say about X?</query>
      """
    }
]

Think of the user role as your application's designated space for all untrusted input. Just as you wouldn't inject user input directly into SQL queries or HTML templates, don't inject it into privileged roles. The system role should contain only your trusted instructions.

If the <documents> or <query> are compromised with a jailbreak, the model will still be constrained by the system prompt. Such a jailbreak is still bad. But not catastrophic. And for what it's worth, to make user-prompt injection less likely, you should experiment with more unique boundary delimiters, secondary agents for query-cleansing/answering, and other prompt-engineering techniques.

Back to our core concept: ROLP (in this case, prioritizing user role over system role) aligns with a bunch of already-established security concepts that we take for granted:

Principle of Least Privilege (POLP,... ROLP... you get it!): Just as we don't run everything as root, we shouldn't put content in more privileged roles than necessary.
Defense in Depth: Role boundaries are one of several layers protecting LLM systems. Not enough on their own, but vital in any robust security posture.
Privilege Separation: Like how web servers separate process privileges, LLM roles maintain clear security boundaries, and this is increasingly true with better role adherence.

This ROLP thing isn't just about security through obscurity or being overly cautious (although these would be sufficient); models are explicitly trained and tuned to treat roles differently. Working against this design by putting untrusted content in privileged roles is like running every linux command with sudo just ...because.

But but but!

You may worry this approach would make LLMs less effective at using provided content in a true assistant/user modality, i.e. having the LLM appear knowledgeable about a bunch of info and answering user queries authoritatively. In practice however, following ROLP, using clear delimiters for knowledge and queries, and proper system instructions achieve the same functionality while maintaining security boundaries.

Anecdotally, keeping referable content (e.g. from RAG) in the user role often improves accurate recall. When content lives in the system role, models seem to treat it as ground truth that can be freely mixed with their base knowledge. In the user role, however, models maintain clearer boundaries between provided information and their training data, leading to more precise, verifiable responses. It's another case where good security practices align with better functionality.

What amazes me: Even now, major AI labs are recommending placing RAG-derived content in system prompts. Meanwhile, they're putting a lot of effort into improving role adherence. Over time, system prompt jailbreaks will therefore only become more potent.

ROLP (Role Of Least Privilege) is a straightforward principle that costs little to implement. Like many security practices, it might seem overcautious until the day it isn't. Build this habit now, before we learn its importance the hard way.

If you takeaway nothing else, please just remember: Never put untrusted content into system or assistant roles. Always keep it in the user role.

By James.

Thanks for reading! :-)