James Padolsey's Blog

2024-05-25

Intercepting LLM Streams for Improved Chat UX

I’ve been building LLM chat interfaces for a while now and wanted to share some weird methods I’ve been using to get a finer grained control over text streams.

As each token comes down on an HTTP stream (usually from an LLM cloud provider), I intercept in Node.js, apply a bunch of transformations, and then forward it on to the client so it can render appropriately. Typically, there are four broad things I wish to do to the tokens before showing them to the user:

Intercept special markup or functions
Block bad stuff like jailbroken output or harmful material
Tell the client-side what's happening as it happens
Render custom things on the client-side (e.g. markdown)

First, let’s pretend we’ve told the LLM to do the following in its SYSTEM prompt:

When a user says “I want to send an email to [email protected]",
declare it in this custom format so that it can be
intercepted by middleware before being displayed:

§<email_form addr="[email protected]" text="hello there" />

You can output regular prose around the form, e.g.

"""
Yes absolutely, if you're happy with the
below email you can press send.

§<email_form
  addr="[email protected]"
  text="hello there, this is my email"
/>
"""

First you may ask: what on earth is §?

Well, it's just an arbitrary character that LLMs will be able to use, but highly unlikely to regurgitate in 99% of normal usage. It's harmless but helpful; we're only using it as an indicator that our declaration might be incoming. FWIW, § is actually a Section Sign. It's likely to be in training data but unlikely to be in everyday prose.

"Explain!" Ok, this is just a weird (but effective!) thing that I personally use to better delineate XML-like markup. For this strange purpose, it's important to pick characters that are common enough in LLM training sets to be outputtable but rare enough to avoid inunintended output. I don't want to block streams on "<", i.e. the beginning of an XML opening tag, because it's just too common and may lead to unpolished delays in the user-received stream.

Second you may ask: "why not use function-calling APIs?"

You're right. I could ask the LLM to give me structured JSON instructions which my middleware could then process. It'd surely save me from going through all this parsing mayhem, right?

Well, not really.

Anecdotally, I have found function-calling way less deterministic and reliable than XML-like syntax (I have theories as to why).
I want the LLM to form natural prose around the custom markup, not split things up computationally as that can affect the flow of meaning.
Speed is a priority. Function-calling outputs are usually slower than regular streaming-completions.

So, assuming we're happy enough with § and the premise of custom XML-like declarations, we can move on.

Rough implementation

On each call to our LLM, as the HTTP stream comes down to us, we can do the following:

Forward the stream until "§".
When encountered, stop forwarding the stream.
If not followed by <email... then continue forwarding the stream.
If it is followed by <email..., then gather incoming tokens until '/>'.
While gathering, output a "waiting token" like \uE001 to the client.
Process/Filter stuff in-stream: e.g. validate the email address.
When all gathered, the whole declaration can be sent as one to the client.

For (1), (2), (3) and (4), you can look at some JS here on github which shows you how a stream might be temporarily blocked while delimited content is gathered (e.g. an HTML element or a custom declaration we've asked the LLM to produce).

For (3), we can use a PUA unicode character (e.g. \uE006), and then the client can wait for this character and just keep displaying a specified loading state until it sees some other codepoint.

PUA, or "Private Use Area" is a range of codepoints in the Unicode spec that are designed for private usage and will not be assigned characters (at least not by the Unicode Consortium). This means they are extremely unlikely to be in normal LLM output, and even if they are, they won't constitute a useful part of the response. So we can use them however we like!

We could use a more richly defined indicator like a readable string: "[[Client:PleaseWait]]", or even just \n, but why risk ambiguity (conflicts with legitimate content) or use up bandwidth if single unique codepoints suffice? And we don't want to risk chunk fragmentation on our HTTP stream to the client. Single PUA codepoints just win! They are atomic, unique, tiny, and using them here is entirely on-spec! Also PUAs, if they were to sneek in to raw LLM output, can be wiped without worrying about the quality of the completions.

For what it's worth, for tiptap.chat I use a variety of codepoints to indicate specific types of states to the client. A bit like this:

const UNICODE_INDICATORS = {
  UNSUITABLE: '\uE000',
  UNRELATED: '\uE001',
  HARM: '\uE002',
  CONTAINS_FORM: '\uE003',
  NONSENSE: '\uE004',
  EVENT: '\uE005',
  WAIT: '\uE006'
  // etc.
};

This means the client just needs to keep a lookout for specific codepoints on the stream and can then enter states or render content as needed:

// E.g. Providing messages to the user in cases of possible harm,
// irrelevant or unsuitable content, or even jailbreaking attempts.

function optionallyRenderCustomMessage(content) {

  if (content.includes(UNICODE_INDICATORS.WAIT)) {
    return <Loader/>;
  }

  if (content.includes(UNICODE_INDICATORS.HARM)) {
    return <div>
      Your message is concerning. Please call the emergency services,
      or seek other help if possible. Click here for more info:
      <button></button>
    </div>;
  }

  if (content.includes(UNICODE_INDICATORS.UNSUITABLE)) {
    return <>
      Sorry, we can't help with that.
      See <a /> for more details.
    </>;
  }

  // etc.
  return null;
}

The main takeaways: Consider creating richer LLM functionalities enabled by intercepting the stream prior to the client. Also consider using PUA codepoints! They can be used as a "secret" trusted stream of communication between your server and the client existing alongside, but not being polluted by, the less-trusted LLM tokens.

thanks for reading!