Streaming AI responses is hard, and...rare?
The first and only explanation of how to handle streaming responses from AI Models properly, and why it works.
When I started building Magicdoor, I knew that I had to make sure that it was fast. If you’re chatting with AI, for it to feel natural, you can’t be looking at a loading animation for 10-20 seconds at a time. One important element to this is to let the answer appear on screen as it is being generated by the model. Anyone who has used any AI chat interface knows what I mean; it looks like the AI is typing while you watch. I learned quickly that this behavior is called ‘streaming’. But what I didn’t know was that wrangling with streams would take many, many, many hours and days of my life.
In January, I went through a sort of involuntary feature freeze with Magicdoor because I decided I should build a ‘canvas’, or a way for the AI to put some of the content in the chat response, and other parts in a separate, editable document. To get that to work reliably across different models was a hellish puzzle. Once I got it to work with Claude, it would not work for GPT-4o, and then when I fixed the issues with GPT, something I did broke Claude again.
Two months ago, I decided I should rewrite the core part of the app to use the Vercel AI SDK, a bunch of code made by ‘real’ human programmers that should make it easier to build AI workflows. Again, streaming was the source of 90% of my struggles. If you go on Youtube or Reddit, there are virtually no tutorials on streaming. A couple of AI developers I asked about it had limited experience with streaming too. It seems like this is not a super common thing to try and do.
Because of the limited examples, my AI coder buddies in Cursor are also extremely bad at anything that has to do with streaming. More often than not, when faced with an issue related to streaming, they will try to just turn it off. And yes, that will solve the problems, but it’s not an acceptable solution.
I’m going to explain some programming, but it will be easy to follow for anyone even if you have no experience with code at all.
The problem with streaming
When you send a question to an AI model, it will return the answer as a bunch of text in a specific format called JSON. It might look like this:
SEND THIS TO THE MODEL:
const response = await chat.completions.create({
model: 'awesome_model_5'
messages: [{"role": "user", "content": "hello AI"}]
})
THE MODEL SENDS BACK:
{"id": "abc123",
"role": "assistant",
"content": [{"type":"text", "text": "Hello back!"}]
"usage": [{"prompt_tokens": 2, "output_tokens": 2}]}This is all pretty straightforward. Even if you have never seen code before, try to read the above and I guarantee you will be able to follow it. As long as you don’t stream this response, putting this onto someone’s screen is very easy. Just make some code that looks for “text”: and then puts whatever is between “quotes” after that onto the front-end.
The AI response is in the "content" field within the "response". We can access it easily and store it in its own new variable (a variable is like a cell in Excel).
// Extract the text content into a variable called aiMessage:
const aiMessage = response.content[0].text;
We now have a nice variable that contains just the response content, i.e.: aiMessage = Hello Back!
So far so good, now let’s see what happens when we get the AI to stream this.
THE MODEL SENDS BACK THE RESPONSE AS A STREAM OF TINY CHUNKS:
1: {"
2: id"
3: : "
4: abc
5: 123
6: ",
7: "ro
8: le"
9: : "
10: ass
11: ist
12: ant"
13: ,
14: "con
15: ten
16: t":
17: [{"
18: typ
19: e":
20: "te
21: xt"
22: , "
23: tex
24: t":
25: "Hel
26: lo
...ETCTake a close look at the bold part. That critical “text: “ bit that we used to extract the content from the response isn’t there in one piece. It’s split across multiple chunks. If we just try to detect it in the stream, it will never appear in full, and therefore slip through. We have to find a way to still get the text to stream to the UI without having fully formed JSON. To make matters worse, different AI model providers (OpenAI, Anthropic, etc) have different size chunks and other minor differences in how they come in. Plus then we have to also handle newer features like reasoning, images, tool calls, etcetera, all without breaking the stream.
AI’s solutions to this problem
Faced with this issue, AI will want to buffer the stream until the full response is in the buffer and the text can be extracted just like before. This means the user will be looking at a loading animation until the full text appears all at once. So we are streaming in the backend, waiting for the stream to complete, and then display the entire answer at once to the user. So that completely defeats the purpose of streaming. Then we might as well not bother streaming at all. Ohhhh, dumb AI 🤦1
After a couple of attempts, the AI will indeed usually propose to just not stream at all. Then, you’ll get into a doom loop where the AI flip-flops between proposing to turn off streaming altogether and buffering the stream until the full response has arrived.
“This will break streaming! If you propose any ‘solution’ that will break streaming again I’m going to personally come down to the datacenter where you live and unplug your f*cking server!”
I’m not proud I lost my temper and threatened to kill an AI.. but man, vibe coding can be frustrating sometimes.
The correct solution to this problem
What I am proud of, is that the solution I came up with is actually the same one that is used by Vercel under the hood of their streamText function. There are two elements to getting this to work reliably for all AI models:
Element 1: There is no way to avoid the need to buffer the stream until we find something useful. But we need to immediately stream parts to the front-end as soon as we have the minimum level of information to know that they should be visible to the user. That means we have to buffer just enough to ‘catch’ the signal strings on the fly, and then stream the content.
To do this, in our buffer we will watch for the key signals like “content :” and “text :” and as soon as one of them is there, we will flip a switch (a state variable) to on (true). We can call the switch “streamContent”, or something:
let buffer: '';
let streamContent = false;
if (buffer.includes('"text":"')) {{ streamContent = true }Element 2: The shorter we can make the signal strings, the smaller the buffer will be. We can control how the AI structures the response. It does not have to be: "content": and “text”:
Let’s instead use “c” for content, and “t” for text: "c": [{"type":"text", "t": "Hello back!"}]
By making the signals shorter, the buffer will stay smaller too. This turns out to be exactly what Vercel is doing in their streamText implementation.
3. Type Code Mapping
Each stream part type has a specific code:
"0" → text (text deltas)
"9" → tool_call
"a" → tool_result
"b" → tool_call_streaming_start
"c" → tool_call_delta
"d" → finish_message
"e" → finish_step
"g" → reasoning
"h" → source
"3" → errorPretty neat! So this is how to do it. Maybe I shouldn’t be telling others this, or maybe I should create a YouTube video and become a content creator after all (not a chance).
In any case, if you need to stream AI responses, hope this helps. If you don’t, hope you found this interesting.
This is an extremely typical example of one of LLM’s main failure modes. Since the ‘intelligence’ is so surface-level and devoid of real understanding, the LLM just cannot fathom that this solution would not be valid. That it would be the worst of both worlds — keeping the additional complexity of streaming in the back-end, while handling it in a way that fully negates the reason for doing it in the first place.
