HomeReadTactics deskStreaming LLM Tokens: The Production SSE Setup for AI Products
Tactics·Jun 5, 2026

Streaming LLM Tokens: The Production SSE Setup for AI Products

Pavel Espitia's spectr-ai streams LLM output to the browser, transforming a 15-40 second wait into a real-time, interactive experience. This tactic addresses perceived latency in AI applications.…

Pavel Espitia's spectr-ai streams LLM output to the browser, transforming a 15-40 second wait into a real-time, interactive experience. This tactic addresses perceived latency in AI applications.

Pavel Espitia's spectr-ai, a security report generator, faces a core challenge: its underlying LLM takes 15 to 40 seconds to produce a full report. Instead of a static spinner, Espitia implemented Server-Sent Events (SSE) to stream tokens to the browser, allowing the report to "write itself" in real-time. This tactic transforms a lengthy wait into an interactive experience, directly addressing user perception of latency in AI-powered applications.

The UX of Latency

Espitia identifies the "spinner is a lie" problem. For spectr-ai, an LLM-generated security report takes between 15 and 40 seconds to complete, the founder claims. Presenting the full report only after generation forces users to stare at a blank screen or a generic loading indicator. The solution involves streaming each token as it arrives from the LLM, mirroring the user experience of conversational AI interfaces like ChatGPT. This approach maintains the same underlying generation time but changes the perceived wait.

Choosing the Right Client for SSE

The standard browser EventSource API is designed for SSE, offering automatic reconnection. However, Espitia's use case for spectr-ai requires sending a POST request with a body containing contract source and model choice. EventSource only supports GET requests and does not allow custom headers for authentication or expose a clean AbortController for manual cancellation. For these reasons, Espitia opted for fetch with response.body.getReader(). This choice enables POST requests, custom headers, and explicit cancellation via AbortController. While EventSource handles auto-reconnect, Espitia notes this is undesirable for a one-shot LLM request, as a reconnect would restart the generation and incur double billing.

Server as an SSE Proxy

The technical setup involves a Next.js 15 Route Handler that functions as both an SSE client and an SSE server. On the backend, it consumes the stream from the chosen LLM (Ollama or Claude). Both Ollama and Claude, Espitia notes, provide OpenAI-compatible endpoints that stream responses as SSE lines, formatted data: {json}\n\n and ending with data: [DONE]. The Route Handler then re-emits these text fragments as SSE to the browser. This dual role requires careful handling of the incoming model stream and the outgoing browser stream, ensuring token deltas are correctly parsed and forwarded.

Client-Side Rendering and Cancellation

On the client side, the application reads the fetch response stream and renders tokens as they arrive. This involves using response.body.getReader() to iterate through the stream. Crucially, given the 15 to 40 second duration of LLM generation, robust cancellation and error handling are integrated. The AbortController associated with the fetch request allows users to stop an ongoing generation. This prevents unnecessary resource consumption and improves user control over long-running operations.

What We'd Change

Espitia's approach is tailored to the specific constraints of LLM streaming, particularly the need for POST requests and manual cancellation. While fetch is a pragmatic choice here, for simpler SSE use cases that do not require a request body or custom headers, EventSource remains a more straightforward option due to its built-in reconnection logic. Reimplementing reconnection for fetch streams in other contexts adds boilerplate.

The reliance on Next.js 15 Route Handlers ties the solution to a specific framework and version. While common in the JavaScript ecosystem, a more portable solution might abstract the SSE server logic into a framework-agnostic middleware or function, allowing for easier adoption across different backend environments. The streamModel helper, as presented, implies a direct consumption of OpenAI-compatible streams. Founders using models with different streaming protocols would need to adapt this parsing layer, adding complexity.

The article focuses on the technical implementation but does not detail the impact on conversion rates or user retention, which are critical metrics for validating UX improvements. While the qualitative benefit of "completely different feel" is clear, quantifying this impact would strengthen the playbook.

Landing

The spectr-ai implementation demonstrates that perceived performance often outweighs raw speed, especially in AI applications where inherent latency is high. By transforming a static wait into a dynamic, real-time display, Espitia shifts the user's focus from the delay itself to the unfolding content. This tactic provides a concrete engineering blueprint for any AI product grappling with long-running generative processes and the challenge of maintaining user engagement.

The investor read

This technical deep dive into LLM streaming highlights a critical user experience challenge in AI products: managing perceived latency. As LLM inference times remain significant (15-40 seconds in this case, according to the founder), founders must invest in sophisticated front-end and backend streaming architectures to maintain engagement. Products that fail to address this UX friction risk higher bounce rates, even if their core AI capabilities are strong. This signals a growing demand for robust, production-grade streaming infrastructure and tooling, potentially creating opportunities for specialized libraries or platform services that abstract away the complexity of SSE, WebSockets, and model-specific streaming protocols. For investors, the ability of a team to execute on such detailed technical requirements, alongside core AI development, is a differentiator in a crowded market.

Sources · how we verified
  1. Streaming LLM Tokens to the Browser: The Production SSE Setup

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.