HomeReadTactics deskA Java Playbook for Semantic Caching with pgvector to Reduce LLM Costs
Tactics·Jun 21, 2026

A Java Playbook for Semantic Caching with pgvector to Reduce LLM Costs

A technical playbook for Java developers using Spring AI and a local ONNX model to intercept semantically duplicate LLM queries, cutting API spend and improving application latency. A user asking…

A technical playbook for Java developers using Spring AI and a local ONNX model to intercept semantically duplicate LLM queries, cutting API spend and improving application latency.

A user asking “How do I reset my password?” and another asking “Password reset steps” can trigger two identical, expensive LLM API calls. Traditional key-value caching fails here because the query strings don't match exactly. A technical post on the blogging platform Dev.to outlines a playbook for solving this with semantic caching, intercepting queries before they reach an external LLM.

The approach moves beyond simple string matching to recognize and serve cached responses for queries with the same meaning, not just the same words. The author claims this can prevent enterprises from “bleeding thousands of dollars” on redundant API calls, a significant operational cost for any AI-native product.

Intercept calls with a Spring AI Advisor

The entry point for this caching strategy is a framework-level interceptor. The playbook uses Spring AI’s CallAroundAdvisor interface to build a custom SemanticCacheAdvisor. This component sits between the application's business logic and the external LLM client (like OpenAiChatClient).

Its function is to inspect every outgoing prompt. Before proceeding to the external API, the advisor first checks the semantic cache for a sufficiently similar past query. If a high-quality match is found, it returns the cached response immediately, avoiding the network latency and API cost of the external call. The process is transparent to the rest of the application.

Generate embeddings locally

To check for semantic similarity, the query must be converted into a vector embedding. A common mistake is to call an external service like OpenAI’s embedding API for this step. That approach introduces a network hop just to check the cache, partially defeating the goal of low latency.

The playbook avoids this by running a local embedding model directly within the Java application. It specifies using an ONNX model like all-MiniLM-L6-v2. The author claims this can generate embeddings in under 5 milliseconds. This keeps the entire cache-check process self-contained and fast, deciding whether to hit the external LLM without first making a different external call.

Query pgvector with a similarity threshold

Once the local embedding is generated, the advisor queries a PostgreSQL database running the pgvector extension. The database stores embeddings of previous prompts alongside their corresponding LLM responses. The query uses cosine distance to find the nearest neighbors to the new prompt's vector.

The provided code snippet specifies a similarity threshold of 0.96. Only past queries with a cosine similarity score above this strict threshold are considered a match. This number is the critical tuning parameter. If a match is found, its associated response is served from the cache. If not, the request proceeds to the external LLM, and the new prompt and response are then stored in the vector database for future requests.

WHAT WE'D CHANGE

The playbook is a strong, specific starting point for developers in the Spring ecosystem. However, implementing it successfully requires addressing several operational questions the original post overlooks.

The 0.96 similarity threshold is presented as a fixed value, but this is application-specific. A support chatbot might tolerate a lower threshold to maximize cache hits, while a code generation tool would require a much higher one to avoid returning subtly incorrect results. Teams must test and tune this value based on their own tolerance for false positives. The playbook lacks guidance on this validation process.

More importantly, the strategy has no mention of cache invalidation. If the correct answer to a question changes, or if a bug produced a bad response, how is the stale entry removed from the vector store? A simple time-to-live (TTL) policy may not suffice. A robust implementation needs a mechanism to manually or automatically purge specific entries when underlying information is updated.

Finally, the principles are portable even if the code is not. A Python team could achieve the same outcome using FastAPI middleware, a local SentenceTransformers model, and a library like psycopg3 to interact with pgvector. The core architecture of local embedding and a vector database cache check is the key takeaway, not the specific Java implementation.

LANDING

This tactic represents a broader maturation in the AI development stack. Early-stage projects focus on making the core product functional. This playbook shifts focus to operational efficiency and gross margin. Treating LLM API calls as a core cost of goods sold to be actively managed is a sign of a business moving from a prototype to a scalable product. While not a defensible moat, implementing infrastructure for capital efficiency signals a founder's focus on building a profitable, resilient business. The playbook uses a local ONNX model to generate query embeddings in a claimed sub-5ms timeframe, avoiding a network hop to an external API.

The investor read

This playbook signals a shift from proof-of-concept to profitability-focused engineering in AI startups. Managing LLM API spend is a direct lever on gross margins. Founders implementing this level of infrastructure demonstrate operational discipline. The choice of an open-source stack (Postgres, pgvector, ONNX) over managed vector databases or embedding APIs indicates a preference for lower variable costs in exchange for higher initial engineering overhead. This is a classic bootstrapped or capital-efficient mindset. While this specific tactic is a commodity layer, not a defensible moat, its presence in a company's stack is a positive signal for investors prioritizing sustainable unit economics over hyper-growth backed by high burn.

Pull quote: “The playbook uses a local ONNX model to generate query embeddings in a claimed sub-5ms timeframe, avoiding a network hop to an external API.”

Sources · how we verified
  1. Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.