Codebase RAG: AST Chunking, Hybrid Retrieval, Capped Agents
A founder’s open-source codebase assistant, Nexus, details a RAG system using Tree-sitter for AST-aware chunking, hybrid PostgreSQL search, and a Cohere reranker. This technical playbook offers…
A founder’s open-source codebase assistant, Nexus, details a RAG system using Tree-sitter for AST-aware chunking, hybrid PostgreSQL search, and a Cohere reranker. This technical playbook offers specific implementation choices.
A codebase assistant named Nexus, detailed by its builder u/eviltwin7648, implements a multi-stage Retrieval-Augmented Generation (RAG) system for understanding code. The project's public GitHub repository provides a concrete artifact outlining specific choices for chunking, retrieval, and agentic behavior within a code-focused context. This detailed technical breakdown offers a playbook for founders building similar LLM-powered developer tools.
AST-Aware Chunking for Code Context
Nexus addresses a core challenge in code RAG: fixed-token splitting loses critical structural context. The founder reports Nexus uses AST-aware chunking via Tree-sitter, parsing syntax trees to extract meaningful nodes like functions, methods, classes, structs, and interfaces. This process supports Java, JavaScript, TypeScript, Python, and Go.
The system incorporates several techniques to refine chunk quality. Recursive chunking allows deeper parsing into methods if a class remains too large. Sibling merging combines small fragments (under ~80 tokens) with their subsequent siblings, preventing database bloat from trivial chunks. Each chunk also carries metadata—file, language, symbol, symbol type, and parent—to provide contextual information to the LLM. For markdown files, Nexus uses a heading-based chunker. Unidentified file types default to size-based splitting with overlap.
Hybrid Retrieval with PostgreSQL and Reranking
Pure vector search often fails on exact code identifiers like LeadScoringWorker or tenant_id, which embedding models interpret as "nonsense token soup." Nexus implements a hybrid retrieval approach, running both vector and lexical searches concurrently. Vector search uses pgvector with HNSW indexes and cosine similarity for semantic queries. Lexical search employs PostgreSQL's Full-Text Search (FTS) with GIN indexes, specifically configured with the 'simple' setting to avoid stemming code keywords or dropping syntax-critical "stopwords."
The system merges results using Reciprocal Rank Fusion (RRF) with a smoothing constant k=60. This method avoids direct score comparison between disparate metrics like cosine similarity and ts_rank. The top 20 candidates from the RRF stage then proceed to a Cohere cross-encoder reranker. This reranker processes the query and document together, offering higher accuracy than bi-encoders at the cost of speed. The reranker filters the candidates down to the top 8.
Capped Agentic Loop for Conclusive Answers
Nexus incorporates an agentic query interface, designed to prevent indefinite tool-call loops. The system caps the agent's operation at five iterations. On the final iteration, the agent is forced to provide a conclusive answer, rather than continuing to search or refine its query. This design choice prioritizes a definitive response over potentially endless exploration.
WHAT WE'D CHANGE
The Nexus implementation provides a detailed technical blueprint, but its utility as a general playbook is limited by the absence of empirical validation. The founder claims RRF is "surprisingly effective," but this assertion lacks quantitative support. Without metrics on retrieval precision, recall, or end-user satisfaction, it is difficult to assess the actual performance gains from the chosen chunking and retrieval strategies. A founder adopting this playbook would need to establish their own benchmarks to confirm the efficacy of each component.
The cost implications of specific choices are also not addressed. While PostgreSQL and Tree-sitter are open-source, Cohere's reranker is a commercial API. For bootstrapped projects, this introduces a recurring cost that might necessitate exploring open-source cross-encoder alternatives or a simpler reranking approach. The decision to force a conclusive answer after five iterations, while preventing infinite loops, could also lead to confident but incorrect responses if the underlying retrieval confidence is low. A more robust agent design might incorporate a "I don't know" state or a confidence threshold before forcing an answer.
The specificity of the chunking and lexical search configurations, while optimal for code, may not translate directly to other data types. The 'simple' FTS configuration, for example, is ideal for code but would perform poorly on natural language documents where stemming and stopword removal are beneficial. Adapting this playbook for mixed-content RAG systems would require significant modifications to these components.
The Nexus project offers a high-fidelity technical demonstration of a sophisticated RAG system for codebases. Its value lies in the concrete, open-source implementation of advanced techniques like AST-aware chunking and hybrid retrieval. For founders building developer tools, the GitHub repository serves as a practical starting point, detailing specific choices for overcoming common RAG challenges. However, the absence of performance data means that replicating this architecture requires independent validation of its effectiveness against specific use cases and cost constraints.
The investor read
The Nexus project highlights a growing trend in developer tooling: leveraging advanced RAG techniques to build intelligent code assistants. The focus on AST-aware chunking and hybrid retrieval (vector + lexical) reflects the industry's recognition that semantic search alone is insufficient for structured data like code. The integration of a commercial reranker like Cohere suggests a willingness to invest in best-in-class components for accuracy, even in a side project. For investors, this signals continued capital and attention flowing into developer productivity tools, particularly those that can demonstrably improve code comprehension and generation. Benchmarks for such tools typically include code understanding accuracy, response latency, and developer adoption. An investable product would need to show strong empirical evidence of superior performance and a clear path to monetization beyond a personal assistant.
Pull quote: “Nexus uses AST-aware chunking via Tree-sitter.”
- Built a codebase assistant from scratch in golang. here's what I learned about chunking, retrieval, and agents ↗
- eviltwin7648/nexus ↗
Every claim ties to a primary source. See our methodology.