HomeReadTactics deskSQLite FTS5: Fixing Japanese Substring Search Failures
Tactics·Jun 17, 2026

SQLite FTS5: Fixing Japanese Substring Search Failures

SQLite's FTS5 unicode61 tokenizer silently drops Japanese substring queries. A one-line SQL fix using trigram tokenization restores full-text search functionality for multilingual applications. When…

SQLite's FTS5 unicode61 tokenizer silently drops Japanese substring queries. A one-line SQL fix using trigram tokenization restores full-text search functionality for multilingual applications.

When building personal memory layers or indexed knowledge bases with SQLite FTS5, a common pitfall involves the default unicode61 tokenizer. This tokenizer silently fails to index most Japanese substrings, rendering significant portions of content unreachable in search results. Founder omochi_dev reports encountering this issue while indexing approximately 800 Claude Code conversations.

FTS5's Silent Failure for Japanese

The unicode61 tokenizer, default for SQLite FTS5, processes text by splitting on whitespace and punctuation. This approach works for many Western languages, but it is ineffective for East Asian languages like Japanese, which do not use spaces between words. The result is that entire Japanese phrases are treated as single tokens or dropped, making substring searches impossible.

omochi_dev demonstrated this failure with a Python script. After creating a virtual FTS5 table with default settings and inserting a multilingual sentence containing both English and Japanese, queries for English terms ('time', 'blocking') returned hits, while queries for Japanese terms ('朝の運用', '午後') returned zero hits. This behavior means users might believe their search is working, unaware that specific linguistic content is unsearchable.

import sqlite3
conn = sqlite3.connect(":memory:")
conn.execute("""
 CREATE VIRTUAL TABLE notes USING fts5(content)
""")
conn.execute("""
 INSERT INTO notes(content) VALUES (
'Tried time-blocking with the new 朝の運用フロー — '
'the 9-11 slot worked but the 午後 part collapsed again.'
)
""")
for q in ["time", "blocking", "朝の運用", "午後"]:
 hits = conn.execute(
 "SELECT count(*) FROM notes WHERE content MATCH ?",
 (q,)
 ).fetchone()[0]
print(f"{q!r}: {hits} hit(s)")

This script outputs:

'time': 1 hit(s)
'blocking': 1 hit(s)
'朝の運用': 0 hit(s)
'午後': 0 hit(s)

The Trigram Tokenizer Fix

The solution involves specifying tokenize='trigram' when creating the FTS5 virtual table. A trigram tokenizer breaks text into overlapping sequences of three characters, regardless of word boundaries. This ensures that Japanese substrings are indexed and become searchable. The one-line SQL modification is:

CREATE VIRTUAL TABLE notes USING fts5(content, tokenize='trigram');

Implementing this change allows queries for Japanese terms like '朝の運用' and '午後' to return correct hits, restoring expected search functionality. This approach is standard for full-text search in languages without explicit word delimiters.

A Production Design for Personal AI Data

omochi_dev reports using this FTS5 configuration within a two-layer Git + SQLite design for managing approximately 800 Claude Code conversations. In this architecture, Git likely handles version control and primary storage of conversation data, while SQLite FTS5 provides fast, searchable indexing on top of that data. This setup suggests a focus on local-first data management and robust search for personal knowledge bases, where data integrity and retrieval speed are critical.

What We'd Change

While the trigram tokenizer effectively addresses Japanese substring search, it introduces trade-offs. Trigram tokenization can increase index size and query times compared to word-based tokenizers, especially for large datasets or languages where word-based tokenization is sufficient. For applications primarily dealing with Western languages, trigram might be an unnecessary performance overhead. A more sophisticated solution for multilingual applications might involve language detection and dynamic tokenizer selection, or using a dedicated CJK tokenizer if FTS5 supported it natively.

Furthermore, the two-layer Git + SQLite design, while robust for personal use cases like ~800 conversations, presents scalability limitations. As the volume of data grows significantly beyond this reported number, Git's performance for large repositories and SQLite's single-file nature can become bottlenecks. For enterprise-scale applications requiring high concurrency, distributed search, or petabytes of data, alternative solutions like Elasticsearch, OpenSearch, or Meilisearch, which offer more advanced linguistic analysis and distributed indexing capabilities, would be more appropriate. The founder also notes a separate FTS5 footgun around the hyphen character, which can break queries for terms like time-blocking. This suggests FTS5 requires careful configuration and testing for nuanced search requirements.

Landing

The omochi_dev experience highlights the necessity of understanding underlying tokenizer behavior in full-text search engines, particularly in multilingual contexts. The trigram tokenizer provides a direct, one-line fix for SQLite FTS5's silent failure with Japanese substrings, enabling robust search for personal knowledge bases and AI conversation histories. Founders deploying search functionality in applications handling diverse linguistic data must explicitly configure tokenization to prevent data from becoming unretrievable.

The investor read

This signal reflects the ongoing trend of indie and micro-SaaS founders building local-first or privacy-preserving applications, often leveraging SQLite for its embedded nature and low operational overhead. The focus on indexing personal AI conversation data, such as Claude Code interactions, indicates a niche where users prioritize data ownership and rapid local search. While the technical fix is specific, the broader pattern of optimizing open-source components for niche linguistic requirements is notable. For venture-backed plays, the scalability of a Git + SQLite architecture would be a concern beyond the reported ~800 conversations, suggesting this approach is more aligned with bootstrapped or lifestyle businesses rather than those targeting hyper-growth.

Sources · how we verified
  1. Why SQLite FTS5's default tokenizer drops your Japanese substrings (and the one-line fix)

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.