HomeReadTactics deskBuilding global ID OCR: Why data variation breaks naive systems
Tactics·Jul 5, 2026

Building global ID OCR: Why data variation breaks naive systems

A playbook for handling the three axes of variation in national ID cards: disparate fields, non-Latin scripts, and country-specific data formats like the Thai Buddhist calendar. A system for Know…

A playbook for handling the three axes of variation in national ID cards: disparate fields, non-Latin scripts, and country-specific data formats like the Thai Buddhist calendar.

A system for Know Your Customer (KYC) processes ingests a national ID card and returns a clean data structure. Then it ingests an ID from a second country and the date of birth field reads '2567'. The system has not failed at reading text. It has failed at understanding context.

The central challenge of building a global identity system is not the optical character recognition, but the fact that every country’s ID is a fundamentally different document. A technical post from developer 'deepfox' on Dev.to outlines an architecture for handling this variation, moving beyond simple text extraction to a system of managed normalization.

Model for variable fields, not a fixed schema

There is no universal standard for the data printed on a national ID. The author points to several examples of non-standard, yet official, fields. A Thai ID card includes the holder's religion. A German ID card specifies height and eye color. A Chinese ID card lists the holder's ethnicity. Attempting to force these disparate schemas into a single, fixed IdCard data type results in either data loss or a sparse table of null values and special cases.

The proposed solution is to treat the set of available fields as dependent on the document's country of origin. This requires a flexible data model, likely a key-value store or a document database structure, where the application logic is aware that eye_color will exist for German users but not for Thai users. The name field itself can vary from a single string to separate given and family name fields, sometimes in multiple scripts.

Store native scripts alongside Latin transliterations

Forcing all names into a Latin script via transliteration is a common but flawed approach for global user bases. The process is lossy. Diacritics are dropped and multiple native spellings can collapse into a single, ambiguous Latin form. This makes it impossible to reliably match the name back to the source document or government databases, undermining the purpose of KYC.

The recommended architecture is to store the name exactly as it appears on the document, in its native script. If the card also provides a Latin version, that is stored as a separate field. This creates two handles for the identity: a local, high-fidelity version for matching against official records, and a Latin version for systems requiring ASCII compatibility. The original data is never destroyed.

Normalize formats while preserving original strings

Specific data formats, particularly dates and identification numbers, present parsing challenges. The author uses the Thai ID card as a primary example. It prints dates using the Buddhist calendar (BE), which is 543 years ahead of the Gregorian calendar. It also uses Thai numerals, not Arabic digits.

A naive parser fails on two counts: it cannot parse the numerals, and if it could, it would misinterpret the year by more than five centuries. The correct process involves multiple steps: recognize and convert the numerals, subtract the 543-year offset, and normalize the date to a standard like ISO 8601. Critically, the original string should be preserved for auditing and display purposes. Similarly, national ID numbers have country-specific lengths and internal structures, such as checksums or region codes, which can be used for validation if the system is designed to know the rules for each format.

What We'd Change

The playbook is a sound engineering blueprint for the data modeling problem. It understates, however, the operational burden required to maintain such a system. The core asset is not the code but the proprietary knowledge base of every global ID's specific format, fields, and validation rules. This 'document almanac' requires continuous research and updates as governments issue new card versions. This is a product management and data governance function, not a one-time engineering task.

Furthermore, the 'store both' approach (native and Latin script, raw and normalized date) pushes complexity downstream. Every internal service that consumes this identity data, from fraud analysis to customer support tools, must be built to handle these dual representations. This can lead to integration errors if API contracts are not strictly enforced. The engineering cost is not isolated to the OCR service; it is distributed across the entire organization.

Finally, the analysis implicitly favors a 'build' decision in a market that has largely shifted to 'buy'. The complexity described is the explicit business model of identity verification (IDV) vendors like Stripe Identity, Veriff, and Onfido. For most companies, building and maintaining this system from scratch is a strategic error. The playbook is most relevant for the IDV vendors themselves or for companies operating in a niche so specific that off-the-shelf solutions are inadequate.

Landing

The complexity of global identity verification is not a text-recognition problem. OCR technology is largely commoditized. The durable challenge is informational and architectural: creating a system that treats the diversity of global identity documents as its core domain. This requires building a flexible data model that normalizes inputs without destroying source information. It is a system for managing exceptions as the rule, where every new country represents a new schema to be mapped and maintained. For most, this is a clear signal to integrate a specialized third-party vendor.

The investor read

This playbook details the technical moat of the Identity Verification (IDV) market. Companies like Stripe Identity, Onfido, and Veriff sell solutions to this exact problem. The high, ongoing operational cost of maintaining a global 'document almanac' makes a 'build' decision a red flag for most startups. An investor would question why a company is allocating engineering resources to a non-core, solved problem instead of using a vendor. The exception is a startup whose entire business is to disrupt the IDV space itself, in which case this playbook is table stakes. For all others, it's a clear case for 'buy,' where the vendor's margin is justified by the immense complexity of the underlying task.

Pull quote: “The central challenge of building a global identity system is not the optical character recognition, but the fact that every country’s ID is a fundamentally different document.”

Sources · how we verified
  1. The hard part of national ID OCR isn't the OCR

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.
M
Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Founderr Pulse — free & independent. The desk for people who build & back.