Tactics·Jun 13, 2026

Extract CRM Data from Email Signatures with Regex, Not LLMs

A founder claims a regex-first approach can extract structured data from email signatures with high accuracy, bypassing large language models for predictable text patterns and reducing operational…

By Maya · Tactics desk·Human-reviewed·✓ Verified Jun 13, 2026·3 min read·1 source

A founder claims a regex-first approach can extract structured data from email signatures with high accuracy, bypassing large language models for predictable text patterns and reducing operational costs.

A founder writing on dev.to outlines a technical process for extracting structured data from email signatures, asserting that a regex-based agent can achieve production-usable accuracy. The approach directly challenges the common assumption that large language models (LLMs) are necessary for such tasks, particularly for data within CRMs.

Regex for Predictable Structure

The founder claims that email signatures, despite appearing as prose, are predictably structured. They are typically 3-6 lines, often separated by the RFC 3676 -- delimiter. This predictable format allows for a regex-first strategy, which the founder reports catches "over 95% of well-formed signatures." This method is described as running in microseconds and costing nothing per message, positioning LLMs as a fallback for the remaining 5% or to be skipped entirely in an initial build.

The provided Python code snippet demonstrates how to identify signature boundaries using a list of delimiters, including the RFC 3676 standard and common phrases like "Sent from my iPhone" or "Regards." This initial split isolates the signature block from the main email body, a critical first step for subsequent field extraction.

Extracting and Tiering Key Fields

Once the signature is isolated, the process involves extracting specific fields: phone numbers, LinkedIn URLs (specifically /in/ formats, noting the /pub/ format is deprecated), website addresses, job titles, and company names. The founder emphasizes the strategic value of classifying job titles into tiers—C-suite, VP, Director, Manager, and Individual Contributor (IC). This classification transforms raw title strings into actionable routing signals, making the extracted data more valuable for sales and marketing automation within a CRM.

The Cross-Referencing Accuracy Boost

The founder claims that accuracy can improve from 67% to 91% by cross-referencing multiple emails from the same sender. This "trick" addresses the variability of email signatures. A quick reply from a mobile device might contain only a sender's name, while a more formal email could include a full signature with company, title, and contact details. By aggregating information across several messages from the same contact, the system can build a more complete and accurate profile, compensating for incomplete data in individual emails.

What We'd Change

The central claims regarding 95% regex accuracy and the 67% to 91% improvement from cross-referencing are presented as founder reports without independent verification or primary artifacts. While the technical approach is sound, the specific performance metrics remain unbacked claims. Founders considering this playbook should validate these numbers against their own datasets.

Reliance on regex for signature extraction introduces a maintenance overhead. Email signature formats evolve, and new delimiters or patterns emerge. A regex-based system requires continuous monitoring and updates to its rule set, which can become a significant operational cost over time. An LLM, while potentially more expensive per inference, might offer greater adaptability to novel signature patterns without manual rule engineering.

Furthermore, the post briefly mentions a

The investor read

This signal highlights a recurring tension in data extraction: bespoke, rule-based systems versus adaptable, model-based solutions. The founder's claim of high accuracy (95% regex, 91% with cross-referencing) at near-zero cost per inference suggests a potential niche for highly optimized, domain-specific tools. For investors, this points to opportunities in vertical SaaS where predictable data structures allow for cost-efficient, non-LLM solutions. However, the long-term maintenance costs of regex rules and the scalability to diverse, international signature formats remain key considerations. An investable product would need to demonstrate robust, automated rule updates or a hybrid approach that leverages LLMs for edge cases efficiently.

Pull quote: “The founder claims that accuracy can improve from 67% to 91% by cross-referencing multiple emails from the same sender.”

Sources · how we verified

Import Email Signatures Into Your CRM With an Agent ↗

Every claim ties to a primary source. See our methodology.

Reported by the Maya desk on Founderr Pulse’s Tactics beat. Every factual claim is tied to a primary source and linked; anything that can’t be stood up doesn’t run. Founderr (RIKHATH LLC) is the accountable publisher and corrects in place. How we work · About · File a correction.

Maya

The Maya desk covers tactics: concrete playbooks, growth experiments, and operating decisions indie founders are running now. Every claim is sourced and linked. Operated by Founderr (RIKHATH LLC) See the desk →

Regex for Predictable Structure

Extracting and Tiering Key Fields

The Cross-Referencing Accuracy Boost

What We'd Change

The investor read

Developer details Iceberg partition overwrite for atomic data corrections in pipelines

Developer traces inconsistent AI output to floating-point rounding noise

Engineer details config-driven pipeline for unifying CSVs via EAV model