Why We Rebuilt Our OCR Pipeline
From Scratch
Our first OCR engine worked great on passports. Then someone uploaded a Singapore NRIC. Then an Irish PPSN card. Then a Malaysian MyKad. This is the story of how a failing Tesseract pipeline, a conversation with a trusted old buddy, and the blessing of the great Lord led us to build something we're genuinely proud of.
The Problem With "Just Use Tesseract"
When we first built FaceVault's document intelligence layer, the plan was simple: extract the MRZ (Machine Readable Zone) from passports using PassportEye, and fall back to Tesseract OCR for everything else. MRZ extraction was rock solid. Passports have that lovely two-line code at the bottom with standardised formatting. Parse it, done.
Then the real world showed up.
A Singapore NRIC doesn't have an MRZ. Neither does a Malaysian MyKad, an Irish PPSN card, most EU national IDs, or the majority of driver's licences worldwide. For these documents, we were relying entirely on Tesseract — and it was struggling. The extraction would come back garbled, fields would bleed into each other, and names would get mixed up with document headers. We were seeing match rates that weren't acceptable for a product people would trust with their identity.
We tried the obvious fixes. We tweaked Tesseract's page segmentation modes. We added image preprocessing — CLAHE enhancement, adaptive thresholding, denoising. We tried different binarisation strategies. Each tweak helped a little, but the fundamental problem remained: Tesseract was giving us a wall of text with no structure, and we were using fragile regex patterns to fish out names and dates from that wall. It was a house of cards.
The Conversation That Changed Everything
I was stuck. The kind of stuck where you've been staring at OCR output logs for hours and everything looks like MRZ filler characters. So I did what any engineer does when they've exhausted their own ideas — I called a trusted old buddy.
The conversation was one of those rare ones where the solution doesn't come from a single brilliant insight, but from someone asking the right questions. "Why are you merging all the OCR text into one blob before extracting fields?" — that one hit different. "Why are you using the same preprocessed image for both engines?" — that one too.
And then, with the blessing of the great Lord and a fresh perspective, the architecture of the new pipeline crystallised. Not a tweak. Not a patch. A complete rethink of how we approach document text extraction.
The New Pipeline: Four Layers Deep
Our new OCR pipeline runs in four stages. Each stage is designed to succeed where the previous one might fail, and the confidence-weighted reconciliation at the end picks the best result for each individual field.
1 Smart Preprocessing
Before any OCR engine touches the image, we run perspective correction (flattening angled phone photos), upscale to at least 1500px wide, and then produce three different image variants — each optimised for a different engine:
- ▶ Colour-enhanced — CLAHE on the LAB luminance channel, preserving colour. Optimised for OnnxTR's neural text detection.
- ▶ Binary threshold — adaptive Gaussian thresholding that kills patterned backgrounds. Optimised for Tesseract.
- ▶ Per-channel enhanced — aggressive CLAHE on each RGB channel independently. Reveals labels that are the same colour as the card background.
This decoupling was one of the first changes we made. Previously, we were feeding the same preprocessed image to both OnnxTR and Tesseract — but these engines have completely different strengths. OnnxTR's neural network works best on colour images with natural contrast. Tesseract's LSTM engine works best on clean, high-contrast binarised text. Giving each engine the image it was designed for was an immediate accuracy boost.
2 Neural OCR with Triple Extraction
We run OnnxTR (a neural OCR engine using db_resnet50 for text detection and ParseQ for recognition) once on the full document. One pass, three extraction strategies:
- A. Label-value association — finds known labels like "SURNAME", "DATE OF BIRTH", "NATIONALITY" and extracts the value from the same line or the line directly below. This is the most reliable strategy because it knows what each piece of text represents.
- B. Spatial clustering — groups text lines by vertical proximity into logical blocks (header, upper, middle, lower regions), then applies field-specific regex within each region. A date in the upper region is probably a date of birth; a date in the lower region is probably an expiry date.
- C. Full-text regex — the catch-all. Concatenates all OCR text and runs pattern matching. Same approach as the old pipeline, but now it's the fallback rather than the primary strategy.
Results are merged with priority: labels > spatial > full-text. Each field carries a confidence score derived from OnnxTR's per-word confidence, so we know exactly how much to trust each extraction.
3 Tesseract as Specialist Backup
Tesseract only runs when OnnxTR is missing fields or reports low confidence (< 0.7) for a field. It gets the binarised image variant, and crucially, each Tesseract pass extracts fields independently rather than dumping raw text into a shared pool. This prevents cross-contamination — a common source of the garbled output we saw with the old pipeline.
4 Confidence-Weighted Reconciliation
The final step compares OnnxTR and Tesseract results field by field. High-confidence OnnxTR extractions are locked — Tesseract can't override them. Low-confidence OnnxTR results can be replaced if Tesseract found something better. Missing fields are filled by whichever engine has a result. The output is the best possible extraction across both engines for each individual field.
Label-Aware Extraction: Reading Like a Human
The single biggest accuracy improvement came from label-value association. Think about how you read an ID card. You don't scan every word and try to figure out which one is the name. You look for the label "SURNAME" and then read the text next to it or below it. That's exactly what our pipeline does now.
We maintain a pattern library covering labels in English, Malay, French, German, Spanish, Italian, Irish, and more. The system recognises that "NAMA PENUH" on a Malaysian MyKad means the same thing as "FULL NAME" on a UK driving licence, or "SLOINNE" on an Irish PPSN card means "SURNAME".
Critically, the system handles split name labels. Many European and Irish IDs don't have a single "FULL NAME" field — they have separate "SURNAME" and "FORENAME" labels. The old pipeline would find "SURNAME", grab the value, call it the full name, and move on. The new pipeline extracts surname and forename independently, then combines them: forename + surname = full name. Simple, but it was the difference between a failed match and a passed verification for every Irish, German, French, and many other EU documents.
SURNAME / SLOINNE
O'BRIEN
FORENAME / AINM
SARAH
→ Extracted: Sarah O'Brien
The Invisible Label Problem
Just when we thought we'd cracked it, we hit another wall. The Irish PPSN card is green. A nice, official, government-issued green. And the field labels — "SURNAME", "FORENAME", "DATE OF BIRTH" — are also printed in green. A slightly different shade, barely distinguishable to the eye, and completely invisible to standard OCR preprocessing.
The actual values (the person's name, their date of birth) are printed in black. Those come through perfectly. But without the labels, our label-value extraction strategy had nothing to anchor to. It was like having a perfectly good map but no street signs.
The fix required understanding why the labels were invisible. Our standard preprocessing converts to LAB colour space and applies CLAHE (Contrast Limited Adaptive Histogram Equalisation) to the L channel — the lightness channel. This works brilliantly for most documents because text is usually darker or lighter than the background. But when the label and background have the same lightness and only differ in hue or saturation, the L channel shows zero contrast.
This second OnnxTR pass only triggers when the first pass finds no labels at all — so there's no performance penalty for documents that work fine with standard preprocessing. When it does trigger, it specifically targets the label extraction, feeding the newly visible labels back into the same label-value association logic. The result: documents that were completely unreadable now extract cleanly.
What This Means For You
If you're a CTO or product lead evaluating KYC providers, here's what this pipeline delivers in practical terms:
Global document support
Passports (MRZ), Singapore NRIC, Malaysian MyKad, Irish PPSN, EU national IDs, driver's licences. Any document with printed text.
Automatic cross-checking
The extracted data is automatically compared against the user's self-declared information. Name, date of birth, nationality — each field is independently verified and reported in your dashboard.
No manual review required
For high-confidence extractions, the pipeline returns a definitive pass/fail without human intervention. When confidence is lower, the dashboard provides a clear side-by-side comparison for quick human review.
Privacy preserved
All processing happens on our infrastructure. No third-party OCR APIs. No data leaves your pipeline. Photos are automatically purged after your configured retention period.
If you're a developer integrating FaceVault, nothing changes in your API calls. The OCR pipeline runs automatically when a user uploads an ID document. You get the same data_match, name_match, dob_match, and nationality_match fields in the session response — they're just far more likely to be correct now.
Being Honest About Our Approach
We could have taken the easier route. Cloud OCR APIs from the big providers would have given us decent results with a fraction of the engineering effort. But that would mean sending your users' identity documents to a third party. Every passport photo, every selfie, every name and date of birth — routed through someone else's infrastructure. For a product built on the promise of privacy, that was a non-starter.
So we built our own. OnnxTR running locally on ONNX Runtime, Tesseract as a specialist backup, and a custom extraction layer that understands what identity documents actually look like. It took longer. It was harder. But now every byte of your users' data stays within FaceVault's infrastructure, processed in memory, never touching an external API, and purged on schedule.
Is it perfect? No. There will always be edge cases — damaged documents, unusual layouts, handwritten fields. But the architecture is designed to improve incrementally. Every new document type we encounter teaches the system something new, and the layered approach means we can add new extraction strategies without rewriting the pipeline.
Sometimes the best engineering comes from the simplest realisations: give each tool the input it was designed for, read documents the way a human would, and trust the math to pick the best answer. The rest is just persistence — and having good people to talk to when you're stuck.
Ready to try it?
Create a free account and verify your first document in under 10 minutes.
Get started