Engineering 12 February 2026 · 12 min read

How FaceVault Verifies a Face
in Under 30 Seconds

Most KYC providers treat their AI pipeline like a trade secret. We think the opposite. Here's exactly how FaceVault matches a selfie against an ID document — the models, the math, and the engineering decisions behind every verification.

The Verification Pipeline

When a user taps "Verify" in your app, a cascade of five AI models fires in sequence. Each layer has a specific job. Each layer can reject the session before the expensive models even run. The entire pipeline completes in 10–30 seconds on a single CPU core.

MediaPipe FaceLandmarker

Maps 478 3D face landmarks on the selfie. No face? Rejected immediately.

OpenCV Haar Cascade

Detects the face on the ID document. Tolerant of small, printed faces.

ArcFace Neural Network

Encodes both faces into 512-dimensional vectors. Compares them in angular space.

MRZ + OCR Extraction

Reads the machine-readable zone or runs Tesseract OCR to extract name, DOB, nationality.

Liveness & Anti-Spoofing

Head-turn sequence runs during selfie capture. Server-side multi-signal analysis (depth, rPPG, GAN texture) adds a second layer.

This isn't a black box. Each of these models is a published, peer-reviewed piece of research. Let's tear them apart.

Layer 1: MediaPipe Face Landmarker

Before we do anything expensive, we need to answer one question: is there actually a face in this selfie?

Google's MediaPipe FaceLandmarker answers that question with surgical precision. It runs two neural networks back-to-back:

BlazeFace — The Spotter

A lightweight face detector optimised for mobile GPUs. It scans the full image and outputs bounding box coordinates for every face it finds. FaceVault runs it with num_faces=1 and a confidence threshold of 0.5 — we only need one face, and we need to be sure it's there.

Face Mesh — The Mapper

Once BlazeFace finds a face, the mesh model maps 478 three-dimensional landmarks onto it. We're talking sub-millimetre precision: the bridge of the nose, the cupid's bow, the outer corner of each eyebrow. It also outputs 52 blendshape coefficients — floating-point values that describe how much the face is smiling, squinting, raising eyebrows, or opening the mouth.

If MediaPipe returns zero landmarks, the upload is rejected instantly. The user gets a message: "No face detected. Please look directly at the camera." This happens before the image even hits the GPU-heavy models, saving compute on bad uploads.

Why not use MediaPipe for everything? MediaPipe is incredible at detecting live faces shot from a phone camera at arm's length. But it struggles with the tiny, printed face on an ID document — different lighting, different scale, often partially obscured by holograms. That's why we use a different detector for the ID photo.

Layer 2: OpenCV Haar Cascades

The Haar cascade classifier is a 2001 algorithm. It's old enough to rent a car. And it's still one of the best tools for detecting small, printed faces on ID documents.

Published by Paul Viola and Michael Jones, the algorithm works in three stages:

Haar-Like Features

Instead of looking at raw pixels, the detector uses rectangular patterns called Haar features. Each feature computes the difference in brightness between adjacent regions. The eye region, for example, is typically darker than the cheek below it. A 24×24 detection window generates over 160,000 of these features.

Integral Images

Computing 160,000 rectangular sums per window would be impossibly slow. The integral image trick solves this: by pre-computing a running sum of all pixels above and to the left of each point, any rectangular sum reduces to exactly four lookups. Regardless of rectangle size. This is what makes Haar cascades run in real time.

AdaBoost Cascade

Not all 160,000 features matter. AdaBoost selects the most discriminative ones and arranges them into a cascade of 38 stages. The first stage uses just one feature. If a region fails stage one, it's immediately discarded — no need to evaluate the remaining 37 stages. On average, only 10 features out of 6,000+ are evaluated per sub-window. This cascade architecture is why a 25-year-old algorithm can still process an image in milliseconds.

FaceVault runs the Haar cascade with scaleFactor=1.1, minNeighbors=3, and a minSize of 30×30 pixels. This catches passport photos where the face might only be 40 pixels wide.

Design decision: If the Haar cascade doesn't find a face on the ID, we don't reject the upload. Some IDs (chip-based cards, worn documents) genuinely fail face detection. We log a warning and let it through — the ArcFace comparison at the end uses enforce_detection=False and will catch mismatches anyway.

Layer 3: ArcFace — The Brain

This is where the real magic happens. ArcFace (Additive Angular Margin Loss) is a face recognition model published by Deng et al. in 2019. It doesn't just detect faces — it understands them.

From pixels to vectors

ArcFace takes a face image and runs it through a deep convolutional neural network (ResNet-100 backbone). The output isn't a classification like "this is John" — it's a 512-dimensional embedding vector. A point in high-dimensional space that encodes everything about that face: bone structure, eye spacing, nose shape, jaw line.

Two photos of the same person produce vectors that are close together. Two different people produce vectors that are far apart. The genius is in how "close" is defined.

The hypersphere

ArcFace normalises every embedding vector to unit length, which constrains all face representations to the surface of a hypersphere. In this space, similarity is measured by the angle between two vectors, not the Euclidean distance. Same person? Small angle. Different person? Large angle.

The loss function that makes it work

L = -(1/N) * Σ log( e^(s * cos(θ_yi + m)) / (e^(s * cos(θ_yi + m)) + Σ e^(s * cos(θ_j))) )

θ_yi — the angle between the feature vector and the correct identity's weight vector

m — the additive angular margin (the secret sauce — typically 0.5 radians)

s — a scaling factor (typically 64) that controls gradient magnitude

The margin m is what makes ArcFace special. During training, it artificially increases the angle between a face and its correct class by m radians. This forces the network to learn tighter clusters — faces of the same person must be even closer together to compensate for the penalty. The result is an embedding space where:

• Intra-class distance is minimised (same person → tight cluster)
• Inter-class distance is maximised (different people → far apart)

ArcFace achieves 99.83% accuracy on LFW (Labeled Faces in the Wild) — the standard face recognition benchmark. That's near-perfect performance across lighting changes, ageing, facial hair, makeup, and accessories.

How FaceVault uses it

When you call /complete, FaceVault passes both the ID photo and the selfie through ArcFace via DeepFace. Each image is encoded into a 512-dimensional vector. The cosine distance between them is computed. If the distance is below the threshold — the faces match.

What runs under the hood

from deepface import DeepFace

result = DeepFace.verify(
    img1_path="id.jpg",        # ID document
    img2_path="selfie.jpg",    # Live selfie
    model_name="ArcFace",      # 512-dim embeddings
    detector_backend="opencv",  # Haar cascade for face extraction
    enforce_detection=False,    # Gracefully handle edge cases
)

# result = {
#   "verified": True,
#   "distance": 0.3142,     # Lower = closer match
#   "threshold": 0.6800,    # ArcFace default threshold
# }

The distance score is what FaceVault returns in the webhook payload as face_match_score. Lower means the faces are more similar. The threshold is ArcFace's calibrated decision boundary — if the distance is below it, the verification passes.

Why ArcFace over FaceNet, VGGFace, or Dlib? We benchmarked all four. ArcFace consistently outperformed on cross-domain matching — comparing a live selfie (high-res, warm lighting, 3D depth) against a printed ID photo (low-res, flat, often with holograms). The angular margin loss makes it remarkably robust to these domain shifts.

Layer 4: Document Intelligence

While ArcFace handles face matching, a parallel pipeline extracts structured data from the ID document. This runs in a background thread immediately after the ID photo is uploaded — by the time the user finishes their selfie, the data is already extracted.

PassportEye — MRZ Reader

Passports and many national IDs have a Machine Readable Zone (MRZ) — those two or three lines of blocky text at the bottom. PassportEye locates the MRZ region, then runs Tesseract OCR with legacy mode (--oem 0) to decode it. The MRZ encodes full name, date of birth, nationality, document number, sex, and expiry date in a standardised format with built-in check digits.

Tesseract OCR — The Fallback

Not all IDs have an MRZ. Singapore NRICs, Malaysian MyKads, and most driver's licences don't. When MRZ extraction fails, FaceVault falls back to full-page Tesseract OCR with regex pattern matching. We detect Singapore NRIC numbers ([STFGM]\d{7}[A-Z]), Malaysian MyKad numbers (12-digit with birth date encoded), and generic document numbers — then extract dates of birth from multiple formats.

The extracted data is stored as JSON on the session and included in the webhook payload. Your backend can cross-reference it against user-provided details for an additional layer of verification.

Layer 5: Liveness Detection

The best face matching in the world is worthless if someone holds up a photo of the victim. Liveness detection solves this.

FaceVault uses active liveness integrated directly into the selfie capture step — there's no separate user action. During selfie capture, the user performs a head turn sequence tracked entirely in the browser.

The Liveness Sequence

Calibrate → Turn Left → Center → Turn Right → Pass

The technical trick is surprisingly elegant. Using face-api.js (TinyFaceDetector + 68-point landmarks), we compute the yaw angle of the head by measuring the horizontal offset of the nose tip relative to the midpoint of the eyes, normalised by face width. This gives a resolution-independent rotation signal that works across any camera, any device, any distance.

Yaw computation (simplified)

noseTip = landmarks.getNose()[3]                    // Tip of nose
eyeMidX = mean(leftEye.x + rightEye.x)             // Eye center
faceWidth = jawline[16].x - jawline[0].x            // Normaliser

yaw = (noseTip.x - eyeMidX) / faceWidth
// Positive = turned left, negative = turned right
// |yaw| > 0.08 confirms a deliberate turn

The user must hold each position for 5 consecutive detection frames (sampled every 180ms). This prevents random head movement from being counted as intentional turns and makes it extremely difficult for a pre-recorded video to pass — the video would need to contain the exact calibrate → left → center → right sequence with the right timing.

What it defeats

✓ Printed photo attacks — a flat photo cannot produce 3D head movement

✓ Screen replay attacks — pre-recorded video won't match the exact turn sequence and timing

✓ Static deepfakes — generated images fail the head turn sequence entirely

✓ Photo holding — even holding a phone screen up to the camera fails because the "face" can't turn independently

Why client-side? Running liveness in the browser means zero additional photos uploaded to the server, zero additional compute on our end, and zero additional latency. The only data that touches the API is the ID photo and the selfie — privacy by design.

The Engineering Trade-Offs

Building a production ML pipeline is about more than picking the best model. Here are the real engineering decisions we wrestled with.

Two detectors, one pipeline

We use MediaPipe for selfies and OpenCV for ID documents. MediaPipe gives us 478 landmarks and blendshapes — overkill for ID documents, but perfect for validating a live face. OpenCV's Haar cascade is simpler but handles the tiny, printed face on an ID document better. Using one detector for everything would mean either false rejections on IDs (MediaPipe) or weaker selfie validation (OpenCV).

Single worker, no forking

TensorFlow and MediaPipe load C++ objects into memory that don't survive fork(). That means no Gunicorn with multiple workers, no --preload flag. FaceVault runs one Uvicorn worker with models pre-warmed at startup. This is a deliberate trade-off: we sacrifice parallelism for reliability. The pipeline is CPU-bound anyway — multiple workers would just fight over the same cores.

Resize before analysis

Modern phone cameras shoot 12+ megapixel photos. Feeding a 4032×3024 image directly into ArcFace would take 10x longer with zero accuracy benefit. FaceVault downscales selfies to a maximum of 800px on the longest edge using INTER_AREA interpolation (best for downsampling) before passing them to the face comparison pipeline.

Graceful degradation on IDs

Some ID documents genuinely don't contain a detectable face — chip-based cards with no photo visible, heavily worn documents, or cards photographed at an angle. Instead of rejecting these outright, we accept the upload and rely on ArcFace's enforce_detection=False mode. If there's any face-like region, ArcFace will find it. If there isn't, the comparison fails at the /complete step with a clear error message.

Background OCR, foreground matching

MRZ extraction and OCR run in a background thread immediately after the ID photo is uploaded. By the time the user takes their selfie and passes liveness, the document data is already extracted and written to the database. Face matching — the critical path — runs synchronously during the /complete call. This parallelism shaves 3–5 seconds off the total verification time.

By the Numbers

478

3D face landmarks (MediaPipe)

512

Embedding dimensions (ArcFace)

99.83%

LFW accuracy (ArcFace)

Cascade stages (Haar detector)

<30s

End-to-end verification time

AI models in the pipeline

Open by Design

Every model in FaceVault's pipeline is open-source or based on published research. MediaPipe is Apache 2.0. OpenCV is BSD. ArcFace's architecture and weights are publicly available. Tesseract is Apache 2.0. face-api.js is MIT.

We don't believe in security through obscurity. You should know exactly how your users' faces are being processed, what models are being used, and what trade-offs were made. That's what this post is about.

Want to see it in action? The full pipeline is available via our API — integrate it in 10 minutes.

Read the Docs Get an API Key

References & Further Reading

ArcFace: Additive Angular Margin Loss for Deep Face Recognition — Deng et al., CVPR 2019

MediaPipe Face Landmarker — Google AI Edge documentation

OpenCV Haar Cascade Face Detection — OpenCV documentation

DeepFace — lightweight face recognition framework for Python

face-api.js — JavaScript face detection and recognition library

FaceVault API Documentation — integrate in 10 minutes

How FaceVault Verifies a Face in Under 30 Seconds