How FaceVault Verifies a Face
in Under 30 Seconds
Most KYC providers treat their AI pipeline like a trade secret. We think the opposite. Here's exactly how FaceVault matches a selfie against an ID document — the models, the math, and the engineering decisions behind every verification.
The Verification Pipeline
When a user taps "Verify" in your app, a cascade of five AI models fires in sequence. Each layer has a specific job. Each layer can reject the session before the expensive models even run. The entire pipeline completes in 10–30 seconds on a single CPU core.
MediaPipe FaceLandmarker
Maps 478 3D face landmarks on the selfie. No face? Rejected immediately.
OpenCV Haar Cascade
Detects the face on the ID document. Tolerant of small, printed faces.
ArcFace Neural Network
Encodes both faces into 512-dimensional vectors. Compares them in angular space.
MRZ + OCR Extraction
Reads the machine-readable zone or runs Tesseract OCR to extract name, DOB, nationality.
Liveness & Anti-Spoofing
Head-turn sequence runs during selfie capture. Server-side multi-signal analysis (depth, rPPG, GAN texture) adds a second layer.
This isn't a black box. Each of these models is a published, peer-reviewed piece of research. Let's tear them apart.
Layer 1: MediaPipe Face Landmarker
Before we do anything expensive, we need to answer one question: is there actually a face in this selfie?
Google's MediaPipe FaceLandmarker answers that question with surgical precision. It runs two neural networks back-to-back:
BlazeFace — The Spotter
A lightweight face detector optimised for mobile GPUs. It scans the full image and outputs bounding box coordinates for every face it finds. FaceVault runs it with num_faces=1 and a confidence threshold of 0.5 — we only need one face, and we need to be sure it's there.
Face Mesh — The Mapper
Once BlazeFace finds a face, the mesh model maps 478 three-dimensional landmarks onto it. We're talking sub-millimetre precision: the bridge of the nose, the cupid's bow, the outer corner of each eyebrow. It also outputs 52 blendshape coefficients — floating-point values that describe how much the face is smiling, squinting, raising eyebrows, or opening the mouth.
If MediaPipe returns zero landmarks, the upload is rejected instantly. The user gets a message: "No face detected. Please look directly at the camera." This happens before the image even hits the GPU-heavy models, saving compute on bad uploads.
Layer 2: OpenCV Haar Cascades
The Haar cascade classifier is a 2001 algorithm. It's old enough to rent a car. And it's still one of the best tools for detecting small, printed faces on ID documents.
Published by Paul Viola and Michael Jones, the algorithm works in three stages:
Haar-Like Features
Instead of looking at raw pixels, the detector uses rectangular patterns called Haar features. Each feature computes the difference in brightness between adjacent regions. The eye region, for example, is typically darker than the cheek below it. A 24×24 detection window generates over 160,000 of these features.
Integral Images
Computing 160,000 rectangular sums per window would be impossibly slow. The integral image trick solves this: by pre-computing a running sum of all pixels above and to the left of each point, any rectangular sum reduces to exactly four lookups. Regardless of rectangle size. This is what makes Haar cascades run in real time.
AdaBoost Cascade
Not all 160,000 features matter. AdaBoost selects the most discriminative ones and arranges them into a cascade of 38 stages. The first stage uses just one feature. If a region fails stage one, it's immediately discarded — no need to evaluate the remaining 37 stages. On average, only 10 features out of 6,000+ are evaluated per sub-window. This cascade architecture is why a 25-year-old algorithm can still process an image in milliseconds.
FaceVault runs the Haar cascade with scaleFactor=1.1, minNeighbors=3, and a minSize of 30×30 pixels. This catches passport photos where the face might only be 40 pixels wide.
enforce_detection=False and will catch mismatches anyway.
Layer 3: ArcFace — The Brain
This is where the real magic happens. ArcFace (Additive Angular Margin Loss) is a face recognition model published by Deng et al. in 2019. It doesn't just detect faces — it understands them.
From pixels to vectors
ArcFace takes a face image and runs it through a deep convolutional neural network (ResNet-100 backbone). The output isn't a classification like "this is John" — it's a 512-dimensional embedding vector. A point in high-dimensional space that encodes everything about that face: bone structure, eye spacing, nose shape, jaw line.
Two photos of the same person produce vectors that are close together. Two different people produce vectors that are far apart. The genius is in how "close" is defined.
The hypersphere
ArcFace normalises every embedding vector to unit length, which constrains all face representations to the surface of a hypersphere. In this space, similarity is measured by the angle between two vectors, not the Euclidean distance. Same person? Small angle. Different person? Large angle.
The loss function that makes it work
L = -(1/N) * Σ log( e^(s * cos(θ_yi + m)) / (e^(s * cos(θ_yi + m)) + Σ e^(s * cos(θ_j))) ) θ_yi — the angle between the feature vector and the correct identity's weight vector
m — the additive angular margin (the secret sauce — typically 0.5 radians)
s — a scaling factor (typically 64) that controls gradient magnitude
The margin m is what makes ArcFace special. During training, it artificially increases the angle between a face and its correct class by m radians. This forces the network to learn tighter clusters — faces of the same person must be even closer together to compensate for the penalty. The result is an embedding space where:
- • Intra-class distance is minimised (same person → tight cluster)
- • Inter-class distance is maximised (different people → far apart)
ArcFace achieves 99.83% accuracy on LFW (Labeled Faces in the Wild) — the standard face recognition benchmark. That's near-perfect performance across lighting changes, ageing, facial hair, makeup, and accessories.
How FaceVault uses it
When you call /complete, FaceVault passes both the ID photo and the selfie through ArcFace via DeepFace. Each image is encoded into a 512-dimensional vector. The cosine distance between them is computed. If the distance is below the threshold — the faces match.
from deepface import DeepFace
result = DeepFace.verify(
img1_path="id.jpg", # ID document
img2_path="selfie.jpg", # Live selfie
model_name="ArcFace", # 512-dim embeddings
detector_backend="opencv", # Haar cascade for face extraction
enforce_detection=False, # Gracefully handle edge cases
)
# result = {
# "verified": True,
# "distance": 0.3142, # Lower = closer match
# "threshold": 0.6800, # ArcFace default threshold
# }
The distance score is what FaceVault returns in the webhook payload as face_match_score. Lower means the faces are more similar. The threshold is ArcFace's calibrated decision boundary — if the distance is below it, the verification passes.
Layer 4: Document Intelligence
While ArcFace handles face matching, a parallel pipeline extracts structured data from the ID document. This runs in a background thread immediately after the ID photo is uploaded — by the time the user finishes their selfie, the data is already extracted.
PassportEye — MRZ Reader
Passports and many national IDs have a Machine Readable Zone (MRZ) — those two or three lines of blocky text at the bottom. PassportEye locates the MRZ region, then runs Tesseract OCR with legacy mode (--oem 0) to decode it. The MRZ encodes full name, date of birth, nationality, document number, sex, and expiry date in a standardised format with built-in check digits.
Tesseract OCR — The Fallback
Not all IDs have an MRZ. Singapore NRICs, Malaysian MyKads, and most driver's licences don't. When MRZ extraction fails, FaceVault falls back to full-page Tesseract OCR with regex pattern matching. We detect Singapore NRIC numbers ([STFGM]\d{7}[A-Z]), Malaysian MyKad numbers (12-digit with birth date encoded), and generic document numbers — then extract dates of birth from multiple formats.
The extracted data is stored as JSON on the session and included in the webhook payload. Your backend can cross-reference it against user-provided details for an additional layer of verification.
Layer 5: Liveness Detection
The best face matching in the world is worthless if someone holds up a photo of the victim. Liveness detection solves this.
FaceVault uses active liveness integrated directly into the selfie capture step — there's no separate user action. During selfie capture, the user performs a head turn sequence tracked entirely in the browser.
The Liveness Sequence
The technical trick is surprisingly elegant. Using face-api.js (TinyFaceDetector + 68-point landmarks), we compute the yaw angle of the head by measuring the horizontal offset of the nose tip relative to the midpoint of the eyes, normalised by face width. This gives a resolution-independent rotation signal that works across any camera, any device, any distance.
noseTip = landmarks.getNose()[3] // Tip of nose
eyeMidX = mean(leftEye.x + rightEye.x) // Eye center
faceWidth = jawline[16].x - jawline[0].x // Normaliser
yaw = (noseTip.x - eyeMidX) / faceWidth
// Positive = turned left, negative = turned right
// |yaw| > 0.08 confirms a deliberate turn The user must hold each position for 5 consecutive detection frames (sampled every 180ms). This prevents random head movement from being counted as intentional turns and makes it extremely difficult for a pre-recorded video to pass — the video would need to contain the exact calibrate → left → center → right sequence with the right timing.
What it defeats
✓ Printed photo attacks — a flat photo cannot produce 3D head movement
✓ Screen replay attacks — pre-recorded video won't match the exact turn sequence and timing
✓ Static deepfakes — generated images fail the head turn sequence entirely
✓ Photo holding — even holding a phone screen up to the camera fails because the "face" can't turn independently
The Engineering Trade-Offs
Building a production ML pipeline is about more than picking the best model. Here are the real engineering decisions we wrestled with.
Two detectors, one pipeline
We use MediaPipe for selfies and OpenCV for ID documents. MediaPipe gives us 478 landmarks and blendshapes — overkill for ID documents, but perfect for validating a live face. OpenCV's Haar cascade is simpler but handles the tiny, printed face on an ID document better. Using one detector for everything would mean either false rejections on IDs (MediaPipe) or weaker selfie validation (OpenCV).
Single worker, no forking
TensorFlow and MediaPipe load C++ objects into memory that don't survive fork(). That means no Gunicorn with multiple workers, no --preload flag. FaceVault runs one Uvicorn worker with models pre-warmed at startup. This is a deliberate trade-off: we sacrifice parallelism for reliability. The pipeline is CPU-bound anyway — multiple workers would just fight over the same cores.
Resize before analysis
Modern phone cameras shoot 12+ megapixel photos. Feeding a 4032×3024 image directly into ArcFace would take 10x longer with zero accuracy benefit. FaceVault downscales selfies to a maximum of 800px on the longest edge using INTER_AREA interpolation (best for downsampling) before passing them to the face comparison pipeline.
Graceful degradation on IDs
Some ID documents genuinely don't contain a detectable face — chip-based cards with no photo visible, heavily worn documents, or cards photographed at an angle. Instead of rejecting these outright, we accept the upload and rely on ArcFace's enforce_detection=False mode. If there's any face-like region, ArcFace will find it. If there isn't, the comparison fails at the /complete step with a clear error message.
Background OCR, foreground matching
MRZ extraction and OCR run in a background thread immediately after the ID photo is uploaded. By the time the user takes their selfie and passes liveness, the document data is already extracted and written to the database. Face matching — the critical path — runs synchronously during the /complete call. This parallelism shaves 3–5 seconds off the total verification time.
By the Numbers
478
3D face landmarks (MediaPipe)
512
Embedding dimensions (ArcFace)
99.83%
LFW accuracy (ArcFace)
38
Cascade stages (Haar detector)
<30s
End-to-end verification time
5
AI models in the pipeline
Open by Design
Every model in FaceVault's pipeline is open-source or based on published research. MediaPipe is Apache 2.0. OpenCV is BSD. ArcFace's architecture and weights are publicly available. Tesseract is Apache 2.0. face-api.js is MIT.
We don't believe in security through obscurity. You should know exactly how your users' faces are being processed, what models are being used, and what trade-offs were made. That's what this post is about.
Want to see it in action? The full pipeline is available via our API — integrate it in 10 minutes.
References & Further Reading
ArcFace: Additive Angular Margin Loss for Deep Face Recognition — Deng et al., CVPR 2019
MediaPipe Face Landmarker — Google AI Edge documentation
OpenCV Haar Cascade Face Detection — OpenCV documentation
DeepFace — lightweight face recognition framework for Python
face-api.js — JavaScript face detection and recognition library
FaceVault API Documentation — integrate in 10 minutes