We Made Our AI 3x Faster
by Making It Dumber
Our ML pipeline runs four neural networks per verification: face recognition, depth estimation, and two OCR models. Together they occupied 448MB of RAM and took 3–8 seconds per request. One afternoon, a single Python script, and a technique called INT8 quantization cut that to 125MB and roughly doubled throughput. Here's exactly what we did, why it works, and what we lost (almost nothing).
The Problem: 448MB of Neural Networks on a Modest VPS
FaceVault runs on a single modest VPS. Not a GPU cluster. Not a Kubernetes fleet. One box with a handful of cores and a tight RAM budget. This is deliberate — we keep infrastructure costs minimal, and it forces us to be efficient.
Every verification loads four ONNX models:
| Model | Purpose | Size |
|---|---|---|
w600k_r50 | ArcFace face recognition | 166 MB |
depth_anything_vits | Monocular depth estimation (anti-spoofing) | 95 MB |
db_resnet50 | Text detection (OCR) | 96 MB |
parseq | Text recognition (OCR) | 92 MB |
| Total | 448 MB | |
With a tight container memory limit, those 448MB of model weights were eating a big chunk of our budget before a single request came in. Each uvicorn worker loads its own copy of the models into memory, so scaling to multiple workers would have multiplied that footprint — a non-starter without shrinking the models first.
We needed the models to get smaller. Ideally a lot smaller. Without retraining them, without switching architectures, and without losing the accuracy we'd spent weeks tuning.
What INT8 Quantization Actually Is
Neural networks store their learned knowledge as millions of floating-point numbers called weights. By default, most models use 32-bit floats (FP32) — each weight takes 4 bytes, can represent values with ~7 decimal digits of precision, and covers a range from roughly ±3.4 × 1038.
INT8 quantization converts those 32-bit floats into 8-bit integers. Each weight drops from 4 bytes to 1 byte. That's a 75% reduction in model size, with a proportional reduction in the amount of data the CPU needs to move through its caches during inference.
FP32 weight: 0.0372914671897888 (32 bits, 4 bytes)
INT8 weight: 9 (8 bits, 1 byte)
Scale factor: 0.00414349...
Zero point: 0
Reconstruction: 9 * 0.00414349 = 0.03729...
≈ original value The trick is that neural networks are remarkably tolerant of imprecision. A face embedding doesn't need 7 decimal digits of precision per weight to distinguish your face from someone else's. The learned patterns survive quantization even when individual weights are rounded.
There are two flavors of quantization:
Dynamic quantization (what we used)
Weights are quantized offline. Activations are quantized on-the-fly during inference using the actual input data's range. No calibration dataset needed. You run a script and you're done.
Static quantization
Both weights and activations are quantized using a representative calibration dataset. Higher performance but requires collecting sample data and running a calibration pass. We'll do this eventually — dynamic was the quick win.
The Entire Script Is 6 Lines
This is not a metaphor. The quantization script that gave us 2–3x faster inference is genuinely this short:
from onnxruntime.quantization import quantize_dynamic, QuantType
models = [
("w600k_r50.onnx", "w600k_r50_int8.onnx"),
("depth_anything_vits.onnx", "depth_anything_vits_int8.onnx"),
("db_resnet50-69ba0015.onnx", "db_resnet50-69ba0015_int8.onnx"),
("parseq-00b40714.onnx", "parseq-00b40714_int8.onnx"),
]
for src, dst in models:
quantize_dynamic(src, dst, weight_type=QuantType.QInt8)
That's it. quantize_dynamic from ONNX Runtime reads each model, maps every FP32 weight to its nearest INT8 representation (with per-channel scale factors stored in the model), and writes a new .onnx file. No training data. No GPU. No hyperparameter tuning. It took about 30 seconds per model.
The only dependency is the onnx Python package (not installed in production — we ran this inside the container, then copied the quantized models into the persistent volume). ONNX Runtime, which we already use for inference, loads INT8 models transparently. No code change needed on the inference side — just point the model path to the new file.
# Before
_ARCFACE_MODEL_PATH = os.environ.get(
'ARCFACE_MODEL_PATH', '/models/w600k_r50.onnx'
)
# After
_ARCFACE_MODEL_PATH = os.environ.get(
'ARCFACE_MODEL_PATH', '/models/w600k_r50_int8.onnx'
) We kept the originals on disk. If something goes wrong, reverting is changing one string.
Before and After
| Model | FP32 | INT8 | Reduction |
|---|---|---|---|
| w600k_r50 (ArcFace) | 166 MB | 42 MB | 75% |
| depth_anything_vits | 95 MB | 26 MB | 72% |
| db_resnet50 (OCR detection) | 96 MB | 24 MB | 75% |
| parseq (OCR recognition) | 92 MB | 33 MB | 64% |
| Total | 448 MB | 125 MB | 72% |
Three workers loading INT8 models use 375MB for weights. That's less than what a single worker used to need with FP32. This is what unlocked our move from 1 uvicorn worker to 3 — previously impossible within our original container memory limit.
What About Accuracy?
This is the question that matters. If quantization made our face matching unreliable or our OCR miss MRZ fields, the speed improvement would be worthless.
We ran our full test suite — 286 tests covering face matching, anti-spoofing, document fraud detection, OCR extraction, trust engine scoring, webhooks, billing, and end-to-end session flows. All 286 passed. No threshold adjustments needed. No new edge cases.
This isn't surprising from an information theory perspective. Here's why:
Face recognition is comparison-based
ArcFace outputs a 512-dimensional embedding vector. We compare two embeddings by cosine distance with a threshold of 0.45. Quantization shifts embeddings slightly but consistently — both the ID photo and the selfie shift by roughly the same amount. The distance between them stays stable.
Depth estimation is relative, not absolute
Our anti-spoofing depth signal measures whether the nose protrudes more than the ears — relative depth variation across the face. It doesn't need millimeter accuracy. The ratio between depth values survives quantization.
OCR is classification, not regression
The text detection and recognition models ultimately output discrete characters from a fixed vocabulary. The model needs to get the argmax right (which character has the highest probability), not the exact probability value. INT8 preserves the ranking between classes even when individual logits shift.
The Other 6 Optimizations We Shipped
Quantization was the headline, but it was part of a broader "Phase 1" optimization push. All changes were free (no infrastructure cost), took a single afternoon, and shipped without any API contract changes. Here's the full list:
More uvicorn workers
Our API was running a single Python process. Every request — health checks, status polls, dashboard queries, verifications — shared one event loop. Adding multiple worker processes lets requests run in parallel. This alone roughly triples throughput for concurrent users. One line changed in entrypoint.sh.
Higher ML concurrency
We use a threading semaphore to prevent CPU starvation from too many concurrent ML operations. The old limit was conservative. Raising it reduces queue wait time without causing contention, because INT8 models now finish faster.
ONNX Runtime session tuning
We weren't passing any session options to ONNX Runtime. Tuning thread counts, enabling graph optimization, and turning on memory pattern reuse gave us a free 10–20% inference speedup from better thread parallelism and graph optimization.
Single image load for anti-spoofing
Our 12-signal anti-spoofing pipeline was calling cv2.imread() independently in each signal analyzer — 11 redundant disk reads of the same JPEG file. Now the orchestrator loads the image once and passes the numpy array to all analyzers. Saves 50–100ms per verification, which adds up at scale.
Async thread pool for liveness check
The run_liveness_check() function blocks the CPU for 3–8 seconds during face comparison and anti-spoofing. It was running directly in the async event loop, blocking the entire uvicorn worker. Now it runs via asyncio.run_in_executor(), freeing the worker to handle other requests while ML churns.
Container memory headroom increase
With multiple workers each loading INT8 models plus image buffers and Python overhead, we increased the API container's memory allocation to give comfortable headroom. Tight but workable.
Every one of these changes was under 10 lines of code. Most were 1–3 lines. The combined effect is multiplicative, not additive — faster models × more workers × less contention × fewer wasted reads = significantly more than the sum of its parts.
By the Numbers
| Metric | Before | After | Change |
|---|---|---|---|
| Model RAM (all 4) | 448 MB | 125 MB | -72% |
| Uvicorn workers | 1 | multiple | nx |
| Redundant disk reads per verification | many | 1 | -90%+ |
| Infrastructure cost change | — | — | $0 |
| Test suite | 286 pass | 286 pass | 0 regressions |
| Est. throughput | — | — | ~3x |
Zero dollars spent. One afternoon of work. Roughly 3x the verification capacity on the same server. The single most impactful change was INT8 quantization — not because it was the cleverest optimization, but because it was the one that unlocked all the others. Without smaller models, we couldn't add workers. Without more workers, the semaphore and thread pool changes wouldn't matter.
quantize_dynamic, stop reading this and go do it. It's the highest ratio of impact to effort we've ever seen in infrastructure work. One function call, 75% less RAM, 2–3x faster inference, zero cost, zero retraining, zero risk (keep the originals).
Further Reading
ONNX Runtime Quantization Documentation — the official guide to dynamic and static quantization
How FaceVault Verifies a Face in Under 30 Seconds — the full ML pipeline these models power
Deepfake Defense: An IDS/IPS for Identity Verification — the anti-spoofing pipeline we optimized
FaceVault API Documentation — integrate in 10 minutes