Engineering 26 February 2026 · 10 min read

We Made Our AI 3x Faster
by Making It Dumber

Our ML pipeline runs four neural networks per verification: face recognition, depth estimation, and two OCR models. Together they occupied 448MB of RAM and took 3–8 seconds per request. One afternoon, a single Python script, and a technique called INT8 quantization cut that to 125MB and roughly doubled throughput. Here's exactly what we did, why it works, and what we lost (almost nothing).

The Problem: 448MB of Neural Networks on a Modest VPS

FaceVault runs on a single modest VPS. Not a GPU cluster. Not a Kubernetes fleet. One box with a handful of cores and a tight RAM budget. This is deliberate — we keep infrastructure costs minimal, and it forces us to be efficient.

Every verification loads four ONNX models:

Model	Purpose	Size
`w600k_r50`	ArcFace face recognition	166 MB
`depth_anything_vits`	Monocular depth estimation (anti-spoofing)	95 MB
`db_resnet50`	Text detection (OCR)	96 MB
`parseq`	Text recognition (OCR)	92 MB
Total		448 MB

With a tight container memory limit, those 448MB of model weights were eating a big chunk of our budget before a single request came in. Each uvicorn worker loads its own copy of the models into memory, so scaling to multiple workers would have multiplied that footprint — a non-starter without shrinking the models first.

We needed the models to get smaller. Ideally a lot smaller. Without retraining them, without switching architectures, and without losing the accuracy we'd spent weeks tuning.

What INT8 Quantization Actually Is

Neural networks store their learned knowledge as millions of floating-point numbers called weights. By default, most models use 32-bit floats (FP32) — each weight takes 4 bytes, can represent values with ~7 decimal digits of precision, and covers a range from roughly ±3.4 × 10³⁸.

INT8 quantization converts those 32-bit floats into 8-bit integers. Each weight drops from 4 bytes to 1 byte. That's a 75% reduction in model size, with a proportional reduction in the amount of data the CPU needs to move through its caches during inference.

The math, simplified

FP32 weight:  0.0372914671897888  (32 bits, 4 bytes)
INT8 weight:  9                   (8 bits, 1 byte)

Scale factor: 0.00414349...
Zero point:   0

Reconstruction: 9 * 0.00414349 = 0.03729...
                                  ≈ original value

The trick is that neural networks are remarkably tolerant of imprecision. A face embedding doesn't need 7 decimal digits of precision per weight to distinguish your face from someone else's. The learned patterns survive quantization even when individual weights are rounded.

There are two flavors of quantization:

Dynamic quantization (what we used)

Weights are quantized offline. Activations are quantized on-the-fly during inference using the actual input data's range. No calibration dataset needed. You run a script and you're done.

Static quantization

Both weights and activations are quantized using a representative calibration dataset. Higher performance but requires collecting sample data and running a calibration pass. We'll do this eventually — dynamic was the quick win.

Why it's faster, not just smaller: Modern CPUs have dedicated integer arithmetic units that process INT8 operations 2–4x faster than floating-point operations. AVX-512 VNNI instructions can perform 4 INT8 multiply-accumulates in the same cycle as one FP32 multiply. Less data to move through cache + faster math = faster inference.

The Entire Script Is 6 Lines

This is not a metaphor. The quantization script that gave us 2–3x faster inference is genuinely this short:

quantize_models.py

from onnxruntime.quantization import quantize_dynamic, QuantType

models = [
    ("w600k_r50.onnx",              "w600k_r50_int8.onnx"),
    ("depth_anything_vits.onnx",     "depth_anything_vits_int8.onnx"),
    ("db_resnet50-69ba0015.onnx",    "db_resnet50-69ba0015_int8.onnx"),
    ("parseq-00b40714.onnx",         "parseq-00b40714_int8.onnx"),
]

for src, dst in models:
    quantize_dynamic(src, dst, weight_type=QuantType.QInt8)

That's it. quantize_dynamic from ONNX Runtime reads each model, maps every FP32 weight to its nearest INT8 representation (with per-channel scale factors stored in the model), and writes a new .onnx file. No training data. No GPU. No hyperparameter tuning. It took about 30 seconds per model.

The only dependency is the onnx Python package (not installed in production — we ran this inside the container, then copied the quantized models into the persistent volume). ONNX Runtime, which we already use for inference, loads INT8 models transparently. No code change needed on the inference side — just point the model path to the new file.

face_analysis.py — one line changed

# Before
_ARCFACE_MODEL_PATH = os.environ.get(
    'ARCFACE_MODEL_PATH', '/models/w600k_r50.onnx'
)

# After
_ARCFACE_MODEL_PATH = os.environ.get(
    'ARCFACE_MODEL_PATH', '/models/w600k_r50_int8.onnx'
)

We kept the originals on disk. If something goes wrong, reverting is changing one string.

Before and After

Model	FP32	INT8	Reduction
w600k_r50 (ArcFace)	166 MB	42 MB	75%
depth_anything_vits	95 MB	26 MB	72%
db_resnet50 (OCR detection)	96 MB	24 MB	75%
parseq (OCR recognition)	92 MB	33 MB	64%
Total	448 MB	125 MB	72%

Three workers loading INT8 models use 375MB for weights. That's less than what a single worker used to need with FP32. This is what unlocked our move from 1 uvicorn worker to 3 — previously impossible within our original container memory limit.

Why parseq is only 64%: The PARSeq text recognition model uses a transformer architecture with embedding layers and attention heads that don't compress as aggressively as the pure convolutional layers in ResNet. The positional encodings and vocabulary projections retain more floating-point precision. Still a big win — 92MB to 33MB is nothing to complain about.

What About Accuracy?

This is the question that matters. If quantization made our face matching unreliable or our OCR miss MRZ fields, the speed improvement would be worthless.

We ran our full test suite — 286 tests covering face matching, anti-spoofing, document fraud detection, OCR extraction, trust engine scoring, webhooks, billing, and end-to-end session flows. All 286 passed. No threshold adjustments needed. No new edge cases.

This isn't surprising from an information theory perspective. Here's why:

Face recognition is comparison-based

ArcFace outputs a 512-dimensional embedding vector. We compare two embeddings by cosine distance with a threshold of 0.45. Quantization shifts embeddings slightly but consistently — both the ID photo and the selfie shift by roughly the same amount. The distance between them stays stable.

Depth estimation is relative, not absolute

Our anti-spoofing depth signal measures whether the nose protrudes more than the ears — relative depth variation across the face. It doesn't need millimeter accuracy. The ratio between depth values survives quantization.

OCR is classification, not regression

The text detection and recognition models ultimately output discrete characters from a fixed vocabulary. The model needs to get the argmax right (which character has the highest probability), not the exact probability value. INT8 preserves the ranking between classes even when individual logits shift.

Where quantization can hurt: Tasks that depend on precise absolute values — fine-grained regression, medical imaging measurements, or models already operating near their accuracy floor. Our pipeline doesn't have this problem because every signal is either comparative (face matching), relative (depth), or classification (OCR). If you're quantizing for a different use case, benchmark before deploying.

The Other 6 Optimizations We Shipped

Quantization was the headline, but it was part of a broader "Phase 1" optimization push. All changes were free (no infrastructure cost), took a single afternoon, and shipped without any API contract changes. Here's the full list:

More uvicorn workers

Our API was running a single Python process. Every request — health checks, status polls, dashboard queries, verifications — shared one event loop. Adding multiple worker processes lets requests run in parallel. This alone roughly triples throughput for concurrent users. One line changed in entrypoint.sh.

Higher ML concurrency

We use a threading semaphore to prevent CPU starvation from too many concurrent ML operations. The old limit was conservative. Raising it reduces queue wait time without causing contention, because INT8 models now finish faster.

ONNX Runtime session tuning

We weren't passing any session options to ONNX Runtime. Tuning thread counts, enabling graph optimization, and turning on memory pattern reuse gave us a free 10–20% inference speedup from better thread parallelism and graph optimization.

Single image load for anti-spoofing

Our 12-signal anti-spoofing pipeline was calling cv2.imread() independently in each signal analyzer — 11 redundant disk reads of the same JPEG file. Now the orchestrator loads the image once and passes the numpy array to all analyzers. Saves 50–100ms per verification, which adds up at scale.

Async thread pool for liveness check

The run_liveness_check() function blocks the CPU for 3–8 seconds during face comparison and anti-spoofing. It was running directly in the async event loop, blocking the entire uvicorn worker. Now it runs via asyncio.run_in_executor(), freeing the worker to handle other requests while ML churns.

Container memory headroom increase

With multiple workers each loading INT8 models plus image buffers and Python overhead, we increased the API container's memory allocation to give comfortable headroom. Tight but workable.

Every one of these changes was under 10 lines of code. Most were 1–3 lines. The combined effect is multiplicative, not additive — faster models × more workers × less contention × fewer wasted reads = significantly more than the sum of its parts.

By the Numbers

Metric	Before	After	Change
Model RAM (all 4)	448 MB	125 MB	-72%
Uvicorn workers	1	multiple	nx
Redundant disk reads per verification	many	1	-90%+
Infrastructure cost change	—	—	$0
Test suite	286 pass	286 pass	0 regressions
Est. throughput	—	—	~3x

Zero dollars spent. One afternoon of work. Roughly 3x the verification capacity on the same server. The single most impactful change was INT8 quantization — not because it was the cleverest optimization, but because it was the one that unlocked all the others. Without smaller models, we couldn't add workers. Without more workers, the semaphore and thread pool changes wouldn't matter.

The takeaway for other startups: If you're running ONNX models on CPU and haven't tried quantize_dynamic, stop reading this and go do it. It's the highest ratio of impact to effort we've ever seen in infrastructure work. One function call, 75% less RAM, 2–3x faster inference, zero cost, zero retraining, zero risk (keep the originals).

We Made Our AI 3x Faster by Making It Dumber