Why Your TensorRT FP16 Speedup Looks Smaller Than Promised

You read the TensorRT FP16 quantization benchmark on a fresh YOLO checkpoint. The paper says 3.8× speedup over the FP32 PyTorch baseline. You quantize your own fine-tuned detector, you wire it into your pipeline, and you measure end-to-end frame latency. You see 1.5× — maybe 2× on a good run.

You think you did something wrong. You haven't. You measured the wrong wall.

The two latencies that get conflated

Almost every real-time vision pipeline has at least two stages downstream of the camera:

Predict — run the detector on the current frame, get bounding boxes + class IDs + confidence scores
Track — match those detections against the previous frame's tracks, update IDs, handle births and deaths

If you've added a vision-language model, a downstream classifier, a feature extractor, or any post-processing, those become stages 3, 4, 5. Each one has its own latency distribution, its own bottleneck, and its own dependency on the hardware path.

TensorRT FP16 quantization affects exactly one of those stages: the detector. It cuts the predict latency by a factor that depends on the model, the batch size, and the GPU. On a fine-tuned YOLO26n at 640×640 on a Jetson Orin AGX 64GB at MODE_30W, you'll see something like:

PyTorch FP32: ~25-30ms per frame for predict alone
TensorRT FP32: ~12-15ms (~2× over PyTorch FP32 — the engine optimizations alone)
TensorRT FP16: ~6-8ms (~4× over PyTorch FP32 — the engine plus the quantization)

That 4× speedup is real. It's just localized to the predict layer.

What happens when you measure end-to-end

Once you add the tracker, your pipeline latency is the sum:

total = predict + track + post-process + i/o

Take the same Jetson, the same detector, a BoT-SORT-noreid tracker, and a moderate output JSONL writer. With FP32 PyTorch baseline, you might see:

predict: 28ms
track:   16ms
i/o:     3ms
total:   47ms  (21 FPS)

Switch to TensorRT FP16:

predict: 7ms
track:   16ms
i/o:     3ms
total:   26ms  (38 FPS)

The predict layer dropped from 28ms to 7ms — that's the 4× speedup the benchmark promised. But end-to-end you got 47ms → 26ms — about 1.8×. The tracker, the I/O, the kernel launches, the bookkeeping — none of those got faster. They couldn't. TensorRT didn't touch them.

This is the moment most engineers conclude "TensorRT FP16 isn't worth it." They're wrong. They're measuring the wrong wall. The predict layer is 4× faster; the rest of the pipeline is exactly as fast as it was. The end-to-end speedup is determined by how much of the original budget the predict layer consumed — and that's a property of the rest of the pipeline, not the optimization.

The attribution rule

The structural insight: when you optimize one stage of a pipeline, you need to measure that stage in isolation, not the whole pipeline.

Per-stage profiling tells you whether the optimization worked. End-to-end profiling tells you whether it mattered enough.

These are two different questions. Conflating them produces wrong conclusions in both directions.

The conclusion you can draw from per-stage timing: "TensorRT FP16 cut my detector from 28ms to 7ms — the quantization works on this hardware." That's the question the FP16 paper answers. It's also the question that lets you decide whether to ship the FP16 engine vs. keep the FP32 PyTorch model.

The conclusion you can draw from end-to-end timing: "My pipeline went from 21 FPS to 38 FPS — I'm now under the 33ms / 30 FPS budget." That's the operational question. It depends on every stage, not just the one you optimized.

A team that only measures end-to-end will dismiss real optimizations because they don't move the headline number enough. A team that only measures per-stage will ship optimizations that didn't help the operational metric. You need both.

Profiling the right way

The minimum measurement infrastructure to get this right on an edge box:

import time

t0 = time.perf_counter()
detections = predictor(frame)
t1 = time.perf_counter()
tracks = tracker(detections)
t2 = time.perf_counter()
write_jsonl(tracks)
t3 = time.perf_counter()

latencies['predict'].append(t1 - t0)
latencies['track'].append(t2 - t1)
latencies['io'].append(t3 - t2)
latencies['total'].append(t3 - t0)

Run that across 500-1000 frames. Compute the p50, p95, and p99 for each stage. Plot them as stacked bars. Now you can see what owns the latency budget at each percentile.

Two patterns to look for:

Bottleneck shift. Before optimization, predict owned the budget. After optimization, track owns it. That's the signal that the next optimization should target the tracker, not the detector. This is the right time to revisit your tracker choice (a slower tracker that was acceptable when predict cost 28ms might be unacceptable now that predict costs 7ms).

Long-tail pattern. p99 is much higher than p95 for one stage. That's an outlier signature — usually I/O, sometimes GPU contention from another process, sometimes a thermal-throttle event. Always investigate. A pipeline that meets its frame budget at p95 but blows it at p99 will drop frames on real workloads.

When the speedup actually delivers

The end-to-end gain from optimizing one stage depends on Amdahl's law. If predict owns 60% of your total latency, halving predict gives you a 25% end-to-end speedup (going from 100 to 70). If predict owns 90% of your total latency, halving predict gives you a 45% end-to-end speedup. The marginal value of detector optimization is highest when the detector dominates the budget.

This is why most edge AI pipelines should optimize the detector first, then re-profile, then decide whether to optimize the tracker next. By the time the detector is at 7ms, the tracker at 16ms is now the bottleneck — and that's a different optimization problem.

The mistake teams make is optimizing all stages in parallel. They put effort into TensorRT FP16, they put effort into a faster tracker, they put effort into batching the post-process. Then they ship and measure. Whatever speedup they see, they have no way to attribute. They can't decide which optimizations to roll back, which to double down on, or which were wasted effort.

Optimize one stage. Measure. Decide.

The deployment implication

TensorRT FP16 is almost always worth the effort on Jetson hardware. The quantization cost is one-time (build the engine offline; ship the engine file). The deployment cost is zero (the runtime is already on the device). The speedup on the detector is real.

But — and this matters for setting customer expectations — the headline benchmark number is a per-stage number. When you tell a stakeholder "we got 4× speedup from quantization," they hear "the whole pipeline is 4× faster." It isn't. The predict stage is 4× faster. The rest of the pipeline is whatever it was.

The honest way to frame the result:

The detector inference dropped from 28ms to 7ms per frame, freeing 21ms of frame budget. With the existing tracker at 16ms, the pipeline now meets a 30 FPS target with 5-6ms of headroom. Without the quantization, the pipeline was running at 21 FPS — under the 30 FPS budget by 14ms per frame.

That's a measurable operational outcome. "4× faster on the detector" is a benchmark number. "We now meet 30 FPS at 13 watts" is the deployment number. Both are useful. They're not the same.

What this generalizes to

The predict-vs-track attribution pattern shows up everywhere in pipelined systems:

Database query optimization. Speeding up the query doesn't speed up the application if the application is bottlenecked on serialization or network round-trips.
Network compression. Gzipping the response doesn't help if the client renders synchronously on a slow device.
Caching. Hitting the cache doesn't help if the rest of the request lifecycle dominates the wall clock.

The pattern: when you optimize one stage of a pipelined workload, measure that stage in isolation before you measure end-to-end. The two numbers answer different questions. Conflating them gets you wrong conclusions about whether the optimization worked, whether it helped, and what to do next.

Per-stage timing is cheap. Three time.perf_counter() calls and a dict. There's no excuse not to have it on every real-time pipeline you ship.

Related reading:

The Four-Tracker Spectrum: Picking the Right Multi-Object Tracker for Edge Vision — what's downstream of the predict layer
When You Have to Vendor: Pip-Install Reality on Legacy Edge Hardware — the deployment context

Why Your TensorRT FP16 Speedup Looks Smaller Than Promised

The two latencies that get conflated

What happens when you measure end-to-end

The attribution rule

Profiling the right way

When the speedup actually delivers

The deployment implication

What this generalizes to

Keep reading

The Four-Tracker Spectrum: Picking the Right Multi-Object Tracker for Edge Vision

When You Have to Vendor: Pip-Install Reality on Legacy Edge Hardware

Stay Connected