Estimating Range From a Single Camera: The Math That Replaces LIDAR for Most Cases

There's an instinct in computer vision projects to assume that any time you need to know "how far away is that thing," you need a depth sensor. Stereo cameras, LIDAR, time-of-flight, structured light — pick your physics, pay your hardware bill, ship the box.

That instinct is wrong for a surprising number of real applications. If you already know what the object is — its category, its rough physical dimensions, its expected shape — you can recover range from a single camera image to within 5-10% accuracy. The math is high-school geometry. The tradeoff is that it only works when you know what you're looking at. For applications where the detector is already classifying the targets, that condition is usually already met.

Here's the math, the calibration setup, and the boundary conditions where this approach is the right call.

The basic equation

A camera with focal length f (in pixels) imaging an object of known real-world size H (meters) at distance Z (meters) projects an image of pixel size h:

h = f * H / Z

Solve for Z:

Z = f * H / h

That's it. Given:

f — the camera's focal length in pixels (from intrinsic calibration)
H — the object's known real-world height in meters (from a known class)
h — the object's pixel height in the image (from the detector)

You get Z, the range in meters. No depth sensor required.

The math is exact when the object is exactly the size you assume and the camera's optical center is perfectly aligned with the target. In practice neither is true, and the error sources are what determine whether this approach is usable.

Where the error comes from

Three sources of error dominate, in order of impact:

Object size variation. If you're identifying "a person standing upright" and using H = 1.75m, your actual targets will range from 1.55m (small adult) to 1.95m (tall adult). A 1.55m person at a measured pixel height that you interpret as a 1.75m person will be reported as 13% closer than they actually are.

For known-shape rigid objects (vehicles, aircraft, equipment), the size variation is much smaller. A Cessna 172 wingspan is 11.0m ± 0.05m across the fleet. A school bus is 12.2m ± 0.3m. The smaller the per-class size variance, the more accurate the range estimate.

Pixel measurement error. The detector's bounding box isn't exact. A 32-pixel-tall detection might actually represent a target whose true pixel height is anywhere from 30 to 35 pixels — depending on the detector's localization accuracy, the target's pose, the lighting, and where the bounding box draws its edges (tight crop vs. loose crop).

This error scales linearly with range: a 2-pixel localization error on a 100-pixel target is 2%; the same error on a 20-pixel target is 10%; on a 4-pixel target it's 50%. Small distant targets have terrible range estimates from monocular vision. Large near targets have excellent ones.

Focal length calibration error. If your camera's focal length is reported as 800 pixels but is actually 820, every range estimate is off by 2.5%. Camera intrinsics from datasheets are notoriously imprecise; intrinsics from on-board EEPROM are better; intrinsics from per-board calibration with a checkerboard are best. Most cameras need calibration; the consumer-grade ones especially.

Combined, the typical error budget on a well-calibrated camera tracking a known target class is:

±5-7% for rigid objects of known dimensions (vehicles, aircraft, fixed structures)
±10-15% for humans (more size variance)
±20-40% for arbitrary objects in unconstrained scenes

For comparison, a consumer-grade LIDAR gives you ±2-5cm absolute (much better at short range, much worse than the percentage estimates at long range). Monocular ranging is worse at short range, comparable at medium range, and often better at long range where LIDAR returns get sparse.

When this approach is the right call

The decision tree:

Use monocular range estimation when:

You're already running a detector that classifies targets — the classification is free.
Your target classes have small physical size variance (vehicles, aircraft, signs, equipment).
The range envelope you care about puts the target at 30+ pixels height. (At 30 pixels, you're getting reasonable estimates. At 100 pixels, you're getting good ones.)
You can afford ±5-15% range error in your downstream consumer (track management, threat prioritization, alerting).
You're power, weight, or cost-constrained — a single camera + a detector running on existing compute is enormously cheaper than adding a LIDAR.

Use a real depth sensor when:

You need centimeter-level accuracy at short range (collision avoidance, gesture recognition, pick-and-place robotics).
You're imaging arbitrary, unclassified objects.
Your scene has a lot of clutter and you need depth-based segmentation.
You're going to do depth-aware fusion downstream (mesh reconstruction, 3D semantic mapping, dense scene understanding).
You're operating in environments where the targets aren't visually distinct enough to classify reliably.

For most aerial surveillance, vehicle traffic monitoring, perimeter sensing, agricultural inspection, and a lot of industrial visual workflows — monocular ranging is the right tradeoff. For autonomous driving stacks, robotics manipulation, and dense scene understanding — depth sensors earn their cost.

Calibration: the part that actually matters

The math is easy. The calibration is where range estimation lives or dies.

Focal length in pixels. Don't trust the spec sheet. For each camera, run an OpenCV cv2.calibrateCamera() against a checkerboard at known distances. Capture at least 20 images covering the full field of view. The output is your camera matrix, the relevant entry is fx (and fy — they should be approximately equal for a square pixel sensor; if they're meaningfully different, your detector should use the appropriate one based on whether you're measuring object width or height).

For long focal length lenses (telephoto), the calibration matters less because the fx/fy variation is small. For short focal length lenses (wide-angle, fisheye), the calibration matters a lot — and you may need to undistort the image first before applying the simple math, because the simple math assumes a pinhole camera.

Per-class size priors. For each object class your detector produces, document the assumed physical size. Pick the median, not the mean — the median is more robust to a few outlier instances. Track the size variance per class in your design documentation: classes with low variance (commercial aircraft, fleet vehicles) get good range; classes with high variance (people, animals) get rough range; classes with no useful prior (debris, novel objects) get no range estimate at all.

Pose correction. The object's pixel height depends on its orientation relative to the camera. A car viewed broadside has a pixel height roughly equal to its real height (1.5m); a car viewed end-on has a pixel height roughly equal to its real width (1.8m). If your detector predicts a 2D bounding box without orientation, you're conflating these.

The mitigation: either restrict the math to a dimension that's pose-invariant (e.g., aircraft wingspan from above), use a detector that predicts orientation (oriented bounding boxes, often called OBB), or carry a known confidence interval that absorbs the pose variation.

Atmospheric / optical effects. At long range, atmospheric refraction and lens blur start to shrink the apparent target size. The pixel height of a target at 5km isn't quite f*H/5000 — it's slightly less, because the image is degraded. This is a small correction in clear air but matters more in haze, rain, or at low elevation angles where you're looking through a lot of atmosphere.

Stacking the technique with tracking

The single-frame range estimate is noisy. A 10% range error on each frame, frame-to-frame, looks like a target dithering in and out of position. The fix is to stack the technique with tracking and a Kalman filter.

The detector produces a bounding box per frame. The range estimate is a noisy measurement of true range. The tracker maintains a state estimate that includes range and range-rate, and the Kalman filter blends new measurements with the predicted state. Over 5-10 frames, the noise averages out and you get a stable range trajectory.

This is the same pattern as any single-sensor estimation problem: a noisy measurement model + a motion model + Bayesian filtering = a stable state estimate. Adding a second sensor (radar, LIDAR, AIS, ADS-B) gives you even better fusion, but the single-camera path produces usable results on its own.

The mission-design implication

When designing a sensor payload for a new system, the default assumption shouldn't be "we need a depth sensor." The default should be "we need a calibrated camera, a detector with class priors, and a Kalman filter." If that doesn't meet the mission's range accuracy requirement, then escalate to stereo, then to LIDAR, then to radar.

Each escalation adds cost, weight, power, and integration complexity. Each escalation is often unnecessary for the accuracy the mission actually requires. The teams that produce the most cost-effective sensor payloads aren't the ones with the most expensive sensors. They're the ones who measured what the mission needed and stopped at the cheapest sensor that delivered.

Monocular range estimation is one of those cheapest-sensors-that-deliver tools. It deserves to be the default starting point on any visual surveillance, monitoring, or identification system. Add more capable sensors when measurement evidence shows you need to. Don't add them by default.

Related reading:

The Four-Tracker Spectrum: Picking the Right Multi-Object Tracker for Edge Vision — the tracker that stabilizes the noisy per-frame range estimate
MAVLink Isn't Just for Drones: A Protocol Worth Stealing for Civil Systems — how range estimates get published downstream
Why Your TensorRT FP16 Speedup Looks Smaller Than Promised — the detector that produces the bounding boxes range estimation depends on

Estimating Range From a Single Camera: The Math That Replaces LIDAR for Most Cases

The basic equation

Where the error comes from

When this approach is the right call

Calibration: the part that actually matters

Stacking the technique with tracking

The mission-design implication

Keep reading

The Four-Tracker Spectrum: Picking the Right Multi-Object Tracker for Edge Vision

Why Your TensorRT FP16 Speedup Looks Smaller Than Promised

Stay Connected