IoT Edge AI Deployment Guide: Jetson Nano, Coral TPU, Hailo, and Constrained Hardware

We find that IoT edge AI operates under constraints that make cloud or server inference engineering look straightforward. Memory measured in gigabytes rather than hundreds. Power budgets of 5–15W instead of 300W. Hardware that may run in environments ranging from −40°C to 85°C. And deployment at scale — hundreds or thousands of nodes — where per-unit cost determines project viability.

The hardware platforms that dominate this space each have distinct architectures, quantization requirements, and support ecosystems. Choosing the wrong one costs months of porting work.

Hardware Platform Comparison

Platform	Peak Inference	Memory	Power (TDP)	Primary Quantization	Cost (approx.)
Jetson Nano (original)	472 GFLOPS FP16	4 GB LPDDR4	5–10W	FP16, INT8 (TensorRT)	~$50–$100 module
Jetson Orin Nano	40 TOPS	4–8 GB LPDDR5	5–15W	INT8, FP16	~$150–$250 module
Google Coral TPU (Edge)	4 TOPS	N/A (host memory)	~2W (TPU only)	INT8 only	~$25–$60
Hailo-8	26 TOPS	N/A (host memory)	2.5W	INT8, INT4	~$200–$400 module
Hailo-8L	13 TOPS	N/A (host memory)	1.5W	INT8, INT4	~$100–$200 module
Intel Movidius MyriadX	4 TOPS	4 GB LPDDR4	2.5W	FP16, INT8	~$50–$100

These prices are approximate and fluctuate with supply conditions. The TOPS ratings are for INT8 where applicable; FP16 performance is typically 50% of the INT8 TOPS figure for platforms that support both.

Jetson (NVIDIA)

Jetson platforms run NVIDIA’s full CUDA stack at reduced scale. TensorRT handles the inference optimization path — it accepts ONNX models and applies layer fusion, precision calibration, and kernel selection for the target hardware automatically.

INT8 quantization in TensorRT requires a calibration dataset — a representative sample of real inference inputs. TensorRT uses this to determine per-tensor quantization scales. Post-training quantization (PTQ) with TensorRT calibration typically achieves less than 1% accuracy degradation on classification models; object detection and segmentation require more careful calibration.

FP16 is available on all Orin-class Jetson hardware and requires no calibration — it’s simply a precision switch. Most Jetson deployments start with FP16 (easier, good accuracy) and only move to INT8 if memory or throughput requirements demand it.

Google Coral TPU

The Coral Edge TPU is an accelerator for the host CPU. It only executes INT8 models — this is not optional. Any model running on the Edge TPU must be fully INT8 quantized, including all operations that the Edge TPU supports. Operations the Edge TPU doesn’t support fall back to the host CPU, which can severely degrade throughput.

The workflow: train in TensorFlow, quantize with TFLite INT8 (full integer quantization, not just weights), and compile with the Edge TPU compiler. The compiler reports how many operations run on-chip vs fall back. Models not designed for Edge TPU compatibility often have 30–50% CPU fallback, which defeats the power and latency advantage.

In our experience, using architectures explicitly designed or validated for Edge TPU — MobileNet v2/v3, EfficientDet-Lite — achieves full on-chip execution. Generic research architectures rarely run entirely on-chip without architecture modifications.

Hailo

Hailo’s toolchain (Dataflow Compiler) is model-compiler-based: you provide an ONNX or TensorFlow model, and the compiler performs architecture-aware mapping onto the Hailo-8 dataflow array. Hailo supports INT8 and INT4 precision. INT4 enables higher throughput and lower power at the cost of accuracy, typically requiring QAT (quantization-aware training) to recover accuracy to acceptable levels.

Hailo’s architecture is fundamentally different from GPU-style execution — it’s a spatial dataflow architecture where operations are statically mapped to processor elements. This makes it extremely efficient for steady-state inference of fixed-topology models but inflexible for dynamic workloads or variable batch sizes.

How do you choose between on-device and edge-server inference?

For IoT deployments at scale, the choice between putting inference on the sensing device itself vs routing to a local edge server affects system design, cost, and maintainability.

On-device inference

Pros:

No network dependency for inference
Lowest end-to-end latency
No bandwidth cost for streaming raw sensor data

Cons:

Per-device compute cost at scale (1,000 Hailo-8 modules vs 5 edge servers)
Model updates require over-the-air update to every device
Limited to models that fit in the platform’s quantization constraints

Edge-server inference

Pros:

Higher compute capability (Jetson AGX Orin or full server GPU)
Aggregate feeds from many sensors
Centralized model management

Cons:

Network required between sensors and server
Single point of failure for many sensors
Additional infrastructure to manage

The break-even point is typically around 10–20 sensors per edge server, depending on sensor frame rate and model inference time. Below that ratio, on-device inference is often simpler. Above it, edge servers are more economical. This is an observed range across deployments we’ve seen, not a benchmarked rate — your actual break-even depends on model size, frame rate, and network reliability.

Quantization Workflow Checklist for IoT Edge

Define minimum acceptable accuracy metric before starting quantization
Confirm target hardware quantization support (INT8 only? FP16? INT4?)
Collect calibration dataset (minimum 100–1,000 representative inputs)
Apply PTQ first — measure accuracy delta
If PTQ accuracy is insufficient: apply QAT or knowledge distillation
Profile inference time on target hardware (not development machine)
Measure power draw under sustained load (not peak TDP from datasheet)
Test at temperature extremes if operating environment requires it

For the broader comparison of distillation vs quantization approaches — and when to use each — see our analysis of distillation vs quantisation for multi-platform edge inference.

What are the common failure modes in IoT edge deployments?

Accuracy validation on the wrong data distribution. Models calibrated on lab data fail in the field when lighting, sensor variation, or product variation differs from calibration inputs.

Insufficient memory profiling. Peak memory during inference (including intermediate activations) often exceeds the model parameter size estimate. A 50 MB model file can require 200+ MB of runtime memory.

Ignoring update infrastructure. How do you update 500 deployed devices when the model needs retraining? This question needs an answer before deployment, not during an incident.

Assuming TOPS ratings are directly comparable. Hailo-8’s 26 TOPS, Coral’s 4 TOPS, and Jetson Orin Nano’s 40 TOPS are measured on different operations, at different precision levels, with different benchmarking assumptions. Profile your specific model on your specific hardware.

Lessons from the field

IoT edge AI requires hardware-specific quantization workflows that differ substantially between platforms. Coral TPU mandates full INT8 with architecture-compatible models. Jetson supports both INT8 and FP16 via TensorRT with calibration. Hailo uses a compiler-based mapping workflow with INT8 and INT4 support. The on-device vs edge-server architecture decision depends on the number of sensors, update requirements, and acceptable infrastructure complexity. Start with PTQ, measure accuracy on representative field data, and validate on the target hardware before committing to a deployment architecture.

Frequently Asked Questions

Which edge AI accelerator should I choose for IoT deployment?

It depends on your model architecture and quantization tolerance. Coral Edge TPU is cheapest and lowest power but mandates full INT8 quantization with compatible architectures like MobileNet or EfficientDet-Lite. Jetson Orin Nano offers the most flexibility (FP16 and INT8 via TensorRT) and runs arbitrary ONNX models, at higher per-unit cost. Hailo-8 sits between them — higher throughput than Coral, lower cost than Jetson Orin — but its dataflow architecture suits fixed-topology models with steady-state workloads.

Does quantization always degrade model accuracy?

Not meaningfully, if done correctly. Post-training INT8 quantization on classification models typically loses less than 1% accuracy when calibrated with a representative dataset. Object detection and segmentation are more sensitive and may require quantization-aware training (QAT) to recover accuracy. INT4 almost always requires QAT. FP16 generally costs no measurable accuracy on modern architectures.

When should I use on-device inference vs an edge server?

The decision turns on sensor count, model size, and update infrastructure. Below roughly 10–20 sensors per location, on-device inference is usually simpler — no network dependency, no single point of failure. Above that ratio, an edge server aggregating many sensor feeds onto a single Jetson AGX Orin or server GPU is typically more economical. This is an observed range, not a benchmarked rate; profile your specific workload.

Why don’t TOPS ratings predict real-world performance?

TOPS figures from different vendors are measured on different operations, at different precision levels, with different assumptions about batch size and utilization. Hailo’s 26 TOPS, Coral’s 4 TOPS, and Jetson Orin Nano’s 40 TOPS aren’t directly comparable. The only reliable measurement is your specific model running on the target hardware under realistic load — datasheet TOPS is a marketing number, not an operational one.