System-on-a-Chip for AI: Why Integration Doesn’t Eliminate the Software Stack

“Integrated” doesn’t mean “self-contained”

A system-on-a-chip integrates compute, memory controllers, and accelerator blocks — neural processing units, GPU blocks, sometimes specialised ASIC blocks for vision or audio — onto a single die. The integration is real and substantial: data does not have to traverse a board-level interconnect to move between the CPU and the AI accelerator block, which removes a class of bottlenecks that constrain discrete-accelerator systems. It is reasonable to expect that this integration changes the performance reasoning that applies to AI workloads on the device. It is also reasonable, and common, to overshoot that expectation.

What the integration does not do is collapse the AI Executor into the silicon. The software stack — drivers, runtime, kernels, framework backends for the SoC’s specific accelerator block — is still the half of the executor without which the hardware does nothing. If anything, SoC integration makes the software stack more, not less, decisive. The accelerator block is vendor-specific and physically tied to that SoC, so the software stack that targets it is the only software stack that can extract its performance. There is no second-source ecosystem to fall back on, and no neutral benchmark that strips the stack out.

This is the same principle that holds for discrete accelerators where performance emerges from the hardware and software together, but in concentrated form. We see it consistently when teams move workloads from a familiar discrete-GPU environment onto an embedded or edge SoC and discover that the variance they were used to treating as noise on the desktop is now the first-order term in their measurements.

What changes and what doesn’t with SoC integration

The cleanest way to be specific about which parts of the reasoning shift and which do not is a side-by-side. The table is the structured answer surface for this article — it is meant to be lifted out of context and still make sense.

Performance dimension	Discrete accelerator	System-on-a-chip
CPU ↔ accelerator data movement	Crosses board-level interconnect; latency-bound	On-die; substantially lower latency
Memory bandwidth to accelerator	Dedicated memory subsystem	Shared with CPU; bandwidth contention possible
Software stack maturity	Mature ecosystems for major vendors	Per-SoC; varies widely with vendor investment
Driver/runtime portability	Broadly standardised by major frameworks	Per-SoC; rarely portable across SoCs
Effect of stack version on result	Significant	Often dominant

The top two rows are the ones that get most of the marketing attention — on-die data movement and shared memory are the visible differences. The bottom three rows are the rows that matter for evaluation. SoC integration changes the physical bottlenecks; it does not change the principle that performance is a property of the hardware-and-software stack, observed under a specific workload. That is an observed pattern across the embedded AI work we have done, not a benchmarked rate.

Why does the software stack matter more on an SoC than on a discrete accelerator?

For a discrete accelerator from a major vendor — an NVIDIA GPU running CUDA, cuDNN and TensorRT, an AMD GPU on ROCm, an Intel accelerator on oneAPI — the software stack is mature, broadly portable across host platforms, and well-supported by the major frameworks. A team evaluating that hardware can reasonably assume the software side of the AI Executor is approximately constant across deployments and that performance differences they measure reflect hardware differences with limited stack noise. PyTorch with torch.compile, ONNX Runtime, or a graph compiler like XLA will produce results that travel.

For an SoC’s AI accelerator block, that assumption breaks. The drivers and runtime that target the block are vendor-specific to that SoC. Framework support — PyTorch, TensorFlow Lite, ONNX Runtime, llama.cpp — varies by SoC and by SoC generation. Two devices built around the same SoC silicon, running the same model weights, can exhibit substantially different observed performance because the vendor-supplied software stacks differ in maturity for that block, the framework integration is at a different version on each device, or the accelerator-block kernels have been tuned for one version of a framework but not another. We see this pattern repeatedly across customer engagements: identical silicon, different SDK minor versions, multiplicative performance gaps.

The implication is concrete. A benchmark report for an SoC that omits the software-stack version is reporting a number whose generalisation to a different deployment cannot be assessed. The hardware identity is constant; the executor is not.

What does it mean to treat AI performance as a systems problem on an SoC?

It means accepting that the unit of measurement is the executor, not the chip. Evaluating an SoC for an AI workload is a stack-disclosure exercise more than a hardware exercise. The dimensions that have to be captured before any benchmark result is interpretable include:

The exact SoC model and silicon revision.
The vendor SDK version that targets the AI accelerator block.
The driver and runtime versions (often vendor-specific to the SoC, sometimes board-vendor-specific on top of that).
The framework version and the framework’s SoC backend version — for example PyTorch with a vendor-supplied execution provider, or ONNX Runtime with a vendor EP.
Whether the workload actually runs on the AI accelerator block, the integrated GPU block, or falls back to the CPU. Silent fallback is a common cause of reported underperformance.
The precision configuration the accelerator block supports for the workload (INT8, FP16, mixed precision, or a vendor-specific quantisation scheme).

A benchmark that captures these dimensions produces a result that another team can reproduce or at least interpret. A benchmark that omits any of them is reporting a number whose validity is bounded to the original measurement environment.

The point is not that SoC benchmarking is harder than discrete-accelerator benchmarking. It is that the methodological discipline that applies to both becomes operationally visible on SoCs, because the stack variance is large enough to dominate naïve comparisons. On discrete hardware you can sometimes get away with sloppy disclosure and still produce numbers that roughly travel. On an SoC, you cannot.

SoC vs discrete: a worked comparison frame

A team comparing an SoC-based deployment with a discrete-accelerator alternative is not comparing two pieces of silicon. It is comparing two AI Executors: (SoC, vendor SDK, framework backend, precision) versus (discrete accelerator, CUDA / ROCm / oneAPI stack, framework, precision). The two executors differ on the hardware axis — but they also differ on multiple software axes, on memory architecture (shared vs dedicated), and on the workload-handling pattern (continuous on-die data movement vs board-level transfer over PCIe).

A benchmark that holds the workload constant and varies only one of these axes is informative about that axis. A benchmark that varies all of them and reports a single comparison number is informative about the combined trade-off and uninformative about which axis drove the result. Both kinds of benchmark have their uses; conflating them is the methodological error. When a procurement document compares “the SoC” with “the GPU” on a single throughput number, it is implicitly making a claim about every axis at once and naming none of them.

The pattern repeats whenever someone tries to reason about a stack one layer at a time. The kernel writer assumes the framework’s scheduling is fixed; the framework integrator assumes the kernels are fixed; the application team assumes both are fixed. On an SoC the cross-layer interactions — quantisation scheme tied to kernel availability tied to runtime version tied to SDK release — refuse to stay inside their layers. Reasoning layer-by-layer breaks down because the layers are not independent.

The framing that helps

A system-on-a-chip is not a hardware-only object. It is an integrated AI Executor whose software stack is per-SoC, often vendor-specific, and frequently the dominant source of measured performance variance across devices built around the same silicon. SoC evaluation that omits the software stack is reporting a hardware identity and calling it a benchmark.

LynxBench AI treats the SoC’s vendor SDK, runtime, framework backend, and precision configuration as part of the AI Executor specification — alongside the silicon — because the on-die integration removes some bottlenecks but not the methodological requirement that the full stack be disclosed for the benchmark result to transfer. That is what carries the principle that AI performance emerges from the hardware × software stack into the embedded and edge context without dilution. For your SoC deployment, which vendor SDK version, runtime, framework backend, and accelerator-block kernels is the published benchmark actually running on — and does your firmware reproduce them?

Frequently Asked Questions

Does shared CPU-and-accelerator memory on an SoC create new performance variance?

It can. On-die integration removes board-level transfer latency, but the accelerator block now shares a memory subsystem with the CPU, so bandwidth contention becomes possible under concurrent load. That contention is workload-dependent and rarely shows up in a single-stream benchmark, which is one more reason the workload and the full stack have to be disclosed before an SoC number is interpretable.

What is “silent fallback” on an SoC, and why does it cause reported underperformance?

Silent fallback is when a workload that was expected to run on the AI accelerator block quietly executes on the integrated GPU block or the CPU instead — usually because a kernel or operator is unsupported in the installed SDK or framework backend version. The reported number is real, but it is measuring the wrong execution path. Capturing where the workload actually ran is one of the disclosure dimensions that separates an interpretable SoC benchmark from a misleading one.

Can a benchmark from one SoC device be reused for another device built on the same silicon?

Not safely. Two devices on identical silicon can differ in vendor SDK minor version, framework backend version, and accelerator-block kernel tuning, and any of those can shift observed performance multiplicatively. Unless the second device reproduces the same vendor SDK, runtime, framework backend, and precision configuration, the original result is bounded to its original measurement environment.

When does an SoC-versus-discrete throughput number become misleading in procurement?

It becomes misleading when a single comparison number stands in for several axes at once — hardware, multiple software layers, shared-versus-dedicated memory, and the data-movement pattern. Such a number is informative about the combined trade-off but says nothing about which axis drove the result. A procurement comparison that holds the workload constant and varies one axis at a time is the one that supports a defensible decision.

System-on-a-Chip for AI: Why Integration Doesn't Eliminate the Software Stack