Skip to content

A Field Guide to AI Chips

A beginner-to-intermediate field guide to the silicon powering modern AI. GPUs, TPUs, NPUs, CPUs, ASICs, FPGAs, edge chips and emerging architectures — what each is for, where you'll meet it, and how much it costs.

By Ajay Walia · Apr 29, 2026 · 10 min read

Share: LinkedIn
On this page
A Field Guide · 2026

A Field Guide to AI Chips

Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.
❦ ❦ ❦

Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a “where you’ll meet it” entry for each. Each section links to a deeper entry for the curious.

The Roll Call

ChipBest forMemory & InterconnectCost & AccessNotable Specimens (2026)
GPUTraining + inference80–192GB HBM3/3e; NVLink 5, PCIe 5$25–40K each; cloud-only at scaleNVIDIA H100, B200, GB200 NVL72; AMD MI325X
TPUHyperscale training95–192GB HBM; OCS interconnectGoogle Cloud onlyTPU v5p, v6 Trillium
NPUOn-device AIShared LPDDR / unified memoryBundled in deviceApple Neural Engine (M4), Intel AI Boost (Lunar Lake), Qualcomm Hexagon (8 Elite)
CPUOrchestration & control planeDDR5; PCIe 5, CXL$1–15K; retailIntel Xeon 6, AMD EPYC 9005
ASICInference at scale; specialised trainingCustom HBM / SRAM; proprietary fabricCloud-onlyAWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L
FPGACustom, low-latency, adaptiveDDR/HBM; reprogrammable fabric$5–50K each; cloudAMD Versal AI Edge, Intel Agilex 7
Edge AIMobile, robotics, IoTLPDDR; low-power$50–2000, embedded in productNVIDIA Jetson Orin, Google Coral, Hailo-8
EmergingFrontier R&DWafer-scale SRAM / photonic / analogMostly research, limited cloudCerebras (covered above), Lightmatter, Mythic

1 · GPU — Graphics Processing Unit

GPU · The Apex Predator
Class
Parallel beast
Memory
80–192GB HBM3 / HBM3e
Interconnect
NVLink 5, PCIe 5, InfiniBand
Power
350–1200W per die
Habitat
Hyperscale datacenters
Cost
$25–40K per card · cloud rental at scale
Best Prey
LLM training, diffusion, multimodal pretraining
Specimens
NVIDIA H100, B200, GB200 NVL72; AMD MI325X

GPUs are the apex predator of the AI hardware ecosystem in 2026. Originally designed for graphics, they turned out to be ideal for the dense matrix multiplications that dominate neural network training. NVIDIA’s H100 made the LLM era possible; B200 and the rack-scale GB200 NVL72 (72 GPUs treated as one machine, lashed together by NVLink switches) define the current frontier.

The reason GPUs dominate isn’t just parallel processing — it’s the combination of HBM (high-bandwidth memory mounted directly on the chip package), tensor cores (specialised matrix-multiply units), and a mature software ecosystem (CUDA, PyTorch, JAX) that nothing else has matched at scale. AMD’s MI325X is the only serious open-market competitor, and even it ships running CUDA-compatible code through ROCm translation.

The catch: you cannot really buy them. H100s and B200s ship into hyperscaler datacenters first and reach the open market — when they do — through Lambda, CoreWeave, AWS, and friends, rented by the hour at $2–8 each.

→ Full entry: Field Guide · GPUs

2 · TPU — Tensor Processing Unit

TPU · Google's Matmul Colossus
Class
Bespoke matrix engine
Memory
95–192GB HBM (varies by generation)
Interconnect
OCS (Optical Circuit Switching) + ICI
Power
~200–300W per chip
Habitat
Google Cloud (only)
Cost
Cloud rental only
Best Prey
Hyperscale training of Gemini-class models
Specimens
TPU v5p (training), v5e (inference), v6 Trillium

Google designed TPUs in-house to avoid paying NVIDIA’s margins on a workload they knew exactly — TensorFlow matrix multiplications at hyperscale. Each generation has narrowed the gap with GPUs on flexibility while widening it on energy efficiency per FLOP.

The architectural bet is the systolic array: a grid of multiply-add units that pumps data through in lockstep, achieving near-peak utilisation on matmul-heavy workloads. The trade-off is that anything outside that sweet spot (irregular memory access, highly dynamic shapes) runs less efficiently than on a GPU. The OCS-based interconnect lets Google rewire a TPU pod’s topology per job, which matters enormously at the scale of a Gemini training run.

You cannot buy a TPU. They exist exclusively inside Google Cloud, rented by the hour. Gemini was trained on them; many third parties (Anthropic for a stretch, plus enterprise customers) rent slices for their own runs.

→ Full entry: Field Guide · TPUs

3 · NPU — Neural Processing Unit

NPU · The Resident Familiar
Class
On-device specialist
Memory
Shared LPDDR / unified system memory
Interconnect
SoC fabric (on-die)
Power
5–40W
Habitat
Laptops, phones, tablets
Cost
Bundled — no separate purchase
Best Prey
Voice, camera AI, on-device LLMs, Copilot features
Specimens
Apple Neural Engine (M4), Intel AI Boost (Lunar Lake / Arrow Lake), Qualcomm Hexagon (8 Elite, X Elite)

NPUs are the chip type most people interact with every day without knowing it. They live inside the SoC of your phone or laptop, optimised for running already-trained models locally with extreme power efficiency. Voice transcription, Face ID, Pixel’s call screening, the on-device chat in Copilot+ PCs — all NPU workloads.

The defining trait is integer-quantised math (INT8 / INT4) at very low wattage. Where a datacenter GPU might pull 700W to serve a model, an NPU runs a comparable inference on the same model — quantised down — at 5–15W, with the weights sitting in the device’s main memory because there is no discrete accelerator memory to fill.

Microsoft now requires 40+ TOPS of NPU performance for a laptop to qualify as a “Copilot+ PC” — a forcing function that pushed Qualcomm, Intel and AMD into a 12-month arms race. As of 2026, top mobile SoCs ship 50–60 TOPS of NPU performance.

→ Full entry: Field Guide · NPUs

4 · CPU — Central Processing Unit

CPU · The Foundational Workhorse
Class
General-purpose
Memory
DDR5 (system RAM)
Interconnect
PCIe 5, CXL
Power
100–500W
Habitat
Every server, every workstation
Cost
$1–15K, retail and widely available
Best Prey
Orchestration, preprocessing, control plane, small-batch inference
Specimens
Intel Xeon 6, AMD EPYC 9005

CPUs aren’t obsolete — they’re indispensable. Every AI training run needs CPUs to feed data to the accelerators (decompression, tokenisation, augmentation), schedule jobs and run the control plane. The ratio matters: a typical training cluster pairs eight GPUs with one or two CPU sockets.

Modern server CPUs ship with AI-targeted extensions — AVX-512, AMX (Advanced Matrix Extensions), bf16 support — that let them handle small-batch inference and embedding generation reasonably well. For workloads under 7B parameters at low traffic, a CPU is often more economical than a dedicated accelerator.

What CPUs cannot do is train frontier models. The arithmetic density and memory bandwidth needed for LLM pretraining is 10–100× what a CPU delivers per watt. CPUs do the surrounding work; accelerators do the math.

→ Full entry: Field Guide · CPUs

5 · ASIC — Application-Specific Integrated Circuit

ASIC · The Purpose-Bred Specialist
Class
Fixed-function accelerator
Memory
Custom HBM / on-die SRAM
Interconnect
Proprietary fabric
Power
75–450W per chip
Habitat
Hyperscaler clouds (AWS, Cerebras, Groq, SambaNova)
Cost
Cloud rental only
Best Prey
Inference at scale, specialised training
Specimens
AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L

ASICs are chips designed for one thing and one thing only — and they’re brutally good at that thing. AWS Inferentia2 runs production inference for Anthropic, Amazon search and Alexa at a cost-per-token that beats GPUs. Trainium2 is AWS’s training equivalent, taking aim at NVIDIA’s H100/B200 dominance. Groq’s LPU posts inference latencies — sub-1ms first-token for many models — that GPUs simply cannot match.

The architectural philosophy is “build silicon for the specific math you do most often, throw away the rest.” Cerebras takes this furthest: their Wafer-Scale Engine 3 is a single chip the size of an entire silicon wafer (900,000 cores, 44GB on-die SRAM) that eliminates the multi-GPU communication overhead which plagues distributed training.

The price of specialisation: you cannot pivot. When the dominant architecture changes — and it does (Mamba, MoE, diffusion, JEPA) — ASICs designed for the last era stop being competitive overnight. GPUs hedge their bets; ASICs commit.

→ Full entry: Field Guide · ASICs

6 · FPGA — Field-Programmable Gate Array

FPGA · The Shapeshifter
Class
Reprogrammable logic
Memory
DDR / HBM (model-dependent)
Interconnect
PCIe; custom
Power
50–300W
Habitat
Trading desks, 5G basebands, telecom, occasional inference
Cost
$5–50K each; cloud
Best Prey
Ultra-low-latency inference, custom protocols, evolving workloads
Specimens
AMD/Xilinx Versal AI Edge, Intel Agilex 7

FPGAs occupy a strange ecological niche. Unlike ASICs, their internal wiring is reprogrammable — you can compile a new circuit into them, deploy it, and reprogram it tomorrow. This makes them ideal for workloads that change faster than chip fabrication cycles (years), or where you need an ultra-low-latency that even ASICs struggle to deliver.

In AI specifically, FPGAs are rarely the first choice for mainstream training or inference — they are slower to develop for and harder to program than GPUs. Where they shine: when the model is small enough to fit, the latency budget is brutal (single-digit microseconds), and the workload spec might shift quarterly. Microsoft used FPGAs heavily in early Bing Search ranking and Azure networking; financial firms still run them for inline ML in trading.

For most readers, FPGAs will be a “did you know?” category rather than a chip you’ll ever deploy.

→ Full entry: Field Guide · FPGAs

7 · Edge AI — Mobile, Robotics, IoT

Edge AI · The Frontier Ranger
Class
Embedded inference
Memory
LPDDR, sometimes onboard SRAM
Interconnect
PCIe, MIPI, USB
Power
1–25W
Habitat
Drones, robots, cameras, autonomous systems, sensors
Cost
$50–2000; embedded in product
Best Prey
Real-time inference, computer vision, robotics
Specimens
NVIDIA Jetson Orin, Google Coral / Edge TPU, Hailo-8, Ambarella CV5

Edge AI chips are NPUs’ cousins — same family, different role. Where an NPU lives inside a consumer laptop alongside other compute, an edge AI chip is purpose-built for an embedded device: a security camera, a drone, a forklift, a Tesla.

The defining constraints are size, power and latency. A camera processing 4K video at 30fps cannot afford to ship frames to a cloud GPU; it has to detect motion locally, identify objects locally and signal events within tens of milliseconds — on a few watts, because the device runs on battery or is fanless.

NVIDIA’s Jetson family is the broadest platform — same CUDA software stack as their datacenter GPUs, scaled down to 7–60W. Google’s Edge TPU is the smallest, cheapest and lowest power (Coral USB stick: $40, 2W). Hailo-8 and Ambarella sit in between, targeting industrial and automotive customers.

→ Full entry: Field Guide · Edge AI

8 · Emerging Architectures

Emerging · The Frontier Beasts
Class
Experimental
Memory
Wafer-scale SRAM / photonic / analog
Interconnect
On-wafer / optical / in-memory
Power
Varies wildly
Habitat
Research labs, narrow cloud offerings
Cost
Mostly inaccessible; limited cloud
Best Prey
The next 10× efficiency leap
Specimens
Cerebras WSE-3 (wafer-scale); Lightmatter, Lightelligence (photonic); Mythic, IBM (analog/in-memory)

Three architectures sit at the frontier — promising, but not yet mainstream.

  • Wafer-scale (Cerebras): one single chip the size of an entire silicon wafer. Eliminates multi-chip communication entirely; presents the whole system to software as a single device. Already commercial.
  • Photonic / optical AI (Lightmatter, Lightelligence): perform matrix math using light interference instead of electricity. Potentially orders of magnitude lower energy per operation; currently limited to inference and constrained models.
  • Analog / in-memory compute (Mythic, IBM, several startups): compute inside memory arrays using analog voltage levels. Removes the von-Neumann bottleneck — the constant shuttling of data between memory and compute — entirely. Promising for low-power inference; precision limitations make training hard today.

→ Full entry: Field Guide · Emerging Architectures

Current Industry Reality (2026)

  • GPUs dominate training. Every frontier model — GPT-5, Claude, Gemini, Llama, Grok — is still trained on NVIDIA or AMD silicon at hyperscale.
  • ASICs are ascendant in inference. AWS reports more than 40% of internal inference now runs on Inferentia and Trainium; Groq leads on latency-critical applications.
  • NPUs are exploding on consumer devices. Every premium laptop and phone shipped in 2026 has a 40+ TOPS NPU.
  • CPUs remain foundational. No accelerator runs without one.
  • TPUs are Google-only. Gemini, Veo and Imagen were all trained on TPU v5p / v6.

Simplified View

Use caseTypical chip
Train a GPT-class modelGPU clusters (or TPU pods if you're Google)
Run ChatGPT-class inference at scaleGPUs + ASICs (Inferentia, Groq, Trainium)
AI on laptopNPU + integrated GPU
AI on phoneMobile NPU
Robot or drone AIEdge AI chips (Jetson, Hailo)
Ultra-low-latency custom AIFPGA or ASIC

The Industry Trend

The industry is moving from "general-purpose GPU everything" to "specialised chip for each layer of the stack."

Power and inference cost are now the binding constraints. A frontier model serving billions of queries spends more on inference electricity in a year than its entire training run cost. The economics force specialisation: train once on GPUs, serve forever on cheaper inference silicon. Expect the gap between training hardware (still GPU-dominant) and inference hardware (rapidly ASIC- and NPU-fragmented) to widen.

❦ ❦ ❦

Test what you read

Quick quiz

1 of 4

Question 1

Why do GPUs dominate AI training in 2026?

Ajay Walia

About the Author

Ajay Walia

AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.

Don't stop now

Keep Reading