On this page

A Field Guide · 2026

A Field Guide to AI Chips

Stat blocks, lairs and notable specimens for the eight kinds of silicon that power modern AI.

❦ ❦ ❦

Modern AI runs on a small zoo of specialised chips. Each evolved to handle a different workload — training a frontier model, answering a billion queries a day, recognising a face on your phone, keeping a drone alive in the air. This guide catalogues eight of them, with a stat block and a “where you’ll meet it” entry for each. Each section links to a deeper entry for the curious.

The Roll Call

Chip	Best for	Memory & Interconnect	Cost & Access	Notable Specimens (2026)
GPU	Training + inference	80–192GB HBM3/3e; NVLink 5, PCIe 5	$25–40K each; cloud-only at scale	NVIDIA H100, B200, GB200 NVL72; AMD MI325X
TPU	Hyperscale training	95–192GB HBM; OCS interconnect	Google Cloud only	TPU v5p, v6 Trillium
NPU	On-device AI	Shared LPDDR / unified memory	Bundled in device	Apple Neural Engine (M4), Intel AI Boost (Lunar Lake), Qualcomm Hexagon (8 Elite)
CPU	Orchestration & control plane	DDR5; PCIe 5, CXL	$1–15K; retail	Intel Xeon 6, AMD EPYC 9005
ASIC	Inference at scale; specialised training	Custom HBM / SRAM; proprietary fabric	Cloud-only	AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L
FPGA	Custom, low-latency, adaptive	DDR/HBM; reprogrammable fabric	$5–50K each; cloud	AMD Versal AI Edge, Intel Agilex 7
Edge AI	Mobile, robotics, IoT	LPDDR; low-power	$50–2000, embedded in product	NVIDIA Jetson Orin, Google Coral, Hailo-8
Emerging	Frontier R&D	Wafer-scale SRAM / photonic / analog	Mostly research, limited cloud	Cerebras (covered above), Lightmatter, Mythic

1 · GPU — Graphics Processing Unit

GPU · The Apex Predator

Class: Parallel beast
Memory: 80–192GB HBM3 / HBM3e
Interconnect: NVLink 5, PCIe 5, InfiniBand
Power: 350–1200W per die
Habitat: Hyperscale datacenters
Cost: $25–40K per card · cloud rental at scale
Best Prey: LLM training, diffusion, multimodal pretraining
Specimens: NVIDIA H100, B200, GB200 NVL72; AMD MI325X

GPUs are the apex predator of the AI hardware ecosystem in 2026. Originally designed for graphics, they turned out to be ideal for the dense matrix multiplications that dominate neural network training. NVIDIA’s H100 made the LLM era possible; B200 and the rack-scale GB200 NVL72 (72 GPUs treated as one machine, lashed together by NVLink switches) define the current frontier.

The reason GPUs dominate isn’t just parallel processing — it’s the combination of HBM (high-bandwidth memory mounted directly on the chip package), tensor cores (specialised matrix-multiply units), and a mature software ecosystem (CUDA, PyTorch, JAX) that nothing else has matched at scale. AMD’s MI325X is the only serious open-market competitor, and even it ships running CUDA-compatible code through ROCm translation.

The catch: you cannot really buy them. H100s and B200s ship into hyperscaler datacenters first and reach the open market — when they do — through Lambda, CoreWeave, AWS, and friends, rented by the hour at $2–8 each.

→ Full entry: Field Guide · GPUs

2 · TPU — Tensor Processing Unit

TPU · Google's Matmul Colossus

Class: Bespoke matrix engine
Memory: 95–192GB HBM (varies by generation)
Interconnect: OCS (Optical Circuit Switching) + ICI
Power: ~200–300W per chip
Habitat: Google Cloud (only)
Cost: Cloud rental only
Best Prey: Hyperscale training of Gemini-class models
Specimens: TPU v5p (training), v5e (inference), v6 Trillium

Google designed TPUs in-house to avoid paying NVIDIA’s margins on a workload they knew exactly — TensorFlow matrix multiplications at hyperscale. Each generation has narrowed the gap with GPUs on flexibility while widening it on energy efficiency per FLOP.

The architectural bet is the systolic array: a grid of multiply-add units that pumps data through in lockstep, achieving near-peak utilisation on matmul-heavy workloads. The trade-off is that anything outside that sweet spot (irregular memory access, highly dynamic shapes) runs less efficiently than on a GPU. The OCS-based interconnect lets Google rewire a TPU pod’s topology per job, which matters enormously at the scale of a Gemini training run.

You cannot buy a TPU. They exist exclusively inside Google Cloud, rented by the hour. Gemini was trained on them; many third parties (Anthropic for a stretch, plus enterprise customers) rent slices for their own runs.

→ Full entry: Field Guide · TPUs

3 · NPU — Neural Processing Unit

NPU · The Resident Familiar

Class: On-device specialist
Memory: Shared LPDDR / unified system memory
Interconnect: SoC fabric (on-die)
Power: 5–40W
Habitat: Laptops, phones, tablets
Cost: Bundled — no separate purchase
Best Prey: Voice, camera AI, on-device LLMs, Copilot features
Specimens: Apple Neural Engine (M4), Intel AI Boost (Lunar Lake / Arrow Lake), Qualcomm Hexagon (8 Elite, X Elite)

NPUs are the chip type most people interact with every day without knowing it. They live inside the SoC of your phone or laptop, optimised for running already-trained models locally with extreme power efficiency. Voice transcription, Face ID, Pixel’s call screening, the on-device chat in Copilot+ PCs — all NPU workloads.

The defining trait is integer-quantised math (INT8 / INT4) at very low wattage. Where a datacenter GPU might pull 700W to serve a model, an NPU runs a comparable inference on the same model — quantised down — at 5–15W, with the weights sitting in the device’s main memory because there is no discrete accelerator memory to fill.

Microsoft now requires 40+ TOPS of NPU performance for a laptop to qualify as a “Copilot+ PC” — a forcing function that pushed Qualcomm, Intel and AMD into a 12-month arms race. As of 2026, top mobile SoCs ship 50–60 TOPS of NPU performance.

→ Full entry: Field Guide · NPUs

4 · CPU — Central Processing Unit

CPU · The Foundational Workhorse

Class: General-purpose
Memory: DDR5 (system RAM)
Interconnect: PCIe 5, CXL
Power: 100–500W
Habitat: Every server, every workstation
Cost: $1–15K, retail and widely available
Best Prey: Orchestration, preprocessing, control plane, small-batch inference
Specimens: Intel Xeon 6, AMD EPYC 9005

CPUs aren’t obsolete — they’re indispensable. Every AI training run needs CPUs to feed data to the accelerators (decompression, tokenisation, augmentation), schedule jobs and run the control plane. The ratio matters: a typical training cluster pairs eight GPUs with one or two CPU sockets.

Modern server CPUs ship with AI-targeted extensions — AVX-512, AMX (Advanced Matrix Extensions), bf16 support — that let them handle small-batch inference and embedding generation reasonably well. For workloads under 7B parameters at low traffic, a CPU is often more economical than a dedicated accelerator.

What CPUs cannot do is train frontier models. The arithmetic density and memory bandwidth needed for LLM pretraining is 10–100× what a CPU delivers per watt. CPUs do the surrounding work; accelerators do the math.

→ Full entry: Field Guide · CPUs

5 · ASIC — Application-Specific Integrated Circuit

ASIC · The Purpose-Bred Specialist

Class: Fixed-function accelerator
Memory: Custom HBM / on-die SRAM
Interconnect: Proprietary fabric
Power: 75–450W per chip
Habitat: Hyperscaler clouds (AWS, Cerebras, Groq, SambaNova)
Cost: Cloud rental only
Best Prey: Inference at scale, specialised training
Specimens: AWS Inferentia2, Trainium2; Cerebras WSE-3; Groq LPU; SambaNova SN40L

ASICs are chips designed for one thing and one thing only — and they’re brutally good at that thing. AWS Inferentia2 runs production inference for Anthropic, Amazon search and Alexa at a cost-per-token that beats GPUs. Trainium2 is AWS’s training equivalent, taking aim at NVIDIA’s H100/B200 dominance. Groq’s LPU posts inference latencies — sub-1ms first-token for many models — that GPUs simply cannot match.

The architectural philosophy is “build silicon for the specific math you do most often, throw away the rest.” Cerebras takes this furthest: their Wafer-Scale Engine 3 is a single chip the size of an entire silicon wafer (900,000 cores, 44GB on-die SRAM) that eliminates the multi-GPU communication overhead which plagues distributed training.

The price of specialisation: you cannot pivot. When the dominant architecture changes — and it does (Mamba, MoE, diffusion, JEPA) — ASICs designed for the last era stop being competitive overnight. GPUs hedge their bets; ASICs commit.

→ Full entry: Field Guide · ASICs

6 · FPGA — Field-Programmable Gate Array

FPGA · The Shapeshifter

Class: Reprogrammable logic
Memory: DDR / HBM (model-dependent)
Interconnect: PCIe; custom
Power: 50–300W
Habitat: Trading desks, 5G basebands, telecom, occasional inference
Cost: $5–50K each; cloud
Best Prey: Ultra-low-latency inference, custom protocols, evolving workloads
Specimens: AMD/Xilinx Versal AI Edge, Intel Agilex 7

FPGAs occupy a strange ecological niche. Unlike ASICs, their internal wiring is reprogrammable — you can compile a new circuit into them, deploy it, and reprogram it tomorrow. This makes them ideal for workloads that change faster than chip fabrication cycles (years), or where you need an ultra-low-latency that even ASICs struggle to deliver.

In AI specifically, FPGAs are rarely the first choice for mainstream training or inference — they are slower to develop for and harder to program than GPUs. Where they shine: when the model is small enough to fit, the latency budget is brutal (single-digit microseconds), and the workload spec might shift quarterly. Microsoft used FPGAs heavily in early Bing Search ranking and Azure networking; financial firms still run them for inline ML in trading.

For most readers, FPGAs will be a “did you know?” category rather than a chip you’ll ever deploy.

→ Full entry: Field Guide · FPGAs

7 · Edge AI — Mobile, Robotics, IoT

Edge AI · The Frontier Ranger

Class: Embedded inference
Memory: LPDDR, sometimes onboard SRAM
Interconnect: PCIe, MIPI, USB
Power: 1–25W
Habitat: Drones, robots, cameras, autonomous systems, sensors
Cost: $50–2000; embedded in product
Best Prey: Real-time inference, computer vision, robotics
Specimens: NVIDIA Jetson Orin, Google Coral / Edge TPU, Hailo-8, Ambarella CV5

Edge AI chips are NPUs’ cousins — same family, different role. Where an NPU lives inside a consumer laptop alongside other compute, an edge AI chip is purpose-built for an embedded device: a security camera, a drone, a forklift, a Tesla.

The defining constraints are size, power and latency. A camera processing 4K video at 30fps cannot afford to ship frames to a cloud GPU; it has to detect motion locally, identify objects locally and signal events within tens of milliseconds — on a few watts, because the device runs on battery or is fanless.

NVIDIA’s Jetson family is the broadest platform — same CUDA software stack as their datacenter GPUs, scaled down to 7–60W. Google’s Edge TPU is the smallest, cheapest and lowest power (Coral USB stick: $40, 2W). Hailo-8 and Ambarella sit in between, targeting industrial and automotive customers.

→ Full entry: Field Guide · Edge AI

8 · Emerging Architectures

Emerging · The Frontier Beasts

Class: Experimental
Memory: Wafer-scale SRAM / photonic / analog
Interconnect: On-wafer / optical / in-memory
Power: Varies wildly
Habitat: Research labs, narrow cloud offerings
Cost: Mostly inaccessible; limited cloud
Best Prey: The next 10× efficiency leap
Specimens: Cerebras WSE-3 (wafer-scale); Lightmatter, Lightelligence (photonic); Mythic, IBM (analog/in-memory)

Three architectures sit at the frontier — promising, but not yet mainstream.

Wafer-scale (Cerebras): one single chip the size of an entire silicon wafer. Eliminates multi-chip communication entirely; presents the whole system to software as a single device. Already commercial.
Photonic / optical AI (Lightmatter, Lightelligence): perform matrix math using light interference instead of electricity. Potentially orders of magnitude lower energy per operation; currently limited to inference and constrained models.
Analog / in-memory compute (Mythic, IBM, several startups): compute inside memory arrays using analog voltage levels. Removes the von-Neumann bottleneck — the constant shuttling of data between memory and compute — entirely. Promising for low-power inference; precision limitations make training hard today.

→ Full entry: Field Guide · Emerging Architectures

Current Industry Reality (2026)

GPUs dominate training. Every frontier model — GPT-5, Claude, Gemini, Llama, Grok — is still trained on NVIDIA or AMD silicon at hyperscale.
ASICs are ascendant in inference. AWS reports more than 40% of internal inference now runs on Inferentia and Trainium; Groq leads on latency-critical applications.
NPUs are exploding on consumer devices. Every premium laptop and phone shipped in 2026 has a 40+ TOPS NPU.
CPUs remain foundational. No accelerator runs without one.
TPUs are Google-only. Gemini, Veo and Imagen were all trained on TPU v5p / v6.

Simplified View

Use case	Typical chip
Train a GPT-class model	GPU clusters (or TPU pods if you're Google)
Run ChatGPT-class inference at scale	GPUs + ASICs (Inferentia, Groq, Trainium)
AI on laptop	NPU + integrated GPU
AI on phone	Mobile NPU
Robot or drone AI	Edge AI chips (Jetson, Hailo)
Ultra-low-latency custom AI	FPGA or ASIC

The Industry Trend

The industry is moving from "general-purpose GPU everything" to "specialised chip for each layer of the stack."

Power and inference cost are now the binding constraints. A frontier model serving billions of queries spends more on inference electricity in a year than its entire training run cost. The economics force specialisation: train once on GPUs, serve forever on cheaper inference silicon. Expect the gap between training hardware (still GPU-dominant) and inference hardware (rapidly ASIC- and NPU-fragmented) to widen.

❦ ❦ ❦

Test what you read

Quick quiz

1 of 4

Why do GPUs dominate AI training in 2026?

About the Author

Ajay Walia

AI {IT Architect} focusing on local-first multi-agent AI engineering, zero-data-egress systems. Ideator, Creator and Executor on Curious Bit.

LinkedIn GitHub 📧 Subscribe

A Field Guide to AI Chips

A Field Guide to AI Chips

The Roll Call

1 · GPU — Graphics Processing Unit

2 · TPU — Tensor Processing Unit

3 · NPU — Neural Processing Unit

4 · CPU — Central Processing Unit

5 · ASIC — Application-Specific Integrated Circuit

6 · FPGA — Field-Programmable Gate Array

7 · Edge AI — Mobile, Robotics, IoT

8 · Emerging Architectures

Current Industry Reality (2026)

Simplified View

The Industry Trend

Quick quiz

Ajay Walia

Keep Reading

LLM & Embeddings — One Predicts Words. One Maps Meaning.

About