Reiner Pope, CEO of MatX, explains AI chip design from the bottom up: from logic gates and multiply-accumulate units to systolic arrays and the trade-offs between compute and communication. The episode covers how Tensor Cores evolved from CUDA cores, the role of clock cycles, FPGA vs ASIC, and the brain vs chip comparison.
Dwarkesh Patel – hostReiner Pope – CEO of MatX (AI chip company)
Multiply-accumulate (MAC) is the fundamental primitive for AI chips because matrix multiplication consists of repeated MAC operations.
In MAC, accumulation needs higher precision than multiplication because rounding errors accumulate over many additions.
Data movement (muxes, register files) dominates chip area and energy, often costing more than the actual compute logic.
Systolic arrays (Tensor Cores) solve the data movement problem by storing weights locally and amortizing communication over many computations.
The quadratic scaling of multiplier area with bit width makes low-precision arithmetic (FP4 vs FP8) much more efficient than linear scaling suggests.
Clock cycle is set by the longest combinational path; pipelining can increase frequency but adds register area and may break feedback loops.
FPGAs are ~10x less efficient than ASICs because LUTs and routing muxes use many more gates than a direct gate implementation.
Deterministic latency in CPUs is broken by caches; scratchpad memories (like in TPUs) give software control and deterministic timing.
GPUs use many small SMs (each like a tiny TPU) for flexibility, while TPUs use fewer larger matrix units for higher efficiency on large matmuls.
The brain's unstructured sparsity, slower clock, and co-located memory/compute differ from current chip designs, but some principles (e.g., locality) are similar.
Fundamental primitive: Multiply-Accumulate (MAC)
The core operation in AI chips is multiply-accumulate: multiply two numbers and add to an accumulator.
Matrix multiplication is a triple loop (i, j, k) where each step is a MAC: output[i][k] += input[i][j] * other_input[j][k].
Accumulation needs higher precision than multiplication because errors accumulate over many additions (e.g., 4-bit multiply, 8-bit accumulate).
Example: 4-bit × 4-bit multiply with 8-bit add. The multiplication produces 16 partial products (using AND gates) plus the 8-bit accumulator term.
Partial products are generated by AND gates: each bit of one number AND each bit of the other (P×Q AND gates for P-bit × Q-bit).
Summing partial products uses full adders (3-to-2 compressors) that take three bits in the same column and output a sum and a carry.
The total number of full adders needed equals P×Q (input bits minus output bits: (P×Q + P + Q) - (P + Q) = P×Q).
This MAC design is area-efficient and matches the recurrence in matrix multiplication.
Quadratic scaling with bit width and precision trade-offs
Multiplier area scales quadratically with bit width: P×Q gates for P×Q multiply.
This makes low-precision arithmetic (FP4) much more efficient than FP8: 4× the throughput for half the bits, not just 2×.
Nvidia historically reported 2× FP4 vs FP8, but B300 and later acknowledge 3× (though theory says 4×); floating-point exponent logic complicates the ratio.
The quadratic scaling is the key reason low precision works so well for neural networks.
Data movement also benefits: two 4-bit numbers pack into the same storage as one 8-bit number, halving memory bandwidth.
Choosing how much FP4 vs FP8 hardware to include is a major design decision, often based on customer requirements or power budgets.
Data movement cost: muxes and register files
In a traditional CUDA core or CPU, a register file (e.g., 8 entries) feeds a MAC unit via muxes.
A mux selects one of N inputs (each P bits) using AND-OR logic: N×P AND gates and (N-1)×P OR gates.
For a MAC with three inputs, the mux cost is 3×N×P gates, while the MAC itself costs only P×Q gates (Q=4).
With N=8, P=4, Q=4: mux cost = 3×8×4 = 96 gates; MAC cost = 4×4 = 16 gates. Data movement dominates (~86% of area).
This hidden data movement cost is the motivation for Tensor Cores / systolic arrays.
The goal: increase compute per data movement by batching more operations together.
Systolic arrays (Tensor Cores) to amortize communication
A systolic array bakes an entire matrix-vector multiplication loop into hardware, not just a single MAC.
It stores weight matrix locally (in registers) and streams input vectors through, reusing weights many times.
For a 2×2 matrix, 4 MACs are performed; input/output vectors have size 2 (linear), while compute is quadratic (4).
Weights are loaded slowly (daisy-chain) to keep bandwidth from register file low (linear in vector size, not quadratic).
This achieves the goal: compute scales as X×Y, communication scales as X (or Y), giving a Y× advantage.
In practice, systolic arrays can be 128×128 (TPU v1) or larger, amortizing register file costs over many MACs.
The same principle (maximize compute/communication) applies at all levels: from precision to chip-to-chip networking.
Clock cycles, pipelining, and feedback loops
Chips synchronize globally every clock cycle (e.g., every nanosecond) using registers between logic clouds.
Clock frequency is limited by the longest combinational path (critical path) between registers.
Pipelining: inserting registers splits logic into shorter stages, allowing higher frequency but adding register area.
Feedback loops (e.g., accumulator) cannot be pipelined naively because splitting changes the computation (e.g., even/odd sums).
These loops often set the clock cycle; designers must balance frequency vs. area vs. correctness.
TSMC provides primitives (AND, full adder) with ~10 ps delay; typical clock cycle allows 10-30 gates in series.
Over-pipelining wastes area on registers; under-pipelining limits frequency. The sweet spot maximizes throughput (ops/clock × clocks/sec).
FPGA vs ASIC: flexibility vs efficiency
FPGAs emulate ASIC logic using lookup tables (LUTs) and programmable routing (muxes).
A 4-input LUT is a 16:1 mux (truth table) that can implement any 4-input boolean function.
Each LUT costs ~32 gates (16 ANDs + 16 ORs) but replaces only ~3 gates for a simple function like 4-input AND.
Routing muxes add further overhead; total FPGA area/energy is ~10× worse than ASIC.
FPGAs are used when logic changes frequently (e.g., high-frequency trading) because ASIC tapeout costs $30M+.
Deterministic latency is achievable in both FPGAs and ASICs; CPUs lose determinism due to caches and branch predictors.
Scratchpad memories (TPU) give software control over data movement, unlike hardware-managed caches (CPU).
GPU vs TPU architecture: many small vs few large
GPU: many Streaming Multiprocessors (SMs), each with small Tensor Cores, register files, and schedulers, tiled across the chip.
TPU: fewer, larger Matrix Units (systolic arrays) with a central vector unit; coarser-grained.
An SM is like a tiny TPU; GPUs have many tiny TPUs, TPUs have a few large ones.
Larger systolic arrays amortize register file costs better, but require more data movement across the chip (perimeter bottleneck).
GPUs have higher data bandwidth between vector and matrix units because each SM has its own local connections.
MatX's 'splittable systolic array' aims to combine the flexibility of small arrays with the efficiency of large ones.
Brain vs chip: clock speed, sparsity, and energy
Brain clock is ~100 Hz vs chip GHz; slower clock reduces switching energy (dynamic power scales linearly with frequency).
Brain has unstructured sparsity (any neuron can connect to any other), while chips use structured sparsity for efficiency.
Memory and compute are co-located in the brain (synapses), similar to systolic arrays storing weights locally.
Running a chip at lower frequency (e.g., 1 MHz) reduces energy proportionally, but idle time doesn't consume much dynamic power.
Dynamic power dominates chip energy: charging/discharging capacitors on each 0→1 or 1→0 transition.
The brain's energy efficiency comes from different physics (ion channels, analog computation) not just clock speed.
Passos práticos
When designing AI chips, prioritize maximizing compute per data movement at every level: precision, systolic array size, and inter-chip communication.
Use lower precision (FP4) where possible to exploit quadratic area savings and reduce memory bandwidth.
Consider scratchpad memories instead of caches for deterministic latency and software-controlled data movement.
For applications requiring frequent logic changes (e.g., trading), use FPGAs; for fixed high-volume workloads, use ASICs.
Balance clock frequency and pipeline depth to maximize throughput, not just frequency; avoid over-pipelining.
Evaluate trade-offs between many small compute units (GPU-like) vs few large ones (TPU-like) based on workload granularity.
Frases marcantes
"Data movement is the hidden cost; almost all the area in a traditional core is spent on muxes and register files, not the actual multiply."
"The quadratic scaling of multiplier area with bit width is the single reason low precision has worked so well for neural nets."
"In a systolic array, you store the weight matrix locally and stream vectors through, amortizing communication over many computations."
"The clock cycle is set by the longest path; feedback loops like accumulators are the hardest to pipeline because they change the computation."
"An FPGA is about 10x less efficient than an ASIC because a LUT uses 32 gates to do what 3 gates can do."
"A GPU is many tiny TPUs; a TPU is a few large ones. The trade-off is flexibility vs. efficiency."
Mencionados no episódio
MatX – AI chip company founded by Reiner Pope
Nvidia Volta – GPU generation that introduced Tensor Cores
Nvidia B100/B200/B300 – recent GPU architectures with FP4/FP8 ratios
TPU (Tensor Processing Unit) – Google's AI accelerator with systolic arrays
CUDA core – traditional GPU compute unit before Tensor Cores