The Tensor Core

A Tensor Core is a specialized hardware unit inside an NVIDIA GPU that performs matrix multiply-accumulate operations. NVIDIA introduced Tensor Cores with the Volta architecture in 2017, in the Tesla V100. Where an ordinary GPU core multiplies and adds individual numbers, a Tensor Core operates on small matrices in a single step, computing the product of two matrices and adding it to a third. This directly targets the workload that dominates deep learning: large chains of matrix multiplications.

NVIDIA’s Volta architecture whitepaper describes the unit precisely. Each Tesla V100 carries 640 Tensor Cores, and each one “performs 64 floating-point fused-multiply-add (FMA) operations per clock,” with the array of cores in a streaming multiprocessor “deliver(ing) up to 125 TFLOPS for training and inference applications.” That figure is several times the throughput the same chip achieves with its conventional floating-point cores, which is why Tensor Cores produced a step-change rather than an incremental gain.

The key technique is mixed precision. A Tensor Core multiplies matrices stored in half-precision (FP16) but accumulates the running sum in full single-precision (FP32). Deep-learning training tolerates the reduced precision of the inputs because the network is statistical and self-correcting, while the FP32 accumulation preserves enough numerical accuracy in the sum to keep training stable. This trade lets the hardware move and multiply data in a compact 16-bit format, roughly doubling effective throughput and halving memory traffic, without the accuracy collapse that pure low-precision arithmetic would cause.

Architecturally, the Tensor Core is an example of building a fixed-function accelerator for a specific operation, in the spirit of an ASIC, inside an otherwise general-purpose programmable processor. The general SIMD-style cores of the GPU remain for everything else, while the Tensor Cores handle the one operation, dense matrix multiply, that deep learning performs in overwhelming volume. Later NVIDIA architectures extended the idea with additional numeric formats and higher throughput, but the basic bargain, dedicated matrix hardware plus mixed precision, was set in Volta.

For software, Tensor Cores are mostly invisible: a framework calls a cuDNN or CUDA library routine, and the library decides to issue Tensor Core instructions when the operation and data types allow. This is why a Tensor Core is best understood as a hardware concept whose impact is felt through the software stack above it. Its arrival is one of the clearest cases of hardware being co-designed with a software workload, the GPU reshaped specifically to make neural-network math fast.

Sources

Related