cuDNN

cuDNN, the CUDA Deep Neural Network library, is NVIDIA’s GPU-accelerated library of primitives for deep learning. First released in 2014, it sits one layer above CUDA and provides highly optimized implementations of the operations that neural networks spend most of their time on. NVIDIA’s documentation describes it as “a GPU-accelerated library of primitives for deep neural networks” that “provides highly tuned implementations for standard routines such as forward and backward convolution, attention, matmul, pooling, and normalization.”

The library exists because the performance-critical inner loops of deep learning are a small, recurring set of operations. Rather than have every framework author write and re-tune its own convolution kernels for each new GPU generation, NVIDIA centralized that work in cuDNN and updated it as new hardware shipped. A framework calls a cuDNN routine; cuDNN selects and runs an implementation tuned for the specific GPU. This division of labor is why a single optimization inside cuDNN can speed up the entire deep-learning ecosystem at once.

cuDNN is the hidden engine beneath the popular frameworks. TensorFlow and PyTorch do not, in the common case, implement their own GPU convolutions; they dispatch to cuDNN. When a researcher writes a few lines of Python to define and train a network, the heavy numerical work is executed by cuDNN kernels running on CUDA. This layering, framework on top of cuDNN on top of CUDA on top of the GPU, is the standard architecture of modern deep-learning compute.

Architecturally, cuDNN evolved from a fixed set of per-operation functions toward a graph-based interface. NVIDIA’s documentation explains that the cuDNN frontend and backend APIs are entry points to “the graph API,” which lets callers describe a graph of operations that cuDNN can fuse and optimize as a unit. Operation fusion (combining several steps into one kernel to avoid round-trips to memory) is one of the main levers cuDNN uses to extract performance from the underlying hardware, including the Tensor Cores introduced in later GPU architectures.

Because it is tied to CUDA and to NVIDIA hardware, cuDNN is also part of the company’s competitive moat. The deep-learning frameworks that the field standardized on were built and optimized against cuDNN’s behavior, which made NVIDIA GPUs the path of least resistance for training and inference. A framework can in principle target other backends, but for years the most thoroughly optimized and best-supported route ran through cuDNN.