cuDNN Library 與本章總結

#application-case-study #deep-learning #cudnn #gemm #algorithm-selection

重點總覽 (Overview)

項目	重點	備註
cuDNN	NVIDIA 提供的 deep learning primitives 優化函式庫	被 Caffe / TensorFlow / Theano / Torch 等框架直接呼叫
資料位置	input/output 必須常駐 GPU device memory	與 cuBLAS 相同的要求
執行緒安全	thread-safe,可從不同 host threads 呼叫
介面抽象	透過 opaque descriptors 存取 tensor / filter	可用任意 strides 描述 layout
核心 primitive	一種特殊的 batched convolution	forward/backward 共用一個 descriptor
演算法選擇	GEMM / Winograd / FFT-based …	同一層可挑不同演算法 (algorithm-selection)
關鍵優化	lazy on-chip materialization:X_unroll 只在晶片上即時生成	避免 im2col 寫回 global memory 的頻寬與空間浪費
收尾	計算完成後做 tensor transposition 還原使用者 layout

Important

cuDNN 的核心賣點:讓框架使用者「不必自己寫 CUDA kernel」,就能取得接近硬體理論峰值的 convolution 效能。它把第 16.4 節 im2col + GEMM 的所有缺點(額外記憶體流量、X_unroll 佔空間、小矩陣利用率低)在內部一次解決。

cuDNN 函式庫概觀 (cuDNN Library Overview)

定位:a library of optimized routines for deep learning primitives,目的是讓 deep learning frameworks 更容易吃到 GPU 效能。
API:flexible、easy-to-use 的 C-language API,可整潔地嵌入既有框架。
資料常駐:要求 input 與 output data 都已在 GPU device memory(與 cuBLAS 一致;呼叫前要先把資料搬上 device)。
thread-safe:routines 可從多個 host threads 並行呼叫。
descriptor 模型:
- convolution 的 forward 與 backward path 共用一個 descriptor,封裝該層的所有屬性(attributes)。
- tensor 與 filter 透過 opaque descriptors 存取 → 可用任意 stride 指定各維度 layout(不綁死 row-major)。

Tip

「descriptor」= 把 layer 形狀/步幅/padding 等參數打包成一個不透明物件。好處是同一份計算程式碼可服務任意 layout/stride,框架不必為每種張量擺放方式重寫 kernel。

Batched Convolution 與參數 (Tensor / Filter Descriptors & Parameters)

cuDNN 最重要的 computational primitive 是一種 batched convolution(一次處理整個 minibatch)。注意 cuDNN 命名與前幾節略有不同。

參數	意義	對應前文
N	minibatch 中的影像數	N (sample)
C	input feature maps 數	C (channel)
H / W	input image 高 / 寬	H / W
K	output feature maps 數	M(注意!不是 filter 大小)
R / S	filter 的高 / 寬	K×K
u / v	vertical / horizontal stride	—
pad_h / pad_w	zero padding 的高 / 寬	ghost cells

Warning

命名陷阱:在前面幾節 K 是 filter 邊長、M 是 output map 數;在 cuDNN 裡 K 變成 output feature map 數,filter 邊長改用 R/S。考試/面試常拿這點設陷阱。

三個 4D tensor:

D : N × C × H × W   (input data)            ── 每 sample 有 C 個 H×W 輸入圖
F : K × C × R × S   (convolution filters)   ── K 個 output 各對 C 個 input 用 R×S filter
O : N × K × P × Q   (output)                ── 每 sample 有 K 個 P×Q 輸出圖

        D[N,C,H,W]            F[K,C,R,S]                 O[N,K,P,Q]
   ┌───────────────┐     ┌────────────────┐        ┌───────────────┐
   │ sample n      │     │ filterbank K×C │   ===> │ sample n      │
   │  ┌──┐ C 個    │  ⊛  │  R×S each      │        │  ┌──┐ K 個    │
   │  │HW│ input   │     │                │        │  │PQ│ output  │
   │  └──┘ maps    │     └────────────────┘        │  └──┘ maps    │
   └───────────────┘                               └───────────────┘
   P = f(H, R, u, pad_h)      Q = f(W, S, v, pad_w)

stride (u, v):只計算 output pixels 的一個子集 → 直接降低 computational load。
padding (pad_h, pad_w):在每個 feature map 邊緣補 0 列/行 → 改善 memory alignment 與 vectorized execution(也順帶處理 ghost cells)。

Tip

output 尺寸的慣用公式(書中只寫成抽象的 f(·)):
P = (H − R + 2·pad_h) / u + 1,Q = (W − S + 2·pad_w) / v + 1。
stride 越大、padding 越小 → 輸出越小。

演算法選擇:GEMM / Winograd / FFT (Algorithm Selection)

cuDNN 同一個 convolutional layer 支援多種底層演算法,可依層的形狀挑選最快者(algorithm-selection)。

演算法	概念	適合情境
GEMM-based	im2col 展開後做矩陣乘法(同 16.4 節)	通用;矩陣夠大時效率高
Winograd	用最少乘法的轉換 (Lavin & Scott, 2016)	小 filter(如 3×3)、減少乘法次數
FFT-based	在頻域做 convolution(Vasilache et al., 2014)	大 filter / 大 feature map

Important

沒有「永遠最快」的演算法。實務上框架(或 cuDNN 的 heuristic / autotune)會依 N,C,H,W,K,R,S 與硬體,為每一層挑選不同演算法。這正是演算法選擇與權衡在真實函式庫中的體現。

Lazy On-Chip Materialization (惰性晶片上展開)

這是 cuDNN GEMM 路徑相對於樸素 im2col 的關鍵改良。

樸素 im2col 的兩大缺點(回顧 16.4 節):

X_unroll 在 off-chip global memory 物化 → 佔用龐大空間(膨脹最多 R×S 倍)。
X_unroll 必須先寫後讀,外加讀原始 X → memory traffic 暴增,降低 computational intensity。

cuDNN 的做法:不在 off-chip 先 gather 整個 X_unroll,而是把展開後的小塊只 lazily 載入 on-chip memory,邊算邊生成。

傳統 im2col + GEMM:
   X(global) ──im2col──► X_unroll(global, 大!) ──讀──► GEMM ──► O
                         ▲ 多一份寫+讀的 DRAM 流量

cuDNN lazy materialization:
   X(global) ─────────────────────────────────────► O
                 └─ tile 邊載入 on-chip 邊即時 unroll ─┘
                    (X_unroll 從不落地 global memory)

底層矩陣乘法 routine(類 Tan et al., 2011)的 tiling 流水線:

 A,B 的固定大小 submatrix(tile)輪流載入 on-chip
 ┌────────── 時間軸 ──────────────────────────────►
 compute  : [ tile k   ][ tile k+1 ][ tile k+2 ] ...   ← 只受算術時間限制
 fetch    :    [ tile k+1 ][ tile k+2 ][ tile k+3 ]    ← 預取下一塊
            └ 計算與 DRAM 取資料重疊 → 隱藏 memory latency

把 A、B 的 fixed-size submatrices 連續讀入 on-chip,算出 C 的一塊 submatrix。
一邊算當前 tile,一邊從 off-chip 預取下一組 tile → 隱藏 memory latency,使計算只受算術運算時間限制(逼近 FP 理論峰值)。

Warning

tiling 與 convolution 參數無關,所以 X_unroll 的 tile 邊界與 convolution 問題之間的對應是 nontrivial 的。cuDNN 必須:

動態計算這個 mapping,在計算過程中把正確的 A、B 元素載入 on-chip;
付出額外的 indexing 算術(比純 matmul 多),但能完整沿用高度優化的 matrix-multiply 引擎;
計算完成後做 tensor transposition,把結果存成使用者要的 data layout。

Arithmetic intensity 視角:matmul 在 GPU 上特別快,因為 FLOP / byte 比值高,且矩陣越大比值越高;cuDNN 避免把 X_unroll 落地,正是為了不要稀釋這個比值。參見 arithmetic intensity / roofline。

本章總結 (Chapter Summary)

本章 (Ch.16) 的整體脈絡,把四篇 sibling notes 串起來:

階段	內容	對應 note
1. ML 基礎	classification、perceptron、MLP、可微 activation、chain-rule backpropagation	16-Deep-Learning/01-Machine-Learning-Foundations-Perceptrons-Backpropagation
2. CNN 結構	LeNet-5、convolution / subsampling / fully-connected 層、序列實作與 minibatch	16-Deep-Learning/02-Convolutional-Neural-Networks-Layers
3. GPU kernel	N/M/H/W 四層平行、2D-block tiled inference kernel	16-Deep-Learning/03-GPU-Convolutional-Layer-CUDA-Kernel-and-GEMM
4. GEMM 化	im2col unrolling → cuBLAS GEMM、膨脹比與記憶體流量取捨	16-Deep-Learning/03-GPU-Convolutional-Layer-CUDA-Kernel-and-GEMM
5. cuDNN	用優化函式庫,免寫 kernel 即取得高效能	(本篇)

convolutional layer 是 CNN 中最 compute-intensive 的一層,因此是 GPU 加速的主戰場。
各層(conv / pooling / FC)都可看成 perceptron 的特例或簡單變形。
conv layer 可建立在第 7 章 convolution pattern 之上,再用 constant memory + shared-memory tiling 優化(留作習題)。
把 conv 化為 matrix multiplication 可借力高度優化的 GEMM 函式庫。
最終,大多數框架直接使用 cuDNN,使用者享受優化過的層實作而無需自行撰寫 CUDA kernels。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
cuDNN 對資料位置的要求?	input/output 必須在 GPU device memory(同 cuBLAS)
cuDNN tensor 怎麼描述 layout?	透過 opaque descriptors,可指定任意 stride
cuDNN 最重要的 primitive?	一種 batched convolution(forward/backward 共用 descriptor)
參數 `K` 在 cuDNN 指什麼?	output feature maps 數(不是 filter 邊長!filter 用 `R`/`S`)
input/filter/output tensor 形狀?	`D[N,C,H,W]`、`F[K,C,R,S]`、`O[N,K,P,Q]`
`u`/`v` 與 `pad_h`/`pad_w` 作用?	stride 減少計算量;padding 改善 alignment / vectorization(補 0)
cuDNN 支援哪些 conv 演算法?	GEMM / Winograd / FFT-based 等(可按層選最快)
cuDNN 如何避免 im2col 的記憶體浪費?	lazy on-chip materialization:X_unroll 只在晶片上即時生成,不落地 global memory
matmul routine 如何達到高 FP 利用率?	tile 化 A、B 進 on-chip,計算與預取下一 tile 重疊隱藏 latency
為何需要額外 indexing 算術?	matmul 的 tiling 與 conv 參數無關,須動態計算 tile↔conv 的 mapping
計算完成後最後一步?	tensor transposition,還原成使用者要的 layout
為何 matmul 在 GPU 上快?	高 FLOP/byte (arithmetic intensity),矩陣越大比值越高

重點總覽 (Overview)

cuDNN 函式庫概觀 (cuDNN Library Overview)

Batched Convolution 與參數 (Tensor / Filter Descriptors & Parameters)

演算法選擇:GEMM / Winograd / FFT (Algorithm Selection)

Lazy On-Chip Materialization (惰性晶片上展開)

本章總結 (Chapter Summary)

考試/面試重點 (Exam / Test Patterns)

Related Notes