PMPP 速查表 (Quick Reference)
用法
本表彙整全書高價值、可考的公式與數值比值;公式與 CUDA 術語保留英文,說明用繁體中文。每個主題標題後的 → 連到最相關的概念筆記。最後的 必背公式 是考前最後衝刺清單。
效能模型 (Amdahl / Roofline / Arithmetic Intensity) → 優化清單與效能瓶頸
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| Speedup 定義 | Speedup_{A/B} = T_B / T_A | 200s / 10s = 20× |
| Amdahl's Law | Speedup = 1 / ((1−p) + p/s) | p = 可平行比例, s = 該部分加速倍率 |
| Amdahl 硬上限 | s→∞ ⇒ 1 / (1−p) | p=30% → 1.43×;p=99% → 100×(s=100 時僅 ≈50×) |
| Arithmetic Intensity | OP/B = compute_ops / bytes_from_global | 高 → compute-bound;低 → memory-bound |
| Machine balance (ridge) | required OP/B = peak_FLOPS / peak_BW | A100: 19500 / 1555 = 12.5 OP/B |
| Bound 判定 | OP/B < peak_FLOPS/peak_BW ⇒ memory-bound;否則 compute-bound | roofline ridge point |
| Memory-bound 吞吐 | GFLOPS = peak_BW × OP/B | 1555 GB/s × 0.25 = 389 GFLOPS |
| 平行加速期望 | ~10×(飽和 DRAM 頻寬) → >100×(>99.9% 平行 + 大量調校) | 直接平行化常卡在 ~10× |
記憶體 (Coalescing / Memory Types / Tiling) → Tiling 技術與 Tiled 矩陣乘法 Kernel
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| Row-major 線性化 | addr(M[row][col]) = row*Width + col | 同列相鄰,上下相隔 Width |
| Coalesced 樣式 | M[k*Width + col], col = bx*bd.x + tx ⇒ stride = 1 | warp i=0..31 存 X+i → 合併 1 個 burst |
| Uncoalesced 樣式 | M[col*Width + k] ⇒ stride = Width | column-major / transpose |
| Tiling 流量縮減 | reduction factor = TILE_WIDTH | 每元素讀 N/T 次(原 N 次) |
| Tiled matmul OP/B | OP/B = 0.25 × TILE_WIDTH | 16×16 → 4;32×32 → 8 |
| Phases 數 | num_phases = Width / TILE_WIDTH | |
| 協作載入索引 | Mds[ty][tx]=M[RowW + phT + tx];Nds[ty][tx]=N[(ph*T+ty)*W + Col] | T = TILE_WIDTH |
| Constant memory 容量 | ≤ 64 KB (65,536 bytes) / application | F 由 constant cache 服務 |
| Shared mem / thread (full occ, A100) | 164 KB / 2048 = 82 B/thread 門檻 | tiled matmul 僅用 8 B/thread |
| DDR 頻寬 | BW = width(B) × 2 × clock(GHz) | 8B × 2 × 1GHz = 16 GB/s |
| 飽和匯流排 banks 數 | banks ≥ R + 1, R = latency / transfer_time | single-bank 利用率 = 1/(R+1) |
執行模型 (Warps / Occupancy / Scheduling) → Resource Partitioning、Occupancy 與 Device Properties 查詢
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| Warps / block | ceil(blockDim / 32) | warp n = threads 32n .. 32(n+1)−1 |
| SIMD efficiency | active_threads / 32 | divergence 降低之 |
| Occupancy | threads_assigned_to_SM / max_threads_per_SM | = warps_assigned / max_warps |
| Full-occupancy 暫存器上限 | regs/thread ≤ regs_per_SM / max_threads = 65536/2048 = 32 | |
| 暫存器限制的 max threads | regs_per_SM / regs_per_thread | 65536/64 = 1024 → ≤ 50% occupancy |
| Oversubscription | max_threads_per_SM / cores_per_SM | A100: 2048 / 64 = 32× |
| Execution waves | ceil(num_blocks / blocks_per_wave) | transparent scalability |
| 不整除 block (768) | floor(2048/768) = 2 → 1536/2048 = 75% | 浪費 resource slots |
| A100 cores | 108 SM × 64 cores = 6912 cores |
Thread/Grid 索引與啟動慣用語 (Indexing & Launch Idioms) → 呼叫 Kernel、Compilation 與本章總結
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| Global index (1D) | i = blockIdx.x*blockDim.x + threadIdx.x | 1 thread : 1 element |
| Ceiling-division 啟動 | numBlocks = (n + blockSize − 1) / blockSize = ceil(n/blockSize) | 整數除法慣用語, e.g. (N+255)/256 |
| Grid threads ≥ n | ceil(n/256) × 256 ≥ n | 需邊界檢查 if (i < n) |
| 2D row / col | row = bybd.y + ty;col = bxbd.x + tx | |
| 2D row-major | index(M[j][i]) = j*Width + i | column-major: i*Height + j |
| 3D row-major | index = plane*(W*H) + row*W + col | |
| Block thread 上限 | bd.x × bd.y × bd.z ≤ 1024 | |
| gridDim 範圍 | x ≤ 2^31−1;y, z ≤ 65535 (2^16−1) | |
| RGB byte offset | (row*width + col) × CHANNELS | CHANNELS = 3 |
平行模式複雜度 (Scan / Reduction / Merge / Sort) → 演算法選擇與權衡
| 模式 | Work / Steps | 說明/條件 |
|---|---|---|
| Reduction | ops = N−1;steps = log2(N);speedup = N/log2(N) | avg parallelism = (N−1)/log2(N) |
| Sequential scan | adds = N−1 = O(N) | baseline |
| Kogge-Stone scan | work = N·log2(N)−(N−1) = O(N log N);steps = log2(N) | step 少, work 多(非 work-efficient) |
| Brent-Kung scan | work = 2N−2−log2(N) = O(N);steps = O(2 log N) | work-efficient |
| Hierarchical scan | len(S) = N / SECTION_SIZE;回加 Y[i] += S[bx−1] (bx>0) | 任意長度三-kernel |
| Merge | output size = m+n;sequential = O(m+n) | co-rank 不變式 k = i + j |
| Co-rank search | O(log N), N = max(m, n) | binary search;i_low=max(0,k−n) |
| Sort 下界 | comparison-based ≥ O(N log N) | radix sort < O(N log N) (非比較式) |
| 其他 sort work | merge O(N log N);odd-even transposition O(N²);bitonic O(N log²N) | |
| Radix sort | iterations = ceil(N_bits / r);buckets = 2^r | dst0 = i − ones_before;dst1 = size − ones_total + ones_before |
Convolution 與 Stencil 強度 (Arithmetic Intensity Bounds) → Stencil 的 Shared Memory Tiling
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| 1D convolution | y[i] = Σ_{j=−r..r} f[j+r]·x[i+j];filter size = 2r+1 | cross-correlation 形式 |
| 2D convolution | P = ΣΣ f·N;filter = (2r+1)² | |
| Basic conv OP/B | 2 / 8 = 0.25 | constant-mem 版: 2/4 = 0.5(2×) |
| Input tile dim | IN_TILE_DIM = OUT_TILE_DIM + 2·FILTER_RADIUS | halo cells |
| Conv tiled 上界 | OP/B_max = (2r+1)²·2 / 4 (tile ≫ r) | 5×5 → 12.5;9×9 → 40.5 |
| Stencil point count | 1D = 2·order + 1 | 7-point 3D = self + 6 |
| Stencil basic OP/B | 13 / (7×4) = 0.46 | |
| Tiled stencil OP/B (7-pt) | (13/4)(1 − 2/T)³ → 3.25 (T→∞) | T=8 → 1.37;T=32 → 2.68 |
| Stencil/conv 上界規則 | OP/B_max ≈ (stencil point count) / 2 | 5-pt → 2.5;3×3 conv → 4.5 |
| Halo fraction (3D, order-1) | 1 − ((T−2)/T)³ | T=8 → 58% |
稀疏矩陣與圖遍歷 (Sparse Matrix & Graph) → 稀疏矩陣與基礎儲存格式
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| SpMV | y = A·x + y;OP/B ≈ 0.25 (value 4B + colIdx 4B, 無重用) | bandwidth-bound |
| COO 儲存 | rowIdx[nnz] + colIdx[nnz] + value[nnz] | 1% 非零 → 總量 ≈ dense 的 3% |
| CSR 儲存 | colIdx[nnz] + value[nnz] + rowPtrs[numRows+1] | rowPtrs[numRows] = nnz |
| ELL (column-major) | i = t*numRows + row;size = numRows × maxNnzPerRow | padding 隨最長 row 增長 |
| Graph 儲存 | adjacency = N²;CSR/CSC = (N+1)+E;COO = 2E | out-degree(i)=srcPtrs[i+1]−srcPtrs[i] |
| BFS 複雜度 | O(V + E) | level[v] = root 到 v 的最少 hops |
| Threads / level | vertex-centric = V;edge-centric = E (E ≫ V) | edge-centric 平行度更高 |
| Frontier | threadsLaunched = numPrevFrontier;結束於 curr frontier = 0 | 非全圖 N |
| First-visit 判定 | atomicCAS(&level[nbr], UINT_MAX, currLevel) == UINT_MAX | 第一個訪問者入隊 |
深度學習 (CNN / Conv Layer / GEMM) → GPU Convolutional Layer:CUDA Kernel 與 GEMM 化
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| Perceptron / FC | y = sign(W·x + b);FC: y = sigmoid(W·x + b), W 為 m×n | cost = m·n weights |
| Cost / 梯度 | E = (y−t)²/2;∂E/∂y = y−t;更新 θ ← θ − ε·∂E/∂θ | gradient descent |
| Conv output size | H_out = H − K + 1 | cuDNN: P = (H − R + 2·pad)/u + 1 |
| Conv forward | Y[m,h,w] = Σc Σp Σq X[c,h+p,w+q]·W[m,c,p,q] | filter banks = M × C |
| FC backprop | ∂E/∂x = Wᵀ(∂E/∂y);∂E/∂w = (∂E/∂y)xᵀ | outer product |
| im2col 形狀 | filter M×(C·K·K);X_unroll (C·K·K)×(H_out·W_out);Y M×(H_out·W_out) | GEMM 化 |
| Expansion ratio | (C·K·K·H_out·W_out)/(C·H·W) → K·K | 範例 16/9 ≈ 1.8 |
| 可用平行度 | N × M × H_out × W_out | 四層 easy-parallel loops |
數值 (Floating-Point) → 浮點數資料表示
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| IEEE-754 值 | value = (−1)^S × 1.M × 2^(E − bias) | bias = 2^(e−1) − 1 (e = exponent bits) |
| Effective mantissa | m-bit mantissa = (m+1)-bit (隱含 1.) | |
| Denormalized (E=0) | value = (−1)^S × 0.M × 2^(2 − 2^(n−1)) | single → exp = −126;填平 0 附近空洞 |
| 特殊樣式 | E 全1 & M=0 → ±∞;E 全1 & M≠0 → NaN;E=0 & M=0 → 0 | NaN 來源:0/0, 0×∞, ∞−∞ |
| 每區間代表數 | 2^N (N = mantissa bits) | 越靠近 0 越密 |
| single vs double | double error = single error / 2^29 | mantissa 多 29 bits |
| Rounding error | 正確捨入 add/sub ≤ 0.5 ULP;div / 超越函數可能 > 0.5 ULP | round-to-zero ≤ 1 ULP |
| Precision vs Accuracy | precision ← mantissa 位數;accuracy ← 運算與順序 | FP 加法不符結合律 → 平行 reduction 順序影響結果 |
硬體數據與其他 (Atomics / Bandwidth / Dynamic Parallelism) → 記憶體頻寬與運算吞吐量
| 項目 | 公式/數值 | 說明/條件 |
|---|---|---|
| Atomic 吞吐 | period ≈ 2 × memory_latency;throughput ≈ 1/(2·latency) | 1/400cyc × 1GHz = 2.5 M atomics/s |
| Privatization | private hist bytes = gridDim.x × NUM_BINS × 4;競爭 ↓ ≈ active blocks | 每 block 僅 1 次 global atomicAdd |
| A100 FP peak (2021) | FP64 = 9.7;FP32 = 156;FP16 = 312 TFLOPS | GPU 頻寬 ≈ 10× 同代 CPU |
| FP64 相對速度 | early ≈ FP32/8;Fermi+ ≈ FP32/2 | FP16 資料搬運量 = 0.5× FP32 |
| 系統互連頻寬 | PCIe (system interconnect) BW < 10% global mem BW | NVLink ≤ 5× PCIe 3.0;zero-copy 受限 |
| G80→A100 (13 年) | compute 452×;bandwidth 18× → gap 25× | memory wall 加深 → OP/B 需求上升 |
| DCS gather | OP/B = 9 FLOP / 4 access = 2.25;複雜度 O(N_atom·N_grid) = O(V²) | cutoff binning → O(V) |
| Dynamic parallelism | child gridDim = ceil((end−start)/BLOCK_DIM) | max nesting = 24;pending pool = 2048 |
| MPI cluster | compute procs = np−1 (rank np−1 = data server);halo = 4·dimx·dimy | overlap:T ≈ T_stage1 + max(T_compute, T_comm) |
必背公式 (Must-Know Formulas)
- Amdahl's Law — Speedup = 1 / ((1−p) + p/s),上限 s→∞ ⇒ 1/(1−p) → 平行運算的目標與 Amdahl's Law
- Arithmetic Intensity — OP/B = compute_ops / bytes_from_global → 優化清單與效能瓶頸
- Roofline ridge (machine balance) — required OP/B = peak_FLOPS / peak_BW (A100 = 12.5) → 記憶體存取效率與 CUDA 記憶體類型
- Global thread index — i = blockIdx.x*blockDim.x + threadIdx.x → Kernel Functions 與 Threading
- Ceiling-division 啟動 — numBlocks = (n + blockSize − 1) / blockSize → 呼叫 Kernel、Compilation 與本章總結
- Occupancy — threads_assigned_to_SM / max_threads_per_SM → Resource Partitioning、Occupancy 與 Device Properties 查詢
- Coalescing 條件 — index = kWidth + (bxbd+tx) ⇒ stride = 1 → 記憶體合併存取
- Tiled matmul OP/B — OP/B = 0.25 × TILE_WIDTH → Tiling 技術與 Tiled 矩陣乘法 Kernel
- Reduction work/steps — ops = N−1, steps = log2(N), speedup = N/log2(N) → Reduction 基礎與簡單 Kernel
- Scan work — Kogge-Stone O(N log N) vs Brent-Kung 2N−2−log2(N) = O(N) → 高 Work-Efficiency 的 Scan:Brent-Kung 與 Thread Coarsening
- IEEE-754 值 — value = (−1)^S × 1.M × 2^(E − bias), bias = 2^(e−1) − 1 → 浮點數資料表示