PMPP 速查表 (Quick Reference)

用法

本表彙整全書高價值、可考的公式與數值比值;公式與 CUDA 術語保留英文,說明用繁體中文。每個主題標題後的 → 連到最相關的概念筆記。最後的 必背公式 是考前最後衝刺清單。

效能模型 (Amdahl / Roofline / Arithmetic Intensity) → 優化清單與效能瓶頸

項目 公式/數值 說明/條件
Speedup 定義 Speedup_{A/B} = T_B / T_A 200s / 10s = 20×
Amdahl's Law Speedup = 1 / ((1−p) + p/s) p = 可平行比例, s = 該部分加速倍率
Amdahl 硬上限 s→∞ ⇒ 1 / (1−p) p=30% → 1.43×;p=99% → 100×(s=100 時僅 ≈50×)
Arithmetic Intensity OP/B = compute_ops / bytes_from_global 高 → compute-bound;低 → memory-bound
Machine balance (ridge) required OP/B = peak_FLOPS / peak_BW A100: 19500 / 1555 = 12.5 OP/B
Bound 判定 OP/B < peak_FLOPS/peak_BW ⇒ memory-bound;否則 compute-bound roofline ridge point
Memory-bound 吞吐 GFLOPS = peak_BW × OP/B 1555 GB/s × 0.25 = 389 GFLOPS
平行加速期望 ~10×(飽和 DRAM 頻寬) → >100×(>99.9% 平行 + 大量調校) 直接平行化常卡在 ~10×

記憶體 (Coalescing / Memory Types / Tiling) → Tiling 技術與 Tiled 矩陣乘法 Kernel

項目 公式/數值 說明/條件
Row-major 線性化 addr(M[row][col]) = row*Width + col 同列相鄰,上下相隔 Width
Coalesced 樣式 M[k*Width + col], col = bx*bd.x + tx ⇒ stride = 1 warp i=0..31 存 X+i → 合併 1 個 burst
Uncoalesced 樣式 M[col*Width + k] ⇒ stride = Width column-major / transpose
Tiling 流量縮減 reduction factor = TILE_WIDTH 每元素讀 N/T 次(原 N 次)
Tiled matmul OP/B OP/B = 0.25 × TILE_WIDTH 16×16 → 4;32×32 → 8
Phases 數 num_phases = Width / TILE_WIDTH
協作載入索引 Mds[ty][tx]=M[RowW + phT + tx];Nds[ty][tx]=N[(ph*T+ty)*W + Col] T = TILE_WIDTH
Constant memory 容量 ≤ 64 KB (65,536 bytes) / application F 由 constant cache 服務
Shared mem / thread (full occ, A100) 164 KB / 2048 = 82 B/thread 門檻 tiled matmul 僅用 8 B/thread
DDR 頻寬 BW = width(B) × 2 × clock(GHz) 8B × 2 × 1GHz = 16 GB/s
飽和匯流排 banks 數 banks ≥ R + 1, R = latency / transfer_time single-bank 利用率 = 1/(R+1)

執行模型 (Warps / Occupancy / Scheduling) → Resource Partitioning、Occupancy 與 Device Properties 查詢

項目 公式/數值 說明/條件
Warps / block ceil(blockDim / 32) warp n = threads 32n .. 32(n+1)−1
SIMD efficiency active_threads / 32 divergence 降低之
Occupancy threads_assigned_to_SM / max_threads_per_SM = warps_assigned / max_warps
Full-occupancy 暫存器上限 regs/thread ≤ regs_per_SM / max_threads = 65536/2048 = 32
暫存器限制的 max threads regs_per_SM / regs_per_thread 65536/64 = 1024 → ≤ 50% occupancy
Oversubscription max_threads_per_SM / cores_per_SM A100: 2048 / 64 = 32×
Execution waves ceil(num_blocks / blocks_per_wave) transparent scalability
不整除 block (768) floor(2048/768) = 2 → 1536/2048 = 75% 浪費 resource slots
A100 cores 108 SM × 64 cores = 6912 cores

Thread/Grid 索引與啟動慣用語 (Indexing & Launch Idioms) → 呼叫 Kernel、Compilation 與本章總結

項目 公式/數值 說明/條件
Global index (1D) i = blockIdx.x*blockDim.x + threadIdx.x 1 thread : 1 element
Ceiling-division 啟動 numBlocks = (n + blockSize − 1) / blockSize = ceil(n/blockSize) 整數除法慣用語, e.g. (N+255)/256
Grid threads ≥ n ceil(n/256) × 256 ≥ n 需邊界檢查 if (i < n)
2D row / col row = bybd.y + ty;col = bxbd.x + tx
2D row-major index(M[j][i]) = j*Width + i column-major: i*Height + j
3D row-major index = plane*(W*H) + row*W + col
Block thread 上限 bd.x × bd.y × bd.z ≤ 1024
gridDim 範圍 x ≤ 2^31−1;y, z ≤ 65535 (2^16−1)
RGB byte offset (row*width + col) × CHANNELS CHANNELS = 3

平行模式複雜度 (Scan / Reduction / Merge / Sort) → 演算法選擇與權衡

模式 Work / Steps 說明/條件
Reduction ops = N−1;steps = log2(N);speedup = N/log2(N) avg parallelism = (N−1)/log2(N)
Sequential scan adds = N−1 = O(N) baseline
Kogge-Stone scan work = N·log2(N)−(N−1) = O(N log N);steps = log2(N) step 少, work 多(非 work-efficient)
Brent-Kung scan work = 2N−2−log2(N) = O(N);steps = O(2 log N) work-efficient
Hierarchical scan len(S) = N / SECTION_SIZE;回加 Y[i] += S[bx−1] (bx>0) 任意長度三-kernel
Merge output size = m+n;sequential = O(m+n) co-rank 不變式 k = i + j
Co-rank search O(log N), N = max(m, n) binary search;i_low=max(0,k−n)
Sort 下界 comparison-based ≥ O(N log N) radix sort < O(N log N) (非比較式)
其他 sort work merge O(N log N);odd-even transposition O(N²);bitonic O(N log²N)
Radix sort iterations = ceil(N_bits / r);buckets = 2^r dst0 = i − ones_before;dst1 = size − ones_total + ones_before

Convolution 與 Stencil 強度 (Arithmetic Intensity Bounds) → Stencil 的 Shared Memory Tiling

項目 公式/數值 說明/條件
1D convolution y[i] = Σ_{j=−r..r} f[j+r]·x[i+j];filter size = 2r+1 cross-correlation 形式
2D convolution P = ΣΣ f·N;filter = (2r+1)²
Basic conv OP/B 2 / 8 = 0.25 constant-mem 版: 2/4 = 0.5(2×)
Input tile dim IN_TILE_DIM = OUT_TILE_DIM + 2·FILTER_RADIUS halo cells
Conv tiled 上界 OP/B_max = (2r+1)²·2 / 4 (tile ≫ r) 5×5 → 12.5;9×9 → 40.5
Stencil point count 1D = 2·order + 1 7-point 3D = self + 6
Stencil basic OP/B 13 / (7×4) = 0.46
Tiled stencil OP/B (7-pt) (13/4)(1 − 2/T)³ → 3.25 (T→∞) T=8 → 1.37;T=32 → 2.68
Stencil/conv 上界規則 OP/B_max ≈ (stencil point count) / 2 5-pt → 2.5;3×3 conv → 4.5
Halo fraction (3D, order-1) 1 − ((T−2)/T)³ T=8 → 58%

稀疏矩陣與圖遍歷 (Sparse Matrix & Graph) → 稀疏矩陣與基礎儲存格式

項目 公式/數值 說明/條件
SpMV y = A·x + y;OP/B ≈ 0.25 (value 4B + colIdx 4B, 無重用) bandwidth-bound
COO 儲存 rowIdx[nnz] + colIdx[nnz] + value[nnz] 1% 非零 → 總量 ≈ dense 的 3%
CSR 儲存 colIdx[nnz] + value[nnz] + rowPtrs[numRows+1] rowPtrs[numRows] = nnz
ELL (column-major) i = t*numRows + row;size = numRows × maxNnzPerRow padding 隨最長 row 增長
Graph 儲存 adjacency = N²;CSR/CSC = (N+1)+E;COO = 2E out-degree(i)=srcPtrs[i+1]−srcPtrs[i]
BFS 複雜度 O(V + E) level[v] = root 到 v 的最少 hops
Threads / level vertex-centric = V;edge-centric = E (E ≫ V) edge-centric 平行度更高
Frontier threadsLaunched = numPrevFrontier;結束於 curr frontier = 0 非全圖 N
First-visit 判定 atomicCAS(&level[nbr], UINT_MAX, currLevel) == UINT_MAX 第一個訪問者入隊

深度學習 (CNN / Conv Layer / GEMM) → GPU Convolutional Layer:CUDA Kernel 與 GEMM 化

項目 公式/數值 說明/條件
Perceptron / FC y = sign(W·x + b);FC: y = sigmoid(W·x + b), W 為 m×n cost = m·n weights
Cost / 梯度 E = (y−t)²/2;∂E/∂y = y−t;更新 θ ← θ − ε·∂E/∂θ gradient descent
Conv output size H_out = H − K + 1 cuDNN: P = (H − R + 2·pad)/u + 1
Conv forward Y[m,h,w] = Σc Σp Σq X[c,h+p,w+q]·W[m,c,p,q] filter banks = M × C
FC backprop ∂E/∂x = Wᵀ(∂E/∂y);∂E/∂w = (∂E/∂y)xᵀ outer product
im2col 形狀 filter M×(C·K·K);X_unroll (C·K·K)×(H_out·W_out);Y M×(H_out·W_out) GEMM 化
Expansion ratio (C·K·K·H_out·W_out)/(C·H·W) → K·K 範例 16/9 ≈ 1.8
可用平行度 N × M × H_out × W_out 四層 easy-parallel loops

數值 (Floating-Point) → 浮點數資料表示

項目 公式/數值 說明/條件
IEEE-754 值 value = (−1)^S × 1.M × 2^(E − bias) bias = 2^(e−1) − 1 (e = exponent bits)
Effective mantissa m-bit mantissa = (m+1)-bit (隱含 1.)
Denormalized (E=0) value = (−1)^S × 0.M × 2^(2 − 2^(n−1)) single → exp = −126;填平 0 附近空洞
特殊樣式 E 全1 & M=0 → ±∞;E 全1 & M≠0 → NaN;E=0 & M=0 → 0 NaN 來源:0/0, 0×∞, ∞−∞
每區間代表數 2^N (N = mantissa bits) 越靠近 0 越密
single vs double double error = single error / 2^29 mantissa 多 29 bits
Rounding error 正確捨入 add/sub ≤ 0.5 ULP;div / 超越函數可能 > 0.5 ULP round-to-zero ≤ 1 ULP
Precision vs Accuracy precision ← mantissa 位數;accuracy ← 運算與順序 FP 加法不符結合律 → 平行 reduction 順序影響結果

硬體數據與其他 (Atomics / Bandwidth / Dynamic Parallelism) → 記憶體頻寬與運算吞吐量

項目 公式/數值 說明/條件
Atomic 吞吐 period ≈ 2 × memory_latency;throughput ≈ 1/(2·latency) 1/400cyc × 1GHz = 2.5 M atomics/s
Privatization private hist bytes = gridDim.x × NUM_BINS × 4;競爭 ↓ ≈ active blocks 每 block 僅 1 次 global atomicAdd
A100 FP peak (2021) FP64 = 9.7;FP32 = 156;FP16 = 312 TFLOPS GPU 頻寬 ≈ 10× 同代 CPU
FP64 相對速度 early ≈ FP32/8;Fermi+ ≈ FP32/2 FP16 資料搬運量 = 0.5× FP32
系統互連頻寬 PCIe (system interconnect) BW < 10% global mem BW NVLink ≤ 5× PCIe 3.0;zero-copy 受限
G80→A100 (13 年) compute 452×;bandwidth 18× → gap 25× memory wall 加深 → OP/B 需求上升
DCS gather OP/B = 9 FLOP / 4 access = 2.25;複雜度 O(N_atom·N_grid) = O(V²) cutoff binning → O(V)
Dynamic parallelism child gridDim = ceil((end−start)/BLOCK_DIM) max nesting = 24;pending pool = 2048
MPI cluster compute procs = np−1 (rank np−1 = data server);halo = 4·dimx·dimy overlap:T ≈ T_stage1 + max(T_compute, T_comm)

必背公式 (Must-Know Formulas)

  1. Amdahl's Law — Speedup = 1 / ((1−p) + p/s),上限 s→∞ ⇒ 1/(1−p) → 平行運算的目標與 Amdahl's Law
  2. Arithmetic Intensity — OP/B = compute_ops / bytes_from_global → 優化清單與效能瓶頸
  3. Roofline ridge (machine balance) — required OP/B = peak_FLOPS / peak_BW (A100 = 12.5) → 記憶體存取效率與 CUDA 記憶體類型
  4. Global thread index — i = blockIdx.x*blockDim.x + threadIdx.x → Kernel Functions 與 Threading
  5. Ceiling-division 啟動 — numBlocks = (n + blockSize − 1) / blockSize → 呼叫 Kernel、Compilation 與本章總結
  6. Occupancy — threads_assigned_to_SM / max_threads_per_SM → Resource Partitioning、Occupancy 與 Device Properties 查詢
  7. Coalescing 條件 — index = kWidth + (bxbd+tx) ⇒ stride = 1 → 記憶體合併存取
  8. Tiled matmul OP/B — OP/B = 0.25 × TILE_WIDTH → Tiling 技術與 Tiled 矩陣乘法 Kernel
  9. Reduction work/steps — ops = N−1, steps = log2(N), speedup = N/log2(N) → Reduction 基礎與簡單 Kernel
  10. Scan work — Kogge-Stone O(N log N) vs Brent-Kung 2N−2−log2(N) = O(N) → 高 Work-Efficiency 的 Scan:Brent-Kung 與 Thread Coarsening
  11. IEEE-754 值 — value = (−1)^S × 1.M × 2^(E − bias), bias = 2^(e−1) − 1 → 浮點數資料表示