記憶體頻寬與運算吞吐量 (Memory Bandwidth and Compute Throughput)

#advanced-practice #memory-bandwidth #atomic-operations #control-divergence #floating-point

重點總覽 (Overview)

本節整理 GPU 各世代在運算吞吐量 (compute throughput) 與記憶體頻寬 (memory bandwidth) 上的硬體進步。整體趨勢:讓「直接把 CPU 程式移植到 GPU」就能拿到合理效能 (smoothing the porting path),減少必須做演算法重構的壓力。

面向	早期 GPU 限制	引入的世代	關鍵改善	對程式設計師的意義
Double-precision speed	約比 single-precision 慢 8x	Fermi+	DP 提升到約 SP 的 1/2 速度	移植 CPU 數值程式不必再費力評估能否降成 SP
Half-precision (FP16)	不支援	Pascal (HW) / Ampere (tensor core)	A100:FP16 tensor 156 TFLOPS vs FP32 19.5 TFLOPS	醫療影像、遙測、天文等資料適配 16-bit,省頻寬 + 省能耗
Control flow efficiency	divergence 處理差	Fermi+	編譯器驅動的 predication	data-driven 應用 (ray tracing、cellular automata) 大幅受益
On-chip memory	純 programmer-managed scratchpad	Fermi+	可配置 cache vs shared memory	可預測 → 用 shared;不可預測/CPU-ported → 用 cache
Atomic operations	慢且功能受限	Fermi → Kepler → Maxwell	更快、更通用、shared-mem 吞吐再增	減少改用 prefix-sum / sort 的需要
Global memory access	random access 慢、極依賴 coalescing	Fermi/Kepler;Pascal	random access 變快;HBM2 (3x)、NVLink (5x)	coalescing 較不致命;多 GPU 擴展性大增

一條主線

這些硬體增強的共同目的,是把「需要高手調校才有效能」的門檻往下拉,讓更多 object-oriented、data-driven、難以整理成 tiled array 的應用也能直接吃到 GPU 效能。

雙精度與半精度速度 (Double / Half-Precision Speed)

早期 GPU:double-precision (FP64) 約比 single-precision (FP32) 慢 8 倍,被 HPC 社群詬病。
Fermi 及後繼:FP64 運算單元大幅強化,達到約 FP32 的 一半速度 (2x slower)。
- 最大受益者是把 CPU 數值程式移植到 GPU 的開發者:不必再花力氣評估「能不能塞進 single-precision」,大幅降低移植成本。
較小資料型別 (8-bit / 16-bit / FP32) 仍偏好低精度,理由是頻寬:32-bit 資料搬運量只有 64-bit 的一半。
Pascal 加入 16-bit half-precision (FP16) 硬體支援;Ampere (A100) 透過 tensor cores 把 FP16 吞吐推到極致。

精度 × 速度 × 頻寬的取捨

精度	位元數	相對運算速度	相對搬運量	典型適用
FP64 (double)	64	早期 1/8 → Fermi+ 1/2 of FP32	2x of FP32	高精度科學計算、CPU 移植
FP32 (single)	32	基準 (1x)	基準	一般通用
FP16 (half)	16	極高 (tensor core)	0.5x of FP32	深度學習、影像、遙測、天文

數量級對照

A100 (Ampere):FP16 tensor-core 吞吐 156 TFLOPS,FP32 僅 19.5 TFLOPS → 約 8x 差距。
選對精度同時換到 更高算力 與 更低記憶體頻寬壓力。

精度不是免費的速度

降精度的前提是資料本身能容忍該表示範圍與 round-off。先確認數值穩定性 (見 24-Numerical-Considerations/02-Representable-Numbers-Precision-and-Accuracy 的精度概念) 再降精度。

控制流程效率 (Better Control Flow Efficiency)

從 Fermi 起,CUDA 採用編譯器驅動的 predication 技術 (Mahlke et al., 1995) 來處理控制流程。
predication 把控制相依 (control dependence) 轉成資料相依 (data dependence):用一個 predicate (條件遮罩) 決定指令是否對該 lane 生效,消除分支指令與 reconvergence 開銷。
在 VLIW 上只是中等成效,但在 GPU warp-style SIMD 上帶來更顯著加速。
受益最多的是 data-driven 應用:ray tracing、quantum chemistry visualization、cellular automata simulation。

傳統 branch divergence                    predication (Fermi+)
─────────────────────────                ─────────────────────────
if (cond) A(); else B();                 p = cond;          // 產生遮罩
warp 內 lane: T T F F T F T F            @p   A();          // 只對 p==T 的 lane 生效
  step1: 執行 A  (F lane 閒置)            @!p  B();          // 只對 p==F 的 lane 生效
  step2: 執行 B  (T lane 閒置)
  → 兩段「序列化」,SIMD 利用率低          → 無分支、無 reconverge,SIMD 利用率高

predication 不是萬靈丹

predicated 指令仍會把「兩邊」都發出 (對不生效的 lane 變成 no-op)。對極短的 if-body 划算;若分支內工作量很大,真正的 branch 反而較省。實際上編譯器會自行權衡。控制發散的完整討論見 04-Compute-Architecture-And-Scheduling/02-Warps-SIMD-and-Control-Divergence。

可配置快取與 Scratchpad (Configurable Caching and Scratchpad)

早期 shared memory 只是 programmer-managed scratchpad:適合存取模式可預測 (predictable) 且局部化的資料。
從 Fermi 起,on-chip memory 變大,且可配置 (configurable) 成「一部分當 cache、一部分當 shared memory」。
- 讓可預測與不可預測的存取模式都能享受到 on-chip memory。
- 程式設計師依應用特性分配 (apportion) 資源。

Fermi+ 可配置 on-chip memory (示意,例如 64 KB)

配置 A:可預測存取 (tiling kernel)        配置 B:CPU-ported / 不規則存取
┌───────────────────────┬─────────┐      ┌─────────┬───────────────────────┐
│  Shared memory (48 KB) │ L1 (16) │      │ Sh (16) │   L1 Cache (48 KB)     │
└───────────────────────┴─────────┘      └─────────┴───────────────────────┘
  程式設計師「手動」管理 scratchpad         硬體「自動」管理 cache
  → 適合 stencil / tiled matmul            → 適合直接移植的 CPU 程式

使用情境	偏好配置	原因
早期移植 / 直接 port CPU 程式	cache 為主	提升「easy performance」,免手動管理
既有 CUDA、存取可預測 (tiling)	shared 為主	保留前代 occupancy 又加大快資料量
受 shared memory 大小限制的 kernel (如 stencil / finite difference)	加大 shared	提升 memory bandwidth 效率與效能

與 tiling 的關係

加大 shared memory 直接改善 stencil (見 08-Stencil/02-Shared-Memory-Tiling-for-Stencil) 等以 tiling 為核心的應用——更大的 tile → 更高的資料重用 → 更省 global memory 頻寬。

增強的原子操作 (Enhanced Atomic Operations)

效能逐代提升:Fermi 比前代快很多 → Kepler 更快且更通用 (general) → Maxwell 對 shared memory 變數 的 atomic 吞吐再增強。
atomic 常用於 random scatter 計算模式,典型例子是 histogram (見 09-Parallel-Histogram/01-Atomic-Operations-and-Basic-Histogram)。
更快的 atomic 帶來的連鎖效應:
- 減少改用 prefix sum (scan) 與 sorting 等演算法轉換的需要 (這些轉換會增加 kernel 啟動次數與總運算量)。
- 減少把 collective / 多 block 更新共享資料結構的工作丟回 host CPU 的需要 → 降低 CPU↔GPU 資料傳輸壓力。

Random scatter (atomics)                  避免 atomic 的傳統做法
─────────────────────────                ─────────────────────────
each thread:                              改用 prefix-sum / sort 重排
  atomicAdd(&hist[bin], 1);                + 更多 kernel 啟動
  → 多 thread 競爭同一位址                  + 更多總運算量
faster atomics (Fermi→Kepler→Maxwell)      或丟回 CPU 做 → 更多 CPU↔GPU 傳輸
  → 直接 scatter 變得可接受

「快的 atomic」≠「不用優化」

atomic 變快降低了「改用 scan/sort」的壓力,但高競爭 (contention) 仍會拖慢效能。privatization、coarsening、aggregation 等技巧依然有效 (見 09-Parallel-Histogram/02-Histogram-Optimizations-Privatization-Coarsening-Aggregation)。

增強的全域記憶體存取 (Enhanced Global Memory Access)

Random memory access 在 Fermi / Kepler 比更早世代快很多 → 程式設計師可以較少擔心 memory coalescing。
- 讓更多 CPU 演算法可直接拿來當 GPU 的可接受基準 (尤其是 ray tracing、重度物件導向、難以轉成完美 tiled array 的應用)。
Pascal 兩大頻寬/互連躍進:
- HBM2 (High-Bandwidth Memory v2) 3D-stacked DRAM → 提供前代 Maxwell 最高 3x 的記憶體頻寬。
- 首度支援 NVLink 處理器互連 → Tesla P100 的 GPU-GPU 與 GPU-CPU 通訊效能達 PCI Express 3.0 的最高 5x;大幅改善 node 內多 GPU 擴展性與 GPU/CPU 資料共享。

舊架構 (PCIe-only)                        Pascal P100 + NVLink
─────────────────────                    ─────────────────────────────
 GPU0   GPU1                              GPU0 ═══NVLink (≤5x PCIe3)═══ GPU1
   \    /   都擠 PCIe 3.0                  ║HBM2 (≤3x Maxwell BW)         ║HBM2
    CPU      (慢、共享)                    GPU2 ═══NVLink═══════════════ GPU3
                                            ╲          NVLink           ╱
                                                 CPU (NVLink-capable)

互連 / 記憶體	角色	相對提升
HBM2	GPU 封裝內的 DRAM 頻寬 (片外記憶體)	≤ 3x of Maxwell
NVLink	處理器互連 (GPU↔GPU、GPU↔CPU)	≤ 5x of PCIe 3.0

HBM2 與 NVLink 是兩回事

HBM2 解決的是「GPU 讀寫自己 DRAM 的頻寬」;NVLink 解決的是「GPU 與其他 GPU/CPU 溝通的頻寬」。兩者互補,別混為一談。NVLink 的記憶體模型意涵 (跨 GPU 直接定址) 見 22-Advanced-Practices-And-Future-Evolution/01-Host-Device-Interaction-Memory-Model。

coalescing 沒有消失

「可以較少擔心 coalescing」是相對於更早世代;對記憶體頻寬受限 (bandwidth-bound) 的 kernel,coalesced access 仍是決定性因素 (見 06-Performance-Considerations/01-Memory-Coalescing)。對照:zero-copy / system interconnect 頻寬通常 < 10% 的 global memory 頻寬。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「double-precision 在 GPU 慢多少?」	早期 ~8x slower;Fermi 之後約為 single 的 1/2 (2x slower)
「A100 FP16 vs FP32 吞吐?」	tensor-core FP16 156 TFLOPS vs FP32 19.5 TFLOPS ≈ 8x
「為何用 half / single 而非 double?」	速度 + 頻寬 (搬運量減半) + 能耗;前提是資料容忍該精度
「Fermi 如何改善 control divergence?」	編譯器驅動的 predication (控制相依→資料相依),warp SIMD 受益最大
「on-chip memory 配置怎麼選?」	可預測/tiling → shared 為主;CPU-ported/不規則 → cache 為主
「atomic 變快帶來什麼?」	減少改用 prefix-sum / sorting 的需要、減少回丟 CPU → 降 CPU↔GPU 傳輸
「HBM2 提升多少?提升什麼?」	≤3x of Maxwell;提升 GPU 自身 DRAM 頻寬
「NVLink 提升多少?提升什麼?」	≤5x of PCIe 3.0;提升 GPU↔GPU / GPU↔CPU 通訊
「Fermi+ 之後還要做 coalescing 嗎?」	random access 變快但 bandwidth-bound kernel 仍需 coalescing

重點總覽 (Overview)

雙精度與半精度速度 (Double / Half-Precision Speed)

精度 × 速度 × 頻寬 的取捨

控制流程效率 (Better Control Flow Efficiency)

可配置快取與 Scratchpad (Configurable Caching and Scratchpad)

增強的原子操作 (Enhanced Atomic Operations)

增強的全域記憶體存取 (Enhanced Global Memory Access)

考試/面試重點 (Exam / Test Patterns)

Related Notes

精度 × 速度 × 頻寬的取捨