記憶體存取效率與 CUDA 記憶體類型 (Memory Access Efficiency & CUDA Memory Types)

#memory-and-performance #arithmetic-intensity #roofline-model #cuda-memory-types #shared-memory

重點總覽 (Overview)

主題	核心要點	量化指標
Compute-to-global-memory-access ratio	每存取 1 byte global memory 所做的 FLOP 數，又稱 arithmetic / computational intensity	原始 matmul = 0.25 OP/B
Memory-bound	執行速度被記憶體頻寬卡住，無法逼近運算 peak	A100 0.25 OP/B → 僅 389 GFLOPS (peak 的 2%)
Compute-bound	執行速度被運算單元卡住，記憶體頻寬有餘裕	需 ≥ 12.5 OP/B 才能餵滿 A100 19,500 GFLOPS
Roofline Model	視覺化模型：x = intensity (OP/B)，y = throughput (GFLOPS)，斜線=頻寬上限、水平線=運算上限	交點 = memory↔compute bound 分界
CUDA 記憶體類型	registers / local / shared / global / constant，各有 scope、lifetime、速度	constant ≤ 65,536 B
優化方向	提高 ratio = 減少 global memory 存取 → 善用 on-chip memory (register / shared)	tiling 可把 matmul 推到 4 OP/B

Important

本章的中心論點:到目前為止寫的 kernel 只發揮硬體極小部分潛能,因為 global memory (off-chip DRAM) 延遲高 (數百 cycle)、頻寬有限。解法是把資料搬到 on-chip 記憶體,提高 compute-to-memory ratio。

記憶體存取效率的重要性 (Importance of Memory Access Efficiency)

Compute-to-Global-Memory-Access Ratio

以 Ch.3 矩陣乘法 kernel 最熱的內層 dot-product 迴圈為例:

// 內層迴圈每一次迭代:
for (int k = 0; k < Width; ++k) {
    Pvalue += M[row*Width + k] * N[k*Width + col];
    //         ^^^^^^^^^^^^^^ 1 次 global load (4B)
    //                          ^^^^^^^^^^^^^^ 1 次 global load (4B)
    // 1 次 FP multiply + 1 次 FP add = 2 FLOP
}

每次迭代:2 FLOP 對 8 B global memory 存取 → ratio = 2/8 = 0.25 FLOP/B (OP/B)。
這個比值就是 compute to global memory access ratio,文獻中也叫 arithmetic intensity 或 computational intensity。

為何 0.25 OP/B 是災難

A100 peak global memory bandwidth = 1555 GB/s。

可達 FP throughput = 1555 GB/s × 0.25 FLOP/B = 389 GFLOPS。
但 A100 FP32 peak = 19,500 GFLOPS → 只用到 2%。
若比對 tensor core peak 156,000 GFLOPS → 僅 0.25%。
kernel 速度被「資料從 memory 送到 core 的速率」嚴重卡住 → 這就是 memory-bound program。

要餵滿運算單元需要的 ratio:

{ratio}_{需求} = \frac{19,500 GOP/s}{1555 GB/s} = 12.5 OP/B

代表每讀 1 個 4-byte 浮點值,必須做約 50 個浮點運算才能完全發揮 A100。能否達到取決於計算本身的 data reuse 程度。

Memory-bound vs Compute-bound

	Memory-bound	Compute-bound
瓶頸	global memory 頻寬	運算單元 throughput
intensity	低 (左側)	高 (右側)
症狀	core idle、等資料	頻寬有餘裕
唯一/主要解法	提高 intensity (減少 global 存取)	已逼近 peak,改演算法才有空間

Roofline Model

評估「應用程式相對於硬體上限」效率的視覺模型。

 GFLOPS (throughput, y軸, log)
   ^
   |                 ______________________  <- 水平線: peak compute (GFLOPS)
   |                /        A3 (compute-bound)
   |               /  • A3
   |   斜線:       /
   |   peak BW   /•A1   <- A1 貼著斜線: 高效使用頻寬
   |   (slope)  /
   |          / • A2    <- A2 遠低於斜線: 頻寬利用率差,有優化空間
   |         /
   |        /
   +-------+------------------------------> arithmetic intensity (OP/B, x軸 log)
        交點 = memory-bound ↔ compute-bound 分界

x 軸:arithmetic intensity (OP/B) = 每 byte 載入做多少 work。
y 軸:實測 computational throughput (GFLOPS)。
斜線(從原點出發,正斜率)= peak memory bandwidth 上限。
水平線 = peak compute throughput 上限。
任何 app 的點必在兩線之下(不可能超過硬體 peak)。
點離線越近 = 資源用得越有效率;交點橫座標 = memory↔compute bound 的分界 intensity。

A1 vs A2 vs A3 的解讀

A1(memory-bound,貼斜線):頻寬已用滿,唯一提速法是「提高 intensity」。
A2(memory-bound,遠離斜線):頻寬沒用好,可先優化 memory 存取(如 coalescing)拉升 throughput。
A3(compute-bound):落在水平線附近,不受頻寬限制。

intensity 高 ≠ 一定快

Roofline 只給「上限」。落在 line 下方很遠代表還沒榨乾該資源;同一 intensity 下實測點可高可低。診斷要看「點離哪條線、離多遠」,而非只看 OP/B 數字。

CUDA 記憶體類型 (CUDA Memory Types)

CUDA device 提供多種記憶體幫助提高 compute-to-memory ratio。底層回到 von Neumann model:global memory 對應 Memory box(off-chip DRAM,高延遲低頻寬),register file 在 processor chip 上(極短延遲、極高頻寬)。

            CUDA Device (晶片邊界 = processor box)
  +-------------------------------------------------+
  |  SM                                             |
  |   +------------+   +------------------------+   |  on-chip:
  |   | Registers  |   |   Shared Memory        |   |  超高速、平行
  |   | (per-thread)|  |   (per-block, scratchpad)| |
  |   +------------+   +------------------------+   |
  |        ^                    ^                   |
  +--------|--------------------|-------------------+
           |  (R/W, no load     |  (需 load 指令,
           |   instruction)     |   但 on-chip 低延遲)
           v                    v
  +-------------------------------------------------+
  |  Global Memory (DRAM, off-chip)  | Constant Mem |  off-chip:
  |  R/W by host & device            | R-only, 有 cache|  高延遲 (數百 cycle)
  |  (Local memory 也實際放這裡)      | host W / dev R |
  +-------------------------------------------------+

Registers vs Shared vs Global 三者差異

屬性	Registers	Shared Memory	Global Memory
位置	on-chip (register file)	on-chip (scratchpad)	off-chip (DRAM)
延遲	最低	低 (需 load)	高 (數百 cycle)
頻寬	最高 (比 global 高 ≥2 個數量級)	高	低
共享範圍	單一 thread 私有	整個 block 共享	所有 thread / kernel
存取指令	內建於算術指令 (無額外 load)	需 memory load 指令	需 memory load 指令
能耗	最低 (比 global 低 ≥1 個數量級)	中	最高

為何 register 比 global 快得多 (三個理由)

頻寬/延遲:on-chip register file 聚合頻寬比 global memory 高 ≥2 個數量級;且存到 register 後,該存取不再消耗 global 頻寬 → 直接拉高 compute-to-memory ratio。
指令數:算術指令的 operand 內建為 register (fadd r1, r2, r3)。若 operand 在 global,需先 load r2, r4, offset 再 fadd,多一條指令 → 多耗 cycle。
能耗:讀 register 的能量比讀 global memory 低 ≥1 個數量級。

每個 thread 可用 register 數量有限。register 用太多會降低 SM occupancy(可同時駐留的 thread 變少),反而傷害延遲隱藏能力。需在「減少 global 存取」與「不超量 register」之間取捨。詳見 04-Compute-Architecture-And-Scheduling/03-Resource-Partitioning-and-Occupancy。

Shared Memory 的角色

在電腦架構術語裡是一種 scratchpad memory(程式員手動管理,非自動 cache)。
因為 on-chip,延遲遠低於 global;但因仍需 load 操作,延遲/頻寬不如 register。
關鍵差異:shared 變數 block 內所有 thread 可見,而 register 是 thread 私有 → shared memory 是 thread 間高頻寬協作的手段(這是 tiling 的基礎)。

CPU vs GPU register 架構

CPU context switch 需把 outgoing thread 的 register 存回記憶體、載入 incoming thread → 有 overhead。GPU 把所有已排程 thread 的 register 同時保存在 register file 中,達成 zero-overhead scheduling(warp 切換瞬間完成)。因此 GPU register file 遠大於 CPU,且支援 dynamic partitioning(每 thread 配多/少 register ↔ 駐留多/少 thread)。

宣告語法、Scope 與 Lifetime (Table 5.1)

變數宣告	Memory	Scope	Lifetime
自動純量變數 (非陣列)	Register	Thread	Grid
自動陣列變數	Local	Thread	Grid
`__device__ __shared__ int SharedVar;`	Shared	Block	Grid
`__device__ int GlobalVar;`	Global	Grid	Application
`__device__ __constant__ int ConstVar;`	Constant	Grid	Application

// 各記憶體類型宣告範例
__global__ void kernel(...) {
    int idx = threadIdx.x;          // 自動純量 -> Register, scope=thread
    float tmp[4];                   // 自動陣列 -> Local memory (見警告)
    __shared__ float tile[16][16];  // Shared, scope=block, 也可寫 __device__ __shared__
    // ...
}
__device__ int   GlobalVar;          // Global, 宣告在任何 function 之外
__constant__ int ConstVar[256];      // Constant, 必須在 function 之外宣告

Scope:能存取此變數的 thread 集合。

thread scope:每個 thread 有私有副本。launch 一百萬 thread → 產生一百萬份。
block scope:每個 block 一份,block 內 thread 共用。
grid scope:所有 grid 所有 thread 共用同一份。

Lifetime:變數在程式執行期間可用的時段。

Grid lifetime:必須宣告在 kernel function body 內;kernel 多次呼叫間「不保留」值,每次都要重新初始化。
Application lifetime:必須宣告在任何 function body 之外;值在整個 application 期間維持,所有 kernel 可見。

自動陣列不在 register

規則:自動「純量」變數 → register。但自動「陣列」變數 → local memory(實際位於 global memory,延遲長、可能塞車)。例外:若編譯器發現陣列所有索引都是常數,可能放進 register。實務上 kernel 很少需要自動陣列。

Local memory 名字會騙人

Local memory 不是 on-chip!它實際放在 global memory,延遲與 global 相同,只是「thread 私有」。用途:靜態配置的陣列、spilled registers、call stack。

Constant 與 Global 變數細節

	Constant	Global
宣告位置	function 外	function 外 (`__device__`)
Device 存取	唯讀,short-latency 高頻寬 (有 cache)	可讀寫,但慢 (近代靠 cache 改善)
Host 存取	可讀寫 (W/R)	可讀寫 (W/R)
容量	≤ 65,536 bytes	大
典型用途	餵給 kernel 的唯讀輸入值	kernel 間傳遞資訊、跨 block 協作

跨 block 同步陷阱

Global 變數雖對所有 kernel 所有 thread 可見並持久,但沒有簡單方式同步不同 block 的 thread,或保證 global memory 存取的資料一致性。只能靠 atomic operations 或「結束當前 kernel」來達成。故 global 變數常用於「kernel 呼叫之間」傳遞資訊。

Constant memory 細節留待 Ch.7

Constant 變數存在 global memory 但被 cache,搭配適當存取模式極快。完整應用見 07-Convolution/02-Constant-Memory-and-Caching。

Pointer 用法

kernel/device function 中 pointer 指向 global memory 的兩種典型方式:

host 用 cudaMalloc 配置物件,pointer 當 kernel 參數傳入(Ch.2 / Ch.3 的作法)。
把 global 變數的位址指派給 pointer,例如 kernel 內 float* ptr = &GlobalVar;。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「compute-to-global-memory ratio / arithmetic intensity」定義	每存取 1 byte global memory 所執行的 FLOP 數 (OP/B);原始 matmul = 0.25
算 matmul ratio	內層迴圈 2 FLOP / 8 B = 0.25 OP/B
給 peak BW × ratio	throughput 上限 = BW × intensity;A100: 1555×0.25 = 389 GFLOPS
判斷 memory-bound vs compute-bound	算 kernel intensity 與「機器平衡點」`peak FLOPS / peak BW` 比較;低於→memory-bound,高於→compute-bound
「為了餵滿 A100 需多少 ratio」	19,500/1555 ≈ 12.5 OP/B
Roofline 圖三條件	x=intensity, y=throughput;斜線=peak BW,水平線=peak compute,交點=分界
點貼斜線 (A1) 想提速	已用滿頻寬 → 只能提高 intensity(不是再優化 memory)
點遠離斜線 (A2)	頻寬利用差 → 先改善 memory 存取效率 (coalescing 等)
register vs global 為何快 (3 理由)	頻寬高 / 不佔 global 頻寬;無需額外 load 指令;能耗低 ≥1 數量級
自動純量 vs 自動陣列放哪	純量→register;陣列→local memory (除非全常數索引)
local memory 在哪	實際在 global memory (off-chip),非 on-chip
shared 變數 scope/lifetime	scope=block,lifetime=grid (kernel 結束即消失)
constant 變數 scope/lifetime/容量	scope=grid,lifetime=application,≤ 65,536 B,device 唯讀
launch N threads,某 thread-scope 變數有幾份	N 份 (每 thread 私有副本)
跨 block 通訊/一致性	用 global 變數 + atomic operations 或結束 kernel;無內建跨 block 同步