記憶體架構與資料局部性練習題 (Practice - Memory Access Efficiency and CUDA Memory Types)

#practice #memory-and-performance #tiling #shared-memory

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案 / 公式
compute-to-global-memory-access ratio 定義	每存取 1 byte global memory 所做的 FLOP 數 (OP/B)，又稱 arithmetic / computational intensity
naive matmul 內層迴圈 ratio	2 FLOP / 8 B = 0.25 OP/B
throughput 上限公式	`throughput = bandwidth × intensity`；A100：1555 × 0.25 = 389 GFLOPS
memory-bound vs compute-bound 判斷	比較 kernel intensity 與機器平衡點 `peak FLOPS / peak BW`；低於 → memory-bound，高於 → compute-bound
register 比 global 快的 3 理由	頻寬高且不佔 global 頻寬 / 無需額外 load 指令 / 能耗低 ≥1 數量級
自動純量 vs 自動陣列放哪	純量 → register；陣列 → local memory (實際在 global，除非全常數索引)
shared 變數 scope / lifetime	scope = block，lifetime = grid (kernel 結束即消失)
constant 變數 scope / lifetime / 容量	scope = grid，lifetime = application，≤ 65,536 B，device 唯讀
tiling 減量因子	TILE_WIDTH（T×T tile → 每元素讀取從 N 降為 N/T）
兩個 `__syncthreads()` 各防什麼	#1 = RAW (true dependence)；#2 = WAR (false dependence)
越界載入填什麼	`0.0f`（乘加皆中性，不改變內積）
三組獨立邊界檢查	載入 M、載入 N、寫入 P 各自的索引界線不同
full-occupancy shared-mem 門檻 (A100)	164 KB / 2048 = 82 B/thread
動態 shared memory	`extern __shared__` (省略大小) + launch 第三參數 + `cudaGetDeviceProperties`

Question 1 - Compute-to-Memory Ratio 定義 [recall]

情境/題目：什麼是 compute-to-global-memory-access ratio？以 Ch.3 naive 矩陣乘法 kernel 的內層 dot-product 迴圈計算其數值，並說明它的別名。

查看答案

ratio = 每從 global memory 存取 1 byte 所執行的 FLOP 數。內層迴圈每次迭代做 1 次 FP multiply + 1 次 FP add = 2 FLOP，並讀 M、N 各 1 個 4-byte 值 = 8 B → ratio = 2/8 = 0.25 OP/B。別名為 arithmetic intensity 或 computational intensity。

Question 2 - 為何 Naive Matmul 是 Memory-Bound [recall]

情境/題目：A100 的 peak global memory bandwidth = 1555 GB/s，FP32 peak = 19,500 GFLOPS。為何 0.25 OP/B 的 matmul kernel 被稱為 memory-bound？算出它能達到的吞吐量上限。

查看答案

吞吐量上限 = bandwidth × intensity = 1555 GB/s × 0.25 FLOP/B = 389 GFLOPS，僅為 FP32 peak 的 2%（對比 tensor core 156,000 GFLOPS 更只有 0.25%）。執行速度被「資料從 memory 送到 core 的速率」卡住，而非運算單元 → 即 memory-bound program。要餵滿 19,500 GFLOPS 需 ratio ≥ 19,500/1555 ≈ 12.5 OP/B。

Question 3 - CUDA 記憶體類型 Scope 與 Lifetime [recall]

情境/題目：列出 自動純量變數、自動陣列變數、__shared__、__device__、__constant__ 各落在哪種記憶體，以及其 scope 與 lifetime。

查看答案

宣告	Memory	Scope	Lifetime
自動純量變數	Register	Thread	Grid
自動陣列變數	Local	Thread	Grid
`__shared__`	Shared	Block	Grid
`__device__`	Global	Grid	Application
`__constant__`	Constant	Grid	Application

關鍵陷阱：自動「陣列」放 local memory（實際位於 off-chip global，非 register，除非編譯器發現索引全為常數）。

Question 4 - Register 為何比 Global Memory 快 [recall]

情境/題目：給出三個理由，說明為何把 operand 放在 register 比放在 global memory 快。

查看答案

頻寬/延遲：on-chip register file 聚合頻寬比 global memory 高 ≥2 個數量級；且值存進 register 後該存取不再消耗 global 頻寬，等於提高 compute-to-memory ratio。
指令數：算術指令的 operand 內建為 register（fadd r1, r2, r3）；若 operand 在 global，需先 load 再 fadd，多一條指令多耗 cycle。
能耗：讀 register 的能量比讀 global memory 低 ≥1 個數量級。

Question 5 - Tiling 的減量因子 [recall]

情境/題目：對 N×N 矩陣乘法，每個輸入元素被從 global memory 請求幾次？(a) 無 tiling；(b) 使用 T×T tile。並說明 16×16 tile 把 ratio 提升到多少。

查看答案

(a) 無 tiling：每元素被請求 N 次。(b) T×T tile：協同載入後每元素只被請求 N/T 次 → global memory 流量減量因子 = TILE_WIDTH（不是 T²）。16×16 tile → ratio = 0.25 × 16 = 4 OP/B，A100 吞吐上限提升到 1555 × 4 = 6220 GFLOPS。

Question 6 - 兩個 __syncthreads() 防的 Hazard [recall]

情境/題目：tiled matmul kernel 在「載入 tile 後」與「用完 tile 後」各有一個 __syncthreads()。各防止哪種資料相依？哪個是 true / false dependence？

查看答案

載入後 (line 21)：防 Read-after-write (RAW) = true dependence（reader 真的需要別 thread 寫入的 tile 資料，非等不可）。
用完後 (line 26)：防 Write-after-read (WAR) = false dependence（writer 不需 reader 的資料，僅因重用同一 shared 位置才相依，換位置即可消除）。漏任一個都會造成 race condition。

Question 7 - 越界邊界檢查 [recall]

情境/題目：當矩陣寬度不是 TILE_WIDTH 的整數倍時，為何需要邊界檢查？需要幾組獨立檢查？越界載入該填什麼值？

查看答案

row-major 線性化下，超出列尾的存取會「靜默」讀到下一列的合法值而汙染內積；超出陣列尾則回傳隨機值或 crash。需 三組獨立檢查：載入 M (Row<Width && ph*TILE_WIDTH+tx<Width)、載入 N (ph*TILE_WIDTH+ty<Width && Col<Width)、寫入 P (Row<Width && Col<Width)，因為三者索引界線不同。越界載入填 0.0f（乘加皆中性，不改變內積）。注意越界發生在所有 phase，不只最後一個。

Question 8 - 計算 Shared Memory 對 Occupancy 的限制 [application]

情境/題目：A100 每 SM 最多 164 KB shared memory、最多 2048 threads。某 kernel 每 block 用 32 KB shared memory、256 threads。它能否達到 full occupancy？最大 occupancy 約多少？

查看答案

平均用量 = 32 KB / 256 threads = 128 B/thread，超過 full-occupancy 門檻 164 KB / 2048 = 82 B/thread → 無法 full occupancy。每 SM 可容納 threads ≈ 164 KB / 128 B ≈ 1272–1312（書中算 ≈132 B/thread → 1272），max occupancy ≈ 1272 / 2048 ≈ 62%，限制因子是 shared memory。（對比 tiled matmul 只用 8 B/thread，不受限。）

Question 9 - 判斷 Memory-Bound 或 Compute-Bound [application]

情境/題目：某 kernel 每 thread 做 36 個 FLOP 與 7 次 32-bit global memory 存取。判斷在以下兩種裝置上是 memory-bound 還是 compute-bound：(a) peak 200 GFLOPS、100 GB/s；(b) peak 300 GFLOPS、250 GB/s。

查看答案

kernel intensity = 36 FLOP / (7 × 4 B) = 36/28 ≈ 1.29 OP/B。比較機器平衡點 peak FLOPS / peak BW：

(a) 200/100 = 2.0 OP/B。kernel 1.29 < 2.0 → memory-bound。
(b) 300/250 = 1.2 OP/B。kernel 1.29 > 1.2 → compute-bound。

同一 kernel 在不同硬體可能落在分界兩側，取決於機器平衡點。

Question 10 - Naive vs Tiled Matmul 的效能取捨 [analysis]

情境/題目：比較 naive 與 tiled (16×16) 矩陣乘法在 A100 上的吞吐量。為何 tiling 有效？tiled 版仍只達峰值 32%，根本原因是什麼，可如何進一步改善？

查看答案

naive：ratio 0.25 → 389 GFLOPS（峰值 2%）。tiled 16×16：協同載入讓每元素只讀 1 次，ratio 升 16 倍至 4 OP/B → 6220 GFLOPS（峰值 32%）。tiling 有效是因為 matmul 具 data reuse / locality，把熱資料放進 on-chip shared memory，移除大量 global 流量。仍非峰值是因為 ratio 4 OP/B 距餵滿所需的 12.5 OP/B 仍有差距（殘留 memory-bound）。進一步改善：thread coarsening、register tiling，或直接用高度優化的 cuBLAS / CUTLASS。

Question 11 - 為何用 Shared Memory 而非 Register 持有載入值 [analysis]

情境/題目：假設 register 與 shared memory 容量都不是問題，為何在 tiled matmul 中要用 shared memory（而非 register）來持有從 global memory 載入的 M、N 元素？

查看答案

關鍵差異在 scope：register 是 thread 私有，一個 thread 載入的值只有它自己能用；shared memory block 內所有 thread 可見。tiled matmul 的精髓是「協同載入」——一個元素被多個 thread 重複使用（如同 row 的 thread 共用 M、同 col 的 thread 共用 N）。把值放 shared memory，每個 global 元素只需載入一次就能服務多個 thread，達成 TILE_WIDTH 倍的流量減量；若放 register，每個 thread 都得自己重載，無法共享，失去 tiling 的全部好處。

Question 12 - GPU 顯式 Shared Memory vs CPU 隱式 Cache [analysis]

情境/題目：tiling/blocking 在 CPU 與 GPU 都能提升效能。為何 GPU 需要顯式用 shared memory，而 CPU 可依賴隱式 cache？

查看答案

CPU 核心通常一次只跑 1–2 個 thread，重用資料很可能仍留在 cache 中，靠硬體自動管理即可（隱式 blocking）。GPU 的 SM 同時跑大量 thread 以隱藏延遲，這些 thread 會競爭 cache slot，使 cache 留住重用資料變得不可靠。因此 GPU 必須用 shared memory (scratchpad) 顯式、程式員手動管理重要的重用資料，確保它留在 on-chip。兩者目標相同（提升 data locality、把重用資料留在晶片上），但 GPU 因高並行度而需顯式控制。

Question 13 - Strip-mining 與 Phase 數 [recall]

情境/題目：tiled matmul kernel 的外層迴圈用到一種叫 strip-mining 的技術。strip-mining 是什麼？對 Width×Width 矩陣、tile 寬 TILE_WIDTH，dot product 會分成幾個 phase？

查看答案

Strip-mining：把一條長迴圈拆成「外層 phase 迴圈 + 每 phase 執行少數連續迭代的內層迴圈」，並在內層前後加 barrier，強迫同 block 的 thread 每 phase 聚焦於同一段輸入 → 正是 tiling 所需的 phase 機制。phase 數 = Width / TILE_WIDTH（任意寬度時用 ceil(Width/(float)TILE_WIDTH)）。

Question 14 - Constant 變數與動態 Shared Memory [recall]

情境/題目：(a) constant 變數的 scope、lifetime、容量上限與 device 端存取權限為何？(b) 若想在執行期（不重新編譯）依硬體調整 shared memory 大小，需要哪三項配合？

查看答案

(a) constant 變數：scope = grid（所有 grid 所有 thread 共用），lifetime = application，容量上限 ≤ 65,536 bytes，device 端唯讀（存在 global memory 但被 cache，搭配適當存取模式極快）。
(b) 三項：① extern __shared__ 宣告（省略大小、合併成一個一維陣列）；② kernel launch 第三個 <<<grid, block, size>>> 參數傳入位元組數；③ 用 cudaGetDeviceProperties 查 devProp.sharedMemPerBlock 決定 size。需手動切分 Mds/Nds 起點並改用線性化索引。

Question 15 - 計算變數的版本數 [application]

情境/題目：kernel 以 1000 個 thread block、每 block 512 threads 啟動。在整個 kernel 執行生命週期中：(a) 一個宣告為 local 的變數會產生幾份？(b) 一個宣告為 shared memory 的變數會產生幾份？

查看答案

(a) local 變數 scope = thread → 每個 thread 一份 = 1000 × 512 = 512,000 份。
(b) shared 變數 scope = block → 每個 block 一份 = 1000 份。
重點：版本數由 scope 決定（thread-scope = 總 thread 數；block-scope = block 數；grid-scope = 1 份）。

重點整理 (Pattern Summary，點擊展開)

主題	核心要點
Arithmetic intensity	OP/B = FLOP 數 / global bytes；naive matmul = 0.25；餵滿 A100 需 ≥ 12.5
Memory- vs compute-bound	比較 kernel intensity 與機器平衡點 `peak FLOPS / peak BW`
Roofline Model	x = intensity，y = throughput；斜線 = peak BW，水平線 = peak compute，交點 = 分界
CUDA 記憶體類型	register / local / shared / global / constant，各有 scope、lifetime、速度；local 實際在 global
Register 優勢	頻寬高 + 不佔 global 頻寬 / 無額外 load 指令 / 能耗低
Tiling	協同載入到 shared memory，流量減量 = TILE_WIDTH，ratio = TILE_WIDTH × 0.25
兩個 barrier	載入後防 RAW (true)；用完後防 WAR (false)；漏掉 → race condition
Strip-mining	把長迴圈拆成 outer phase + inner 連續迭代，加 barrier 形成 tiling 的 phase
邊界檢查	三組獨立（載 M / 載 N / 寫 P），越界填 `0.0f`，phase 上界用 `ceil`
Occupancy 與記憶體	avg = sharedMem/block ÷ threads/block；A100 門檻 82 B/thread；tiled matmul 僅 8 B/thread
動態 shared memory	`extern __shared__` + launch 第三參數 + `cudaGetDeviceProperties` 查 `sharedMemPerBlock`

Related Concepts

Question 1 - Compute-to-Memory Ratio 定義 [recall]

Question 2 - 為何 Naive Matmul 是 Memory-Bound [recall]

Question 3 - CUDA 記憶體類型 Scope 與 Lifetime [recall]

Question 4 - Register 為何比 Global Memory 快 [recall]

Question 5 - Tiling 的減量因子 [recall]

Question 6 - 兩個 __syncthreads() 防的 Hazard [recall]

Question 7 - 越界邊界檢查 [recall]

Question 8 - 計算 Shared Memory 對 Occupancy 的限制 [application]

Question 9 - 判斷 Memory-Bound 或 Compute-Bound [application]

Question 10 - Naive vs Tiled Matmul 的效能取捨 [analysis]

Question 11 - 為何用 Shared Memory 而非 Register 持有載入值 [analysis]

Question 12 - GPU 顯式 Shared Memory vs CPU 隱式 Cache [analysis]

Question 13 - Strip-mining 與 Phase 數 [recall]

Question 14 - Constant 變數與動態 Shared Memory [recall]

Question 15 - 計算變數的版本數 [application]