FHD Kernel 的記憶體頻寬優化：Registers 與 Constant Memory

#application-case-study #register-optimization #constant-memory #arithmetic-intensity #cache-hierarchy

重點總覽 (Overview)

承接 17-Iterative-MRI-Reconstruction/02-FHD-Kernel-Parallelism-Structure 選定的 gather kernel（Fig. 17.10），本節是 Step 2：把這個 kernel 從記憶體頻寬綁定 (memory-bandwidth bound) 一路救回運算綁定邊緣。手法是逐步抬高 compute-to-global-memory-access ratio (OP/B，即 arithmetic intensity)。

階段	手法	每次迭代 global accesses	OP/B	關鍵限制
Fig. 17.10 起點	gather kernel（全部走 global memory）	14	0.23	bandwidth bound
Fig. 17.11	把 `x[n] y[n] z[n] rFhD[n] iFhD[n]` 提升到 registers	7	0.46	register/occupancy
Fig. 17.13	`kx ky kz` 放 constant memory + chunking	2	1.63	constant cache 容量
Fig. 17.16	k-space 改 array-of-structures (AoS) 佈局	2	1.63（但 cache 命中率大增）	解 cache thrashing

Important

目標數字來自 Chapter 5：OP/B 太低時 kernel 受限於 memory bandwidth。三步把 0.23 → 1.63 OP/B，使 bandwidth 不再是唯一瓶頸（剩下交給 Step 3 的硬體三角函數）。

瓶頸分析：Compute-to-Memory Ratio (Arithmetic Intensity)

逐項數 Fig. 17.10 內層迴圈每次迭代的 global memory 存取（共 14 次，每次 4 bytes = float）：

kx[m] ky[m] kz[m]        ← 3  (k-space 座標)
x[n]  y[n]  z[n]         ← 3  (voxel 座標)
rMu[m] (×2) iMu[m] (×2)  ← 4  (Mu，編譯器看成 2 個位置)
rFhD[n] read+write       ← 2
iFhD[n] read+write       ← 2
                         ───
                          14 accesses

約 13 個浮點運算（乘/加/三角）：

OP/B = \frac{13}{14 \times 4} = 0.23 OP/B

Warning

三角函數的 op 計數有歧義：若 sin/cos 用 5 項 Taylor（13、12 ops）→ 最佳約 0.75 OP/B；若視為單一硬體指令 → 最差 0.25 OP/B（見 17.3 開頭分析）。本節以「每函數算 1 op」的保守口徑得到 0.23，作為優化基準。

第一招：把陣列元素提升到 Registers (Register Promotion)

判斷哪些 global 存取可以變成 register：看 index 與 threadIdx 的關係。

變數	index	在 m-loop 中是否變動？	可否進 register？
`x[n] y[n] z[n]`	`n`（thread 固定）	否（整圈不變）	可：迴圈前載入一次
`rFhD[n] iFhD[n]`	`n`（thread 固定）	累加（讀寫同一位置）	可：圈內用 register 累加，圈後寫回一次
`kx[m] ky[m] kz[m]`	`m`（每圈變）	是	不可（register 救不了 → 改用 constant memory）

__global__ void cmpFhD(/* ... */, int M) {
    int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;
    // 頻繁存取的座標與輸出 → automatic variables (registers)
    float xn_r = x[n]; float yn_r = y[n]; float zn_r = z[n];
    float rFhDn_r = rFhD[n]; float iFhDn_r = iFhD[n];
    for (int m = 0; m < M; m++) {
        float expFhD = 2*PI*(kx[m]*xn_r + ky[m]*yn_r + kz[m]*zn_r);
        float cArg = cos(expFhD);
        float sArg = sin(expFhD);
        rFhDn_r += rMu[m]*cArg - iMu[m]*sArg;   // 累加在 register
        iFhDn_r += iMu[m]*cArg + rMu[m]*sArg;
    }
    rFhD[n] = rFhDn_r; iFhD[n] = iFhDn_r;       // 圈後寫回一次
}

thread n 的 m-loop（register 重用）:

  x[n],y[n],z[n] ── 載入 1 次 ──► [xn_r][yn_r][zn_r]  (整圈唯讀重用)
                                        │
   m=0 ─► m=1 ─► m=2 ─► ... ─► m=M-1    │ 全程從 register 取
                                        ▼
  rFhDn_r / iFhDn_r  累加於 register ───► 圈結束才寫回 global 1 次

效果：每圈 global accesses 14 → 7，OP/B 0.23 → 0.46（13/(7×4)）。

每 thread 多用 5 個 register → 每 block 多用 5 × FHD_THREADS_PER_BLOCK = 5 × 1024 = 5120 registers。SM 總額 65,536 registers（SM 3.5+）由所有常駐 block 共享。本 kernel 還 OK，但再加 register 就可能限制每 SM 的 block 數（occupancy ↓）。詳見 04-Compute-Architecture-And-Scheduling/03-Resource-Partitioning-and-Occupancy。

第二招：k-space 放 Constant Memory + Chunking

kx[m] ky[m] kz[m] 為何適合 constant memory（而非 register）？

唯讀：kernel 不修改 k-space。
index = m，與 threadIdx 無關 → warp 內 32 個 thread 同一時刻存取同一個元素。

constant cache 廣播 (broadcast)：
   kx_c[m] ──取一次進 cache──► broadcast 給 warp 內全部 32 threads
   ⇒ 每 32 次存取，≥31 次由 cache 服務 ⇒ 消除 96%+ 的 global accesses

Tip

這個存取型態讓 constant memory「幾乎和 register 一樣快」——差別只在仍需一條 memory load 指令。

容量限制 → Chunking 64KB

Constant memory 容量僅 64KB，但 k-space 樣本可達數百萬個 → 必須切塊 (chunk)，host 端多次呼叫 kernel，每次先把一塊塞進 constant memory：

__constant__ float kx_c[CHUNK_SIZE], ky_c[CHUNK_SIZE], kz_c[CHUNK_SIZE];

void main() {
    for (int i = 0; i < M/CHUNK_SIZE; i++) {
        cudaMemcpyToSymbol(kx_c, &kx[i*CHUNK_SIZE], 4*CHUNK_SIZE, cudaMemcpyHostToDevice);
        cudaMemcpyToSymbol(ky_c, &ky[i*CHUNK_SIZE], 4*CHUNK_SIZE, cudaMemcpyHostToDevice);
        cudaMemcpyToSymbol(kz_c, &kz[i*CHUNK_SIZE], 4*CHUNK_SIZE, cudaMemcpyHostToDevice);
        cmpFhD<<<N/FHD_THREADS_PER_BLOCK, FHD_THREADS_PER_BLOCK>>>(/* ... */, CHUNK_SIZE);
    }
    // 若 M 不是 CHUNK_SIZE 的整數倍，需再補一輪 memcpy + kernel
}

kernel 內 kx[m] 改成 global __constant__ 變數 kx_c[m]（不再放參數表）。剩下唯一的 global 存取是 rMu[m]、iMu[m]（編譯器把 4 次合併成 2 次 load）：

OP/B = \frac{13}{2 \times 4} = 1.63 OP/B

為何 chunking 對本 kernel特別容易

所有 thread 一齊 (lock-step) 掃過同一段 k-space 陣列（同一迭代存取同一元素），大資料只是「迴圈跑更多圈」。所以把迴圈切段、每段配一個 chunk 即可。
（較新裝置可改用 const __restrict__ 參數宣告，讓資料走 read-only data cache，效果類似但更簡單。）

Warning

並非所有唯讀資料都這麼適合 constant memory。若不同 block 在同一迭代存取不同元素（diverged pattern），就很難把足夠資料塞進 constant memory（見原文 footnote 3）。

第三招：Array-of-Structures (AoS) 佈局調整 Cache

跑完 Fig. 17.13 在某些裝置上加速不如預期——問題出在 constant cache 設計 + k-space 記憶體佈局。

Cache line 會一次載入多個連續字。三個分離陣列 kx ky kz → 每次迭代需要 3 條 cache line（分屬三個陣列的不同區段）。

Working set 爆掉 → cache thrashing

1024 threads/block, 2 blocks/SM 同時常駐：
  warps/SM = (1024/32) × 2 = 64 warps
  每 warp 需 3 條 cache line（kx/ky/kz 各一）
  worst case working set = 64 × 3 = 192 cache lines
  (即使 3 個 warp 共享同一迭代) ≈ 64 cache lines

但有些裝置 constant cache 只有 ~32 條 line → 容不下 working set → 不同 warp 互相搶 line：下一圈要用的元素早被別的 warp 擠出去 → cache 無法消除 global accesses。

解法：SoA → AoS

把 x/y/z 三分量塞進一個 struct 的三個欄位，強制它們連續存放 → 一次迭代的 3 分量塞進同一條 cache line：

(A) Separate arrays (SoA)            (B) Array of structures (AoS)
  kx: [..x[i]..]  ─ line 1            k: [ x y z | x y z | x y z ... ]
  ky: [..y[i]..]  ─ line 2               └──┬──┘
  kz: [..z[i]..]  ─ line 3            一次迭代的 x,y,z ─► 同一條 cache line
  → 每 warp 需 3 lines                 → 每 warp 只需 1 line (working set ÷3)

struct kdata { float x, y, z; };
__constant__ struct kdata k[CHUNK_SIZE];
// kernel 內：kx[m] → k[m].x, ky[m] → k[m].y, kz[m] → k[m].z
float expFhD = 2*PI*(k[m].x*xn_r + k[m].y*yn_r + k[m].z*zn_r);

附帶好處：只剩一個陣列 → 只需一次 cudaMemcpyToSymbol，搬移大小由 4*CHUNK_SIZE 改為 12*CHUNK_SIZE（一次搬三分量）。

AoS 與 coalescing 規則相反，別搞混！

在 global memory 中，當 warp 內相鄰 thread 存取連續元素時，反而要用 SoA 才能 coalesce（見 06-Performance-Considerations/01-Memory-Coalescing）。這裡用 AoS 是因為 warp 內所有 thread 存取的是同一個元素，不是連續元素——存取型態不同，佈局結論就相反。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
算 compute-to-memory ratio	OP/B = (浮點運算數) ÷ (global accesses × 4 bytes)；起點 13/(14×4)=0.23
哪些陣列可放 register？	index 與 `threadIdx` 綁定且整迴圈不變/僅累加者（`x[n] y[n] z[n] rFhD[n] iFhD[n]`）；隨迴圈變的 `kx[m]` 不行
register promotion 抬升多少	14→7 accesses，0.23→0.46 OP/B；代價是每 thread +5 register（每 block +5120）
為何 `kx[m]` 不能進 register	每次迭代 `m` 都變，無法整圈重用 → 改用 constant memory（練習 3 考點）
k-space 為何適合 constant memory	唯讀 + index=`m` 與 threadIdx 無關 → warp 內全 32 thread 同一元素 → broadcast，消除 96%+ 存取
constant memory 容量限制	64KB → chunking，host 端迴圈多次 `cudaMemcpyToSymbol` + 多次 kernel 啟動
加 constant memory 後 OP/B	剩 2 個 global access（rMu/iMu），13/(2×4)=1.63 OP/B
constant cache 反而慢的原因	cache line 載多字 + 三分離陣列每 warp 需 3 line → working set (~64–192) 超過 cache line 數(~32) → thrashing
working set 算法	warps/SM = (threads/32)×blocks/SM；× 每 warp cache lines = 64×3=192 (worst)
AoS vs SoA 該用哪個	同一元素全 warp 共讀 → AoS（一 line 裝 x/y/z）；相鄰 thread 讀連續元素 → SoA（coalescing）
AoS 後的 memcpy 改變	一個陣列、一次 copy，大小 4·CHUNK → 12·CHUNK
`const __restrict__` 作用	讓參數走 read-only data cache，新裝置上替代 constant memory 的簡便寫法

重點總覽 (Overview)

瓶頸分析：Compute-to-Memory Ratio (Arithmetic Intensity)

第一招：把陣列元素提升到 Registers (Register Promotion)

第二招：k-space 放 Constant Memory + Chunking

容量限制 → Chunking 64KB

第三招：Array-of-Structures (AoS) 佈局調整 Cache

Working set 爆掉 → cache thrashing

解法：SoA → AoS

考試/面試重點 (Exam / Test Patterns)

Related Notes