FHD Kernel 的平行化結構：Scatter vs Gather 與迴圈轉換

#application-case-study #scatter-vs-gather #atomic-operations #loop-transformation #mri-reconstruction

重點總覽 (Overview)

本筆記對應 PMPP 17.3 的 Step 1：決定 kernel 的平行化結構。目標是把計算 FHD 的循序 C 雙重迴圈，安全地轉成「不需要 global-memory atomics」的 CUDA kernel。

項目	內容	關鍵點
問題規模	`M` = k-space 樣本數 (~10^5–10^6)、`N` = 影像 voxel 數 (128³ ≈ 2.1M)	每個 FHD voxel 依賴所有樣本，但 voxel 之間互相獨立 → 可全平行
循序結構	外迴圈跑 `m` (樣本)、內迴圈跑 `n` (voxel)	內迴圈前有 `rMu/iMu` 計算 → 非完美巢狀
Scatter	一個 thread = 一個輸入樣本，寫入所有 voxel	多 thread 撞同一輸出 → 必須 atomicAdd → 慢
Gather	一個 thread = 一個輸出 voxel，收集所有樣本	每 thread 只寫自己的 voxel → 無衝突、無 atomics
致能轉換	Loop fission (拆出 `cmpMu`) → Loop interchange (n/m 對調)	fission 先讓迴圈變完美巢狀，才能 interchange
三種設計選項	①thread=內迴圈 (N×M)、②thread=外迴圈 (M)、③thread=輸出 (N)	選項③ (gather) 最佳
效能瓶頸 (預告)	OP/B 僅 0.25–0.75；FP:trig = 13:2	由 03 與 04 處理

Important

本章的核心結論：從 scatter 轉成 gather，是避免 atomic operations 的關鍵；而要做到 gather，需要 loop fission + loop interchange 兩個迴圈轉換來搭橋。

循序 FHD 程式碼 (Sequential FHD Code)

// Fig. 17.4 — sequential C, computing FHD
for (int m = 0; m < M; m++) {                 // outer: k-space samples
  rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];      // pre-inner statements
  iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m];
  for (int n = 0; n < N; n++) {                // inner: image voxels
    float expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);
    float cArg = cos(expFhD);
    float sArg = sin(expFhD);
    rFhD[n] += rMu[m]*cArg - iMu[m]*sArg;       // accumulate into voxel n
    iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;
  }
}

M = k-space 總樣本數；N = 重建影像的 voxel 總數。
每個 FHD[n] 累加所有 m 的貢獻 → voxel 依賴全部樣本。
但 任一 voxel 不依賴其他 voxel → 所有 FHD 元素可平行；外、內迴圈的所有 iteration 皆可平行。
例外（依賴）：同一個外迴圈 iteration 內，內迴圈用到該迭代先算好的 rMu[m]/iMu[m]（迴圈內的 statement 依賴）。

兩個效能瓶頸（在此先點名，細節見姊妹筆記）

記憶體頻寬：FHD 迴圈的 compute-to-memory 比僅 0.25 OP/B（最差，trig 算硬體單一指令）到 0.75 OP/B（最佳，trig 用 5 項 Taylor 級數 = sin 13 / cos 12 FP ops）。遠低於不被頻寬限制所需的比值。
三角函數：FP 算術 : trig 函數 ≈ 13 : 2，sin/cos 長延遲、低吞吐，會主導執行時間。

Scatter vs Gather

直觀對應 (Mapping)

SCATTER  (Fig 17.5)  thread <-> 一個 k-space 樣本 m
   thread m  ──writes──▶  rFhD[0] rFhD[1] ... rFhD[N-1]   (寫"所有"voxel)
   thread m' ──writes──▶  rFhD[0] rFhD[1] ... rFhD[N-1]
                              ▲撞      ▲撞   => 必須 atomicAdd

GATHER   (Fig 17.10) thread <-> 一個輸出 voxel n
   k[0]k[1]...k[M-1] ──read──▶ thread n  ──writes──▶ rFhD[n]   (只寫自己)
   k[0]k[1]...k[M-1] ──read──▶ thread n' ──writes──▶ rFhD[n']
                                          每 thread 唯一 n => 無衝突、無 atomics

第一版 kernel（scatter，反例）

// Fig. 17.5 — first (scatter) version: maps OUTER loop to threads
#define FHD_THREADS_PER_BLOCK 1024
__global__ void cmpFhD(/* ...rPhi,iPhi,rD,iD,kx,ky,kz,x,y,z,rMu,iMu,rFhD,iFhD */, int N) {
  int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;   // thread = 1 sample
  rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];
  iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m];
  for (int n = 0; n < N; n++) {
    float expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);
    float cArg = cos(expFhD); float sArg = sin(expFhD);
    atomicAdd(&rFhD[n], rMu[m]*cArg - iMu[m]*sArg);   // <-- 所有 thread 都寫每個 voxel
    atomicAdd(&iFhD[n], iMu[m]*cArg + rMu[m]*sArg);   // <-- 必須 atomic
  }
}

Grid/Block：gridDim = M/FHD_THREADS_PER_BLOCK、blockDim = FHD_THREADS_PER_BLOCK；m = blockIdx.x*1024 + threadIdx.x。
例：M = 1,000,000、block = 1024 → grid = 1,000,000/1024 = 977 blocks。
致命缺陷：所有 thread 都寫進每一個 rFhD/iFhD voxel → 內迴圈必須用 global-memory atomicAdd 防止互相覆蓋。

Warning

Scatter 為何不能改用 privatization（如 Ch.9 直方圖那樣搬到 shared memory）？因為 rFhD/iFhD 的大小 = 影像 voxel 總數（百萬級），塞不進 shared memory，無法做私有化。所以必須換思路 → gather。

Scatter vs Gather 比較

面向	Scatter (input-centric)	Gather (output-centric)
thread 對應	一個輸入元素 (k-space 樣本)	一個輸出元素 (voxel)
動作	把該輸入的影響散播到多個輸出	從所有輸入收集到單一輸出
輸出寫入衝突	多 thread 寫同一輸出 → 有	每 thread 唯一輸出 → 無
是否需 atomics	需要（global memory，慢）	不需要
平行度	M threads	N threads
本章評價	慢（Fig 17.5 / 17.8）	最佳（Fig 17.10）

Tip

Scatter↔Gather 是跨章節反覆出現的決策模式：見 DCS 的 Scatter vs Gather 與 Output-centric vs Input-centric 分解。共同心法：讓 thread 擁有「唯一的輸出」，就能消滅 atomics。

迴圈分裂 (Loop Fission)

要走 gather，必須把內迴圈 n 提到外層（loop interchange）。但 interchange 需要 完美巢狀 (perfectly nested) 迴圈：外、內 for 之間不能有其他 statement。Fig 17.4 的外迴圈在內迴圈前有 rMu/iMu 計算 → 不是完美巢狀。

Loop fission（迴圈分裂 / loop splitting）：把一個迴圈的 body 拆成兩個獨立迴圈。

// Fig. 17.6 — loop fission: split into Mu-loop and FHD-loop
for (int m = 0; m < M; m++) {                 // Loop 1: 只算 rMu/iMu
  rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];
  iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m];
}
for (int m = 0; m < M; m++) {                 // Loop 2: 完美巢狀，可 interchange
  for (int n = 0; n < N; n++) {
    float expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);
    float cArg = cos(expFhD); float sArg = sin(expFhD);
    rFhD[n] += rMu[m]*cArg - iMu[m]*sArg;
    iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;
  }
}

執行順序的改變（關鍵正確性論證）：

fission 前：  [pre_0][inner_0]  [pre_1][inner_1]  [pre_2][inner_2] ...
fission 後：  [pre_0][pre_1][pre_2]...  [inner_0][inner_1][inner_2]...

第一部份（rMu/iMu）不依賴任何前次迭代第二部份（內迴圈）的結果 → 先全部做完 pre 再做 inner 不改變結果。
這也是進階編譯器（能分析 iteration 間依賴）常自動做的轉換。

第一個迴圈直接變成 cmpMu kernel（每 thread 一個 m，各自寫自己的 rMu[m]/iMu[m]，無衝突）：

// Fig. 17.7 — cmpMu kernel (first loop)
#define MU_THREADS_PER_BLOCK 1024
__global__ void cmpMu(float* rPhi, iPhi, rD, iD, rMu, iMu) {
  int m = blockIdx.x*MU_THREADS_PER_BLOCK + threadIdx.x;
  rMu[m] = rPhi[m]*rD[m] + iPhi[m]*iD[m];
  iMu[m] = rPhi[m]*iD[m] - iPhi[m]*rD[m];
}

Tip

拆成兩個依序執行的 kernel 不損失任何平行度：第二個 kernel 本就需要第一個的結果。兩 kernel 之間的順序由 kernel launch 順序保證。

迴圈交換 (Loop Interchange) 與三種 Kernel 設計選項

對 Fig 17.6 的第二個（完美巢狀）迴圈，設計 kernel 有三種選項：

選項	thread 對應	thread 數量	是否需 atomics	評價
①	一個內迴圈 iteration	N × M（百萬 × 十萬）	否	thread 太多，遠超裝置所需 → 不可行
②	一個外迴圈 iteration (m)	M	是（同 Fig 17.5 問題）	寫衝突大量 → 慢 (Fig 17.8)
③	一個輸出 voxel (n)	N	否	需 loop interchange；最佳 ✅

選項② (反例，仍是 scatter)

// Fig. 17.8 — option 2: thread = outer (m) iteration, STILL needs atomics
__global__ void cmpFhD(/* ... */, int N) {
  int m = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;
  for (int n = 0; n < N; n++) {
    float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]);
    float cArg = cos(expFhD); float sArg = sin(expFhD);
    atomicAdd(&rFhD[n], rMu[m]*cArg - iMu[m]*sArg);   // 又見 atomic
    atomicAdd(&iFhD[n], iMu[m]*cArg + rMu[m]*sArg);
  }
}

選項③：loop interchange → gather

// Fig. 17.9 — loop interchange: n becomes the OUTER loop
for (int n = 0; n < N; n++) {            // outer: output voxel
  for (int m = 0; m < M; m++) {          // inner: collect ALL samples
    float expFhD = 2*PI*(kx[m]*x[n] + ky[m]*y[n] + kz[m]*z[n]);
    float cArg = cos(expFhD); float sArg = sin(expFhD);
    rFhD[n] += rMu[m]*cArg - iMu[m]*sArg;
    iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;
  }
}

// Fig. 17.10 — third (gather) kernel: thread = one output voxel n
#define FHD_THREADS_PER_BLOCK 1024
__global__ void cmpFhD(float* rPhi, iPhi, phiMag,
                       kx, ky, kz, x, y, z, rMu, iMu, int M) {
  int n = blockIdx.x * FHD_THREADS_PER_BLOCK + threadIdx.x;  // thread = 1 voxel
  for (int m = 0; m < M; m++) {
    float expFhD = 2*PI*(kx[m]*x[n]+ky[m]*y[n]+kz[m]*z[n]);
    float cArg = cos(expFhD); float sArg = sin(expFhD);
    rFhD[n] += rMu[m]*cArg - iMu[m]*sArg;   // 每 thread 唯一 n -> 直接 +=，無 atomics
    iFhD[n] += iMu[m]*cArg + rMu[m]*sArg;
  }
}

為何 interchange 合法：兩層迴圈的所有 iteration 互相獨立、可任意順序執行 → 交換順序不影響結果。
Grid/Block：gridDim = N/FHD_THREADS_PER_BLOCK、blockDim = FHD_THREADS_PER_BLOCK；n = blockIdx.x*1024 + threadIdx.x。
例：128³ = 2,097,152 voxel、block = 1024 → grid = 2,097,152/1024 = 2048 blocks。
每 thread 有唯一 n → 只累加自己的 rFhD[n]/iFhD[n] → 無衝突、無 atomics，故為三者中最佳。

高解析度的例外處理

對更高解析度（如 512³）voxel 數可能超過單一 grid 上限，需改用 多維 grid、launch 多個 grid、或讓一個 thread 處理多個 voxel（thread coarsening）。

Step 1 之後仍未解決的事

選項③ 雖然比 Fig 17.5/17.8 快很多，但仍受 memory bandwidth 限制（compute-to-memory 比僅約 0.23 OP/B）。提升比值（registers、constant memory、array-of-structs）與 trig 加速屬 Step 2–4，見 17-Iterative-MRI-Reconstruction/03-FHD-Memory-Bandwidth-Optimization 與 17-Iterative-MRI-Reconstruction/04-Hardware-Trig-Accuracy-and-Performance-Tuning。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「為什麼第一版 FHD kernel 需要 atomicAdd？」	它是 scatter：每 thread (一個樣本 m) 寫進所有 voxel，多 thread 撞同一輸出，必須 atomic 防覆蓋
「scatter 為何不能用 shared-memory privatization 解？」	輸出 (`rFhD/iFhD`) = voxel 總數（百萬級），塞不進 shared memory
「如何消除 atomics？」	改成 gather：讓每 thread 擁有唯一的輸出 voxel，從所有輸入收集 → 無衝突
「為什麼需要 loop fission？」	interchange 需完美巢狀迴圈；`rMu/iMu` 卡在內迴圈前，須先 fission 拆出去
「loop fission 改變執行順序，為何結果不變？」	第一部份 (`rMu/iMu`) 不依賴前次迭代內迴圈的結果，故先做完全部 pre 再做 inner 仍正確
「loop interchange 何時合法？」	當兩層迴圈所有 iteration 互相獨立、可任意順序執行時
「三選項各幾條 thread？選誰？」	①N×M（太多）②M（要 atomics）③N（gather，最佳）
「scatter vs gather 的 thread 數差異」	scatter = M threads；gather = N threads（本例 N = voxel 數 = 2.1M for 128³）
「FHD 的瓶頸數字」	OP/B = 0.25（最差）~0.75（最佳）；FP:trig ≈ 13:2
「512³ 解析度怎麼辦？」	多維 grid / 多 grid launch / 一 thread 多 voxel（coarsening）

重點總覽 (Overview)

循序 FHD 程式碼 (Sequential FHD Code)

Scatter vs Gather

直觀對應 (Mapping)

第一版 kernel（scatter，反例）

Scatter vs Gather 比較

迴圈分裂 (Loop Fission)

迴圈交換 (Loop Interchange) 與三種 Kernel 設計選項

選項② (反例，仍是 scatter)

選項③：loop interchange → gather

考試/面試重點 (Exam / Test Patterns)

Related Notes