優化單一 Block 的 Reduction Kernel (Control/Memory Divergence 與 Shared Memory)

#parallel-pattern #reduction #control-divergence #memory-coalescing #shared-memory

重點總覽 (Overview)

本筆記針對 simple sum reduction kernel (Fig. 10.6) 做三個連續優化，每一步只改幾行卻有顯著效能差異。基準輸入皆以 N = 256 (即 128 threads = 4 warps) 為例。

優化步驟	關鍵改動	解決的瓶頸	N=256 量化結果
(0) Simple kernel (Fig.10.6)	`i = 2*threadIdx.x`；stride 遞增 1→2→4…	基準 (有 control + memory divergence)	效率 35%；141 次 global requests
(1) Convergent kernel (Fig.10.9)	`i = threadIdx.x`；stride 遞減 blockDim.x→1；`if(tid < stride)`	降低 control divergence + 達成 coalescing	效率 66%；36 次 requests
(2) Shared-memory kernel (Fig.10.11)	partial sum 存 `__shared__`，迴圈只動 shared memory	降低 global memory accesses	global requests → 9 (再 ~4×)

Important

三個 kernel 的「有用加法次數完全相同 (N-1)」與「每次迭代的 active thread 數完全相同」。差別純粹在於 active thread 在 warp 內的相對位置 與 記憶體位置的相鄰性。這正是「需要清楚理解 SIMD 硬體執行才能做的微調」。

降低 Control Divergence (Minimizing Control Divergence, 10.4)

問題：active thread 隨時間越離越遠

Simple kernel 把 thread 對應到 i = 2*threadIdx.x (偶數位置)，條件是 threadIdx.x % stride == 0。隨 stride 遞增，留下的 active thread 在 warp 內越分越散，使每個 warp 同時混有 active / inactive thread → divergence。

Simple (Fig.10.7): stride 遞增, active thread 越離越遠
in[]: 0  1  2  3  4  5  6  7  ...
T:    0     1     2     3        (i = 2*tid)
s=1:  +1    +1    +1    +1   每個 active thread 與鄰居間隔大
s=2:  T0          T2           (只剩 tid 為 2 的倍數 → warp 內穿插 divergence)
s=4:  T0                       (只剩 tid 為 4 的倍數)

解法：讓 active thread 保持「相鄰」(stride 遞減)

改成 i = threadIdx.x、stride 從 blockDim.x 遞減，條件 threadIdx.x < stride。如此 active thread 永遠是 thread 0..stride-1 的連續區段，整段 warp 要嘛全 active、要嘛全 inactive。

Convergent (Fig.10.8, N=16, 8 threads): stride 遞減, active thread 永遠相鄰
in[]:  0  1  2  3  4  5  6  7 | 8  9 10 11 12 13 14 15
s=8:   T0 T1 T2 T3 T4 T5 T6 T7  in[t]+=in[t+8]   (全部 active)
s=4:   T0 T1 T2 T3 .  .  .  .   in[t]+=in[t+4]
s=2:   T0 T1 .  .                in[t]+=in[t+2]
s=1:   T0 .                      in[0]+=in[1]  → 結果在 in[0]

// Fig.10.9  Convergent kernel — 低 control divergence + coalesced
__global__ void ConvergentSumReductionKernel(float* input, float* output) {
    unsigned int i = threadIdx.x;                 // 改 1: 不再乘 2，owner 位置相鄰
    for (unsigned int stride = blockDim.x; stride >= 1; stride /= 2) {  // 改 2: 遞減
        if (threadIdx.x < stride) {               // 改 3: 連續區段的 thread 才 active
            input[i] += input[i + stride];
        }
        __syncthreads();
    }
    if (threadIdx.x == 0) *output = input[0];
}

為何 if-statement 還在，divergence 卻變少？

重點不是「有沒有 if」，而是 active 與 inactive thread 的相對位置。當 active thread 落在「整個 warp 的邊界」上時，每個 warp 內部走同一條路徑 → no control divergence。

量化：execution resource utilization efficiency

efficiency = \frac{總 committed (active) threads}{總消耗的 execution resources}

分母關鍵：一個 warp 只要有任何一個 active thread，就吃掉整 32 個 thread 的 resource。

	消耗 resource (×32)	committed threads	efficiency
Simple (Fig.10.6)	(4·5 + 2 + 1)·32 = 736	4·(32+16+…+1)+2+1 = 255	255/736 ≈ 0.35
Convergent (Fig.10.9)	(4+2+1 + 5·1)·32 = 384	同上 = 255	255/384 ≈ 0.66

N/64 = 啟動的 warp 數 (N/2 threads ÷ 32)。Simple 前 5 次迭代所有 warp 都活著 (warp 內 ≥1 active)，故 ×5。
Convergent：整個 warp 會一起退場，前段每次迭代 active warp 減半，直到剩 1 個 warp 撐完最後 5 次迭代 (+5·1)。

divergence 沒有完全消除

對 N=256，從第 4 次迭代起 active thread 數降到 32 以下 (16, 8, 4, 2, 1)，最後 5 次迭代仍有 warp 內 divergence。Convergent kernel 只是把「有 divergence 的迭代數」從 10 降到 5，而非歸零。

降低 Memory Divergence (Minimizing Memory Divergence, 10.5)

Memory divergence = warp 內相鄰 thread 沒有存取相鄰位址 → 無法 coalesce，一次 warp 存取被拆成多個 global memory requests。

每次迭代每個 active thread 做 2 reads + 1 write (讀 owner、讀 stride 外、寫回 owner)。

Simple kernel: 相鄰 thread 的 owner 位址相隔 2 (= 2*tid) → 非 coalesced
warp 第一次 read:  in[0] in[2] in[4] ...  位址間隔 2
  → 觸發 2 個 memory requests，回傳資料有一半沒用到

Convergent kernel: 相鄰 thread 存取相鄰位址 (i = tid) → 全 coalesced
warp read:  in[0] in[1] in[2] ... in[31]  連續
  → 每個 read/write 只觸發 1 個 memory request

量化：global memory requests (N=256，×3 代表 2讀+1寫)

	公式	requests
Simple (Fig.10.6)	(4·5·2 + 4+2+1)·3	141
Convergent (Fig.10.9)	((4+2+1) + 5)·3	36

Simple 的 ·2：前 5 次迭代每個 warp 因非 coalesced 而觸發 2 個 request。
比值 141/36 ≈ 3.9×。在 N=2048 時為 1149 vs 204 ≈ 5.6× (warp 越多、初期非 coalesced 浪費越大)。

Important

Fig.10.9 一次優化同時治好 control divergence (10.4) 與 memory divergence (10.5)——因為「讓 active thread 連續」恰好也讓「相鄰 thread 存取相鄰位址」。兩者是同一個 thread-to-data mapping 改動的兩個面向。

降低 Global Memory 存取 (Minimizing Global Memory Accesses, 10.6)

觀察：convergent kernel 每次迭代仍把 partial sum 寫回 global memory，下一次又讀回來。由於 shared memory 延遲更低、頻寬更高，可把 partial sum 全程留在 shared memory。

Fig.10.10: 第一次加法直接在 load 時於 global memory 完成，
之後所有迭代都只在 shared memory input_s[] 進行：

global in[]: [.... 2*BLOCK_DIM 個原始元素 ....]
   每個 thread t:  input_s[t] = in[t] + in[t + BLOCK_DIM]   (coalesced, 一次做完第一層)
                          │
                          ▼  (以下只動 shared memory)
   shared input_s[]:  reduction tree (stride 從 blockDim/2 遞減)
                          │
                          ▼
   thread 0:  output = input_s[0]   (唯一寫回 global)

// Fig.10.11  Shared-memory kernel
#define BLOCK_DIM 1024
__global__ void SharedMemorySumReductionKernel(float* input, float* output) {
    __shared__ float input_s[BLOCK_DIM];
    unsigned int t = threadIdx.x;
    input_s[t] = input[t] + input[t + BLOCK_DIM];   // 第一層在 global 上做掉, 兩個 read 皆 coalesced
    for (unsigned int stride = blockDim.x/2; stride >= 1; stride /= 2) { // 從 blockDim/2 起
        __syncthreads();                            // 移到迴圈開頭: 同步 shared 寫入與下一輪讀取
        if (threadIdx.x < stride) {
            input_s[t] += input_s[t + stride];      // 只動 shared memory
        }
    }
    if (threadIdx.x == 0) *output = input_s[0];      // 唯一寫回 global
}

量化：global memory accesses / requests

指標	公式	N=256
Global memory accesses	N + 1 (載入 N 個 + 最後寫 1 個)	257
Global memory requests (coalesced)	N/32 + 1	8 + 1 = 9

對比 convergent 的 36 → 9，再 ~4× 改善。
__syncthreads() 必須移到迴圈開頭：確保第一層 (迴圈外) 的 shared memory 寫入，與第一次迭代的讀取之間有 barrier synchronization。

額外好處：input 不被破壞

前兩個 kernel 是 in-place 覆寫 input[]，shared-memory 版只讀取原陣列。若原始資料之後還要用，這個性質很有價值。

block size 上限仍在

仍是單一 block kernel：__syncthreads() 只能在同一 block 內同步，故一個 block 最多 1024 threads → 至多處理 2048 元素。任意長度需 hierarchical / segmented multiblock reduction (10.7) 配 atomic operations。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
Fig.10.6 → Fig.10.9 改了哪幾行？	`i=threadIdx.x` (非 `2*tid`)、stride 由 `blockDim.x` 遞減、`if(tid<stride)`。三行同時治 control + memory divergence。
為何 if-statement 還在卻沒 divergence？	因 active thread 變成 thread 0..stride-1 連續區段，整個 warp 走同一路徑。
efficiency 公式	active(committed) threads ÷ 消耗 execution resources；warp 只要 1 thread active 就吃滿 32。
N=256 efficiency 數字	Simple 255/736 ≈ 35%；Convergent 255/384 ≈ 66% (約 2×)。
convergent 是否完全無 divergence？	否。active thread < 32 (最後 5 次迭代) 時 warp 內仍 divergent；只是迭代數從 10 降到 5。
global memory requests 比值	N=256: 141/36 ≈ 3.9×；N=2048: 1149/204 ≈ 5.6× (warp 越多差越大)。
每次迭代每 thread 幾次 global access？	2 reads + 1 write = 3 (故公式 `×3`)。
shared-memory 版 global accesses	N+1 次；coalesced 後 requests = N/32+1 (N=256 → 9)。
為何 `__syncthreads()` 移到迴圈開頭？	同步「迴圈外第一層寫入 shared」與「第一次迭代讀 shared」。
為何 commutative 也需要？	遞減-stride 重排了 operand 順序 (非僅插括號)，故除 associative 外還需 commutative。
reduction tree 的 work / step 複雜度	操作數 = N-1 (work-efficient)；time steps = log₂N；average parallelism = (N-1)/log₂N。