Reduction 平行歸約與最小化分歧練習題 (Practice - Reduction Fundamentals and the Simple Kernel)

#practice #parallel-pattern #reduction #control-divergence

10-Reduction/01-Reduction-Fundamentals-and-Simple-Kernel — Reduction 基礎與簡單 Kernel
10-Reduction/02-Optimizing-Single-Block-Reduction-Kernel — 優化單一 Block 的 Reduction Kernel (Control/Memory Divergence 與 Shared Memory)
10-Reduction/03-Scaling-Reduction-Hierarchical-and-Coarsening — 擴展 Reduction：Hierarchical Reduction 與 Thread Coarsening

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案重點
Reduction 需要什麼運算性質	binary operator 要有 identity value；tree 重排運算順序需 associative，重排 operand 位置需 commutative
`sum/product/min/max` 的 identity	`0.0` / `1.0` / `+∞` / `−∞`
reduction tree 總工作量 / 步數	work = N−1 (work-efficient)；time steps = log₂N
peak / average parallelism	peak = N/2；average = (N−1)/log₂N
simple kernel 的 thread 映射	`i = 2threadIdx.x` (偶數位置)、owner-computes*、stride 遞增 1→2→4
單 block 上限	`__syncthreads()` 只能同 block → ≤1024 threads → ≤ 2048 元素
convergent kernel 三行改動	`i=threadIdx.x`、stride 遞減 blockDim.x→1、`if(threadIdx.x < stride)`
N=256 efficiency	simple 255/736 ≈ 35%；convergent 255/384 ≈ 66%
N=256 global requests	simple 141；convergent 36；shared-memory 9
shared-memory kernel global accesses	N+1 (coalesced 後 requests = N/32+1)
任意長度 / 多 block	segment 大小 `2blockDim.x`，每 block 一棵 tree，結尾 `atomicAdd`*，output 先設為 identity
thread coarsening	每 thread 在 register 序列加 COARSE_FACTOR×2 元素，砍掉 underutilization + sync + shared overhead

Question 1 - Reduction 的定義與 identity value [recall]

情境：什麼是 reduction？要能對一個 binary operator 定義 reduction，數學上必須具備什麼條件？請寫出 sum、product、min、max 四種 reduction 的 identity value。

查看答案

Reduction 用一個 binary operator 把整個 array of values 縮成單一值 (sum / max / min / product …)。前提是該 operator 要有 identity value（滿足 v ⊕ identity = v）。identity：sum → 0.0、product → 1.0、min → +∞、max → −∞。

Question 2 - Associative 與 Commutative [recall]

情境：把 sequential reduction 改寫成 reduction tree，operator 需要哪個性質？若要進一步「重排 operand 位置」(如 §10.4 讓 active thread 相鄰) 又需要哪個性質？

查看答案

sequential → tree 只是改變加括號的位置 (運算順序)，需 associative：(a⊕b)⊕c = a⊕(b⊕c)。若還要重排 operand 位置（convergent kernel 把遠處元素配成相鄰 pair），則額外需 commutative：a⊕b = b⊕a。sum/min/max/product 兩者皆滿足；整數減法兩者皆不滿足。（浮點加法嚴格上非 associative，但實務上在容忍誤差內當作 associative。）

Question 3 - Reduction tree 的工作量與步數 [recall]

情境：對 N 個元素的 reduction tree，總共要做幾次運算？需要幾個 time steps？這與 sequential 版本相比是否更省「工作量」？

查看答案

總運算數 = N/2 + N/4 + … + 1 = N−1，與 sequential 完全相同（所以 reduction tree 是 work-efficient，平行不減少總工作量）。time steps = log₂N。換到的好處是步數從 N → log₂N，代價是需要大量平行硬體（peak 同時要 N/2 個運算單元）。

Question 4 - Peak 與 Average Parallelism [application]

情境：N = 1024。請計算這個 reduction tree 的 (a) peak parallelism、(b) average parallelism、(c) 相對 sequential 的理想 speedup。

查看答案

(a) peak = N/2 = 512（發生在第 1 步）。(b) average = (N−1)/log₂N = 1023/10 ≈ 102.3。(c) 理想 speedup = N/log₂N = 1024/10 = 102.4×。peak (512) 遠大於 average (≈102)，正反映各 time step 平行度從 N/2 一路掉到 1，使資源利用率成為 reduction 的挑戰。

Question 5 - Simple Kernel 的 Thread 映射與 owner-computes [recall]

情境：在 Fig. 10.6 的 simple sum reduction kernel 中，thread k 擁有哪個 input 位置？stride 如何變化？__syncthreads() 放在迴圈內的作用是什麼？

查看答案

採 owner-computes：每個位置只由唯一的 owner thread 寫入。thread k 的 owner 位置 i = 2*threadIdx.x（即所有偶數索引）。stride 從 1 開始每輪 ×2 (1,2,4,…)，第 n 輪只有 threadIdx.x % stride == 0 的 thread 為 active。每個 active thread 每輪做 2 reads + 1 write (input[i] += input[i+stride])。__syncthreads() 確保本輪所有 partial sum 都寫回 array，下一輪 active thread 才能正確讀取（防 race condition）。

Question 6 - 為何只能單一 Block / 2048 元素上限 [recall]

情境：前幾版 reduction kernel 都只 launch 一個 block，最多只能處理 2048 個元素。為什麼有這個限制？

查看答案

reduction tree 需要所有參與 thread 互相同步，而 kernel 用 __syncthreads() 當 barrier。__syncthreads() 只能同步同一個 block 內的 threads，跨 block 沒有好的 barrier。一個 block 最多 1024 threads，每個 thread 負責 2 個元素 → 最多 2×1024 = 2048 元素。任意長度需用 §10.7 的 segmented multiblock reduction + atomic。

Question 7 - Convergent Kernel 的三行關鍵改動 [recall]

情境：從 simple kernel (Fig. 10.6) 改成 convergent kernel (Fig. 10.9) 只改了三處，請列出。這三處同時解決了哪兩個瓶頸？

查看答案

(1) i = threadIdx.x（不再乘 2，owner 位置相鄰）；(2) stride 改為從 blockDim.x 遞減到 1（stride /= 2）；(3) active 條件改為 if (threadIdx.x < stride)（連續區段的 thread 才 active）。這同一個 thread-to-data 改動同時治好 control divergence (active thread 變連續) 與 memory divergence (相鄰 thread 存相鄰位址 → coalesced)。

Question 8 - 為何 if 還在，divergence 卻變少 [analysis]

情境：convergent kernel 仍然有 if-statement，且每次迭代的 active thread 數量與 simple kernel 完全相同。那為什麼 control divergence 會大幅下降、efficiency 從 35% 升到 66%？

查看答案

關鍵不是「有沒有 if」或「active 數量」，而是 active 與 inactive thread 在 warp 內的相對位置。simple kernel 的 active thread (index 為 2ⁿ 倍數) 隨 stride 遞增而越來越分散，使每個 warp 內混有 active/inactive → divergence。convergent kernel 的 active thread 永遠是 thread 0..stride−1 的連續區段，整個 warp 要嘛全 active、要嘛全 inactive。一個 warp 只要有任一 active thread 就吃滿 32 個 resource，故位置決定資源利用率：N=256 時 simple 消耗 (4·5+2+1)·32=736、convergent (4+2+1+5)·32=384，committed 都是 255 → efficiency 255/736≈35% vs 255/384≈66%。（divergence 沒完全消除：最後 5 次迭代 active < 32 仍 divergent。）

Question 9 - Memory Coalescing：Strided vs Coalesced [analysis]

情境：比較 simple kernel 與 convergent kernel 的 global memory 存取模式。為什麼 simple kernel 是非 coalesced？對 N=256 各觸發幾次 global memory requests？為何此比值在 N 越大時越懸殊？

查看答案

simple kernel 相鄰 thread 的 owner 位址相隔 2 (i=2*tid)，warp 內存取位址散開 → 無法 coalesce，前 5 次迭代每個 warp 須 2 個 request；convergent kernel i=tid，warp 內存取連續位址 → 1 個 request，且整個 warp 一起退場。N=256：simple (4·5·2+4+2+1)·3 = 141，convergent ((4+2+1)+5)·3 = 36，比值 ≈ 3.9×（×3 = 2 reads + 1 write）。N=2048 時為 1149 vs 204 ≈ 5.6×——warp 越多，simple 初期「非 coalesced × 2 倍 request」的浪費被放大，故差距更大。

Question 10 - Shared-Memory Kernel 的 Global 存取量 [recall]

情境：Fig. 10.11 用 shared memory 後，對 N 個元素的 reduction 只需多少次 global memory accesses？coalesced 後又是多少 requests？__syncthreads() 為何要移到迴圈開頭？

查看答案

global memory accesses = N+1（載入 N 個原始元素 + thread 0 最後寫 1 個 output），中間迭代全部在 shared memory input_s[] 進行。coalesced 後 requests = N/32 + 1（N=256 → 9，對比 convergent 的 36 再 ~4×）。__syncthreads() 移到迴圈開頭，是為了同步「迴圈外第一層寫入 shared memory」與「第一次迭代讀取 shared memory」。額外好處：不破壞原始 input array。

Question 11 - 練習：Simple Kernel 第五次迭代的 divergence warp 數 [application]

情境 (Exercise 1)：simple kernel (Fig. 10.6)，N = 1024、warp size = 32。第 5 次迭代時，block 中有幾個 warp 會發生 control divergence？

查看答案

N=1024 → 512 threads → 16 warps。simple kernel 第 5 次迭代 stride = 2⁴ = 16，active 條件 threadIdx.x % 16 == 0 → active threads 為 0,16,32,…,496，共 32 個，平均散在每個 warp 各 2 個 active。因此每個 warp 都同時有 active 與 inactive thread → 16 個 warp 全部 divergent。

Question 12 - 練習：Convergent Kernel 第五次迭代的 divergence warp 數 [application]

情境 (Exercise 2)：convergent kernel (Fig. 10.9)，N = 1024、warp size = 32。第 5 次迭代時有幾個 warp 會 divergence？與上一題對比說明。

查看答案

stride 遞減：iter1=512, iter2=256, iter3=128, iter4=64, iter5=32。第 5 次迭代 active 條件 threadIdx.x < 32 → 恰好 warp 0 的 32 個 thread 全 active，其餘 15 個 warp 全 inactive。沒有任何 warp 內 active/inactive 混雜 → 0 個 warp divergent。對比上題 simple kernel 的 16，凸顯「讓 active thread 落在 warp 邊界上」如何徹底消除這一輪的 divergence。

Question 13 - 階層式分段多 Block Reduction 與 atomicAdd [recall]

情境：要對數百萬元素做 reduction，如何擴展到多個 block？segment 大小是多少？各 block 的 partial sum 如何合併？使用前 host 端必須先做什麼？

查看答案

把輸入切成 segments，每段大小 = 2 * blockDim.x，block b 的段起點 = 2*blockDim.x*blockIdx.x。每個 block 在自己的 segment 上獨立跑一棵 reduction tree（用私有 input_s）。跨 block 無法 barrier 同步，所以每個 block 由 thread 0 用 atomicAdd(output, input_s[0]) 把 partial sum 累加到唯一 output。host 端必須在 launch 前把 output 初始化為 identity（sum → 0.0），否則 atomic 會把舊值一起累加。atomic 只發生 N/2048 次（每 block 一次），競爭極低。

Question 14 - Thread Coarsening 的做法 [recall]

情境：thread coarsening 在 reduction 中具體怎麼做？coarsen factor 為 C 時每個 thread 序列加幾個元素？這個序列階段為什麼不需要 __syncthreads()？它砍掉了哪三種 overhead？

查看答案

每個 block 多吃 COARSE_FACTOR 倍資料（每 block COARSE_FACTOR*2*blockDim.x 元素）。每個 thread 獨立、序列地把它負責的 COARSE_FACTOR*2 個元素累加到一個 register 變數 sum，加完才寫進 input_s[t] 再進 reduction tree。序列階段全部 thread 都活躍且彼此獨立（各加各的），所以不需要 __syncthreads()、也不需寫 shared memory。它同時砍掉三種 overhead：hardware underutilization、barrier synchronization、shared memory access。

Question 15 - Coarsening 的權衡與 factor 上限 [analysis]

情境：既然硬體在資源不足時本來就會把多餘的 block 序列化執行，為什麼我們還要「自己」用 thread coarsening 序列化？以 Fig. 10.16 (factor=2) 的步數說明，並說明 coarsening factor 過大會有什麼後果。

查看答案

reduction tree 的「逐漸閒置」(underutilization + sync + shared access) 階段，每一個 launch 的 block 都要付一次。若由硬體序列化兩個未粗化 block：共 8 步，只有 2 步全利用、6 步 underutil（每步要 barrier + shared access）。若改成 1 個 coarsen×2 的 block：前 3 步在 register 序列相加（全利用、無 sync、無 shared），只剩 3 步走 tree → 共 6 步（3 全利用 / 3 underutil，僅 3 次 sync）。所以「自己粗化」比讓硬體序列化更省。但 factor 不是越大越好：粗化越多 → 平行做的事越少；若 block 數少於硬體能同時執行的量，就浪費平行硬體。最佳 factor 取決於輸入總大小與裝置 (SM 數、可駐留 block 數)。

重點整理 (Pattern Summary，點擊展開)

主題	核心重點	數值 / 公式
Reduction 定義	binary operator + identity value，縮成單一值	sum=0.0 / prod=1.0 / min=+∞ / max=−∞
必要性質	tree 重排運算順序需 associative；重排 operand 需 commutative	`(a⊕b)⊕c=a⊕(b⊕c)`；`a⊕b=b⊕a`
Tree 工作量 / 步數	work-efficient，步數降到對數	work = N−1；steps = log₂N
Parallelism	peak 遠大於 average → 資源利用挑戰	peak = N/2；avg = (N−1)/log₂N
Simple kernel	owner-computes，`i=2*tid`，stride 遞增	≤2048 元素 (單 block)
Convergent kernel	`i=tid`、stride 遞減、`if(tid<stride)` → 連續 active	efficiency 35%→66%；requests 141→36
Shared-memory kernel	partial sum 全留 shared，`__syncthreads()` 移到迴圈頭	global accesses = N+1；requests = N/32+1
Segmented multiblock	每 block 一棵 tree + `atomicAdd` 合併	segment = 2·blockDim.x；output 先設 identity
Thread coarsening	register 序列加 C×2 元素，砍 3 種 overhead	factor 過大 → block 數 < 硬體量 → 浪費平行度

Related Concepts

Question 1 - Reduction 的定義與 identity value [recall]

Question 2 - Associative 與 Commutative [recall]

Question 3 - Reduction tree 的工作量與步數 [recall]

Question 4 - Peak 與 Average Parallelism [application]

Question 5 - Simple Kernel 的 Thread 映射與 owner-computes [recall]

Question 6 - 為何只能單一 Block / 2048 元素上限 [recall]

Question 7 - Convergent Kernel 的三行關鍵改動 [recall]

Question 8 - 為何 if 還在，divergence 卻變少 [analysis]

Question 9 - Memory Coalescing：Strided vs Coalesced [analysis]

Question 10 - Shared-Memory Kernel 的 Global 存取量 [recall]

Question 11 - 練習：Simple Kernel 第五次迭代的 divergence warp 數 [application]

Question 12 - 練習：Convergent Kernel 第五次迭代的 divergence warp 數 [application]

Question 13 - 階層式分段多 Block Reduction 與 atomicAdd [recall]

Question 14 - Thread Coarsening 的做法 [recall]

Question 15 - Coarsening 的權衡與 factor 上限 [analysis]