Reduction 基礎與簡單 Kernel

#parallel-pattern #reduction #barrier-synchronization #work-efficiency #control-divergence

重點總覽 (Overview)

主題	關鍵點	數值 / 公式
Reduction 定義	用一個有 identity value 的 binary operator,把整個 array 縮成單一值	sum, max, min, product, ...
必要數學性質	reduction tree 重排「運算順序」需 associative;再重排「operand 位置」需 commutative	`(a⊕b)⊕c = a⊕(b⊕c)`
平行模型	reduction tree:每一 time step 平行做一半的運算,逐層收斂到 root	共 N−1 次運算,log₂N 步
加速 vs 代價	步數從 N 降到 log₂N,但需大量平行硬體(peak = N/2)	speedup = `N/log₂N`
Simple kernel	單一 block、N/2 threads、owner-computes、stride 由 1 倍增	thread k 擁有 `input[2k]`
限制	1 block ≤ 1024 threads;`__syncthreads()` 只能同 block 同步	≤ 2048 elements

Important

本筆記只涵蓋 §10.1–10.3:定義、reduction tree、第一個 simple kernel。Simple kernel 帶有的 control divergence / memory divergence / 過多 global memory 存取等問題,以及對它的三個優化,屬於 10-Reduction/02-Optimizing-Single-Block-Reduction-Kernel;多 block、atomic、coarsening 屬於 10-Reduction/03-Scaling-Reduction-Hierarchical-and-Coarsening。

背景:Reduction 的定義 (Background)

Reduction = 從一個 array of values 導出單一值(sum / max / min / product / ...),輸入型別可為 int、float、double、half、char,計算結構皆相同。
數學上:給定一個有 identity value 的 binary operator 即可定義 reduction。identity 指「v ⊕ identity = v」的那個值。
與 histogram 一樣,reduction 是把大量資料壓成摘要的重要 pattern;平行版需要 threads 互相協調 (coordination),是展示效能瓶頸與其解法的好教材。

Operator (⊕)	Reduction 結果	Identity value
floating-point `+`	sum(總和)	`0.0`
floating-point `*`	product(乘積)	`1.0`
`min` 比較	最小值	`+∞`
`max` 比較	最大值	`−∞`

Sequential reduction(O(N),N 次迭代,N 步):

// Fig. 10.1 — sum 版本
float sum = 0.0f;                       // 初始化為 identity
for (unsigned int i = 0; i < N; ++i)
    sum += input[i];                    // 逐一累加

// Fig. 10.2 — 通用形式(可換成 max / min / product)
acc = IDENTITY;
for (unsigned int i = 0; i < N; ++i)
    acc = Operator(acc, input[i]);      // max 回傳較大、min 回傳較小 ...

範例

集合 {7.0, 2.1, 5.3, 9.0, 11.2} 的 sum reduction = 7.0+2.1+5.3+9.0+11.2 = 34.6。

Reduction Tree 平行模型 (Reduction Trees)

把 sequential 的一條長鏈,重新加括號成一棵樹:每個 time step 同時對所有「pair」做運算,逐層折半收斂到 root。葉子是原始輸入,root 是最終結果(注意:這是概念上的樹,edge 只代表 time step 間的資訊流,不是指標型 tree 資料結構)。

8 個輸入的 sum reduction tree(time 由上往下)
index:    0    1    2    3    4    5    6    7
         [3]  [1]  [7]  [0]  [4]  [1]  [6]  [3]
           \  /      \  /      \  /      \  /
step 1:    [4]       [7]       [5]       [9]      ← N/2 = 4 次運算 (peak)
             \       /           \       /
step 2:       [ 11 ]              [ 14 ]          ← N/4 = 2 次運算
                 \               /
step 3:             [   25   ]                    ← 1 次運算 = root

順序被改變了:sequential 是 ((((((3 max 1) max 7) max 0)...) 一路左結合;tree 則是 ((3⊕1)⊕(7⊕0)) ⊕ ((4⊕1)⊕(6⊕3))——同一份運算清單,只是括號位置不同。

兩個必要性質

性質	定義	重排什麼	反例
Associative(結合律)	`(a⊕b)⊕c = a⊕(b⊕c)`	改變運算順序(插不同括號)→ 把 sequential 變成 tree	整數減法 `(1−2)−3 ≠ 1−(2−3)`
Commutative(交換律)	`a⊕b = b⊕a`	額外改變 operand 位置 → §10.4 的優化用得到	整數減法 `1−2 ≠ 2−1`

max / min / sum / product 皆同時 associative 且 commutative。
把 sequential → tree(Fig 10.7)只需 associative;若要像 10-Reduction/02-Optimizing-Single-Block-Reduction-Kernel 那樣重排 operand(讓 active threads 靠在一起),則還需 commutative。

浮點數嚴格來說不是 associative

不同的加括號方式可能造成不同的 rounding,使結果略有差異。實務上多數應用接受「在容忍誤差內視為相同」,因此允許把 floating-point addition 當作 associative 處理(詳見 PMPP Appendix A 與 24-Numerical-Considerations/03-Algorithm-Considerations-and-Numerical-Stability)。

工作量、步數與平行度

量	公式	N=1024
總運算數 (work)	`N/2 + N/4 + ... + 1 = N − 1`	1023
Time steps	`log₂N`	10
Speedup vs sequential	`N / log₂N`	102.4× (理想)
Peak parallelism(第 1 步)	`N/2`	512
Average parallelism	`(N − 1) / log₂N`	≈ 102.3

平行不會減少總工作量

reduction tree 的總運算數仍是 N−1,和 sequential 相同(work-efficient)。它換到的是步數從 N → log₂N。代價是需要大量平行硬體,且各 time step 的平行度劇烈變化(從 N/2 一路掉到 1),使資源利用率成為挑戰。

Simple Reduction Kernel (A Simple Reduction Kernel)

因為 reduction tree 需要所有 threads 協作,而跨整個 grid 無法做 barrier,所以第一版只在單一 block 內做。對 N 個元素,launch 一個 block、N/2 threads。

一個 block 最多 1024 threads → simple kernel 最多處理 2048 個 elements(§10.8 才解除此限制)。
採 "owner computes":每個資料位置由唯一一個 thread「擁有」,且只有 owner 會寫入它。thread k 的 owner 位置是 i = 2*threadIdx.x(即所有偶數索引)。
stride 從 1 開始每輪 ×2(1,2,4,...);第 n 輪只有 threadIdx.x % stride == 0(即 index 為 2ⁿ 倍數)的 thread 為 active,逐輪減半,最後只剩 thread 0。

// Fig. 10.6 — 單一 block 的 simple sum reduction kernel
__global__ void SimpleSumReductionKernel(float* input, float* output) {
    unsigned int i = 2 * threadIdx.x;                      // owner = 偶數索引
    for (unsigned int stride = 1; stride <= blockDim.x; stride *= 2) {
        if (threadIdx.x % stride == 0) {                   // active = 2^n 的倍數
            input[i] += input[i + stride];                 // 讀 owner + 讀 stride 外 + 寫 owner
        }
        __syncthreads();                                   // 等全部 partial sum 寫完才進下一輪
    }
    if (threadIdx.x == 0) {
        *output = input[0];                                // thread 0 輸出最終結果
    }
}

Thread → Data 映射與執行(N=8, 4 threads)

thread:   T0       T1       T2       T3          (T_k owns input[2k])
index:    0   1    2   3    4   5    6   7
         [a] [b]  [c] [d]  [e] [f]  [g] [h]
stride=1  T0:0+=1  T1:2+=3  T2:4+=5  T3:6+=7      ← 4 threads 全 active,相鄰 pair
         [ab]     [cd]     [ef]     [gh]          (寫回 idx 0,2,4,6)
            __syncthreads()
stride=2  T0:0+=2           T2:4+=6               ← 只 T0,T2 active;T1,T3 idle
         [abcd]            [efgh]                 (寫回 idx 0,4)
            __syncthreads()
stride=4  T0:0+=4                                 ← 只 T0 active
         [abcdefgh]                               (寫回 idx 0)
            __syncthreads()
output = input[0]            (T0)

每個 active thread 每輪做 2 reads + 1 write:讀自己的 owner、讀距離 stride 外的元素、寫回 owner。
__syncthreads() 確保「本輪所有 partial sum 都已寫進 input array」後,下一輪的 active threads 才會去讀(避免 race)。詳見 04-Compute-Architecture-And-Scheduling/01-GPU-Architecture-and-Block-Scheduling 的 barrier。

這個 simple kernel 效率不佳(由 §10.4–10.6 修正)

stride 遞增使 active threads 隨時間越來越分散:

Control divergence:同一 warp 內 active/inactive 混雜,大量執行資源被浪費(256-element 的利用率僅 255/736 ≈ 35%)。
Memory divergence:相鄰 threads 不存取相鄰位置 → 無法 coalesce。
過多 global memory 存取:每輪都把 partial sum 寫回 global memory 再讀。

解法(stride 改成遞減、用 shared memory)見 10-Reduction/02-Optimizing-Single-Block-Reduction-Kernel;而把 simple kernel 的修正版搬上 control divergence 概念,見 04-Compute-Architecture-And-Scheduling/02-Warps-SIMD-and-Control-Divergence。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
reduction tree 為何能成立?需要什麼性質	operator 必須 associative(才能改加括號的順序)
若還要重排 operand 位置(讓 active threads 相鄰)	再加 commutative;§10.4 的低 divergence 版本需要它
`max` / `min` 的 identity value	`−∞` / `+∞`(sum=0.0, product=1.0)
N 個元素的 reduction tree 總運算數	N−1(與 sequential 相同,work-efficient)
reduction tree 的 time steps	log₂N(N=1024 → 10 步)
tree vs sequential 的理想 speedup	`N / log₂N`(8 元素 = 8/3 ≈ 2.67×)
第一步需要多少平行資源 / peak parallelism	`N/2`(N=1024 → 512)
average parallelism	`(N−1) / log₂N`(N=1024 → ≈102.3)
simple kernel 為何只能單一 block	`__syncthreads()` 只能同步同 block 的 threads,跨 block 無 barrier
simple kernel 最多處理幾個元素	2048(1024 threads × 每 thread 2 元素)
simple kernel launch 幾個 threads	N/2
"owner computes" 是什麼	每個位置由唯一 owner thread 寫入;thread k 擁有 `input[2k]`
`__syncthreads()` 放在迴圈內的目的	確保本輪 partial sums 全部寫完,下一輪才讀(防 race condition)
浮點數加法是 associative 嗎	嚴格不是(rounding),但實務上在容忍誤差內當作 associative

重點總覽 (Overview)

背景:Reduction 的定義 (Background)

Reduction Tree 平行模型 (Reduction Trees)

兩個必要性質

工作量、步數與平行度

Simple Reduction Kernel (A Simple Reduction Kernel)

Thread → Data 映射與執行(N=8, 4 threads)

考試/面試重點 (Exam / Test Patterns)

Related Notes