擴展 Reduction：Hierarchical Reduction 與 Thread Coarsening

#parallel-pattern #reduction #atomic-operations #thread-coarsening #shared-memory

重點總覽 (Overview)

主題	解決的問題	核心手法	代價 / 限制
Hierarchical / Segmented multiblock reduction	單一 block 受 `__syncthreads()` 限制,最多 1024 threads → 只能處理 ≤2048 元素	將輸入切成 segments,每個 block 獨立做一棵 reduction tree,再用 `atomicAdd` 把 partial sum 累加到唯一 output	跨 block 無法 barrier 同步 → 必須靠 atomic;output 須先初始化為 identity
Thread coarsening	把工作切到太多 block,每個 block 都重複付出 hardware underutilization + sync + shared memory 的成本	每個 thread 先序列地加 `COARSE_FACTOR*2` 個元素(全 thread 活躍、免同步),再進 reduction tree	coarsening 太大 → block 數少於硬體可同時跑的量 → 浪費平行度
章節總結	從 sequential 到高效大規模 reduction	divergence 減少 + shared memory + segmented atomic + coarsening 缺一不可	是 Chapter 11 Scan 的基礎

Important

本篇承接 10-Reduction/02-Optimizing-Single-Block-Reduction-Kernel 的 shared memory kernel (Fig 10.11),把它從「單一 block / 固定長度」擴展成「任意長度 / 多 block」並降低 overhead。數學定義、reduction tree 與 divergence 分析見前兩篇。

階層式分段多 Block Reduction (Hierarchical Segmented Multiblock Reduction)

為何單 block 不夠

之前所有 kernel 都假設 launch 一個 block,因為用 __syncthreads() 當 barrier。
__syncthreads() 只能同步同一個 block 內的 threads → 平行度上限 = 1024 threads = 2048 元素。
百萬/十億級輸入需要更多 thread → 必須讓不同 block 獨立執行(因為跨 block 沒有好的 barrier)。

分段策略

把輸入切成多個 segment,每段大小 = 2 * blockDim.x(一個 block 處理的量)。
每個 block 在自己的 segment 上跑「整棵」reduction tree,就好像 segment 是整個輸入。
每個 block 最終的 partial sum (input_s[0]) 用 atomicAdd 累加到單一 output。

segment size = 2 * blockDim.x   (blockDim.x = 1024 → 2048)

global input:  [ ---- block 0 ---- ][ ---- block 1 ---- ][ ---- block 2 ---- ] ...
segment start:  0                    2048                 4096
                 \  reduce tree /     \  reduce tree /     \  reduce tree /
                  input_s[0]            input_s[0]            input_s[0]
                       \                    |                    /
                        \____ atomicAdd ___ output ____________/

Kernel (Fig 10.13)

__global__ void SegmentedSumReductionKernel(float* input, float* output) {
    __shared__ float input_s[BLOCK_DIM];               // BLOCK_DIM == blockDim.x
    unsigned int segment = 2 * blockDim.x * blockIdx.x; // 此 block 的段起點 (line 03)
    unsigned int i = segment + threadIdx.x;             // thread 在全域陣列的擁有位置 (line 04)
    unsigned int t = threadIdx.x;                       // thread 在 shared 陣列的位置
    input_s[t] = input[i] + input[i + BLOCK_DIM];       // 第一步:用全域 index i 載入並相加 (line 06)
    for (unsigned int stride = blockDim.x/2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (t < stride) {
            input_s[t] += input_s[t + stride];          // reduction tree (與 Fig 10.11 完全相同)
        }
    }
    if (t == 0) {
        atomicAdd(output, input_s[0]);                  // 把段的 partial sum 原子累加 (line 14)
    }
}

關鍵差異 vs 單 block 版:(1) 用 i (全域、含 segment 位移) 讀 global memory;(2) tree 迴圈內仍用 t (因每個 block 有自己私有的 input_s);(3) 結尾改成 atomicAdd 而非直接寫。
i 是「全域擁有位置」,t 是「shared 內擁有位置」── 兩者分工是這支 kernel 的核心。

output 必須先初始化為 identity

因為多個 block 用 atomicAdd 累加到同一個 output,host 端必須在 launch 前把 output 設為運算子的 identity (sum → 0.0)。換成 max/min reduction 時,用對應的 atomic (如 atomicMax) 並初始化為 -∞/+∞。

為何 atomic 在這裡便宜

整個 grid 只有「每個 block 一次」atomic(共 N/2048 次),而非每個 thread 一次。絕大部分工作都在各 block 的 tree 內完成,atomic 只負責最後跨 block 的少量合併 → 競爭極低。對照 09-Parallel-Histogram/01-Atomic-Operations-and-Basic-Histogram 的高競爭情境。

公式	說明
segment 大小	`2 * blockDim.x`
block 段起點	`2 * blockDim.x * blockIdx.x`
block 數 (無 coarsening)	`N / (2 * blockDim.x) = N/2048` (blockDim.x=1024)
atomic 次數	= block 數 = `N/2048`

執行緒粗化降低 Overhead (Thread Coarsening for Reduced Overhead)

動機:我們為平行付了「重稅」

Reduction tree 每往上一層,就有更多 warp 閒置、最後一個 warp control divergence 更嚴重 → hardware underutilization。
這個「逐漸閒置」的階段,每一個 launch 的 block 都要付一次。
若硬體資源不足,多餘的 block 本來就會被硬體序列化執行(跑完一個換下一個)。
既然要被序列化,不如自己用更有效率的方式序列化 → 這就是 thread coarsening(把工作塞進較少 thread 以減少平行化 overhead)。

做法:每個 block 多吃資料,前段序列相加

把 block 粗化 COARSE_FACTOR 倍 → 每個 block 拿 COARSE_FACTOR * 2 * blockDim.x 個元素。
例:原本每 block 16 元素(每 thread 2 個);coarsen ×2 → 每 block 32 元素(每 thread 4 個)。
每個 thread 獨立、序列地把它負責的 COARSE_FACTOR*2 個元素加進 register sum:
- 此階段全部 thread 都活躍(硬體全利用)。
- 不需要 __syncthreads()、不需要寫 shared memory(thread 彼此獨立)。
加總完才把 sum 寫進 input_s[t],接著進入和之前完全相同的 reduction tree。

COARSE_FACTOR = 2, blockDim.x = 8, 32 elements/block, 每 thread 4 元素

序列粗化階段 (全 thread 活躍, 無 sync, 無 shared):
 step 1   sum  = input[i]
 step 2   sum += input[i + 1*BLOCK_DIM]
 step 3   sum += input[i + 2*BLOCK_DIM]
          sum += input[i + 3*BLOCK_DIM]   (寫入 input_s[t])
            |  全 8 threads 活躍 → 硬體全利用
            v
reduction tree 階段 (每步半數 thread drop, 需 sync + shared):
 step 4   stride=4  ▓▓▓▓░░░░
 step 5   stride=2  ▓▓░░
 step 6   stride=1  ▓░

Kernel (Fig 10.15)

__global__ void CoarsenedSumReductionKernel(float* input, float* output) {
    __shared__ float input_s[BLOCK_DIM];
    // 段更大 COARSE_FACTOR 倍 (line 03)
    unsigned int segment = COARSE_FACTOR * 2 * blockDim.x * blockIdx.x;
    unsigned int i = segment + threadIdx.x;
    unsigned int t = threadIdx.x;
    float sum = input[i];                               // 粗化迴圈:序列相加 (lines 06-09)
    for (unsigned int tile = 1; tile < COARSE_FACTOR * 2; ++tile) {
        sum += input[i + tile * BLOCK_DIM];             // 全 thread 活躍、無 __syncthreads()
    }
    input_s[t] = sum;                                   // 加總完才入 shared memory
    for (unsigned int stride = blockDim.x/2; stride >= 1; stride /= 2) {
        __syncthreads();
        if (t < stride) {
            input_s[t] += input_s[t + stride];
        }
    }
    if (t == 0) {
        atomicAdd(output, input_s[0]);
    }
}

為何省:2 序列 block vs 1 coarsened block (Fig 10.16)

比較項 (blockDim.x=8, 共 32 元素)	(A) 2 個 block 被硬體序列化	(B) 1 個 coarsen ×2 的 block
全利用步數 (full util, 無 sync/shared)	2 步	3 步
underutil 步數 (需 sync + shared)	6 步	3 步
總步數	8 步	6 步
barrier / shared memory 存取	6 次	3 次

coarsening 同時砍掉三種 overhead:hardware underutilization、barrier synchronization、shared memory access。
全利用階段在 register 上做、且無需同步 → 比 tree 階段便宜得多。

coarsening factor 不是越大越好

粗化越多 → 平行做的事越少。若 COARSE_FACTOR 太大,使得 launch 的 block 數少於硬體能同時執行的量,就會浪費平行硬體資源。
最佳 factor:剛好讓 block 數足以填滿硬體 → 取決於輸入總大小與特定裝置特性(SM 數、可同時駐留的 block 數)。

公式	說明
每 block 元素數	`COARSE_FACTOR * 2 * blockDim.x`
每 thread 序列加的元素數	`COARSE_FACTOR * 2`
block 數 (coarsened)	`N / (COARSE_FACTOR * 2 * blockDim.x)`
段起點	`COARSE_FACTOR * 2 * blockDim.x * blockIdx.x`

本章總結 (Chapter Summary)

平行 reduction 看似簡單(sequential 版只是一個 for-loop),但要對大型輸入達到高效能,需要層層疊加的技巧:

技巧	解決的瓶頸	對應章節
Thread index assignment (stride 由大到小)	control divergence + memory divergence	10.4 / 10.5 → 10-Reduction/02-Optimizing-Single-Block-Reduction-Kernel
Shared memory 暫存 partial sum	global memory access (降到 `N+1`)	10.6 → 同上
Segmented reduction + atomic	任意長度 / 跨 block 平行	10.7 (本篇)
Thread coarsening	underutilization + sync + shared overhead	10.8 (本篇)

Reduction 是許多資料彙整應用的核心,也是 prefix-sum (scan) 的重要基礎 → 延伸至 11-Prefix-Sum-Scan/01-Scan-Foundations-and-Kogge-Stone。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
為何單 block reduction 上限 2048 元素?	`__syncthreads()` 只能同步同一 block,block 上限 1024 threads,每 thread 載 2 元素 → 2048
跨 block 如何合併 partial sum?	每 block 算完 `input_s[0]`,由 thread 0 用 `atomicAdd(output, ...)` 累加;跨 block 不能 barrier
segment 起點公式?	無 coarsening: `2blockDim.xblockIdx.x`;coarsened: `COARSE_FACTOR2blockDim.x*blockIdx.x`
用 atomic 前忘了做什麼會錯?	host 未把 `output` 初始化為 identity (sum→0.0);多 block atomic 會把舊值一起累加
為何 atomic 在此不致命?	只有 `N/2048` 次 atomic(每 block 一次),競爭極低,主要工作在 block 內 tree 完成
thread coarsening 省下哪三種 overhead?	hardware underutilization、barrier synchronization、shared memory access
coarsening 階段為何不用 `__syncthreads()`?	每個 thread 各自序列累加自己負責的元素到 register,thread 間互相獨立,且全 thread 活躍
coarsening factor 太大的後果?	block 數 < 硬體可同時跑的量 → 浪費平行硬體;最佳值取決於輸入大小與裝置
Fig 10.16 數字 (factor 2)	2 序列 block = 8 步 (2 full / 6 underutil);1 coarsened block = 6 步 (3 full / 3 underutil)
coarsening 後每 thread 加幾個元素?	`COARSE_FACTOR * 2`