直方圖優化：Privatization、Coarsening 與 Aggregation (Histogram Optimizations)

#parallel-pattern #histogram #privatization #thread-coarsening #aggregation

重點總覽 (Overview)

承接 09-Parallel-Histogram/01-Atomic-Operations-and-Basic-Histogram：basic kernel 用 atomicAdd 序列化更新，throughput 受限於 ~1/(2×memory latency)。本筆記三個技術依序「降低衝突 → 降低延遲 → 降低 commit 開銷 → 降低同位置更新次數」。

技術	核心想法	解決什麼	主要成本	適用時機
Privatization (global)	每個 block 一份私有 histogram (在 global memory)	把 contention 從「全部 thread」縮到「同 block」+ commit 階段	需 `gridDim.x × NUM_BINS × 4` bytes + commit	永遠值得；private copy 靠 L2 cache 加速
Privatization (shared)	私有 copy 放 shared memory	atomic latency 從數百 cycle → 數 cycle	bins 數須夠小放進 shared memory	bins 少 (例: NUM_BINS ≤ 上限)
Thread Coarsening	減少 block 數，每 thread 處理多個元素	block 被序列化時不必要的 commit 開銷	平行度下降 (block 太少)	launch block 數 > 硬體可同時執行數
└ Contiguous partition	thread 取連續區段	CPU 友善	GPU 上 non-coalesced	CPU / 少量 thread
└ Interleaved partition	warp 內相鄰 thread 取相鄰位址	memory coalescing	索引計算稍複雜	GPU 首選
Aggregation	連續相同 bin 的更新先在 register 累加，bin 變了才 flush	skewed data 對單一 bin 的重度 contention	多用 register/指令，可能 control divergence	資料高度集中 (如天空大片同色)

Important

四步遞進都是為了同一目標:讓「對同一記憶體位置的 atomic 序列化」變短或變少。Privatization 減少同位置的競爭者；shared memory 縮短每次 atomic 的延遲;coarsening 減少要 merge 的份數;aggregation 直接合併同位置的多次更新。

Privatization(私有化)

想法:把高度競爭的輸出資料結構複製成多份 private copies,讓一群 thread 各更新自己那份,最後再 merge 回去。

Contention 降低倍數 ≈ 同時 active 的 block 數(across all SMs)。
在大規模平行系統,privatization 以 thread subset(通常一個 block) 為單位,而非每個 thread 一份(否則 merge 成本爆炸)。

                       不做 privatization                每 block 一份 private copy
   Block0  Block1 ...  ──┐                          Block0→[histo_0]  Block1→[histo_1] ...
   全部 thread 都 atomic  ├─► [ 單一 histo ] (重度衝突)        ↑只與同block衝突       ↑
                       ──┘                                   └──── commit ───► [histo_0 當公開版]

為何選「每個 thread block 一份」?

Tip

block 內可用 __syncthreads() 等待彼此後再 commit;若 private copy 被多個 block 共用,就得另開 kernel 或更複雜手法來 merge。
若 bins 數夠小,private copy 可宣告在 shared memory(block 之間看不到彼此 shared memory,所以必須 per-block)。

版本 A:Private copy 在 Global Memory (Fig. 9.9)

Host 端須配置 gridDim.x × NUM_BINS × 4 bytes;每 thread 加上 blockIdx.x*NUM_BINS 的 offset 指向自己 block 那份。

__global__ void histo_private_kernel(char* data, unsigned int length,
                                     unsigned int* histo) {
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < length) {
        int pos = data[i] - 'a';
        if (pos >= 0 && pos < 26)
            atomicAdd(&histo[blockIdx.x*NUM_BINS + pos/4], 1);   // 更新私有版
    }
    if (blockIdx.x > 0) {                 // block 0 那份直接當公開版
        __syncthreads();                  // 等本 block 更新完
        for (unsigned int bin = threadIdx.x; bin < NUM_BINS; bin += blockDim.x) {
            unsigned int v = histo[blockIdx.x*NUM_BINS + bin];
            if (v > 0) atomicAdd(&histo[bin], v);   // commit 到 block 0 的版本
        }
    }
}

Commit 階段每個 bin 只有「每 block 一個 thread」競爭 → contention 極低。
最終結果在 histo[0 .. NUM_BINS-1](即 block 0 那份)。

版本 B:Private copy 在 Shared Memory (Fig. 9.10)

Important

任何 latency 的降低都「直接」轉成同位置 atomic 的 throughput 提升。Shared memory 是 per-SM、latency 僅數 cycle,因此把 private histogram 放 shared memory 可大幅提升 atomic throughput。這是 shared memory 的重要使用情境之一。

__global__ void histo_private_kernel(char* data, unsigned int length,
                                     unsigned int* histo) {
    __shared__ unsigned int histo_s[NUM_BINS];
    for (unsigned int bin = threadIdx.x; bin < NUM_BINS; bin += blockDim.x)
        histo_s[bin] = 0u;                // 平行初始化
    __syncthreads();                      // 確保全清零後才開始更新

    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    if (i < length) {
        int pos = data[i] - 'a';
        if (pos >= 0 && pos < 26)
            atomicAdd(&histo_s[pos/4], 1);   // atomic 作用在 shared memory
    }
    __syncthreads();

    for (unsigned int bin = threadIdx.x; bin < NUM_BINS; bin += blockDim.x) {
        unsigned int v = histo_s[bin];
        if (v > 0) atomicAdd(&histo[bin], v);   // commit 回 global
    }
}

Warning

Shared-memory privatization 的前提是 bins 數量夠小 能塞進 shared memory。bins 太多(如數萬)時只能退回 global-memory privatization(版本 A)。

Coarsening(執行緒粗化)

動機:privatization 的開銷是「每 block commit 一次」。launch 的 block 越多,commit 開銷越大。

Warning

一般原則:block 平行執行時,commit 開銷值得付。例外:若 launch 的 block 數 > 硬體能同時執行的數量,硬體會把這些 block 序列化,此時為「不會同時跑」的 block 各做一次 commit 是白白浪費。

解法:減少 block 數、讓每個 thread 處理多個輸入元素(coarsening factor = CFACTOR),把 private copy 份數降下來。分配方式有兩種。

Contiguous Partitioning (Fig. 9.12)

每個 thread 取一段連續元素:i 從 tid*CFACTOR 到 min((tid+1)*CFACTOR, length)。

    // ... shared histo_s 初始化同 Fig. 9.10 ...
    unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
    for (unsigned int i = tid*CFACTOR;
         i < min((tid+1)*CFACTOR, length); ++i) {       // 連續區段
        int pos = data[i] - 'a';
        if (pos >= 0 && pos < 26) atomicAdd(&histo_s[pos/4], 1);
    }
    __syncthreads();
    // ... commit 同 Fig. 9.10 ...

4 thread、CFACTOR=4 (連續切):
data:  | p r o g | r a m m | i n g _ | m a s s |
          T0        T1        T2        T3
iter0:    p         r         i         m   ← warp 內位址相隔 CFACTOR → NON-coalesced
iter1:    r         a         n         a

Tip

Contiguous 在 CPU 上最佳:thread 少、每個 cache 只服務少量 thread,連續存取能充分重用 cache line。

Warning

Contiguous 在 GPU 上次佳:SM 內大量 thread 互相干擾 cache,單一 thread 的連續存取無法保留在 cache。應讓 warp 內 thread 存取相鄰位址以達成 coalescing → 改用 interleaved。

Interleaved Partitioning (Fig. 9.14)

i 從 tid 出發,每次跳 blockDim.x*gridDim.x(全 grid 總 thread 數)。

    unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
    for (unsigned int i = tid; i < length; i += blockDim.x*gridDim.x) {
        int pos = data[i] - 'a';
        if (pos >= 0 && pos < 26) atomicAdd(&histo_s[pos/4], 1);
    }

8 thread、交錯切:
data:  p r o g r a m m i n g _ m a s s ...
iter0: T0 T1 T2 T3 T4 T5 T6 T7  ← 相鄰位址,一次 DRAM 存取即 coalesced
iter1:                         T0 T1 ... (整體 +blockDim*gridDim)

第 i 輪所有 thread 共同處理一段 blockDim.x*gridDim.x 元素;位址連續 → memory coalescing,一次 DRAM burst 取回。
輸入長度不一定是總 thread 數倍數 → 尾端某些 thread 少跑一輪,屬正常。

Tip

更細緻的考量:每個 thread 每輪最好處理 4 個 char(一個 32-bit word),以充分利用 cache 與 SM 之間的 interconnect 頻寬。

比較項	Contiguous	Interleaved
同一輪 warp 內位址	相隔 `CFACTOR`(strided)	相鄰(連續)
Memory coalescing	否	是
最佳平台	CPU(少 thread,重用 cache line)	GPU
索引公式	`tidCFACTOR .. (tid+1)CFACTOR`	`tid; i += blockDim*gridDim`

Aggregation(更新聚合)

動機:某些資料局部高度集中相同值(如天空大片同色 pixel),導致對單一 bin 的重度 contention。

想法 (Merrill, 2015):每個 thread 把連續、針對同一 bin 的更新先在 register 累加,等 bin 變了才用一次 atomic flush 出去 → 大幅減少對熱門 bin 的 atomic 次數。

    // ... shared histo_s 初始化 ...
    unsigned int accumulator = 0;          // 已聚合的更新數
    int prevBinIdx = -1;                    // 上一個聚合中的 bin(-1 不會 match)
    unsigned int tid = blockIdx.x*blockDim.x + threadIdx.x;
    for (unsigned int i = tid; i < length; i += blockDim.x*gridDim.x) {
        int pos = data[i] - 'a';
        if (pos >= 0 && pos < 26) {
            int bin = pos/4;
            if (bin == prevBinIdx) {        // 同 bin → 連勝(streak),只累加
                ++accumulator;
            } else {                        // 換 bin → 結束 streak,flush 舊的
                if (accumulator > 0) atomicAdd(&histo_s[prevBinIdx], accumulator);
                accumulator = 1;
                prevBinIdx   = bin;
            }
        }
    }
    if (accumulator > 0) atomicAdd(&histo_s[prevBinIdx], accumulator); // 收尾 flush
    __syncthreads();
    // ... commit 回 global ...

Important

更新「永遠落後一格」:目前的 streak 還沒 flush。所以迴圈結束後必須檢查 accumulator 並做最後一次 flush(否則最後一段 streak 會遺失)。

Warning

Aggregation 不是萬靈丹:多了變數與 if 判斷。

低 contention 時,aggregated kernel 可能比 simple kernel 更慢。
高 contention 時,可顯著加速。
那個 if 可能造成 control divergence;但「完全無 contention」或「重度 contention」兩種極端下,thread 多半同時 flush 或同時 streak,divergence 很小;中間情況的 divergence 通常被降低的 contention 補償掉。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
Privatization 降低 contention 的倍數	≈ 同時 active 的 block 數(across all SMs)
為何 private copy 選 per-block	可用 `__syncthreads()` 合併 + 可放 shared memory(blocks 看不到彼此 shared mem)
Global private histogram 需多少記憶體	`gridDim.x × NUM_BINS × 4` bytes
Shared-memory privatization 的限制	bins 須夠小才放得進 shared memory;否則退回 global 版
為何 shared memory 能提升 atomic throughput	latency 數 cycle(vs DRAM 數百 cycle),latency↓ 直接 → throughput↑
何時該做 coarsening	launch block 數 > 硬體可同時執行數(否則 block 被序列化,commit 開銷白付)
GPU 上 contiguous vs interleaved 該選哪個	Interleaved(warp 相鄰 thread 取相鄰位址 → coalescing);contiguous 適合 CPU
Interleaved 步長	`i += blockDim.x * gridDim.x`(全 grid 總 thread 數)
Aggregation 適用資料	局部高度集中相同值(skewed),對單一 bin 重度 contention
Aggregation 收尾為何要再 flush	更新永遠落後一格,最後一段 streak 仍在 accumulator 未寫出
Aggregation 何時反而變慢	低 contention(多餘指令/變數)+ 可能 control divergence
Fig.9.6 basic:N=524288 的 global atomic 數	524288(無 privatization,每元素一次 global atomic)
Fig.9.10 priv+shared(1024 t/block):global atomic 上限	blocks=524288/1024=512,每 block commit ≤128 bins → 512×128 = 65536(histogram 階段的 atomic 在 shared,不算 global)
Fig.9.14 +coarsening factor 4:global atomic 上限	blocks=524288/(1024×4)=128 → 128×128 = 16384

重點總覽 (Overview)

Privatization(私有化)

為何選「每個 thread block 一份」?

版本 A:Private copy 在 Global Memory (Fig. 9.9)

版本 B:Private copy 在 Shared Memory (Fig. 9.10)

Coarsening(執行緒粗化)

Contiguous Partitioning (Fig. 9.12)

Interleaved Partitioning (Fig. 9.14)

Aggregation(更新聚合)

考試/面試重點 (Exam / Test Patterns)

Related Notes