平行直方圖 (Parallel Histogram) 練習題 (Practice - Atomic Operations and the Basic Histogram Kernel)

#practice #parallel-pattern #histogram #atomic-operations

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案重點
為何不能用 owner-computes rule	輸出位置 data-dependent，多 thread 可能更新同一 bin → output interference
read-modify-write race condition	兩個以上同位址更新序列交錯 (overlap)，最終值隨時序變 → 可能 lost update
`atomicAdd(address, val)` 語意	RMW 不可分割；回傳舊值 (old value)；intrinsic → 單一硬體指令
atomic 是否保證執行順序	否，只保證互斥/序列化；trailing thread 等 leading 完成，但誰先誰後任意
單一位址 atomic 最大吞吐	≈ 1 / (2 × memory latency)；e.g. 100 ns → 1/(200 ns) = 5 M atomics/s
為何 atomic 吞吐遠低於一般存取	同位址無法 overlap，週期 ≈ load + store latency，延遲主導
L2 / last-level-cache atomics 加速	latency 數百 → 數十 cycle；熱門變數常駐 cache → 吞吐升一個數量級
Privatization 降 contention 倍數	≈ 同時 active 的 block 數 (across all SMs)
Global private histogram 記憶體	`gridDim.x × NUM_BINS × 4` bytes
為何 private copy 選 per-block	可用 `__syncthreads()` 合併 + 可放 shared memory(block 間互不可見)
Shared memory 為何提升 atomic 吞吐	latency 僅數 cycle，latency↓ 直接 → throughput↑
何時做 thread coarsening	launch block 數 > 硬體可同時執行數(否則 block 序列化，commit 開銷白付)
Contiguous vs Interleaved (GPU)	Interleaved：warp 相鄰 thread 取相鄰位址 → coalescing；contiguous 適合 CPU
Interleaved 步長	`i += blockDim.x * gridDim.x`(全 grid 總 thread 數)
Aggregation 適用 / 收尾	skewed 資料(單 bin 重度 contention)；迴圈後須再 flush(更新永遠落後一格)

Question 1 - Owner-computes rule 為何失效 [recall]

情境：前幾章的 pattern(如 vector add、convolution)都遵守 owner-computes rule。為什麼 histogram 無法套用此規則？此現象稱為什麼？

查看答案

因為每個 thread 的 輸出位置是 data-dependent(由輸入值決定 histo[(data[i]-'a')/4]),多個 thread 可能更新 同一個 bin,無法把每個 output element 專屬給某個 thread。這種多 thread 寫同一輸出位置稱為 output interference,是 race condition 的根源。

Question 2 - atomicAdd 的語意與回傳值 [recall]

情境：CUDA kernel 呼叫 atomicAdd(&histo[k], 1)。請說明它對記憶體做了什麼、回傳什麼、以及為何沒有函式呼叫開銷。

查看答案

它讀取 address 指向的 32-bit word(global 或 shared memory),加上 val,把結果寫回同一位址,整個 read-modify-write 形成 不可分割 (undividable) 單位,並 回傳該位址的舊值 (old value)。atomicAdd 是 intrinsic function,編譯成單一硬體 atomic 指令,程式中沒有真正的函式呼叫。

Question 3 - Read-modify-write race condition [recall]

情境：兩個 thread 都對初值為 0 的 histo[x] 各做一次 histo[x]++(非 atomic)。畫出/說明一種會得到錯誤結果 1(而非 2)的交錯,並指出問題本質。

查看答案

當兩 thread 的 RMW overlap:T1 R(0)→T2 R(0)(此時 T1 還沒寫回)→T1 W(1)→T2 W(1),最終 histo[x]=1,T1 的更新被覆蓋(lost update)。問題本質:histo[x]++ 是 read→+1→write 三步驟,可被其他 thread 交錯;最終值隨相對時序變動即為 read-modify-write race condition。

Question 4 - 單一位址 atomic 的吞吐瓶頸 [recall]

情境：許多 thread 同時對 同一個 histo bin 做 atomicAdd。為什麼「越多並行存取 → 越高吞吐」的常規在此失效？單一位址的吞吐約等於多少？

查看答案

同一位址的 atomic 無法 overlap:trailing thread 的 RMW 必須等 leading thread 的 RMW 完成才能開始(Fig. 9.7),一次只有一個進行中。每個 atomic 約佔 load latency + store latency ≈ 2 × memory latency,故單一位址吞吐 ≈ 1 / (2 × latency)。此時 延遲(非頻寬)主導,且會 dominate kernel 執行時間。

Question 5 - Last-level cache atomics 為何加速 [recall]

情境：現代 GPU 允許在所有 SM 共享的 last-level cache (L2) 執行 atomic。為什麼這能大幅提升 atomic 吞吐?

查看答案

因為 throughput ≈ 1/(2×latency),降低 latency 直接提升吞吐。L2 存取約 數十 cycle(vs DRAM 數百 cycle)。高度競爭的變數被多 thread 頻繁存取,一旦載入便 常駐 L2(命中即直接在 cache 更新),使 atomic 吞吐相較早期 GPU 至少提升一個數量級。

Question 6 - Privatization 的概念與記憶體需求 [recall]

情境：解釋 privatization 如何降低 atomic contention,常見以什麼為單位複製,以及 global-memory 版(Fig. 9.9)host 端要配置多少記憶體、每個 thread 用什麼 offset。

查看答案

把高度競爭的 histogram 複製成多份 private copies,讓一群 thread 各更新自己那份;通常以 每個 thread block 一份 為單位。Contention 約降低 同時 active 的 block 數 倍。Global 版須配置 gridDim.x × NUM_BINS × 4 bytes,thread 以 offset blockIdx.x * NUM_BINS 指向自己 block 那份;最後每 block 把私有值 commit(atomicAdd)回 block 0 的版本。

Question 7 - 為何 per-block + 放 shared memory [recall]

情境：為什麼把 private histogram 建在「每個 thread block 一份」特別有利?把它放進 shared memory(Fig. 9.10)又帶來什麼好處與前提?

查看答案

Per-block 的好處:(1) block 內可用 __syncthreads() 等彼此更新完再 commit(若跨 block 共用就得另開 kernel 或更複雜手法);(2) 若 bins 夠少,private copy 可宣告在 shared memory(block 間看不到彼此的 shared memory,所以必須 per-block)。Shared memory latency 僅 數 cycle,直接轉成更高的 atomic throughput。前提是 bins 數夠小 能塞進 shared memory,否則退回 global 版。

Question 8 - Interleaved partitioning 的步長與 coalescing [recall]

情境：在 coarsened kernel(Fig. 9.14)中,interleaved partitioning 的迴圈索引如何前進?為什麼這在 GPU 上比 contiguous 好?

查看答案

每個 thread 從 i = tid(global thread index)開始,每輪 i += blockDim.x * gridDim.x(全 grid 總 thread 數)。同一輪內,相鄰 thread 存取 相鄰位址,一次 DRAM burst 即可 coalesced 取回,充分利用記憶體頻寬;contiguous 則讓 warp 內位址相隔 CFACTOR,non-coalesced。

Question 9 - Exercise 1：atomic 最大吞吐計算 [application]

情境：假設 DRAM 系統中每個 atomic operation 的總延遲為 100 ns。對 同一個 global memory 變數做 atomic,最大吞吐是多少?

查看答案

同一位址的 atomic 不能 overlap,故每 100 ns 最多完成一個:throughput = 1 / 100 ns = 10 M atomics/s。
(若題目把 100 ns 視為單次 load 或 store 的 latency,則一個 atomic ≈ read+write = 200 ns → 1/(2×latency) = 5 M atomics/s;按 Ch.9 公式 throughput ≈ 1/(2×latency)。)

Question 10 - Exercise 5：正確的 atomicAdd 寫法 [application]

情境：要把整數變數 Partial 的值 atomic 加到 global memory 整數變數 Total,下列何者正確? (a) atomicAdd(Total, 1) (b) atomicAdd(&Total, &Partial) (c) atomicAdd(Total, &Partial) (d) atomicAdd(&Total, Partial)

查看答案

(d) atomicAdd(&Total, Partial);。第一參數是要更新位置的位址(&Total),第二參數是要加上的值(Partial,非位址)。

Question 11 - Exercise 6：各 kernel 的 global atomic 次數 [application]

情境：輸入 524,288 個元素,histogram 128 bins,1024 threads/block。分別求 (a) Fig. 9.6 basic kernel、(b) Fig. 9.10 privatization+shared、(c) Fig. 9.14 +coarsening(factor 4) 在 global memory 上最多做幾次 atomic?

查看答案

(a) 524,288(每元素一次 global atomic,無 privatization)。
(b) blocks = 524288/1024 = 512;histogram 階段 atomic 在 shared memory,只有 commit 階段打 global,每 block 至多 128 bins → 512 × 128 = 65,536。
(c) blocks = 524288/(1024×4) = 128 → 128 × 128 = 16,384。

Question 12 - Contiguous vs Interleaved partitioning [analysis]

情境：同樣是 thread coarsening,為什麼 contiguous partitioning 在 CPU 上常是最佳,卻在 GPU 上次佳?從 cache 與 coalescing 角度比較兩種分法。

查看答案

CPU:thread 少、每個 cache 只服務少量 thread,contiguous 的循序存取能重用 cache line 且互不干擾 → 最佳。GPU:SM 內大量同時 active thread 互相干擾 cache,單一 thread 的連續資料無法保留在 cache;且 contiguous 讓 warp 內位址相隔 CFACTOR → non-coalesced,浪費 DRAM 頻寬。Interleaved 讓 warp 相鄰 thread 取相鄰位址 → memory coalescing,一次 burst 服務整個 warp,故為 GPU 首選。

Question 13 - Aggregation vs simple kernel 的取捨 [analysis]

情境：aggregation kernel(Fig. 9.15)在何種資料下能顯著加速、何種下反而更慢?它新增的 if 是否一定造成嚴重 control divergence?

查看答案

Aggregation 把連續、同一 bin 的更新先在 register 累加,bin 變了才 atomic flush → 減少對熱門 bin 的 atomic 次數。高 contention(skewed,如天空大片同色) 時顯著加速;低 contention 時因多了變數與指令,可能比 simple kernel 更慢。if 的 divergence 在兩個極端都小(全部 streak 或全部 flush);中間情況的 divergence 通常被 降低的 contention 補償 掉。

重點整理 (Pattern Summary，點擊展開)

主題	核心結論
Output interference	輸出位置 data-dependent,多 thread 寫同一 bin → 不能用 owner-computes rule
Race condition	RMW(read→+1→write)交錯 → 最終值隨時序變,可能 lost update
atomicAdd	RMW 不可分割、回傳舊值、intrinsic;只保證互斥不保證順序
Atomic 吞吐	同位址 ≈ 1/(2×latency);延遲主導,遠低於一般記憶體存取
L2 atomics	在 last-level cache 做 atomic,latency 數百→數十 cycle,吞吐升一個數量級
Privatization	每 block 一份 private copy,contention 降 ≈ active block 數;global 需 `gridDim.x×NUM_BINS×4` bytes
Shared memory 版	latency 數 cycle → 吞吐再升;前提 bins 夠少;per-block 才能用 `__syncthreads()` + shared memory
Coarsening	block 數 > 硬體可同時執行數時,減 block 數降 commit 開銷;GPU 用 interleaved 取得 coalescing
Aggregation	skewed 資料先在 register 累加同 bin 更新,bin 變才 flush;迴圈後須收尾 flush

Related Concepts

Question 1 - Owner-computes rule 為何失效 [recall]

Question 2 - atomicAdd 的語意與回傳值 [recall]

Question 3 - Read-modify-write race condition [recall]

Question 4 - 單一位址 atomic 的吞吐瓶頸 [recall]

Question 5 - Last-level cache atomics 為何加速 [recall]

Question 6 - Privatization 的概念與記憶體需求 [recall]

Question 7 - 為何 per-block + 放 shared memory [recall]

Question 8 - Interleaved partitioning 的步長與 coalescing [recall]

Question 9 - Exercise 1：atomic 最大吞吐計算 [application]

Question 10 - Exercise 5：正確的 atomicAdd 寫法 [application]

Question 11 - Exercise 6：各 kernel 的 global atomic 次數 [application]

Question 12 - Contiguous vs Interleaved partitioning [analysis]

Question 13 - Aggregation vs simple kernel 的取捨 [analysis]