靜電位圖計算練習題 (Practice - DCS Algorithm and Scatter vs Gather Kernel Design)

#practice #application-case-study #electrostatic-potential #cutoff-binning

18-Electrostatic-Potential-Map/01-DCS-Scatter-vs-Gather — DCS 演算法與 Scatter vs Gather 核心設計 (Direct Coulomb Summation)
18-Electrostatic-Potential-Map/02-Thread-Coarsening-and-Memory-Coalescing — Thread Coarsening 與記憶體合併存取 (Memory Coalescing)
18-Electrostatic-Potential-Map/03-Cutoff-Binning-for-Scalability — Cutoff Binning 與資料規模擴充性 (Data Size Scalability)

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 觸發點	答案要點
DCS 定義	每個 grid point = 所有 atom 貢獻和，貢獻 = `charge / r_ij`
DCS 複雜度	$O (N_{a t o m s} \times N_{g r i d})$ ；atoms、grid 皆 ∝ 體積 → O(V²) quadratic，不可擴充
atom 表示	`atoms[]` 連續 4 個 float：`{x, y, z, charge}`
loop interchange 合法性	三層迴圈 perfectly nested 且各迭代獨立
交換解鎖什麼	提取 loop-invariant：`dz/dz2` (slice 共用)、`dy/dy2` (row 共用) 移出內層
最佳化序列碼 = 什麼形態	scatter：atom 在最外層，`+=` 散播到所有 grid point
scatter kernel	1 thread = 1 atom；需 `atomicAdd` → 序列化、慢
gather kernel	1 thread = 1 grid point；各寫各的、免 atomic；建立在未最佳化 Fig.18.3
gather arithmetic intensity	9 FLOP / 4 memory access = 2.25；constant cache broadcast → 非 bandwidth-bound
CHUNK_SIZE 限制	`CHUNK_SIZE*4 ≤ 64K` (constant memory 容量)；host 分 chunk 載入
coarsening 重用什麼	`dy`、`dz`、`dysqdzsq = dy²+dz²`、`charge` 取一次存 register，多點共用
coarsening 量化 (1 atom×4 點)	constant 存取 16→4 (4×)、FP ops 48→24 (2×)
Fig.18.8 為何寫回未合併	連續指派 → 相鄰 thread 同一寫回敘述位址相隔 `COARSEN_FACTOR`(=4)，warp 內 strided
coalescing 修法	interleaved：`i = blockIdx.xblockDim.xCF + threadIdx.x`，步距改 `blockDim.x`
coalescing 改了多少運算	0——只重排 index，免費加速
cutoff summation	只收半徑內 atom，複雜度 O(V²)→O(V)；遠端暗原子用 implicit method 補
為何 cutoff 用 grid-centric	atom-centric 是 scatter (需 atomic)；grid-centric gather 各寫各的
supercircle 半徑	cutoff + ½ × bin 對角線
neighborhood 範例	9 full + 12 partial = 21 bins；以 relative offset 放 constant memory
control divergence 來源	bin 內原子不一定落在每 thread 的 cutoff 圈 → 各 thread `if` 判斷不同
cutoff 為何改用 global+shared	不同 block 存取不同 neighborhood，constant memory 裝不下 → block 協同載入 shared mem
bins 大小不一處理	固定 size 填 dummy atoms (charge 0)；較佳設小 size + overflow list 由 host 收尾
overflow 隱藏延遲	subvolume 分段 → device 跑下一 kernel 時 host 處理上一批 overflow (<3% 可全隱藏)

Question 1 - DCS 方法與複雜度 [recall]

說明 Direct Coulomb Summation (DCS) 如何計算單一 grid point 的靜電位，並寫出其計算複雜度與為何「隨體積平方成長」。

查看答案

DCS 把每個 grid point 的位能算成系統中所有 atom 的貢獻和： $potential [j] = \sum_{i} \frac{atom [i] . charge}{r_{i j}}$ ，其中 $r_{i j} = \sqrt{(x_{j} - x_{i})^{2} + (y_{j} - y_{i})^{2} + (z_{j} - z_{i})^{2}}$ 。計算次數 $\propto N_{a t o m s} \times N_{g r i d}$ 。放大模擬體積時 atoms 和 grid points 都正比於體積成長，故總運算量 $\propto V \times V = V^{2}$ (quadratic)，對真實生物系統不可擴充。每個 atom 以 atoms[] 中連續 4 個 float {x, y, z, charge} 表示。

Question 2 - Loop Interchange 合法性與解鎖的最佳化 [recall]

Fig.18.4 把 atom 迴圈交換到最外層。(a) 為何此 loop interchange 合法？(b) 交換後解鎖了哪些 loop-invariant 計算的提取？

查看答案

(a) 合法：Fig.18.3 的三層迴圈是 perfectly nested 且各迭代彼此獨立（同 FHD kernel 的論證）。
(b) atom 移到最外層後可提取 loop-invariant：同一 slice 所有 grid point 共用 z → dz、dz2 在 slice 層算 1 次；同一 row 共用 y → dy、dy2 在 row 層算 1 次。Fig.18.3 因 atom 在最內層，這些值每次都要重算。副作用：寫回從 = energy 變成 += contribution（scatter 形態），這正是平行化需要 atomic 的根源。

Question 3 - Scatter Kernel 與 atomicAdd [recall]

Fig.18.5 的 scatter kernel 如何把工作分給 thread？它為何必須使用 atomicAdd，且 CHUNK_SIZE 受什麼限制？

查看答案

Scatter kernel 直接平行化最佳化序列碼 (Fig.18.4)：每個 thread = 一個 atom（最外層迴圈的一個迭代），把該 atom 的貢獻散播到所有 grid point。因為多個 atom-thread 會同時 += 同一個 energygrid[] 位置 → race condition，故必須用 atomicAdd 序列化更新 → 嚴重拖慢。atom 放 constant memory，CHUNK_SIZE*4 ≤ 64K（constant memory 容量），host 端分 chunk 逐批載入並呼叫 kernel。

Question 4 - Gather Kernel 與無 Atomic [recall]

Fig.18.6 的 gather kernel 如何分配工作而不需 atomic？它建立在哪一份序列碼上？需要什麼小轉換才能平行化？

查看答案

Gather kernel：每個 thread = 一個 grid point，用 2D thread grid 對應 2D grid point，迴圈跑所有 atom 收集貢獻。每個 thread 只寫自己的 grid point → 無寫入衝突、完全不需 atomic。它建立在未最佳化的 Fig.18.3 迴圈順序上。轉換：把兩層外迴圈做成 perfectly nested——本章選擇「把 y 座標計算搬進內層」（只多算幾次 y，計算量極小），而非 loop fission（需兩 kernel 經 global memory 溝通 + launch overhead）。這是「用少量重複計算換最大平行度」的取捨。

Question 5 - Gather 的 Arithmetic Intensity 與 Constant Cache [recall]

Fig.18.6 的 gather kernel 雖然每個 thread 要讀所有 atom，為何 global memory bandwidth 不是瓶頸？它的 arithmetic intensity 是多少？

查看答案

每個 thread 對每個 atom 做 9 個浮點運算 / 4 個 memory 存取 → arithmetic intensity = 9/4 = 2.25 OP/access。關鍵是 atoms[] 放 constant memory：每個 SM 的 constant cache 會快取並把資料 broadcast 給同一 warp 中眾多 thread。由於所有 thread 同步掃描同一批 atom，跨 thread 的大量重用使 constant cache 極有效，幾乎消除 DRAM 存取 → global memory bandwidth 不是限制因素。

Question 6 - Thread Coarsening 的 Register 重用 [recall]

Fig.18.8 的 thread coarsening 讓一個 thread 算同一 row 上 4 個 grid point。哪些量被取一次存進 register 重複使用？處理「1 個 atom × 4 個 grid point」的 constant 存取與 FP ops 從多少降到多少？

查看答案

同一 row 上 4 個 grid point 共用同一 atom 的 dy、dz、dysqdzsq = dy*dy + dz*dz、charge——這些從 constant memory 取一次存 register，4 個點共用；只有 dx 隨 x 改變（dx1=dx0+gridspacing 等由 register 推導）。量化（1 atom × 4 grid points）：constant 存取 16→4（x/y/z/charge 各省 3，4× 縮減）、FP ops 48→24（2× 縮減）。代價是每 thread 用更多 register；只要不超過上限就不影響 occupancy。

Question 7 - Cutoff Summation 的複雜度與遠端原子 [recall]

Cutoff summation 如何把 DCS 的複雜度從 O(V²) 降到 O(V)？被切掉的「遠端原子」貢獻如何處理？這屬於哪一類演算法策略？

查看答案

物理觀察：貢獻 ∝ charge / r，遠處原子貢獻極小。Cutoff summation 讓每個 grid point 只累加固定半徑 (cutoff) 內的原子貢獻 → 每點看的原子數固定 → 複雜度降為 O(V) 線性。cutoff 圈外的暗原子貢獻用一個低複雜度的 implicit method 集體補回。這屬於「若允許最終解輕微變動，可換取更大效率提升」的積極策略——犧牲少量精度換取可擴充性，與「替代演算法須得相同解」的一般規則不同。

Question 8 - Cutoff Binning 的 Control Divergence 與記憶體選擇 [recall]

在 grid-centric cutoff binning 中，(a) control divergence 從何而來？(b) 為何 atom 從 constant memory 改放 global memory，並如何緩解頻寬？

查看答案

(a) neighborhood bin 內的原子不保證都落在每個 thread 的 cutoff 半徑內，因此處理 bin 內每顆原子時 warp 中各 thread 要各自 if (r2 <= cutoff2) 判斷，做出不同的 include/exclude 決定 → control divergence。
(b) 不同 block 存取不同 neighborhood，原子總量大，有限的 constant memory 裝不下所有 active blocks 所需原子 → 改放 global memory。緩解：block 內 thread 協同載入共同的 neighborhood bin 到 shared memory，再各自從 shared memory 讀取（與 stencil tiling 同思路）。

Question 9 - Binning 資料結構與 Neighborhood List [recall]

在 cutoff binning 中，atom 如何被組織成 bins？grid point 的「neighborhood」如何定義與表示？為何要在 launch grid 之前就先算好？

查看答案

Binning：把輸入原子依座標 sort 進 bins，每個 bin 對應 energy grid space 的一個 box，內含座標落在該 box 的所有原子。資料結構是多維陣列：x、y、z 三維加上第四維（bin 內原子的 vector）。grid point 的 neighborhood = 所有可能貢獻能量的 bin 集合，以 home bin 為原點的 relative offset 清單表示（範例 9 full + 12 partial = 21 bins），通常放 constant memory array。因為由 cutoff 半徑反推 neighborhood bins 是複雜且耗時的幾何問題，故為整個 block 定義一份、在 launch 之前先準備好（可視為 energy grid space 中的一個 stencil）。

Question 10 - Exercise 2：Coarsening Factor 8 的運算量比較 [application]

對 coarsening factor 8，比較 Fig.18.8（coarsened，1 次迴圈迭代）與 Fig.18.6（gather，對應 8 次迭代）在 constant 存取、FP 運算與分支上的差異。

查看答案

處理 1 個 atom 對 8 個 grid point：
Fig.18.6 (gather) = 8 次迭代 = 8×(4 constant 存取 + 12 FP ops + 1 迴圈分支) = 32 constant 存取、96 FP ops、8 分支。
Fig.18.8 (coarsened factor 8) = 1 次迭代：constant 存取仍只 4（x/y/z/charge 各取一次）；FP ops 因 dy、dz、dysqdzsq、charge 共用而大幅縮減至約 45（3 sub + ~24 add + ~10 mul + 8 div，m*gridspacing 常數被 hoist 出迴圈）；分支 1。
結論：constant 存取 8× 縮減、分支 8× 縮減、FP ops 約 2× 縮減 —— 同時降低執行時間與能耗。

Question 11 - Per-Block Neighborhood 與 Supercircle 計算 [application]

假設 grid spacing = 0.5 Å、block = 8×8×8、cutoff = 12 Å。(a) 每個 block 覆蓋多大的 energy grid 體積？(b) supercircle 半徑公式與其數值為何？(c) 一個 block 有幾個 grid point 各自帶 cutoff 圈？

查看答案

(a) 每維 8 × 0.5 Å = 4 Å → block 覆蓋 4 Å × 4 Å × 4 Å 立方體（即一個 bin 的大小）。
(b) supercircle 半徑 = cutoff + ½ × bin 對角線。3D bin 對角線 = $4 \sqrt{3} \approx 6.93$ Å，故半徑 = $12 + \frac{1}{2} (4 \sqrt{3}) = 12 + 2 \sqrt{3} \approx 15.46 Å$ 。此 supercircle 以 bin 中心為圓心，可涵蓋「以 bin 四角為心的所有 cutoff 圈」。
(c) block 有 $8^{3} = 512$ 個 grid point，每個都有自己的 cutoff 圈。只需列出被 supercircle 完全/部分覆蓋的 bin 作為 neighborhood list（範例為 9 full + 12 partial = 21 bins），以 relative offset 放 constant memory。

Question 12 - Dummy Atoms vs Overflow List [application]

某資料集的原子在空間中分佈不均。為了 memory coalescing 需讓所有 bin 同樣大小。(a) 用 dummy atoms 補滿有何兩個負面效應？(b) overflow list 是更好的設計，如何運作並隱藏延遲？

查看答案

(a) Dummy atoms (charge = 0) 的缺點：① 佔用 global / shared memory 空間，並消耗傳到 device 的傳輸頻寬；② 拖慢那些「真實原子很少」的 thread block——dummy 仍要逐顆被處理，延長其執行時間。
(b) Overflow list：把 bin size 設在能容納絕大多數 bin 原子數的合理值（遠小於最大可能值）。binning 時若某原子的 home bin 已滿 → 改放入 overflow list。device kernel 完成後，host 對 overflow list 跑序列 cutoff 補上缺失貢獻。把計算切成 subvolume，device 跑下一個 kernel 的同時 host 處理上一批 overflow；只要 overflow 佔比小（例如 < 3%），其序列時間短於 device 時間，可被完全隱藏。

Question 13 - Scatter vs Gather 的平行化兩難 [analysis]

為何說「最佳化的序列碼反而更不適合平行化」？比較 scatter 與 gather 兩種 DCS kernel 的 thread 對應、寫入模式、效能瓶頸與單 thread 計算量，並說明本章如何化解此兩難。

查看答案

兩難根源：把序列碼最佳化（loop interchange）會把 atom 推到最外層，得到 scatter 形態（+= 散播）；直接平行化它 → 多 thread 寫同一 grid point → 需 atomicAdd 序列化，平行效能差。要 gather（每 thread 寫自己的點、免 atomic、平行好）就必須退回未最佳化的 Fig.18.3 迴圈順序，使每個 thread 內部重算 dy/dz、計算量較多。

比較項	Scatter (Fig.18.5)	Gather (Fig.18.6, 首選)
thread 對應	1 thread = 1 atom	1 thread = 1 grid point
寫入模式	多 thread 寫同一點 (`+=`)	各寫各的點
是否需 atomic	需要	不需要
瓶頸	atomic 序列化 → 慢	無 atomic → 快
單 thread 計算	較少 (享 loop-invariant)	較多 (y/z 每次重算)
decomposition	input-centric (atom)	output-centric (grid pt)

化解：先選 gather 取得高平行度與免 atomic，再用 thread coarsening 把 atom 資料抓進 register 重用於多點，奪回最佳化序列碼省下的計算。output-centric (gather) 通常較佳，因每個 output 由單一 thread 擁有。

Question 14 - Coalesced vs Strided 寫回 [analysis]

Fig.18.8（連續指派）與 Fig.18.10（interleaved 指派）運算量完全相同，為何後者效能更好？分析其 warp 寫入位址型態，並說明為何 coalescing 是「免費」的優化、以及為何它不影響對 atoms[] 的讀取。

查看答案

Fig.18.8 連續指派：thread 負責 i, i+1, i+2, i+3。對任一寫回敘述（如 energy0），warp 內相鄰 thread 位址相隔 COARSEN_FACTOR(=4) → 位址 0─4─8─12（中間各空 3 格）→ uncoalesced，每次寫只用到 DRAM burst 的一小部分。
Fig.18.10 interleaved 指派：i = blockIdx.x*blockDim.x*CF + threadIdx.x，thread 內步距改 blockDim.x*gridspacing。同一寫回敘述中 warp 寫的是連續位址 0─1─2─3 → fully coalesced，一個 burst 服務整個 warp。
為何免費：兩版 FP 與 constant 存取完全一樣，只重排了 grid-point→thread 的折疊索引，不增減任何運算 → 純粹靠對齊 DRAM burst 取得加速。
為何不影響 atoms[]：atoms[] 在 constant memory，走 broadcast / constant cache，不是 DRAM burst；coalescing 規則只適用於對 global memory energygrid[] 的寫回。

Question 15 - DCS vs Cutoff 的演算法選擇權衡 [analysis]

同一問題常有多種演算法。比較 DCS 與 cutoff binning 在精度、複雜度、平行度、大型系統適用性上的取捨，並說明 cutoff binning 為何仍選擇 grid-centric (gather) 分解而非 atom-centric。

查看答案

演算法在四個面向（計算步數、平行度、數值穩定性、記憶體頻寬）各有取捨，沒有單一演算法樣樣最佳。

面向	DCS	Cutoff Binning
每點收的原子	全部	僅 cutoff 半徑內
精度	高（完全精確）	略低（遠端可 implicit 補回）
複雜度	O(V²) quadratic	O(V) linear
平行度	極高	仍維持高平行
大型系統	不可擴充	可擴充

Cutoff 的複雜度改善來自「每 thread 只檢查一小撮 neighborhood bin 原子」。仍用 grid-centric (gather) 的原因：序列 cutoff 是 atom-centric（每次處理一原子、迭代其半徑內 grid points），這是 scatter 更新模式需 atomic、無法良好平行；改成 grid-centric 讓每個 thread 算一個 grid point、各寫各的、免 atomic，才能在 GPU 上保持高平行度（Rodrigues et al., 2008）。代價是 neighborhood bin 帶來的 control divergence 與 binning 的負載不均（用 overflow list 處理）。

重點整理 (Pattern Summary，點擊展開)

Kernel / 策略 (節)	thread 對應	需 atomic	寫回	關鍵手法 / 效果
Scatter (18.2, Fig.18.5)	1 atom	是	`atomicAdd`	平行化最佳化序列碼 → atomic 序列化，慢
Gather (18.2, Fig.18.6)	1 grid point	否	各寫各的	未最佳化迴圈順序 + constant cache broadcast
Coarsening (18.3)	1 thread×F 點	否	F 個 += (連續)	register 重用 dy/dz/dysqdzsq/charge：16→4、48→24
Coalescing (18.4)	1 thread×F 點	否	coalesced	interleaved 指派，步距 `blockDim.x`，免費
Cutoff Binning (18.5)	1 grid point	否	各寫各的	bins + neighborhood list + shared memory，O(V)

核心觀念鏈：DCS = $O (V^{2})$ 不可擴充 → 最佳化序列碼=scatter(需 atomic) → 退回未最佳化得 gather(免 atomic) → thread coarsening 用 register 奪回效率 → coalescing-aware 索引重排換完全合併寫入 → cutoff binning 犧牲少量精度把複雜度降到 $O (V)$ 。

關鍵公式：DCS 貢獻 charge / r_ij；複雜度 $O (N_{a t o m s} \times N_{g r i d}) \approx O (V^{2})$ ；arithmetic intensity 9/4=2.25；coarsening (1 atom×4 點) constant 16→4、FP 48→24；CHUNK_SIZE*4 ≤ 64K；supercircle 半徑 = cutoff + ½×bin 對角線；neighborhood 範例 9+12=21 bins。

一句話總結：本章是「平行化決策的演進」案例——output-centric(gather) 戰勝 input-centric(scatter)，再靠 coarsening + coalescing 補回單 thread 效率，最後用 cutoff binning 換取可擴充性。

Related Concepts

Question 1 - DCS 方法與複雜度 [recall]

Question 2 - Loop Interchange 合法性與解鎖的最佳化 [recall]

Question 3 - Scatter Kernel 與 atomicAdd [recall]

Question 4 - Gather Kernel 與無 Atomic [recall]

Question 5 - Gather 的 Arithmetic Intensity 與 Constant Cache [recall]

Question 6 - Thread Coarsening 的 Register 重用 [recall]

Question 7 - Cutoff Summation 的複雜度與遠端原子 [recall]

Question 8 - Cutoff Binning 的 Control Divergence 與記憶體選擇 [recall]

Question 9 - Binning 資料結構與 Neighborhood List [recall]

Question 10 - Exercise 2：Coarsening Factor 8 的運算量比較 [application]

Question 11 - Per-Block Neighborhood 與 Supercircle 計算 [application]

Question 12 - Dummy Atoms vs Overflow List [application]

Question 13 - Scatter vs Gather 的平行化兩難 [analysis]

Question 14 - Coalesced vs Strided 寫回 [analysis]

Question 15 - DCS vs Cutoff 的演算法選擇權衡 [analysis]