Cutoff Binning 與資料規模擴充性 (Data Size Scalability)

#application-case-study #cutoff-binning #electrostatic-potential #control-divergence #shared-memory

重點總覽 (Overview)

項目	重點
核心問題	DCS 計算量 ∝ atoms × grid points ≈ O(V²)（與體積平方成長），不可擴充
解法策略	Cutoff summation：只累加 cutoff 半徑內的原子貢獻，犧牲少量精度換取 O(V) 線性複雜度
平行化分解	必須用 grid-centric (gather)，每 thread 算一個 grid point；atom-centric (scatter) 無法平行化
資料結構	把原子依座標 sort 進 bins（多維陣列 `[x][y][z][atom vector]`）
Neighborhood	一個 grid point 的 cutoff 圈會覆蓋多個 bin；用 relative-offset list 表示（範例 9 full + 12 partial = 21 bins）
記憶體選擇	不同 block 存取不同 neighborhood → constant memory 不夠用 → 改用 global memory + 協同載入 shared memory
Control divergence	neighborhood bin 內的原子不一定落在每個 thread 的 cutoff 圈內 → 各 thread `if` 判斷不同 → divergence
負載不均	bins 原子數不一；固定 bin size 需填 dummy atoms (charge 0) → 用 overflow list + host 收尾
遠端原子	cutoff 圈外的暗原子貢獻極小，用 separate implicit method 集體處理

演算法選擇的權衡 (Algorithm Selection Tradeoffs)

同一問題通常有多種演算法，各自在四個面向上取捨，沒有單一演算法在所有面向都最佳：

面向	說明
計算步數	有些演算法步數較少
平行度	有些暴露更高的平行執行度
數值穩定性	有些 numerical stability 較好
記憶體頻寬	有些消耗較少 memory bandwidth

Important

一般規則：替代演算法應得到相同的解，只能在效率/平行度上優化。
例外（更積極的策略）：若問題允許最終解有輕微變動，可換取更大幅度的效率提升 —— Cutoff summation 正是這類「犧牲少量精度」的策略。

延伸閱讀見 19-Parallel-Programming-And-Computational-Thinking/02-Algorithm-Selection-and-Tradeoffs。

Cutoff Summation 與擴充性 (Cutoff Summation & Scalability)

物理觀察：grid point 的能量貢獻與距離成反比（charge / r），遠處原子貢獻極小，可被集體用低複雜度的 implicit method 處理。

   Direct Summation (A)              Cutoff Summation (B)
   每個 grid point 收所有原子貢獻        只收 cutoff 半徑 r 內的原子貢獻

        o    o    o                       (dark)  o
     o     [G]     o                    o    .-----.
        o    o    o                          ( [G] )  r
     o     o    o                         o   '-----'   o (dark)
                                              ^ 圈外暗原子 → implicit method

   計算量 ∝ N_atom × N_grid            每點只看半徑內固定數量原子
          ≈ O(V²)  (quadratic)               ≈ O(V)  (linear)

	Direct Coulomb Summation (DCS)	Cutoff Summation
每個 grid point 收的原子	全部原子	僅 cutoff 半徑內原子
精度	高（完全精確）	略低（捨去遠端，可 implicit 補回）
複雜度	O(V²)（atoms 與 grid points 皆 ∝ V）	O(V)（線性 ∝ 體積）
平行度	極高	仍維持高平行度
大型系統適用性	不可擴充（計算過久）	可擴充

Warning

增大模擬體積時，grid points 和原子數同時正比於體積成長，所以 DCS 的操作數 ≈ V × V = V² (quadratic)，對真實生物系統不可行。

Grid-Centric Cutoff Binning 演算法 (Grid-Centric Binning Algorithm)

為何不能直接平行化序列 cutoff？ 序列版是 atom-centric（每次處理一個原子，迭代其半徑內 grid points）—— 這是 scatter 更新模式，需 atomic operations，平行化效果差（見 18-Electrostatic-Potential-Map/01-DCS-Scatter-vs-Gather）。

解法：採 grid-centric (gather) 分解 —— 每個 thread 算一個 grid point 的能量（Rodrigues et al., 2008）。

核心步驟：

Binning：把輸入原子依座標 sort 進 bins。每個 bin 對應 energy grid space 的一個 box，內含所有座標落在該 box 的原子。
資料結構：bins 為多維陣列 —— x、y、z 三維 加上第四維：bin 內原子的 vector。
Neighborhood：定義 grid point 的 neighborhood = 所有可能貢獻能量的 bin 集合。

   Fig 18.12 概念：grid point 的 cutoff 圈覆蓋周邊 bins
   +-----+-----+-----+
   | bin | bin | bin |   九個 bin 與 cutoff 圈重疊
   +-----+--.--+-----+   → 這 9 個 bin 內所有原子都要納入考慮
   | bin |(  [G] ) | bin|   ( ) = cutoff circle, [G] = grid point
   +-----+--'--+-----+
   | bin | bin | bin |   但 bin 內某些原子可能仍落在圈外!
   +-----+-----+-----+

Control Divergence 來源

neighborhood bin 內的原子不保證都落在每個 thread 的 cutoff 半徑內。因此處理 bin 內每顆原子時，warp 中各 thread 須各自 if (dist <= cutoff) 判斷，做出不同的 include/exclude 決定 → 造成 control divergence（見 04-Compute-Architecture-And-Scheduling/02-Warps-SIMD-and-Control-Divergence）。

Neighborhood List 與 Per-Block 設計 (Neighborhood List & Per-Block Design)

計算 neighborhood bins 是複雜的幾何問題、耗時。因此為整個 block 定義一份 neighborhood，並在 launch grid 之前先準備好（neighborhood bins 就像 energy grid space 中的一個 stencil）。

Per-block 推導（Fig 18.13/18.14 範例）：

參數	數值
Grid spacing	0.5 Å
Block 維度	8 × 8 × 8
每個 block 覆蓋體積	(8×0.5)³ = 4 Å × 4 Å × 4 Å cube
Cutoff distance	典型 12 Å
Block 內 grid points	512（每個都有自己的 cutoff 圈）

保守近似 — Supercircle：以 bin 中心為圓心，半徑取

supercircle_radius = cutoff_distance + ½ × bin_diagonal

此 supercircle 可覆蓋「以 bin 四角為圓心的所有 cutoff 圈」，於是只需列出被 supercircle 完全或部分覆蓋的 bin。

   5×5 bin neighborhood（以 home bin [G] 為原點的相對 offset）
   .  = 未覆蓋(角落)    p = 部分覆蓋(partial)    F = 完全覆蓋(full)

      .    p    p    p    .
      p    F    F    F    p
      p    F   [G]   F    p      [G] = home bin, offset (0,0)
      p    F    F    F    p
      .    p    p    p    .

   full (含 [G]) = 9    partial = 12    neighborhood list = 21 bins
   9 個 full 的相對 offset:
   (-1,-1)(0,-1)(1,-1)(-1,0)(0,0)(1,0)(-1,1)(0,1)(1,1)

Neighborhood list 以 relative offset 表示，通常放在 constant memory array 供 kernel 使用。

// 預先算好、放 constant memory 的 neighborhood 相對 offset 清單
__constant__ int nbrlist[21][3];   // 每筆 {dx, dy, dz} bin offset
__constant__ int numNbrs;          // = 21 (9 full + 12 partial)

__global__ void cutoff_cenergy(float *energygrid, /* grid 維度等 */ ,
                               const float *binBase, int binSize,
                               float gridspacing, float cutoff2) {
    // 1) 由 block/thread index 算出本 thread 負責的 grid point 座標
    int i = blockIdx.x*blockDim.x + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    float gx = i*gridspacing, gy = j*gridspacing /* gz ... */;
    float energy = 0.0f;

    __shared__ float atomBin[BIN_SIZE*4];   // 協同載入的 bin 緩衝

    // 2) 所有 thread 迭代共同的 neighborhood list
    for (int b = 0; b < numNbrs; ++b) {
        // 對 block 所屬 bin 套用相對 offset → 鄰居 bin 座標
        int binIdx = blockBinIndex + nbrlist[b][0] /* + y,z 維度 */;

        // 3) 協同把該 bin 的原子 global mem -> shared mem
        for (int t = threadIdx.x; t < binSize*4; t += blockDim.x)
            atomBin[t] = binBase[binIdx*binSize*4 + t];
        __syncthreads();

        // 4) 每個 thread 各自檢查 shared mem 中每顆原子是否在 cutoff 內
        for (int a = 0; a < binSize*4; a += 4) {
            float dx = gx - atomBin[a], dy = gy - atomBin[a+1],
                  dz = /*gz*/ - atomBin[a+2];
            float r2 = dx*dx + dy*dy + dz*dz;
            if (r2 <= cutoff2)                 // <-- control divergence 點
                energy += atomBin[a+3] * rsqrtf(r2);
        }
        __syncthreads();
    }
    energygrid[/* linear index */] += energy;   // gather: 各寫各的, 免 atomic
}

執行流程：

for each bin in neighborhood_list[21]:     (block 內所有 thread 一起)
   bin_coord = block_bin_coord + offset      套用相對 offset
   ── 協同載入 ──   atoms: global memory ──▶ shared memory
   __syncthreads()
   ── 各自計算 ──   每 thread 掃 shared mem 的原子:
                      if dist <= cutoff:  energy += charge / dist   ← divergence
   __syncthreads()

Shared Memory 與記憶體選擇 (Shared Memory & Memory Choice)

複雜度的改善來自：每個 thread 只檢查 neighborhood bins 定義的一小撮原子（而非全部原子）。但這帶來記憶體選擇的轉變：

	DCS（前幾節）	Cutoff Binning
原子放哪	constant memory（broadcast + 重用，cache 極有效）	global memory
為何改變	全 thread 看同一批原子	不同 block 存取不同 neighborhood，原子總量大，constant memory 太小裝不下所有 active blocks 所需原子
頻寬對策	constant cache 廣播	block 內 thread 協同載入共同 neighborhood bin 至 shared memory，再各自從 shared memory 讀取

Tip

Neighborhood bin 對一個 block 內所有 thread 是共同的資料，因此非常適合協同搬到 shared memory 重用 —— 與 08-Stencil/02-Shared-Memory-Tiling-for-Stencil 的 tiling 思路相同。

Bin 大小、Dummy Atoms 與 Overflow List (Bin Size, Dummy Atoms & Overflow List)

問題：原子在空間中統計分佈不均 —— 有些 bin 很多原子、有些完全沒有。

為了 memory coalescing：所有 bin 必須同樣大小且對齊 coalescing 邊界 → 需用 dummy atoms（charge = 0） 把較空的 bin 補滿。

Dummy atoms 的兩個負面效應

佔用：dummy atoms 仍佔 global / shared memory 空間，並消耗傳到 device 的傳輸頻寬。
拖慢：dummy atoms 延長那些「真實原子很少」的 thread block 的執行時間（仍要逐顆處理）。

較佳作法 — Overflow List：

步驟	做法
設定 bin size	設在能容納絕大多數 bin 原子數的合理值（遠小於最大可能原子數）
溢位處理	處理原子時若其 home bin 已滿 → 改放入 overflow list
Device 完成後	grid point 能量值傳回 host
Host 收尾	host 對 overflow list 跑序列 cutoff 演算法，補上這些溢位原子的缺失貢獻

Host/Device 重疊隱藏延遲

設計成每次 kernel 只算一個 subvolume：device 跑下一個 kernel 的同時，host 處理上一個 kernel 的 overflow atoms。只要 overflow 原子佔比小（例如 < 3%），其序列處理時間通常短於 device 執行時間，可被完全隱藏。

本章總結 (Chapter Summary)

整章是一連串平行化決策與權衡的演進：

 highly-optimized 序列 C  ──平行化──▶  scatter kernel  (需大量 atomicAdd, 慢)
                                            │  改用 gather
 less-optimized 序列 C    ──平行化──▶  gather kernel   (各寫各的, 快很多)
                                            │  thread coarsening
                                       重拾優化序列的效率 (register 重用)
                                            │  coalescing-aware 配置
                                       完全 coalesced 的記憶體寫入
                                            │  資料規模擴充
 ─────────────────────────────────────────────────────────────────
 DCS = 高精度但 O(V²) 不可擴充
        │  犧牲少量精度
 Cutoff summation + binning = 大幅降低複雜度 (O(V)) 且維持高平行度

DCS 精確但操作數 ∝ atoms × grid points → 隨體積 quadratic 成長，不適合真實生物系統。
透過 scatter → gather 避開 atomic、thread coarsening 重用暫存器、coalescing-aware 配置完全合併寫入（前兩篇）。
Cutoff binning：稍降精度即把複雜度從 O(V²) 降到 O(V)，同時保持高度平行。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
DCS 為何不可擴充？	atoms 與 grid points 都 ∝ 體積 → 操作數 ∝ V² (quadratic)
Cutoff summation 的複雜度？	每點只看固定半徑內原子 → O(V) 線性；代價是略損精度
為何 cutoff 要用 grid-centric？	atom-centric 是 scatter（需 atomic）無法良好平行；grid-centric gather 各寫各的
Bins 資料結構？	多維陣列：x,y,z + 第四維為 bin 內原子 vector
Neighborhood list 怎麼表示 / 放哪？	bin 的相對 offset 清單；放 constant memory；launch 前先算好（幾何計算貴）
Supercircle 半徑公式	cutoff + ½ × bin 對角線，可涵蓋以 bin 四角為心的所有 cutoff 圈
範例 neighborhood 有幾個 bin？	9 full + 12 partial = 21
Control divergence 從何而來？	neighborhood bin 內原子不一定落在每 thread 的 cutoff 圈內 → 各 thread `if` 判斷不同
為何 cutoff 不用 constant memory 放原子？	不同 block 存取不同 neighborhood，constant memory 太小 → 改 global memory + shared memory 協同載入
Bins 大小不一怎麼辦？	固定 size 需填 dummy atoms (charge 0)；但較佳是設小 size + overflow list 由 host 收尾
Dummy atoms 兩個缺點	(1) 佔記憶體 / 傳輸頻寬；(2) 拖慢原子少的 block
Overflow 隱藏延遲	subvolume 分段 → device 跑下一 kernel 時 host 處理上一批 overflow（<3% 可完全隱藏）
遠端原子貢獻怎麼處理？	cutoff 圈外暗原子貢獻極小 → 另用 implicit method 集體補回

重點總覽 (Overview)

演算法選擇的權衡 (Algorithm Selection Tradeoffs)

Cutoff Summation 與擴充性 (Cutoff Summation & Scalability)

Grid-Centric Cutoff Binning 演算法 (Grid-Centric Binning Algorithm)

Neighborhood List 與 Per-Block 設計 (Neighborhood List & Per-Block Design)

Shared Memory 與記憶體選擇 (Shared Memory & Memory Choice)

Bin 大小、Dummy Atoms 與 Overflow List (Bin Size, Dummy Atoms & Overflow List)

本章總結 (Chapter Summary)

考試/面試重點 (Exam / Test Patterns)

Related Notes