DCS 演算法與 Scatter vs Gather 核心設計 (Direct Coulomb Summation)

#application-case-study #electrostatic-potential #scatter-vs-gather #atomic-operations #loop-transformation

重點總覽 (Overview)

本章以 VMD 分子動力學軟體中的「靜電位圖 (electrostatic potential map)」計算為案例，介紹 Direct Coulomb Summation (DCS) 方法，並比較兩種平行化策略。

項目	內容	關鍵點
應用	VMD 計算 grid point 的靜電位，用於放置 ions / 分子動力學模擬	計算量大，傳統用 batch mode
方法	DCS：每個 grid point = 所有 atom 貢獻之和	高精度但複雜度高
複雜度	$O (N_{a t o m s} \times N_{g r i d})$	隨體積平方成長 (不可擴充)
序列最佳化	迴圈交換 (loop interchange) + 提取迴圈不變計算	atom 迴圈移到最外層
Scatter 核心	一個 thread = 一個 atom，散播到所有 grid point	需 `atomicAdd`，慢
Gather 核心	一個 thread = 一個 grid point，收集所有 atom 貢獻	無需 atomic，首選
核心困境	最佳化的序列碼 ↔ 平行化的序列碼不是同一份	平行化要犧牲序列效率

本章的核心教訓

「高度最佳化的序列程式碼往往比未最佳化的版本更不適合平行化。」 Gather 是首選 (避免 atomic)，但它必須建立在「較慢」的未最佳化迴圈順序之上 — 後續再靠 thread coarsening 把效率賺回來。

VMD 與靜電位圖背景 (Background: VMD & Electrostatic Potential Map)

VMD (Visual Molecular Dynamics)：顯示、動畫化、分析生物分子系統的熱門軟體 (20 萬+ 使用者)，被稱為「計算顯微鏡 (computational microscope)」。
計算內容：在 3D grid space 的每個 grid point 計算靜電位值。
- 用途：辨識可放入 ions (紅點) 的空間位置；計算分子動力學模擬中時間平均的電場位圖。
加速動機：把原本只能跑 batch mode 的長時間計算加速到 interactive (互動) 等級，大幅提升科學研究效率。CUDA 桌機普及使影響廣泛。

grid point ≠ thread grid

此處的「grid point」是離散化空間取樣點 (如同 Stencil 中的離散網格)，與 CUDA 的 thread grid 無關，勿混淆。

Direct Coulomb Summation 方法 (DCS Method)

DCS 是計算靜電位圖中高精度、且特別適合 GPU 的方法。每個 grid point 的位能 = 系統中所有 atom 貢獻之總和。

單一 atom 對 lattice point 的貢獻：

potential [j] + = \frac{atom [i] . charge}{r_{i j}}, r_{i j} = \sqrt{(x_{j} - x_{i})^{2} + (y_{j} - y_{i})^{2} + (z_{j} - z_{i})^{2}}

      atom[i] (charge q_i)
          *  \
          |    \  r_ij  = 距離
          |      \
   z      |        \
   ^      v          v
   |   . . . . . . [j] . . . .   <- lattice (grid) points (同一 slice 共用 z)
   +---> x
   貢獻 = q_i / r_ij ，grid point j 累加所有 atom 的貢獻

每個 atom 用 atoms[] 陣列中連續 4 個 float 表示：{x, y, z, charge}。
函式一次處理 3D grid 的一個 2D slice (同一 slice 所有 grid point 共用 z 座標，由呼叫端預先算好傳入)。
均勻網格 (uniform grid)：座標 = index * gridspacing，三維間距相同，可即時 (on-the-fly) 計算座標。

複雜度警示 (不可擴充)

計算次數 $\propto N_{a t o m s} \times N_{g r i d}$ 。當系統體積放大時，atom 數與 grid point 數都隨體積成長，故總運算量 $\propto {Volume}^{2}$ (二次成長)。這正是 Cutoff Binning 要解決的可擴充性問題。

序列程式碼與迴圈交換最佳化 (Sequential Code & Loop Interchange)

未最佳化版本 (Fig. 18.3) — grid-point 在外、atom 在內

void cenergy(float *energygrid, dim3 grid, float gridspacing, float z,
             const float *atoms, int numatoms) {
  int atomarrdim = numatoms * 4;        // x,y,z,charge per atom
  for (int j=0; j<grid.y; j++) {        // grid y
    float y = gridspacing * (float) j;
    for (int i=0; i<grid.x; i++) {      // grid x
      float x = gridspacing * (float) i;
      float energy = 0.0f;
      for (int n=0; n<atomarrdim; n+=4) {        // <-- 最內層：所有 atom
        float dx = x - atoms[n];
        float dy = y - atoms[n+1];               // y,z 距離在最內層重算!
        float dz = z - atoms[n+2];
        energy += atoms[n+3] / sqrtf(dx*dx + dy*dy + dz*dz);
      }
      energygrid[grid.x*grid.y*z + grid.x*j + i] = energy;  // 直接寫，無 +=
    }
  }
}

最佳化版本 (Fig. 18.4) — loop interchange：atom 移到最外層

void cenergy(/* ... */) {
  int atomarrdim = numatoms * 4;
  int grid_slice_offset = (grid.x*grid.y*z) / gridspacing;
  for (int n=0; n<atomarrdim; n+=4) {            // <-- atom 移到最外層 (scatter)
    float dz  = z - atoms[n+2];                  // 整個 slice 共用 z -> 算一次
    float dz2 = dz*dz;
    float charge = atoms[n+3];
    for (int j=0; j<grid.y; j++) {
      float y   = gridspacing * (float) j;
      float dy  = y - atoms[n+1];                // 整列共用 y -> 在內層外算
      float dy2 = dy*dy;
      int grid_row_offset = grid_slice_offset + grid.x*j;
      for (int i=0; i<grid.x; i++) {
        float dx = gridspacing*(float)i - atoms[n];
        energygrid[grid_row_offset+i] += charge / sqrtf(dx*dx + dy2 + dz2); // +=
      }
    }
  }
}

迴圈交換為何合法：三層迴圈是 perfectly nested 且各迭代彼此獨立 (同 FHD kernel 的論證)。

交換後解鎖的最佳化 (提取迴圈不變量)：

計算	Fig. 18.3 (未最佳化)	Fig. 18.4 (最佳化)
`dz, dz2`	最內層 atom 迴圈中每次重算	slice 層算 1 次
`dy, dy2`	最內層中每次重算	row 層算 1 次 (內層外)
寫回方式	`= energy` (整批寫一次)	`+= contribution` (累加散播)

關鍵副作用

最佳化後 atom 在最外層 → 每個 atom 把貢獻 scatter (散播) 到所有 grid point，且寫回變成 += (read-modify-write)。這個 += 正是平行化時需要 atomic 的根源。

                 i-loop, j-loop (grid points)
   Fig 18.3:  [grid pt] --gather--> Σ over all atoms      (寫 1 次, 但 dy/dz 重算)
   Fig 18.4:  [atom]    --scatter--> += to all grid pts    (dy/dz 提取, 但要 +=)
              ^ 最佳化序列碼 = scatter 形態

Scatter 核心 (Scatter Kernel with atomicAdd)

直接平行化「最佳化序列碼 (Fig. 18.4)」：每個 thread = 一個 atom (最外層迴圈的一個迭代)，把該 atom 的貢獻散播到所有 grid point。

__constant__ float atoms[CHUNK_SIZE*4];   // atom 資料放 constant memory
void __global__ cenergy(float *energygrid, dim3 grid, float gridspacing, float z) {
  int n = (blockIdx.x * blockDim.x + threadIdx.x) * 4;   // thread -> 一個 atom
  float dz = z - atoms[n+2];  float dz2 = dz*dz;
  int grid_slice_offset = (grid.x*grid.y*z) / gridspacing;
  float charge = atoms[n+3];
  for (int j=0; j<grid.y; j++) {
    float dy = gridspacing*(float)j - atoms[n+1];  float dy2 = dy*dy;
    int grid_row_offset = grid_slice_offset + grid.x*j;
    for (int i=0; i<grid.x; i++) {
      float dx = gridspacing*(float)i - atoms[n];
      atomicAdd(&energygrid[grid_row_offset+i],          // <-- 必須 atomic!
                charge / sqrtf(dx*dx + dy2 + dz2));
    }
  }
}

CHUNK_SIZE * 4 必須 ≤ 64K (constant memory 容量限制)；host 端把 atom 分 chunk 載入。
致命缺點：多個 thread (atom) 會同時更新同一個 energygrid[] 位置 → 必須 atomicAdd → 序列化、嚴重拖慢。

   thread (atom) 0 ─┐
   thread (atom) 1 ─┼─ atomicAdd ─> energygrid[k]   (寫衝突 -> 序列化!)
   thread (atom) 2 ─┘

Scatter = 寫入衝突

Scatter 的本質是「多個來源寫同一目的地」，必然產生 atomic 競爭。與 MRI FHD 章節結論一致：scatter 平行化效能差。

Gather 核心 (Gather Kernel — 首選)

改用 gather：每個 thread = 一個 grid point，收集所有 atom 對它的貢獻。每個 thread 只寫自己的 grid point → 完全不需 atomic。

__constant__ float atoms[CHUNK_SIZE*4];
void __global__ cenergy(float *energygrid, dim3 grid, float gridspacing,
                        float z, int numatoms) {
  int i = blockIdx.x * blockDim.x + threadIdx.x;   // 2D thread grid 對應
  int j = blockIdx.y * blockDim.y + threadIdx.y;   //   2D grid point
  int atomarrdim = numatoms * 4;
  int k = z / gridspacing;
  float x = gridspacing * (float) i;
  float y = gridspacing * (float) j;
  float energy = 0.0f;
  for (int n=0; n<atomarrdim; n+=4) {              // gather: 迴圈跑所有 atom
    float dx = x - atoms[n];
    float dy = y - atoms[n+1];
    float dz = z - atoms[n+2];
    energy += atoms[n+3] / sqrtf(dx*dx + dy*dy + dz*dz);
  }
  energygrid[grid.x*grid.y*k + grid.x*j + i] += energy;  // 每 thread 寫自己的點
}

它建立在「未最佳化」的 Fig. 18.3 迴圈順序上，需做一個小轉換：

把 Fig. 18.3 的兩層外迴圈 (i, j) 做成 perfectly nested，才能讓一個 thread 執行一個 (i,j) 迭代。
做法有二：
1. Loop fission：另建陣列存 y 值 → 需兩個 kernel 經 global memory 溝通 + kernel launch overhead。
2. 把 y 座標計算搬進內層 (本章選此)：只是多算幾次 y，計算量極小。
取捨：用少量重複計算換取最大平行度 (所有 i, j 迭代可平行)。

        所有 atom (constant memory, broadcast)
         |    |    |
         v    v    v
   thread(grid pt)  -- gather: energy += Σ atom 貢獻 --> 寫「自己」的 grid point
   (無寫入衝突, 無 atomic)

Gather 核心效能良好 (即使是未最佳化迴圈順序)：

無 atomic 拖累。
Arithmetic intensity：每 4 個 memory access 做 9 個浮點運算 → $OP/B = 9 / 4$ 。
atoms[] 在每個 SM 的 constant cache 中被快取，並 broadcast 給眾多 thread → 大量跨 thread 重用，幾乎消除 DRAM 存取 → global memory bandwidth 不是瓶頸。

Gather 的下個最佳化

Gather 雖避開 atomic，但每 9 個浮點運算仍要 4 次 constant memory 存取。下一步用 thread coarsening 把 atom 資料抓進 register 重用於多個 grid point，奪回最佳化序列碼省下的計算。

Scatter vs Gather 困境 (The Parallelization Dilemma)

比較項	Scatter (Fig. 18.5)	Gather (Fig. 18.6, 首選)
thread 對應	1 thread = 1 atom	1 thread = 1 grid point
迴圈來源	最佳化序列碼 (Fig. 18.4)	未最佳化序列碼 (Fig. 18.3)
寫入模式	多 thread 寫同一 grid point	每 thread 寫自己的 grid point
是否需 atomic	需要 `atomicAdd`	不需要
效能瓶頸	atomic 序列化 → 慢	無 atomic，效能好
單 thread 計算量	較少 (享 loop-invariant 提取)	較多 (y/z 距離每次重算)
decomposition	input-centric (atom 為中心)	output-centric (grid point 為中心)

平行化的常見兩難 (要背)

最佳化的序列碼 = scatter 形態 (需 atomic，平行差)；要 gather (平行好) 就得回到未最佳化的迴圈順序，使每個 thread 內部變慢。 解法：先選 gather 取得高平行度，再用 thread coarsening 把單 thread 效率補回來。

這與 output-centric vs input-centric 分解是同一個主題：output-centric (gather) 通常較好，因為每個 output 由單一 thread 擁有。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
DCS 計算複雜度	$O (N_{a t o m s} \times N_{g r i d})$ ；隨體積平方成長 → 不可擴充
「為何 DCS 不適合大型系統」	atom 數與 grid 數都隨體積線性增 → 總運算量 $\propto {Volume}^{2}$
loop interchange 為何合法	三層迴圈 perfectly nested 且各迭代獨立
最佳化序列碼解鎖什麼	把 `dz/dz2` (slice 不變)、`dy/dy2` (row 不變) 提取到迴圈外
scatter 為何需要 atomic	多個 atom-thread 同時 `+=` 同一 grid point → race condition
gather 為何不需 atomic	每個 thread 只寫自己擁有的 grid point
哪個是首選？為何	Gather：避免 atomic 序列化；但建立在較慢的迴圈順序上
gather 用未最佳化碼的轉換	把 y 座標計算搬進內層 (或 loop fission)，使 i/j 迴圈 perfectly nested
gather 的 arithmetic intensity	9 FLOP / 4 memory accesses = 2.25
為何 gather 不受 bandwidth 限制	`atoms[]` 在 constant cache 被 broadcast 大量重用，消除多數 DRAM 存取
CHUNK_SIZE 限制	`CHUNK_SIZE*4` ≤ 64K (constant memory 容量)
「平行化兩難」	最佳化序列碼=scatter(差)；gather(好)需犧牲單 thread 效率，再靠 coarsening 補