Thread Coarsening 與記憶體合併存取 (Memory Coalescing)

#application-case-study #thread-coarsening #memory-coalescing #electrostatic-potential #register-optimization

重點總覽 (Overview)

接續 18-Electrostatic-Potential-Map/01-DCS-Scatter-vs-Gather 的 gather kernel (Fig. 18.6)：它靠 constant cache 避開了 DRAM 瓶頸，但仍是「每 9 個浮點運算就要 4 次 constant memory 存取指令」。本節用兩個正交的優化把效率拉回接近最佳化序列碼的水準。

優化	解決什麼問題	核心手法	量化效果 (處理 1 個 atom × 4 個 grid points)
Thread Coarsening (§18.3, Fig. 18.8)	constant 存取指令與冗餘算術佔用硬體資源/耗能	一個 thread 算 COARSEN_FACTOR 個 grid points，把 `atom.x/y/z/charge`、`dy`、`dz`、`dysqdzsq` 取一次存 register 重複用	constant 存取 16→4、FP ops 48→24
Memory Coalescing (§18.4, Fig. 18.10)	Fig. 18.8 的寫回 `energygrid[]` 未合併 (warp 內相鄰 thread 位址差 4)	改用 coalescing-aware 指派：相鄰 grid point 給相鄰 thread，每個 thread 的點彼此相隔 `blockDim.x`	所有寫回 fully coalesced

Important

兩個優化各自獨立：coarsening 省的是 constant memory 存取指令 + 算術（atom 資料在 cache 中，非 DRAM）；coalescing 修的是對 global memory energygrid[] 的寫回。考試常把兩者混為一談。

執行緒粗化 (Thread Coarsening)

動機：同一 row 的 grid points 共用 y/z 計算

如 Fig. 18.7：同一 row (y 維) 上所有 grid points 的 y 座標相同 → atom 與這些點的 dy（以及 dz、dy*dy+dz*dz、charge）都一樣。Fig. 18.6 讓每個 thread 各自重算這些值，是冗餘的。

            row j  (固定 y)         atom i
   x: ──●────●────●────●──→        (ax, ay, az, q)
       g0   g1   g2   g3
       dy = y - ay   ← 對 g0..g3 完全相同 (算一次即可)
       dz = z - az   ← 整個 slice 都相同
   只有 dx 隨 x 改變: dx_m = (x + m·spacing) - ax

想法：讓一個 thread 負責同一 row 上的 COARSEN_FACTOR 個 grid points，把可共用的量從 constant memory 取一次、存進 register，供多個點重複使用。

Kernel (COARSEN_FACTOR = 4)

__constant__ float atoms[CHUNK_SIZE*4];
#define COARSEN_FACTOR 4
__global__ void cenergy(float *energygrid, dim3 grid,
                        float gridspacing, float z, int numatoms) {
    // 連續指派：thread 負責 4 個「相鄰」grid points (i, i+1, i+2, i+3)
    int i = (blockIdx.x*blockDim.x + threadIdx.x) * COARSEN_FACTOR;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    int atomarrdim = numatoms*4;
    int k = z/gridspacing;
    float y = gridspacing*(float)j;
    float x = gridspacing*(float)i;
    float energy0=0, energy1=0, energy2=0, energy3=0;
    for (int n=0; n<atomarrdim; n+=4) {
        float dx0 = x - atoms[n];                 // x 從 constant 取一次
        float dx1 = dx0 + gridspacing;            // 其餘 dx 由 register 推導
        float dx2 = dx0 + 2*gridspacing;
        float dx3 = dx0 + 3*gridspacing;
        float dy  = y - atoms[n+1];               // dy 取/算一次
        float dz  = z - atoms[n+2];               // dz 取/算一次
        float dysqdzsq = dy*dy + dz*dz;           // 存 register，4 點共用
        float charge   = atoms[n+3];              // charge 取一次
        energy0 += charge / sqrtf(dx0*dx0 + dysqdzsq);
        energy1 += charge / sqrtf(dx1*dx1 + dysqdzsq);
        energy2 += charge / sqrtf(dx2*dx2 + dysqdzsq);
        energy3 += charge / sqrtf(dx3*dx3 + dysqdzsq);
    }
    energygrid[grid.x*grid.y*k + grid.x*j + i  ] += energy0;
    energygrid[grid.x*grid.y*k + grid.x*j + i+1] += energy1;   // ← 寫回未合併!
    energygrid[grid.x*grid.y*k + grid.x*j + i+2] += energy2;
    energygrid[grid.x*grid.y*k + grid.x*j + i+3] += energy3;
}

每個 atom、每 4 個 grid points 的運算量比較

項目	Fig. 18.6 gather (4 threads×1 點)	Fig. 18.8 coarsened (1 thread×4 點)	省下
constant memory 存取	16	4	12 (x/y/z/charge 各省 3)
FP 減法	12	3	9
FP 加法	12	11	—
FP 乘法	12	6	—
FP 除法	12	4	—
FP ops 合計	48	24	24

Tip

重點不是逐項精確值（會隨 compiler 把 2*gridspacing、3*gridspacing hoist 出迴圈而略有出入），而是 constant 存取 4× 縮減、FP ops 2× 縮減。少掉的存取指令同時降低 執行時間與能耗。

Warning

代價：每個 thread 用更多 register（energy0..3、dx0..3…）。若 register 數超過上限會壓低每個 SM 可容納的 thread 數（occupancy）。書中此例仍在允許範圍內，故 不影響 occupancy——但這是 SIMPLIFICATION-WITH-EXCEPTION：coarsening factor 越大越可能撞到 register 上限。參見 04-Compute-Architecture-And-Scheduling/03-Resource-Partitioning-and-Occupancy。

記憶體合併存取 (Memory Coalescing)

問題：連續指派造成 strided 寫回

Fig. 18.8 用 連續指派（thread 負責 i, i+1, i+2, i+3）。對任一條寫回敘述（例如 energy0），warp 內相鄰 thread 的寫入位址相隔 COARSEN_FACTOR 個元素 → 未合併 (uncoalesced)。

連續指派 (Fig. 18.8)  blockDim.x=4, COARSEN_FACTOR=4 → block 蓋 16 點
  thread:  T0          T1          T2          T3
  負責點:  0 1 2 3     4 5 6 7     8 9 10 11   12 13 14 15
                       ↑ T0,T1 的 energy0 位址相隔 4 (strided)

  寫 energy0 的 warp 位址:  0 ── 4 ── 8 ── 12   (中間各空 3 格) ✗ uncoalesced

解法：coalescing-aware 指派（interleaved）

把 grid points 交錯分給 thread：先把連續 blockDim.x 個點分給 thread 0..N-1，再把「下一段」blockDim.x 個點分給同一批 thread，重複 COARSEN_FACTOR 次（Fig. 18.9）。如此同一條寫回敘述中，warp 寫入的就是 連續位址。

interleaved 指派 (Fig. 18.10)  blockDim.x=4, COARSEN_FACTOR=4
  thread:  T0          T1          T2          T3
  負責點:  0 4 8 12     1 5 9 13    2 6 10 14   3 7 11 15
           ↑每個 thread 內部相隔 blockDim.x(=4)

  寫 energy0 的 warp 位址:  0 ─ 1 ─ 2 ─ 3       (連續) ✓ coalesced
  寫 energy1 的 warp 位址:  4 ─ 5 ─ 6 ─ 7       (連續) ✓ coalesced

Kernel (Fig. 18.10)

僅兩處關鍵改動：(1) index 公式 改成讓相鄰 thread 相鄰；(2) dx 與寫回的步距 從 gridspacing 改成 blockDim.x*gridspacing。

__constant__ float atoms[CHUNK_SIZE*4];
#define COARSEN_FACTOR 4
__global__ void cenergy(float *energygrid, dim3 grid,
                        float gridspacing, float z, int numatoms) {
    // interleaved：相鄰 thread → 相鄰 grid point (差 1，不再差 4)
    int i = blockIdx.x*blockDim.x*COARSEN_FACTOR + threadIdx.x;
    int j = blockIdx.y*blockDim.y + threadIdx.y;
    int atomarrdim = numatoms*4;
    int k = z/gridspacing;
    float y = gridspacing*(float)j;
    float x = gridspacing*(float)i;
    float energy0=0, energy1=0, energy2=0, energy3=0;
    for (int n=0; n<atomarrdim; n+=4) {
        float dx0 = x - atoms[n];
        float dx1 = dx0 +   blockDim.x*gridspacing;  // ← 步距改為 blockDim.x 個格
        float dx2 = dx0 + 2*blockDim.x*gridspacing;
        float dx3 = dx0 + 3*blockDim.x*gridspacing;
        float dy = y - atoms[n+1];
        float dz = z - atoms[n+2];
        float dysqdzsq = dy*dy + dz*dz;
        float charge   = atoms[n+3];
        energy0 += charge / sqrtf(dx0*dx0 + dysqdzsq);
        energy1 += charge / sqrtf(dx1*dx1 + dysqdzsq);
        energy2 += charge / sqrtf(dx2*dx2 + dysqdzsq);
        energy3 += charge / sqrtf(dx3*dx3 + dysqdzsq);
    }
    energygrid[grid.x*grid.y*k + grid.x*j + i              ] += energy0;
    energygrid[grid.x*grid.y*k + grid.x*j + i +   blockDim.x] += energy1; // ✓ 合併
    energygrid[grid.x*grid.y*k + grid.x*j + i + 2*blockDim.x] += energy2;
    energygrid[grid.x*grid.y*k + grid.x*j + i + 3*blockDim.x] += energy3;
}

	Fig. 18.8 連續	Fig. 18.10 interleaved
index `i`	`(blockIdx.xblockDim.x+threadIdx.x)CF`	`blockIdx.xblockDim.xCF + threadIdx.x`
thread 內步距	`gridspacing`（差 1 格）	`blockDim.x*gridspacing`（差 blockDim.x 格）
寫回位址（warp 內）	strided（相隔 CF）	連續
global write	uncoalesced ✗	fully coalesced ✓
運算量	與 18.10 相同	與 18.8 相同

Important

兩版的 運算量完全一樣；差別純粹在 grid-point→thread 的「折疊方式」。這說明 coalescing 是 免費的 加速——只重排索引，不增減任何 FP 或 constant 存取。背景知識見 06-Performance-Considerations/01-Memory-Coalescing（DRAM bursts）。

Warning

Coalescing 對 constant memory 的 atoms[] 讀取沒有影響——constant memory 走的是 broadcast / constant cache，不是 DRAM burst。本節的合併只針對對 global memory 的 energygrid[] 寫回。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「coarsening 省了什麼？」	省 constant memory 存取指令與冗餘算術（dy/dz/dysqdzsq/charge 取一次存 register 共用）；非省 DRAM 頻寬
「16→4 與 48→24」	處理 1 atom × 4 grid points：constant 存取 16→4 (4×)、FP ops 48→24 (2×)
「Fig. 18.8 為何寫回未合併？」	連續指派下相鄰 thread 的同一寫回敘述位址相隔 `COARSEN_FACTOR`（=4），warp 內 strided
「如何修成 coalesced？」	interleaved 指派：`i = blockIdx.xblockDim.xCF + threadIdx.x`，thread 內步距改 `blockDim.xgridspacing`、寫回 `+mblockDim.x`
「coarsening 的兩個缺點」(Exercise 3)	(1) 用更多 register/資源 → 可能降 occupancy；(2) 平行度下降（thread 變少），且若做過頭反而變慢
「coalescing 改了多少運算？」	0——運算量不變，只重排 index；屬「免費」優化
「為什麼不直接對 atoms[] 做 coalescing？」	atoms 在 constant memory（broadcast/cache），合併規則不適用；合併針對 global `energygrid[]` 寫回
「coarsening 一定傷 occupancy 嗎？」	否，只在 register 超出上限時才傷；書例仍在限內故不影響

重點總覽 (Overview)

執行緒粗化 (Thread Coarsening)

動機：同一 row 的 grid points 共用 y/z 計算

Kernel (COARSEN_FACTOR = 4)

每個 atom、每 4 個 grid points 的運算量比較

記憶體合併存取 (Memory Coalescing)

問題：連續指派造成 strided 寫回

解法：coalescing-aware 指派（interleaved）

Kernel (Fig. 18.10)

考試/面試重點 (Exam / Test Patterns)

Related Notes