Stencil 的 Shared Memory Tiling

#parallel-pattern #stencil #tiling #shared-memory #arithmetic-intensity

重點總覽 (Overview)

項目	內容
核心技術	shared memory tiling（設計與 convolution 幾乎相同，僅少數差異）
input tile	每維 `T` 個 grid points（cube `T³`）
output tile	每維 `T-2`（order-1 stencil），即 `(T-2)³` 個有效輸出
block size	= input tile size = `T³` → 受 1024 上限限制，實務 T ≤ 8（8³=512 threads）
達成 ratio (T=8)	1.37 OP/B（遠低於上限）
漸近上限 (T→∞)	13/4 = 3.25 OP/B
stencil vs conv 上限	stencil 因「稀疏（只取軸向鄰居）」上限遠低於 convolution
兩大缺點	① halo overhead 高（3D 8³ tile 達 ~58%）② 小 tile 使 coalescing 變差

Important

相較 08-Stencil/01-Stencil-Background-and-Basic-Kernel 的 basic kernel 只有 0.46 OP/B，tiling 可拉高比值；但因 stencil 為 3D + 稀疏 pattern，提升幅度遠不如 convolution，這正是後續 thread coarsening 與 register tiling 的動機。

與 Convolution 的相似與差異 (Tiling Design vs Convolution)

shared memory tiling 對 stencil 與 convolution 幾乎相同：所有 convolution 載入 input tile 的策略都可直接套用。但有幾個微妙卻重要的差異：

input tile 不含角落 (corner) grid points：2D five-point stencil 只取上下左右四個鄰居 + 中心，不取對角線；3×3 convolution 則包含 9 個（含角落）。
此「無角落」特性是 08-Stencil/03-Thread-Coarsening-and-Register-Tiling 中 register tiling 能成立的關鍵。
stencil 資料常為 double/high-precision 浮點 → 同樣的 tile 佔用更多 on-chip 記憶體，使 tiling 更吃緊。

2D five-point stencil 的 input tile          3×3 convolution 的 input tile
(中心 + 4 個軸向鄰居，無角落)               (3×3 全含角落)

        . X . .                                X X X .
        X X X .                                X X X .
        . X . .                                X X X .
        . . . .                                . . . .
   → 每個 output 只讀 5 個 input             → 每個 output 讀 9 個 input

算術-記憶體比值的上限 (Arithmetic-to-Memory Ratio Upper Bounds)

把 input 完美重複利用（每個 grid 值只從 global memory 載入一次）時的理論上限。每個 stencil point 貢獻約 2 個 op（1 mult + 1 add），故：

上限 OP/B ≈ (stencil point 數) / 2

Pattern	維度/order	point 數	上限 OP/B
stencil five-point	2D order 1	5	2.5
convolution 3×3	2D	9	4.5
stencil nine-point	2D order 2	9	4.5
convolution 5×5	2D	25	12.5
stencil 13-point	2D order 3	13	6.5
convolution 7×7	2D	49	24.5
stencil 19-point	3D order 3	19	9.5
convolution 7×7×7	3D	343	171.5

Important

關鍵結論：stencil 因 pattern 稀疏（只取軸向鄰居，數量隨 order 線性成長），重用度遠低於 dense convolution（point 數隨維度/order 多項式爆炸）。3D 情境差距最大（9.5 vs 171.5 OP/B），代表「把一個 grid 值載入 shared memory 的效益」對 stencil 而言低很多 —— 而 3D 正是 stencil 的主要應用場景。

Tiled 3D Seven-Point Stencil Kernel (Fig. 8.8)

block 大小 = input tile 大小，多載入的 thread 在計算階段被關閉（與 tiled convolution 同套路）。

#define IN_TILE_DIM  8
#define OUT_TILE_DIM (IN_TILE_DIM - 2)   // order-1：每側少 1，故 -2

// c0..c6 為 stencil 係數（依微分方程而定，常放 constant memory）
__global__ void stencil_kernel(float* in, float* out, unsigned int N) {
    // 每個 thread 對應 input patch 起點；-1 = stencil order（每側 1 點）
    int i = blockIdx.z * OUT_TILE_DIM + threadIdx.z - 1;   // z
    int j = blockIdx.y * OUT_TILE_DIM + threadIdx.y - 1;   // y
    int k = blockIdx.x * OUT_TILE_DIM + threadIdx.x - 1;   // x

    __shared__ float in_s[IN_TILE_DIM][IN_TILE_DIM][IN_TILE_DIM];

    // (1) 協同載入 input tile；i/j/k >=0 與 <N 擋掉 ghost cells（越界）
    if (i >= 0 && i < N && j >= 0 && j < N && k >= 0 && k < N) {
        in_s[threadIdx.z][threadIdx.y][threadIdx.x] = in[i*N*N + j*N + k];
    }
    __syncthreads();                       // 等所有 tile 元素就位

    // (2) 關閉邊界 thread（boundary 保持 initial condition 不計算）
    if (i >= 1 && i < N-1 && j >= 1 && j < N-1 && k >= 1 && k < N-1) {
        // (3) 關閉只為載入而存在的 halo thread；僅內部 thread 算 output
        if (threadIdx.z >= 1 && threadIdx.z < IN_TILE_DIM-1 &&
            threadIdx.y >= 1 && threadIdx.y < IN_TILE_DIM-1 &&
            threadIdx.x >= 1 && threadIdx.x < IN_TILE_DIM-1) {
            out[i*N*N + j*N + k] =
                  c0 * in_s[threadIdx.z]  [threadIdx.y]  [threadIdx.x]
                + c1 * in_s[threadIdx.z]  [threadIdx.y]  [threadIdx.x-1]
                + c2 * in_s[threadIdx.z]  [threadIdx.y]  [threadIdx.x+1]
                + c3 * in_s[threadIdx.z]  [threadIdx.y-1][threadIdx.x]
                + c4 * in_s[threadIdx.z]  [threadIdx.y+1][threadIdx.x]
                + c5 * in_s[threadIdx.z-1][threadIdx.y]  [threadIdx.x]
                + c6 * in_s[threadIdx.z+1][threadIdx.y]  [threadIdx.x];
        }
    }
}

減去的 1 = stencil 的 order（每側點數）；高階 stencil 要減更大值。
兩層 if：外層擋 grid boundary（不更新），內層關掉只負責載入 halo 的多餘 thread。

達成比值與 T 的硬體限制 (Achieved Ratio & the T Limit)

input tile = cube，每維 T；active threads = (T-2)³，每個做 13 ops；每個 block 載入 T³ 個 4-byte 值：

OP/B = 13(T-2)³ / (4·T³) = (13/4)·(1 - 2/T)³

T	block threads	達成 OP/B
8	512	1.37
→ ∞	—	3.25（= 13/4 上限）

計算驗證（T=8）：13 × 6³ / (4 × 8³) = 13×216 / (4×512) = 2808/2048 ≈ 1.37。

Warning

為何上限是 13/4 而非 7/2？ 一般「point 數/2」估法給 seven-point 為 14/4=3.5；但實際 kernel 只需 6 次加法（加 7 個值），故 op = 7 mult + 6 add = 13，漸近上限為 13/4 = 3.25 OP/B。

Warning

T 受 block size 卡死：block size = input tile size = T³，受 1024 thread 上限約束 → 實務 T ≤ 8（8³=512）。此外 shared memory 用量 ∝ T³，更大的 T 會急速吃光 shared memory。對比 convolution 處理 2D 影像可輕鬆用 32×32 大 tile，stencil 被迫用小 tile。

缺點一：3D Tile 的 Halo Overhead

T 變小 → halo（邊界載入但不產出 output 的元素）佔比飆升 → 重用率與 OP/B 雙降。halo 元素比非 halo 元素重用次數少。

情境	input tile	output tile	halo 元素	halo 比例
2D convolution radius 1	32×32 = 1024	30×30 = 900	124	~12%
3D stencil order 1	8×8×8 = 512	6×6×6 = 216	296	~58%

3D 8×8×8 input tile（一層 x-y 切面，外圈為 halo，內部 6×6 才算 output）

   H H H H H H H H      H = halo（只載入，不產 output）
   H · · · · · · H      · = output（6×6=36 / 切面）
   H · · · · · · H
   H · · · · · · H      整顆 cube：output 216，halo 296 → 58% 是 halo！
   H · · · · · · H
   H · · · · · · H
   H · · · · · · H
   H H H H H H H H

Important

3D 的 halo 是整層表面（6 個面），相對於體積佔比遠高於 2D 的「一圈」。這是 3D stencil tiling 報酬偏低的根本原因。

缺點二：小 Tile 傷害 Memory Coalescing

8×8×8 tile 中，最快變化的維度（x）只有 8 寬，但一個 warp 有 32 threads：

warp（32 threads）需載入 tile 的「4 個不同 row」，每 row 僅 8 個值：

  row(j=0): t0 t1 t2 t3 t4 t5 t6 t7      ← global memory 位置 A
  row(j=1): t8 ...               t15     ← 距 A 為 N（遠）
  row(j=2): t16...               t23     ← 又距一個 N
  row(j=3): t24...               t31     ← 又距一個 N

  → 同一 load 指令落在「至少 4 個相距甚遠」的 global memory 位置
  → 無法 coalesce，DRAM burst 利用率低、頻寬浪費

Tip

若 T 能 ≥ 32（讓一整個 row ≥ warp 寬），同一 row 的 32 threads 就能合併成連續存取。但 block size = T³ 的限制使 3D 無法做到 → 動機就是把 z 維拆出去做 06-Performance-Considerations/03-Thread-Coarsening 式的 coarsening，讓 block 只剩 T² threads，便可用 T=32。詳見 08-Stencil/03-Thread-Coarsening-and-Register-Tiling。

參見 06-Performance-Considerations/01-Memory-Coalescing 了解 coalescing 與 DRAM burst 的底層機制。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「tiled 3D seven-point 的 OP/B 公式」	`13(T-2)³ / (4T³)`；漸近上限 3.25 OP/B（=13/4）
「T=8 的實際比值」	1.37 OP/B（`13·6³/(4·8³)`）
「為何 stencil 上限比 convolution 低很多」	stencil pattern 稀疏（只軸向鄰居），point 數隨 order 線性；convolution dense，point 數隨維度多項式爆炸
「2D five-point vs 3×3 conv 上限」	2.5 vs 4.5 OP/B（上限 ≈ point 數 / 2）
「3D order-3 stencil vs 7×7×7 conv 上限」	9.5 vs 171.5 OP/B（差距最誇張）
「為何 T 不能大」	block size = input tile = `T³` ≤ 1024 → T ≤ 8；且 shared memory ∝ `T³`
「8³ tile 的 halo 比例」	~58%（halo 296 / 512）；2D 32² conv 僅 ~12%
「小 tile 為何傷 coalescing」	x 維僅 8 寬 < warp 32 → 一 warp 跨 4 個相距 N 的 row，無法合併
「減去的常數 1 代表什麼」	stencil 的 order（每側點數）；高階要減更大
「兩層 if 的作用」	外層擋 grid boundary（不更新）；內層關掉只載入 halo 的多餘 thread

重點總覽 (Overview)

與 Convolution 的相似與差異 (Tiling Design vs Convolution)

算術-記憶體比值的上限 (Arithmetic-to-Memory Ratio Upper Bounds)

Tiled 3D Seven-Point Stencil Kernel (Fig. 8.8)

達成比值與 T 的硬體限制 (Achieved Ratio & the T Limit)

缺點一：3D Tile 的 Halo Overhead

缺點二：小 Tile 傷害 Memory Coalescing

考試/面試重點 (Exam / Test Patterns)

Related Notes