Stencil 練習題 (Practice - Stencil Background and the Basic Parallel Kernel)

#practice #parallel-pattern #stencil #thread-coarsening

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案重點
Stencil order	中心點每一側的格點數 = 近似導數的階數;1D 總點數 = 2·order+1
3D seven-point stencil 幾階	order 1(每軸每側 1 點);6 鄰點 + 1 中心
Finite difference 一階近似	`f'(x) ≈ (f(x+h)−f(x−h))/(2h) + O(h²)`,誤差 ∝ `h²`
三點 stencil 係數	`[-1/2h, 0, 1/2h]`(由微分方程決定,非影像濾鏡)
Basic kernel OP/B	`13/(7×4) = 0.46 OP/B` → memory-bound
Tiled OP/B 公式	`13(T−2)³/(4T³) = (13/4)(1−2/T)³`;上限 3.25
T=8 / T=32 達成比值	1.37 / 2.68 OP/B
Halo 佔比(3D order-1, 邊長 T)	`1 − ((T−2)/T)³`;T=8 → 58%
為何 T 不能大	block size = input tile = `T³` ≤ 1024 → T≤8;且 shared mem ∝ `T³`
Thread coarsening	沿 z 方向;block 降為 T² threads → 可取 T=32;shared mem `3T²`
Register tiling	把 z 鄰居 `inPrev`/`inNext` 放 register;`inCurr_s` 留 shared;shared 降為 `T²` (1/3)
Register tiling 對 OP/B / 流量	皆不變,只把 reuse 從 shared 移到更快的 register
Stencil vs convolution 上限	stencil pattern 稀疏(只軸向鄰居),上限遠低於 dense convolution

Question 1 - Stencil Order 的定義 [recall]

情境/題目:何謂 stencil 的「order(階數)」?一個 3D seven-point stencil 是幾階?它有幾個鄰點?

查看答案

Order = 中心點每一側的格點數,也等於所近似導數的階數(1D 總點數 = 2·order+1)。3D seven-point stencil 是 order 1(x/y/z 每軸每側各 1 點),共 6 個鄰點 + 1 中心 = 7 點。

Question 2 - Finite Difference 一階近似 [recall]

情境/題目:寫出一階導數的經典有限差分近似公式,並說明其誤差項;由它推出的三點 stencil 係數為何?

查看答案

f'(x) ≈ (f(x+h) − f(x−h)) / (2h) + O(h²),誤差 O(h²) 與格距 h 的平方成正比(h 愈小愈準)。因 x = i·h,寫成 FD[i] = (F[i+1] − F[i−1])/(2h),對應係數 [−1/2h, 0, 1/2h] — 此即 1D three-point stencil。係數由所解的微分方程決定。

Question 3 - Basic Kernel 的 OP/B [recall]

情境/題目:Fig. 8.6 的基本 3D seven-point stencil sweep kernel,其 floating-point 對 global memory access 比值是多少?如何算出?這代表什麼?

查看答案

每 thread 做 7 乘 + 6 加 = 13 FLOP,載入 7 個 4-byte 輸入 = 28 B → OP/B = 13/(7×4) ≈ **0.46 OP/B**。此值太低,kernel 被 global memory 頻寬綁死(memory-bound),遠未用到算力,因此需要 shared memory tiling 提升重用。

Question 4 - 邊界格點的處理 [recall]

情境/題目:在 stencil sweep 中,grid 的邊界格點(boundary cells)如何處理?kernel 用什麼條件實作?

查看答案

邊界格點儲存 boundary condition,輸出 = 輸入、不更新(只計算內部陰影區)。kernel 以 i>=1 && i<N-1 && j>=1 && j<N-1 && k>=1 && k<N-1 關閉落在 index 0 或 N-1 的 thread。邊界形成 grid 表面的一層。

Question 5 - Stencil vs Convolution 的差異 [recall]

情境/題目:stencil sweep 在程式形態上幾乎與 convolution 相同,但有哪些本質差異使它引出不同的最佳化?

查看答案

(1) 權重係數來自所解的微分方程而非影像濾鏡;(2) 用於迭代求解 PDE;(3) 常用 3D 高精度浮點(double) 資料,佔更多 on-chip 記憶體;(4) pattern 稀疏(只取軸向鄰居,不含角落)。這些差異促成 thread coarsening 與 register tiling。

Question 6 - 為何 z 鄰居可放 Register [recall]

情境/題目:在 coarsening + register tiling kernel(Fig. 8.12)中,為何 inPrev/inNext(z 鄰居)可以放 register,但 inCurr 的 x-y 鄰居必須留在 shared memory?

查看答案

inPrev/inNext 每個元素只被相同 (x,y) 的單一 thread 在計算自己那條 column 時使用 → 不需跨 thread 共享,放 register 即可。inCurr_s 的 x-y 鄰居會被多個相鄰 thread 共享,所以必須留在 shared memory。這是因為軸向 stencil 不含角落,z 鄰居才會被單一 thread 獨用。

Question 7 - Coarsening 的方向與 Shared Memory 用量 [recall]

情境/題目:3D seven-point stencil 的 thread coarsening 沿哪個方向?coarsening 後 block size 與 shared memory 需求各變成多少?帶來什麼好處?

查看答案

沿 z 方向:block 只覆蓋一個 x-y plane(T² threads),沿 z 逐層 iterate。任一時刻只需 3 層在片上 → shared memory 從 T³ 降為 3T² 元素。好處:block 從 T³ 降到 T² → 可取 T=32(1024 threads),降低 halo 佔比、改善 coalescing,OP/B 從 1.37 升到 2.68。

Question 8 - 3D Tile 的 Halo Overhead [recall]

情境/題目:對 3D order-1 stencil 的 8×8×8 input tile,halo 元素佔比約多少?和 2D 32×32 convolution tile 相比如何?為何差這麼多?

查看答案

8³ input tile = 512,output tile 6³ = 216 → halo = 296,佔 ≈58%(公式 1 − ((T−2)/T)³)。2D 32×32 convolution tile 的 halo 僅 ≈12%。因為 3D 的 halo 是整個表面的一層(6 個面),相對體積佔比遠高於 2D 的「一圈」,這是 3D stencil tiling 報酬偏低的根本原因。

Question 9 - 計算 Tiled OP/B [application]

情境/題目:tiled 3D seven-point stencil 的 OP/B 公式為何?分別代入 T=8 與 T=32,算出達成比值(理論上限是多少?)。

查看答案

公式 OP/B = 13(T−2)³/(4T³) = (13/4)(1−2/T)³。
T=8:13×6³/(4×8³) = 2808/2048 ≈ **1.37 OP/B**。
T=32:(13/4)(30/32)³ = 3.25×0.824 ≈ **2.68 OP/B**。
T→∞ 漸近上限 = 13/4 = **3.25 OP/B**。

Question 10 - 練習題 1:Thread Block 數量 [application]

情境/題目:grid 大小 120×120×120(含邊界)。(a) 每次 sweep 計算幾個 output 格點?(b) basic kernel(Fig. 8.6,block 8×8×8)需幾個 block?(c) tiled kernel(Fig. 8.8,block 8×8×8)需幾個 block?(d) coarsening kernel(Fig. 8.10,block 32×32)需幾個 block?

查看答案

(a) 內部格點 = 118³ = **1,643,032**(邊界不更新)。
(b) 每 thread 算一個格點,⌈120/8⌉=15 → 15³ = **3375** blocks。
(c) OUT_TILE_DIM = 8−2 = 6,⌈120/6⌉=20 → 20³ = **8000** blocks。
(d) OUT_TILE_DIM = 32−2 = 30,x/y/z 各 ⌈120/30⌉=4 → 4³ = **64** blocks。

Question 11 - 練習題 2:Tile 大小與 Shared Memory [application]

情境/題目:七點 3D stencil,block 32×32,coarsening factor = 16(每 block 沿 z 處理 16 個連續 output plane)。求 (a) input tile 元素數;(b) output tile 元素數;(c) OP/B;(d) Fig. 8.10(無 register tiling)每 block shared memory;(e) Fig. 8.12(有 register tiling)每 block shared memory。

查看答案

(a) input plane 32×32,z 需 16+2 = 18 層 → 32×32×18 = **18432**。
(b) 30×30×16 = **14400**。
(c) 13×14400/(4×18432) = 187200/73728 ≈ **2.54 OP/B**。
(d) 3 層:3×32²×4B = **12288 B (12KB)**。
(e) 1 層:32²×4B = **4096 B (4KB)**(= 1/3)。

Question 12 - 為何 Stencil 的 Tiling 上限遠低於 Convolution [analysis]

情境/題目:同樣套 shared memory tiling,為何 stencil 的算術-記憶體比值上限明顯低於 convolution?以 2D five-point vs 3×3 conv、3D order-3(19-point) vs 7×7×7 conv 為例比較,並說明根本原因。

查看答案

上限 ≈ (stencil point 數)/2。2D:five-point = 2.5 OP/B,3×3 conv = 4.5 OP/B。3D:19-point = 9.5 OP/B,但 7×7×7 conv 高達 171.5 OP/B。根本原因:stencil pattern 稀疏(只取軸向鄰居,point 數隨 order 線性成長),而 convolution dense(point 數隨維度/order 多項式爆炸)。所以「把一個 grid 值載入 shared memory 的效益」對 stencil 低很多 — 而 3D 正是 stencil 主場,差距最誇張,這正是 thread coarsening / register tiling 的動機。

Question 13 - 小 Tile 傷 Coalescing 與 Register Tiling 取捨 [analysis]

情境/題目:(a) 為何 8×8×8 tile 會傷害 memory coalescing,thread coarsening 如何解決?(b) 在 coarsening 之上再加 register tiling,對 OP/B、global memory 流量、shared memory、register 各有何影響?何時該回退?

查看答案

(a) 8³ tile 最快維(x)只有 8 寬 < warp 的 32 threads → 一個 warp 跨「4 個相距 N 的 row」,無法 coalesce、浪費 DRAM 頻寬。coarsening 把 z 拆出 → block 只剩 T² threads,可用 T=32,讓一整 row(32)≥ warp 寬 → 同 row 32 threads 合併連續存取。
(b) register tiling 不改 OP/B、不改 global memory 流量(reuse 從 shared 移到更快的 register,latency 更低、bandwidth 更高 → 更快),shared memory 降為 1/3(3T²→T²),代價是每 thread 多 3 registers(32×32 block → 3072 regs/block)。若 register 變瓶頸壓低 occupancy,就把部分 plane 改回 shared memory — 典型的 shared memory ↔ register 取捨。

重點整理 (Pattern Summary，點擊展開)

Kernel 版本	Block size	On-chip 用量	可用 T	OP/B (七點)
Basic sweep (Fig 8.6)	T³ threads	無 tiling	—	0.46
Shared-mem tiling (Fig 8.8)	T³ threads	shared `T³`	實際 T=8	1.37 (T=8)
Coarsening z (Fig 8.10)	T²	shared `3T²`	T=32	2.68 (T=32)
Coarsening + Register tiling (Fig 8.12)	T²	shared `T²` + 3 regs/thr	T=32	2.68(更快)
理論上限 (T→∞)	—	—	—	3.25

Order = 每側格點數;3D 7-point = order 1。Finite difference 一階誤差 O(h²)。
Basic OP/B 13/28 = 0.46 → memory-bound;Tiled (13/4)(1−2/T)³,上限 3.25。
3D 小 tile 兩大痛點:halo 58% + coalescing 差(x 維 8 < warp 32)。
Coarsening(z) 提高 OP/B(T 變大、halo 降);Register tiling 不改 OP/B / global 流量,只把 z 鄰居搬進 register、省 2/3 shared memory。
Stencil 稀疏 → tiling 上限遠低於 dense convolution(3D 最誇張:9.5 vs 171.5 OP/B)。

Related Concepts

Question 1 - Stencil Order 的定義 [recall]

Question 2 - Finite Difference 一階近似 [recall]

Question 3 - Basic Kernel 的 OP/B [recall]

Question 4 - 邊界格點的處理 [recall]

Question 5 - Stencil vs Convolution 的差異 [recall]

Question 6 - 為何 z 鄰居可放 Register [recall]

Question 7 - Coarsening 的方向與 Shared Memory 用量 [recall]

Question 8 - 3D Tile 的 Halo Overhead [recall]

Question 9 - 計算 Tiled OP/B [application]

Question 10 - 練習題 1:Thread Block 數量 [application]

Question 11 - 練習題 2:Tile 大小與 Shared Memory [application]

Question 12 - 為何 Stencil 的 Tiling 上限遠低於 Convolution [analysis]

Question 13 - 小 Tile 傷 Coalescing 與 Register Tiling 取捨 [analysis]