Convolution 練習題 (Practice - Convolution Fundamentals and the Basic Parallel Kernel)

#practice #parallel-pattern #convolution #tiling

07-Convolution/01-Convolution-Fundamentals-and-Basic-Kernel — Convolution 基礎與基本平行 Kernel
07-Convolution/02-Constant-Memory-and-Caching — Constant Memory 與快取階層 (Caching)
07-Convolution/03-Tiled-Convolution-and-Halo-Handling — Tiled Convolution 與 Halo Cells 處理

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案 / 重點
"convolution kernel" 指什麼	指 filter (權重陣列 F)；書中改稱 filter 以免與 CUDA kernel function 撞名
filter 大小為何 `2r+1`（奇數）	使加權和對稱於中心，每側剛好 `r` 個鄰居；`r` = radius
Ghost cells	陣列邊界外不存在的輸入元素；多數應用補 0（也可填最近邊緣值）
Halo cells vs Ghost cells	halo = 鄰居 tile 的真實內部元素；ghost = 陣列外、補 0 的不存在元素
基本 kernel thread 映射	2D grid，1 thread = 1 output pixel；`outCol=bxbdimx+tx`、`outRow=bybdimy+ty`
基本 kernel OP/B	0.25 OP/B = 2 ops / 8 bytes（載 N 4B + F 4B）→ memory-bound
Constant memory 宣告 / 拷貝	`__constant__`（全域 scope）＋ `cudaMemcpyToSymbol(dest,src,size)`，容量 ≈ 64 KB
Constant memory 後 OP/B	0.5 OP/B = 2 ops / 4 bytes（F 由 constant cache 供應，不耗 DRAM）
Constant cache 何時最有效	warp 內所有 thread 讀同一位址（F 索引與 threadIdx 無關）→ 廣播
Cache vs Shared memory	cache 透明（硬體自動）；shared memory 是 scratchpad，需手動 `__shared__`
Input tile 公式	`IN_TILE_DIM = OUT_TILE_DIM + 2*FILTER_RADIUS`
Tiled 算術強度公式	`OUT²·(2r+1)²·2 / (OUT+2r)²·4` OP/B；上界 `(2r+1)²·2/4`（tile ≫ r）
5×5、32×32 input tile 的 ratio	9.57 OP/B（上界 12.5）
方案 A 兩個 if 各管什麼	第一個管 ghost（補 0）、第二個管 deactivation（停用外圈 threads）
`__syncthreads()` 放哪	載入完整 input tile 之後、開始計算 output 之前
方案 B 靠什麼省流量	halo 已被鄰居 block 帶進 L2 cache，直接讀 global 即可

Question 1 - Convolution 定義與 filter radius [recall]

用一句話定義 convolution；filter 為何大小是奇數 2r+1？書中為何把權重陣列改稱 "filter" 而非 "convolution kernel"？

查看答案

Convolution：每個輸出元素是「對應輸入元素 + 中心對稱鄰居」的 weighted sum，權重由 filter 給定。大小 2r+1 為奇數，使加權和對稱於被計算的中心元素，每側剛好 r 個鄰居（r = radius）。書中為避免與 CUDA kernel function 撞名，把權重陣列一律稱 filter (F)。

Question 2 - Ghost cells 與邊界條件 [recall]

什麼是 ghost cells？多數應用給它什麼預設值？2D convolution 的邊界條件比 1D 複雜在哪？

查看答案

Ghost cells 是靠近陣列端點時、落在邊界外不存在的輸入元素；多數應用補預設值 0（如 audio 假設錄音前後音量為 0；也有應用填最近的有效邊緣值）。2D 更複雜，因為輸出可能同時碰到水平、垂直或兩者皆有的邊界（例如 P_{1,0} 缺 2 行 + 1 列）。

Question 3 - 基本 2D Kernel 的映射與瓶頸 [recall]

基本平行 2D convolution kernel 如何把 thread 映射到 output？它的算術強度是多少、瓶頸為何？另有什麼次要問題？

查看答案

用 2D grid，每 thread 算一個 output pixel：outCol = blockIdx.x*blockDim.x + threadIdx.x、outRow = blockIdx.y*blockDim.y + threadIdx.y（與 Ch.3 ColorToGrayscale 相同）。瓶頸是 global memory bandwidth：inner loop 2 ops / 載入 8 bytes（N 4B + F 4B）= 0.25 OP/B。次要問題是邊緣 thread 處理不同數量 ghost cells 造成的 control divergence（大圖小 filter 時影響輕微到可忽略）。

Question 4 - Constant Memory 宣告與拷貝 [recall]

如何把 filter F 放進 constant memory？宣告語法、host 端拷貝 API、容量限制、以及 kernel 如何存取 F 分別為何？

查看答案

宣告：__constant__ float F[..][..];，必須放在所有函式之外（global scope）。Host 拷貝用專用 API cudaMemcpyToSymbol(dest, src, size)（不是 cudaMemcpy），它告知 runtime 這份資料 kernel 期間不會被改。容量約 64 KB。Kernel 把 F 當全域變數直接存取，不必當參數傳入（跨檔需 extern 宣告）。

Question 5 - 為何 F 適合 Constant Cache [recall]

書中列出 F 的哪三項存取特性使它適合 constant memory + caching？constant cache 在什麼存取模式下效益最大？最終把 OP/B 提升到多少？

查看答案

三特性：(1) 體積小（radius 通常 ≤ 7，3D filter ≤ 7³=343）；(2) kernel 執行中內容不變（read-only）；(3) 所有 thread 以相同順序存取 F。當一個 warp 內 32 個 thread 同時讀同一位址（F 索引與 threadIdx 無關），硬體可一次廣播，提供龐大頻寬。因 F 視為永遠命中、不耗 DRAM，OP/B 由 0.25 翻倍到 0.5 OP/B（2 ops / 4 bytes，只剩 N 的 4B）。

Question 6 - Input/Output Tile、Halo 與兩個 if [recall]

寫出 input tile 與 output tile 的尺寸關係公式；halo cells 與 ghost cells 差在哪？方案 A (Fig 7.12) 的兩個 if 條件各負責什麼？__syncthreads() 該放在哪？

查看答案

公式：IN_TILE_DIM = OUT_TILE_DIM + 2*FILTER_RADIUS（每方向各延伸 r 以涵蓋 halo）。Halo = 鄰居 tile 的真實內部元素；ghost = 陣列外、補 0 的不存在元素。方案 A 第一個 if 管 ghost cell（邊界外補 0），第二個 if 管 deactivation（block 比 output tile 大，停用外圈 FILTER_RADIUS 層 threads）。__syncthreads() 放在載入完整 input tile 之後、開始計算之前（barrier）。

Question 7 - 計算 1D Convolution 輸出 [application]

對 N = {4, 1, 3, 2, 3}、filter F = {2, 1, 4}（r=1，ghost = 0）做 1D convolution，求輸出陣列。

查看答案

用 y[i] = F[0]·x[i-1] + F[1]·x[i] + F[2]·x[i+1]，邊界 ghost = 0：

y[0] = 2·0 + 1·4 + 4·1 = 8
y[1] = 2·4 + 1·1 + 4·3 = 21
y[2] = 2·1 + 1·3 + 4·2 = 13
y[3] = 2·3 + 1·2 + 4·3 = 20
y[4] = 2·2 + 1·3 + 4·0 = 7

輸出 = {8, 21, 13, 20, 7}。

Question 8 - Tiled Kernel 的資源用量與比值計算 [application]

用方案 A (Fig 7.12) 對 N×N 陣列、M×M filter、T×T output tile 做 tiled 2D convolution：(a) 需要幾個 thread block？(b) 每 block 幾個 thread？(c) 每 block 需多少 shared memory？(d) 並驗證 5×5 filter、32×32 input tile 的算術強度。

查看答案

令 M = 2r+1、IN_TILE_DIM = T + 2r = T + M - 1。

(a) blocks = ⌈N/T⌉²（每 block 一個 output tile）。
(b) threads/block = IN_TILE_DIM² = (T+M-1)²（方案 A：block 尺寸 = input tile）。
(c) shared memory = (T+M-1)² × 4 bytes。
(d) 5×5 (r=2)、IN=32 → OUT=28：ratio = 28²·25·2 / (32²·4) = 39200/4096 ≈ 9.57 OP/B（上界 (2r+1)²·2/4 = 12.5）。

Question 9 - 方案 A vs 方案 B (halo in shared vs halo in cache) [analysis]

比較「方案 A：block = input tile、把 halo 載進 shared memory」與「方案 B：block = output tile、halo 靠 cache」兩種 tiled 設計，在載入複雜度、計算複雜度、divergence、block/tile 尺寸與 halo DRAM 流量上的取捨。

查看答案

兩者把複雜度在「載入 ↔ 計算」之間搬家。方案 A：載入單純（每 thread 載 1 元素）、但需停用外圈 threads且 input≠output tile（不易為 2 的次方、divergence 較多）；halo 已在 shared，halo DRAM 流量為 0。方案 B：input = output = TILE_DIM（可為 2 的次方、divergence 較少、載入只需檢查一般邊界），但計算迴圈較複雜（內部讀 shared、halo 讀 global 且仍要判 ghost），halo 倚賴 L2 cache 命中（機率高但非保證）。

Question 10 - Constant Cache vs Shared Memory，及大 Filter 的權衡 [analysis]

(a) constant cache 與 shared memory 在「透明度 / 使用機制」上最大的差異是什麼？(b) 為何 filter 越大、算術強度的「上界」越高，但「實際值與上界的落差」也越大？小 tile 為何危險？

查看答案

(a) 透明度不同：cache（含 constant cache）對程式透明，程式只存取原始 global 變數、硬體自動快取並記住位址；shared memory 是 scratchpad，須手動宣告 __shared__ 並明確拷貝。(b) filter 越大，每個 N 元素被約 (2r+1)² 個 thread 重用，故上界 (2r+1)²·2/4 越高；但 halo 隨 r 增大、把 output tile 擠小，使實際 ratio 與上界落差變大。小 tile 危險：8×8 input tile 配 5×5 filter 僅 3.13 OP/B（halo 佔比過高），3D convolution 因 on-chip memory 不足常被迫用小 tile，reduction 效果大打折扣。

重點整理 (Pattern Summary，點擊展開)

階段 / 優化	機制	算術強度 (OP/B)
基本 2D kernel (Fig 7.7)	1 thread = 1 output；N、F 都讀 DRAM	0.25（2 ops / 8 B）
+ Constant memory (Fig 7.9)	F 放 `__constant__`，constant cache 廣播、不耗 DRAM	0.5（2 ops / 4 B）
+ Tiled / shared (Fig 7.12)	協同載 input tile 進 shared，停用外圈、N 重用	5×5/32 tile → 9.57
+ Cache for halo (Fig 7.15)	shared 只放內部元素，halo 靠 L2；block/tile 同尺寸	簡化程式、divergence 較少

概念	一句話
filter / radius	權重陣列 F，大小 `2r+1` 奇數對稱，`r` = radius
ghost cell	陣列邊界外不存在的輸入，補 0
halo cell	input tile 外圈，是鄰居 tile 的真實內部元素
constant cache	唯讀、無寫入邏輯 → 省面積/功耗；warp 同址讀可廣播
input tile 公式	`IN_TILE_DIM = OUT_TILE_DIM + 2*FILTER_RADIUS`
tiled 上界	`(2r+1)²·2/4` OP/B（tile ≫ r）；大 filter 上界高但落差大、小 tile 危險

Related Concepts

Question 1 - Convolution 定義與 filter radius [recall]

Question 2 - Ghost cells 與邊界條件 [recall]

Question 3 - 基本 2D Kernel 的映射與瓶頸 [recall]

Question 4 - Constant Memory 宣告與拷貝 [recall]

Question 5 - 為何 F 適合 Constant Cache [recall]

Question 6 - Input/Output Tile、Halo 與兩個 if [recall]

Question 7 - 計算 1D Convolution 輸出 [application]

Question 8 - Tiled Kernel 的資源用量與比值計算 [application]

Question 9 - 方案 A vs 方案 B (halo in shared vs halo in cache) [analysis]

Question 10 - Constant Cache vs Shared Memory，及大 Filter 的權衡 [analysis]