Convolution 基礎與基本平行 Kernel

#parallel-pattern #convolution #control-divergence #ghost-cells #arithmetic-intensity

重點總覽 (Overview)

項目	內容	關鍵數值 / 備註
Convolution	每個輸出 = 對應輸入元素及其鄰居的 weighted sum	權重來自 filter (convolution kernel)
Filter / radius	filter 大小為奇數 `2r+1`，對稱於中心	每側 `r` 個鄰居，`r` = radius
維度	1D (audio)、2D (image)、3D (video)	本章程式碼以 2D 平方 filter 為例
Ghost cells	邊界外不存在的輸入元素	多數應用填 0（也可填邊緣值）
Thread mapping	2D grid，每 thread 算 1 個 output pixel	與 Ch.3 ColorToGrayscale 相同
Control divergence	邊界 thread 處理不同數量 ghost cells	大圖小 filter 時影響「輕微到可忽略」
效能瓶頸	global memory bandwidth	0.25 OP/B（2 ops / 8 bytes）

命名衝突

書中為避免與 CUDA kernel function 混淆，把 convolution 的權重陣列一律稱為 filter (F)，而非 "convolution kernel"。

卷積定義 (Convolution Definition)

Convolution：每個輸出元素是「對應輸入元素 + 中心對稱鄰居」的加權和；權重由 filter array 給定。

1D Convolution

輸入 x[0..n-1]、filter f[0..2r]（共 2r+1 個權重）。
輸出（PMPP 寫法，實為 cross-correlation）：

y [i] = \sum_{j = - r}^{r} f [j + r] \times x [i + j]

filter 大小是 奇數 2r+1 → 加權和對稱於被計算的中心元素，每側 r 個。
每個 y[i] = x 子陣列（起點 x[i-r]）與 f 的 inner product。

範例 (r=2):  x = [8, 2, 5, 4, 1, 7, 3]   f = [1, 3, 5, 3, 1]

           x[0] x[1] x[2] x[3] x[4]
            8    2    5    4    1
            ×    ×    ×    ×    ×
           f[0] f[1] f[2] f[3] f[4]
            1    3    5    3    1
            ----------------------
   y[2] =  8 + 6 + 25 + 12 + 1 = 52      (中心對齊 x[2])

  視窗右移一格：
   y[3] = 1·2 + 3·5 + 5·4 + 3·1 + 1·7 = 47   (中心對齊 x[3])

2D Convolution

filter 為 2D，大小 (2r_x+1) × (2r_y+1)（典型為正方形）。

P_{y, x} = \sum_{j = - r_{y}}^{r_{y}} \sum_{k = - r_{x}}^{r_{x}} f_{j + r_{y}, k + r_{x}} \times N_{y + j, x + k}

做法：取以對應位置為中心的輸入子陣列，與 filter 逐元素相乘 (product array)，再把整個 product array 加總。
範例：5×5 filter（r_x=r_y=2），P_{2,2} = 以 N_{2,2} 為中心的 5×5 子陣列與 f 的內積。

應用

Image blur 用平均型 filter 平滑訊號；Gaussian filter 可銳化物件邊界與邊緣。Convolution 是 signal/image/video processing 與 computer vision 的核心運算。

邊界條件與 Ghost Cells (Boundary Conditions & Ghost Cells)

因卷積靠鄰居定義，靠近陣列端點的輸出會缺少輸入元素。
Ghost cells：那些不存在的邊界外輸入元素。
慣例：缺的元素填 預設值 0（如 audio 假設錄音前後音量為 0）。

1D 邊界 (計算 y[1], r=2):

  ghost            x[0] x[1] x[2] x[3]
   [0]    →   缺左側 1 格
    0      8    2    5    4
    ×      ×    ×    ×    ×
   f[0]  f[1] f[2] f[3] f[4]
    1     3    5    3    1
   y[1] = 1·0 + 3·8 + 5·2 + 3·5 + 1·4 = 53
            ↑ ghost (跳過 / 視為 0)

2D 更複雜：輸出可能同時碰到水平、垂直或兩者皆有的邊界。例如 P_{1,0}（r=2）會缺 1 列 + 2 行（少 1 個 ghost 列、2 個 ghost 行）。

並非永遠填 0

有些應用讓 ghost cells 取「最近的有效邊緣元素值」而非 0。此外 tiling 還會產生另一種 ghost/halo cells（見 07-Convolution/03-Tiled-Convolution-and-Halo-Handling），會影響 tiling 的效率。

基本平行 2D Kernel (Basic Parallel 2D Kernel)

Thread → Data 映射

output 是 2D → 用 2D grid，每個 thread 算一個 output element（pixel）。
thread 的 (x, y) index 直接對應 output pixel 的 (x, y)。與 Ch.3 的 ColorToGrayscale 完全相同。

 16×16 image，4×4 blocks，每 block 4×4 threads：

   Grid (4×4 blocks)              Block(1,1) 內 thread(1,1)
   +----+----+----+----+          → P[1*4+1][1*4+1] = P[5][5]
   |b00 |b10 |b20 |b30 |
   +----+----+----+----+          算 P[5][5] 需要的輸入 patch：
   |b01 |b11 |b21 |b31 |             x: outCol-r .. outCol+r = 3..7
   +----+----+----+----+             y: outRow-r .. outRow+r = 3..7
   |b02 |b12 |b22 |b32 |          左上角 (outCol-r, outRow-r) = patch 起點
   +----+----+----+----+
   |b03 |b13 |b23 |b33 |          input (global memory)        output
   +----+----+----+----+          +---------------+           +-------+
                                  |   . . . . .   | r          | P     |
   outRow,outCol = block*dim      | . [patch ] .  |---map----> | pixel |
                + threadIdx       |   . . . . .   |            +-------+

Kernel 程式碼 (Fig. 7.7)

__global__ void convolution_2D_basic_kernel(float *N, float *F, float *P,
                                            int r, int width, int height) {
    // 02-03: 由 block/thread index 算出本 thread 負責的 output 座標
    int outCol = blockIdx.x * blockDim.x + threadIdx.x;
    int outRow = blockIdx.y * blockDim.y + threadIdx.y;

    float Pvalue = 0.0f;                 // 用 register 累加，省 DRAM 頻寬
    for (int fRow = 0; fRow < 2*r+1; fRow++) {
        for (int fCol = 0; fCol < 2*r+1; fCol++) {
            int inRow = outRow - r + fRow;   // 對應輸入列
            int inCol = outCol - r + fCol;   // 對應輸入行
            // 09: 檢查是否為 ghost cell（在 N 邊界外）；是 → 跳過 (視為 0)
            if (inRow >= 0 && inRow < height &&
                inCol >= 0 && inCol < width) {
                // 10: 1 次乘 + 1 次加 = 2 ops；載入 F+N = 8 bytes
                Pvalue += F[fRow*(2*r+1) + fCol] * N[inRow*width + inCol];
            }
        }
    }
    P[outRow*width + outCol] = Pvalue;   // 14: 寫回 output
}

五個重點參數：N（輸入）、F（filter）、P（輸出）、r（radius）、width、height。
Pvalue 是 register 累加器，最後一次性寫回 P，避免反覆讀寫 DRAM。
if 直接跳過 ghost cell 的乘加（因為乘 0 無貢獻）。

Control Divergence 與頻寬瓶頸 (Divergence & Bandwidth)

觀察一：Control Divergence

靠近 P 四個邊緣的 thread 需處理 ghost cells，且各自碰到的 ghost 數量不同（line 09 的 if 分支結果不一致）。
例：算 P[0][0] 的 thread 多數時候跳過乘加；算 P[0][1] 跳得較少…形成分歧。
成本取決於 width、height 與 filter r。

大圖小 filter → 影響可忽略

對大型輸入陣列 + 小 filter，divergence 只發生在「一小部分」邊緣輸出。由於卷積常用於大圖，control divergence 的影響「從輕微到可忽略」。詳見 04-Compute-Architecture-And-Scheduling/02-Warps-SIMD-and-Control-Divergence。

觀察二：Memory Bandwidth（更嚴重）

line 10 每次 multiply-accumulate：
- 運算：2 ops（1 乘 + 1 加）
- 記憶體：載入 F 元素 4 bytes + N 元素 4 bytes = 8 bytes

Arithmetic Intensity = \frac{2 OP}{8 B} = 0.25 OP/B

與 basic matrix multiplication 同病：只能跑到 peak 效能的極小比例 → memory-bound。

兩個後續優化方向

把 F 放進 constant memory + constant cache → 不再耗 DRAM 頻寬讀 filter → 升到 0.5 OP/B（見 07-Convolution/02-Constant-Memory-and-Caching）。
用 shared-memory tiling 重用 N 元素 → 大幅提升比值（見 07-Convolution/03-Tiled-Convolution-and-Halo-Handling）。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
filter 大小為何是奇數 `2r+1`？	使加權和對稱於中心元素，每側剛好 `r` 個鄰居；`r` 即 radius
"convolution kernel" 指什麼？	指 filter (權重陣列)；書中改稱 filter 以免與 CUDA kernel function 混淆
Ghost cells 是什麼？預設值？	邊界外不存在的輸入；通常填 0（也可填最近邊緣值）
計算 1D `y[i]` / 2D `P_{y,x}`	以該位置為中心取輸入子陣列，與 filter 做 inner product（逐元素乘後加總）
基本 2D kernel 的 thread 映射	2D grid，1 thread = 1 output pixel；`outCol=bxbdimx+tx`, `outRow=bybdimy+ty`
為何 `if` 條件存在	偵測 ghost cell；落在 N 範圍外則跳過乘加（乘 0 無貢獻）
Control divergence 來源	邊緣 thread 處理不同數量 ghost cells；大圖小 filter 時影響小
基本 kernel 的瓶頸 / OP/B	memory bandwidth；`0.25 OP/B`＝2 ops / 8 bytes（F 4B + N 4B）
為何用 `Pvalue` register	累加中間結果，最後才寫 P，節省 DRAM 頻寬
1D filter `[1/3 1/3 1/3]` 作用	區域平均 → blur/smoothing（考題常見的 filter 語意判讀）

重點總覽 (Overview)

卷積定義 (Convolution Definition)

1D Convolution

2D Convolution

邊界條件與 Ghost Cells (Boundary Conditions & Ghost Cells)

基本平行 2D Kernel (Basic Parallel 2D Kernel)

Thread → Data 映射

Kernel 程式碼 (Fig. 7.7)

Control Divergence 與頻寬瓶頸 (Divergence & Bandwidth)

觀察一：Control Divergence

觀察二：Memory Bandwidth（更嚴重）

考試/面試重點 (Exam / Test Patterns)

Related Notes