Tiled Convolution 與 Halo Cells 處理

#parallel-pattern #tiling #shared-memory #halo-cells #arithmetic-intensity

重點總覽 (Overview)

項目	內容	關鍵數字 / 重點
目標	用 shared memory tiling 降低 N 的 global memory 存取	0.5 → 9.57 OP/B (5×5, 32×32 tile)
Output tile	一個 block 負責產生的輸出元素集合	大小 = `OUT_TILE_DIM`²
Input tile	算出整個 output tile 所需的全部 N 元素	每個方向各 +`FILTER_RADIUS`，含 halo
Halo cells	input tile 邊緣、屬於鄰居 tile 內部的元素	使 input tile > output tile
方案 A (Fig 7.12)	block 尺寸 = input tile，停用外圈 threads	載入單純、計算需 deactivate
方案 B (Fig 7.15)	block 尺寸 = output tile，halo 靠 L2 cache	shared memory 只放內部元素，較單純
核心比值公式	arithmetic-to-global-memory ratio	見下方公式區
tile 越大	ratio 越接近上界	32×32 是目前 GPU 上限
filter 越大	上界越高，但實際與上界落差越大	halo 變多、output tile 變小

Important

本章前兩步 (basic kernel 0.25 OP/B、constant memory 把 F 移出 DRAM → 0.5 OP/B) 已在 07-Convolution/01-Convolution-Fundamentals-and-Basic-Kernel 與 07-Convolution/02-Constant-Memory-and-Caching 處理。本篇處理「N 元素」的 DRAM 流量。

輸入塊與輸出塊 (Input Tile vs Output Tile)

Output tile：每個 block 負責計算的 P 元素集合。
Input tile：要算出該 output tile，需要載入的全部 N 元素——必須往四周各延伸 FILTER_RADIUS，才能涵蓋 output tile 邊緣元素所需的鄰居。
延伸出來的外圈即 halo cells;它們同時是相鄰 block 的內部元素(這正是 §7.5 cache 方案的依據)。

Input tile (左)  =  output tile  + 四周各 r 的 halo
            r=2 範例：output 4x4 → input 8x8

  H H H H H H H H        H = halo cell (鄰居 tile 的內部元素)
  H H H H H H H H        O = output tile 元素 (此 block 真正算的)
  H H O O O O H H
  H H O O O O H H        IN_TILE_DIM = OUT_TILE_DIM + 2*FILTER_RADIUS
  H H O O O O H H
  H H O O O O H H
  H H H H H H H H
  H H H H H H H H

尺寸	Output tile	Input tile	input/output 比
Toy (視覺化用)	4×4 = 16	8×8 = 64	4.0× (誇張，因 tile 太小)
實務 (5×5 filter)	16×16 = 256	20×20 = 400	1.56×

這是與 Tiled Matrix Multiplication 的關鍵差異

05-Memory-Architecture-And-Data-Locality/02-Tiling-and-Tiled-Matrix-Multiplication 假設 input tile = output tile;convolution 的 input tile 比 output tile 大(因 halo),這讓 thread 與 tile 的對應變複雜,是本章所有麻煩的根源。

兩種 Thread 組織 (Two Thread Organizations)

	方案 A:block = input tile (Fig 7.12)	方案 B:block = output tile (習題)
載入 input tile	簡單:每 thread 載入 1 個元素	複雜:thread 需迴圈多載 (含 halo)
計算 output	需停用外圈 `FILTER_RADIUS` 層 threads	簡單:全部 thread 都算,不需停用
block / tile 尺寸	input/output 不同 → 不易是 2 的次方	可相同、可為 2 的次方
divergence	較多 (input ≠ output)	較少

本章正文採用方案 A。

方案 A:Tiled Kernel 與外圈停用 (Deactivated Exterior Threads)

#define IN_TILE_DIM 32
#define OUT_TILE_DIM ((IN_TILE_DIM) - 2*(FILTER_RADIUS))
__constant__ float F_c[2*FILTER_RADIUS+1][2*FILTER_RADIUS+1];

__global__ void convolution_tiled_2D_const_mem_kernel(float *N, float *P,
                                                      int width, int height) {
    // 1) 每個 thread 對應的「載入 / 計算」座標 (-r 偏移到 input tile 左上角)
    int col = blockIdx.x*OUT_TILE_DIM + threadIdx.x - FILTER_RADIUS;
    int row = blockIdx.y*OUT_TILE_DIM + threadIdx.y - FILTER_RADIUS;

    // 2) 協同載入 input tile 進 shared memory;ghost cell 補 0
    __shared__ float N_s[IN_TILE_DIM][IN_TILE_DIM];
    if (row >= 0 && row < height && col >= 0 && col < width) {
        N_s[threadIdx.y][threadIdx.x] = N[row*width + col];
    } else {
        N_s[threadIdx.y][threadIdx.x] = 0.0f;     // ghost cell → 0
    }
    __syncthreads();                              // barrier:確保整塊載齊

    // 3) 只有「內圈」active threads 計算 output
    int tileCol = threadIdx.x - FILTER_RADIUS;
    int tileRow = threadIdx.y - FILTER_RADIUS;
    if (col >= 0 && col < width && row >= 0 && row < height) {        // 邊界
        if (tileCol >= 0 && tileCol < OUT_TILE_DIM &&
            tileRow >= 0 && tileRow < OUT_TILE_DIM) {                 // 停用外圈
            float Pvalue = 0.0f;
            for (int fRow = 0; fRow < 2*FILTER_RADIUS+1; fRow++)
                for (int fCol = 0; fCol < 2*FILTER_RADIUS+1; fCol++)
                    Pvalue += F_c[fRow][fCol]
                            * N_s[tileRow+fRow][tileCol+fCol];        // 全部讀 shared
            P[row*width + col] = Pvalue;
        }
    }
}

外圈停用 (deactivation) 圖解 — 3×3 filter (r=1)、8×8 block/input tile、6×6 output tile:

 block = input tile (8x8),  threadIdx 範圍 0..7
 +---------------------------+
 | .  .  .  .  .  .  .  .   |   '.' = 外圈被停用的 thread (只載入,不計算)
 | .  #  #  #  #  #  #  .   |   '#' = active thread, tx,ty ∈ [1..6]
 | .  #  #  #  #  #  #  .   |
 | .  #  #  #  #  #  #  .   |   active (tx,ty) → output (tx-r, ty-r)
 | .  #  #  #  #  #  #  .   |   patch 左上角 = N_s[ty-r][tx-r]
 | .  #  #  #  #  #  #  .   |
 | .  #  #  #  #  #  #  .   |   thread(1,1) → output(0,0), patch 左上角 N_s[0][0]
 | .  .  .  .  .  .  .  .   |   thread(6,6) → output(5,5), patch 左上角 N_s[5][5]
 +---------------------------+

兩段式角色

同一個 thread 在載入階段對應 input tile 的一格 (含 halo);在計算階段只有內圈被啟用,外圈 FILTER_RADIUS 層純粹是為了把 halo 搬進 shared memory。
[!warning] 兩個獨立的 if
第一個 if 管 ghost cells(陣列邊界外,補 0);第二個 if 管 deactivation(block 比 output tile 大,停用外圈)。兩者目的不同,別混淆。

算術強度分析 (Arithmetic-to-Global-Memory Ratio)

只計算「內部 block」(halo 都不是 ghost cell);large input + small filter 時 ghost cell 影響可忽略。

運算量:每個 output 元素做 (2r+1)² 次乘加 = 2(2r+1)² ops → 整塊 OUT_TILE_DIM² · (2r+1)² · 2
記憶體:所有 global 存取都移到「載入 input tile」階段,每元素載 4 bytes → 整塊 IN_TILE_DIM² · 4 = (OUT_TILE_DIM+2r)² · 4 bytes

ratio = \frac{{OUT_TILE_DIM}^{2} (2 r + 1)^{2} \cdot 2}{(OUT_TILE_DIM + 2 r)^{2} \cdot 4} OP/B

漸近上界 (當 OUT_TILE_DIM ≫ r,分母 ≈ OUT_TILE_DIM):

{ratio}_{max} = \frac{(2 r + 1)^{2} \cdot 2}{4} OP/B

直覺:原始 kernel 中每個 N 元素被約 (2r+1)² 個 thread 重複載入;tile 無限大時每元素只載一次,故上界即「重複次數 × 2 ops / 4 bytes」。

Filter	IN_TILE_DIM	OUT_TILE_DIM	ratio (OP/B)	上界 (OP/B)
5×5 (r=2)	8	4	3.13	12.5
5×5 (r=2)	16	12	7.03	12.5
5×5 (r=2)	32	28	9.57	12.5
9×9 (r=4)	16	8	10.13	40.5
9×9 (r=4)	32	24	22.78	40.5

32×32 input tile 是目前 GPU 的上限 (1024 threads/block)。
tile 越大 → 越接近上界(halo 佔比下降)。
filter 越大 → 上界越高(每個 N 被更多 thread 用),但實際值與上界落差也越大(halo 變多 → output tile 被擠小)。

小 tile 陷阱

8×8 block 配 5×5 filter 只有 3.13 OP/B,遠低於預期。實務上常因 on-chip memory 不足(尤其 3D convolution,所需 shared memory 隨維度暴增)而被迫用小 tile,reduction 效果大打折扣。

方案 B:用 Cache 處理 Halo (Caching for Halo Cells)

核心觀察:halo cells 是鄰居 tile 的內部元素,當本 block 需要它們時,很可能已被鄰居 block 的存取帶進 L2 cache,不必再吃 DRAM 流量。因此只把 tile 內部元素載進 shared memory,halo 直接從 global memory(實則 L2)讀。

→ input tile = output tile = TILE_DIM,block 也同尺寸 (可為 2 的次方)。

#define TILE_DIM 32
__constant__ float F_c[2*FILTER_RADIUS+1][2*FILTER_RADIUS+1];

__global__ void convolution_cached_tiled_2D_const_mem_kernel(float *N, float *P,
                                                             int width, int height) {
    int col = blockIdx.x*TILE_DIM + threadIdx.x;
    int row = blockIdx.y*TILE_DIM + threadIdx.y;

    // 載入:不含 halo → 條件只需檢查一般陣列邊界 (無 ghost cell 之虞)
    __shared__ float N_s[TILE_DIM][TILE_DIM];
    if (row < height && col < width)
        N_s[threadIdx.y][threadIdx.x] = N[row*width + col];
    else
        N_s[threadIdx.y][threadIdx.x] = 0.0f;
    __syncthreads();

    if (col < width && row < height) {
        float Pvalue = 0.0f;
        for (int fRow = 0; fRow < 2*FILTER_RADIUS+1; fRow++) {
            for (int fCol = 0; fCol < 2*FILTER_RADIUS+1; fCol++) {
                if (threadIdx.x-FILTER_RADIUS+fCol >= 0 &&
                    threadIdx.x-FILTER_RADIUS+fCol < TILE_DIM &&
                    threadIdx.y-FILTER_RADIUS+fRow >= 0 &&
                    threadIdx.y-FILTER_RADIUS+fRow < TILE_DIM) {
                    // 內部 → 讀 shared memory
                    Pvalue += F_c[fRow][fCol]
                            * N_s[threadIdx.y-FILTER_RADIUS+fRow]
                                 [threadIdx.x-FILTER_RADIUS+fCol];
                } else {
                    // halo → 檢查是否 ghost cell;非 ghost 才從 global memory 讀
                    if (row-FILTER_RADIUS+fRow >= 0 && row-FILTER_RADIUS+fRow < height &&
                        col-FILTER_RADIUS+fCol >= 0 && col-FILTER_RADIUS+fCol < width) {
                        Pvalue += F_c[fRow][fCol]
                                * N[(row-FILTER_RADIUS+fRow)*width
                                    + col-FILTER_RADIUS+fCol];
                    }
                }
            }
        }
        P[row*width + col] = Pvalue;
    }
}

	方案 A (halo in shared, Fig 7.12)	方案 B (halo in cache, Fig 7.15)
Shared memory 內容	內部 + halo (整個 input tile)	只有內部元素
載入條件	含 ghost cell 判斷	只需一般邊界判斷(較單純)
計算迴圈	全讀 shared (單純)	較複雜:內部讀 shared、halo 讀 global,且 halo 仍要判 ghost
block/tile 尺寸	input ≠ output,不易是 2 的次方	input = output,可為 2 的次方
divergence	較多 (input ≠ output)	較少
halo DRAM 流量	0 (已載入 shared)	倚賴 L2 命中 (機率高但非保證)

複雜度搬家

兩方案是把複雜度在「載入階段 ↔ 計算階段」之間轉移:方案 A 把麻煩放在載入(deactivation),方案 B 把麻煩放在計算迴圈(三路判斷:內部 / ghost / 有效 halo)。

本章總結 (Chapter Summary)

Convolution 是核心平行 pattern,也是 stencil(08-Stencil/01-Stencil-Background-and-Basic-Kernel)、grid 力/位能計算(17-Iterative-MRI-Reconstruction/01-MRI-Background-and-Iterative-Reconstruction)、CNN(16-Deep-Learning/02-Convolutional-Neural-Networks-Layers) 的基礎。
三步演進:basic kernel (DRAM-bound, 0.25 OP/B) → constant memory 消除 filter 的 DRAM 存取 (0.5 OP/B) → tiling 用 shared memory 降低 N 的 DRAM 存取 (數倍提升,但增加 divergence 與程式複雜度) → L2 cache for halo 簡化程式。
算術強度分析是可遷移的重要技能;從中看出小 tile 的限制,在大 filter 與 3D convolution 尤其嚴重。
技巧可直接推廣到 1D / 3D,只是索引計算與迴圈巢狀更深。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
input tile 尺寸公式	`IN_TILE_DIM = OUT_TILE_DIM + 2*FILTER_RADIUS`
為何 input tile > output tile	需要 halo cells(output tile 邊緣元素的鄰居)
Halo vs Ghost cell	halo = 鄰居 tile 的真實內部元素;ghost = 陣列邊界外、補 0 的不存在元素
算術強度公式	`OUT²(2r+1)²·2 / (OUT+2r)²·4` OP/B
漸近上界	`(2r+1)²·2 / 4` OP/B (tile ≫ r)
5×5、32×32 tile 的 ratio	9.57 OP/B(上界 12.5)
tile 越大 ratio 怎麼變	越接近上界(halo 佔比下降)
filter 越大的影響	上界更高,但與實際值落差更大(halo 多、output tile 被擠小)
小 tile 為何危險	reduction 遠不如預期(8×8/5×5 僅 3.13 OP/B);3D 受 on-chip memory 限制更明顯
方案 A 兩個 if 各管什麼	第一個管 ghost(補 0)、第二個管 deactivation(停用外圈)
方案 B 靠什麼省流量	halo 已被鄰居 block 帶進 L2 cache,直接讀 global
方案 B 的好處	block/tile 同尺寸、可為 2 的次方,divergence 較少;代價是計算迴圈較複雜
需要 `__syncthreads()` 之處	載入 input tile 之後、開始計算之前(barrier)