Tiling 技術與 Tiled 矩陣乘法 Kernel (Tiling & Tiled Matrix Multiplication)

#memory-and-performance #tiling #shared-memory #matrix-multiplication #barrier-synchronization

重點總覽 (Overview)

項目	內容	關鍵數字
核心問題	Global memory 大但慢；naive matmul 的 compute-to-memory ratio 僅 0.25 OP/B，是 memory-bound	A100 僅達 389 GFLOPS (峰值 2%)
策略 (Tiling)	將資料切成可放進 shared memory 的 tiles，threads 協同載入並重複使用	減少 global memory 流量
協同載入 (Collaborative Loading)	每個 thread 載入 1 個 M、1 個 N 元素到 shared memory，全 block 共用	每元素只從 global 讀 1 次
執行分階段 (Phases)	dot product 拆成 `Width / TILE_WIDTH` 個 phase，每 phase 處理一對 tile	strip-mining
同步 (Barriers)	載入後 / 使用後各一個 `__syncthreads()`，防止 RAW / WAR hazard	2 個 barrier
效益	global memory 存取量降為 1/TILE_WIDTH，ratio 提升至 TILE_WIDTH × 0.25	16×16 tile → 4 OP/B → 6220 GFLOPS
限制	Shared memory 容量小；Fig 5.9 假設方陣且 Width 是 TILE_WIDTH 倍數	邊界處理見 05-Memory-Architecture-And-Data-Locality/03-Boundary-Checks-and-Memory-Occupancy

Important

本篇接續 05-Memory-Architecture-And-Data-Locality/01-Memory-Access-Efficiency-and-CUDA-Memory-Types 的動機（compute-to-global-memory-access ratio、memory-bound）與記憶體類型，把 shared memory 實際用在演算法上。

Tiling 的核心思想 (Tiling for Reduced Memory Traffic)

記憶體的內在權衡 (The Tradeoff)

Memory	大小	速度	角色
Global memory	大 (GB)	慢 (off-chip DRAM)	存整個矩陣
Shared memory	小 (KB/block)	快 (on-chip scratchpad)	存一塊 tile

Tile 比喻：大牆面 (global data) 由一片片小磁磚 (能放進 shared memory 的子集) 覆蓋。
前提：tile 上的計算必須彼此獨立才能切割。
並非所有資料結構都能切成 tile；要看 kernel 的存取模式是否具備重用性 (data reuse)。

為何 matmul 適合 tiling — 存取重疊分析

在 block(0,0) 中，每個 M、N 元素都被該 block 的 threads 存取剛好兩次（2×2 block）。若讓 threads 協同，只從 global memory 載入一次即可。

block(0,0) 對 global memory 的存取 (Fig 5.6 概念)
時間 →
thread(0,0):  M0,0  M0,1  M0,2  M0,3   N0,0  N1,0  N2,0  N3,0
thread(0,1):  M0,0  M0,1  M0,2  M0,3   N0,1  N1,1  N2,1  N3,1   ← M row0 與 t(0,0) 完全重疊
thread(1,0):  M1,0  M1,1  M1,2  M1,3   N0,0  N1,0  N2,0  N3,0   ← N col0 與 t(0,0) 完全重疊
thread(1,1):  M1,0  M1,1  M1,2  M1,3   N0,1  N1,1  N2,1  N3,1
              └── 同 row 的 thread 共用 M ──┘ └── 同 col 的 thread 共用 N ──┘

潛在減量 ∝ block 維度

用 Width×Width 的 block，global memory 流量最多可降為原本的 1/Width。例如 16×16 block → 降為 1/16。

執行分階段 (Execution Phases) 與協同載入

將 dot product 拆成多個 phase，每 phase 全 block 協同把一對 tile 載入 shared 陣列 Mds、Nds，再用之計算部分內積。

Tiled matmul 的兩個 phase (Fig 5.8, block(0,0), TILE_WIDTH=2)

 Phase 1                              Phase 2
┌──────────────┐                    ┌──────────────┐
│ load M tile0 │  M0,0 M0,1         │ load M tile1 │  M0,2 M0,3
│ load N tile0 │  M1,0 M1,1         │ load N tile1 │  M1,2 M1,3
│  __sync      │                    │  __sync      │
│ Pvalue +=    │  Mds·Nds (k=0,1)   │ Pvalue +=    │  Mds·Nds (k=0,1)
│  __sync      │                    │  __sync      │
└──────────────┘                    └──────────────┘
        Mds / Nds 在各 phase 間「重複使用」(reuse) → 只需很小的 shared memory

phase 數 = Width / TILE_WIDTH；每 phase 處理一對 TILE_WIDTH 寬的 tile。
shared 陣列 Mds/Nds 跨 phase 重複使用 → 小容量即可服務大量 global 存取。
此「聚焦於小子集」的存取行為稱為 locality (資料局部性)，是高速記憶體能發揮作用的根本原因（CPU cache 亦同）。

Tiled 矩陣乘法 Kernel (A Tiled Matrix Multiplication Kernel)

完整 Kernel (Fig 5.9)

#define TILE_WIDTH 16

__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
    // 04-05: shared memory tiles，scope = block（每 block 一份）
    __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

    // 07-08: 存進 register（縮短名稱）
    int bx = blockIdx.x;  int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;

    // 11-12: 此 thread 負責的 P 元素位置
    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE_WIDTH + tx;

    // 16: strip-mining 的外層迴圈，逐 phase 計算
    float Pvalue = 0;
    for (int ph = 0; ph < Width/TILE_WIDTH; ++ph) {
        // 19-20: 協同載入一對 tile 到 shared memory
        Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
        Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
        __syncthreads();                 // 21: RAW barrier（載入完才能讀）

        // 23: 內層迴圈，對 tile 做部分內積
        for (int k = 0; k < TILE_WIDTH; ++k) {
            Pvalue += Mds[ty][k] * Nds[k][tx];
        }
        __syncthreads();                 // 26: WAR barrier（讀完才能覆寫）
    }
    P[Row*Width + Col] = Pvalue;         // 29: 寫回結果
}

Index 計算 (Fig 5.10)

量	公式	說明
Col (P 的 x)	`bx*TILE_WIDTH + tx`	前面 bx 個 block 各蓋 TILE_WIDTH 欄，再加 tx
Row (P 的 y)	`by*TILE_WIDTH + ty`	同理（垂直方向）
載入的 M	`M[RowWidth + phTILE_WIDTH + tx]`	列固定為 Row，欄起點 `ph*TILE_WIDTH` 偏移 tx
載入的 N	`N[(phTILE_WIDTH + ty)Width + Col]`	欄固定為 Col，列起點 `ph*TILE_WIDTH` 偏移 ty

Thread(ty,tx) 在每 phase 的角色：
   M  ────────────────►  (沿 row Row 由左往右，每 phase 取一段 tile)
   ┌───┬───┐
   │tile│tile│ ...           P 的 (Row,Col) = Σ_phase Σ_k Mds[ty][k]*Nds[k][tx]
   └───┴───┘
   N  │  每 phase 取一段 tile（沿 col Col 由上往下）
      ▼
TILE_WIDTH² 個 thread 協同 → 載入 TILE_WIDTH² 個 M + TILE_WIDTH² 個 N
（每 thread 恰好載入 1 M + 1 N，且 index 各不重複）

Strip-mining 與兩個 Barrier

Strip-mining：把一條長迴圈 (line 16-28) 切成「外層 phase 迴圈 + 內層幾次連續迭代」。在內層前後加 barrier，強迫同 block 的 threads 每 phase 聚焦在同一段輸入資料 → 這正是 tiling 所需的 phase 機制。

Barrier	位置	防止的 hazard	別名	為何要等
`__syncthreads()` #1	載入 tile 之後 (line 21)	Read-after-write (RAW)	true dependence（真相依）	reader 真的需要別的 thread 寫入的資料，非等不可
`__syncthreads()` #2	用完 tile 之後 (line 26)	Write-after-read (WAR)	false dependence（假相依）	writer 不需 reader 的資料；只因重用同一記憶體位置才相依（換位置即可消除）

漏掉任一個 __syncthreads() 都會造成 race condition：

漏 #1：有 thread 在 tile 尚未載入完成時就開始讀 → 讀到舊/未定值。
漏 #2：有 thread 在別人還沒讀完舊 tile 時就覆寫成下一個 tile → 破壞他人輸入。
Barrier 的同步機制細節見 04-Compute-Architecture-And-Scheduling/01-GPU-Architecture-and-Block-Scheduling。

效益量化

指標	Naive (Ch.3)	Tiled (16×16)
Compute-to-global-memory ratio	0.25 OP/B	4 OP/B (= 16 × 0.25)
每元素被 global 讀取次數	N	N / TILE_WIDTH
A100 可達吞吐量	389 GFLOPS	6220 GFLOPS (1555 GB/s × 4)

核心公式

Global memory 存取減量因子 = TILE_WIDTH（不是 TILE_WIDTH²！）。
Ratio 提升：ratio_tiled = TILE_WIDTH × 0.25 OP/B。
吞吐上限：throughput = bandwidth × ratio。

Tiled 版 6220 GFLOPS 仍只有 A100 峰值 (19,500 GFLOPS) 的 32%。要更快需進一步優化（thread coarsening、register tiling 等），或直接用 cuBLAS / CUTLASS 等高度優化函式庫。

CPU vs GPU tiling

	CPU (blocking)	GPU (tiling)
重用資料留在晶片上的方式	依賴 cache（隱式）	用 shared memory（顯式）
原因	一核心通常只跑 1-2 threads，cache 較可靠	SM 同時跑大量 threads 競爭 cache slot，故需顯式 shared memory

簡化假設 (本 Kernel 的限制)

Fig 5.9 做了兩個簡化假設：

Width 是 block 寬度的整數倍（否則無法正確處理任意寬度）。
矩陣為方陣 (square)。
一般 (rectangular) 矩陣與邊界檢查見 05-Memory-Architecture-And-Data-Locality/03-Boundary-Checks-and-Memory-Occupancy。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
Tiling 把 global memory 存取減少幾倍？	TILE_WIDTH 倍（T×T tile → 每元素讀取次數從 N 降為 N/T）
16×16 tile 的 compute-to-memory ratio？	0.25 × 16 = 4 OP/B；32×32 → 減量 32 倍
phase 數量怎麼算？	`Width / TILE_WIDTH`
兩個 `__syncthreads()` 各防什麼？	#1 = RAW (true dependence)；#2 = WAR (false dependence)
忘記 `__syncthreads()` 會怎樣？	race condition：讀到未載入資料 (#1) 或 tile 被提早覆寫 (#2)
為何 false dependence 是「假」相依？	writer 不需 reader 的資料，僅因重用同一 shared 位置；換位置即無相依
strip-mining 是什麼？	把長迴圈拆成 outer phase 迴圈 + inner 連續迭代，加 barrier 形成 tiling 的 phase
每個 thread 載入幾個元素？	恰 1 個 M + 1 個 N；TILE_WIDTH² threads 共載 TILE_WIDTH² 對
`Mds`/`Nds` scope 與 lifetime？	scope = block，lifetime = kernel；每 block 一份，跨 phase 重用
為何 GPU 用 shared memory 而非靠 cache？	SM 同時跑大量 threads 競爭 cache，cache 不可靠 → 需顯式 shared memory
Tiled 仍非峰值的原因？	仍 memory-bound 殘留；需 coarsening/register tiling 或 cuBLAS/CUTLASS
Col / Row 公式？	`Col = bxTILE_WIDTH+tx`，`Row = byTILE_WIDTH+ty`