矩陣乘法 Kernel (Matrix Multiplication)

#cuda-programming #matrix-multiplication #thread-to-data-mapping #row-major-layout #tiling

重點總覽 (Overview)

項目	內容
問題	`M` (I×j) × `N` (j×k) = `P` (I×k)；本章簡化為方陣 `Width × Width`
每個 `P` 元素	一個 inner product / dot product：`M` 的某一列 · `N` 的某一行
平行策略	one thread → one P element（與 `colorToGrayscaleConversion` 相同的映射）
`row` / `col`	`row = blockIdx.yblockDim.y + threadIdx.y`；`col = blockIdx.xblockDim.x + threadIdx.x`
`M` 存取	row-major、連續：`M[row*Width + k]`
`N` 存取	同樣 row-major 儲存，但取「行(column)」需跨列跳躍：`N[k*Width + col]`
Tiling	thread-to-data 映射把 `P` 切成 tiles，每個 block 算一塊 tile
BLAS 定位	Level-3 `C = αAB + βC` 的特例（α=1, β=0）
擴充限制	grid 大小受 max blocks/threads 限制 → 拆 submatrix 或讓每 thread 算多個 `P` 元素

Important

這是全書第一個「每個 thread 內含一個迴圈、做多次運算」的 kernel。vecAdd 與 grayscale 每 thread 只算一個輸出且只做幾次運算；matrix multiplication 每 thread 要跑 Width 次迭代做 inner product。

矩陣乘法定義與 Inner Product (Matrix Multiplication & Inner Product)

I×j 矩陣 M 乘上 j×k 矩陣 N，得到 I×k 矩陣 P。
P 的每個元素是 M 的一列 與 N 的一行 的 inner product（dot product）：

P_{r o w, c o l} = \sum_{k = 0}^{W i d t h - 1} M_{r o w, k} \cdot N_{k, c o l}

例（row=1, col=5）：

P_{1, 5} = M_{1, 0} N_{0, 5} + M_{1, 1} N_{1, 5} + M_{1, 2} N_{2, 5} + \dots + M_{1, W i d t h - 1} N_{W i d t h - 1, 5}

        N (column strip ↓)
        +---+---+---+---+
        |   |   |col|   |
        |   |   | ∥ |   |
        |   |   | ∥ |   |
        +---+---+---+---+
M (row strip →)        P
+---+---+---+---+    +---+---+---+---+
|row══════════> |    |   |   |   |   |
+---+---+---+---+    +---+ [P_row,col]+   <- 一個 thread
|   |   |   |   |    |   |   |   |   |
+---+---+---+---+    +---+---+---+---+
   P_row,col = (M 第 row 列) · (N 第 col 行)

BLAS Level 提示

本例屬 BLAS Level-3（matrix-matrix，C=αAB+βC）的特例。Level-1 是 vector（y=αx+y，vecAdd 即 α=1 的特例），Level-2 是 matrix-vector（y=αAx+βy）。Level 越高、運算量越大、越適合 GPU 加速。

Thread 對輸出 P 的映射 (Thread-to-Output Mapping)

與 colorToGrayscaleConversion 完全相同的映射；差別只是這裡簡化為方陣，把 width 與 height 都換成 Width。

維度	公式	對應
`row`（垂直）	`blockIdx.y*blockDim.y + threadIdx.y`	`M` 的列 / `P` 的列
`col`（水平）	`blockIdx.x*blockDim.x + threadIdx.x`	`N` 的行 / `P` 的行

軸向別搞反

.y → row、.x → col。grid/block 的維度排序（x 先）與資料/數學的排序（row 先、即 y）是相反的；混用是最常見的索引 bug。詳見 03-Multidimensional-Grids-And-Data/02-Mapping-Threads-to-Multidimensional-Data。

矩陣乘法 Kernel 程式碼 (matrixMulKernel)

// Fig. 3.11 — one thread computes one P element (square matrices, size Width)
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
    int row = blockIdx.y * blockDim.y + threadIdx.y;   // line 03
    int col = blockIdx.x * blockDim.x + threadIdx.x;   // line 04
    if ((row < Width) && (col < Width)) {              // line 05  邊界檢查
        float Pvalue = 0;                              // line 06
        for (int k = 0; k < Width; ++k) {              // line 07  inner product
            Pvalue += M[row*Width + k] * N[k*Width + col];  // line 08
        }
        P[row*Width + col] = Pvalue;                   // line 10  寫回
    }
}

Pvalue 是 register 中的區域累加器（避免每次迭代都寫回 global memory）。
迴圈結束後一次寫回 P[row*Width + col]。
與 grayscale 的差異：多了 for-loop（做 Width 次 multiply-add）。

簡化假設

此 kernel 只處理方陣（M、N、P 皆 Width×Width）。非方陣需分別傳入 M 的列數/行數與 N 的行數，並把三個 Width 換成正確維度。

M 與 N 的記憶體存取 (Row-major M vs Column-strided N)

兩個矩陣都用 row-major 線性化儲存，但因 inner product 需要 M 的一列與 N 的一行，存取樣態不同：

	取用方向	線性索引	記憶體樣態
`M`	第 `row` 列	`M[row*Width + k]`	連續（同列元素相鄰）
`N`	第 `col` 行	`N[k*Width + col]`	跨列跳躍（每次 +Width，stride 存取）

記憶體 (row-major, Width=4)：
M:  [ M0,0 M0,1 M0,2 M0,3 | M1,0 ... ]   取 row 1 → 連續讀 M[4],M[5],M[6],M[7]
                                              k 增加 → 位址 +1   ✔ 連續

N:  [ N0,0 N0,1 N0,2 N0,3 | N1,0 ... ]   取 col 0 → 讀 N[0],N[4],N[8],N[12]
                                              k 增加 → 位址 +Width  ✗ 跳躍(stride)

推導：N 第 col 行的起點是 N[col]；同一行的「下一個」元素在下一列，故需跳過一整列 → N[k*Width + col]。

為什麼這很重要

同一個 warp 內相鄰 thread 的 col 相差 1，它們對 N[k*Width + col] 的存取其實是相鄰位址（合併良好）；而對 M[row*Width + k]，相鄰 thread row 不同 → 存取相距 Width。這裡的 stride 樣態正是後續 memory coalescing 與 tiling 優化的起點（見 06-Performance-Considerations/01-Memory-Coalescing、05-Memory-Architecture-And-Data-Locality/02-Tiling-and-Tiled-Matrix-Multiplication）。

Tiling P 與多 Block 執行 (Tiling P Across Blocks)

one-thread-one-element 的映射「自然地」把 P 切成一塊塊 tile，每個 block 負責一塊 tile。
小範例：4×4 的 P，BLOCK_WIDTH = 2 → 4 個 block（2×2 排列），每個 block 是 2×2 threads。

P (4×4)  切成 4 個 2×2 tiles：
        col→ 0   1   2   3
 row 0  | B(0,0) | B(0,1) |
 row 1  |        |        |
        +--------+--------+
 row 2  | B(1,0) | B(1,1) |
 row 3  |        |        |

thread(0,0) of block(0,0) → P0,0
thread(1,0) of block(0,0) → row=0*2+1=1, col=0*2+0=0 → P1,0
thread(0,0) of block(1,0) → row=1*2+0=2, col=0       → P2,0

走訪驗證 thread(0,0)@block(0,0) 的迴圈（Width=4）：

k	`M[row*W+k]`	`N[k*W+col]`	對應
0	`M[0]`	`N[0]`	M0,0 · N0,0
1	`M[1]`	`N[4]`	M0,1 · N1,0
2	`M[2]`	`N[8]`	M0,2 · N2,0
3	`M[3]`	`N[12]`	M0,3 · N3,0

→ 正是 M 第 0 列與 N 第 0 行的 inner product，寫入 P[0] = P0,0。

擴充限制 (Scaling Limits)

grid 受 max blocks/grid 與 max threads/block (1024) 限制 → matrixMulKernel 能處理的最大 P 也受限。
超過上限的兩種對策：

對策	做法	後續章節
拆 submatrix	host 把 `P` 切成數個子矩陣，每個子矩陣 launch 一個 grid	—
每 thread 算多個元素	修改 kernel 讓單一 thread 計算多個 `P` 元素	thread coarsening，見 06-Performance-Considerations/03-Thread-Coarsening

複雜度與 arithmetic intensity（前瞻）

量	值
每 thread 工作	`O(Width)`（迴圈跑 Width 次）
thread 總數	`Width²`
總工作量	`O(Width³)`
Arithmetic intensity	每迭代讀 2 個 float (8 B) 做 2 FLOP → 0.25 OP/B（記憶體受限）

為什麼這個基本版很慢

0.25 OP/B 代表它是 memory-bound：每筆 global memory 流量只換到極少運算，且 M、N 的元素被重複從 global memory 讀取多次。解法是把資料載入 shared memory 重複使用（tiling），見 05-Memory-Architecture-And-Data-Locality/02-Tiling-and-Tiled-Matrix-Multiplication。本節僅止於基本 kernel。

本章總結 (Chapter Summary)

CUDA grid 與 block 最多 3 維；多維度的目的在於把 threads 自然對應到多維資料。
execution configuration parameters 定義 grid/block 維度；blockIdx 與 threadIdx 讓每個 thread 辨識自己負責的資料範圍——由程式設計者負責正確使用這些變數。
存取多維資料時通常要把多維索引 linearize 成 1D offset，因為 C 的動態配置多維陣列以 row-major 的 1D 形式儲存。
本章用 vecAdd → grayscale → image blur → matrix multiplication 一連串遞增複雜度的範例，建立「用多維 grid 處理多維資料」的基本功，是後續 parallel patterns 與優化技巧的基礎。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
每個 thread 負責什麼？	一個 `P` 元素 = 一個 inner product（`M` 一列 · `N` 一行）
`M` 的存取索引	`M[row*Width + k]`（row-major，連續）
`N` 的存取索引	`N[kWidth + col]`（取行需 stride `Width`，非連續*）
`P` 的寫回索引	`P[row*Width + col]`
row / col 公式	`row = blockIdx.yblockDim.y + threadIdx.y`；`col = blockIdx.xblockDim.x + threadIdx.x`
為何要 `if (row<Width && col<Width)`	grid 的 thread 數是 blockDim 的倍數，可能超出矩陣邊界，需擋掉越界 thread
為何用 `Pvalue` 累加	在 register 累加，避免每次迭代寫 global memory（迴圈後才寫回一次）
BLAS 屬於哪一級	Level-3（matrix-matrix `C=αAB+βC`），本例 α=1, β=0
矩陣太大超出 grid 上限怎麼辦	(1) 拆 submatrix 多次 launch；(2) 每 thread 算多個 `P` 元素（thread coarsening）
為何基本版效能差	arithmetic intensity ≈ 0.25 OP/B、`M`/`N` 元素被重複讀取 → memory-bound → 用 shared-memory tiling 改善
總工作複雜度	`O(Width³)`（`Width²` 個 thread × 每個 `O(Width)`）