執行緒粗化 (Thread Coarsening)

#memory-and-performance #thread-coarsening #tiling #matrix-multiplication #occupancy

重點總覽 (Overview)

項目	說明
核心想法	不再用「最細粒度 (finest granularity)」每個 thread 只做一個 work unit；改成每個 thread 負責多個 work units，部分串行化以省下重複成本
動機	平行化是有「price of parallelism」的（重複載入資料、重複計算、同步開銷…）。若硬體反正要串行化 thread blocks，這個 price 等於白付
何時有用	該 price 存在且硬體因資源不足而把工作串行化時
關鍵參數	`COARSE_FACTOR` = 每個 coarsened thread 負責的原始 work unit 數量
核心結構	coarsening loop：在 thread 內以迴圈走訪它負責的多個 work units
代價	暴露給硬體的平行度 (exposed parallelism) 下降 `COARSE_FACTOR` 倍；register / shared memory 用量上升
三大陷阱	(1) 不必要時硬套 (2) 過度粗化使資源閒置 (3) 資源用量過大壓垮 occupancy
本章範例	Tiled matrix multiplication：相鄰輸出 tile 共用同一塊 M input tile，coarsening 讓 M tile 只載入一次

一句話

Thread coarsening = 「拿暴露的平行度，換掉不必要的重複工作」。只有當平行度本來就用不完、而平行又有額外成本時，這筆交易才划算。

細粒度 vs 粗化 (Finest Granularity vs Coarsening)

到目前為止的所有 kernel 都採用最細粒度平行化（每 thread 最小工作單位）：

Kernel	每個 thread 負責
Vector addition	1 個輸出元素
RGB→grayscale / image blur	1 個輸出 pixel
Matrix multiplication	1 個輸出矩陣元素

最細粒度的優點 — transparent scalability

暴露最多平行度給硬體。
資源夠 → 全部平行執行；資源不夠 → 硬體自動把 thread blocks 一個接一個串行化，程式不必改。

最細粒度的缺點 — price of parallelism

平行化常需付出額外成本：
- redundant loading of data（不同 block 重複載入同一份資料）
- redundant work（重複計算）
- synchronization overhead（同步開銷）
thread 真的平行跑時，這個 price 值得付。
但若硬體因資源不足而串行化了這些工作 → price 白付了。

交易的本質

既然硬體反正要串行化，不如由程式主動部分串行化：讓一個 thread block 處理多個工作單位，把重複成本省下來。這就是 thread coarsening。

範例：Tiled 矩陣乘法 (Coarsening for Tiled MatMul)

Price of parallelism = 重複載入 M tile

計算 P 中兩個水平相鄰的 output tile 時：

                N (input, 不同 tile)
            ┌──────────┬──────────┐
            │  N tile  │  N tile  │   ← 每個 output tile 要不同的 N tile
            │   (左)   │   (右)   │
   ─────────┼──────────┼──────────┤
   M tile   │   P_tile │  P_tile  │   ← 兩個相鄰的 output tile
  (同一塊!) │    左    │    右    │
   ─────────┴──────────┴──────────┘
       ▲
       └─ 同一塊 M input tile 同時餵給「左」「右」兩個 output tile

兩個 output tile 需要不同的 N input tiles。
但需要相同的 M input tiles。
Ch.5 tiled 版：每個 output tile 由不同 block 處理 → shared memory 不能跨 block 共享 → 每個 block 各載入一份 M tile（重複！）。

方案	處理 2 個相鄰 output tile	M tile 載入次數
Ch.5 細粒度（2 個 block）	2 個 block 平行	載入 2 次（重複）
Coarsened（1 個 block，每 thread 做 2 元素）	1 個 block 串行兩 tile	載入 1 次（重用）

只有在 blocks 會被串行化時才划算

若這兩個 block 真的能平行跑，重複載入是值得的；只有當它們被硬體串行化時，coarsening 把 M tile 重用才會帶來實質加速。

Coarsening loop 程式碼 (Fig. 6.13)

#define TILE_WIDTH   32
#define COARSE_FACTOR 4          // line 02: 每個 coarsened thread 負責幾個 work unit

__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
    __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x;  int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;

    int row = by * TILE_WIDTH + ty;
    // line 13: 第一個負責的 column；block 跨度 = TILE_WIDTH * COARSE_FACTOR
    int colStart = bx * TILE_WIDTH * COARSE_FACTOR + tx;

    // lines 16-19: 為每個負責的元素各開一個 Pvalue 累加器
    float Pvalue[COARSE_FACTOR];
    for (int c = 0; c < COARSE_FACTOR; ++c)      // line 17: 初始化的 coarsening loop
        Pvalue[c] = 0.0f;

    for (int ph = 0; ph < Width / TILE_WIDTH; ++ph) {   // line 22: 走訪 input tiles
        // 一個 phase 只載入「一塊」M tile（與原版相同）
        Mds[ty][tx] = M[row * Width + ph * TILE_WIDTH + tx];

        for (int c = 0; c < COARSE_FACTOR; ++c) {       // line 27: 主 coarsening loop
            int col = colStart + c * TILE_WIDTH;        // line 29: 算出本次負責的 column

            Nds[ty][tx] = N[(ph * TILE_WIDTH + ty) * Width + col];  // line 32: 載入對應 N tile
            __syncthreads();

            for (int k = 0; k < TILE_WIDTH; ++k)        // lines 35-37: 用同一塊 M tile
                Pvalue[c] += Mds[ty][k] * Nds[k][tx];   //   更新「不同」的 Pvalue[c]
            __syncthreads();
        }
    }

    for (int c = 0; c < COARSE_FACTOR; ++c) {           // lines 44-47: 寫回各輸出元素
        int col = colStart + c * TILE_WIDTH;
        P[row * Width + col] = Pvalue[c];
    }
}

關鍵改動對照

行為	Ch.5 細粒度	Coarsened (Fig. 6.13)
每個 thread 負責的輸出元素	1	`COARSE_FACTOR` 個
Block column 跨度	`TILE_WIDTH`	`TILE_WIDTH * COARSE_FACTOR`
累加器	單一 `Pvalue`	`Pvalue[COARSE_FACTOR]` 陣列
每個 phase 載入 M tile	1 塊	1 塊（不變 → 重用!）
每個 phase 載入 N tile	1 塊	`COARSE_FACTOR` 塊（不可避免）

coarsening loop 放在哪

M tile 在迴圈外只載入一次；coarsening loop 在內，對「同一塊 M tile」搭配「多塊 N tile」算出多個 Pvalue。這正是省下重複載入的關鍵位置。

算術強度提升 (Arithmetic Intensity, OP/B)

以 floating-point 對 global memory 存取比 (OP/B) 量化好處（書末習題 4）：

Kernel	OP/B	推導
簡單版（無優化）	0.25	2 FLOP / 8 B（載入 1 M + 1 N）
Tiled 32×32	8	0.25 × TILE_WIDTH = 0.25 × 32
Tiled 32×32 + COARSE_FACTOR 4	12.8	每 phase 載 1 M + 4 N（5 tiles）算 4 個 output tile：262144 FLOP / 20480 B

Coarsening 把每組 4 個 output tile 的 M tile 載入從 4 塊降到 1 塊 → global memory traffic 下降 → OP/B 從 8 升到 12.8。

三大陷阱 (Three Pitfalls)

Pitfall 1 — 不必要時硬套

只有存在 price of parallelism（重複載入 / 重複計算 / 同步開銷…）時才有效。
像 vector addition、RGB→grayscale 這類 kernel，平行處理不同元素沒有額外代價 → 套 coarsening 不會有明顯效益。

Pitfall 2 — 過度粗化使硬體閒置

Coarsening 會減少暴露的平行度（÷ COARSE_FACTOR）。
COARSE_FACTOR 太大 → 暴露的平行度不足 → 部分執行資源閒置。
最佳 COARSE_FACTOR 是 device-specific 且 dataset-specific，換裝置 / 換資料需重新調校 → transparent scalability 變得不再透明。

Pitfall 3 — 壓垮 occupancy

Coarsening 常需要更多 registers / thread（多個 Pvalue）或更多 shared memory / block。
若資源用量過大壓低了 occupancy，降低 occupancy 的傷害可能大於 coarsening 帶來的好處。

跨章節的 coarsening 應用 (Where It Recurs)

不同章節「price of parallelism」不同，coarsening 攻擊點也不同：

章節 / 模式	Coarsening 省下的代價
Ch.6 Matrix Mul	重複載入 input tile
Ch.8 Stencil	重複載入 input data（register tiling）
Ch.9 Histogram	減少需要 commit 的 private copy 數量
Ch.10 Reduction / Ch.11 Scan	同步與 control divergence 開銷
Ch.11 Scan	也減少平行演算法相對序列版的 redundant work
Ch.12 Merge	減少 binary search（co-rank）次數
Ch.13 Sorting	改善 memory coalescing

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「Thread coarsening 是什麼」	讓一個 thread 負責多個 work unit，部分串行化以減少 price of parallelism
「何時該用 / 不該用」	有 price of parallelism 且硬體會串行化 → 用；vector add / grayscale 這種無代價的 → 不用（Pitfall 1）
「coarsening loop 在哪、做什麼」	thread 內走訪它負責的多個 work units；把載入一次的 M tile 重用於多個 N tile / `Pvalue`
`COARSE_FACTOR` 太大會怎樣	暴露平行度不足 → 執行資源閒置；且 scalability 不再 transparent（Pitfall 2）
「coarsening 的副作用」	registers / shared memory 用量上升 → 可能壓低 occupancy（Pitfall 3）
`colStart` 怎麼算	`bx * TILE_WIDTH * COARSE_FACTOR + tx`（block 跨度乘上 `COARSE_FACTOR`）
Tiled 32×32 + coarse 4 的 OP/B	12.8 OP/B（vs tiled 8、簡單版 0.25）
coarsening 對 transparent scalability	降低（減少暴露平行度，需逐裝置 / 逐資料調校）
Table 6.1 中 coarsening 的好處	compute：less redundant work / divergence / sync；memory：less redundant global memory traffic

重點總覽 (Overview)

細粒度 vs 粗化 (Finest Granularity vs Coarsening)

範例：Tiled 矩陣乘法 (Coarsening for Tiled MatMul)

Price of parallelism = 重複載入 M tile

Coarsening loop 程式碼 (Fig. 6.13)

算術強度提升 (Arithmetic Intensity, OP/B)

三大陷阱 (Three Pitfalls)

跨章節的 coarsening 應用 (Where It Recurs)

考試/面試重點 (Exam / Test Patterns)

Related Notes