邊界檢查與記憶體使用對 Occupancy 的影響 (Boundary Checks & Occupancy)

#memory-and-performance #occupancy #shared-memory #tiling #matrix-multiplication

重點總覽 (Overview)

主題	核心問題	解法 / 關鍵公式
Boundary Checks	tile 寬不整除矩陣寬時,thread 讀到不存在的元素	每個記憶體存取都加上界線測試;越界載入填 `0.0f`
三種獨立檢查	載入 M、載入 N、寫入 P 的越界條件各不相同	不能只靠「排除無效 thread」解決
General (rectangular) matmul	kernel 只支援方陣	`Width` 拆成 `j, k, l` (j×k · k×l = j×l)
Register / Shared-mem 對 occupancy	每 thread 用越多資源 → 每 SM 可容納 thread 越少	full-occupancy 門檻 = `SM_sharedMem / maxThreads` (B/thread)
Dynamic shared memory	`__shared__` 大小編譯期寫死,需重編譯	`extern __shared__` + kernel launch 第三參數 + `cudaGetDeviceProperties`

Important

一句話總結:Tiling 用 registers / shared memory 換取較高的 compute-to-global-memory-access ratio,但這兩種片上記憶體容量有限,過度使用會壓低 occupancy,反而降低延遲容忍能力與吞吐量。

邊界檢查 (Boundary Checks)

05-Memory-Architecture-And-Data-Locality/02-Tiling-and-Tiled-Matrix-Multiplication 的 tiled kernel 有兩個簡化假設:(1) 矩陣寬度是 tile 寬度的整數倍;(2) 矩陣是方陣。本節移除假設 (1)。

為什麼會越界 (row-major linearization 的陷阱)

以 3×3 矩陣、TILE_WIDTH=2 為例 (寬度 3 不是 2 的倍數):

3x3 M, TILE_WIDTH=2, phase ph=1 of block(0,0)
            tile col0  tile col1(越界)
Row 0:  [ M0,0  M0,1 | M0,2   (M0,3 ✗) ]
Row 1:  [ M1,0  M1,1 | M1,2   (M1,3 ✗) ]
Row 2:  [ M2,0  M2,1 | M2,2   (M2,3 ✗) ]

Linear (row-major): M0,0 M0,1 M0,2  M1,0 M1,1 M1,2  M2,0 ...
                                 ^index 3
thread(0,1) 想讀 M[Row*Width + ph*TILE_WIDTH + tx]
          = M[0*3 + 1*2 + 1] = M[3] = M1,0  ← 取到「下一列」的元素,結果被悄悄汙染!

兩種越界後果:

越界方向	例子	後果
超出列尾 (M)	`M0,3` → 實際取到 `M1,0`	拿到錯誤但合法的鄰列值,靜默地汙染內積
超出欄尾 / 陣列尾 (N)	`N` 存取超出 allocation	回傳隨機值,或直接 crash / abort

Warning

越界不是只發生在「最後一個 phase」。 block(1,1) 在 phase 0 就會讀到不存在的 M3,0。所以不能用「只在最後 phase 做特判」的捷徑。

Warning

不能只排除「不算有效 P 的 thread」。 例如 block(1,1) 的 thread(1,0) 不負責任何有效 P,但它仍須載入 M2,1 供同 block 其他 thread 使用;反之,負責有效 P 的 thread(0,1) 也可能去讀不存在的 M0,3。→ 載入 M、載入 N、寫入 P 必須各自獨立檢查。

三組獨立的界線條件

Tip

拇指法則 (rule of thumb):每一次記憶體存取,都要有一個對應的檢查,確保索引落在被存取陣列的界線內。

動作	索引	界線條件	不滿足時
載入 M tile	row=`Row`, col=`ph*TILE_WIDTH+tx`	`Row < Width && (ph*TILE_WIDTH+tx) < Width`	寫入 `0.0f`
載入 N tile	row=`ph*TILE_WIDTH+ty`, col=`Col`	`(ph*TILE_WIDTH+ty) < Width && Col < Width`	寫入 `0.0f`
寫入 P	row=`Row`, col=`Col`	`Row < Width && Col < Width`	不寫

Important

越界載入填 0.0f 是安全的:0 乘任何數為 0、加到 Pvalue 不改變內積結果。這是「padding 中性元素」的通用技巧。

帶界線檢查的 kernel (Fig. 5.13)

#define TILE_WIDTH 16

__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
    __shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Nds[TILE_WIDTH][TILE_WIDTH];

    int bx = blockIdx.x, by = blockIdx.y;
    int tx = threadIdx.x, ty = threadIdx.y;
    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE_WIDTH + tx;

    float Pvalue = 0;
    // 任意寬度:phase 數改用 ceil(Width / TILE_WIDTH)
    for (int ph = 0; ph < ceil(Width / (float)TILE_WIDTH); ++ph) {
        // --- 載入 M tile,含界線檢查 ---
        if ((Row < Width) && (ph*TILE_WIDTH + tx) < Width)
            Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
        else
            Mds[ty][tx] = 0.0f;                       // 中性 padding
        // --- 載入 N tile,含界線檢查 ---
        if ((ph*TILE_WIDTH + ty) < Width && (Col < Width))
            Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
        else
            Nds[ty][tx] = 0.0f;
        __syncthreads();                              // read-after-write barrier

        for (int k = 0; k < TILE_WIDTH; ++k)
            Pvalue += Mds[ty][k] * Nds[k][tx];
        __syncthreads();                              // write-after-read barrier
    }
    // --- 寫入 P,含界線檢查 ---
    if ((Row < Width) && (Col < Width))
        P[Row*Width + Col] = Pvalue;
}

Warning

迴圈上界由 Width/TILE_WIDTH 改成 ceil(Width/(float)TILE_WIDTH);否則寬度非整除時會少做最後一個 (部分) phase。__syncthreads() 仍須由 全 block 一致執行,所以界線檢查只能包住「載入/寫入」,不能包住整個迴圈本體 (否則 barrier 發散會掛)。

推廣到任意矩形矩陣 (general matmul)

M(j×k) · N(k×l) = P(j×l)

把單一 Width 換成三個 unsigned:

原 `Width` 的角色	換成
M 的高 / P 的高	`j`
M 的寬 / N 的高 (內積長度)	`k`
N 的寬 / P 的寬	`l`

(書中把完整改寫留作習題。)

記憶體使用對 Occupancy 的影響 (Impact of Memory Usage on Occupancy)

回顧 04-Compute-Architecture-And-Scheduling/03-Resource-Partitioning-and-Occupancy:SM 的資源 (registers、shared memory、thread/block slots) 有限,每個 thread 要的資源越多,每個 SM 能同時駐留的 thread 就越少,occupancy 下降 → 隱藏長延遲的能力變差。

Full-occupancy 的 shared-memory 預算 (A100 範例)

A100:每 SM 最多 164 KB shared memory、最多 2048 threads。

full-occupancy 門檻 = \frac{164 KB}{2048 threads} = 82 B/thread

每 thread 平均用超過 82 B 的 shared memory,就無法填滿 2048 個 thread slot。

Tiled matmul 的 shared-memory 用量

每個 block 有 TILE_WIDTH² 個 thread,使用 TILE_WIDTH²·4B (Mds) + TILE_WIDTH²·4B (Nds):

\frac{2 \cdot {TILE_WIDTH}^{2} \cdot 4 B}{{TILE_WIDTH}^{2} threads} = 8 B/thread

8 B/thread ≪ 82 B/thread → tiled matmul 的 occupancy 不受 shared memory 限制。

受限的反例

某 kernel:每 block 用 32 KB shared memory、256 threads。

平均用量 = 32 KB / 256 threads = 128 B/thread   (> 82 → 無法 full occupancy)
SM 可容納 threads = 164 KB / 128 B ≈ 1312 threads
max occupancy ≈ 1312 / 2048 ≈ 64%

Warning

書中此例的算術鏈為 ≈132 B/thread → 1272 threads → 62% (32KB/256 嚴格等於 128 B/thread)。重點不在精確百分比,而在 方法:avg = sharedMem_per_block / threads_per_block,再用 SM_sharedMem / avg 反推每 SM 可駐留 thread 數。各裝置的 shared memory 容量不同 (implementation-dependent)。

場景	每 thread shared mem	是否限制 occupancy
Tiled matmul (任意 TILE_WIDTH)	8 B	否 (遠低於 82 B)
32 KB / 256 threads kernel	128 B	是 → 約 64%

動態大小的 Shared Memory (Dynamically Sized Shared Memory)

問題:Fig. 5.9 / 5.13 的 __shared__ float Mds[TILE_WIDTH][TILE_WIDTH]; 把大小寫死成編譯期常數,要換大小就得改 TILE_WIDTH 並重新編譯。但不同世代 GPU 的 shared memory 容量不同,我們希望 host 端能在執行期決定用量。

三件事

查詢硬體容量:cudaGetDeviceProperties(&devProp, dev),讀 devProp.sharedMemPerBlock 得知每 block 可用的 shared memory。
extern __shared__ 宣告:省略陣列大小,合併成一個一維動態陣列。
kernel launch 第三參數:傳入要配置的 shared memory 位元組數 (size_t)。

// 動態 shared memory:一個合併的一維陣列,大小由 launch 時決定
extern __shared__ float Mds_Nds[];

__global__ void matrixMulKernel(float* M, float* N, float* P, int Width,
                                unsigned Mds_sz, unsigned Nds_sz) {
    float* Mds = (float*) Mds_Nds;              // line 06: Mds 區段起點
    float* Nds = (float*) Mds_Nds + Mds_sz;     // line 07: Nds 接在 Mds 之後
    // ...
    // 一維化存取:Mds[ty][tx]  →  Mds[ty*TILE_WIDTH + tx]
}

// Host 端:依裝置查詢結果動態決定 size,當作第三個 <<<>>> 參數
size_t size = /* 依 devProp.sharedMemPerBlock 計算,例:16x16 tile → 2*16*16*4 = 2048 B */;
matrixMulKernel<<<dimGrid, dimBlock, size>>>(Md, Nd, Pd, Width, size/2, size/2);

合併陣列的記憶體佈局:

extern __shared__ Mds_Nds[]   (size bytes, e.g. 2048 B for 16x16 tile)
|<---------- Mds_sz ---------->|<---------- Nds_sz ---------->|
[ Mds[0..255]                 ][ Nds[0..255]                 ]
 ^Mds = Mds_Nds                ^Nds = Mds_Nds + Mds_sz

Tip

因為只有一個合併陣列,必須手動切分 Mds / Nds 起點,且陣列是一維的 → 用線性化索引 Mds[ty*TILE_WIDTH + tx] 取代 Mds[ty][tx]。host 端傳 size/2 作為各區段大小 (16×16 tile 時為 1024 B)。

本章總結 (Chapter Summary, 5.7)

現代處理器的執行速度常被記憶體速度限制;compute-to-global-memory-access ratio 低 → memory-bound (見 05-Memory-Architecture-And-Data-Locality/01-Memory-Access-Efficiency-and-CUDA-Memory-Types)。
CUDA 提供 registers / shared memory / constant memory:比 global memory 小但快,有效使用需重新設計演算法。
Tiling 用 barrier synchronization 強迫多個 thread 在每個 phase 共同聚焦於輸入資料的一個子集,把子集放進高速記憶體 → 提升 data locality (見 05-Memory-Architecture-And-Data-Locality/02-Tiling-and-Tiled-Matrix-Multiplication)。
這些特殊記憶體容量有限且 implementation-dependent;一旦超用,會限制每 SM 同時執行的 thread 數,傷害吞吐量與延遲容忍能力。「能就硬體限制推理」是平行程式設計的核心能力。
Tiling 不只適用 GPU:多核 CPU 靠 on-chip cache 達到同樣的 locality 效益 (CPU 隱式靠 cache,GPU 顯式用 shared memory)。
本章未討論用 registers 做 tiling (register tiling) → 留待 Part II (見 08-Stencil/03-Thread-Coarsening-and-Register-Tiling)。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
矩陣寬度非 tile 寬整數倍會怎樣?	thread 越界:超列尾 → 靜默讀到鄰列 (row-major 線性化);超陣列尾 → 隨機值或 crash
邊界檢查要幾組?能否共用?	三組獨立:載入 M、載入 N、寫入 P,各自的索引界線不同;不可只用「排除無效 P 的 thread」
越界載入該填什麼?	`0.0f` (乘加皆中性,不改變內積)
phase 迴圈上界怎麼改?	`ceil(Width/(float)TILE_WIDTH)`,否則漏掉最後的部分 tile
為何不能把整個 phase 迴圈包進 `if (boundary)`?	`__syncthreads()` 須全 block 一致執行,否則 barrier 發散 → deadlock
general (rectangular) matmul 怎麼改?	`Width` → `j, k, l`;j×k · k×l = j×l
full occupancy 的 shared-mem 門檻 (A100)	164 KB / 2048 = 82 B/thread
tiled matmul 每 thread 用多少 shared mem?	`2·TILE_WIDTH²·4B / TILE_WIDTH²` = 8 B/thread (不限制 occupancy)
32 KB/block、256 threads 的 occupancy?	128 B/thread → 164KB/128B ≈ 1312 → ≈ 64% (≈書中 62%);限制因子=shared memory
如何執行期調整 shared memory?	`extern __shared__` (省略大小) + `<<<grid,block,size>>>` 第三參數 + `cudaGetDeviceProperties` 查 `sharedMemPerBlock`
`extern __shared__` 有何注意事項?	只有一個合併一維陣列,須手動切 Mds/Nds 起點、用線性化索引
registers vs shared memory 對 occupancy	兩者皆片上、皆有限;用量越大 → 每 SM thread 越少 → occupancy 越低

重點總覽 (Overview)

邊界檢查 (Boundary Checks)

為什麼會越界 (row-major linearization 的陷阱)

三組獨立的界線條件

帶界線檢查的 kernel (Fig. 5.13)

推廣到任意矩形矩陣 (general matmul)

記憶體使用對 Occupancy 的影響 (Impact of Memory Usage on Occupancy)

Full-occupancy 的 shared-memory 預算 (A100 範例)

Tiled matmul 的 shared-memory 用量

受限的反例

動態大小的 Shared Memory (Dynamically Sized Shared Memory)

三件事

本章總結 (Chapter Summary, 5.7)

考試/面試重點 (Exam / Test Patterns)

Related Notes