Resource Partitioning、Occupancy 與 Device Properties 查詢

#gpu-architecture #occupancy #register-optimization #block-scheduling #latency-tolerance

重點總覽 (Overview)

概念	核心重點	為何重要
Occupancy	= (assigned warps) / (max warps per SM)；等價於 threads 版本	越高越能用其他 warp 隱藏 long-latency operation
Dynamic partitioning	SM 的 registers / shared memory / block slots / thread slots 依需求動態切分	比 fixed partitioning 更彈性，避免浪費 slot
Block-slot 上限	SM 有 max blocks/SM 限制 (A100=32)	block 太小 (如 32 threads) 會先撞到 block slot 上限 → 低 occupancy
Thread-slot 整除性	block size 必須整除 max threads/SM 才能填滿	768 threads/block → 75% (剩 512 slot 閒置)
Register 限制	每 thread register 數 × threads 不可超過 SM register 總量	用太多 register → 能跑的 thread 變少 → occupancy 下降
Performance cliff	資源用量小幅增加 → 平行度與效能驟降	多宣告 2 個變數 (31→33 reg) 可使 occupancy 100%→75%
Device query	`cudaGetDeviceProperties` 讀 `cudaDeviceProp` 各欄位	讓同一程式在不同硬體上自適應

本筆記範例硬體

以 Ampere A100 (compute capability 8.0) 為例：每 SM 上限為 32 blocks、64 warps = 2048 threads、1024 threads/block、65,536 registers、64 cores。這些數字隨 compute capability 而異，務必用 query 取得實際值。

資源分割 (Resource Partitioning) 與 Occupancy 定義

Occupancy 是衡量 SM 被「填滿」程度的比值：

occupancy = (warps assigned to SM) / (max warps supported by SM)
          = (threads assigned to SM) / (max threads supported by SM)

SM 內可被分割的執行資源 (execution resources)：

Registers (暫存器)
Shared memory (見 05-Memory-Architecture-And-Data-Locality/01-Memory-Access-Efficiency-and-CUDA-Memory-Types)
Thread block slots (一個 SM 能同時容納的 block 數)
Thread slots (一個 SM 能同時容納的 thread 數)

Dynamic vs Fixed partitioning

CUDA 採 dynamic partitioning：thread slots 依 block size 動態切給不同數量的 block，讓 SM 可「少 block 多 thread」或「多 block 少 thread」。Fixed partitioning 會給每個 block 固定資源 → block 需求少時浪費 slot、需求多時又裝不下。

A100 在不同 block size 下如何切 2048 個 thread slots：

block size	blocks/SM	threads/SM	occupancy
1024	2	2048	100%
512	4	2048	100%
256	8	2048	100%
128	16	2048	100%
64	32	2048	100%
32	32 (撞 block 上限，本應 64)	1024	50%
768	2 (剩 512 閒置)	1536	75%

Occupancy 的兩種隱性陷阱 (Subtle Interactions)

陷阱 1：block slot 上限 (block 太小)

想填滿 2048 threads 需要：2048 / 32 = 64 blocks
但 A100 只有 32 block slots
→ 只能放 32 blocks × 32 threads = 1024 threads
→ occupancy = 1024 / 2048 = 50%

至少 64 threads/block

要達 100% 必須 threads_per_block × 32 ≥ 2048，即 block 至少 64 threads。32 threads/block 永遠卡在 50%。

陷阱 2：thread-slot 不整除 (block size 無法整除 max threads/SM)

block size = 768：
2048 / 768 = 2.67 → 只能放 2 blocks = 1536 threads
剩 512 slot 閒置 (block 上限、thread 上限都沒到)
→ occupancy = 1536 / 2048 = 75%

SM thread slots (max 2048)
┌──────────┬──────────┬──────────┐
│ block0   │ block1   │  XXXXXX  │
│  768     │  768     │  512 idle│
└──────────┴──────────┴──────────┘

Register 限制與 Performance Cliff

每 thread 用越多 register，SM 能同時容納的 thread 就越少：

為達 full occupancy，每 thread 可用 register 上限：
regs_per_thread ≤ (registers per SM) / (max threads per SM)
              = 65,536 / 2048 = 32 registers/thread

若 kernel 用 64 regs/thread：
max threads = 65,536 / 64 = 1024 → occupancy ≤ 1024/2048 = 50%

Performance cliff 經典範例

情境	regs/thread	blocks/SM	threads/SM	total regs	occupancy
原始	31	4	2048	2048×31 = 63,488 ✅	100%
多宣告 2 個變數	33	4 (需 67,584 > 65,536 ❌) → 降為 3	1536	3×512×33 = 50,688 ✅	75%

Performance Cliff (Ryoo et al., 2008)

資源用量「小幅」增加 (31→33 registers，只多 2 個自動變數) 可使 runtime 從每 SM 4 blocks 降為 3 blocks，平行度與效能驟降 (100%→75%)。優化時要警覺資源邊界。

編譯器可能進行 register spilling，把部分 register 值放到記憶體以降低每 thread register 需求、提升 occupancy。代價是 thread 需從記憶體存取被 spill 的值 → 執行時間增加，整體 grid 反而可能變慢。
精確估算每 SM 實際 thread 數很困難，可用 CUDA Occupancy Calculator (官方下載試算表) 依 kernel 的資源用量計算。

高 occupancy ≠ 一定最快

Occupancy 高代表更能隱藏延遲 (latency tolerance)，但它只是手段不是目的。記憶體頻寬、coalescing 等仍可能是瓶頸，見 06-Performance-Considerations/04-Optimization-Checklist-and-Bottlenecks。

查詢裝置屬性 (Querying Device Properties)

裝置可用資源量由 compute capability 決定 (越高通常資源越多；A100 = 8.0)。Host code 可在 runtime 查詢，以便同一程式在不同硬體上自適應。

// 1) 取得可用 CUDA device 數量
int devCount;
cudaGetDeviceCount(&devCount);

// 2) 逐一查詢各 device 屬性，挑選資源足夠者
cudaDeviceProp devProp;
for (unsigned int i = 0; i < devCount; i++) {
    cudaGetDeviceProperties(&devProp, i);
    // 依 devProp 各欄位決定此 device 是否合適
}

整合型 GPU (integrated GPU)

現代 PC 常有「一顆以上」CUDA device，其中常含 integrated GPU (僅提供基本繪圖功能)，CUDA 應用在其上效能通常很差 → 因此才需 iterate 所有 device、查詢能力後選擇合適者。

`cudaDeviceProp` 重要欄位

欄位	意義	典型用途
`maxThreadsPerBlock`	每 block 最大 thread 數 (常為 1024，未來可能更多)	確認 block 配置合法
`multiProcessorCount`	SM 數量	SM 太少效能不佳時據此篩選
`clockRate`	時脈頻率	與 SM 數合看 → 估算最大運算吞吐量
`maxThreadsDim[0..2]`	block 各維 (x/y/z) thread 上限	auto-tuning 設定 block 維度範圍
`maxGridSize[0..2]`	grid 各維 (x/y/z) block 上限	判斷 grid 是否能一次涵蓋整個資料集
`regsPerBlock`	SM 中可用 register 數 (名稱略誤導)	判斷 kernel 是否被 register 限制 occupancy
`warpSize`	warp 大小 (硬體相關)	計算 warp 數、寫對齊 warp 的最佳化

regsPerBlock 名稱誤導

多數 compute capability 下，一個 block 可用的最大 register 數 = SM 全部 register 數；但某些 compute capability 下，block 可用上限會小於 SM 總量。

本章總結 (Chapter Summary)

GPU 由多個 SM 組成，SM 內含多個 processing block 的 cores，共享控制邏輯與記憶體資源。
Grid 的 blocks 以任意順序 assign 到 SM → 帶來 transparent scalability；代價是不同 block 的 thread 無法互相同步 (見 04-Compute-Architecture-And-Scheduling/01-GPU-Architecture-and-Block-Scheduling)。
Thread 以 block 為單位 assign 到 SM，再切成 warps 依 SIMD model 執行；warp 內路徑分歧 → control divergence 多趟執行 (見 04-Compute-Architecture-And-Scheduling/02-Warps-SIMD-and-Control-Divergence)。
SM 上的 warp 數常遠多於可同時執行者 → 用其他 warp 填補 long-latency 等待 → latency tolerance。
Occupancy = assigned threads / max threads；越高越能隱藏延遲，但受 block slot、thread slot 整除性、register、shared memory 等多重資源交互限制。
各裝置資源上限不同，CUDA C 提供 runtime query (cudaDeviceProp) 讓程式自適應硬體。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
計算 occupancy	(assigned threads)/(max threads per SM)，先算實際塞得下的 thread 數
block size = 32，A100	撞 32 block-slot 上限 → 32×32=1024 threads → 50%；要 ≥64 threads/block 才能 100%
block size 不整除 max threads (如 768)	2 blocks=1536 → 75%，剩 512 slot 閒置
每 thread register 上限求 full occupancy	regs/SM ÷ max threads = 65,536/2048 = 32 regs/thread
kernel 用 64 regs/thread	max threads = 65,536/64 = 1024 → occupancy ≤ 50%，與 block size 無關
「Performance cliff」	資源用量小增 → blocks/SM 驟降 → occupancy/效能驟降 (Ryoo 2008)
register spilling	降 per-thread register 以升 occupancy，但增加記憶體存取 → 可能更慢
為何要多塞 warp 到 SM	latency tolerance / zero-overhead scheduling，A100 可 32× 超額訂閱 (2048 threads vs 64 cores)
查 SM 數 / warp 大小 / 每 block 上限	`multiProcessorCount` / `warpSize` / `maxThreadsPerBlock`
區分多顆 device / 避開 integrated GPU	`cudaGetDeviceCount` + `cudaGetDeviceProperties` 逐一查詢挑選
compute capability	描述 SM 資源量；A100 = 8.0，越高通常資源越多