Kernel 執行控制 (Kernel Execution Control)

#advanced-practice #kernel-execution-control #cooperative-kernels #dynamic-parallelism #cuda-streams

重點總覽 (Overview)

本節整理 CUDA 在「kernel 執行期能力」上的演進，重點不在效能，而在軟體工程能力 (composability、可維護性、除錯) 與排程彈性 (priority、load balance)。

能力 (Capability)	首次支援 (Arch / CUDA)	解決的痛點	關鍵 API / 機制
In-kernel function calls	Kepler / CUDA 5	早期須全部 inline，不支援 recursion、system call、virtual function	per-thread call frame stack
Recursion + std library	Kepler / CUDA 5	無法 cut-paste CPU 演算法、無法在 kernel 內輸出	`printf()`、`malloc()`
Device lambdas	CUDA 8 (C++11)	無法寫泛型可重用 kernel	`__host__ __device__` extended lambda (`--extended-lambda`)
Exception handling (limited)	—	無法偵測 rare condition / 越界	debugger breakpoint、run-until-fault
Simultaneous multiple grids	Fermi	小 grid 無法填滿 device；priority 無法區分	CUDA streams + 同時執行
Hardware queues	Kepler / CUDA 5	單一 queue 造成 false serialization	多條 HW queue
Dynamic parallelism	Kepler / CUDA 5	data-dependent 工作量須回 CPU	child grid launch (見 Ch.21)
Interruptable grids	Fermi	長時間計算無法中止	grid cancel
Cooperative kernels	CUDA 11	irregular data 的 load imbalance、跨 block 合作 deadlock	`cudaLaunchCooperativeKernel()`、`grid.sync()`

Important

這些能力大多是為了降低開發/移植/維護成本而生 (讓 GPU code 更像 CPU code、更 composable)，硬體效能提升另見 22-Advanced-Practices-And-Future-Evolution/03-Memory-Bandwidth-and-Compute-Throughput。

Kernel 內函式呼叫 (Function Calls within Kernel Functions)

早期模型：原始碼可以「看起來」有函式呼叫，但 compiler 必須把所有函式 body inline 進 kernel object，runtime 時 kernel 內沒有真正的 call。

不支援：system calls、dynamically linked library calls、recursion、virtual functions (C++)。
對 performance-critical 區段尚可，但無法支撐複雜應用的軟體工程實務。

Kepler / CUDA 5 之後：支援 runtime 真正的函式呼叫。

Compiler 不再被強制 inline (仍可為了效能選擇 inline)。
由 per-thread 的 massively parallel call frame stack (有 cache、快速實作) 支撐。
帶來 composability：不同作者寫不同 kernel 元件再組裝，免大改；廠商可釋出不含原始碼的 device library (IP 保護)。

仍存在的限制

virtual functions 僅支援由 device code 建構的物件。
device code 的 dynamic libraries 仍不支援。

// CUDA 5+ (Kepler): 真正的 runtime function call，不再強制 inline
__device__ int factorial(int n) {      // recursion 現在合法
    if (n <= 1) return 1;
    return n * factorial(n - 1);       // 遞迴呼叫使用 per-thread call stack
}

__global__ void demo(int *out, int n) {
    out[threadIdx.x] = factorial(n);
    printf("tid %d done\n", threadIdx.x);  // 在 kernel 內呼叫標準函式庫
}

支援 recursion，並能直接「cut-and-paste」CPU code 取得堪用 kernel (後續仍建議調效)。
kernel 可呼叫 printf()、malloc()。
- printf() 對 production 除錯特別有用：可加一個 dump 內部狀態的模式，讓非技術 end user 也能回報有意義的 bug report。

Device lambdas (CUDA 8 + C++11)

搭配 metaprogramming → 高效能、可重用 code。
lambda 可作為參數傳入 kernel：寫泛型 kernel，例如 sorting kernel 把比較函式當輸入參數。
Extended lambdas (--extended-lambda 旗標)：可在 C++ lambda 上加 __host__ __device__，進一步簡化可重用 code。

// 由比較 lambda 參數化的泛型 kernel
template <typename Compare>
__global__ void sort_kernel(int *data, int n, Compare cmp) {
    // ... 用 cmp(a, b) 決定順序 ...
}

// extended lambda：同一份程式碼可在 host / device 執行
auto cmp = [] __host__ __device__ (int a, int b) { return a < b; };
sort_kernel<<<grid, block>>>(d_data, n, cmp);

Kernel 內例外處理 (Exception Handling in Kernel Functions)

早期 CUDA：kernel code 完全不支援 exception handling。
對 performance-critical 區段影響不大，但 production 品質應用常依賴 exception 來偵測/處理罕見情況 (不必額外寫顯式測試碼)。
目前提供有限 (limited) 的 exception handling，主要透過 CUDA debugger 體現：

Debugger 能力	用途
step-by-step execution	逐步追蹤 kernel
breakpoints	在指定點暫停
run-until invalid memory access	跑到越界存取自動停
inspect local / global variables	暫停時檢視變數值

Tip

CUDA debugger 對偵測 out-of-bounds memory access 與潛在 race condition 非常有用。

多 Grid 同時執行 (Simultaneous Execution of Multiple Grids)

最早期：每個 device 同一時間只能執行一個 grid。多個 grid 雖可用 CUDA streams 提交，但被排在 queue 中緩衝，前一個跑完才放下一個。
Fermi 之後：允許同一應用的多個 grid 同時執行。
- 降低開發者把多個 kernel「batch」成一個大 kernel 以填滿 device 的壓力。
- 可把工作切成不同 priority 的 chunk。

Pre-Fermi (一次一個 grid，串列):
  stream queue : [Grid A][Grid B][Grid C]
  GPU 時間軸   : |==A==|==B==|==C==|     小 grid 無法填滿 SM -> 低利用率

Fermi+ (多 grid 同時執行):
  SMs : |===== Grid A (low-prio local) =========|
        |   |== Grid B (high-prio remote) ==|   |   <- 立刻插入執行
        time -->

典型情境 (parallel cluster)：工作分成 local 與 remote partition，remote 牽涉與其他 node 互動、位於全域進度的 critical path 上。

舊系統的兩難	結果
等 remote work 才動作	device 閒置、低利用率
急著先做 local work 填滿 device	remote work 卡在大 grid 後面，latency 變高
多 grid 同時執行	用較小 grid 提交工作，高優先 remote work 到達時低延遲立即啟動

硬體佇列與動態平行 (Hardware Queues and Dynamic Parallelism)

Kepler / CUDA 5：在多 grid launch 之上加入多條 hardware queues，更有效率地排程「來自多個 stream 的多個 grid」的 thread blocks。

單一 HW queue (false serialization):
  stream1: A1->A2     stream2: B1
  HW queue: [A1][A2][B1]   <- B1 雖獨立卻被卡在 A2 之後

多 HW queues (Kepler):
  stream1 --> HWQ0: [A1][A2]
  stream2 --> HWQ1: [B1]      <- B1 與 A1/A2 併發派發

Dynamic parallelism (詳見 21-CUDA-Dynamic-Parallelism/01-Dynamic-Parallelism-Fundamentals)：

GPU grid 可非同步、動態、依資料/計算負載地 launch child grids (GPU work creation)。
減少 CPU-GPU 互動與同步；CPU 因此可去做其他有用運算。

Tip

Hardware queues 解決的是「既有多個 stream/grid 之間」的排程；Dynamic parallelism 解決的是「GPU 自己產生新工作」。兩者互補。

可中斷的 Grid (Interruptable Grids)

Fermi：允許正在執行的 grid 被「cancel」。
讓 CUDA 加速應用可隨時中止長時間計算，且開發者幾乎不需特別設計。
用途：
- 實作 user-level task scheduling 系統。
- GPU node 間更好的 load balance。
- 某顆 GPU 負載過重、跑得比同儕慢時，能更優雅地處理。

協作式 Kernel (Cooperative Kernels)

GPU 處理 irregular data 常受 load imbalance 之苦。CUDA 11 引入 cooperative kernels。

特性	說明
block 數上限	最多可達「填滿整個 GPU」的 thread block 數
並發保證 (concurrency guarantee)	CUDA runtime 保證所有 thread blocks 同時執行
用途	讓 blocks 能安全合作，不會 deadlock 於共享互斥機制 (如 mutex) 來保護共享結構 (如 work queue)
啟動 API	`cudaLaunchCooperativeKernel()`
跨裝置	`cudaLaunchCooperativeKernelMultiDevice()`
device API	用來辨識並切分 thread 群組

#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void coop_kernel(/* ... */) {
    cg::grid_group grid = cg::this_grid();
    // ... phase 1 ...
    grid.sync();   // grid-wide barrier：唯有「所有 block 併發」才安全
    // ... phase 2 ...
}

// 啟動：grid 必須能塞進 GPU，不可 oversubscribe
void *args[] = { /* ... */ };
cudaLaunchCooperativeKernel((void*)coop_kernel, gridDim, blockDim, args);

一般 grid (block 數 > SM 容量) -> 分波 (waves)，排程順序未定義
  若呼叫 grid-wide barrier -> 等待中的 block 永遠不會上場 -> DEADLOCK

Cooperative grid (block 數 <= GPU 容量) -> 全部同時駐留
  SM0:[B0][B1]   SM1:[B2][B3]  ... 全在跑  -> grid.sync() 安全

為什麼一般 grid 不能做 grid.sync()

一般 grid 的 block 排程是 transparent scalability 設計 (見 04-Compute-Architecture-And-Scheduling/01-GPU-Architecture-and-Block-Scheduling)：blocks 可分波、順序不保證、彼此不可互等。cooperative kernel 用「犧牲可任意擴張的 block 數」換取「全 grid 同時駐留」的並發保證，才讓跨 block 同步/合作成為可能。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
早期 kernel 為何不支援 recursion / virtual function?	compiler 必須 inline 全部函式，runtime 沒有真正 call；Kepler/CUDA 5 後有 per-thread call frame stack 才支援
kernel 內可呼叫哪些 std 函式？最大用途？	`printf()`、`malloc()`；`printf()` 用於 production 除錯，讓 end user dump 內部狀態回報 bug
想寫一個比較函式可換的泛型 sort kernel	用 device lambda 當 kernel 參數；跨 host/device 需 `__host__ __device__` + `--extended-lambda`
kernel 為何要支援 function call (非效能理由)？	composability：元件可組裝、廠商可釋出無原始碼的 device library (IP 保護)
多個 grid 透過 stream 提交卻不併發？(舊系統)	早期一次只跑一個 grid，stream 內被 buffer 在 queue；Fermi 起才同時執行多 grid
local/remote 工作、remote 在 critical path	用多 grid + 小 grid size，讓高優先 remote work 低延遲插入
多個獨立 stream 仍被序列化 (false serialization)	Kepler 多 hardware queues 解決
GPU 自己依資料量動態產生工作	Dynamic parallelism (child grid)，非 hardware queues
讓使用者中止長時間 GPU 計算	Interruptable grids (Fermi，grid cancel)
多 block 要用 mutex 共享 work queue 卻怕 deadlock	Cooperative kernel (CUDA 11)：保證所有 block 併發，`cudaLaunchCooperativeKernel()` + `grid.sync()`
cooperative kernel 的限制	block 數不可超過 GPU 容量 (不可 oversubscribe)，否則無法保證並發
一般 grid 為何不能做 grid-wide barrier	block 分波、排程順序不保證 → 等待者永不上場 → deadlock

重點總覽 (Overview)

Kernel 內函式呼叫 (Function Calls within Kernel Functions)

Device lambdas (CUDA 8 + C++11)

Kernel 內例外處理 (Exception Handling in Kernel Functions)

多 Grid 同時執行 (Simultaneous Execution of Multiple Grids)

硬體佇列與動態平行 (Hardware Queues and Dynamic Parallelism)

可中斷的 Grid (Interruptable Grids)

協作式 Kernel (Cooperative Kernels)

考試/面試重點 (Exam / Test Patterns)

Related Notes