重要執行考量與總結

#advanced-practice #dynamic-parallelism #parent-child-kernels #cuda-streams #nesting-depth

重點總覽 (Overview)

要正確且有效地使用 Dynamic Parallelism,必須掌握四大執行考量。下表為總覽:

考量項目 (Consideration)	關鍵規則	預設值 / 上限	出錯後果
Memory & data visibility	只能傳 global / constant / texture 指標給 child;不可傳 local / shared	—	存取無效記憶體 / 讀到舊資料
Pending launch pool	追蹤執行中與待執行 kernel 的固定緩衝區	2048 pending launches	超量改用 virtualized pool,慢 10×+
Streams (per-thread)	stream 範圍 private to block;不指定則用 NULL stream	NULL stream	同 block 內全部 launch 被序列化
Nesting / synchronization depth	巢狀 launch 的層數限制	最大 24 層	超過上限 launch 失敗

Important

本書 (4th ed.) 描述的 parent-child 顯式同步 (cudaDeviceSynchronize 於 device 端) 在 CUDA 12 / Compute Capability 9.0 已被棄用,衍生出「synchronization depth」這個額外限制。本筆記依書本內容說明;新平台請查最新 CUDA C++ Programming Guide。

記憶體與資料可見性 (Memory and Data Visibility)

當 parent thread 把指標傳給 child grid 時,必須保證該記憶體對 child 是可存取的,且要清楚何時寫入的資料才會彼此可見。

可傳遞 (兩者皆可存取):

記憶體類型	可傳給 child?	原因
Global memory	✅	device 全域共享
Constant memory	✅	device 全域共享 (唯讀)
Texture memory	✅	device 全域共享
Local memory	❌	thread 私有
Shared memory	❌	block 私有

Warning

把 __shared__ 陣列或區域變數的位址傳給 child grid 是未定義行為——child 看到的是無效位址。__device__ 全域變數或 cudaMalloc (host 或 device 端) 配置的指標才安全。

兩個一致視圖時間點 (Two coherence points)

parent 與 child 對記憶體只有兩個保證一致的時刻:

parent thread 時間軸 (memory writes)
  │
  ▼  write A  ──────────► [可見] child 看得到 launch 前的所有寫入 (A)
  │
 ┌┴───────────────┐  ← (1) LAUNCH child grid   ← 一致視圖點 #1
 │  launch child   │
 └┬───────────────┘
  │  write B  ──────────► [不保證] child 不一定看得到 launch 後的寫入 (B)
  │
  │     ┌──────────── child grid 執行中, write C ────────────┐
  │     └──────────────────────────────────────────────────┘
  │
 ┌┴───────────────┐  ← (2) parent SYNC 於 child 完成 ← 一致視圖點 #2
 │  cudaDevice...  │
 └┬───────────────┘
  ▼  read C  ──────────► [可見] 同步後才保證看得到 child 的寫入 (C)

點 #1 — child launch 時刻:launch 之前的所有寫入,child 一定看得到;launch 之後的寫入,無保證。
點 #2 — parent 同步於 child 完成時:在此之前,child 的寫入不保證對 parent 可見。

Tip

想讓 child 讀到某筆資料,就在 launch child 之前寫完;想讓 parent 讀 child 的結果,就先同步再讀。

Pending Launch Pool 配置 (Pending Launch Pool Configuration)

Pending launch pool = 追蹤「執行中或等待執行」kernel 的緩衝區,大小固定。

                ┌─────────────────────────────┐
launches ─────► │  Fixed pool (預設 2048 slots) │ ──► 快速排程
                └─────────────────────────────┘
                          │ 超過 2048
                          ▼
                ┌─────────────────────────────┐
                │   Virtualized pool          │ ──► 慢 10× 以上
                └─────────────────────────────┘

預設可容納 2048 個 pending kernel calls。
超過 → 啟用 virtualized pool,效能掉 一個數量級 (order of magnitude) 以上。
解法:在 host 端、launch parent 之前呼叫:

// 把固定 pool 調到「預期 launch 的 grid 數」
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, N_LINES);
computeBezierLines_parent<<<N_LINES/BLOCK_DIM, BLOCK_DIM>>>(...);

Bezier 範例量化:

N_LINES	child grids 數	預設 2048 pool	結果
4096	4096	一半 (2048) 溢位到 virtualized	顯著變慢
4096	4096	先設 limit = 4096	全進 fixed pool,快很多

Warning

一般建議:只在預期 launch 數超過預設 2048 時才調高。把 pool 調比實際需求小 (例如 N_LINES=1024 卻設成 1024) 既無加速也無意義——預設 2048 已足夠涵蓋。

Per-Thread Streams (Streams)

device thread 也能像 host 一樣用 streams 讓 child grids 併發。關鍵:stream scope 是 block-private。

情境	行為	平行度
不指定 stream → 用 NULL stream	同 block 內所有 launch (即使來自不同 thread) 被序列化	低
每個 thread 建立 named stream	同 block 內不同 thread 的 grid 併發	高

同一個 block (多個 thread 各 launch 一個 child grid)

【NULL stream — 序列化】          【Named streams — 併發】
 t0 ▓▓▓                            t0 ▓▓▓
 t1     ▓▓▓                        t1 ▓▓▓
 t2         ▓▓▓                    t2 ▓▓▓
 t3             ▓▓▓                t3 ▓▓▓
 ───────────時間──────►            ───時間──►
 (浪費 SM,平行度大降)             (充分利用 SM)

取代 Bezier 範例 (Fig. 21.7) line 19 的程式碼 (Fig. 21.12):

cudaStream_t stream;
// 每個 thread 建立自己的 non-blocking stream (block-private)
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
computeBezierLines_child<<<gridDim, blockDim, 0, stream>>>(
        bLines_d, lidx, bLines_d[lidx].nPoints);
cudaStreamDestroy(stream);

Warning

若 Bezier child kernel 沿用預設 NULL stream,同一 parent block launch 的所有 child grid 會被序列化,平行度甚至比沒用 dynamic parallelism 的原始版本還差。必須改用 named streams 才能贏回併發。

巢狀深度與同步深度 (Nesting Depth & Synchronization Depth)

child kernel 自己也能再 launch kernel,層層相疊。

depth 0  parent grid (host launch)
   └─► depth 1   child grid
          └─► depth 2   grandchild grid
                 └─►  ...           ...
                        └─► depth 23  (最深合法層)
                               ✗ depth 24+ → 不合法

Nesting depth:總共抵達的層數;目前硬體上限 = 24 層。
像 quadtree 這種遞迴,launch 前必須先檢查是否已達 max depth (if (depth >= MAX_DEPTH) return;),否則 launch 失敗。
Synchronization depth:當存在 parent-child 同步時,系統需保存 parent grid 狀態,記憶體成本更高 → 對巢狀深度有額外更嚴格的限制 (≤ nesting depth)。

Tip

nested parallelism 像「樹狀處理」。樹要矮要胖才有效率:每個節點 (grid) deploy 多 block / 多 thread (thick nodes)、且 branch degree 大 (每個 parent 多 children)。深而瘦的樹因深度受限無法有效實作。

本章總結 (Chapter Summary)

Dynamic Parallelism 擴充 CUDA 模型,讓 kernel 可呼叫 kernel,核心價值:

能力	帶來的好處	範例
thread 動態發現工作並 launch grid	更佳的 thread/block 間負載平衡、消除 control divergence	Bezier 曲線
thread 端 device-side `cudaMalloc`/`cudaFree`	依曲率配置「剛剛好」的記憶體,大幅省記憶體	Bezier 曲線
支援遞迴演算法	把遞迴自然映射到巢狀 launch	Quadtree

Warning

小 grid 陷阱:launch 只有少量 thread 的 child grid 會嚴重低度利用 (underutilize) GPU。一般建議:child grid 要有很多 block;若 block 少,至少每個 block 要有很多 thread。

Important

要用好 dynamic parallelism,必須理解三件事:memory 可見性、pending launch count、streams。parent/child 之間記憶體與 stream 的謹慎使用,是正確執行與達成預期平行度的關鍵。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
child 可存取哪些 parent 記憶體?	global / constant / texture 可;shared / local 不可 (private)
parent/child 何時記憶體視圖一致?	(1) launch child 時 (2) parent 同步於 child 完成時
launch 後 parent 才寫的資料 child 看得到嗎?	不保證 (只保證 launch 前的寫入)
預設 pending launch pool 大小?	2048;超過用 virtualized pool,慢 10×+
如何避免 pending pool 變慢?	host 端 `cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, N)`
N_LINES=1024, fixed pool 調成 1024 有幫助嗎?	沒有——預設 2048 已足夠,且調小無益 (Ex.1b = False)
stream 不指定 → 後果?	用 NULL stream,同 block launch 全序列化;需 named stream 才併發
stream 的作用範圍?	block-private (不可跨 block 使用其他 block 建的 stream)
最大 nesting depth?	24 層;遞迴 kernel launch 前須檢查
synchronization depth 是什麼?	有 parent-child 同步時,因需存 parent state 而產生的更嚴格深度限制
Ex.1a: N_LINES=1024, BLOCK_DIM=64,launch 幾個 child?	1024 (每 thread 一個 child),非 16;16 是 block 數 = N_LINES/BLOCK_DIM → 敘述 False
Ex.2: 64 個等距點的 quadtree 最大深度 (含 root,min>2)?	4 (64→16→4→1;depth 0~3)
Ex.3: 同上 quadtree 總 child kernel launch 數?	21 = 1 + 4 + 16 (depth 0/1/2 各層節點皆 launch,depth 3 只剩 1 點不 launch)
Ex.4: parent 可定義新 `__constant__` 給 child 繼承?	False (`__constant__` 為編譯期 file-scope,非 runtime 建立)
Ex.5: child 可存取 parent 的 shared/local?	False (private)
Ex.6: 6 blocks × 256 threads,用 NULL stream,幾個 child 併發?	6 (block 內序列化,跨 block 併發 → 每 block 1 個)
小 child grid 的代價?	GPU 嚴重低度利用;child grid 要多 block / 多 thread

重點總覽 (Overview)

記憶體與資料可見性 (Memory and Data Visibility)

兩個一致視圖時間點 (Two coherence points)

Pending Launch Pool 配置 (Pending Launch Pool Configuration)

Per-Thread Streams (Streams)

巢狀深度與同步深度 (Nesting Depth & Synchronization Depth)

本章總結 (Chapter Summary)

考試/面試重點 (Exam / Test Patterns)

Related Notes