範例：Bezier Curves 與動態工作量

#advanced-practice #dynamic-parallelism #bezier-curves #device-side-malloc #parent-child-kernels

重點總覽 (Overview)

本節以 adaptive subdivision of spline curves（Bezier 曲線自適應細分）示範「每個工作單位的工作量會動態變化」時，Dynamic Parallelism 如何同時 (1) 抽取更多平行度、(2) 消除 control divergence、(3) 用 device-side cudaMalloc 精準配置記憶體。

項目	無 Dynamic Parallelism (Fig. 21.6)	有 Dynamic Parallelism (Fig. 21.7)
Kernel 數量	1 個 `computeBezierLines`	`computeBezierLines_parent` + `computeBezierLines_child` (+ `freeVertexMem`)
一組控制點由誰處理	一個 block	parent 用一條 thread；實際細分由一個 child grid
工作量決定方式	block 算 curvature → for-loop 迭代	parent thread 算 curvature → launch 大小可變的 child grid
變動工作量的代價	block 間迭代數差異大 → SM 利用率下降；warp 內迭代數差異 → control divergence	child grid 內每 thread 只做 1 個 vertex → load balance 佳、無 divergence
vertex 記憶體	每條線固定配 `MAX_TESS_POINTS`（最壞情況）	`vertexPos` 為指標，device 端 `cudaMalloc` 精準配置
釋放記憶體	不需要（靜態陣列）	device 配的記憶體須由 device kernel `freeVertexMem` 釋放

Important

核心觀念：當「每個 thread 要做的工作量在執行期才知道、且彼此差異很大」時，把 for-loop 改成「parent thread 啟動一個 child grid」就能把迴圈迭代轉成平行 thread。Bezier 例子展示的正是 variable amount of child grid launches（依工作量而定的可變 child grid 啟動）。

Bezier 曲線數學背景 (Bezier Curve Math)

Bezier 曲線由一組 control points P0…Pn 定義，n 為階數（order）。首尾控制點一定在曲線上，中間控制點通常不在曲線上。

類型	控制點	公式
Linear（一階）	`P0, P1`	`B(t) = (1−t)P0 + t·P1`
Quadratic（二階）	`P0, P1, P2`	`B(t) = (1−t)²P0 + 2(1−t)t·P1 + t²P2`

本範例使用 quadratic（3 個控制點）。
curvature（曲率）越大 → 需要越多 tessellation points 才能畫得平滑，這就是每組控制點工作量不同的來源。

tessellation 取樣：t ∈ [0,1] 等分成 nTessPoints 個點
u = idx / (nTessPoints − 1)        // idx = 0 .. nTessPoints-1
B3u[0]=(1-u)²   B3u[1]=2(1-u)u   B3u[2]=u²
position = B3u[0]*P0 + B3u[1]*P1 + B3u[2]*P2

無 Dynamic Parallelism 的版本 (Without Dynamic Parallelism)

設計：一個 block 處理一組控制點，block 先算 curvature 推得點數，再用 for-loop（步進 blockDim.x）讓 block 內所有 thread 協作填滿 vertex 陣列。

#define MAX_TESS_POINTS 32
struct BezierLine {
    float2 CP[3];                       // 3 control points (quadratic)
    float2 vertexPos[MAX_TESS_POINTS];  // 固定配最壞情況大小！
    int    nVertices;
};

__global__ void computeBezierLines(BezierLine *bLines, int nLines) {
    int bidx = blockIdx.x;                       // <-- block 為單位
    if (bidx < nLines) {
        float curvature = computeCurvature(bLines);
        int nTessPoints = min(max((int)(curvature*CURVE_TESS_FACTOR),4),
                              MAX_TESS_POINTS);   // 工作量 ∝ curvature
        bLines[bidx].nVertices = nTessPoints;
        // for-loop：迭代數因 block 而異 -> 工作量不均
        for (int inc = 0; inc < nTessPoints; inc += blockDim.x) {
            int idx = inc + threadIdx.x;
            if (idx < nTessPoints) {
                float u = (float)idx/(float)(nTessPoints-1);
                float omu = 1.0f - u, B3u[3] = {omu*omu, 2.0f*u*omu, u*u};
                float2 position = {0,0};
                for (int i=0;i<3;i++) position = position + B3u[i]*bLines[bidx].CP[i];
                bLines[bidx].vertexPos[idx] = position;
            }
        }
    }
}

兩個問題：

block0 (curvature 小, 4 pts)   : [#-------]  迭代 1 次後閒置
block1 (curvature 大, 32 pts)  : [########]  迭代 32 次
block2 (curvature 小, 6 pts)   : [##------]  迭代 2 次後閒置
   ^ block 間工作量差異 -> SM 利用率低
   ^ 同一 warp 內 thread 迭代數不同 -> control divergence
   ^ vertexPos 每條都吃 MAX_TESS_POINTS 空間 -> 記憶體浪費

Warning

vertexPos[MAX_TESS_POINTS] 對每條線都配置「最壞情況」大小。當各組控制點 curvature 差異很大時，大量空間被閒置。

有 Dynamic Parallelism 的版本 (With Dynamic Parallelism)

把原 kernel 拆成 parent（發現工作量） 與 child（執行計算）。

關鍵差異 1：thread 為單位 vs block 為單位

parent 每組控制點的工作（算 curvature + launch）很小，所以用一條 thread做就好（原版用一個 block）。因此 host 端 launch 時把 grid 大小除以 BLOCK_DIM：

// host 端
computeBezierLines_parent<<<ceil((float)N_LINES/BLOCK_DIM), BLOCK_DIM>>>(bLines_d, N_LINES);

關鍵差異 2：device-side `cudaMalloc` 精準配置

struct BezierLine {
    float2  CP[3];
    float2 *vertexPos;   // <-- 改成指標，動態配置
    int     nVertices;
};

__global__ void computeBezierLines_parent(BezierLine *bLines, int nLines) {
    int lidx = threadIdx.x + blockDim.x*blockIdx.x;   // <-- thread 為單位 (vs Fig.21.6 的 blockIdx.x)
    if (lidx < nLines) {
        float curvature = computeCurvature(bLines);
        bLines[lidx].nVertices = min(max((int)(curvature*CURVE_TESS_FACTOR),4),
                                     MAX_TESS_POINTS);
        // 在 device 端依實際點數精準配置記憶體
        cudaMalloc((void**)&bLines[lidx].vertexPos,
                   bLines[lidx].nVertices*sizeof(float2));
        // launch 一個大小可變的 child grid（依 nVertices）
        computeBezierLines_child<<<ceil((float)bLines[lidx].nVertices/32.0f), 32>>>
            (lidx, bLines, bLines[lidx].nVertices);
    }
}

__global__ void computeBezierLines_child(int lidx, BezierLine *bLines, int nTessPoints) {
    int idx = threadIdx.x + blockDim.x*blockIdx.x;   // 每 thread 只算 1 個 vertex
    if (idx < nTessPoints) {
        float u = (float)idx/(float)(nTessPoints-1);
        float omu = 1.0f - u, B3u[3] = {omu*omu, 2.0f*u*omu, u*u};
        float2 position = {0,0};
        for (int i=0;i<3;i++) position = position + B3u[i]*bLines[lidx].CP[i];
        bLines[lidx].vertexPos[idx] = position;       // 無迴圈、無 divergence
    }
}

thread → child grid 映射

parent grid (N_LINES 條 thread, 每 thread = 一組控制點)
 t0 ─launch→ child grid A: [#-#-]     (4 pts  -> 1 block)
 t1 ─launch→ child grid B: [########################] (32 pts -> 1 block,32thr)
 t2 ─launch→ child grid C: [#-#-#-]   (6 pts  -> 1 block)
 ...
 每條 child grid 大小不同 = variable child grid launches
 child grid 內「1 thread = 1 vertex」-> 完美 load balance, 0 divergence

Tip

Dynamic parallelism kernel call 與 host launch 一樣是 asynchronous：parent thread 啟動 child 後可繼續執行，不必等待。若要等待須明確同步；即使不同步，parent grid 結束前也有 implicit synchronization 確保所有 child grid 都已完成。

Important

「parent thread 啟動 child 那一刻」與「child 完成且 parent 同步那一刻」，是 parent 與 child 記憶體視圖一致的兩個時間點。launch 之後 parent 的寫入不保證被 child 看到；child 的寫入要等 parent 同步後才保證可見（詳見 21-CUDA-Dynamic-Parallelism/04-Execution-Considerations-and-Summary）。

Device 端記憶體釋放 (Device-Side cudaFree)

device kernel 用 cudaMalloc 配置的記憶體，必須由 device kernel cudaFree 釋放（host 不能直接 free 它）。因此另寫一個 kernel 由 host 呼叫：

__global__ void freeVertexMem(BezierLine *bLines, int nLines) {
    int lidx = threadIdx.x + blockDim.x*blockIdx.x;
    if (lidx < nLines)
        cudaFree(bLines[lidx].vertexPos);   // 平行釋放每條線的 vertex 記憶體
}
// host 端
freeVertexMem<<<ceil((float)N_LINES/BLOCK_DIM), BLOCK_DIM>>>(bLines_d, N_LINES);

配置/釋放位置	由誰配置	須由誰釋放
device-side `cudaMalloc`（在 kernel 內）	device kernel	device kernel `cudaFree`（不能由 host）
host-side `cudaMalloc`	host	host `cudaFree`

工作量、Pool 與 Stream 的關聯 (Workload, Pool & Streams)

此範例是後續「重要執行考量」的 running example，這裡只點出與工作量直接相關的兩個陷阱（細節見 21-CUDA-Dynamic-Parallelism/04-Execution-Considerations-and-Summary）：

Pending launch pool：parent 每條 thread 各 launch 一個 child grid → 共 launch N_LINES 個 child grid。預設 pool 容量 2048；若 N_LINES 超過會落入 virtualized pool，效能可能慢一個數量級。對策：
```
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, N_LINES);
```
Named streams：stream scope 為 block 私有；不指定 stream 時，同一 block 內所有 launch 都用 NULL stream → 被序列化（即使來自不同 thread）。要讓同 block 各 thread 的 child grid 並行，須每 thread 建立 non-blocking named stream：
```
cudaStream_t stream;
cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking);
computeBezierLines_child<<<ceil((float)bLines[lidx].nVertices/32.0f), 32, 0, stream>>>
    (lidx, bLines, bLines[lidx].nVertices);
cudaStreamDestroy(stream);
```

Warning

若直接用預設 NULL stream，同 block 的 child grid 全部序列化，平行度甚至可能比原本「無 dynamic parallelism」版本還差。

關鍵公式與計數 (Key Formulas & Counts)

量	公式
Quadratic Bezier	`B(t) = (1−t)²P0 + 2(1−t)t·P1 + t²P2`
tessellation 取樣	`u = idx / (nTessPoints − 1)`, `idx = 0 … nTessPoints−1`
tessellation 點數	`nTess = min(max((int)(curvature·CURVE_TESS_FACTOR), 4), MAX_TESS_POINTS)`
parent grid blocks	`ceil(N_LINES / BLOCK_DIM)`，block 大小 `BLOCK_DIM`
child grid（每條線）	blocks = `ceil(nVertices / 32)`，block 大小 `32`
啟動的 child kernel 總數	`= N_LINES`（每條 parent thread 一個，與 parent block 數無關）
per-thread streams 總數	`= N_LINES`（每條 parent thread 各建一個）

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
`N_LINES=1024, BLOCK_DIM=64`，launch 幾個 child kernel？	1024（每條 thread 一個 child grid），不是 16；16 只是 parent block 數
`N_LINES=1024`，該把 fixed pool 從 2048 調成 1024 嗎？	否。預設 2048 已足夠涵蓋 1024；調小不會變快，反而可能限制。只有「預期 launch 數 > 預設」才需調大
`N_LINES=1024, BLOCK_DIM=64` + per-thread streams，幾個 stream？	1024（每條 thread 各一個），不是 16
6 個 block × 256 thread 跑 parent，且用預設 NULL stream，最多幾個 child kernel 並行？	6（stream 為 block 私有，NULL stream 使同 block launch 序列化 → 每 block 最多 1 個並行）
為什麼把 block-per-line 改成 thread-per-line？	parent 工作（curvature + launch）很小，用一條 thread 即可；實際運算交給 child grid
為什麼 child 版能消除 control divergence？	child grid 內「1 thread = 1 vertex」，無變長 for-loop，warp 內工作量一致
device kernel 內 `cudaMalloc` 的記憶體誰來 free？	必須由 device kernel `cudaFree`（如 `freeVertexMem`），host 不能直接釋放
dynamic parallelism launch 是同步還是非同步？	非同步；不顯式同步也有 parent grid 結束前的 implicit synchronization
parent 寫入何時對 child 可見？	只保證「launch 之前」的寫入可見；child 的寫入要等 parent 同步後才保證可見
Bezier 範例的意義（一句話）	示範 variable amount of child grid launches：工作量隨資料動態變化時用 child grid 取代 for-loop

重點總覽 (Overview)

Bezier 曲線數學背景 (Bezier Curve Math)

無 Dynamic Parallelism 的版本 (Without Dynamic Parallelism)

有 Dynamic Parallelism 的版本 (With Dynamic Parallelism)

關鍵差異 1：thread 為單位 vs block 為單位

關鍵差異 2：device-side cudaMalloc 精準配置

thread → child grid 映射

Device 端記憶體釋放 (Device-Side cudaFree)

工作量、Pool 與 Stream 的關聯 (Workload, Pool & Streams)

關鍵公式與計數 (Key Formulas & Counts)

考試/面試重點 (Exam / Test Patterns)

Related Notes

關鍵差異 2：device-side `cudaMalloc` 精準配置