Frontier、Privatization 與其他 BFS 最佳化 (Frontiers, Privatization, Other Optimizations)

#parallel-pattern #graph-traversal #privatization #atomic-operations #load-balancing

重點總覽 (Overview)

前兩篇的 vertex-/edge-centric BFS 每個 level 都會對「全部 vertex 或 edge」啟動 thread，浪費大量無用工作。本篇用 frontier (邊界佇列) 只啟動處理「上一層真正活躍 vertex」的 thread，再用 privatization 把 frontier 搬到 shared memory 降低 atomic 競爭，最後處理 launch overhead 與 load balance。

最佳化	解決的問題	核心機制	代價 / 限制
Frontier (15.5)	全圖掃描、冗餘 thread	`prevFrontier`/`currFrontier` 佇列；`atomicCAS` 防重複入隊	atomic 長延遲；`atomicAdd` 對共用 counter 高度競爭
Privatization (15.6)	全域 frontier counter 競爭過高	每個 block 在 shared memory 維護 local frontier，結束時合併到 global	local frontier 容量有限，溢位需 fallback；需 `__syncthreads()`
Reducing launch overhead (15.7)	小 frontier 的 kernel 啟動成本 > 平行收益	單一 block kernel 用 `__syncthreads()` 連跑多個 level	受限於單 block 容量，溢位即返回 host
Load balancing (15.7)	高 variance degree (celebrity vertex)	依 degree 分 bucket：small→thread / medium→warp / large→block	需排序/分桶前處理

Important

Frontier 的本質：把「對所有 vertex 做相關性檢查」轉成「只列舉上一層產生的活躍 vertex」。這是把 input-centric 重新映射為 work-centric 的典型手法。

Frontier (Improving Efficiency with Frontiers)

動機

無 frontier 版本：每個 level 都對所有 vertex (push) 或所有 edge (edge-centric) 啟動 thread，多數 thread 只發現「我不相關」就結束 → 大量 wasted work。
Frontier 版本：上一層的 thread 協作建立 frontier，記錄它們新訪問到的 vertex；下一層只對 frontier 內的 vertex 啟動 thread (Luo et al., 2010)。

無 frontier (push)           有 frontier (push)
所有 9 個 vertex 各 1 thread   只有 prevFrontier 內的 vertex
┌─┬─┬─┬─┬─┬─┬─┬─┬─┐          prevFrontier = [1, 2]
│0│1│2│3│4│5│6│7│8│          ┌─┬─┐
└─┴─┴─┴─┴─┴─┴─┴─┴─┘          │1│2│  ← 只啟動 2 個 thread
 只有 1,2 真正做事            └─┴─┘
 其餘 7 個白跑                 currFrontier ← 新訪問的鄰居

Kernel 程式碼 (Fig. 15.12)

__global__ void bfs_frontier_kernel(
        CSRGraph csrGraph, unsigned int* level,
        unsigned int* prevFrontier, unsigned int* currFrontier,
        unsigned int numPrevFrontier, unsigned int* numCurrFrontier,
        unsigned int currLevel) {
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < numPrevFrontier) {                          // 只 launch 上層 frontier 大小
        unsigned int vertex = prevFrontier[i];          // 取出要處理的 vertex
        for (unsigned int edge = csrGraph.srcPtrs[vertex];
             edge < csrGraph.srcPtrs[vertex + 1]; ++edge) {
            unsigned int neighbor = csrGraph.dst[edge];
            // 一次原子完成「檢查未訪問 + 標記」
            if (atomicCAS(&level[neighbor], UINT_MAX, currLevel) == UINT_MAX) {
                unsigned int idx = atomicAdd(numCurrFrontier, 1);  // 取得入隊位置
                currFrontier[idx] = neighbor;                       // 寫入 currFrontier
            }
        }
    }
}

Note

不再需要 newVertexVisited flag。Host 端改以「numCurrFrontier == 0」判斷 BFS 結束。

為什麼必須用 atomicCAS (而非 idempotent 寫入)

版本	檢查+標記	為何如此
無 frontier (push)	非原子，直接寫 `level[neighbor]=currLevel`	idempotent：多 thread 寫相同值，重複標記無害
有 frontier	必須 atomicCAS	若多 thread 都看到 neighbor 未訪問，會把它重複加入 frontier，下一層被重複處理 → 浪費

atomicCAS(addr, compare, val)：若 *addr == compare 則 *addr = val；回傳舊值。
判斷成功：atomicCAS(...) == UINT_MAX 代表「這個 thread 是第一個訪問者」，只有它能入隊。

Warning

Frontier 的代價是 atomic 長延遲。其中 atomicCAS (line 11) 競爭中等(只有訪問同一鄰居的少數 thread 相撞)；但 atomicAdd(numCurrFrontier,1) (line 12) 競爭極高(所有成功 thread 都對同一個全域 counter 累加) → 由 15.6 的 privatization 解決。

Privatization (Reducing Contention with Privatization)

概念

與 09-Parallel-Histogram/02-Histogram-Optimizations-Privatization-Coarsening-Aggregation 中的 privatized histogram 相同思路：

每個 block 在 shared memory 維護自己的 local frontier 與 local counter。
thread 只與同 block 的 thread 競爭該 local counter → 競爭範圍從「整個 grid」縮小到「一個 block」。
結束時整批 commit 到 global frontier，且寫入可 coalesced (memory coalescing)。

       Block 0                 Block 1                Block 2
  ┌──────────────┐       ┌──────────────┐      ┌──────────────┐
  │ shared mem   │       │ shared mem   │      │ shared mem   │
  │ frontier_s[] │       │ frontier_s[] │      │ frontier_s[] │
  │ count_s (低延遲)│      │ count_s      │      │ count_s      │
  └──────┬───────┘       └──────┬───────┘      └──────┬───────┘
         │  block 內各 thread     │                    │
         │  atomicAdd 只撞同 block │                    │
         ▼                       ▼                    ▼
   一次 atomicAdd(global, count_s) 取得 startIdx，整批 coalesced 寫入
  ┌───────────────────────────────────────────────────────────┐
  │             global currFrontier[]  (public)                │
  └───────────────────────────────────────────────────────────┘

Kernel 程式碼 (Fig. 15.14)

__global__ void bfs_privatized_kernel(/* ...同上參數... */
        unsigned int currLevel) {
    // (1) 宣告並初始化 block 私有 frontier
    __shared__ unsigned int currFrontier_s[LOCAL_FRONTIER_CAPACITY];
    __shared__ unsigned int numCurrFrontier_s;
    if (threadIdx.x == 0) numCurrFrontier_s = 0;
    __syncthreads();                                   // 等 counter 初始化完成

    // (2) BFS 主體 (同 frontier 版)
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < numPrevFrontier) {
        unsigned int vertex = prevFrontier[i];
        for (unsigned int edge = csrGraph.srcPtrs[vertex];
             edge < csrGraph.srcPtrs[vertex + 1]; ++edge) {
            unsigned int neighbor = csrGraph.dst[edge];
            if (atomicCAS(&level[neighbor], UINT_MAX, currLevel) == UINT_MAX) {
                unsigned int idx_s = atomicAdd(&numCurrFrontier_s, 1); // shared，低延遲
                if (idx_s < LOCAL_FRONTIER_CAPACITY) {
                    currFrontier_s[idx_s] = neighbor;                  // 進 local frontier
                } else {                                               // local 溢位 → fallback
                    numCurrFrontier_s = LOCAL_FRONTIER_CAPACITY;       // 復原 counter
                    unsigned int idx = atomicAdd(numCurrFrontier, 1);  // 直接寫 global
                    currFrontier[idx] = neighbor;
                }
            }
        }
    }
    __syncthreads();                                   // 確保 local frontier 已完成

    // (3) 由一個 thread 在 global frontier 一次配置整段空間
    __shared__ unsigned int currFrontierStartIdx;
    if (threadIdx.x == 0)
        currFrontierStartIdx = atomicAdd(numCurrFrontier, numCurrFrontier_s);
    __syncthreads();

    // (4) coalesced 地把 local frontier 搬到 global frontier
    for (unsigned int idx_s = threadIdx.x; idx_s < numCurrFrontier_s;
         idx_s += blockDim.x) {
        currFrontier[currFrontierStartIdx + idx_s] = currFrontier_s[idx_s];
    }
}

競爭層級	atomic 目標	競爭範圍	延遲
無 privatization	`numCurrFrontier` (global)	整個 grid	高 (global atomic)
有 privatization	`numCurrFrontier_s` (shared)	單一 block	低 (shared atomic)
commit 階段	`numCurrFrontier` (global)	每 block 只 1 次 atomicAdd(count_s)	攤提後極低

Tip

commit 寫入用 idx_s = threadIdx.x; idx_s += blockDim.x 的步進，使連續 thread 寫連續 global 位址 → stores are coalesced，是 privatization 的額外紅利。

Warning

local frontier 容量 LOCAL_FRONTIER_CAPACITY 有限。溢位時必須 (a) 把 counter 釘回 capacity，(b) 改走 global atomicAdd fallback，否則會越界寫入 shared memory。

Reducing Launch Overhead (Other Optimizations)

BFS 前幾個 level 的 frontier 通常很小 (第 1 層只有 source 的鄰居)；末幾層也可能很小。
對這些小 frontier 的 level，終止舊 grid + 啟動新 grid 的開銷可能超過平行收益。
解法:準備一個只用單一 thread block 的 kernel，可連續處理多個 level，level 之間以 __syncthreads() 同步,全程只用 block 級 local frontier。

(A) 每個 level 各 launch 一次 grid     (B) 小 level 合併進單一 block grid
 host ─ launch ─► [grid L0]            host ─ launch ─► ┌ single-block grid ┐
        launch ─► [grid L1]                              │ L0 → __syncthreads()│
        launch ─► [grid L2] ...                          │ L1 → __syncthreads()│
   ▲ 每次都有 launch overhead                            │ (frontier 溢位即返回)│
                                                         └─────────────────────┘
                                       ▲ 一次 launch 跑完多個小 level

當 frontier 大到溢出 block-level frontier 時，threads 把 block-level frontier 複製到 global frontier 後返回 host；host 再改用一般 (multi-grid) kernel 繼續，直到 frontier 又變小。
效果：消除小 frontier 迭代的 launch overhead (書中留作習題實作)。

Improving Load Balance (Other Optimizations)

vertex-centric 中每個 thread 的工作量 = 其 vertex 的 degree。社群網路的 celebrity vertex degree 可能比一般 vertex 高出數個數量級 → 少數 thread 拖垮整個 grid。
已知對策之一：改用 edge-centric 實作 (見 15-Graph-Traversal/02-Vertex-Centric-and-Edge-Centric-BFS)。
另一對策 (Merrill & Garland, 2012)：把 frontier 內 vertex 依 degree 分桶，每桶用合適規模的處理器群:

frontier 內 vertex 依 degree 分 3 桶
  small  degree ──► kernel A: 1 vertex / 1 thread
  medium degree ──► kernel B: 1 vertex / 1 warp   (32 threads)
  large  degree ──► kernel C: 1 vertex / 1 block

Bucket	處理單位	適用	理由
small	1 thread	低 degree vertex	工作少，thread 足夠
medium	1 warp	中 degree	warp 內平行掃鄰居，控制 divergence
large	1 block	celebrity vertex	大量鄰居用整個 block 攤平

Tip

此技巧對 high variance degree 的圖特別有效，本質是把 thread→work 的映射依工作量重排，與 thread coarsening / edge-centric 同屬 load-balancing 思路。

Further Challenges & Summary (15.7–15.8)

BFS 雖是最簡單的圖演算法，卻體現複雜應用的共同挑戰：
1. problem decomposition 萃取平行性 (vertex vs edge vs frontier)。
2. 善用 privatization 降低 atomic 競爭。
3. 實作 fine-grained load balancing (分桶)。
4. 確保正確 synchronization (__syncthreads()、atomicCAS)。
進階挑戰：圖大小超過 GPU 記憶體容量；以及前處理成其他格式以暴露更多平行性 / locality / 利於 load balance。
應用領域:推薦系統、社群偵測、圖樣式比對、異常偵測。

全章演進路徑	解決的痛點
Vertex-centric (push/pull)	基本平行化
Edge-centric (COO)	暴露更多平行、降低 divergence
Direction-optimized	早層 push / 晚層 pull
Frontier	消除冗餘 thread/work
Privatization	降低 atomic 競爭
Launch reduction + bucket	降啟動開銷、平衡負載

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
Frontier 版每層啟動幾個 thread？	`numPrevFrontier`(上一層 frontier 大小)，非全圖 vertex 數
為何無 frontier 的 push 可不用 atomic，frontier 版卻要？	無 frontier 只標記 level → idempotent；frontier 版還要入隊，重複入隊會浪費下一層 → 需 `atomicCAS`
`atomicCAS(addr, compare, val)` 回傳什麼？如何判成功？	回傳舊值；`== UINT_MAX` 代表此 thread 第一個訪問該 neighbor，才入隊
Frontier 哪個 atomic 競爭最高？	全域 `atomicAdd(numCurrFrontier,1)`(所有成功 thread 撞同一 counter) → 用 privatization
Privatization 把競爭從哪裡降到哪裡？	grid 級 global atomic → block 級 shared atomic；每 block 只對 global 做 1 次 atomicAdd
Privatization 額外好處？	local→global commit 用 `threadIdx.x` 步進 → coalesced 寫入
local frontier 溢位怎麼辦？	counter 釘回 `LOCAL_FRONTIER_CAPACITY`，改走 global atomicAdd fallback
BFS 結束條件 (frontier 版)？	`numCurrFrontier == 0`(不再需要 visited flag)
小 frontier level 的開銷問題與解法？	launch overhead > 平行收益 → 單 block kernel 用 `__syncthreads()` 連跑多 level
celebrity vertex 造成 load imbalance 的兩個解法？	(1) edge-centric；(2) degree 分桶:small→thread / medium→warp / large→block

重點總覽 (Overview)

Frontier (Improving Efficiency with Frontiers)

動機

Kernel 程式碼 (Fig. 15.12)

為什麼必須用 atomicCAS (而非 idempotent 寫入)

Privatization (Reducing Contention with Privatization)

概念

Kernel 程式碼 (Fig. 15.14)

Reducing Launch Overhead (Other Optimizations)

Improving Load Balance (Other Optimizations)

Further Challenges & Summary (15.7–15.8)

考試/面試重點 (Exam / Test Patterns)

Related Notes