圖走訪與廣度優先搜尋練習題 (Practice - Graph Representations and Breadth-First Search)

#practice #parallel-pattern #graph-traversal #vertex-centric

15-Graph-Traversal/01-Graph-Representations-and-BFS — 圖的表示法與廣度優先搜尋 (Graph Representations and BFS)
15-Graph-Traversal/02-Vertex-Centric-and-Edge-Centric-BFS — Vertex-centric 與 Edge-centric 的 BFS 平行化 (Push, Pull, Edge-Centric)
15-Graph-Traversal/03-Frontiers-Privatization-and-Optimizations — Frontier、Privatization 與其他 BFS 最佳化 (Frontiers, Privatization, Other Optimizations)

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案 / 重點
fully connected 有向圖邊數	`N(N−1)`（無自環，每點 out-degree = `N−1`）
格式 ↔ 演算法對應	CSR(出邊)↔push/top-down；CSC(入邊)↔pull/bottom-up；COO(src+dst)↔edge-centric
BFS label 代表什麼	root 到該 vertex 的最少邊數 (level/depth = 最短 hop)；複雜度 O(V+E)
為何每層各 launch 一個 kernel	需跨層 barrier：等上一層全標完才標下一層，否則 level 標錯
Push 檢查條件 / 動作	`level[vertex]==currLevel−1` → 沿出邊把未訪問鄰居標 `currLevel`
Push 為何免 atomic	idempotent：多 thread 寫相同值，非 read-modify-write
Pull 檢查條件 / 為何能 break	`level[vertex]==UINT_MAX` → 找到一個前驅在上一層即標自己並 break（夠了，降 divergence）
Edge-centric 兩優點	暴露更多平行度（`E≫V`，塞滿小圖）；每 thread 恰 1 邊 → load imbalance 最低
Edge-centric 兩缺點	掃每條邊（無法跳過無關頂點整串邊）；COO 儲存多（`2E`）
Frontier 每層 launch 幾個 thread	`numPrevFrontier`（上一層 frontier 大小），非全圖 V
Frontier 為何要 atomicCAS	還要入隊，多 thread 重複入隊 → 下一層重複處理浪費；CAS 一次完成「檢查未訪問+標記」
`atomicCAS(addr,cmp,val)` 回傳	舊值；`== UINT_MAX` 代表此 thread 第一個訪問者 → 才入隊
Frontier 結束條件	`numCurrFrontier == 0`（不再需要 `newVertexVisited` flag）
競爭最高的 atomic / 解法	全域 `atomicAdd(numCurrFrontier,1)` → privatization（shared memory local frontier）
Privatization 額外紅利	local→global commit 用 `threadIdx.x` 步進 → coalesced 寫入
local frontier 溢位	counter 釘回 `LOCAL_FRONTIER_CAPACITY`，改走 global `atomicAdd` fallback
小 frontier 的 launch overhead 解法	單一 block kernel 用 `__syncthreads()` 連跑多 level，溢位才返回 host
celebrity vertex load imbalance	(1) edge-centric；(2) degree 分桶 small→thread / medium→warp / large→block
Direction-optimized	早期 push、後期 pull；需同時存 CSR+CSC（undirected 對稱則一份）
push/pull ↔ scatter/gather	push=scatter（推給鄰居），pull=gather（拉自鄰居）

Question 1 - 圖表示法與格式對應 [recall]

情境/題目：一個有 N 個 vertex 的有向圖若 fully connected，共有幾條邊？CSR、CSC、COO 三種稀疏格式各自最易取得什麼資訊，分別對應哪一種 BFS 平行化策略？

查看答案

Fully connected 有向圖共 N(N−1) 條邊（無自環，每點 out-degree = N−1）。

CSR(srcPtrs+dst)：易取某 source 的出邊（row 非零）→ push / top-down。
CSC(dstPtrs+src)：易取某 destination 的入邊（column 非零）→ pull / bottom-up（也用於最短路徑回溯找 predecessor）。
COO(src+dst)：由一條 edge 直接取 (src, dst) → edge-centric。
核心觀念：格式選擇 = 演算法選擇。

Question 2 - BFS level labeling 與跨層同步 [recall]

情境/題目：BFS 對每個 vertex 的 label（level/depth）代表什麼意義？為何平行 BFS 必須對每一層各 launch 一個 kernel？整體計算複雜度是多少？

查看答案

label = 從 root 到該 vertex 的最少邊數（最短 hop count = level/depth）。
每層各 launch 一個 kernel 是因為需要一個跨層 barrier：必須等上一層所有 vertex 都標完，才能開始標下一層，否則可能把某 vertex 標到錯誤的 level。換 root 後 BFS 結果可能截然不同（即使新 root 只差 1 條邊）。
複雜度 O(V + E)（每 vertex、每 edge 各走訪一次）。

Question 3 - Push kernel 與 idempotence [recall]

情境/題目：vertex-centric push（top-down）kernel 中，每個 thread 對應什麼、檢查條件與動作是什麼？為何這個 kernel 不需要 atomic operation？

查看答案

thread = 一個 vertex。檢查 level[vertex] == currLevel − 1（我的 vertex 在上一層）；若成立，沿 CSR 出邊（srcPtrs/dst）把所有未訪問（level == UINT_MAX）鄰居標為 currLevel，並設 newVertexVisited = 1。
不需 atomic 因為這是 idempotent 寫入：多個 thread 即使同時標同一鄰居，寫的都是相同的值（currLevel、1），非 read-modify-write，最終結果不變。

Question 4 - Pull kernel 與提早 break [recall]

情境/題目：vertex-centric pull（bottom-up）kernel 使用哪種格式？thread 的檢查條件是什麼？為何找到一個符合的鄰居就能 break 早退，這對效能有何幫助？

查看答案

使用 CSC(dstPtrs+src，入邊)。每個 thread 先檢查自己 level[vertex] == UINT_MAX（尚未訪問）；若是，沿入邊找是否有任一鄰居（前驅）在 currLevel − 1，找到就把自己標為 currLevel 並 break。
能 break 因為：只要一個前驅在上一層就足以判定本 vertex 屬當層，不必再檢查其餘鄰居。只有「沒有任何前驅在上一層」的 vertex 才會跑完整串清單。對 high-degree / 高變異的圖，提早 break 可大幅降低 load imbalance 與 control divergence。

Question 5 - Edge-centric 的優缺點 [recall]

情境/題目：edge-centric BFS 使用哪種格式？相對於 vertex-centric，它有哪兩個主要優點與兩個主要缺點？

查看答案

使用 COO（src+dst）。
優點：(1) 暴露更多平行度 — 邊通常遠多於頂點（E ≫ V），小圖也能塞滿 device；(2) load imbalance / divergence 最低 — 每 thread 恰好處理 1 條邊（屬於重排 thread→work 映射以減 divergence，見 Ch.6）。
缺點：(1) 掃每一條邊 — 無關頂點 v 的 n 條邊各 launch 一個 thread，重複檢查 v 共 n 次（vertex-centric 只需 1 thread 檢查 v 一次即跳過全部 n 條邊）；(2) COO 儲存成本高 — 需顯式存 src+dst 共 2E，多於 CSR/CSC 的 (V+1)+E。

Question 6 - Frontier 與 atomicCAS 的必要性 [recall]

情境/題目：使用 frontier 的 push kernel，每層 launch 多少個 thread？為何它必須改用 atomicCAS，而無 frontier 版可以直接寫入？BFS 的結束條件變成什麼？

查看答案

每層只 launch numPrevFrontier 個 thread（= 上一層 frontier 的大小），而非全圖 V 個 → 消除冗餘 thread。
必須用 atomicCAS 是因為 frontier 版除了標 level，還要把鄰居入隊到 currFrontier。若多個 thread 都看到該鄰居未訪問，就會把它重複入隊，導致下一層重複處理 → 浪費。atomicCAS(&level[neighbor], UINT_MAX, currLevel) 一次原子完成「檢查未訪問 + 標記」，回傳值 == UINT_MAX 者才是第一個訪問者，只有它能入隊。
結束條件：numCurrFrontier == 0（不再需要 newVertexVisited flag）。

Question 7 - Privatization 降低 atomic 競爭 [recall]

情境/題目：frontier 版的 push 中哪一個 atomic operation 競爭最嚴重？privatization 如何降低它的競爭？此手法又帶來什麼額外的記憶體存取好處？

查看答案

競爭最嚴重的是全域 atomicAdd(numCurrFrontier, 1) — 所有成功訪問的 thread 都對同一個 global counter 累加。（atomicCAS 競爭只是中等，因為僅少數 thread 撞同一鄰居。）
Privatization：每個 block 在 shared memory 維護自己的 local frontier 與 local counter，thread 只與同 block 的 thread 競爭（shared atomic 低延遲）；commit 時每個 block 只對 global counter 做 1 次 atomicAdd 配置整段空間。
額外好處：local → global 的 commit 用 idx_s = threadIdx.x; idx_s += blockDim.x 步進，使連續 thread 寫連續 global 位址 → coalesced stores。（local frontier 溢位時需把 counter 釘回 LOCAL_FRONTIER_CAPACITY 並改走 global atomicAdd fallback。）

Question 8 - Launch overhead 與 load balance 最佳化 [recall]

情境/題目：對於 frontier 很小的前幾層/末幾層，會遇到什麼開銷問題？解法為何？另外，celebrity vertex 造成的 load imbalance 有哪兩種解法？

查看答案

Launch overhead：小 frontier 時，終止舊 grid + 啟動新 grid 的開銷可能超過平行收益。解法：準備一個只用單一 thread block 的 kernel，以 __syncthreads() 在 level 之間同步、連續跑多個小 level；當 frontier 大到溢出 block-level frontier 時，把內容複製到 global frontier 並返回 host，host 再改用一般 multi-grid kernel。
Load imbalance（celebrity vertex degree 高出數個數量級）兩解法：(1) 改用 edge-centric；(2) 把 frontier 內 vertex 依 degree 分桶（Merrill & Garland）：small → 1 thread / vertex、medium → 1 warp / vertex、large → 1 block / vertex。

Question 9 - 由邊集建構 CSR [application]

情境/題目：給定有向圖（5 個 vertex），邊集為 {0→1, 0→3, 1→2, 3→2, 3→4, 4→1}。寫出它的 CSR 表示（srcPtrs、dst），每個 vertex 的鄰居清單須依 dst 遞增排序。

查看答案

各 vertex 出邊：v0→{1,3}、v1→{2}、v2→{}、v3→{2,4}、v4→{1}，共 E = 6。

srcPtrs = [0, 2, 3, 3, 5, 6]（長度 N+1 = 6；末元素 6 = 總邊數）。
- v0 起點 0（2 邊）、v1 起點 2（1 邊）、v2 起點 3（0 邊，故 srcPtrs[2]==srcPtrs[3]）、v3 起點 3（2 邊）、v4 起點 5（1 邊）。
dst = [1, 3, 2, 2, 4, 1]。
驗證：v3 出邊 = dst[srcPtrs[3]..srcPtrs[4]−1] = dst[3..4] = {2, 4} ✓。

Question 10 - 計算各實作 launch / 工作的 thread 數 [application]

情境/題目：沿用 Q9 的圖，root = 0（level 0），BFS 結果為 L1={1,3}、L2={2,4}。在「從 level 1 標到 level 2」這一輪（currLevel = 2），push、pull、edge-centric 各 launch 幾個 thread？push 有幾個 thread 真正迭代鄰居？pull 有幾個 thread 迭代鄰居、幾個 thread 標記自己？edge-centric 有幾個 thread 可能 label 一個 vertex？

查看答案

進入本輪前已標：0(L0)、1(L1)、3(L1)；未訪問 = {2, 4}。

Push：launch V = 5 個 thread；只有 vertex 在 currLevel−1 = 1 的 thread 迭代鄰居 → {1, 3} 共 2 個。
Pull：launch V = 5 個 thread；迭代鄰居的是所有未訪問頂點 {2, 4} 共 2 個；兩者都找到上一層前驅（2 的入邊有 1,3；4 的入邊有 3）→ 2 個 thread 標記自己。
Edge-centric：launch E = 6 個 thread（每邊一個）；src 在 level 1 的邊 = 1→2, 3→2, 3→4 → 3 個 thread 可能 label vertex。

Question 11 - atomicCAS 的參數與成功判定 [application]

情境/題目：atomicCAS(addr, compare, val) 的三個參數與回傳值各是什麼？在 frontier kernel 中如何用它「一次原子完成檢查未訪問 + 標記」，並判斷自己是否為第一個訪問該鄰居的 thread？

查看答案

參數：addr = 資料記憶體位址、compare = 用來比較的值、val = 比較成功時要寫入的值。語意（原子）：if (*addr == compare) *addr = val;。回傳值 = *addr 的舊值。
用法：atomicCAS(&level[neighbor], UINT_MAX, currLevel) — 把 level[neighbor] 與 UINT_MAX（未訪問）比較，相等則設成 currLevel。
成功判定：若回傳值 == UINT_MAX，代表此 thread 是第一個把該鄰居從未訪問改成已訪問的 thread → 只有它負責把 neighbor 入隊，避免重複入隊。

Question 12 - Push vs Pull 取捨與 direction-optimized [analysis]

情境/題目：push 與 pull 都是 1 thread / vertex，但有兩個關鍵差異影響效能。請說明這兩個差異，並解釋為何 early level 偏好 push、late level 偏好 pull，以及 direction-optimized 的儲存需求與「push=scatter / pull=gather」的對應。

查看答案

差異 1（迴圈早退）：push 一定跑完整個鄰居清單；pull 找到一個前驅在上一層即 break，對 high-degree / 高變異圖能大幅降低 load imbalance 與 control divergence。
差異 2（哪些 thread 迭代鄰居）：push 只有「上一層頂點」迭代；pull 是「所有未訪問頂點」都迭代。
early level：上一層頂點少、未訪問頂點多 → push 只掃少數清單（快），pull 要掃大量未訪問頂點（慢）。
late level：上一層頂點多、未訪問頂點少，且命中已訪問前驅而早退的機率高 → pull 快，push 慢。
Direction-optimized = 早期 push + 後期 pull；high-degree / small-world 圖切換較早，low-degree 圖切換較晚。需同時存 CSR + CSC；若圖 undirected（adjacency matrix 對稱）則兩者等價，只需一份。
對應：push = scatter（主動把 depth 推給鄰居，寫入分散）、pull = gather（主動把鄰居狀態拉進自己，寫入集中），與 Ch.17/18 的 scatter-vs-gather 同源。

Question 13 - Edge-centric vs Vertex-centric 取捨 [analysis]

情境/題目：從「暴露的平行度」「load imbalance / divergence」「儲存成本」「能否跳過無關工作」四個維度，比較 edge-centric 與 vertex-centric 實作，並說明各自適合哪種圖。

查看答案

平行度：edge-centric = E（最高，E ≫ V，小圖也能塞滿 device）；vertex-centric = V（小圖可能 thread 不足）。
load imbalance / divergence：edge-centric 最低（每 thread 恰 1 邊，等同 Ch.6 重排 thread→work 映射）；vertex-centric 依各 vertex 的 degree，高變異圖嚴重 imbalance。
儲存成本：edge-centric COO = 2E（高）；vertex-centric CSR/CSC = (V+1)+E（低）。
跳過無關工作：vertex-centric 能跳過 — 一個無關 vertex v 只需 1 thread 檢查一次就略過全部 n 條邊；edge-centric 不能 — v 的 n 條邊各 launch 一個 thread，重複檢查 v 共 n 次。
適用：edge-centric → 小圖、或 degree 變異極大（celebrity）的圖；vertex-centric → degree 較均勻的圖，且可進一步搭配 frontier 剪掉無關 thread、再用 privatization 降低 atomic 競爭。

重點整理 (Pattern Summary，點擊展開)

實作	Thread 對應	格式	每層 launch	真正迭代鄰居	Atomic	適用
Push (top-down)	1 / vertex	CSR	`V`	上一層頂點	否 (idempotent)	早期層、低 degree
Pull (bottom-up)	1 / vertex	CSC	`V`	所有未訪問頂點	否 (可 break)	後期層、高 degree
Edge-centric	1 / edge	COO	`E`	每 thread 1 邊	否	小圖 / degree 變異
Frontier push	1 / frontier	CSR	`numPrevFrontier`	frontier 內頂點	`atomicCAS`+`atomicAdd`	消除冗餘 thread
Privatized frontier	1 / frontier	CSR	`numPrevFrontier`	frontier 內頂點	shared + 1×global	降 atomic 競爭、coalesced commit

核心訊息：格式選擇 = 演算法選擇（CSR↔push、CSC↔pull、COO↔edge）；BFS = O(V+E) 逐層 wavefront；最佳化路徑 = vertex/edge-centric → direction-optimized → frontier（剪冗餘）→ privatization（降競爭）→ launch reduction + degree 分桶（降開銷、平衡負載）。

Related Concepts

Question 1 - 圖表示法與格式對應 [recall]

Question 2 - BFS level labeling 與跨層同步 [recall]

Question 3 - Push kernel 與 idempotence [recall]

Question 4 - Pull kernel 與提早 break [recall]

Question 5 - Edge-centric 的優缺點 [recall]

Question 6 - Frontier 與 atomicCAS 的必要性 [recall]

Question 7 - Privatization 降低 atomic 競爭 [recall]

Question 8 - Launch overhead 與 load balance 最佳化 [recall]

Question 9 - 由邊集建構 CSR [application]

Question 10 - 計算各實作 launch / 工作的 thread 數 [application]

Question 11 - atomicCAS 的參數與成功判定 [application]

Question 12 - Push vs Pull 取捨與 direction-optimized [analysis]

Question 13 - Edge-centric vs Vertex-centric 取捨 [analysis]