有序合併與 Co-rank 概念練習題 (Practice - Merge Foundations and the Co-rank Concept)

#practice #parallel-pattern #ordered-merge #co-rank-function

12-Merge/01-Merge-Foundations-and-Co-Rank-Concept — Merge 基礎與 Co-rank 概念 (Sequential Merge & Parallelization)
12-Merge/02-Co-Rank-Function-and-Basic-Merge-Kernel — Co-rank Function 實作與 Basic Parallel Merge Kernel
12-Merge/03-Tiled-and-Circular-Buffer-Merge-Kernels — Tiled 與 Circular Buffer Merge Kernel (Memory Coalescing)
12-Merge/04-Thread-Coarsening-and-Summary — Thread Coarsening 與本章總結 (Summary)

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 觸發點	答案要點
Sequential merge 複雜度	O(m+n)；每元素讀一次、每 C 位置寫一次
tie 為何用 `A[i] <= B[j]`	維持 stability across lists(A 優先);用 `<` 會破壞穩定性
Rank vs Co-rank	rank = `k`(C 索引);co-rank = `(i,j)`(A、B 對應索引),且 `i,j` 唯一
Co-rank 不變式	`k = i + j`,任何 marker 調整都維持
co_rank 接受條件	`A[i-1] <= B[j]` 且 `B[j-1] < A[i]`(`<=`/`<` 不對稱 → stability)
搜尋上界 / 下界	`i = min(k,m)`、`i_low = max(0,k-n)`、`j_low = max(0,k-m)`
為何 binary search / 複雜度	輸入已排序 → search;O(log N), N = max(m,n)
basic kernel 每 thread co_rank 次數	2 次(`k_curr` 取起點、`k_next` 取終點),中間跑 `merge_sequential`
basic kernel 弱點	merge 與 co_rank 存取 uncoalesced,浪費 bandwidth
tiled kernel 改善策略	Ch.6 策略 (3):coalesced 載入 shared memory,不規則存取改在 shared memory
block 級 co-rank 用幾個 thread	1 個(threadIdx.x==0),存 A_S[0]/A_S[1] 經 `__syncthreads()` 廣播
tiled kernel 浪費多少頻寬	50%:每輪載 2×tile_size,只消耗 tile_size
為何載 2x 只保證生成 x	最壞情況 x 個 output 可能全來自單一輸入
circular buffer 解決什麼	保留上輪未消耗元素,只 refill 被消耗部分,index 用 `% tile_size` 環繞 → ~100% 使用
simplified model 目的	讓 user code 看到連續 tile(`a_next ≥ a_curr`),環繞藏進 `_circular` 函式
為何 merge 一定要 coarsening	攤提 binary search 成本;`T=1` 時每元素一次 search,prohibitively expensive
coarsening tradeoff	`T` 太大 → thread/block 數少 → occupancy / 平行度下降

Question 1 - Sequential Merge 結構與複雜度 [recall]

描述 merge_sequential 的兩段式結構,並說明其演算法複雜度為何。

查看答案

Part 1:while-loop 用雙指標 i、j 同時掃描 A、B,每次迭代比較後填一個 C 位置(A[i]<=B[j] 取 A,否則取 B)。Part 2:當一方耗盡後,複製另一方剩餘元素到 C。每個輸入元素讀一次、每個 C 位置寫一次 → O(m+n),與總元素數線性。

Question 2 - Stability 與比較運算子 [recall]

Merge 的 tie-breaking 為何使用 A[i] <= B[j](而非 A[i] < B[j])?改用嚴格小於會發生什麼?

查看答案

用 <= 表示「平手時 A 元素優先放入 C」,維持 stability across lists(跨清單穩定性)——可保留先前依其他 key 排序的成果。若改成 <,平手時改取 B,變成「B 優先」:仍是合法排序,但不再 stable across lists。

Question 3 - Rank、Co-rank 與不變式 [recall]

定義 rank k 與 co-rank (i, j),並寫出它們之間恆成立的不變式(invariant)。

查看答案

rank k = 輸出元素 C[k] 的索引;co-rank (i, j) = 產生 prefix C[0..k-1] 所需的 A、B prefix 長度,即 C[0..k-1] = merge(A[0..i-1], B[0..j-1])。不變式:k = i + j,且 Siebert & Träff 證明對每個 k 此 (i,j) 唯一。co_rank 只回傳 i,caller 自行算 j = k - i。

Question 4 - co_rank 的搜尋邊界初始化 [recall]

co_rank 函式中 i、i_low、j_low 的初始值各是什麼?為何 i_low 不直接設為 0?

查看答案

i = min(k, m)(i 不可超過 A 大小或 k);i_low = max(0, k-n);j_low = max(0, k-m);j = k - i。i_low 設成 0 仍正確但較慢——因 B 最多貢獻 n 個元素,當 k > n 時 A 至少要貢獻 k-n 個,故下界可收緊為 k-n 以縮小搜尋範圍。

Question 5 - co_rank 的接受條件與 delta [recall]

co_rank 的 while-loop 在什麼條件下停止(接受當前 i、j)?delta 為何用 ceil 而非 floor?

查看答案

接受條件:A[i-1] <= B[j] 且 B[j-1] < A[i](注意 <=/< 不對稱以維持 stability)。delta = (i - i_low + 1) >> 1 是 ceil,確保每步至少縮減 1,避免搜尋區間卡住、保證 while-loop 終止。搜尋為 O(log₂N)。

Question 6 - Basic Merge Kernel 的兩次 co_rank [recall]

Basic merge kernel 中每個 thread 為何要呼叫 co_rank 兩次?各取得什麼?

查看答案

第一次以 k_curr 為 rank → 得 i_curr(輸入子陣列起點);第二次以 k_next 為 rank → 得 i_next(輸入子陣列終點)。A 子陣列長度 = i_next - i_curr,B 長度 = j_next - j_curr。k_next 須 min((tid+1)*elems, m+n) 做 clamp,因 m+n 不一定整除 thread 數。

Question 7 - Circular Buffer 的 Simplified Model [recall]

為何 circular buffer kernel 要引入「simplified buffer access model」?co_rank_circular 比原 co_rank 多了哪些參數?

查看答案

直接把環繞索引交給 user code 很複雜(a_next 可能因 wrap 而小於 a_curr,長度要分兩種情況算)。Simplified model 對外呈現「tile 從 A_S_start 起、連續最多 tile_size 個」的幻覺,使 a_next ≥ a_curr 恆成立;環繞只在實際存取元素時於 _circular 函式內展開。co_rank_circular 多 3 個參數:A_S_start, B_S_start, tile_size。

Question 8 - 為何載 2x 元素只保證生成 x [recall]

Tiled merge kernel 每輪在 shared memory 載入 A、B 各 tile_size 個(共 2×tile_size),為何只能保證生成 tile_size 個輸出元素?

查看答案

因為輸入用量是 data-dependent:最壞情況下,當前 tile_size 個輸出可能全部來自單一輸入陣列(全 A 或全 B)。為保證每個 thread 都拿得到所需元素,A、B 必須各備 tile_size 個。實際只會消耗合計 tile_size 個(A_consumed_iter + B_consumed_iter = tile_size),剩下一半被丟棄重載 → tiled kernel 浪費 50% 頻寬。

Question 9 - Exercise 1:計算 C[8] 的 co-rank [application]

合併 A=(1,7,8,9,10) 與 B=(7,10,10,12),求 C[8] 的 co-rank 值。

查看答案

合併結果 C=[1,7,7,8,9,10,10,10,12]。C[8] 的 rank k=8,prefix C[0..7]=[1,7,7,8,9,10,10,10] = merge(A[0..4], B[0..2]),故 co-rank i=5(整個 A)、j=3,驗證 i+j=5+3=8 ✓(C[8]=12 來自 B[3])。

Question 10 - Exercise 4:幾個 thread 做 binary search [application]

合併兩陣列大小 1,030,400 與 608,000,每 thread 合併 8 個元素,block size = 1024。(a) basic kernel 中幾個 thread 在 global memory 做 binary search?(b) tiled kernel 中幾個在 global memory?(c) tiled kernel 中幾個在 shared memory?

查看答案

總輸出 m+n = 1,638,400,thread 數 = 1,638,400 / 8 = 204,800。
(a) basic:每個 thread 都在 global memory 做 → 204,800。
(b) tiled:每 block 只 1 個 thread 做 block 級 global co-rank → blocks = 204,800/1024 = 200 → 200。
(c) tiled:thread 級 co-rank 全在 shared memory → 全部 204,800。

Question 11 - Tiled Kernel 的 shared memory 與迭代次數 [application]

tiled kernel 使用 tile_size = 1024,每 block 需配置多少 shared memory(bytes)?若 m=33000, n=31000、用 16 個 block,每 block 的 while-loop 跑幾輪?

查看答案

Shared memory = A_S 與 B_S 各 1024 個 int → (1024+1024)×4 = 8192 bytes。每 block 輸出 = 64000/16 = 4000,迭代次數 = ceil(C_length/tile_size) = ceil(4000/1024) = 4 輪。每輪載入 for-loop 跑 tile_size/blockDim.x = 1024/128 = 8 次。

Question 12 - Basic vs Tiled vs Circular 的記憶體效率比較 [analysis]

比較 basic、tiled、circular buffer 三種 merge kernel 在 coalescing 與 shared memory 利用率上的差異,並說明為何這個演進方向對 merge 是值得的。

查看答案

Basic:A/B/C 全在 global,每 thread 順序掃描造成相鄰 thread 讀 A[0],A[2],B[0]、寫 C[0],C[3],C[6] → uncoalesced,co-rank binary search 也不規則。Tiled:用 Ch.6 策略(3) coalesced 載入 shared memory,但每輪載 2×tile_size 只用一半 → 50% 利用率。Circular buffer:保留上輪未消耗元素只 refill 被消耗部分(% tile_size 環繞)→ ~100% 利用率。由於 merge 是 memory-bandwidth bound,節省頻寬直接提升效能,值得付出程式複雜度與額外 register 的代價。

Question 13 - Thread Coarsening 對 Merge 的必要性與取捨 [analysis]

為何 thread coarsening 對 merge 是「必需品」而非單純優化?它攤提的成本與 reduction/scan 中 coarsening 攤提的成本有何不同?過度 coarsening 的代價是什麼?

查看答案

Merge 平行化的主要代價是每個 thread 都要做自己的 binary-search co-rank(輸入範圍 data-dependent)。T=1(未 coarsen)時每個輸出元素都要一次 binary search → prohibitively expensive;coarsening 讓 (m+n)/T 個 thread 每個只做 2 次 co-rank,中間 T 個元素用 O(T) 的 merge_sequential 填滿,把 2·log N 的 search 攤提到 T 個元素。不同點:reduction/scan 的 coarsening 省的是同步 / 重複載入 / atomic 成本,merge 省的是 binary search 次數。代價:T 太大 → thread/block 數變少 → occupancy / 平行度下降(但 memory-bound 下通常可接受,最佳 T 需 profiling)。

重點整理 (Pattern Summary，點擊展開)

Kernel (節)	global co-rank	shared co-rank	Coalescing	Shared mem 利用率	複雜度
Basic (12.5)	每 thread × 2	無	差 (uncoalesced)	不使用	低
Tiled (12.6)	每 block 1 thread × 2	每 thread × 2	載入 coalesced	50% 浪費	中
Circular Buffer (12.7)	每 block 1 thread × 2	每 thread × 2	載入 coalesced	~100% 使用	高
Coarsening (12.8)	已套用於以上全部	—	—	—	—

核心觀念鏈:輸入範圍 data-dependent(dynamic input identification)→ co_rank binary search O(log N) 反查 → thread coarsening 攤提 search 成本 → shared-memory tiling 取得 coalesced load → circular buffer 達 100% 頻寬利用 → simplified model 把環繞複雜度封裝進 library 函式。

關鍵公式:k = i + j(co-rank 不變式)、i = min(k,m)、i_low = max(0,k-n)、j_low = max(0,k-m)、delta = (i-i_low+1)>>1、total_iteration = ceil(C_length/tile_size)、numThreads = (m+n)/T。

一句話總結:merge 是 memory-bandwidth bound 的「動態輸入辨識」pattern,co-rank 解決「要讀哪段」,tiling + circular buffer 解決「怎麼讀得快」,coarsening 解決「search 成本太貴」。

Related Concepts

Question 1 - Sequential Merge 結構與複雜度 [recall]

Question 2 - Stability 與比較運算子 [recall]

Question 3 - Rank、Co-rank 與不變式 [recall]

Question 4 - co_rank 的搜尋邊界初始化 [recall]

Question 5 - co_rank 的接受條件與 delta [recall]

Question 6 - Basic Merge Kernel 的兩次 co_rank [recall]

Question 7 - Circular Buffer 的 Simplified Model [recall]

Question 8 - 為何載 2x 元素只保證生成 x [recall]

Question 9 - Exercise 1:計算 C[8] 的 co-rank [application]

Question 10 - Exercise 4:幾個 thread 做 binary search [application]

Question 11 - Tiled Kernel 的 shared memory 與迭代次數 [application]

Question 12 - Basic vs Tiled vs Circular 的記憶體效率比較 [analysis]

Question 13 - Thread Coarsening 對 Merge 的必要性與取捨 [analysis]