平行程式設計與計算思維練習題 (Practice - Parallel Programming and Computational Thinking)

#practice #computational-thinking #scatter-vs-gather #algorithm-selection

19-Parallel-Programming-And-Computational-Thinking/01-Goals-of-Parallel-Computing-and-Amdahls-Law — 平行運算的目標與 Amdahl's Law
19-Parallel-Programming-And-Computational-Thinking/02-Algorithm-Selection-and-Tradeoffs — 演算法選擇與權衡 (Algorithm Selection and Tradeoffs)
19-Parallel-Programming-And-Computational-Thinking/03-Problem-Decomposition-Output-vs-Input-Centric — 問題分解：Output-centric 與 Input-centric (Problem Decomposition)
19-Parallel-Programming-And-Computational-Thinking/04-Computational-Thinking-and-Parallelization-Approaches — 計算思維與平行化策略 (Computational Thinking)

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案 / 重點
平行運算三大目標	less time / bigger problem / better solution，本質都是 increased speed
好的平行候選	大資料量、高計算複雜度、多迭代；小問題不值得平行化
Amdahl's Law speedup 公式	$\frac{1}{(1 - p) + p / s}$ ； $s \to \infty$ 上限 $= \frac{1}{1 - p}$ （由序列段封頂）
MD 範例 (95% / 100×)	無重疊 $1 / (0.05 + 0.0095) \approx$ 17×；重疊 $1 / 0.05 =$ 20×
緩解序列段	Task-level parallelism(CUDA streams / multicore host) 或 hierarchical MPI-CUDA
演算法三性質	definiteness、effective computability、finiteness
演算法四面向	complexity(work)、parallelism、generality、accuracy/numerical stability;無單一全勝
複雜度 vs 平行度	Kogge-Stone(平行度高、步數少) vs Brent-Kung(work-efficient、O(N))
一般性 vs 效率	radix sort(快但限特定 key) vs merge sort(comparison-based、任何 key)
複雜度 vs 準確度	DCS(全準確,work `O(V²)`) vs cutoff summation(犧牲少量準確度降複雜度)
Output-centric = gather	一 thread 對一 output、累積在 register、可共享 input、免 atomics(GPU 偏好)
Input-centric = scatter	一 thread 對一 input、散佈到多 output、需 atomic operations(較慢)
選分解四考量	gather/scatter(atomics)、parallelism、input→output 映射難易、load balance
Histogram 為何 input-centric	output-centric:bin 數 << input(平行度低)、O(bins×inputs)(不 work-efficient)、load imbalance
Cutoff binning	grid-centric 分解 + bins/neighborhood + overflow list;SmallBin-Overlap 達 17×
計算思維定義	把 domain problem formulate 成 computation steps 與 algorithms (Wing, 2006)
三大基礎技能	computer architecture、programming interfaces & compilers、domain knowledge
Good / Better / Best	recompile+libs / rewrite / holistic rethink(連 numerical methods 一起)
平行程式三步驟	algorithm selection → problem decomposition → optimization & tuning

Question 1 - 平行運算的三大目標 [recall]

情境/題目：人們追求平行運算的三個主要原因是什麼？這三者的共同本質是什麼？舉投資公司風險分析為例分別說明。

查看答案

三大目標:① Less time——用更少時間解同一問題(序列需 200 小時,但決策要求 4 小時內完成);② Bigger problem——固定時間內解更大問題(擴大持股數量後序列版會超時);③ Better solution——固定時間內得到更好的解(改用更精確、更複雜的模型)。三者本質都是 increased speed(分別作用在「現有模型+現有規模」「現有模型+更大規模」「更複雜模型+現有規模」),且可組合。好的平行候選具備:大資料量、每次迭代計算量大、迭代次數多。

Question 2 - 演算法的三大基本性質 [recall]

情境/題目：一個程序要稱得上「演算法 (algorithm)」必須具備哪三個基本性質？各是什麼意思？這三性質與「適不適合平行化」有關嗎？

查看答案

① Definiteness(明確):每一步都精確陳述,沒有模糊空間;② Effective computability(可計算):每一步都能由電腦實際執行;③ Finiteness(會終止):保證會結束、不無限執行。這三性質是「是不是合法演算法」的門檻,與平行/序列無關;選哪個演算法上平行硬體則是另一層(四面向)的問題。

Question 3 - 演算法選擇的四個比較面向 [recall]

情境/題目：同一問題常有多個演算法可解。平行程式設計者比較演算法時要看哪四個面向？並指出本書反覆出現的三個「經典權衡」分別對應哪兩個面向與哪個範例章節。

查看答案

四面向:algorithmic complexity / work、parallelism exposed、generality、accuracy / numerical stability;通常無單一演算法在四者全勝,須為目標硬體挑最佳折衷。三大經典權衡:① 複雜度 vs 平行度 → Kogge-Stone vs Brent-Kung(Ch.11 Scan);② 一般性 vs 效率 → radix sort vs merge sort(Ch.13 Sorting);③ 複雜度 vs 準確度 → DCS vs cutoff summation(Ch.18 電位圖)。高複雜度可用 hybrid 或 thread coarsening 緩解。

Question 4 - radix sort vs merge sort 的權衡 [recall]

情境/題目：radix sort 與 merge sort 屬於哪一種經典權衡？各自的優勢與限制是什麼？

查看答案

屬於 一般性 (generality) vs 效率 (efficiency) 的權衡。Radix sort 是 noncomparison sort,複雜度可低於 merge sort 且高度適合平行化,但只能用於特定型別的 key(一般性低)。Merge sort 是 comparison-based,任何有良好定義比較運算子的 key 都能用(一般性高),但效率/平行度較不及 radix。選哪個取決於 key 型別與資料特性。

Question 5 - Output-centric (gather) vs Input-centric (scatter) [recall]

情境/題目：問題分解最常見的兩種策略是 output-centric 與 input-centric。請分別說明每個 thread 對應到什麼、表現出哪種記憶體存取行為、結果累積在哪、是否需要 atomics，以及為何 GPU 通常偏好其中一種。

查看答案

Output-centric:每個 thread 處理一個 output 元素,gather/collect 多個 input 的貢獻,結果累積在 private register;單一寫者 → 免 atomics,且多 thread 可共享 input(用 constant/shared memory 省 bandwidth)。Input-centric:每個 thread 處理一個 input 元素,scatter/distribute 影響到多個 output;多 thread 可能同時寫同一 output → 必須用 atomic operations 防 race condition,而 atomics 遠慢於 register 存取。故 GPU 通常偏好 output-centric (gather)。兩種分解結果相同,但效能可能天差地別。

Question 6 - 緩解序列段：Task-level 與 Hierarchical MPI-CUDA [recall]

情境/題目：Amdahl's Law 顯示「不值得單獨平行化的小活動」累積起來會封住整體 speedup。本章提出哪兩種手段來緩解這塊序列尾巴？各自的機制為何？

查看答案

① Task-level parallelism:某些小活動不值得「細粒度大規模平行」,但資料夠大時可讓它們彼此並行——用 multicore host 同時跑多個小任務,或用 CUDA streams 同時啟動多個小 kernel(各對應一個 task)。② Hierarchical MPI-CUDA:用階層式 data parallelism,MPI 把空間網格大塊分送到叢集各節點(粗粒度),每個節點的 host CPU 算小模組(如 vibrational/rotational force)、CUDA device 高倍率算大模組(nonbonded force),節點間再交換邊界資料。MPI 與 CUDA 互補、階層式地共同達到更高速度。

Question 7 - 計算思維的三大基礎技能 [recall]

情境/題目：要成為有效的 computational thinker 需橫跨哪三大基礎知識領域？並區分 SIMT / SPMD / SIMD 三者的層級差異。

查看答案

三大基礎技能:① Computer architecture(memory organization、caching & locality、bandwidth、SIMT/SPMD/SIMD、floating-point precision vs accuracy);② Programming interfaces & compilers(parallel execution models、memory 種類、synchronization、data layout、thread granularity transformation);③ Domain knowledge(problem formulation、hard vs soft constraints、numerical methods、numerical stability)。
SIMT/SPMD/SIMD 差異:SIMD = 硬體層級,一條指令操作多筆資料(vector lane);SIMT = GPU 執行模型,warp 內 thread 各有自己 PC/狀態但鎖步執行(容許 control divergence);SPMD = 程式層級,所有 thread/process 跑同一份程式但處理不同資料(含 MPI ranks)。

Question 8 - Good / Better / Best 三種平行化策略 [recall]

情境/題目：攻擊「computation-hungry applications」有 good / better / best 三種策略，難度與報酬同步遞增。請分別說明做法、需要哪些步驟、適合誰，以及為何前兩者無法發揮全部潛力。

查看答案

Good——「accelerate」legacy code:最基本是 recompile 換平台,可再用 optimized libraries/directives(CuBLAS、CuFFT、Thrust、MATLAB、OpenACC);不需 algorithm selection/decomposition/tuning,適合 domain scientists,立即見效但無平行洞見。Better——用平行技巧 rewrite 或從頭寫新 code,做 decomposition+optimization,適合 nondomain computer scientists;但缺 domain 知識,無法做有效的 algorithm selection。Best——holistic:三步驟全做,並 rethink numerical methods & algorithms(如 cutoff binning),需 CS+domain 跨領域,報酬最大、可能帶來 fundamental new discoveries。目標是讓科學「better, not just faster」。

Question 9 - Amdahl's Law 計算（Molecular Dynamics）[application]

情境/題目：molecular dynamics 應用中 nonbonded force 佔原始序列執行時間 95%，在 GPU 上加速 100×，其餘 5% 留在 host 且無加速。分別計算「host 與 device 不重疊」與「device 執行完全藏在 host 陰影下」兩種情況的 application-level speedup，並說明它揭示的核心啟示。

查看答案

公式 $speedup = \frac{1}{(1 - p) + p / s}$ ,此處 $p = 0.95$ 、 $s = 100$ 。

無重疊: $\frac{1}{0.05 + 0.95 / 100} = \frac{1}{0.05 + 0.0095} = \frac{1}{0.0595} \approx$ 17×。
有重疊(平行段完全隱藏):整體只剩序列段 → $\frac{1}{0.05} =$ 20×。
啟示(Amdahl's Law):即使平行段加速高達 100× 並完全被隱藏,僅 5% 的序列段就把整體封在 20×。 $s \to \infty$ 時上限 $= \frac{1}{1 - p} = 20 \times$ ,由序列段決定天花板。Ch.17 的 CG 只佔 ~1% 卻成 speedup 限制,即同一現象。

Question 10 - Cutoff Binning 的 grid-centric 設計 [application]

情境/題目：序列 cutoff 演算法採 atom-centric（一次處理一個原子）。為什麼它難以直接搬上 GPU？平行 cutoff binning 改採什麼分解？請描述 bins、neighborhood、overflow list 三者的角色，以及 SmallBin-Overlap 優化什麼。

查看答案

atom-centric 的散佈 (scatter) 記憶體存取行為不利 GPU(多原子向外更新 grid),故改採 grid-centric (output-centric) 分解:每條 thread 算一個 grid point 的能量,用 gather 把鄰近原子貢獻累積到 register。① Bins:先把輸入原子依座標排序進固定容量的 bins(每個 bin 對應 grid 空間一個 box)。② Neighborhood:為每個 grid point 定義「所有可能對其能量有貢獻的原子所在的那些 bins」;執行時各 block 走訪自己的 neighborhood bins,block 內所有 thread 一起掃描原子但各自判斷是否落在 cutoff radius 內。③ Overflow list:bin 容量設得遠小於最大可能原子數,若某原子的 home bin 已滿就改放進 overflow list,kernel 結束後由 host 序列補算缺漏貢獻(解 load imbalance)。SmallBin-Overlap 把這段序列 overflow 處理與下一個 kernel 執行重疊,對 CPU-SSE 序列版達 17× 並維持大體積可擴充性。

Question 11 - 為 SpMV/COO 選擇分解策略 [application]

情境/題目：稀疏矩陣向量乘法 (SpMV) 的 COO kernel 採用了哪一種分解？它使用了原本「不利 GPU」的 scatter + atomics，為什麼仍然值得？最佳分解最終取決於什麼？

查看答案

SpMV/COO 採 input-centric 分解:每條 thread 對應輸入矩陣的一個 nonzero,並用 atomicAdd 更新 output vector 對應元素(scatter access)。雖然付出 atomics 代價,但相較 output-centric kernel 它能抽取更多平行度(nonzero 數通常遠多於 output 元素)且load balance 更好(每 thread 處理一個 nonzero,工作量均勻)。這說明「output-centric 永遠較好」是錯誤簡化:當平行度與負載均衡更關鍵時,input-centric 反而勝出。最佳分解最終取決於 input dataset;Hybrid ELL-COO 即是混合兩種分解的例子。

Question 12 - Kogge-Stone vs Brent-Kung 的權衡 [analysis]

情境/題目：Brent-Kung 的 work-efficiency 高於 Kogge-Stone（O(N) vs O(N log N)），是否就一定比較快？請從「複雜度 vs 平行度」這個經典權衡出發，分析在不同硬體條件下誰勝出，以及如何緩解高複雜度演算法。

查看答案

不一定。這正是 複雜度 (work) vs 平行度 (parallelism) 的經典權衡:

Brent-Kung:複雜度低、做同一計算所需 operations 少(O(N),~2N−3)→ work-efficient;但兩階段(reduction tree + reverse distribution tree)使步數較多(~2 倍 log N)。
Kogge-Stone:暴露更多平行度,能在更少 iteration(log₂N 步)完成;但 work 達 O(N log N),做了較多冗餘運算。
誰勝出取決於目標硬體:執行單元充足(P 大、延遲主導)時 Kogge-Stone 步數少而較快;執行資源有限(P 小、總運算量主導)時 Brent-Kung work 少而較快。緩解高複雜度的手段:用 hybrid(結合兩個平行演算法)或把平行演算法與低複雜度的序列演算法經由 thread coarsening 結合——這也呼應「best」approach 在演算法層級重新設計的精神。

Question 13 - Histogram：input-centric (scatter) 為何勝過 output-centric (gather) [analysis]

情境/題目：本書多數計算（image、matmul、convolution、stencil）都用 output-centric (gather) 以避開 atomics，但 histogram 卻反其道採用 input-centric (scatter)、需要 atomics。請比較兩種分解，從三個面向分析為何 histogram 偏好 input-centric。

查看答案

對 histogram,output-centric(讓每條 thread 負責一個 output bin、去搜尋哪些 input 屬於它)看似能用 gather 免除 atomics,但有三大致命缺點:① 平行度低——output bin 數通常遠小於 input 數,可用 thread 大減;② 不 work-efficient——thread 無法在不檢視所有 input 的情況下知道哪些 input 屬於自己的 bin,每個 bin thread 都得掃過全部 input,工作量 O(#bins × #inputs) 而非 O(#inputs);③ load imbalance——每個 bin 對應的 input 數量不同。三點皆不利,故改採 input-centric:每條 thread 對應一個 input 元素並 atomically 更新對應 bin,雖付出 scatter+atomics 代價,卻換到高平行度(≈#inputs)、work-efficient(每 input 處理一次)與良好負載均衡。此例說明 gather vs scatter 不是唯一考量,parallelism、mapping、load balance 同樣關鍵。

重點整理 (Pattern Summary，點擊展開)

主題	關鍵內容	核心取捨 / 公式
三大目標	less time / bigger problem / better solution	本質都是 increased speed
Amdahl's Law	序列段封住整體 speedup;MD 95%/100× → 17×(無重疊) / 20×(重疊)	$\frac{1}{(1 - p) + p / s}$ ;上限 $\frac{1}{1 - p}$
緩解序列段	task-level parallelism(streams/multicore) 或 hierarchical MPI-CUDA	等效降低 $(1 - p)$
演算法三性質	definiteness、effective computability、finiteness	「是否合法演算法」的門檻
演算法四面向	complexity、parallelism、generality、accuracy	無單一全勝,為硬體挑折衷
複雜度 vs 平行度	Kogge-Stone(平行度高) vs Brent-Kung(work-efficient O(N))	硬體決定誰快;hybrid / coarsening 緩解
一般性 vs 效率	merge sort(任何 key) vs radix sort(限特定 key、快)	comparison vs noncomparison
複雜度 vs 準確度	DCS(全準確,work `O(V²)`) vs cutoff(降複雜度)	cutoff binning + overflow list,17×
Output-centric (gather)	一 thread 對一 output、register 累積、免 atomics	image/matmul/conv/stencil/merge/MRI/EP
Input-centric (scatter)	一 thread 對一 input、需 atomics;但平行度/load balance 可能更好	histogram、SpMV/COO、BFS push/edge
選分解四考量	gather/scatter、parallelism、input→output 映射、load balance	「output 恆優」是錯誤簡化;依資料集
計算思維 + 三步驟	formulate domain problem 成 computation/algorithms;algorithm selection→decomposition→tuning	三大技能:architecture / interfaces / domain
Good/Better/Best	recompile+libs / rewrite / holistic rethink	「make science better, not just faster」

核心心法:平行程式設計 = 三步驟(algorithm selection → problem decomposition → optimization & tuning)。Amdahl's Law 提醒天花板由序列段決定;演算法選擇是在四面向間取最佳折衷;問題分解的 gather/scatter 只是考量之一,平行度與 load balance 常逆轉直覺;真正的突破來自 fresh computational thinking——以 domain 洞見重新設計演算法(如 cutoff binning),而非「just recompile」。

Related Concepts

Question 1 - 平行運算的三大目標 [recall]

Question 2 - 演算法的三大基本性質 [recall]

Question 3 - 演算法選擇的四個比較面向 [recall]

Question 4 - radix sort vs merge sort 的權衡 [recall]

Question 5 - Output-centric (gather) vs Input-centric (scatter) [recall]

Question 6 - 緩解序列段：Task-level 與 Hierarchical MPI-CUDA [recall]

Question 7 - 計算思維的三大基礎技能 [recall]

Question 8 - Good / Better / Best 三種平行化策略 [recall]

Question 9 - Amdahl's Law 計算（Molecular Dynamics）[application]

Question 10 - Cutoff Binning 的 grid-centric 設計 [application]

Question 11 - 為 SpMV/COO 選擇分解策略 [application]

Question 12 - Kogge-Stone vs Brent-Kung 的權衡 [analysis]

Question 13 - Histogram：input-centric (scatter) 為何勝過 output-centric (gather) [analysis]