進階實務與未來演進練習題 (Practice - Model of Host/Device Interaction and Memory Evolution)

#practice #advanced-practice #unified-memory #kernel-execution-control

22-Advanced-Practices-And-Future-Evolution/01-Host-Device-Interaction-Memory-Model — Host/Device 互動模型與記憶體演進 (Host/Device Interaction Model)
22-Advanced-Practices-And-Future-Evolution/02-Kernel-Execution-Control — Kernel 執行控制 (Kernel Execution Control)
22-Advanced-Practices-And-Future-Evolution/03-Memory-Bandwidth-and-Compute-Throughput — 記憶體頻寬與運算吞吐量 (Memory Bandwidth and Compute Throughput)
22-Advanced-Practices-And-Future-Evolution/04-Programming-Environment-and-Future-Outlook — 程式開發環境與未來展望 (Programming Environment and Future Outlook)

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 觸發點	答案要點
簡單模型兩大痛點	I/O 裝置只認 host memory(多搬一趟)；大資料結構塞不進小 device memory(須切塊)
Zero-copy memory 配置 (CUDA 2.2)	`cudaHostAlloc(..., cudaHostAllocMapped)` → `cudaHostGetDevicePointer()`；走 PCIe 直接讀 host pinned memory
Zero-copy 頻寬規則	system interconnect 頻寬 < global memory 的 10%；只適合偶爾、稀疏存取的資料
為何 zero-copy 要 pinned	防止 OS 在 GPU 透過 PCIe 存取時把頁面 page out
UVA (CUDA 4) 做什麼	單一虛擬位址空間，每個實體位址 ↔ 唯一虛擬位址 → `cudaMemcpy` 免指定方向
UVA 是否保證可存取	否。host 仍不能 deref `cudaMalloc()` 指標(zero-copy 例外)
32-bit GPU 記憶體上限	$2^{32}$ = 4 GB(虛擬與實體皆 32-bit)
Kepler (2013) 位址改變	64-bit 虛擬 + ≥40-bit 實體 → >4 GB DRAM、host/device 同一指標值、GPU 間 peer access
Unified (managed) memory (CUDA 6)	`cudaMallocManaged()`；單一指標、runtime 自動 migration/coherence
Kepler/Maxwell unified memory 三限制	launch 前須 flush、CPU/GPU 不能同時存取、受限於 GPU 實體記憶體大小
Pascal (2016) 兩大硬體特性	49-bit 虛擬位址(涵蓋 CPU 48-bit) + page fault handling
Pascal 如何免 flush	page fault + page protection 做 invalidate-based coherence，on-demand migrate 或 map-on-access
遍歷 host linked list 需要	Pascal 起 + 同一指標值(system-wide VA)；不必放 zero-copy
CUDA 11 low-level VA API	`cuMemAddressReserve` / `cuMemCreate` / `cuMemMap` → 跨裝置自訂佈局、單一指標
In-kernel function call (Kepler/CUDA 5)	per-thread call frame stack；compiler 不再強制 inline → composability、無原始碼 device library
kernel 可呼叫的 std 函式	`printf()`(production 除錯、讓 end user dump 狀態)、`malloc()`；支援 recursion
Device lambda (CUDA 8, C++11)	lambda 當 kernel 參數寫泛型 kernel；extended lambda 加 `__host__ __device__` (`--extended-lambda`)
Simultaneous multiple grids (Fermi)	同一 app 多 grid 併發；用小 grid + priority 讓 critical-path 上的 remote work 低延遲插入
Hardware queues (Kepler)	多條 HW queue 消除 false serialization(獨立 stream 被卡)
Interruptable grids (Fermi)	grid cancel → user-level 排程、GPU 間 load balance
Cooperative kernels (CUDA 11)	保證所有 block 併發 → 可安全用 mutex/`grid.sync()` 跨 block 合作；不可 oversubscribe
Double-precision 速度	早期 ~8x slower → Fermi 後 ≈ single 的 1/2(2x slower)
Half-precision (A100)	tensor-core FP16 156 TFLOPS vs FP32 19.5 TFLOPS ≈ 8x；省算力 + 省頻寬(搬運量減半)
Predication (Fermi+)	編譯器驅動，控制相依→資料相依，warp SIMD 受益最大(ray tracing、cellular automata)
Configurable cache/scratchpad (Fermi+)	on-chip memory 可配置 cache vs shared；CPU-ported/不規則 → cache、tiling/可預測 → shared
Enhanced atomics	Fermi→Kepler→Maxwell 漸快/更通用 → 減少改用 prefix-sum/sort、減少回丟 CPU
HBM2 vs NVLink (Pascal)	HBM2 = GPU 自身 DRAM 頻寬 ≤3x Maxwell；NVLink = GPU↔GPU/CPU 互連 ≤5x PCIe 3.0
Unified device memory space (Fermi 2009)	把 global/local/shared 併入單一位址空間，單一 load/store 指令 → device function composable
兩個 unified 別搞混	device memory space(晶片內 g/l/s 位址統一) ≠ managed memory(CPU-GPU 共享 pool)
高階介面	OpenACC(`#pragma`)、Thrust(type-generic STL 風)、CUDA FORTRAN、C++AMP
PC sampling (CUDA 7.5)	指令級 profiling，定位最耗時的程式行
Critical path analysis (CUDA 8, 2016)	Visual Profiler 找真正決定整體時間的 kernel/API；非關鍵路徑灰階淡化

Question 1 - 簡單 Host/Device 模型的兩大痛點 [recall]

前面章節假設的「簡單模型」中，device memory 與 host memory 完全分離、只能用 cudaMemcpy() 來回搬資料。請說出此模型在應用層面造成的兩大問題。

查看答案

(1) I/O 效率:disk controller、NIC 等 I/O 裝置只在 host memory 上運作；資料須先 H2D 搬入運算、運算完再 D2H 搬出才能做 I/O，增加 I/O latency、降低 throughput。(2) 大型資料結構:早期 GPU device memory 很小，逼開發者把大資料結構切塊 (partition) 才塞得下(如 Ch.18 把 3D 能量網格切成 2D slices)；有些應用甚至沒有好的切法。根源是早期 GPU 的記憶體架構限制，非 CUDA 語言本身。

Question 2 - Zero-Copy Memory 的配置與頻寬規則 [recall]

Zero-copy memory(CUDA 2.2, 2009)如何配置?為何 host 指標不能直接傳給 kernel?它最適合什麼存取模式?

查看答案

用 cudaHostAlloc(..., cudaHostAllocMapped) 配置 pinned host memory(pinned 才不會被 OS page out)。host 指標須先經 cudaHostGetDevicePointer() 換成 device 指標再傳給 kernel；kernel deref 時走 PCIe / system interconnect 直接讀寫 host memory，免 cudaMemcpy()。但 system interconnect 頻寬 < global memory 的 10%，故只適合偶爾、稀疏存取的資料結構。

Question 3 - UVA 解決什麼、又不保證什麼 [recall]

Unified Virtual Addressing(UVA, CUDA 4)相對於先前的記憶體模型帶來什麼改變?它是否保證 host code 能直接存取 device 指標所指的資料?

查看答案

UVA 把 host 與 device 的獨立虛擬位址空間合併成單一共用虛擬位址空間 (UVAS)，保證每個實體位址只映射到唯一虛擬位址。runtime 只看指標值即可判斷它指向 host 還是 device → cudaMemcpy() 不再需要指定方向(可用 cudaMemcpyDefault)。但 UVA 不保證 accessibility:host code 仍不能直接 deref cudaMalloc() 回傳的 device 指標(反之亦然)；唯一例外是 zero-copy 指標可直接當 kernel 參數。

Question 4 - 32-bit 天花板與 Kepler 的大型位址空間 [recall]

早期 CUDA GPU 的記憶體容量上限是多少、為什麼?Kepler(2013)在虛擬/實體位址上做了什麼改變,又額外開啟了哪種 GPU 間能力?

查看答案

早期 GPU 用 32-bit 虛擬 + 最多 32-bit 實體位址 → 可定址 $2^{32}$ = 4 GB，device memory 與 kernel 操作的資料集都不能超過 4 GB；且 32-bit VA 裝不下現代 CPU 的 48-bit VA。Kepler 起改用 64-bit 虛擬位址 + ≥40-bit 實體位址 → 可裝 >4 GB DRAM、host/device 可用完全相同的指標值指同一資料，並把多顆 GPU 的 device memory 併入統一實體位址空間 → 一顆 GPU 可直接 deref 指標存取另一顆 GPU 的記憶體(peer access，免 host 觸發的 D2D copy)。

Question 5 - Unified Memory 與 Kepler/Maxwell 的限制 [recall]

CUDA 6(2013)的 unified (managed) memory 用哪個 API 配置?在 Kepler/Maxwell 世代上有哪三項限制?根本原因是什麼?

查看答案

用 cudaMallocManaged() 配置 CPU/GPU 共享的 managed memory pool，單一指標存取，runtime + 硬體負責 migration/coherence。Kepler/Maxwell 上的三限制:(1) 每次 grid launch 前，CPU 改過的 managed memory 全部要 flush 到 GPU；(2) CPU 與 GPU 不能同時存取同一塊 managed allocation；(3) unified address space 受限於 GPU 實體記憶體大小。根本原因:這些世代缺乏 host/device 間的硬體 coherence，migration 多由軟體完成。

Question 6 - Pascal 的 49-bit VA 與 Page Fault Handling [recall]

Pascal(2016)新增哪兩大硬體特性?page fault handling 如何讓 unified memory 不必在每次 launch 前 flush?

查看答案

兩大特性:(1) 49-bit 虛擬位址(足以涵蓋現代 CPU 的 48-bit VA + GPU 自身記憶體) → 整個系統(所有 CPU/GPU)成為單一虛擬位址空間；(2) memory page fault handling。有了 page fault，CUDA runtime 改用 invalidate-based coherence:host/device 修改 managed 變數時用 page mapping/protection 讓對方副本失效；launch 時不必把所有 GPU 副本更新到最新——若 grid 存取到被 invalidate 或不在 device 的 page，就觸發 page fault，按需 migrate 到 GPU(或 map-on-access 走 interconnect)後恢復執行。

Question 7 - Kernel 內函式呼叫與其意義 [recall]

早期 CUDA 為何不支援 recursion、system call、virtual function?Kepler/CUDA 5 起靠什麼機制支援真正的函式呼叫?除了 recursion,這帶來哪些軟體工程效益?

查看答案

早期 compiler 必須把所有函式 body inline 進 kernel object，runtime 沒有真正的 call，故不支援 system call、dynamic library call、recursion、virtual function。Kepler/CUDA 5 起由 per-thread 的 massively parallel call frame stack(有 cache、快速實作)支撐真正的 runtime call，compiler 不再被強制 inline。效益:composability(不同作者寫不同元件再組裝、廠商可釋出不含原始碼的 device library 做 IP 保護)、可呼叫 printf()/malloc()(printf() 對 production 除錯特別有用，讓非技術 end user dump 內部狀態回報 bug)。

Question 8 - Cooperative Kernels 的並發保證與限制 [recall]

Cooperative kernels(CUDA 11)解決什麼問題?它提供什麼保證、用哪個 API 啟動?為何 block 數不能 oversubscribe?

查看答案

解決 irregular data 的 load imbalance、以及多 block 想用 mutex 等互斥機制合作(如共享 work queue)時的 deadlock 風險。它保證所有 thread block 同時併發執行(最多到填滿 GPU 的 block 數)，因此跨 block 同步/合作不會 deadlock。用 cudaLaunchCooperativeKernel() 啟動(跨裝置用 cudaLaunchCooperativeKernelMultiDevice())，配合 grid.sync()。不可 oversubscribe:一般 grid 的 block 採 transparent scalability 分波排程、順序不保證，若 block 數超過 GPU 容量，做 grid-wide barrier 時等待中的 block 永遠不會上場 → deadlock；cooperative kernel 用「犧牲可任意擴張的 block 數」換「全 grid 同時駐留」。

Question 9 - Double / Half-Precision 速度的演進 [recall]

早期 GPU 的 double-precision 相對 single-precision 慢多少?Fermi 後改善到什麼程度?half-precision(FP16)硬體支援在哪個世代引入?

查看答案

早期 GPU 的 FP64 約比 FP32 慢 8 倍(被 HPC 社群詬病)。Fermi 及後繼把 FP64 運算單元強化到約 FP32 的 一半速度(2x slower)，讓移植 CPU 數值程式的開發者不必再費力評估能否降成 single-precision，大幅降低移植成本。half-precision (FP16) 硬體支援在 Pascal 引入，Ampere (A100) 再透過 tensor cores 把 FP16 吞吐推到極致。

Question 10 - A100 精度吞吐量計算與選擇 [application]

A100(Ampere)的 FP16 tensor-core 吞吐為 156 TFLOPS、FP32 為 19.5 TFLOPS。(a) FP16 相對 FP32 約快幾倍?(b) 一個醫療影像應用的資料可容忍 16-bit 表示,除了算力,改用 FP16 還在哪個面向受益、原因為何?

查看答案

(a) $156 / 19.5 = 8$ → FP16 約為 FP32 的 8x 吞吐。(b) 還在 記憶體頻寬 受益:16-bit 資料的搬運量只有 32-bit 的一半,對 bandwidth-bound 的 kernel 直接減半頻寬壓力(也省能耗)。前提是資料本身能容忍該表示範圍與 round-off(先確認數值穩定性再降精度)。

Question 11 - 移植含 qsort 與 linked list 的 CPU 程式 [application]

你要把一支使用 malloc/free 與 qsort()、且資料是 host 上 linked structure 的 CPU 程式移植到 GPU,希望改動最小。(a) 依 Fig 22.1,unified memory 讓 host code 只需改哪兩處?(b) 要讓 GPU kernel 能直接遍歷 host memory 上的 linked list,需要哪個世代/特性,關鍵原理是什麼?

查看答案

(a) 兩處:malloc/free → cudaMallocManaged()/cudaFree()；qsort() → launch 一個平行 sort kernel + cudaDeviceSynchronize()(仍須有 parallel qsort kernel,但 host code 改動極小)。(b) 需 Pascal 起的 page fault handling + system-wide VA:因為 host 與 device 用同一指標值指同一變數,host 建立的指標鏈 device 也能跟著走(反之亦然),且資料不必放在 zero-copy memory;page fault 會按需把 page migrate 到 GPU。這對 CAD 等需數百 GB「in core」資料的應用特別關鍵。

Question 12 - Cluster 的 local/remote 工作排程 [application]

一個 parallel cluster 應用把工作分成 local 與 remote 兩種,remote work 牽涉與其他 node 互動、位於全域進度的 critical path 上。舊 CUDA 系統有什麼兩難?哪個硬體能力解決它、具體怎麼做?

查看答案

舊系統一次只能跑一個 grid,且 grid 要夠大才能填滿 device → 兩難:等 remote work 才動作會讓 device 閒置、低利用率;急著先做 local work 又會讓 remote work 卡在大 grid 後面、latency 變高。Fermi 起的 simultaneous multiple grids(配合 Kepler 的多 hardware queues)解決:用較小的 grid size 提交工作,並可分 priority;當高優先的 remote work 到達時,可低延遲立即插入執行,而不必卡在一大塊 local computation 之後。

Question 13 - Zero-Copy vs Unified Memory(Pascal)的取捨 [analysis]

比較 zero-copy memory 與 Pascal 起的 unified memory(page fault)在「資料存取方式、能否遍歷指標鏈、資料量上限、coherence」四個面向的差異,並說明為何 unified memory 是更通用的 CPU/GPU 互動機制。

查看答案

存取方式:zero-copy 永遠走 PCIe 直接讀、無遷移(頻寬 <10% global memory,故只能稀疏存取);unified memory 會按需 migrate page 到存取端,或 map-on-access。遍歷指標鏈:zero-copy 只在「全部記憶體都用 cudaHostAlloc() 配置」時才能 traverse;unified memory(Pascal)可遍歷任意 host 建立的 linked structure,因 host/device 共用同一指標值。資料量上限:zero-copy 受限於 host pinned memory,unified memory 涵蓋全系統(CPU + 多 GPU)的 VA。Coherence:zero-copy 無(手動),unified memory 由硬體 + page fault 自動維護。因此 unified memory 不再受「能 copy/pin 到 device 的資料量」限制,還能自動把熱資料搬到存取端,是更通用的機制。

Question 14 - PC Sampling vs Critical Path Analysis [analysis]

CUDA 7.5 的 PC sampling 與 CUDA 8 的 critical path analysis 各解決什麼問題?為什麼「執行時間最長的 kernel」未必是最該優化的對象?這與 Amdahl 思維有何關聯?

查看答案

PC sampling(CUDA 7.5)提供指令級 profiling,告訴你哪一行程式碼最耗時(熱點);但它不知道該 kernel 是否與 CPU 活動重疊。Critical path analysis(CUDA 8 Visual Profiler)則找出哪一個 kernel/API 真正決定整體執行時間,非關鍵路徑活動灰階淡化。為何最長 kernel 未必該優化:如 Fig 22.2,Kernel X 雖長卻完全與 CPU 活動 A 重疊,只加速 X 而不同時加速 A → speedup ≈ 1;反觀較短的 Kernel Y 在 critical path 上、CPU 正空等它,加速 Y 才真能縮短整體時間。這正是 Amdahl 思維:只有優化「真正占用串列關鍵路徑時間」的部分才有 speedup,故兩工具互補,缺一不可。

Question 15 - 直接 Atomic Scatter vs Prefix-Sum/Sort 轉換 [analysis]

在 histogram 這類 random scatter 計算中,可以直接用 atomic operations,也可以做演算法轉換改用 prefix-sum(scan)或 sorting。隨各世代 atomic 變快,這個取捨如何改變?「atomic 變快」是否代表不再需要任何優化?

查看答案

早期 GPU 的 atomic 慢且功能受限,開發者常被迫做演算法轉換(prefix-sum / sort)來避開 atomic,但這些轉換會增加 kernel 啟動次數與總運算量,有時甚至要把 collective/多 block 更新工作回丟 host CPU(增加 CPU↔GPU 傳輸)。隨 Fermi→Kepler→Maxwell atomic 逐代變快、更通用(Maxwell 再強化 shared-memory atomic 吞吐),直接 atomic scatter 變得可接受,減少了改用 scan/sort 或回丟 CPU 的需要。但「快的 atomic」≠ 不用優化:高競爭 (contention) 仍會嚴重拖慢效能,privatization、coarsening、aggregation 等技巧依然有效(見 09-Parallel-Histogram/02-Histogram-Optimizations-Privatization-Coarsening-Aggregation)。

重點整理 (Pattern Summary，點擊展開)

記憶體模型演進時間軸

年代 / 版本	硬體世代	里程碑	關鍵 API / 數字
2009 / CUDA 2.2	(pre-Fermi)	Zero-copy memory	`cudaHostAlloc(cudaHostAllocMapped)`；頻寬 <10%
2011 / CUDA 4	Fermi	UVA	每個實體位址 ↔ 唯一虛擬；`cudaMemcpy` 免方向
2013 / Kepler	Kepler	大型位址空間 + peer access	64-bit VA、≥40-bit PA、>4 GB DRAM
2013 / CUDA 6	Kepler/Maxwell	Unified (managed) memory	`cudaMallocManaged()`(須 flush、不可同時存取)
2016 / Pascal	Pascal	Page fault handling + 49-bit VA	免 flush、遍歷 host 指標鏈、HBM2(3x)、NVLink(5x)
CUDA 11	Ampere+	VA space control	`cuMemAddressReserve/Create/Map`

Kernel 執行控制能力(首見世代)

能力	首見	核心價值
In-kernel function call	Kepler/CUDA 5	call frame stack → composability、`printf`
Device lambda	CUDA 8	泛型 kernel；`__host__ __device__` extended
Simultaneous grids	Fermi	小 grid + priority → critical-path 低延遲
Hardware queues	Kepler	消除 false serialization
Interruptable grids	Fermi	grid cancel → load balance
Cooperative kernels	CUDA 11	併發保證 → 跨 block 合作；不可 oversubscribe

吞吐量 / 頻寬硬體進步:DP 8x→1/2(Fermi)；FP16 tensor 156 vs 19.5 TFLOPS(A100)；predication(Fermi)；可配置 cache/shared(Fermi)；atomic Fermi→Kepler→Maxwell 漸快；HBM2 ≤3x、NVLink ≤5x(Pascal)。

生產力工具:unified device memory space(Fermi,把晶片內 g/l/s 位址統一,勿與 managed memory 混淆)；OpenACC / Thrust / CUDA FORTRAN / C++AMP；PC sampling(CUDA 7.5,哪一行) + critical path analysis(CUDA 8,哪一個 kernel)。

一句話總結:Ch.22 的主線是「讓指標愈來愈通用、讓 GPU code 愈來愈像 CPU code」——記憶體模型從手動 cudaMemcpy 走向單一指標自動遷移,kernel 能力從強制 inline 走向 composable,硬體吞吐拉低「easy performance」門檻,工具從「最熱的行」走向「真正在 critical path 上的 kernel」。

Related Concepts

Question 1 - 簡單 Host/Device 模型的兩大痛點 [recall]

Question 2 - Zero-Copy Memory 的配置與頻寬規則 [recall]

Question 3 - UVA 解決什麼、又不保證什麼 [recall]

Question 4 - 32-bit 天花板與 Kepler 的大型位址空間 [recall]

Question 5 - Unified Memory 與 Kepler/Maxwell 的限制 [recall]

Question 6 - Pascal 的 49-bit VA 與 Page Fault Handling [recall]

Question 7 - Kernel 內函式呼叫與其意義 [recall]

Question 8 - Cooperative Kernels 的並發保證與限制 [recall]

Question 9 - Double / Half-Precision 速度的演進 [recall]

Question 10 - A100 精度吞吐量計算與選擇 [application]

Question 11 - 移植含 qsort 與 linked list 的 CPU 程式 [application]

Question 12 - Cluster 的 local/remote 工作排程 [application]

Question 13 - Zero-Copy vs Unified Memory(Pascal)的取捨 [analysis]

Question 14 - PC Sampling vs Critical Path Analysis [analysis]

Question 15 - 直接 Atomic Scatter vs Prefix-Sum/Sort 轉換 [analysis]