多維 Grid 與資料練習題 (Practice - Multidimensional Grid Organization)

#practice #cuda-programming #multidimensional-grids #matrix-multiplication

關鍵題型 (Key Patterns，點擊展開)

關鍵字 / 情境	答案 / 公式
`<<<A, B>>>` 兩參數	A = grid 維度 (以 block 計)、B = block 維度 (以 thread 計)，型別皆 `dim3`
1D shorthand `<<<ceil(n/256.0), 256>>>`	算術式當 `.x` 維、`.y`/`.z` 預設為 1
block 最多 threads	1024，且 `blockDim.x.y.z ≤ 1024`；`(32,32,2)=2048` 不合法
`gridDim` 各維上限	`.x`: 1..2³¹−1；`.y`/`.z`: 1..65,535 (2¹⁶−1)
row / col 公式	`row = blockIdx.yblockDim.y + threadIdx.y`；`col = blockIdx.xblockDim.x + threadIdx.x`
軸向對應	`.y` → row (垂直 n)、`.x` → col (水平 m)；圖示高維在前、C 設定低維在前 (反序)
2D row-major 線性化	`idx = row*Width + col`
2D column-major 線性化	`idx = col*Height + row` (= 轉置的 row-major)
3D row-major 線性化	`idx = plane(WidthHeight) + row*Width + col`
邊界檢查	`if (col < width && row < height)`，擋 grid 多生的越界 thread
RGB → Pin 偏移	`rgbOffset = (rowwidth+col)CHANNELS`，CHANNELS=3
`BLUR_SIZE`	patch 半徑；一維寬度 = `2*BLUR_SIZE+1`；3×3→1、7×7→3
blur 兩層 if	外層擋 thread 越界、內層擋 patch 鄰居越界；除以實際 `pixels`
matmul `M` 存取	`M[row*Width + k]` (連續)
matmul `N` 存取	`N[k*Width + col]` (stride = Width，非連續)
matmul `P` 寫回	`P[row*Width + col]`；`Pvalue` 在 register 累加
基本 matmul 效能	arithmetic intensity ≈ 0.25 OP/B → memory-bound

Question 1 - dim3 與執行設定參數 [recall]

情境：呼叫 kernel 的 <<< , >>> 內兩個執行設定參數各代表什麼？它們的型別是什麼？1D 的 <<<ceil(n/256.0), 256>>> 寫法為何成立？

查看答案

第 1 參數 = grid 維度 (以 block 數 計)，第 2 參數 = block 維度 (以 thread 數 計)，型別皆為 dim3(三元素整數向量 x,y,z)。1D shorthand 利用 C++ 建構子預設參數：單一算術式填入 .x，.y/.z 取預設值 1，故 n=4000 時 gridDim.x=16、blockDim.x=256。

Question 2 - block 與 grid 尺寸上限 [recall]

情境：一個 block 最多可有幾個 thread？(32, 16, 2) 與 (32, 32, 2) 哪個合法？gridDim.x 與 gridDim.y/.z 的上限有何不同？

查看答案

block 總 thread 數上限為 1024(x*y*z ≤ 1024)。(32,16,2)=1024 合法；(32,32,2)=2048 > 1024 不合法。gridDim.x 範圍 1..2³¹−1，而 gridDim.y/.z 只到 65,535 (2¹⁶−1)。

Question 3 - 四個內建變數的意義 [recall]

情境：gridDim、blockDim、blockIdx、threadIdx 分別代表什麼？哪些由 host 執行設定決定、哪些在 kernel 內可以改名？

查看答案

*Dim 是「尺寸」(gridDim=各維 block 數、blockDim=各維 thread 數)，由 host 端的執行設定參數決定；*Idx 是「索引」(blockIdx=本 block 座標、threadIdx=本 thread 座標)，由硬體排程時賦予。這四個名稱是 CUDA C 規範，在 kernel 內不可改名(host 端的 dimGrid/dimBlock 才可自取)。

Question 4 - Thread 對多維資料的映射公式 [recall]

情境：寫出 2D kernel 中由 thread 座標導出 row 與 col 的公式，並說明軸向對應 (哪個對 x、哪個對 y) 以及為何需要 if 邊界檢查。

查看答案

row = blockIdx.y*blockDim.y + threadIdx.y(垂直/n)、col = blockIdx.x*blockDim.x + threadIdx.x(水平/m)，即 .y → row、.x → col。因 grid 各方向 thread 數恆為 blockDim 的整數倍 ≥ 資料尺寸，會多出越界 thread，故需 if (col < width && row < height) 擋掉。

Question 5 - Row-major 線性化與動態陣列 [recall]

情境：4×4 矩陣 M 以 row-major 攤平後，M[j][i] 的 1D 索引是什麼？為何在 CUDA C 中動態配置的多維陣列必須由程式設計師「手動 flatten」？

查看答案

Row-major 1D 索引 = j*Width + i(j*Width 跳過前面整列，i 選列內元素)。現代電腦是 flat memory space(每 byte 一位址)，所有多維陣列終究攤平成 1D。靜態陣列編譯器知道欄數可自動換算；但動態配置陣列的欄數要到 runtime 才確定,ANSI C 需編譯期欄數才能用 [j][i],故須手動寫成 Pin[row*width + col]。

Question 6 - colorToGrayscale 的索引計算 [recall]

情境：在 colorToGrayscaleConversion 中，輸出灰階索引 grayOffset 與輸入彩色起始偏移 rgbOffset 如何計算？為何兩者差一個倍數？

查看答案

grayOffset = row*width + col(輸出每像素 1 byte)。rgbOffset = grayOffset*CHANNELS(CHANNELS=3,每彩色像素 3 bytes r,g,b),再連讀 3 個 byte。例如 62×76 圖 block(1,0) 的 thread(0,0) → P₁₆,₀:grayOffset=16*76+0=1216、rgbOffset=1216*3=3648。

Question 7 - Blur kernel 的兩層 if 與 pixels 計數 [recall]

情境：blurKernel 中 BLUR_SIZE 代表什麼？為什麼需要外層與內層兩個 if？為何最後要除以 pixels 而非固定的 (2*BLUR_SIZE+1)²？

查看答案

BLUR_SIZE = patch 半徑(每邊像素數),一維寬度 = 2*BLUR_SIZE+1(3×3→1、7×7→3)。外層 if 擋掉對應到影像外的 grid 多生 thread;內層 if (curRow/curCol 在 [0,h)/[0,w)) 擋掉 patch 超出邊界的鄰居。邊/角的 patch 部分出界,合法像素數不同(內部 9、邊 6、角 4),須除以實際累加的 pixels 才能正確平均。

Question 8 - matrixMulKernel 的記憶體存取 [recall]

情境：在 matrixMulKernel(方陣 Width) 中，inner product 迴圈內 M、N 的存取索引各是什麼？P 如何寫回？為何用區域變數 Pvalue 累加？

查看答案

Pvalue += M[row*Width + k] * N[k*Width + col],迴圈跑 k=0..Width-1 做 inner product;結束後 P[row*Width + col] = Pvalue。M 取一列(連續)、N 取一行(stride=Width)。Pvalue 放在 register 累加,避免每次迭代都寫 global memory,迴圈後才一次寫回。

Question 9 - Row-major 與 column-major 索引 [application]

情境 (Exercise 4)：一個 width=400、height=500 的 2D 矩陣存成 1D 陣列。求 row=20、col=10 元素的陣列索引：(a) row-major；(b) column-major。

查看答案

(a) row-major：row*Width + col = 20*400 + 10 = 8010。
(b) column-major：col*Height + row = 10*500 + 20 = 5020。
兩者差異來自連續存放方向(列優先 vs 欄優先);column-major 等價於該矩陣轉置後的 row-major。

Question 10 - 3D tensor 的線性化 [application]

情境 (Exercise 5)：一個 width=400、height=500、depth=300 的 3D tensor 以 row-major 存成 1D 陣列。求元素 (x=10, y=20, z=5) 的陣列索引。

查看答案

3D row-major:idx = plane*(Width*Height) + row*Width + col,其中 plane=z=5、row=y=20、col=x=10。
= 5*(400*500) + 20*400 + 10 = 1,000,000 + 8,000 + 10 = 1,008,010。
每個 plane (n×m) 一個接一個放入位址空間。

Question 11 - 啟動設定與邊界 block 行為 [application]

情境：要用 16×16 的 block 處理一張 62 (height/n) × 76 (width/m) 的圖片。grid 需要幾個 block？共產生幾個 thread？為何會有越界 thread？

查看答案

每方向 block 數 = ceil：x 方向 ceil(76/16)=5、y 方向 ceil(62/16)=4 → 20 個 block;共 (5*16)×(4*16) = 80×64 = 5120 個 thread。因 thread 數恆為 blockDim 的整數倍(80≥76、64≥62),x 多出 4、y 多出 2 列 thread 越界,須靠 if (col<76 && row<62) 擋掉(只有 62×76=4712 個有效像素)。

Question 12 - M 連續 vs N stride 存取 [analysis]

情境：基本 matrixMulKernel 中，同一個 warp 內相鄰 thread (col 相差 1) 對 M[row*Width+k] 與 N[k*Width+col] 的存取樣態有何不同？這對記憶體存取效率有何影響？

查看答案

對 N[k*Width+col]:同 warp 相鄰 thread col 差 1,存取的是相鄰位址(stride 1),合併良好;對 M[row*Width+k]:同一列的相鄰 thread row 相同,讀的是同一位址(可由快取/廣播服務),但相鄰列的 thread 存取相距 Width。重點在於存取是否「相鄰 thread → 相鄰位址」,這正是後續 memory coalescing 與 tiling 優化的起點(見 06-Performance-Considerations/01-Memory-Coalescing)。

Question 13 - 每 thread 一列 vs 一行 vs 一元素 [analysis]

情境 (Exercise 1)：除了「每 thread 算一個 P 元素」,還可設計「每 thread 算一整列」或「每 thread 算一整行」的 matmul kernel。比較其優缺點;並說明為何基本版 matmul 是 memory-bound。

查看答案

「一 thread 一列/一行」需要的 thread 數從 Width² 降為 Width,平行度大幅下降、每 thread 工作量變 O(Width²),且一列/一行版本對 M 或 N 的某一矩陣可重複使用 (載入 register) 但對另一矩陣仍重複讀 global memory;平行度不足通常使整體更慢。基本「一元素」版每次迭代讀 2 個 float (8 B) 做 2 FLOP → arithmetic intensity ≈ 0.25 OP/B,且 M/N 元素被多個 thread 重複從 global memory 讀取,因此 memory-bound,需 shared-memory tiling 改善(見 05-Memory-Architecture-And-Data-Locality/02-Tiling-and-Tiled-Matrix-Multiplication)。

重點整理 (Pattern Summary，點擊展開)

主題	核心重點
Grid/Block 階層	grid 是 block 的 3D 陣列、block 是 thread 的 3D 陣列;`dim3(x,y,z)`,未用維設 1
執行設定	`<<<dimGrid, dimBlock>>>`:第1參數 grid 維度、第2參數 block 維度
尺寸上限	block ≤ 1024 thread;`gridDim.x` ≤ 2³¹−1、`.y`/`.z` ≤ 65,535
內建變數	`Dim`=尺寸(host 設定)、`Idx`=索引(硬體賦予);kernel 內不可改名
Thread→Data	`row=blockIdx.yblockDim.y+threadIdx.y`、`col=blockIdx.xblockDim.x+threadIdx.x`;`.y`→row、`.x`→col
邊界 if	thread 數恆為 blockDim 整數倍 ≥ 資料尺寸 → 必須擋越界 thread
Row-major 線性化	2D:`rowWidth+col`;3D:`plane(WH)+rowWidth+col`;column-major:`col*Height+row`
手動 flatten	動態陣列欄數 runtime 才知 → 必須寫 `A[row*width+col]`
colorToGrayscale	`grayOffset=rowwidth+col`、`rgbOffset=grayOffset3`
blurKernel	`BLUR_SIZE`=半徑;兩層 if(thread 越界 + patch 越界);除以實際 `pixels`
matrixMulKernel	`M[rowW+k]`(連續)、`N[kW+col]`(stride W)、`P[row*W+col]`;`Pvalue` 在 register
基本 matmul 瓶頸	arithmetic intensity ≈ 0.25 OP/B、元素重複讀 → memory-bound → tiling

Related Concepts

Question 1 - dim3 與執行設定參數 [recall]

Question 2 - block 與 grid 尺寸上限 [recall]

Question 3 - 四個內建變數的意義 [recall]

Question 4 - Thread 對多維資料的映射公式 [recall]

Question 5 - Row-major 線性化與動態陣列 [recall]

Question 6 - colorToGrayscale 的索引計算 [recall]

Question 7 - Blur kernel 的兩層 if 與 pixels 計數 [recall]

Question 8 - matrixMulKernel 的記憶體存取 [recall]

Question 9 - Row-major 與 column-major 索引 [application]

Question 10 - 3D tensor 的線性化 [application]

Question 11 - 啟動設定與邊界 block 行為 [application]

Question 12 - M 連續 vs N stride 存取 [analysis]

Question 13 - 每 thread 一列 vs 一行 vs 一元素 [analysis]