Thread 對多維資料的映射與線性化 (Mapping Threads to Multidimensional Data)

#cuda-programming #memory-linearization #thread-indexing #row-major-layout #multidimensional-grids

重點總覽 (Overview)

項目	公式 / 規則	重點
Row 索引	`row = blockIdx.y*blockDim.y + threadIdx.y`	對應資料的垂直 (y) 方向
Col 索引	`col = blockIdx.x*blockDim.x + threadIdx.x`	對應資料的水平 (x) 方向
2D 線性化 (row-major)	`index = row*Width + col`	C/CUDA C 預設配置
2D 線性化 (column-major)	`index = col*Height + row`	FORTRAN / 多數 C-BLAS 使用
3D 線性化	`index = plane(WidthHeight) + row*Width + col`	一個 plane 接一個 plane 排放
邊界檢查	`if (col < width && row < height)`	thread 數恆 ≥ 像素數,需擋多餘 thread
RGB → 灰階偏移	`rgbOffset = grayOffset*CHANNELS`	每像素 3 bytes (r,g,b)

Important

維度標示順序與 C 程式碼相反。書中文字/圖以「高維在前」標示像素 P[y][x] = P_{row,col}(y 先),但 dim3 與 blockDim/gridDim 的欄位是「低維在前」(.x 先)。映射時 y↔row、x↔col,務必對齊,否則索引全錯。

Thread 座標導出資料索引 (Deriving Row/Col from Thread Coordinates)

每個 thread 用內建變數算出自己負責的資料座標(本書 §3.1 sibling 03-Multidimensional-Grids-And-Data/01-Multidimensional-Grid-Organization 詳述這些內建變數):

row = blockIdx.y * blockDim.y + threadIdx.y   // 垂直 (vertical)
col = blockIdx.x * blockDim.x + threadIdx.x   // 水平 (horizontal)

範例:62×76 圖片,採 16×16 block。block (1,0) 的 thread (0,0):

row = 1*16 + 0 = 16 ,  col = 0*16 + 0 = 0   →  P_{16,0}

ASCII:thread grid 覆蓋 62×76 圖片(需 4×5 = 20 個 16×16 block,共 64×80 thread):

                 col →  0          16         ...        76  (有效寬度)
              ┌─────────┬─────────┬─────────┬─────────┬───┐
   row 0 ─────│ block   │ block   │ block   │ block   │blk│ ← 80 thread/列
              │ (0,0)   │ (0,1)   │ (0,2)   │ (0,3)   │(0,4)│   (多出 4)
       16 ────├─────────┼─────────┼─────────┼─────────┼───┤
              │ (1,0)   │ ...                          │   │
       ...    │                                        │   │
       62 ┄┄┄┄│┄┄┄┄┄┄┄┄┄ 有效像素邊界 (n=62) ┄┄┄┄┄┄┄┄┄┄│   │
              │ (3,0)   │         │         │         │   │ ← 多出 2 列
       64 ────└─────────┴─────────┴─────────┴─────────┴───┘
   共 64×80 = 5120 thread,但只 62×76 = 4712 個有效像素

Warning

grid 各方向 thread 數恆為 blockDim 的整數倍,所以幾乎總會 多出邊界 thread(此例 y 多 2、x 多 4)。如同 Ch.2 的 1D vecAddKernel,必須用 if 擋掉越界 thread。

主機端啟動程式碼(block 固定 16×16,grid 依圖片大小算):

dim3 dimBlock(16, 16, 1);
dim3 dimGrid(ceil(m/16.0), ceil(n/16.0), 1);  // m=width(x), n=height(y)
colorToGrayscaleConversion<<<dimGrid, dimBlock>>>(Pout_d, Pin_d, m, n);

Tip

1500×2000(3 百萬像素)圖片 → 94 (y) × 125 (x) = 11,750 個 block。kernel 內 gridDim.x=125, gridDim.y=94, blockDim.x=blockDim.y=16。

扁平記憶體空間與線性化 (Flat Memory Space & Linearization)

現代電腦的記憶體是 flat memory space:每個 byte 一個位址(0 ~ 最大),所以 所有多維陣列最終都被攤平成 1D。
靜態配置陣列:編譯器知道欄數,允許 Pin[j][i] 語法,並在底層自動換算 1D offset。
動態配置陣列:ANSI C 要求欄數在編譯期已知才能用 [j][i];動態陣列欄數在 runtime 才確定,故 CUDA C 程式設計師必須手動線性化(flatten)。

Important

這就是為何 CUDA kernel 內幾乎都看到 Pin[row*width + col] 而非 Pin[row][col]——動態配置的 device 記憶體缺編譯期維度資訊。

Row-major vs Column-major 配置 (Layout Comparison)

4×4 矩陣 M 的 row-major 攤平(M_{j,i} = M[j*Width + i]):

邏輯 2D:                 row-major 1D (整列接整列):
 M00 M01 M02 M03         [M00 M01 M02 M03 | M10 M11 M12 M13 | M20 ... ]
 M10 M11 M12 M13           idx0  1   2   3    4   5   6   7    8  ...
 M20 M21 M22 M23
 M30 M31 M32 M33         M_{2,1} → 2*4 + 1 = idx 9

項目	Row-major (列優先)	Column-major (欄優先)
連續存放	同一列的元素相鄰	同一欄的元素相鄰
`M_{j,i}` 的 1D 索引	`j*Width + i`	`i*Height + j`
使用者	C / CUDA C	FORTRAN、許多 C-BLAS 程式庫
等價關係	—	= 其轉置的 row-major
跨列/欄存取成本	同列 stride=1	同欄 stride=1

Warning

呼叫 FORTRAN-style(column-major)程式庫時,手冊常要你 先轉置 輸入。CUDA C 一律 row-major;弄錯 layout 會讀到完全錯的資料。row-major 的連續存放也是後續 memory coalescing 的基礎,見 06-Performance-Considerations/01-Memory-Coalescing。

colorToGrayscaleConversion Kernel

把彩色像素轉灰階:L = 0.21*r + 0.72*g + 0.07*b。每個 thread 處理一個像素。

#define CHANNELS 3   // RGB 每像素 3 bytes (r, g, b)

__global__ void colorToGrayscaleConversion(unsigned char* Pout,
                                           unsigned char* Pin,
                                           int width, int height) {
    int col = blockIdx.x*blockDim.x + threadIdx.x;   // line 05-06
    int row = blockIdx.y*blockDim.y + threadIdx.y;
    if (col < width && row < height) {               // line 07 邊界檢查
        int grayOffset = row*width + col;            // line 10 輸出 1D 索引
        int rgbOffset  = grayOffset*CHANNELS;        // line 13 輸入起始 byte
        unsigned char r = Pin[rgbOffset    ];        // line 14
        unsigned char g = Pin[rgbOffset + 1];        // line 15
        unsigned char b = Pin[rgbOffset + 2];        // line 16
        Pout[grayOffset] = 0.21f*r + 0.72f*g + 0.07f*b;  // line 19
    }
}

輸出 Pout 每像素 1 byte → 索引 = grayOffset = row*width + col。
輸入 Pin 每像素 3 bytes → 起始 = rgbOffset = grayOffset*3,連讀 3 個 byte。
驗證(62×76 圖,block (1,0) 的 thread (0,0) → P_{16,0}):
- grayOffset = 16*76 + 0 = 1216 → Pout[1216]
- rgbOffset = 1216*3 = 3648 → Pin[3648..3650]

四種 block 邊界行為 (62×76 圖, 16×16 block)

Area	位置 / block 數	越界情形	實際處理像素 / 256
1	內部 12 blocks	全部有效	16×16 = 256
2	右上 3 blocks	col > 76,每列多 4 thread	12×16 = 192
3	左下 4 blocks	row > 62,每欄多 2 thread	16×14 = 224
4	右下 1 block	col 與 row 同時越界	14×12 = 168

Tip

同樣的 row/col 計算 + if 邊界檢查模式會原封不動地出現在 sibling 03-Multidimensional-Grids-And-Data/03-Image-Blur-Kernel 與 03-Multidimensional-Grids-And-Data/04-Matrix-Multiplication-Kernel。

延伸至 3D 陣列 (Extension to 3D Arrays)

多一個 z 維度,各 plane 一個接一個放入位址空間:

int plane = blockIdx.z*blockDim.z + threadIdx.z;
int row   = blockIdx.y*blockDim.y + threadIdx.y;
int col   = blockIdx.x*blockDim.x + threadIdx.x;
// row-major 3D 線性化 (m=width 欄數, n=height 列數):
// P[plane*(m*n) + row*m + col]

plane 0          plane 1          plane 2
┌──────┐         ┌──────┐         ┌──────┐
│ n×m  │  接著   │ n×m  │  接著   │ n×m  │  → 攤平成單一 1D
└──────┘         └──────┘         └──────┘
 offset 0         offset (m*n)     offset 2*(m*n)

kernel 須檢查三個 global index 都在範圍內:plane, row, col。
3D 陣列在 CUDA kernel 中的進一步應用見 Ch.8 Stencil:08-Stencil/01-Stencil-Background-and-Basic-Kernel。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
給 thread 座標求像素	`row = blockIdx.yblockDim.y+threadIdx.y`;`col = blockIdx.xblockDim.x+threadIdx.x`(y→row, x→col)
`M[j][i]` 的 row-major 1D 索引	`j*Width + i`(Width = 每列元素數)
400(width)×500(height) 矩陣,row 20 col 10,row-major	`20*400 + 10 = 8010`
同上,column-major	`10*500 + 20 = 5020`
3D tensor W=400,H=500,D=300,x=10,y=20,z=5,row-major	`5(400500) + 20*400 + 10 = 1,008,010`
為何手動 flatten 動態陣列?	動態配置欄數於 runtime 才知,ANSI C 需編譯期欄數才能用 `[ ][ ]`
RGB 像素 → Pin 偏移	`rgbOffset = (rowwidth+col)CHANNELS`(CHANNELS=3)
為何需要 `if (col<width && row<height)`?	grid thread 數恆為 blockDim 整數倍 ≥ 像素數,擋越界 thread
62×76 圖 + 16×16 block 共幾個 block / thread?	4×5=20 blocks;64×80=5120 threads
CUDA C 用哪種 layout?	row-major(非 FORTRAN 的 column-major)
column-major 等價於什麼?	該陣列轉置的 row-major

重點總覽 (Overview)

Thread 座標導出資料索引 (Deriving Row/Col from Thread Coordinates)

扁平記憶體空間與線性化 (Flat Memory Space & Linearization)

Row-major vs Column-major 配置 (Layout Comparison)

colorToGrayscaleConversion Kernel

四種 block 邊界行為 (62×76 圖, 16×16 block)

延伸至 3D 陣列 (Extension to 3D Arrays)

考試/面試重點 (Exam / Test Patterns)

Related Notes