記憶體合併存取 (Memory Coalescing) 與 DRAM Bursts

#memory-and-performance #memory-coalescing #corner-turning #dram-bursts #row-major-layout

重點總覽 (Overview)

項目	重點	一句話
DRAM cell	電容 + sense amplifier，charge sharing 偵測	存取需「數十 ns」,遠慢於 sub-ns 時脈 → 慢
DRAM burst	一次存取會帶出一整段連續位置	多個 sensor 平行讀,連續位置高速送出
Coalescing	warp 內 32 threads 同時 load 連續位置 → 硬體合併成 1 個請求	把 burst 用滿,減少 global memory traffic
Row-major	C/CUDA 多維陣列以列為主線性化	`M[row*Width + col]`,同列相鄰、上下相隔 `Width`
Coalesced 條件	`index = k*Width + col`,`col` 含 `threadIdx.x`	連續 thread → 連續位址 → 1 burst
Uncoalesced	`index = col*Width + k`(column-major)	連續 thread → 相隔 `Width` → 多次存取
Corner turning	以 coalesced 方式搬入 shared memory,再做不利存取	修正 column-major 輸入,SRAM 不需 coalescing

一句話定義

Memory coalescing:當一個 warp 的所有 threads 在同一個 load 指令存取「連續的 global memory 位址」時,硬體會把它們合併 (coalesce) 成一個對連續 DRAM 位置的請求,讓 DRAM 以一個 burst 交付資料。

為什麼 DRAM 很慢 (Why DRAM Is Slow) 與 DRAM Bursts

DRAM cell = 微小電容,以是否帶電荷區分 0/1。讀取時電容要把電荷分享給高電容的 bit line,再由 sense amplifier 判斷 → 類比:走廊另一端用「咖啡香味」判斷咖啡口味。
此過程約 數十奈秒 (tens of ns),相對於 sub-nanosecond 的時脈極慢;且電容為了塞更多 bit 而越做越小 → latency 長年沒下降。
解法:用平行性提升 access throughput。
- DRAM burst:每次存取一個位置,會把「包含該位置的一段連續位置」一起帶出。許多 sensor 平行運作,各偵測一個 bit,偵測完整段連續資料高速送往處理器。
- 若程式聚焦使用 burst 內的資料,DRAM 供給速率遠高於隨機存取。

DRAM access cost:  [====== long row access latency (tens of ns) ======][ burst transfer ]
                    很慢 (electrical charge sharing)                     很快 (連續資料)
聚焦使用一個 burst 內的所有 byte  →  攤平 latency,逼近 peak 頻寬

CPU 也適用

CPU cache line 通常對應一或多個 DRAM burst。用滿每條 cache line 的位元組,效能遠高於隨機存取。本章技巧可移植到 CPU 程式。

合併存取硬體與 Row-Major 線性化 (Coalescing Hardware & Row-Major Layout)

利用 warp 內 threads 同時執行同一指令 (SIMD) 的事實:當整個 warp 執行一個 load,硬體偵測它們是否存取連續的 global memory 位置。
最佳模式:thread 0 → X、thread 1 → X+1、thread 2 → X+2 …,硬體合併成「對連續位置的單一請求」,DRAM 以 burst 交付。
要判斷能否 coalesce,先回顧 C/CUDA 多維陣列的 row-major 線性化:同一列的相鄰元素放在連續位址。

2D matrix M (4x4)                 Row-major 線性記憶體
 M00 M01 M02 M03      →    addr: 0   1   2   3   4   5   6   7   ...
 M10 M11 M12 M13                 M00 M01 M02 M03 M10 M11 M12 M13 ...
 M20 M21 M22 M23                 ▲                ▲
 M30 M31 M32 M33                 M00 與 M10「相隔 Width(=4) 個位置」,非相鄰

對齊要求 (alignment)

某些 CUDA device 要求起始位址 X 對齊到 16-word(64-byte)邊界(低 6 bit 為 0)。新裝置因有 L2 cache 已放寬此限制。此外,現代裝置的 cache 會自動 coalesce 部分模式,但手動安排存取模式在可見的未來仍顯著影響效能。

Coalesced vs Uncoalesced 存取模式

以矩陣乘法中對「第二個輸入矩陣 M」的存取為例:col = blockIdx.x*blockDim.x + threadIdx.x(連續 thread → 連續 col),迴圈以 k 走過列。

// === COALESCED:M 為 row-major ===
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int k = 0; k < Width; ++k) {
    // index = k*Width + col,k 與 Width 對 warp 內所有 thread 相同
    // 連續 thread 的 col 連續 → 存取「相鄰」位址 → 可合併
    value += M[k * Width + col] * /* ... */;
}

// === UNCOALESCED:M 為 column-major(例:row-major 矩陣的 transpose)===
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int k = 0; k < Width; ++k) {
    // index = col*Width + k,col 被乘上 Width
    // 連續 thread 的位址「相隔 Width」(可能上百/上千) → 無法合併
    value += M[col * Width + k] * /* ... */;
}

Coalesced (k*Width+col)          Uncoalesced (col*Width+k)
iter k=0: T0 T1 T2 T3            iter k=0: T0    T1    T2    T3
          |  |  |  |                       |     |     |     |
addr:    [0][1][2][3]  row0      addr:    [0] [W] [2W] [3W]
         相鄰 → 1 burst                   相隔 Width → N 次獨立存取
iter k=1: T0 T1 T2 T3            (實際矩陣 Width=數百~數千,
addr:    [4][5][6][7]  row1       neighbour thread 位址相隔上千元素)

比較	Coalesced	Uncoalesced
索引型式	`k*Width + col`	`col*Width + k`
連續 thread 位址	相鄰 (stride 1)	相隔 `Width`
對 DRAM 的請求	合併成 1 個 burst	warp 內多個分散請求
資料佈局	row-major 第二輸入	column-major / transpose
global memory traffic	低	高(浪費 burst 中其它 byte)

「naturally coalesced」的前提

目前(第一篇)看到的所有 kernel 都是自然 coalesced;但第二、三篇會出現許多不規則存取,需額外努力才能達成 coalescing(reduction 改 thread→data 對應、sparse matrix 改資料佈局 ELL/JDS 等)。

Corner Turning:修正 Column-Major 輸入

當計算「不自然」適合 coalescing 時,三種策略:

重排 thread→data 的對應 (rearrange mapping)。
重排資料佈局 (rearrange data layout)。
以 coalesced 方式在 global↔shared 之間搬資料,把不利的存取留在 shared memory(SRAM,不需 coalescing) 執行 → 本章範例即 corner turning。

情境:A(row-major)× B(column-major)= C(row-major),tiled 載入 input tile。

A(row-major):四個 thread 載入 tile 上緣(同一列),連續 thread → 連續元素 → coalesced(同 Ch.5)。
B(column-major):若沿用 Ch.5 的列載入 (Fig.6.4A),連續 thread 載入 tile 上緣 → 元素在記憶體中相距甚遠 → uncoalesced。
修正 (Fig.6.4B):改派連續 thread 載入 tile 左緣(同一行/column) 的連續元素。直覺上就是在計算 B tile 的線性索引時交換 threadIdx.x 與 threadIdx.y 的角色。因為 B 是 column-major,同一行的相鄰元素在記憶體中相鄰 → coalesced。

B 為 column-major(同一 column 在記憶體相鄰)

(A) 沿用 Ch5 列載入             (B) Corner turning:改載入 column
   tile:                          tile:
   T0 T1 T2 T3   ← 載 tile 上緣      T0 ┐
   ──────────                       T1 │ ← 載 tile 左緣(同 column)
   位址相隔 Width                    T2 │   位址相鄰
   → UNCOALESCED                    T3 ┘  → COALESCED
                                    存入 shared mem 後,row-wise 存取在 SRAM 進行(免費)

// 概念:載入 B 的 tile 時交換 tx/ty 角色,使連續 thread 走同一 column(相鄰位址)
// B 為 column-major:B[row + col*Width]
// (A) 原始(uncoalesced):Bds[ty][tx] = B[(ph*TILE_WIDTH + ty) + Col*Width];
// (B) corner turning(coalesced):tx 對應到「同一 column 內連續的 row」
Bds[ty][tx] = B[(ph*TILE_WIDTH + tx) + (bx*TILE_WIDTH + ty)*Width];
// 存入 shared memory 後,可用 column-major 或 row-major 排,皆無明顯 penalty(SRAM)

為什麼能「免費」做不利存取

Shared memory 以 SRAM 實作,不需要 coalescing。只要進出 global memory 那一步是 coalesced 的,把不規則 access 放到 shared memory 做幾乎沒有效能代價。

Coalescing 的好處與 Carpooling 類比

主要好處:把多個記憶體存取合併成單一存取,降低 global memory traffic。合併條件:同時發生 + 存取相鄰位址。
Carpooling 類比:資料 = 通勤者,DRAM 請求 = 車輛。請求速率 > 頻寬 → traffic congestion,算術單元閒置。多個 thread 存取同一 DRAM 位置可「共乘」合併成一個請求,但需要相似的執行排程。
warp 內 threads 是完美共乘者:因 SIMD,它們同時執行同一個 load 指令,排程天然一致。

類比元素	對應
通勤者 (commuters)	要被讀取的資料
車輛 (vehicles)	DRAM access requests
共乘 (carpool)	多 thread 合併存取 = coalescing
共同作息時間表	相似執行排程 → 同一 warp 的 SIMD

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
「為什麼 DRAM 慢?」	電容小、charge sharing 經 sense amplifier 偵測需數十 ns;電容越做越小所以 latency 不降
「什麼是 DRAM burst?」	一次存取帶出一整段連續位置,多 sensor 平行;聚焦使用 burst 即可逼近 peak 頻寬
判斷某 access 是否 coalesced	看 warp 內連續 threadIdx.x 是否對應連續位址;索引中 `threadIdx.x` 的係數=stride,係數 1 → coalesced
`M[kWidth+col]` vs `M[colWidth+k]`	前者 coalesced(`col` 含 `threadIdx.x`,stride 1);後者 uncoalesced(stride `Width`)
column-major 輸入造成 uncoalesced 怎麼辦	Corner turning:交換 `threadIdx.x/.y` 角色,以 coalesced 載入 shared memory,不利存取在 SRAM 做
為何 shared memory 不需 coalescing	SRAM 實作,隨機存取無 burst 限制(但注意之後章節的 bank conflict)
三種 coalescing 策略	(1) 重排 thread→data 對應 (2) 重排資料佈局 (3) 經 shared memory 中轉(corner turning)
同一 warp 為何適合「共乘」	SIMD,所有 thread 同時執行同一 load,排程一致,可被硬體合併
「正方形 tiled MM 哪些 BLOCK_SIZE 完全避免 uncoalesced?」(習題 2)	BLOCK_SIZE 為 warp size(32)的因數/整除關係,使每個 warp 落在同一列的連續區段

一行口訣

stride = 索引中 threadIdx.x 的係數:係數為 1 → coalesced;係數為 Width → uncoalesced。修不了就「coalesced 進 shared memory,SRAM 裡亂存取」。