Constant Memory 與快取階層 (Caching)

#parallel-pattern #constant-memory #cache-hierarchy #arithmetic-intensity #convolution

重點總覽 (Overview)

本節把 07-Convolution/01-Convolution-Fundamentals-and-Basic-Kernel 中卡在 0.25 OP/B 的基本 kernel，透過把 filter F 放進 constant memory 並倚靠 GPU 的 constant cache，將算術強度提升到 0.5 OP/B（翻倍）。

項目	內容	關鍵數字 / API
為何 F 適合 constant memory	(1) 體積小 (2) kernel 執行中不變 (3) 所有 thread 以相同順序存取	radius 通常 ≤ 7；3D filter ≤ 7³=343
宣告	`__constant__` 全域變數，置於所有函式之外	constant memory 容量 ≈ 64 KB
Host 拷貝	`cudaMemcpyToSymbol(dest, src, size)`，告知 runtime 資料不會被改	不用 `cudaMemcpy`
Kernel 存取	當作全域變數直接用，不必當參數傳入	適用 C 全域變數 scoping 規則
快取階層	regs → L1 (16–64 KB) → L2 (數百 KB–數 MB) → (L3 可達數百 MB) → DRAM	Cache 對程式「透明」
Constant cache	唯讀、無寫入邏輯 → 省晶片面積/功耗；warp 同址存取可廣播	F 視為「永遠命中」，不耗 DRAM 頻寬
結果	F 的 DRAM 存取被消除，每次 inner loop 只剩 N 的 4 bytes	0.25 → 0.5 OP/B

Important

Constant memory 與 shared memory 都能省 DRAM 頻寬，但機制相反：constant cache 由硬體自動快取（透明）；shared memory 由程式設計師手動搬移（scratchpad）。本節用前者處理 F，下一節 07-Convolution/03-Tiled-Convolution-and-Halo-Handling 用後者處理 N。

為什麼 Filter F 適合 Constant Memory (Why F Fits Constant Memory)

書中指出 F 的存取有三項特性，使它成為 constant memory + caching 的理想對象：

體積小 (small)：多數 convolution filter 的 radius ≤ 7；即使 3D filter 通常也只有 ≤ 7³ = 343 個元素。
內容不變 (read-only)：整個 kernel 執行期間 F 的內容都不會被修改。
共用且順序一致 (uniform access)：所有 thread 都存取 F，而且以相同順序（從 F[0][0] 起逐一前進）走訪 doubly-nested loop。

Tip

第三點是 constant cache 能發揮的關鍵：當一個 warp 內 32 個 thread 同時讀同一個位址（F 的索引與 threadIdx 無關），硬體可一次「廣播」給整個 warp，提供極高頻寬。

宣告與拷貝 (Declaration & cudaMemcpyToSymbol)

宣告 (全域變數，函式之外)

#define FILTER_RADIUS 2
// 必須放在所有函式之外 (global scope)
__constant__ float F[2*FILTER_RADIUS+1][2*FILTER_RADIUS+1];

__constant__（兩側各兩個底線）告訴 compiler 把 F 放進 device 的 constant memory。
與 global memory 變數一樣，constant memory 變數對所有 thread block 可見；差別在於 kernel 執行期間 thread 不能修改它。

Host 端拷貝

// 假設 F_h 已在 host 配置並填好 (2*FILTER_RADIUS+1)^2 個元素
cudaMemcpyToSymbol(F, F_h,
    (2*FILTER_RADIUS+1)*(2*FILTER_RADIUS+1)*sizeof(float));

通用形式：

cudaMemcpyToSymbol(dest, src, size);
//  dest : 指向 constant memory 中的目的位置
//  src  : 指向 host memory 中的來源資料
//  size : 要複製的位元組數

Warning

這是專用的拷貝函式，會告知 CUDA runtime「這份資料在 kernel 執行期間不會被改」，runtime 才能放心積極快取它 —— 不能用一般的 cudaMemcpy。（此函式還有 offset、kind 兩個少用參數，通常省略。）

Kernel 端使用 (Kernel Using Constant Memory)

對照基本 kernel，唯一差別是：F 不再透過參數指標傳入，而是當作全域變數直接存取，同時 r 由編譯期常數 FILTER_RADIUS 取代。

__global__ void convolution_2D_const_mem_kernel(float *N, float *P,
                                                int width, int height) {
    int outCol = blockIdx.x*blockDim.x + threadIdx.x;
    int outRow = blockIdx.y*blockDim.y + threadIdx.y;
    float Pvalue = 0.0f;
    for (int fRow = 0; fRow < 2*FILTER_RADIUS+1; fRow++) {
        for (int fCol = 0; fCol < 2*FILTER_RADIUS+1; fCol++) {
            int inRow = outRow - FILTER_RADIUS + fRow;
            int inCol = outCol - FILTER_RADIUS + fCol;
            if (inRow >= 0 && inRow < height &&
                inCol >= 0 && inCol < width) {
                // F 直接以全域變數存取，不再是傳入指標
                Pvalue += F[fRow][fCol] * N[inRow*width + inCol];
            }
        }
    }
    P[outRow*width + outCol] = Pvalue;
}

Warning

Scoping 規則：F 是全域變數，適用 C 語言全域變數的 scoping 規則。若 host code 與 kernel code 位於不同檔案，kernel 所在的檔案必須加上相應的 extern 宣告，才能讓 F 對 kernel 可見。

現代快取階層 (The Cache Hierarchy)

Constant memory 變數其實也住在 DRAM；但因 runtime 知道它不會被改，會指示硬體積極快取。要理解好處，需先看現代處理器的記憶體/快取階層。

                The Processor (GPU)
  ┌────────────┬────────────┬────────────┐
  │  Core/SM   │  Core/SM   │  Core/SM   │
  │  ┌──────┐  │  ┌──────┐  │  ┌──────┐  │  regs : 最快, per-thread
  │  │ regs │  │  │ regs │  │  │ regs │  │
  │  └──────┘  │  └──────┘  │  └──────┘  │
  │ ┌────────┐ │ ┌────────┐ │ ┌────────┐ │  L1   : 16–64 KB, 近核心速度
  │ │  L1 $  │ │ │  L1 $  │ │ │  L1 $  │ │        每核心(SM)私有
  │ └────────┘ │ └────────┘ │ └────────┘ │
  └────────────┴────────────┴────────────┘
  ┌──────────────────────────────────────┐  L2   : 數百 KB ~ 數 MB
  │               L2 Cache               │        ~數十 cycle, 多 SM 共享
  └──────────────────────────────────────┘        (頻寬被分攤)
       (高階處理器可再有 L3, 可達數百 MB)
  ┌──────────────────────────────────────┐  DRAM : 最大最慢
  │      Main Memory / DRAM (Global +     │        global & constant 都住這
  │             Constant Memory)         │
  └──────────────────────────────────────┘

層級	容量	延遲	共享範圍
Registers	極小	最快	單一 thread
L1 Cache	16–64 KB	近核心速度	單一核心 / SM
L2 Cache	數百 KB ~ 數 MB	數十 cycle	多核心 / 多 SM 共享
L3 Cache (高階)	可達數百 MB	更高	全晶片
DRAM	最大	長延遲、頻寬受限	全部

Important

Cache vs Scratchpad (shared memory)：兩者最大差異是「透明度」。

Cache（含 constant cache）對程式透明：程式只存取原始 global 變數，硬體自動保留最近/最常用的資料並記住其原始位址；下次存取命中時直接由 cache 服務，省去 DRAM 存取。
Shared memory / scratchpad 不透明：程式必須宣告 __shared__ 並明確地把 global 變數拷貝進去。

階層之所以分多層，是因為記憶體容量與速度之間存在 tradeoff：越靠近核心越快但越小。編號（L1/L2/L3）反映與處理器的距離。

專用的 Constant Cache (The Specialized Constant Cache)

Constant memory 變數在快取設計上扮演特別角色：

因為 constant 變數在 kernel 執行期間不會被寫入，快取它時不需支援 thread 寫入。
支援高吞吐量寫入需要複雜硬體邏輯，在晶片面積與功耗上昂貴。
省掉寫入支援後，可以設計出在面積/功耗上極高效的專用快取 —— 即現代 GPU 的 constant cache。
constant memory 很小（64 KB），所以一個小型專用快取就能高效抓住每個 kernel 重度使用的 constant 變數。

 Warp (32 threads) 同時讀 F[fRow][fCol]   (同一位址, 與 threadIdx 無關)
            │  broadcast
            ▼
   ┌─────────────────┐   唯讀, 無寫入邏輯 → 便宜 + 高廣播頻寬
   │  Constant Cache │
   └─────────────────┘
            │  (僅首次 miss; F 極小 → 視為永遠命中)
            ▼
   DRAM (Constant Memory, 64 KB)  ← kernel 期間 F 幾乎 0 DRAM 流量

Tip

當 warp 內所有 thread 存取同一個 constant 變數（如本例的 F，索引與 thread index 無關），constant cache 能提供龐大頻寬滿足所有 thread。由於 F 很小，可以假設 F 永遠從 constant cache 取得，亦即不耗任何 DRAM 頻寬。

算術強度的提升 (Arithmetic Intensity: 0.25 → 0.5 OP/B)

	基本 kernel (Fig 7.7)	Constant memory kernel (Fig 7.9)
inner loop 運算	2 ops（1 乘 + 1 加）	2 ops（1 乘 + 1 加）
inner loop DRAM 載入	8 bytes：N 的 4 B + F 的 4 B	4 bytes：只剩 N 的 4 B（F 由 constant cache 供應）
OP/B	0.25	0.5

{OP/B}_{b a s i c} = \frac{2 ops}{8 bytes} = 0.25 {OP/B}_{c o n s t} = \frac{2 ops}{4 bytes} = 0.5

把 F 的 DRAM 存取消除後，算術強度翻倍。
但 0.5 OP/B 對 memory-bound 而言仍偏低 —— 真正的大幅提升來自把 N 也搬上 on-chip（shared memory tiling），見 07-Convolution/03-Tiled-Convolution-and-Halo-Handling。
書中亦預告：對 N 陣列的存取同樣能受益於快取（L2 for halo cells，見 7.5）。

Warning

「F 不耗 DRAM 頻寬」是個簡化假設，前提是 (1) 一個 warp 內存取同址（uniform access），且 (2) F 夠小能整份留在 constant cache。若 filter 很大或存取模式發散，constant cache 的廣播優勢會降低。

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
如何宣告 constant memory 變數？	`__constant__` 修飾，全域 scope（所有函式之外）
如何把資料放進 constant memory？	`cudaMemcpyToSymbol(dest, src, size)`，非 `cudaMemcpy`
constant memory 容量？	約 64 KB
kernel 怎麼存取 `F`？	當全域變數直接用，不需當參數傳入；遵守 C 全域變數 scoping（跨檔需 `extern`）
constant memory 與 global memory 差異？	都在 DRAM、都對所有 block 可見；但 constant kernel 期間唯讀 → 可積極快取
Cache 與 shared memory 差別？	Cache 透明（硬體自動）；shared memory 是 scratchpad，需手動 `__shared__` + 拷貝
為什麼能設計出高效 constant cache？	唯讀 → 不需寫入邏輯，省面積/功耗；且 64 KB 小 → 小快取就高效
何時 constant cache 廣播最有效？	warp 內所有 thread 讀同一位址（索引與 threadIdx 無關）
L1 / L2 容量與共享？	L1：16–64 KB、per-SM；L2：數百 KB–數 MB、多 SM 共享
用 constant memory 後 OP/B？怎麼算？	0.5 OP/B = 2 ops / 4 bytes（F 不再從 DRAM 載入，只剩 N 的 4 B）
為什麼 OP/B 從 0.25 變 0.5？	消除 F 的 8 bytes 中的一半（F 那 4 bytes 由 constant cache 服務）

重點總覽 (Overview)

為什麼 Filter F 適合 Constant Memory (Why F Fits Constant Memory)

宣告與拷貝 (Declaration & cudaMemcpyToSymbol)

宣告 (全域變數，函式之外)

Host 端拷貝

Kernel 端使用 (Kernel Using Constant Memory)

現代快取階層 (The Cache Hierarchy)

專用的 Constant Cache (The Specialized Constant Cache)

算術強度的提升 (Arithmetic Intensity: 0.25 → 0.5 OP/B)

考試/面試重點 (Exam / Test Patterns)

Related Notes