MPI Point-to-Point 通訊與資料分發 (MPI_Send / MPI_Recv)

#advanced-practice #mpi-cluster #point-to-point-communication #halo-exchange #halo-cells

重點總覽 (Overview)

本節聚焦 MPI 兩個最基本的 point-to-point 函式 MPI_Send / MPI_Recv，並以 3D 25-point stencil 的 data server → compute process 資料分發樣式，說明如何把 domain partition 連同 halo slices 一起送出，並區分 edge process 與 internal process。

項目	重點
通訊類型	Point-to-point:一個 source (`MPI_Send`) 對一個 destination (`MPI_Recv`)
`MPI_Send` 參數數	6 個 (buf, count, type, dest, tag, comm)
`MPI_Recv` 參數數	7 個 (多了 `MPI_Status*`,最後一個)
角色分工 (SPMD)	rank `np-1` 當 data server;rank `0…np-2` 當 compute process
partition 方向	沿 z 維度切;一個 z slice = `dimx*dimy` 個元素,記憶體連續
halo 需求	25-point stencil 每方向 4 鄰居 → 每側需 4 個 halo slices
internal process	兩側皆有鄰居 → 收 `dimxdimy(S+8)` (S=每分區 z 數)
edge process	僅單側有鄰居 → 收 `dimxdimy(S+4)`;另一側用 ghost cells (=0),不送
接收彈性	`MPI_Recv` 的 count 可比實收大;只放入實際收到的位元組

Important

MPI 假設 distributed memory model:process 之間不共享變數,一切靠送/收訊息。MPI_Send/MPI_Recv 用 logical rank 定址,如同電話號碼,程式不必管底層 interconnect。

MPI_Send / MPI_Recv 語法 (Point-to-Point Syntax)

// 6 個參數:來源端呼叫
int MPI_Send(void *buf, int count, MPI_Datatype datatype,
             int dest, int tag, MPI_Comm comm);

// 7 個參數:目的端呼叫 (多了 status)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
             int source, int tag, MPI_Comm comm, MPI_Status *status);

參數	`MPI_Send`	`MPI_Recv`
1 `buf`	待送資料的起始位址	接收資料的存放位址
2 `count`	要送的元素數	最多可收的元素數 (上限)
3 `datatype`	`MPI_Datatype`,見下	同左
4	`dest` = 目的 process 的 rank	`source` = 來源 process 的 rank
5 `tag`	分類訊息用的整數標籤	期望的 tag (`MPI_ANY_TAG`=不限)
6 `comm`	communicator (`MPI_COMM_WORLD`)	同左
7	—	`MPI_Status*`,回報接收狀態/錯誤

MPI_Datatype 常見值 (定義於 mpi.h):MPI_DOUBLE、MPI_FLOAT、MPI_INT、MPI_CHAR。實際大小取決於 host 上對應 C type 的大小。

Tip

傳輸位元組數 = count × sizeof(datatype)。反過來,若已知總位元組與 count,可推回每元素大小 (例:4000 bytes / 1000 = 4 bytes = MPI_FLOAT)。

Warning

MPI_Recv 的 count 是上限,不必等於發送端實際送出的數量。MPI 只會把實際收到的位元組放進 buf,其餘空間保持原狀 — 這正是 edge process 仍可宣告滿尺寸 buffer 的關鍵。

資料分發:Data Server 端 (Data Server Distribution)

data server 是 SPMD 中扮演 I/O 的 process (本例簡化成用亂數初始化再分發)。它沿 z 維把整個 grid 切成數個 domain partition,逐一 MPI_Send 給各 compute process,並附帶 halo slices。

void data_server(int dimx, int dimy, int dimz, int nreps) {
  int np;  MPI_Comm_size(MPI_COMM_WORLD, &np);
  unsigned int num_comp_nodes = np - 1, first_node = 0, last_node = np - 2;
  unsigned int num_points = dimx * dimy * dimz;
  float *input = (float*)malloc(num_points * sizeof(float)); /* + output */
  random_data(input, dimx, dimy, dimz, 1, 10);

  // 每側 4 個 halo slices → edge 多收 4 片,internal 多收 8 片
  int edge_num_points = dimx * dimy * ((dimz / num_comp_nodes) + 4);
  int int_num_points  = dimx * dimy * ((dimz / num_comp_nodes) + 8);
  float *send_address = input;

  // (1) 第一個 (edge) process:只需右側 halo
  MPI_Send(send_address, edge_num_points, MPI_FLOAT, first_node, 0, MPI_COMM_WORLD);
  send_address += dimx * dimy * ((dimz / num_comp_nodes) - 4);   // 退回 4 片納入左 halo

  // (2) internal processes:兩側 halo
  for (int process = 1; process < last_node; process++) {
    MPI_Send(send_address, int_num_points, MPI_FLOAT, process, 0, MPI_COMM_WORLD);
    send_address += dimx * dimy * (dimz / num_comp_nodes);       // 淨距 = 一個 partition
  }

  // (3) 最後一個 (edge) process:只需左側 halo
  MPI_Send(send_address, edge_num_points, MPI_FLOAT, last_node, 0, MPI_COMM_WORLD);
}

send_address 位移邏輯 (令 S = dimz / num_comp_nodes,一片 slice = dimx*dimy):

input 陣列 (沿 z 連續排列,P=4 個 compute node):
 slice→  0        S          2S         3S        4S(=dimz)
         |---D1---|----D2----|----D3----|----D4----|

P0 (edge) 送 (S+4) 片:  [====D1====|hh hh]            start = input
                         └ partition ┘ └右halo(取自D2前4片)

P1 (internal) 送 (S+8):  [hh hh|====D2====|hh hh]      start = input + (S-4)
                         └左halo┘└partition┘└右halo┘

P2 (internal) 送 (S+8):  [hh hh|====D3====|hh hh]      start = input + (2S-4)

P3 (edge) 送 (S+4):      [hh hh|====D4====]            start = input + (3S-4)
                         └左halo┘└partition┘

Important

雖然每個 internal process 的起點都「往回退 4 片」以納入左 halo,但每次只前進一個完整 partition (S 片),所以相鄰起點的淨距離仍是一個 partition 大小 — 退 4 片的效果被一致地保留下來。

Edge vs Internal Process 與 Halo / Ghost Cells

	Edge process	Internal process
範例	process 0 (算 D1)、最後一個 (算 D4)	process 1、2 (算 D2、D3)
鄰居	只有單側	兩側都有
需要的 halo	一側 4 片	兩側各 4 片
收到的點數	`dimxdimy(S+4)`	`dimxdimy(S+8)`
缺鄰居那側	用 ghost cells = 0,不傳輸	—

Halo cells:某 partition 邊界計算所需、來自鄰居 partition 的資料切片 (一個 partition 的 halo 同時是鄰居的 boundary)。
Ghost cells:domain 最外緣、沒有真實鄰居的位置,如同 convolution 邊界,填 0 即可,不需透過網路送。

Warning

25-point stencil 每方向取 4 個鄰居,故每側恰需 4 個 halo slices;若改用每方向 k 個鄰居,halo 片數須改為 k。公式中 +4 / +8 皆由「每側 4 片」推得。

接收端:Compute Process 對齊 (Receiving & Buffer Alignment)

void compute_node_stencil(int dimx, int dimy, int dimz, int nreps) {
  int np, pid;
  MPI_Comm_rank(MPI_COMM_WORLD, &pid);
  MPI_Comm_size(MPI_COMM_WORLD, &np);
  int server_process = np - 1;                       // data server = 最大 rank

  unsigned int num_points     = dimx * dimy * (dimz + 8);   // 一律配滿尺寸 (含兩側 halo)
  unsigned int num_bytes      = num_points * sizeof(float);
  unsigned int num_halo_points= 4 * dimx * dimy;            // 4 片 = 一側 halo

  float *h_input = (float*)malloc(num_bytes);
  float *d_input = NULL;  cudaMalloc((void**)&d_input, num_bytes);

  // 關鍵:process 0 沒有左鄰居,接收時跳過前 4 片 (左 halo 槽位)
  float *rcv_address = h_input + ((0 == pid) ? num_halo_points : 0);
  MPI_Recv(rcv_address, num_points, MPI_FLOAT, server_process,
           MPI_ANY_TAG, MPI_COMM_WORLD, &status);
  cudaMemcpy(d_input, h_input, num_bytes, cudaMemcpyHostToDevice);
}

所有 compute process 的 host buffer 配置成相同格式:[左halo 4 | partition S | 右halo 4],簡化後續 kernel。差別只在邊界 process 哪一端槽位無效:

buffer 配置 (S+8 片):  [ 左halo |  partition (S) | 右halo ]
                          4 片        S 片          4 片

process 0 (edge,無左鄰): MPI_Recv 收 (S+4) 片,放到 +num_halo_points 處
                        [ 跳過4 | partition | 右halo ]   ← 前4片無效(當 ghost=0)

internal process:        MPI_Recv 收 (S+8) 片,從頭放
                        [ 左halo | partition | 右halo ]  ← 全部有效

process np-2 (edge,無右鄰): MPI_Recv 收 (S+4) 片,從頭放
                        [ 左halo | partition | 跳過4 ]   ← 後4片無效(當 ghost=0)

Tip

edge process 仍配置完整尺寸 (S+8 片) 只是為了簡化:多出的一側 halo 空間不會被使用。配合 MPI_Recv count 可大於實收的特性,同一段程式碼即可服務 edge 與 internal 兩種角色。

Important

((0==pid) ? num_halo_points : 0) 這個位移,讓 process 0 把「沒有左 halo」的訊息正確對齊到 buffer 的右半,使其左側 4 片落在 ghost 區。process np-2 則直接從頭放,讓無效片落在右端。

關鍵公式 / 比例 (Key Formulas)

令 P = num_comp_nodes = np - 1,S = dimz / P (每分區 z slice 數),一片 slice = dimx*dimy 元素。

量	公式
compute process 數	`P = np - 1`
每分區 partition 點數	`dimx * dimy * S`
一側 halo 點數	`num_halo_points = 4 * dimx * dimy`
edge process 收到點數	`dimx * dimy * (S + 4)`
internal process 收到點數	`dimx * dimy * (S + 8)`
第一次 send_address 位移	`dimx * dimy * (S - 4)`
迴圈內 send_address 位移	`dimx * dimy * S` (淨距 = 一個 partition)
傳輸位元組	`count * sizeof(datatype)`

考試/面試重點 (Exam / Test Patterns)

情境 / 關鍵字	答案 / 技巧
`MPI_Send` 有幾個參數	6 (buf, count, type, dest, tag, comm)
`MPI_Recv` 比 `MPI_Send` 多什麼	多 *第 7 個 `MPI_Status`**;且第 4 參數是 `source` 而非 `dest`
`MPI_Send(ptr,1000,MPI_FLOAT,…)` 送 4000 bytes,每元素幾 bytes	`4000/1000 = 4` bytes (`MPI_FLOAT`)
`MPI_Send` / `MPI_Recv` 是否阻塞	預設皆 blocking;`MPI_Recv` 必為 blocking (回傳即已收到)
不限 tag 接收	用 `MPI_ANY_TAG`
recv count 比實際送的多會怎樣	合法;只放入實際收到的位元組,多的空間不動
25-point stencil 每側需幾片 halo	4 片 (每方向 4 鄰居);internal `+8`、edge `+4`
edge vs internal 差別	edge 僅單側鄰居,缺側用 ghost=0 不傳;internal 兩側皆收 halo
誰當 data server	rank `np-1` (最大 rank);compute = `0…np-2`
process 0 接收為何位移 4 片	它無左鄰居,需把訊息對齊到 buffer 右半,讓左 4 片落在 ghost 區
為何 edge 仍配滿尺寸 buffer	簡化:同一程式碼服務兩種角色,多餘 halo 空間不用
halo slice 與 boundary slice 關係	某 partition 的 halo 即鄰居的 boundary (互為對方資料)
練習:dimz=2048,16 compute proc,每 proc 算幾點	`S = 2048/16 = 128` 片 → `6464128` 個輸出點

重點總覽 (Overview)

MPI_Send / MPI_Recv 語法 (Point-to-Point Syntax)

資料分發:Data Server 端 (Data Server Distribution)

Edge vs Internal Process 與 Halo / Ghost Cells

接收端:Compute Process 對齊 (Receiving & Buffer Alignment)

關鍵公式 / 比例 (Key Formulas)

考試/面試重點 (Exam / Test Patterns)

Related Notes