提升GPU利用率:探索NVIDIA的MIG與MPS虛擬化技術(shù)
1. 背景
目前GPU卡資源緊張且業(yè)務(wù)需求逐漸遞增,存在整卡不夠分配或GPU利用率低造成資源浪費(fèi)的情況。
我們也不可否認(rèn)還有非常多的應(yīng)用場(chǎng)景對(duì)算力的需求不大,比如:
- AI推理場(chǎng)景,基本都是在線實(shí)時(shí)計(jì)算,要求延時(shí)低,batchsize小,計(jì)算量不大。
- AI開發(fā)機(jī)場(chǎng)景,團(tuán)隊(duì)內(nèi)部共享GPU,對(duì)算力要求低。
這些場(chǎng)景的分布非常廣泛,在這些場(chǎng)景下,AI應(yīng)用是無(wú)法把GPU強(qiáng)大的計(jì)算能力全部發(fā)揮出來(lái)的。所以,長(zhǎng)期以來(lái),很多用戶的GPU利用率都不高,基本都只有10%-30%。
GPU的切分(虛擬化)需求基本來(lái)自于兩個(gè)方面,一個(gè)是普通消費(fèi)者,二個(gè)是計(jì)算/服務(wù)中心。
對(duì)于普通消費(fèi)者(用戶),希望使用到新推出的GPU特性,比如某些高性能的CUDA操作,而這些操作只有高版本的硬件SM才具備;同時(shí),很多情況下消費(fèi)者并不能用滿一整張顯卡(比如V100或者A100)的所有資源;另外“數(shù)據(jù)中心”類的GPU產(chǎn)品,價(jià)格都比較高(V100、A100都是wRMB為單位)。所以消費(fèi)者在使用、價(jià)格方面有小資源高性能的GPU需求。
某購(gòu)物平臺(tái)上面的GPU價(jià)格
對(duì)于服務(wù)廠商(比如云服務(wù)),一方面需要提供價(jià)格便宜、性能穩(wěn)定的GPU給用戶使用。由于整卡的成本價(jià)格高,所以服務(wù)費(fèi)用(租金)不會(huì)太低。另一個(gè)方面,大型的計(jì)算中心需要管理成千上萬(wàn)的GPU,服務(wù)廠商有提升集群利用率的訴求,小規(guī)格的GPU資源能夠提升配置的細(xì)粒度,從而能夠更好的提升集群GPU利用率。
目前,對(duì)于像V100這樣的GPU,有些廠商會(huì)讓多個(gè)用戶來(lái)共用一張GPU,從而降低單個(gè)用戶的費(fèi)用。在共享GPU過(guò)程中,一個(gè)重要的操作就是虛擬化,但是虛擬化在安全問題、服務(wù)質(zhì)量上面還有較大的進(jìn)步空間。
2. GPU Share策略方案
①M(fèi)IG(MULTI-INSTANCEGPU****)
隨著AMPERE架構(gòu)的發(fā)布,NVIDIA推出了劃時(shí)代的產(chǎn)品–A100,性能達(dá)到前所未有的高度。從性能壓榨的角度講,普通的一個(gè)AI應(yīng)用要想把全部A100性能發(fā)揮出來(lái)是很難的。反過(guò)來(lái)說(shuō),大量資源沒用上,閑置就是浪費(fèi)。
因此,MIG(multi-Instance GPU)就這樣應(yīng)運(yùn)而生了。
MIG 打破了原有 GPU 資源的分配方式,能夠基于 A100 從硬件層面把一個(gè)GPU切分成最多 7 個(gè) GPU 實(shí)例,并且可以使每一個(gè) GPU 實(shí)例都能夠擁有各自的 SMs 和內(nèi)存系統(tǒng)。簡(jiǎn)單理解就是現(xiàn)在可以并發(fā)的同時(shí)跑7個(gè)不同的AI應(yīng)用,最大程度把強(qiáng)大的GPU資源全部用上。
由于是基于硬件切分的方式,MIG可以讓每個(gè)GPU實(shí)例之間的內(nèi)存空間訪問互不干擾,保障每一個(gè)使用者的工作時(shí)延和吞吐量都是可預(yù)期的。
img
由于采用的是硬切分的方式,GPU實(shí)例之間的隔離度很好,但是靈活度就比較受限了。MIG的切分方式,每個(gè)GPU實(shí)例的大小只能按照固定的profile來(lái)切分:
img
這個(gè)表格清晰的展示了各種不同大小的 GPU 實(shí)例他們具備的流處理器比例、內(nèi)存比例、以及可以分配的數(shù)量。
各種profile的組合方式也是非常有限的,如下圖所示:
img
②MPS(MULTI-PROCESS SERVICE )
MPS,包含在CUDA工具包中的多進(jìn)程服務(wù)。它是一組可以替換的,二進(jìn)制兼容的CUDA API實(shí)現(xiàn),包括3個(gè)模塊:
- 守護(hù)進(jìn)程,用于啟動(dòng)或停止MPS服務(wù)進(jìn)程, 同時(shí)也負(fù)責(zé)為用戶進(jìn)程和服務(wù)進(jìn)程之間建立連接關(guān)系
- 服務(wù)進(jìn)程, 多個(gè)用戶在單個(gè)GPU上面的共享連接,為多個(gè)用戶之間執(zhí)行并發(fā)的服務(wù)
- 用戶運(yùn)行時(shí),集成在CUDA driver庫(kù)中,對(duì)于CUDA應(yīng)用程序來(lái)說(shuō),調(diào)用過(guò)程透明
當(dāng)用戶希望在多進(jìn)程條件下發(fā)揮GPU的并發(fā)能力,就可以使用MPS。MPS允許多個(gè)進(jìn)程共享同一個(gè)GPU context。這樣可以避免上下文切換造成的額外的開銷,以及串行化執(zhí)行帶來(lái)的時(shí)間線拉長(zhǎng)。同時(shí),MPS還允許不同進(jìn)程的kernel和memcpy操作在同一GPU上并發(fā)執(zhí)行,以實(shí)現(xiàn)最大化GPU利用率 。
具體可以用下面2個(gè)圖片對(duì)比來(lái)說(shuō)明MPS的特點(diǎn)。
首先,在沒有開啟MPS的情況下,有兩個(gè)進(jìn)程A(藍(lán)色)和B(紅色),每個(gè)進(jìn)程都有自己的CUDA context。從圖中可以看到,兩個(gè)進(jìn)程雖然同時(shí)被發(fā)送,但是在實(shí)際執(zhí)行中是被串行執(zhí)行的,兩個(gè)進(jìn)程會(huì)被GPU中的時(shí)間片輪轉(zhuǎn)調(diào)度機(jī)制輪流調(diào)度進(jìn)GPU進(jìn)行執(zhí)行。這就是執(zhí)行的時(shí)間線被拉長(zhǎng)的原因。
img
繼續(xù)往下看,如果我們開啟了MPS,同樣是啟動(dòng)兩個(gè)進(jìn)程A(藍(lán)色)和B(紅色),MPS服務(wù)進(jìn)程會(huì)將它們兩個(gè)CUDA context融合到一個(gè)CUDA context里面。這就是最大的不同。兩個(gè)context融合到一個(gè)之后,GPU上不存在context輪轉(zhuǎn)切換,減少額外開銷;而且從時(shí)間片上來(lái)看的話,進(jìn)程A和B的函數(shù)是真正的實(shí)現(xiàn)了并發(fā)執(zhí)行的。這就是MPS帶來(lái)的好處。
img
MPS的好處是顯而易見的,可以提升GPU利用率,減少GPU上下文切換時(shí)間、減少GPU上下文存儲(chǔ)空間??偟膩?lái)說(shuō),就是可以充分利用GPU資源。那么,這么好的技術(shù),為什么在業(yè)界用得很少呢?
因?yàn)镸PS的context融合方式會(huì)帶來(lái)一個(gè)嚴(yán)重的問題:錯(cuò)誤會(huì)互相影響。一個(gè)進(jìn)程錯(cuò)誤退出(包括被kill),如果該進(jìn)程正在執(zhí)行kernel,那么和該進(jìn)程共同share IPC和UVM的其他進(jìn)程也會(huì)一同出錯(cuò)退出。因此無(wú)法在生產(chǎn)場(chǎng)景上大規(guī)模使用。
一個(gè)節(jié)點(diǎn)只能指定一種GPU Share策略
GPU Share策略 | 說(shuō)明 | 備注 |
MPS | 多個(gè)進(jìn)程或應(yīng)用程序共享同一個(gè)GPU的資源,適用于需要并行處理大量數(shù)據(jù)或執(zhí)行復(fù)雜計(jì)算任務(wù)的應(yīng)用場(chǎng)景。 | 單卡均分;可指定一個(gè)節(jié)點(diǎn)中多少?gòu)埧ㄟM(jìn)行拆分;備注:均分份數(shù)—》拆分卡數(shù)GPU:1/4 * NVIDIA Ampere A100顯存:1/4 * 24GB 或 6GB |
MIG | 在一個(gè)物理GPU上同時(shí)運(yùn)行多個(gè)獨(dú)立的GPU實(shí)例,不會(huì)相互干擾。 | 支持每個(gè)卡都可以指定一種MIG策略;單卡:排序固定,最多7切分:7*14+2+14+1+1+1... |
GPU卡型 | MIG(僅支持A系列、H系列的卡型) |
NVIDIA Ampere A800(80G) | 7 * (1g.10gb)4 * (1g.20gb)3 * (2g.20gb)2 * (3g.40gb)1 * (4g.40gb)1 * (7g.80gb)1 * (1g.12gb)、1 * (2g.24gb)、1 * (3g.47gb)2 * (1g.10gb)、1 * (2g.20gb)、1 * (3g.40gb) |
NVIDIA Ampere 4090(24G) | 不支持 |
NVIDIA Ampere A100(40G) | 7 * (1g.5gb)3 * (2g.10gb)2 * (3g.20gb)1* (4g.20gb)1 * (7g.40gb) |
NVIDIA Ampere A30(24G) | 4 * (1g.6gb)2 * (2g.12gb)1 * (4g.24gb)2 * (1g.6gb)、1 * (2g.12gb) |
NVIDIA GeForce 3090 | 不支持 |
3. 實(shí)踐與測(cè)試
方案一(MPS)
源自nvidia官方開源項(xiàng)目 https://github.com/nvidia/k8s-device-plugin
nvidia-device-plugin 支持兩種gpu-share的方式,分別為 “時(shí)間片” 和 “mps”,二者不兼容,只能二選其一。
- 時(shí)間片方案:應(yīng)用可以完整使用GPU內(nèi)存,各應(yīng)用采用時(shí)間片的方式共享GPU計(jì)算能力,各應(yīng)用間內(nèi)存不隔離(直接放棄這個(gè)方案)。
- MPS方案:GPU卡被MPS DAEMON 托管,按照拆分SHARE的副本數(shù),均分GPU memory,應(yīng)用使用GPU memory超過(guò)均分值后 OOM,各個(gè)進(jìn)程間GPU memory隔離,GPU的計(jì)算能力也按照比例拆分(還是基于時(shí)間片的)。
相對(duì)而言,MPS方案在隔離性和資源分配方面更具優(yōu)勢(shì),本次驗(yàn)證主要做MPS的,沒有做時(shí)間片的。
支持的粒度:可以支持到單節(jié)點(diǎn)級(jí)別,只對(duì)某個(gè)節(jié)點(diǎn)開啟MPS或者時(shí)間片的SHARE。
啟用MPS操作
安裝nvidia-device-plugin的時(shí)候,啟用第二配置,默認(rèn)配置不開啟mps或者時(shí)間片,在第二配置中啟用gpu-share。
nvidia-device-plugin 支持到具體節(jié)點(diǎn)開啟mps,如果某個(gè)節(jié)點(diǎn)需要開啟MPS,需要在節(jié)點(diǎn)上打?qū)?yīng)標(biāo)簽開啟。
nvidia.com/mps.capable 決定是否在節(jié)點(diǎn)啟用MPS, 例如:true
nvidia.com/device-plugin.config 決定當(dāng)前節(jié)點(diǎn)使用的配置名字。例如:config1
理論上配置個(gè)數(shù)是沒限制的,單集群下,可以做多個(gè)配置,例如 nvidia-share-4,nvidia-share-2,按照不同的業(yè)務(wù)需求,對(duì)不同的節(jié)點(diǎn)按照不同比例拆分。
一個(gè)完整的部署編排yaml實(shí)例如下(yaml中的節(jié)點(diǎn)親和性和容忍,按需修改即可):
---
# Source: nvidia-device-plugin/templates/service-account.yml
apiVersion: v1
kind: ServiceAccount
metadata:
name: nvidia-device-plugin-service-account
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0-rc.2
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/version: "0.15.0-rc.2"
app.kubernetes.io/managed-by: Helm
---
# Source: nvidia-device-plugin/templates/configmap.yml
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-configs
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0-rc.2
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/version: "0.15.0-rc.2"
app.kubernetes.io/managed-by: Helm
data:
config0: |-
version: v1
config1: |-
version: v1
sharing:
mps:
renameByDefault: true
resources:
- name: nvidia.com/gpu
replicas: 2
---
# Source: nvidia-device-plugin/templates/role.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: nvidia-device-plugin-role
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0-rc.2
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/version: "0.15.0-rc.2"
app.kubernetes.io/managed-by: Helm
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
---
# Source: nvidia-device-plugin/templates/role-binding.yml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: nvidia-device-plugin-role-binding
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0-rc.2
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/version: "0.15.0-rc.2"
app.kubernetes.io/managed-by: Helm
subjects:
- kind: ServiceAccount
name: nvidia-device-plugin-service-account
namespace: kube-system
roleRef:
kind: ClusterRole
name: nvidia-device-plugin-role
apiGroup: rbac.authorization.k8s.io
---
# Source: nvidia-device-plugin/templates/daemonset-device-plugin.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0-rc.2
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/version: "0.15.0-rc.2"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-device-plugin
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: nvidia-device-plugin
annotations:
checksum/config: 5cae25ed78745124db43b014773455550cf9c60962da45074548790b2acb66f0
spec:
priorityClassName: system-node-critical
securityContext:
{}
serviceAccountName: nvidia-device-plugin-service-account
shareProcessNamespace: true
initContainers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: nvidia-device-plugin-init
command: ["config-manager"]
env:
- name: ONESHOT
value: "true"
- name: KUBECONFIG
value: ""
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: "spec.nodeName"
- name: NODE_LABEL
value: "nvidia.com/device-plugin.config"
- name: CONFIG_FILE_SRCDIR
value: "/available-configs"
- name: CONFIG_FILE_DST
value: "/config/config.yaml"
- name: DEFAULT_CONFIG
value: "config0"
- name: FALLBACK_STRATEGIES
value: "named,single"
- name: SEND_SIGNAL
value: "false"
- name: SIGNAL
value: ""
- name: PROCESS_TO_SIGNAL
value: ""
volumeMounts:
- name: available-configs
mountPath: /available-configs
- name: config
mountPath: /config
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: nvidia-device-plugin-sidecar
command: ["config-manager"]
env:
- name: ONESHOT
value: "false"
- name: KUBECONFIG
value: ""
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: "spec.nodeName"
- name: NODE_LABEL
value: "nvidia.com/device-plugin.config"
- name: CONFIG_FILE_SRCDIR
value: "/available-configs"
- name: CONFIG_FILE_DST
value: "/config/config.yaml"
- name: DEFAULT_CONFIG
value: "config0"
- name: FALLBACK_STRATEGIES
value: "named,single"
- name: SEND_SIGNAL
value: "true"
- name: SIGNAL
value: "1" # SIGHUP
- name: PROCESS_TO_SIGNAL
value: "nvidia-device-plugin"
volumeMounts:
- name: available-configs
mountPath: /available-configs
- name: config
mountPath: /config
securityContext:
capabilities:
add:
- SYS_ADMIN
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
imagePullPolicy: IfNotPresent
name: nvidia-device-plugin-ctr
command: ["nvidia-device-plugin"]
env:
- name: MPS_ROOT
value: "/run/nvidia/mps"
- name: CONFIG_FILE
value: /config/config.yaml
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
securityContext:
capabilities:
add:
- SYS_ADMIN
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
# The MPS /dev/shm is needed to allow for MPS daemon health-checking.
- name: mps-shm
mountPath: /dev/shm
- name: mps-root
mountPath: /mps
- name: cdi-root
mountPath: /var/run/cdi
- name: available-configs
mountPath: /available-configs
- name: config
mountPath: /config
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: mps-root
hostPath:
path: /run/nvidia/mps
type: DirectoryOrCreate
- name: mps-shm
hostPath:
path: /run/nvidia/mps/shm
- name: cdi-root
hostPath:
path: /var/run/cdi
type: DirectoryOrCreate
- name: available-configs
configMap:
name: "nvidia-device-plugin-configs"
- name: config
emptyDir: {}
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
---
# Source: nvidia-device-plugin/templates/daemonset-mps-control-daemon.yml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-mps-control-daemon
namespace: kube-system
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0-rc.2
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/version: "0.15.0-rc.2"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-device-plugin
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: nvidia-device-plugin
annotations:
checksum/config: 5cae25ed78745124db43b014773455550cf9c60962da45074548790b2acb66f0
spec:
priorityClassName: system-node-critical
securityContext:
{}
serviceAccountName: nvidia-device-plugin-service-account
shareProcessNamespace: true
initContainers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: mps-control-daemon-mounts
command: [mps-control-daemon, mount-shm]
securityContext:
privileged: true
volumeMounts:
- name: mps-root
mountPath: /mps
mountPropagation: Bidirectional
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: mps-control-daemon-init
command: ["config-manager"]
env:
- name: ONESHOT
value: "true"
- name: KUBECONFIG
value: ""
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: "spec.nodeName"
- name: NODE_LABEL
value: "nvidia.com/device-plugin.config"
- name: CONFIG_FILE_SRCDIR
value: "/available-configs"
- name: CONFIG_FILE_DST
value: "/config/config.yaml"
- name: DEFAULT_CONFIG
value: "config0"
- name: FALLBACK_STRATEGIES
value: "named,single"
- name: SEND_SIGNAL
value: "false"
- name: SIGNAL
value: ""
- name: PROCESS_TO_SIGNAL
value: ""
volumeMounts:
- name: available-configs
mountPath: /available-configs
- name: config
mountPath: /config
containers:
# TODO: How do we synchronize the plugin and control-daemon on restart.
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
name: mps-control-daemon-sidecar
command: ["config-manager"]
env:
- name: ONESHOT
value: "false"
- name: KUBECONFIG
value: ""
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: "spec.nodeName"
- name: NODE_LABEL
value: "nvidia.com/device-plugin.config"
- name: CONFIG_FILE_SRCDIR
value: "/available-configs"
- name: CONFIG_FILE_DST
value: "/config/config.yaml"
- name: DEFAULT_CONFIG
value: "config0"
- name: FALLBACK_STRATEGIES
value: "named,single"
- name: SEND_SIGNAL
value: "true"
- name: SIGNAL
value: "1"
- name: PROCESS_TO_SIGNAL
value: "/usr/bin/mps-control-daemon"
volumeMounts:
- name: available-configs
mountPath: /available-configs
- name: config
mountPath: /config
- image: nvcr.io/nvidia/k8s-device-plugin:v0.17.1
imagePullPolicy: IfNotPresent
name: mps-control-daemon-ctr
command: [mps-control-daemon]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CONFIG_FILE
value: /config/config.yaml
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
securityContext:
privileged: true
volumeMounts:
- name: mps-shm
mountPath: /dev/shm
- name: mps-root
mountPath: /mps
- name: available-configs
mountPath: /available-configs
- name: config
mountPath: /config
volumes:
- name: mps-root
hostPath:
path: /run/nvidia/mps
type: DirectoryOrCreate
- name: mps-shm
hostPath:
path: /run/nvidia/mps/shm
- name: available-configs
configMap:
name: "nvidia-device-plugin-configs"
- name: config
emptyDir: {}
nodeSelector:
# We only deploy this pod if the following sharing label is applied.
nvidia.com/mps.capable: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists驗(yàn)證
我們的測(cè)試集群只有一個(gè)節(jié)點(diǎn),有8張4090卡,單卡GPU Memory為 24G。我們采用MPS方案,將卡副本SHARE為兩份,理論上集群節(jié)點(diǎn)上看到的 nvidia.com/gpu.shared 資源為16個(gè),單個(gè)資源的可用GPU Memory為12G左右。
- 部署yaml到集群
img

部署成功,節(jié)點(diǎn)資源如預(yù)期,GPU卡的兩倍(因?yàn)槲覀冎甋HARE了2)。
- 部署一個(gè)使用shared資源的容器(它只能使用一個(gè)shared設(shè)備)
img

部署成功。
- 驗(yàn)證GPU MEMORY
初始狀態(tài)
img
由于理論上線為12G,我們這兒創(chuàng)建 (1024, 1024, 1024) 類型為fp32的tensor的size為 4G
為了不超過(guò)12G,測(cè)試 (1024, 1024, 1024) * 2 + (1024,1024,512)理論上不會(huì)OOM;
為了達(dá)到12G,測(cè)試(1024, 1024, 1024) * 3 就應(yīng)該OOM。
- (1024,1024,1024) * 2 + (1024,1024,1024)結(jié)果,如預(yù)期。
img
- (1024, 1024, 1024) * 3,結(jié)果如預(yù)期。
img
性能
測(cè)試腳本
import torch
import time
# size of tensor
w = 10000000
def compute_pi(n):
count = 0
total = n // w
print(f"total is {total}")
for idx in range(total):
print(f"{idx}/{total}")
x = torch.rand(w, device='cuda') * 2 - 1
y = torch.rand(w, device='cuda') * 2 - 1
distance = x**2 + y **2
count += (distance <= 1).sum().item()
return 4 * count / n
n = w * 100000
s = time.time()
pi = compute_pi(n)
e = time.time()
print(f"total {e - s}")
print(f"pi is {pi}")測(cè)試方案: 直接在主機(jī)上,在沒有開啟MPS的卡上先后進(jìn)行多進(jìn)程并行計(jì)算和多進(jìn)程串行計(jì)算,得到計(jì)算時(shí)間。
再在mps的卡上,進(jìn)行同樣的計(jì)算,對(duì)比兩種方案的計(jì)算耗時(shí)。
在測(cè)試機(jī)器上(4090顯卡)開啟mps后,同時(shí)在一個(gè)卡上執(zhí)行測(cè)試腳本,分別看兩個(gè)實(shí)例同時(shí)執(zhí)行腳本的耗時(shí)和單獨(dú)執(zhí)行的耗時(shí),進(jìn)行對(duì)比結(jié)果。
- 在開啟MPS的卡上,兩個(gè)進(jìn)程并行計(jì)算。
實(shí)例1,計(jì)算用時(shí) 193 秒
img
實(shí)例2,計(jì)算用時(shí) 193秒

GPU使用率,可以看到GPU開啟MPS之后是 Exclusive 模式,GPU利用率 100%
img
- 在開啟MPS的卡上,串行運(yùn)行計(jì)算任務(wù)(就是兩個(gè)任務(wù)先后計(jì)算)
實(shí)例1,計(jì)算耗時(shí)74秒
img
實(shí)例2, 計(jì)算耗時(shí)74秒
img
GPU使用率截圖
img
- 在沒開啟MPS的卡上,進(jìn)行并行計(jì)算
實(shí)例1,用時(shí)188秒
img
實(shí)例2,用時(shí)188秒
img
- 串行計(jì)算,耗時(shí)71秒

結(jié)論:開啟MPS的性能影響不大,開啟后193 ,未開啟188, 性能損失 = 2%
適用場(chǎng)景
- 小模型
- 計(jì)算量不是很大,QPS不是很高的情況下。
卡級(jí)別的MPS開啟
目前官方nvidia-device-plugin沒有支持單機(jī)上指定卡開啟mps,(場(chǎng)景,單機(jī)8卡,四卡開MPS,四卡獨(dú)占),源代碼中,可以看到有多種resource的支持,但是參數(shù)被屏蔽了,另外就是單配置多資源的情況下,會(huì)開啟多個(gè)mps daemon,導(dǎo)致nvidia-device-plugin容器不能正常運(yùn)行。
針對(duì)這個(gè)問題,進(jìn)行了單獨(dú)的適配,目前已經(jīng)支持了卡級(jí)別MPS的開啟。需要更多大規(guī)模的測(cè)試后,再投入生產(chǎn)。
方案二(VGPU)
來(lái)自第四范式的vgpu方案,目前volcano(要求版本>=1.8.0)集成的也是它。 https://github.com/Project-HAMi/HAMi
其目的是為了統(tǒng)一算力卡的虛擬化調(diào)度,正在集成華為vNPU(這個(gè)ISSUE),值得調(diào)研和投入。
方案三(MIG)
如何管理MIG,參考了知乎(https://zhuanlan.zhihu.com/p/558046644),其中如果關(guān)注資源利用率,可以看看 vgpu 和 mig-vgpu部分的吞吐量,性能對(duì)比部分。 MIG由gi和ci組成,gi表示gpu實(shí)例,ci表示算力單元。
拆分MIG操作流程
MIG的shell操作主要包括:查看命令、創(chuàng)建命令和刪除命令。MIG的操作都要用root權(quán)限,所以如果是非root用戶,操作命令要加上sudo字符,下面操作示例中默認(rèn)用戶是root。 首先將這些操作例出來(lái),然后對(duì)一些創(chuàng)建與刪除操作進(jìn)行講解。
功能 | 命令 | 說(shuō)明 |
【開】指定某卡 開啟MIG | nvidia-smi -i 0 -mig 1 | -i 指定的GPU編號(hào) 可以是0,1,3 |
【關(guān)】指定某卡 關(guān)閉MIG | nvidia-smi -i 0 -mig 0 | |
【開】全部卡的MIG使能 | nvidia-smi -mig 1 | 1 打開; 0 關(guān)閉; |
【查看】子GPU實(shí)例的profile | nvidia-smi mig -lgip | 獲得子GPU可創(chuàng)建的情況 |
【查看】子GPU實(shí)例的placement | nvidia-smi mig -lgipp | 獲得子GPU可以創(chuàng)建的位置 |
【查看】子GPU上CI的profile | nvidia-smi mig -lcip | 添加 -gi指定特定的子GPU,如指定子GPU 2查看上面的CI實(shí)例: nvidia-smi mig -lci -gi 2 |
【查看】已創(chuàng)建的子GPU的情況 | nvidia-smi mig -lgi | |
【創(chuàng)建】子GPU + 對(duì)應(yīng)的CI | nvidia-smi mig -i 0 -cgi 9 -C | -i: 指定父GPU -cgi:列出需要?jiǎng)?chuàng)建的子GPU的類型 格式:9 或者 3g.20gb 或者 MIG 3g.20gb -C :連同一起創(chuàng)建CI |
【創(chuàng)建】子GPU | nvidia-smi mig -i 0 -cgi 9 | 創(chuàng)建一個(gè)profile為9的GI實(shí)例: 3個(gè)計(jì)算單元 + 20gb顯存。 |
【創(chuàng)建】子GPU上面的CI | nvidia-smi mig -cci 0,1 -gi 1 | -cci:創(chuàng)建的CI實(shí)例的編號(hào) -gi:指定子GPU |
【刪除】子GPU | nvidia-smi mig -dgi -i 0 -gi 1 | -i:指定父GPU -gi:待刪除的子GPU |
【刪除】子GPU上面的CI 實(shí)例 | nvidia-smi mig -dci -i 0 -gi 1 -ci 0 | -i:指定父GPU -gi:待操作的子GPU -ci: 待刪除的CI實(shí)例 |
【查看】 整個(gè)MIG實(shí)例情況 | nvidia-smi -L |
MIG的操作順序概況為:
使能MIG -> 創(chuàng)建GI實(shí)例 -> 創(chuàng)建CI實(shí)例 -> 刪除CI實(shí)例 -> 刪除GI實(shí)例 -> 關(guān)閉MIG
img
- 檢查卡類型 nvidia-smi,卡要求A系列以后
- 針對(duì)單卡開啟MIG,
nvidia-smi -i 0 -mig 1,如果出現(xiàn)pending的情況可能需要重啟機(jī)器。
img
- 查看支持的mig profile,
nvidia-smi mig -i 0 -lgip

- 這兒我們將0號(hào)卡拆成兩個(gè)3g.40gb單元
nvidia-smi mig -i 0 -cgi 9,執(zhí)行兩次,創(chuàng)建了兩。

- 查看創(chuàng)建結(jié)果
nvidia-smi mig -i 0 -lgi
img
- 查看每個(gè)GI實(shí)例支持的CI規(guī)格
nvidia-smi mig -i 0 -lcip
img
- 給mig實(shí)例創(chuàng)建 CI
nvidia-smi mig -gi 0 -cci 2
img
- 查看最終結(jié)果
nvidia-smi
img
apiVersion: v1
kind: Pod
metadata:
name: test1
spec:
containers:
- image: harbor.maip.io/base/pytorch-alpaca:v3
imagePullPolicy: IfNotPresent
name: test
command:
- /bin/bash
- -c
- "sleep infinity"
resources:
requests:
nvidia.com/mig-3g.40gb: 1
limits:
nvidia.com/mig-3g.40gb: 1
---
apiVersion: v1
kind: Pod
metadata:
name: test2
spec:
containers:
- image: harbor.maip.io/base/pytorch-alpaca:v3
imagePullPolicy: IfNotPresent
name: test
command:
- /bin/bash
- -c
- "sleep infinity"
resources:
requests:
nvidia.com/mig-3g.40gb: 1
limits:
nvidia.com/mig-3g.40gb: 1
---
apiVersion: v1
kind: Pod
metadata:
name: test3
spec:
containers:
- image: harbor.maip.io/base/pytorch-alpaca:v3
imagePullPolicy: IfNotPresent
name: test
command:
- /bin/bash
- -c
- "sleep infinity"
resources:
requests:
nvidia.com/gpu: 1
limits:
nvidia.com/gpu: 1- 看結(jié)果
img
img
img
img
刪除mig 需要先刪除 ci 刪除ci
nvidia-smi mig -dci -i 0 -gi 1 -ci 0 刪除gi nvidia-smi mig -dgi -i 0 -gi 14. gpu-operator一鍵部署
GPU Operator 是 NVIDIA 提供的一個(gè) Kubernetes Operator,它簡(jiǎn)化了在 Kubernetes 集群中使用 GPU 的過(guò)程,通過(guò)自動(dòng)化的方式處理 GPU 驅(qū)動(dòng)程序安裝、NVIDIA Device Plugin、DCGM Exporter 等組件。
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
# 如果已經(jīng)安裝了 NVIDIA 驅(qū)動(dòng)程序,可以在 GPU Operator 中禁用驅(qū)動(dòng)安裝,修改values.yaml
--set driver.enabled=false- 其中mps和mig的配置,需要在values.yaml中devicePlugin和migManager新增,然后創(chuàng)建configmap
devicePlugin:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: v0.17.1
imagePullPolicy: IfNotPresent
env:
- name: PASS_DEVICE_SPECS
value: "true"
- name: FAIL_ON_INIT_ERROR
value: "true"
- name: DEVICE_LIST_STRATEGY
value: envvar
- name: DEVICE_ID_STRATEGY
value: uuid
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: all
- name: NODE_LABEL
value: nvidia.com/device-plugin.config
config:
name: device-plugin-config
default: default
migManager:
enabled: true
repository: nvcr.io/nvidia/cloud-native
image: k8s-mig-manager
version: v0.12.1-ubuntu20.04
imagePullPolicy: IfNotPresent
env:
- name: WITH_REBOOT
value: "false"
config:
default: all-disabled
name: custom-mig-parted-config
gpuClientsConfig:
name: ""- 測(cè)試用的device-plugin配置文件。
---
# Source: gpu-operator/templates/plugin_config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: device-plugin-config
namespace: gpu-operator
labels:
app.kubernetes.io/name: gpu-operator
helm.sh/chart: gpu-operator-v24.3.0
app.kubernetes.io/instance: stable
app.kubernetes.io/version: "v24.3.0"
app.kubernetes.io/managed-by: Helm
data:
default: |-
version: v1
flags:
migStrategy: none
mig-mixed: |-
version: v1
flags:
migStrategy: mixed
mig-single: |-
version: v1
flags:
migStrategy: single
config1: |-
flags:
migStrategy: none
sharing:
mps:
renameByDefault: true
resources:
- devices:
- "0"
- "1"
name: nvidia.com/gpu
replicas: 2
version: v1測(cè)試用的mig策略配置文件
apiVersion: v1
kind: ConfigMap
metadata:
name: custom-mig-parted-config
namespace: gpu-operator
data:
config.yaml: |
version: v1
# 針對(duì)第一個(gè)卡開啟的MIG
mig-configs:
custom-1:
- devices: [0]
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
"3g.40gb": 1
all-disabled:
- devices: all
mig-enabled: false
all-enabled:
- devices: all
mig-enabled: true
mig-devices: {}
# A100-40GB, A800-40GB
all-1g.5gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 7
all-1g.5gb.me:
- devices: all
mig-enabled: true
mig-devices:
"1g.5gb+me": 1
all-2g.10gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.10gb": 3
all-3g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.20gb": 2
all-4g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.20gb": 1
all-7g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.40gb": 1
# H100-80GB, H800-80GB, A100-80GB, A800-80GB, A100-40GB, A800-40GB
all-1g.10gb:
# H100-80GB, H800-80GB, A100-80GB, A800-80GB
- device-filter: ["0x233010DE", "0x233110DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 7
# A100-40GB, A800-40GB
- device-filter: ["0x20B010DE", "0x20B110DE", "0x20F110DE", "0x20F610DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 4
# H100-80GB, H800-80GB, A100-80GB, A800-80GB
all-1g.10gb.me:
- devices: all
mig-enabled: true
mig-devices:
"1g.10gb+me": 1
# H100-80GB, H800-80GB, A100-80GB, A800-80GB
all-1g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.20gb": 4
all-2g.20gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.20gb": 3
all-3g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.40gb": 2
all-4g.40gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.40gb": 1
all-7g.80gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.80gb": 1
# A30-24GB
all-1g.6gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.6gb": 4
all-1g.6gb.me:
- devices: all
mig-enabled: true
mig-devices:
"1g.6gb+me": 1
all-2g.12gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.12gb": 2
all-2g.12gb.me:
- devices: all
mig-enabled: true
mig-devices:
"2g.12gb+me": 1
all-4g.24gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.24gb": 1
# H100 NVL, H800 NVL, GH200
all-1g.12gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.12gb": 7
all-1g.12gb.me:
- devices: all
mig-enabled: true
mig-devices:
"1g.12gb+me": 1
all-1g.24gb:
- devices: all
mig-enabled: true
mig-devices:
"1g.24gb": 4
all-2g.24gb:
- devices: all
mig-enabled: true
mig-devices:
"2g.24gb": 3
# H100 NVL, H800 NVL
all-3g.47gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.47gb": 2
all-4g.47gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.47gb": 1
all-7g.94gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.94gb": 1
# H100-96GB, PG506-96GB, GH200
all-3g.48gb:
- devices: all
mig-enabled: true
mig-devices:
"3g.48gb": 2
all-4g.48gb:
- devices: all
mig-enabled: true
mig-devices:
"4g.48gb": 1
all-7g.96gb:
- devices: all
mig-enabled: true
mig-devices:
"7g.96gb": 1
# H100-96GB, GH200, H100 NVL, H800 NVL, H100-80GB, H800-80GB, A800-40GB, A800-80GB, A100-40GB, A100-80GB, A30-24GB, PG506-96GB
all-balanced:
# H100 NVL, H800 NVL
- device-filter: ["0x232110DE", "0x233A10DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.12gb": 1
"2g.24gb": 1
"3g.47gb": 1
# H100-80GB, H800-80GB, A100-80GB, A800-80GB
- device-filter: ["0x233010DE", "0x233110DE", "0x232210DE", "0x20B210DE", "0x20B510DE", "0x20F310DE", "0x20F510DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.10gb": 2
"2g.20gb": 1
"3g.40gb": 1
# A100-40GB, A800-40GB
- device-filter: ["0x20B010DE", "0x20B110DE", "0x20F110DE", "0x20F610DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.5gb": 2
"2g.10gb": 1
"3g.20gb": 1
# A30-24GB
- device-filter: "0x20B710DE"
devices: all
mig-enabled: true
mig-devices:
"1g.6gb": 2
"2g.12gb": 1
# H100-96GB, PG506-96GB, GH200
- device-filter: ["0x234210DE", "0x233D10DE", "0x20B610DE"]
devices: all
mig-enabled: true
mig-devices:
"1g.12gb": 2
"2g.24gb": 1
"3g.48gb": 1- 部署完成之后查看服務(wù)狀態(tài)
kubectl get pod -n gpu-operator- 使用打?qū)pu節(jié)點(diǎn)打標(biāo)簽的方式開啟對(duì)應(yīng)的share策略
操作方法:
- 開啟MPS策略:將節(jié)點(diǎn)標(biāo)簽 nvidia.com/device-plugin.config 的值修改成具體的MPS策略,具體策略值必須存在配置的ConfigMap中,例如上面配置中的 config1,它表示針對(duì)兩個(gè)卡開啟的MPS。
- 開啟MIG策略:將節(jié)點(diǎn)標(biāo)簽 nvidia.com/device-plugin.config 的值修改成mig-mixed策略,它表示可以允許不同規(guī)格的MIG出現(xiàn)在一臺(tái)機(jī)器上。然后還需要將節(jié)點(diǎn)標(biāo)簽 nvidia.com/mig.config 的值修改成具體的mig策略配置名字。例如上面的 custom-1,它表示針對(duì)第一個(gè)卡開啟的MIG策略。
- 開啟之后的驗(yàn)證方式參考實(shí)踐與測(cè)試















NVIDIA Ampere A100(3g.40gb)









