GPU 是生产环境中重要的计算资源,温度过高、显存泄漏、硬件故障等问题如果不及时发现,可能影响线上服务。本文记录一套完整的 GPU 监控告警方案,从指标采集到飞书推送的全自动链路。
监控目标 链接到标题
| 维度 | 指标 | 告警阈值 |
|---|---|---|
| 温度 | GPU 核心温度 / 显存温度 | > 83°C / > 100°C |
| 显存 | VRAM 使用率 | > 90% |
| 功耗 | GPU 功率 | > 150W |
| 硬件健康 | PCIe 重连、行重映射错误 | 任何异常 |
| 可用性 | DCGM-Exporter 进程是否存活 | 宕机 1m |
架构总览 链接到标题
flowchart LR
A[DCGM-Exporter
monkey:9400] -->|scrape| B[Prometheus
robin:9090] B -->|alert rules| C[Alertmanager
robin:9093] C -->|webhook| D[alert-transformer
rivo:9091] D -->|hooks| E[OpenClaw] E -->|飞书机器人| F[飞书]
monkey:9400] -->|scrape| B[Prometheus
robin:9090] B -->|alert rules| C[Alertmanager
robin:9093] C -->|webhook| D[alert-transformer
rivo:9091] D -->|hooks| E[OpenClaw] E -->|飞书机器人| F[飞书]
部署 DCGM-Exporter 链接到标题
在 GPU 节点上通过 Docker 部署 DCGM-Exporter。使用 DaoCloud 镜像加速国内拉取:
# /opt/dcgm-exporter/docker-compose.yaml
services:
dcgm-exporter:
image: m.daocloud.io/docker.io/nvidia/dcgm-exporter:4.5.2-4.8.1-distroless
container_name: dcgm-exporter
restart: unless-stopped
runtime: nvidia
ports:
- 9400:9400
# 目录和启动
sudo mkdir -p /opt/dcgm-exporter && sudo chown ubuntu:ubuntu /opt/dcgm-exporter
cd /opt/dcgm-exporter && docker compose up -d
前提:Docker 需配置 nvidia runtime,宿主机 nvidia-smi 可正常运行。
Prometheus 采集配置 链接到标题
在 prometheus.yml 中新增 scrape job:
- job_name: gpu
static_configs:
- targets:
- 192.168.0.73:9400
labels:
hostname: monkey
同步配置并重启:
scp monitor/prometheus/prometheus.yml robin:/tmp/prometheus.yml
ssh robin "sudo cp /tmp/prometheus.yml /opt/monitor/prometheus/prometheus.yml"
ssh robin "cd /opt/monitor && sudo docker compose restart prometheus"
告警规则 链接到标题
在 alerts.yml 中添加 gpu_alerts 规则组,基于 RTX 4060 Ti 参数设置阈值:
- name: gpu_alerts
interval: 30s
rules:
- alert: GPUTargetDown
expr: up{job="gpu"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "DCGM-Exporter 宕机"
description: "主机 {{ $labels.hostname }} GPU 指标采集器已宕机"
- alert: HighGPUTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 83
for: 5m
labels:
severity: warning
annotations:
summary: "GPU 温度过高"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} ({{ $labels.modelName }}) 温度 {{ $value }}°C"
- alert: CriticalGPUTemperature
expr: DCGM_FI_DEV_GPU_TEMP > 90
for: 2m
labels:
severity: critical
annotations:
summary: "GPU 温度严重过高"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} ({{ $labels.modelName }}) 温度 {{ $value }}°C"
- alert: HighGPUMemoryTemperature
expr: DCGM_FI_DEV_MEMORY_TEMP > 100
for: 5m
labels:
severity: warning
annotations:
summary: "GPU 显存温度过高"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 显存温度 {{ $value }}°C"
- alert: HighGPUVRAMUsage
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "GPU 显存使用率超过 90%"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} ({{ $labels.modelName }}) 显存 {{ $value | printf \"%.1f\" }}%"
- alert: CriticalGPUVRAMUsage
expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 > 95
for: 2m
labels:
severity: critical
annotations:
summary: "GPU 显存即将耗尽"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 显存使用率 {{ $value | printf \"%.1f\" }}%"
- alert: HighGPUPowerUsage
expr: DCGM_FI_DEV_POWER_USAGE > 150
for: 5m
labels:
severity: warning
annotations:
summary: "GPU 功耗过高"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 当前功耗: {{ $value }} W"
- alert: GPUPCIEReplay
expr: rate(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[5m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "GPU PCIe 重连错误"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} PCIe 重连速率 {{ $value | printf \"%.2f\" }}/s"
- alert: GPUUncorrectableRows
expr: increase(DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "GPU 硬件故障(不可纠正行重映射)"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 出现不可纠正的行重映射"
- alert: GPURowRemapFailure
expr: DCGM_FI_DEV_ROW_REMAP_FAILURE > 0
for: 1m
labels:
severity: critical
annotations:
summary: "GPU 行重映射失败,需更换硬件"
description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 行重映射失败"
同样同步并重启:
scp monitor/prometheus/alerts.yml robin:/tmp/alerts.yml
ssh robin "sudo cp /tmp/alerts.yml /opt/monitor/prometheus/alerts.yml"
ssh robin "cd /opt/monitor && sudo docker compose restart prometheus"
告警流转 链接到标题
告警触发后的完整链路:
DCGM-Exporter GPU 指标异常
│
Prometheus 匹配告警规则,产生 firing 事件
│
Alertmanager 聚合去重,通过 webhook 发送
│
alert-transformer 格式化消息为中文
│
OpenClaw 路由到指定 agent,调用飞书机器人
│
飞书即时通知(含告警名称、主机、当前值、自动处理指引)
Alertmanager 配置将所有告警路由到 alert-transformer:
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'openclaw'
receivers:
- name: 'openclaw'
webhook_configs:
- url: 'http://192.168.0.99:9091/alertmanager'
send_resolved: true
告警等级对应 OpenClaw 的不同处理方式:
- critical:最高优先级,强制推送,超时 120s
- warning:正常推送,超时 60s
- info:低优先级,超时 30s
验证 链接到标题
# 1. 确认 DCGM-Exporter 运行正常
ssh monkey "docker ps | grep dcgm"
# 2. 确认 GPU 指标可获取
curl -s http://192.168.0.73:9400/metrics | grep DCGM_FI_DEV_GPU_TEMP
# 3. 确认 Prometheus 已发现 target
ssh robin "curl -s http://localhost:9090/api/v1/targets | python3 -c \"import sys,json; d=json.load(sys.stdin); [print(t['labels']['job'], t['labels']['hostname'], t['health']) for t in d['data']['activeTargets'] if t['labels']['hostname']=='monkey']\""
# 4. 确认告警规则已加载
ssh robin "curl -s http://localhost:9090/api/v1/rules | python3 -c 'import sys,json; d=json.load(sys.stdin); [print(r[\"name\"], r[\"state\"]) for g in d[\"data\"][\"groups\"] if g[\"name\"]==\"gpu_alerts\" for r in g[\"rules\"]]'"
总结 链接到标题
通过 DCGM-Exporter + Prometheus + Alertmanager + OpenClaw 的组合,实现了 GPU 资源的全自动监控告警。关键指标(温度、显存、硬件错误)一旦异常,数分钟内即可通过飞书通知到人。这套方案同样适用于多 GPU 节点场景,只需在每个节点部署 DCGM-Exporter 并添加到 Prometheus target 即可。