GPU 是生产环境中重要的计算资源,温度过高、显存泄漏、硬件故障等问题如果不及时发现,可能影响线上服务。本文记录一套完整的 GPU 监控告警方案,从指标采集到飞书推送的全自动链路。

监控目标 链接到标题

维度 指标 告警阈值
温度 GPU 核心温度 / 显存温度 > 83°C / > 100°C
显存 VRAM 使用率 > 90%
功耗 GPU 功率 > 150W
硬件健康 PCIe 重连、行重映射错误 任何异常
可用性 DCGM-Exporter 进程是否存活 宕机 1m

架构总览 链接到标题

flowchart LR A[DCGM-Exporter
monkey:9400] -->|scrape| B[Prometheus
robin:9090] B -->|alert rules| C[Alertmanager
robin:9093] C -->|webhook| D[alert-transformer
rivo:9091] D -->|hooks| E[OpenClaw] E -->|飞书机器人| F[飞书]

部署 DCGM-Exporter 链接到标题

在 GPU 节点上通过 Docker 部署 DCGM-Exporter。使用 DaoCloud 镜像加速国内拉取:

# /opt/dcgm-exporter/docker-compose.yaml
services:
  dcgm-exporter:
    image: m.daocloud.io/docker.io/nvidia/dcgm-exporter:4.5.2-4.8.1-distroless
    container_name: dcgm-exporter
    restart: unless-stopped
    runtime: nvidia
    ports:
      - 9400:9400
# 目录和启动
sudo mkdir -p /opt/dcgm-exporter && sudo chown ubuntu:ubuntu /opt/dcgm-exporter
cd /opt/dcgm-exporter && docker compose up -d

前提:Docker 需配置 nvidia runtime,宿主机 nvidia-smi 可正常运行。

Prometheus 采集配置 链接到标题

prometheus.yml 中新增 scrape job:

  - job_name: gpu
    static_configs:
      - targets:
          - 192.168.0.73:9400
        labels:
          hostname: monkey

同步配置并重启:

scp monitor/prometheus/prometheus.yml robin:/tmp/prometheus.yml
ssh robin "sudo cp /tmp/prometheus.yml /opt/monitor/prometheus/prometheus.yml"
ssh robin "cd /opt/monitor && sudo docker compose restart prometheus"

告警规则 链接到标题

alerts.yml 中添加 gpu_alerts 规则组,基于 RTX 4060 Ti 参数设置阈值:

  - name: gpu_alerts
    interval: 30s
    rules:
      - alert: GPUTargetDown
        expr: up{job="gpu"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "DCGM-Exporter 宕机"
          description: "主机 {{ $labels.hostname }} GPU 指标采集器已宕机"

      - alert: HighGPUTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 83
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU 温度过高"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} ({{ $labels.modelName }}) 温度 {{ $value }}°C"

      - alert: CriticalGPUTemperature
        expr: DCGM_FI_DEV_GPU_TEMP > 90
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU 温度严重过高"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} ({{ $labels.modelName }}) 温度 {{ $value }}°C"

      - alert: HighGPUMemoryTemperature
        expr: DCGM_FI_DEV_MEMORY_TEMP > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU 显存温度过高"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 显存温度 {{ $value }}°C"

      - alert: HighGPUVRAMUsage
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU 显存使用率超过 90%"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} ({{ $labels.modelName }}) 显存 {{ $value | printf \"%.1f\" }}%"

      - alert: CriticalGPUVRAMUsage
        expr: DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE) * 100 > 95
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "GPU 显存即将耗尽"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 显存使用率 {{ $value | printf \"%.1f\" }}%"

      - alert: HighGPUPowerUsage
        expr: DCGM_FI_DEV_POWER_USAGE > 150
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU 功耗过高"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 当前功耗: {{ $value }} W"

      - alert: GPUPCIEReplay
        expr: rate(DCGM_FI_DEV_PCIE_REPLAY_COUNTER[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "GPU PCIe 重连错误"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} PCIe 重连速率 {{ $value | printf \"%.2f\" }}/s"

      - alert: GPUUncorrectableRows
        expr: increase(DCGM_FI_DEV_UNCORRECTABLE_REMAPPED_ROWS[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU 硬件故障(不可纠正行重映射)"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 出现不可纠正的行重映射"

      - alert: GPURowRemapFailure
        expr: DCGM_FI_DEV_ROW_REMAP_FAILURE > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "GPU 行重映射失败,需更换硬件"
          description: "{{ $labels.hostname }} GPU {{ $labels.gpu }} 行重映射失败"

同样同步并重启:

scp monitor/prometheus/alerts.yml robin:/tmp/alerts.yml
ssh robin "sudo cp /tmp/alerts.yml /opt/monitor/prometheus/alerts.yml"
ssh robin "cd /opt/monitor && sudo docker compose restart prometheus"

告警流转 链接到标题

告警触发后的完整链路:

DCGM-Exporter GPU 指标异常
      │
Prometheus 匹配告警规则,产生 firing 事件
      │
Alertmanager 聚合去重,通过 webhook 发送
      │
alert-transformer 格式化消息为中文
      │
OpenClaw 路由到指定 agent,调用飞书机器人
      │
飞书即时通知(含告警名称、主机、当前值、自动处理指引)

Alertmanager 配置将所有告警路由到 alert-transformer:

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'openclaw'

receivers:
  - name: 'openclaw'
    webhook_configs:
      - url: 'http://192.168.0.99:9091/alertmanager'
        send_resolved: true

告警等级对应 OpenClaw 的不同处理方式:

  • critical:最高优先级,强制推送,超时 120s
  • warning:正常推送,超时 60s
  • info:低优先级,超时 30s

验证 链接到标题

# 1. 确认 DCGM-Exporter 运行正常
ssh monkey "docker ps | grep dcgm"

# 2. 确认 GPU 指标可获取
curl -s http://192.168.0.73:9400/metrics | grep DCGM_FI_DEV_GPU_TEMP

# 3. 确认 Prometheus 已发现 target
ssh robin "curl -s http://localhost:9090/api/v1/targets | python3 -c \"import sys,json; d=json.load(sys.stdin); [print(t['labels']['job'], t['labels']['hostname'], t['health']) for t in d['data']['activeTargets'] if t['labels']['hostname']=='monkey']\""

# 4. 确认告警规则已加载
ssh robin "curl -s http://localhost:9090/api/v1/rules | python3 -c 'import sys,json; d=json.load(sys.stdin); [print(r[\"name\"], r[\"state\"]) for g in d[\"data\"][\"groups\"] if g[\"name\"]==\"gpu_alerts\" for r in g[\"rules\"]]'"

总结 链接到标题

通过 DCGM-Exporter + Prometheus + Alertmanager + OpenClaw 的组合,实现了 GPU 资源的全自动监控告警。关键指标(温度、显存、硬件错误)一旦异常,数分钟内即可通过飞书通知到人。这套方案同样适用于多 GPU 节点场景,只需在每个节点部署 DCGM-Exporter 并添加到 Prometheus target 即可。