监控系列文章:

这篇和 MinIO、Qdrant 一样是"内置 metrics 路线"——不需要额外部署 exporter。但 RabbitMQ 的 Prometheus 端点默认未启用,要先在 enabled_plugins 加上 rabbitmq_prometheus 才会监听 15692 端口。

RabbitMQ Prometheus 端点 链接到标题

镜像 rabbitmq:4.2.2-management-alpine
插件 rabbitmq_prometheus(镜像内置,需手动启用)
端口 15692
路径 /metrics
认证 不需要

启用后,15692 端点和 15672 management UI 是两个独立监听器,不走 management API 的鉴权——rabbitmq_prometheus 插件直接暴露 Prometheus 文本格式。

配置步骤 链接到标题

1. 修改 docker-compose.yaml 链接到标题

两处改动:

services:
  rabbitmq:
    image: rabbitmq:4.2.2-management-alpine
    container_name: rabbitmq
    restart: always
    ports:
      - 5672:5672
      - 15672:15672
      - 15692:15692          # 新增
    env_file: .env
    environment:
      RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS: "-rabbitmq_management_agent disable_metrics_collector false"
    configs:
      - source: rabbitmq-plugins
        target: /etc/rabbitmq/enabled_plugins
    volumes:
      - rabbitmq-lib:/var/lib/rabbitmq/
      - rabbitmq-log:/var/log/rabbitmq

configs:
  rabbitmq-plugins:
    content: "[rabbitmq_management,rabbitmq_prometheus]."   # 加 rabbitmq_prometheus

镜像里 rabbitmq_prometheus 插件已编译在内,只是默认不加载。改 enabled_plugins 配置文件后重启才会激活。

2. 容器重启 链接到标题

cd /opt/rabbitmq
docker compose down && docker compose up -d

restart 不会重新读取 enabled_plugins 配置文件,必须 down + up。

3. 验证 链接到标题

# 插件已启用([E*] 表示显式启用)
$ docker exec rabbitmq rabbitmq-plugins list -e | grep prometheus
[E*] rabbitmq_prometheus       4.2.2

# 15692 端点 200
$ curl -s -o /dev/null -w 'HTTP %{http_code}  size=%{size_download} bytes\n' http://localhost:15692/metrics
HTTP 200  size=199205 bytes

Prometheus scrape 配置 链接到标题

15692 端点不需要认证,配置最简:

- job_name: rabbitmq
  metrics_path: /metrics
  static_configs:
    - targets:
        - <rabbitmq-host>:15692
      labels:
        hostname: <rabbitmq-host>

部署到监控服务器:

scp monitor/prometheus/prometheus.yml <monitor-host>:/tmp/
ssh <monitor-host> "
  sudo cp /tmp/prometheus.yml /opt/monitor/prometheus/ &&
  cd /opt/monitor && sudo docker compose restart prometheus
"

暴露的指标 链接到标题

始终暴露的指标 链接到标题

类别 指标
Erlang VM erlang_vm_*(约 50 个,含 memory/atom/process/scheduler)
进程 rabbitmq_process_open_fdsrabbitmq_process_max_fdsrabbitmq_process_resident_memory_bytes
内存水位 rabbitmq_resident_memory_limit_bytes(高水位线,默认 22GB)
磁盘 rabbitmq_disk_space_available_bytesrabbitmq_disk_space_available_limit_bytes
全局计数 rabbitmq_connectionsrabbitmq_queuesrabbitmq_connections_opened_totalrabbitmq_connections_closed_total
队列状态 rabbitmq_queue_messages_readyrabbitmq_queue_messages_unackedrabbitmq_queue_consumers
队列累计 rabbitmq_queue_messages_acked_totalrabbitmq_queue_messages_published_totalrabbitmq_queue_messages_delivered_total
队列资源 rabbitmq_queue_messages_bytesrabbitmq_queue_process_memory_bytes
元信息 rabbitmq_build_info

4.x 已移除或重命名的指标 链接到标题

老版本/教程中的指标 实际状态 替代方案
rabbitmq_partition 4.x 已移除 rabbitmq_connections_closed_total 异常增量推断
rabbitmq_process_open_tcp_sockets 4.x 不暴露 rabbitmq_connections 代替
rabbitmq_disk_space_capacity_bytes 名称错误 实际是 rabbitmq_disk_space_available_limit_bytes
rabbitmq_app_info 不存在 实际是 rabbitmq_build_info

写告警规则前先用以下命令核对当前实例真实指标集:

curl -s http://localhost:15692/metrics \
  | grep -E '^[a-z]' | awk '{print $1}' | sed 's/{.*//' | sort -u

告警规则 链接到标题

alerts.yml 新增 rabbitmq_alerts 分组,共 9 条:

类别 告警名 表达式 阈值 持续 严重度
可用性 RabbitMQDown up{job="rabbitmq"} == 0 - 1m critical
健康 RabbitMQNotRunning absent(rabbitmq_build_info) - 1m critical
资源 RabbitMQFileDescriptors FD 使用率 > 80% 5m warning
资源 RabbitMQDiskSpace 磁盘使用率 > 90% 5m critical
资源 RabbitMQMemoryHigh 内存 / 高水位 > 85% 5m warning
流量 RabbitMQConnectionsHigh 连接数 > 5000 5m warning
队列 RabbitMQUnackedMessages 队列未确认消息 > 1000 5m warning
队列 RabbitMQMessagesReady 队列就绪消息 > 10000 10m warning
队列 RabbitMQConsumersLow 消费者 < 1 且队列 > 0 - 5m critical

完整 YAML:

- name: rabbitmq_alerts
  interval: 30s
  rules:
    - alert: RabbitMQDown
      expr: up{job="rabbitmq"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "RabbitMQ 已宕机"
        description: "RabbitMQ 服务已停止响应超过 1 分钟"

    - alert: RabbitMQNotRunning
      expr: absent(rabbitmq_build_info{job="rabbitmq"})
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "RabbitMQ 节点未运行"
        description: "节点未上报 build_info,可能进程崩溃或插件未启用"

    - alert: RabbitMQFileDescriptors
      expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds > 0.8
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "RabbitMQ 文件描述符使用率过高"
        description: "FD 使用率: {{ $value | printf \"%.1f\" }}%"

    - alert: RabbitMQDiskSpace
      expr: (1 - rabbitmq_disk_space_available_bytes / rabbitmq_disk_space_available_limit_bytes) * 100 > 90
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "RabbitMQ 磁盘空间使用率超过 90%"
        description: "磁盘: {{ $value | printf \"%.1f\" }}%"

    - alert: RabbitMQMemoryHigh
      expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes > 0.85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "RabbitMQ 内存使用率超过 85% 高水位"
        description: "内存使用率: {{ $value | printf \"%.1f\" }}%"

    - alert: RabbitMQConnectionsHigh
      expr: rabbitmq_connections > 5000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "RabbitMQ 连接数过高"
        description: "当前连接: {{ $value }}"

    - alert: RabbitMQUnackedMessages
      expr: sum by(hostname, queue, vhost) (rabbitmq_queue_messages_unacked) > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "RabbitMQ 队列积压未确认消息"
        description: "队列 {{ $labels.vhost }}/{{ $labels.queue }} 未确认消息: {{ $value }}"

    - alert: RabbitMQMessagesReady
      expr: sum by(hostname, queue, vhost) (rabbitmq_queue_messages_ready) > 10000
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "RabbitMQ 队列就绪消息积压"
        description: "队列 {{ $labels.vhost }}/{{ $labels.queue }} 就绪消息: {{ $value }}"

    - alert: RabbitMQConsumersLow
      expr: rabbitmq_queue_consumers < 1 and rabbitmq_queues > 0
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "RabbitMQ 队列无消费者"
        description: "队列 {{ $labels.vhost }}/{{ $labels.queue }} 当前无消费者"

踩坑记录 链接到标题

坑 1:rabbitmq-plugins enable 配置漂移 链接到标题

现象:用 docker exec rabbitmq rabbitmq-plugins enable rabbitmq_prometheus 启用后,容器重启又丢失。

诊断

# 启用后立即查——插件已加载
$ docker exec rabbitmq rabbitmq-plugins list -e | grep prometheus
[E*] rabbitmq_prometheus       4.2.2

# 但 enabled_plugins 文件未更新
$ docker exec rabbitmq cat /etc/rabbitmq/enabled_plugins
[rabbitmq_management].

原因rabbitmq-plugins enable 改的是容器内运行时配置不写回 /etc/rabbitmq/enabled_plugins 文件。容器重启后从配置文件重新加载,配置就丢了。

解决:改 docker-compose.yamlconfigs.rabbitmq-plugins.content 字段,把插件名写进配置:

configs:
  rabbitmq-plugins:
    content: "[rabbitmq_management,rabbitmq_prometheus]."

改完 docker compose down && up -d 一次,配置才真正持久化。

调试阶段用 rabbitmq-plugins enable 快速验证可以,但生产环境必须改配置文件。

坑 2:4.x 指标名/可用性变化 链接到标题

现象:按老版本教程写的告警规则在 Prometheus 里查不到数据:

# 老版本(≤ 3.x)写法,4.x 已不适用
rabbitmq_partition > 0
rabbitmq_process_open_tcp_sockets / rabbitmq_process_max_tcp_sockets > 0.8
(1 - rabbitmq_disk_space_available / rabbitmq_disk_space_capacity) * 100 > 90

诊断

# 实际拉一次,看哪些指标存在
$ curl -s http://localhost:15692/metrics | grep -E '^(rabbitmq_partition|rabbitmq_process_open_tcp_sockets|rabbitmq_disk_space_capacity)'
# 无输出

原因:RabbitMQ 4.x 移除/重命名了一批指标:

老版本(≤ 3.x) 新版本(≥ 4.0)
rabbitmq_partition 已移除
rabbitmq_process_open_tcp_sockets 不再暴露
rabbitmq_disk_space_capacity 改名 rabbitmq_disk_space_available_limit
rabbitmq_app_info 改名 rabbitmq_build_info

解决:写表达式前先核对当前版本真实指标集:

curl -s http://localhost:15692/metrics \
  | grep -E '^[a-z]' | awk '{print $1}' | sed 's/{.*//' | sort -u > /tmp/rabbitmq-metrics.txt
# 然后针对需要的类别 grep
grep -E '^(rabbitmq_queue|rabbitmq_connections|rabbitmq_disk|rabbitmq_process)' /tmp/rabbitmq-metrics.txt

坑 3:RabbitMQConsumersLow 空队列误报 链接到标题

现象:刚部署完 Prometheus,实例上还没有任何队列,但 RabbitMQConsumersLow 一直处于 pending 状态,5 分钟后变 firing

诊断

# 查看 active 告警
$ curl -s 'http://<prometheus-host>:9090/api/v1/alerts' \
  | jq '.data.alerts[] | select(.labels.alertname=="RabbitMQConsumersLow")'
{
  "labels": {
    "alertname": "RabbitMQConsumersLow",
    "hostname": "<rabbitmq-host>",
    "severity": "critical"
  },
  "state": "pending",
  "value": "0e+00"
}

# 队列数为 0
$ curl -s 'http://<prometheus-host>:9090/api/v1/query?query=rabbitmq_queues'
{"data":{"result":[{"value":[..., "0"]}]}}

原因:Prometheus 对空集合的 < 1 判断返回 truerabbitmq_queue_consumers 指标在没有队列时不存在(空集合),但 < 1 对空集合也成立,导致空实例上告警一直 pending。

解决:表达式加 and rabbitmq_queues > 0 守卫:

- alert: RabbitMQConsumersLow
  expr: rabbitmq_queue_consumers < 1 and rabbitmq_queues > 0   # 关键守卫
  for: 5m
  ...

改完 reload 后,队列为 0 时告警不触发;有队列但无消费者时才正确触发。

验证清单 链接到标题

# 1. 插件已启用
ssh <rabbitmq-host> "docker exec rabbitmq rabbitmq-plugins list -e | grep prometheus"
# 预期:[E*] rabbitmq_prometheus  4.2.2

# 2. 端点 200
curl -s -o /dev/null -w 'HTTP %{http_code}\n' http://<rabbitmq-host>:15692/metrics
# 预期:HTTP 200

# 3. promtool 校验
ssh <monitor-host> "sudo docker exec prometheus promtool check config /etc/prometheus/prometheus.yml"
# 预期:SUCCESS, 45 rules found

# 4. target health
ssh <monitor-host> "curl -s http://localhost:9090/api/v1/targets \
  | python3 -c 'import sys,json; d=json.load(sys.stdin); \
  [print(t[\"labels\"].get(\"job\"), t[\"health\"]) \
  for t in d[\"data\"][\"activeTargets\"] if t[\"labels\"].get(\"job\")==\"rabbitmq\"]'"
# 预期:rabbitmq up

# 5. 关键指标
ssh <monitor-host> "curl -s 'http://localhost:9090/api/v1/query?query=up{job=\"rabbitmq\"}'"
ssh <monitor-host> "curl -s 'http://localhost:9090/api/v1/query?query=rabbitmq_connections'"
ssh <monitor-host> "curl -s 'http://localhost:9090/api/v1/query?query=rabbitmq_queues'"

# 6. 规则加载数
ssh <monitor-host> "curl -s http://localhost:9090/api/v1/rules \
  | python3 -c 'import sys,json; d=json.load(sys.stdin); \
  [print(g[\"name\"], len(g[\"rules\"])) \
  for g in d[\"data\"][\"groups\"] if g[\"name\"]==\"rabbitmq_alerts\"]'"
# 预期:rabbitmq_alerts 9

# 7. 无 active 告警(空队列实例上 ConsumersLow 不会触发)
ssh <monitor-host> "curl -s http://localhost:9090/api/v1/alerts \
  | python3 -c 'import sys,json; d=json.load(sys.stdin); \
  print(\"Active RabbitMQ alerts:\", len([a for a in d[\"data\"][\"alerts\"] if a[\"labels\"].get(\"job\")==\"rabbitmq\"]))'"
# 预期:0

总结 链接到标题

RabbitMQ 接入 Prometheus 的步骤:

  1. docker-compose.yamlenabled_pluginsrabbitmq_prometheusports 加 15692
  2. docker compose down && up -d注意不是 restart
  3. prometheus.ymlrabbitmq job(无需认证)
  4. alerts.yml 加 9 条 rabbitmq_alerts 规则
  5. 同步到监控服务器,restart prometheus

队列告警的关注点:

  • RabbitMQMessagesReady > 10000:消息进得多、消费得慢
  • RabbitMQUnackedMessages > 1000:消费者拿了消息但没 ack,可能是处理慢或挂死
  • RabbitMQConsumersLow:队列存在但没有消费者订阅,消息会无限堆积

这 3 条加上 RabbitMQDown / NotRunning 是 RabbitMQ 监控的核心告警;其他资源类(FD / Disk / Memory / Connections)作为辅助,在容量规划阶段调整阈值。