监控系列文章:
- CouchDB 日志告警 → Alloy + Loki + LogQL
- Windmill 日志告警 → Alloy + Loki + LogQL
- 告警通知到飞书 → Alertmanager → transformer → OpenClaw
- MinIO 指标告警 → 内置 metrics + 一行环境变量
- Qdrant 指标告警 → 内置 metrics + API key 认证
- RabbitMQ 指标告警(本篇)→ 插件暴露 15692 端点 + 9 条规则
这篇和 MinIO、Qdrant 一样是"内置 metrics 路线"——不需要额外部署 exporter。但 RabbitMQ 的 Prometheus 端点默认未启用,要先在 enabled_plugins 加上 rabbitmq_prometheus 才会监听 15692 端口。
RabbitMQ Prometheus 端点 链接到标题
| 项 | 值 |
|---|---|
| 镜像 | rabbitmq:4.2.2-management-alpine |
| 插件 | rabbitmq_prometheus(镜像内置,需手动启用) |
| 端口 | 15692 |
| 路径 | /metrics |
| 认证 | 不需要 |
启用后,15692 端点和 15672 management UI 是两个独立监听器,不走 management API 的鉴权——rabbitmq_prometheus 插件直接暴露 Prometheus 文本格式。
配置步骤 链接到标题
1. 修改 docker-compose.yaml
链接到标题
两处改动:
services:
rabbitmq:
image: rabbitmq:4.2.2-management-alpine
container_name: rabbitmq
restart: always
ports:
- 5672:5672
- 15672:15672
- 15692:15692 # 新增
env_file: .env
environment:
RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS: "-rabbitmq_management_agent disable_metrics_collector false"
configs:
- source: rabbitmq-plugins
target: /etc/rabbitmq/enabled_plugins
volumes:
- rabbitmq-lib:/var/lib/rabbitmq/
- rabbitmq-log:/var/log/rabbitmq
configs:
rabbitmq-plugins:
content: "[rabbitmq_management,rabbitmq_prometheus]." # 加 rabbitmq_prometheus
镜像里
rabbitmq_prometheus插件已编译在内,只是默认不加载。改enabled_plugins配置文件后重启才会激活。
2. 容器重启 链接到标题
cd /opt/rabbitmq
docker compose down && docker compose up -d
restart不会重新读取enabled_plugins配置文件,必须 down + up。
3. 验证 链接到标题
# 插件已启用([E*] 表示显式启用)
$ docker exec rabbitmq rabbitmq-plugins list -e | grep prometheus
[E*] rabbitmq_prometheus 4.2.2
# 15692 端点 200
$ curl -s -o /dev/null -w 'HTTP %{http_code} size=%{size_download} bytes\n' http://localhost:15692/metrics
HTTP 200 size=199205 bytes
Prometheus scrape 配置 链接到标题
15692 端点不需要认证,配置最简:
- job_name: rabbitmq
metrics_path: /metrics
static_configs:
- targets:
- <rabbitmq-host>:15692
labels:
hostname: <rabbitmq-host>
部署到监控服务器:
scp monitor/prometheus/prometheus.yml <monitor-host>:/tmp/
ssh <monitor-host> "
sudo cp /tmp/prometheus.yml /opt/monitor/prometheus/ &&
cd /opt/monitor && sudo docker compose restart prometheus
"
暴露的指标 链接到标题
始终暴露的指标 链接到标题
| 类别 | 指标 |
|---|---|
| Erlang VM | erlang_vm_*(约 50 个,含 memory/atom/process/scheduler) |
| 进程 | rabbitmq_process_open_fds、rabbitmq_process_max_fds、rabbitmq_process_resident_memory_bytes |
| 内存水位 | rabbitmq_resident_memory_limit_bytes(高水位线,默认 22GB) |
| 磁盘 | rabbitmq_disk_space_available_bytes、rabbitmq_disk_space_available_limit_bytes |
| 全局计数 | rabbitmq_connections、rabbitmq_queues、rabbitmq_connections_opened_total、rabbitmq_connections_closed_total |
| 队列状态 | rabbitmq_queue_messages_ready、rabbitmq_queue_messages_unacked、rabbitmq_queue_consumers |
| 队列累计 | rabbitmq_queue_messages_acked_total、rabbitmq_queue_messages_published_total、rabbitmq_queue_messages_delivered_total |
| 队列资源 | rabbitmq_queue_messages_bytes、rabbitmq_queue_process_memory_bytes |
| 元信息 | rabbitmq_build_info |
4.x 已移除或重命名的指标 链接到标题
| 老版本/教程中的指标 | 实际状态 | 替代方案 |
|---|---|---|
rabbitmq_partition |
4.x 已移除 | 用 rabbitmq_connections_closed_total 异常增量推断 |
rabbitmq_process_open_tcp_sockets |
4.x 不暴露 | 用 rabbitmq_connections 代替 |
rabbitmq_disk_space_capacity_bytes |
名称错误 | 实际是 rabbitmq_disk_space_available_limit_bytes |
rabbitmq_app_info |
不存在 | 实际是 rabbitmq_build_info |
写告警规则前先用以下命令核对当前实例真实指标集:
curl -s http://localhost:15692/metrics \ | grep -E '^[a-z]' | awk '{print $1}' | sed 's/{.*//' | sort -u
告警规则 链接到标题
在 alerts.yml 新增 rabbitmq_alerts 分组,共 9 条:
| 类别 | 告警名 | 表达式 | 阈值 | 持续 | 严重度 |
|---|---|---|---|---|---|
| 可用性 | RabbitMQDown | up{job="rabbitmq"} == 0 |
- | 1m | critical |
| 健康 | RabbitMQNotRunning | absent(rabbitmq_build_info) |
- | 1m | critical |
| 资源 | RabbitMQFileDescriptors | FD 使用率 | > 80% | 5m | warning |
| 资源 | RabbitMQDiskSpace | 磁盘使用率 | > 90% | 5m | critical |
| 资源 | RabbitMQMemoryHigh | 内存 / 高水位 | > 85% | 5m | warning |
| 流量 | RabbitMQConnectionsHigh | 连接数 | > 5000 | 5m | warning |
| 队列 | RabbitMQUnackedMessages | 队列未确认消息 | > 1000 | 5m | warning |
| 队列 | RabbitMQMessagesReady | 队列就绪消息 | > 10000 | 10m | warning |
| 队列 | RabbitMQConsumersLow | 消费者 < 1 且队列 > 0 | - | 5m | critical |
完整 YAML:
- name: rabbitmq_alerts
interval: 30s
rules:
- alert: RabbitMQDown
expr: up{job="rabbitmq"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "RabbitMQ 已宕机"
description: "RabbitMQ 服务已停止响应超过 1 分钟"
- alert: RabbitMQNotRunning
expr: absent(rabbitmq_build_info{job="rabbitmq"})
for: 1m
labels:
severity: critical
annotations:
summary: "RabbitMQ 节点未运行"
description: "节点未上报 build_info,可能进程崩溃或插件未启用"
- alert: RabbitMQFileDescriptors
expr: rabbitmq_process_open_fds / rabbitmq_process_max_fds > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ 文件描述符使用率过高"
description: "FD 使用率: {{ $value | printf \"%.1f\" }}%"
- alert: RabbitMQDiskSpace
expr: (1 - rabbitmq_disk_space_available_bytes / rabbitmq_disk_space_available_limit_bytes) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "RabbitMQ 磁盘空间使用率超过 90%"
description: "磁盘: {{ $value | printf \"%.1f\" }}%"
- alert: RabbitMQMemoryHigh
expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes > 0.85
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ 内存使用率超过 85% 高水位"
description: "内存使用率: {{ $value | printf \"%.1f\" }}%"
- alert: RabbitMQConnectionsHigh
expr: rabbitmq_connections > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ 连接数过高"
description: "当前连接: {{ $value }}"
- alert: RabbitMQUnackedMessages
expr: sum by(hostname, queue, vhost) (rabbitmq_queue_messages_unacked) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ 队列积压未确认消息"
description: "队列 {{ $labels.vhost }}/{{ $labels.queue }} 未确认消息: {{ $value }}"
- alert: RabbitMQMessagesReady
expr: sum by(hostname, queue, vhost) (rabbitmq_queue_messages_ready) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "RabbitMQ 队列就绪消息积压"
description: "队列 {{ $labels.vhost }}/{{ $labels.queue }} 就绪消息: {{ $value }}"
- alert: RabbitMQConsumersLow
expr: rabbitmq_queue_consumers < 1 and rabbitmq_queues > 0
for: 5m
labels:
severity: critical
annotations:
summary: "RabbitMQ 队列无消费者"
description: "队列 {{ $labels.vhost }}/{{ $labels.queue }} 当前无消费者"
踩坑记录 链接到标题
坑 1:rabbitmq-plugins enable 配置漂移
链接到标题
现象:用 docker exec rabbitmq rabbitmq-plugins enable rabbitmq_prometheus 启用后,容器重启又丢失。
诊断:
# 启用后立即查——插件已加载
$ docker exec rabbitmq rabbitmq-plugins list -e | grep prometheus
[E*] rabbitmq_prometheus 4.2.2
# 但 enabled_plugins 文件未更新
$ docker exec rabbitmq cat /etc/rabbitmq/enabled_plugins
[rabbitmq_management].
原因:rabbitmq-plugins enable 改的是容器内运行时配置,不写回 /etc/rabbitmq/enabled_plugins 文件。容器重启后从配置文件重新加载,配置就丢了。
解决:改 docker-compose.yaml 的 configs.rabbitmq-plugins.content 字段,把插件名写进配置:
configs:
rabbitmq-plugins:
content: "[rabbitmq_management,rabbitmq_prometheus]."
改完 docker compose down && up -d 一次,配置才真正持久化。
调试阶段用
rabbitmq-plugins enable快速验证可以,但生产环境必须改配置文件。
坑 2:4.x 指标名/可用性变化 链接到标题
现象:按老版本教程写的告警规则在 Prometheus 里查不到数据:
# 老版本(≤ 3.x)写法,4.x 已不适用
rabbitmq_partition > 0
rabbitmq_process_open_tcp_sockets / rabbitmq_process_max_tcp_sockets > 0.8
(1 - rabbitmq_disk_space_available / rabbitmq_disk_space_capacity) * 100 > 90
诊断:
# 实际拉一次,看哪些指标存在
$ curl -s http://localhost:15692/metrics | grep -E '^(rabbitmq_partition|rabbitmq_process_open_tcp_sockets|rabbitmq_disk_space_capacity)'
# 无输出
原因:RabbitMQ 4.x 移除/重命名了一批指标:
| 老版本(≤ 3.x) | 新版本(≥ 4.0) |
|---|---|
rabbitmq_partition |
已移除 |
rabbitmq_process_open_tcp_sockets |
不再暴露 |
rabbitmq_disk_space_capacity |
改名 rabbitmq_disk_space_available_limit |
rabbitmq_app_info |
改名 rabbitmq_build_info |
解决:写表达式前先核对当前版本真实指标集:
curl -s http://localhost:15692/metrics \
| grep -E '^[a-z]' | awk '{print $1}' | sed 's/{.*//' | sort -u > /tmp/rabbitmq-metrics.txt
# 然后针对需要的类别 grep
grep -E '^(rabbitmq_queue|rabbitmq_connections|rabbitmq_disk|rabbitmq_process)' /tmp/rabbitmq-metrics.txt
坑 3:RabbitMQConsumersLow 空队列误报 链接到标题
现象:刚部署完 Prometheus,实例上还没有任何队列,但 RabbitMQConsumersLow 一直处于 pending 状态,5 分钟后变 firing。
诊断:
# 查看 active 告警
$ curl -s 'http://<prometheus-host>:9090/api/v1/alerts' \
| jq '.data.alerts[] | select(.labels.alertname=="RabbitMQConsumersLow")'
{
"labels": {
"alertname": "RabbitMQConsumersLow",
"hostname": "<rabbitmq-host>",
"severity": "critical"
},
"state": "pending",
"value": "0e+00"
}
# 队列数为 0
$ curl -s 'http://<prometheus-host>:9090/api/v1/query?query=rabbitmq_queues'
{"data":{"result":[{"value":[..., "0"]}]}}
原因:Prometheus 对空集合的 < 1 判断返回 true。rabbitmq_queue_consumers 指标在没有队列时不存在(空集合),但 < 1 对空集合也成立,导致空实例上告警一直 pending。
解决:表达式加 and rabbitmq_queues > 0 守卫:
- alert: RabbitMQConsumersLow
expr: rabbitmq_queue_consumers < 1 and rabbitmq_queues > 0 # 关键守卫
for: 5m
...
改完 reload 后,队列为 0 时告警不触发;有队列但无消费者时才正确触发。
验证清单 链接到标题
# 1. 插件已启用
ssh <rabbitmq-host> "docker exec rabbitmq rabbitmq-plugins list -e | grep prometheus"
# 预期:[E*] rabbitmq_prometheus 4.2.2
# 2. 端点 200
curl -s -o /dev/null -w 'HTTP %{http_code}\n' http://<rabbitmq-host>:15692/metrics
# 预期:HTTP 200
# 3. promtool 校验
ssh <monitor-host> "sudo docker exec prometheus promtool check config /etc/prometheus/prometheus.yml"
# 预期:SUCCESS, 45 rules found
# 4. target health
ssh <monitor-host> "curl -s http://localhost:9090/api/v1/targets \
| python3 -c 'import sys,json; d=json.load(sys.stdin); \
[print(t[\"labels\"].get(\"job\"), t[\"health\"]) \
for t in d[\"data\"][\"activeTargets\"] if t[\"labels\"].get(\"job\")==\"rabbitmq\"]'"
# 预期:rabbitmq up
# 5. 关键指标
ssh <monitor-host> "curl -s 'http://localhost:9090/api/v1/query?query=up{job=\"rabbitmq\"}'"
ssh <monitor-host> "curl -s 'http://localhost:9090/api/v1/query?query=rabbitmq_connections'"
ssh <monitor-host> "curl -s 'http://localhost:9090/api/v1/query?query=rabbitmq_queues'"
# 6. 规则加载数
ssh <monitor-host> "curl -s http://localhost:9090/api/v1/rules \
| python3 -c 'import sys,json; d=json.load(sys.stdin); \
[print(g[\"name\"], len(g[\"rules\"])) \
for g in d[\"data\"][\"groups\"] if g[\"name\"]==\"rabbitmq_alerts\"]'"
# 预期:rabbitmq_alerts 9
# 7. 无 active 告警(空队列实例上 ConsumersLow 不会触发)
ssh <monitor-host> "curl -s http://localhost:9090/api/v1/alerts \
| python3 -c 'import sys,json; d=json.load(sys.stdin); \
print(\"Active RabbitMQ alerts:\", len([a for a in d[\"data\"][\"alerts\"] if a[\"labels\"].get(\"job\")==\"rabbitmq\"]))'"
# 预期:0
总结 链接到标题
RabbitMQ 接入 Prometheus 的步骤:
- 改
docker-compose.yaml:enabled_plugins加rabbitmq_prometheus,ports加 15692 docker compose down && up -d(注意不是 restart)prometheus.yml加rabbitmqjob(无需认证)alerts.yml加 9 条rabbitmq_alerts规则- 同步到监控服务器,restart prometheus
队列告警的关注点:
RabbitMQMessagesReady> 10000:消息进得多、消费得慢RabbitMQUnackedMessages> 1000:消费者拿了消息但没 ack,可能是处理慢或挂死RabbitMQConsumersLow:队列存在但没有消费者订阅,消息会无限堆积
这 3 条加上 RabbitMQDown / NotRunning 是 RabbitMQ 监控的核心告警;其他资源类(FD / Disk / Memory / Connections)作为辅助,在容量规划阶段调整阈值。