前一篇 介绍了用 Alloy + Loki 给 CouchDB 做日志告警。这篇文章是同一套路在 Windmill 上的应用。

Windmill 是一个工作流调度平台,由多个 Docker 容器组成:server(API 服务)、worker(任务执行)、worker_gpu(GPU 任务)。这些容器的日志里包含各种运行错误——

  • API Token 权限不足
  • S3 存储配置丢失
  • Worker 执行异常

它们不会体现在 Prometheus 指标上,只有查看容器日志才能发现。需要一套日志级的告警来及时发现。

日志格式分析 链接到标题

Windmill 的日志是结构化 JSON 格式,每条日志包含 levelmsgtimestamp 等字段:

{"level":"ERROR","msg":"Permission denied. Required scope: jobs:run:flows","timestamp":"..."}
{"level":"ERROR","msg":"Storage _default_ not found at the workspace level","timestamp":"..."}

JSON 格式的优势在于:Loki 可以直接通过 | json 解析出字段,然后用字段值做精确过滤,比纯文本正则匹配更稳定可靠。

常见的 Windmill ERROR 类型:

错误类型 含义 严重度
Permission denied API Token scope 不足 warning
Storage not found S3 存储配置丢失 critical
unshare isolation 容器隔离配置缺失(无害) 可忽略
Worker ERROR 任务执行异常 warning

整体架构 链接到标题

flowchart LR A["Windmill 容器
server / worker
worker_gpu"] --> B["Alloy
loki.source.docker"] B --> C["Loki
日志存储 + Ruler"] C --> D["Alertmanager
告警去重/路由"] D --> E["alert-transformer
格式化"] E --> F["OpenClaw"] F --> G["飞书通知"] H["Prometheus
指标告警"] --> D

日志告警和已有的指标告警在 Alertmanager 汇合,走同一套通知链路到飞书。

Alloy 部署 链接到标题

对比 前一篇 的 CouchDB 部署,Alloy 的容器部署完全一致,只是 config.alloy 中的 relabel 规则和日志处理不同。

docker-compose.yaml 链接到标题

services:
  alloy:
    image: m.daocloud.io/docker.io/grafana/alloy:v1.14.1
    container_name: alloy
    restart: unless-stopped
    ports:
      - 12345:12345
    volumes:
      - ./config.alloy:/etc/alloy/config.alloy:ro
      - /var/run/docker.sock:/var/run/docker.sock
    command:
      - run
      - --server.http.listen-addr=0.0.0.0:12345
      - --storage.path=/var/lib/alloy/data
      - /etc/alloy/config.alloy

config.alloy 链接到标题

// 发现 Docker 容器
discovery.docker "local" {
  host             = "unix:///var/run/docker.sock"
  refresh_interval = "5s"
}

// 给各容器打标签 + 跳过无关容器
discovery.relabel "local" {
  targets = discovery.docker.local.targets

  // server
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/windmill-windmill_server-1"
    target_label  = "container"
    replacement   = "windmill_server-1"
  }
  // worker两个副本
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/windmill-windmill_worker-1"
    target_label  = "container"
    replacement   = "windmill_worker-1"
  }
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/windmill-windmill_worker-2"
    target_label  = "container"
    replacement   = "windmill_worker-2"
  }
  // worker_gpu
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/windmill-windmill_worker_gpu-1"
    target_label  = "container"
    replacement   = "windmill_worker_gpu-1"
  }
  // 跳过 caddy反向代理 windmill_extraLSP 调试器
  rule {
    source_labels = ["__meta_docker_container_name"]
    regex         = "/windmill-(caddy|windmill_extra)"
    target_label  = "__meta_alloy_skip"
    replacement   = "true"
  }
}

loki.source.docker "local" {
  host             = "unix:///var/run/docker.sock"
  targets          = discovery.relabel.local.output
  forward_to       = [loki.process.local.receiver]
  refresh_interval = "5s"
}

// 解析 JSON 日志 + 提取 level 标签
loki.process "local" {
  stage.json {
    expressions = {
      level = "level",
      msg   = "msg",
    }
  }
  stage.labels {
    values = {
      level     = "",
      container = "",
    }
  }
  forward_to = [loki.write.remote.receiver]
}

// 推送到 Loki
loki.write "remote" {
  endpoint {
    url       = "http://loki.example.com:3100/loki/api/v1/push"
    tenant_id = "my-tenant"
  }
}

配置要点 链接到标题

  1. 挂载 /var/run/docker.sock——让 Alloy 能发现和读取 Docker 容器日志

  2. stage.json + stage.labels——与 CouchDB 篇不同,Windmill 的 JSON 日志可以提取 levelmsg 字段。stage.json 提取字段,stage.labelslevel 提升为标签,这样在 LogQL 中可以直接按 level = "ERROR" 过滤,不需要在每条规则里重新解析

  3. 跳过无关容器——caddy(反向代理)和 windmill_extra(LSP 调试器)的日志与运行错误无关,跳过以减少存储和告警噪声

容器名的坑 链接到标题

Docker Compose 部署的服务名会在容器名追加编号:

Docker Compose 服务名 实际容器名
windmill_server windmill_server-1
windmill_worker(replicas: 2) windmill_worker-1windmill_worker-2
windmill_worker_gpu windmill_worker_gpu-1

在 Loki 中查询 label values 确认实际的容器名:

curl -s -H "X-Scope-OrgID: my-tenant" \
  "http://loki.example.com:3100/loki/api/v1/label/container/values"

Loki Ruler 告警规则 链接到标题

启用 Ruler 链接到标题

前一篇 所述,在 loki-config.yaml 中启用 Ruler:

ruler:
  enable_api: true
  enable_alertmanager_v2: true
  alertmanager_url: http://alertmanager:9093
  poll_interval: 30s
  storage:
    type: local
    local:
      directory: /loki/rules

规则文件 链接到标题

新建 <rules_dir>/<tenant_id>/windmill.yaml

groups:
  - name: windmill_errors
    interval: 30s
    rules:
      # server 错误过多(聚合告警,防抖动)
      - alert: WindmillServerError
        expr: |
          sum(count_over_time({container="windmill_server-1"}
            | json
            | level = "ERROR" [1m])) > 3
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Windmill server 错误过多"
          description: "最近 2 分钟产生 {{ $value }} 条 ERROR"

      # API 权限问题(精确匹配 msg 字段)
      - alert: WindmillPermissionDenied
        expr: |
          count_over_time({container="windmill_server-1"}
            | json
            | level = "ERROR"
            | msg =~ "Permission denied" [5m]) > 0
        labels:
          severity: warning
        annotations:
          summary: "Windmill API 权限不足"
          description: "API token 可能已过期或 scope 不足"

      # S3 存储配置问题
      - alert: WindmillS3Error
        expr: |
          count_over_time({container="windmill_server-1"}
            | json
            | level = "ERROR"
            | msg =~ "Storage.*not found" [5m]) > 0
        labels:
          severity: critical
        annotations:
          summary: "Windmill S3 存储配置错误"
          description: "S3 存储不可用,影响文件访问"

      # Worker 执行异常(注意:windmill_worker-\d+$ 精确匹配两个 worker)
      - alert: WindmillWorkerError
        expr: |
          sum(count_over_time({container=~"windmill_worker-\\d+$"}
            | json
            | level = "ERROR" [5m])) > 0
        labels:
          severity: warning
        annotations:
          summary: "Windmill worker 运行错误"
          description: "Worker 容器出现 ERROR 日志"

LogQL 语法说明 链接到标题

语法 含义 例子
{container="windmill_server-1"} 按标签选择日志流 精确匹配
{container=~"windmill_worker-\\d+$"} 正则匹配日志流 匹配 worker-1、worker-2
| json 解析 JSON 内容 提取 levelmsg 等字段
level = "ERROR" 过滤字段值 只保留 ERROR 级别
msg =~ "Permission denied" 字段值正则匹配 模糊匹配错误信息
count_over_time(... [1m]) 统计 1 分钟内匹配行数 用于计数
for: 2m 持续 2 分钟才触发 防止抖动

sum() 的作用:当多个容器匹配同一个规则时(如多个 worker),将它们的计数汇总。

Alertmanager 配置 链接到标题

同样复用现有配置,不需要新增路由:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'openclaw'

receivers:
  - name: 'openclaw'
    webhook_configs:
      - url: 'http://alert-transformer:9091/alertmanager'
        send_resolved: true

验证方法 链接到标题

# 查看所有容器的 label values
curl -s -H "X-Scope-OrgID: my-tenant" \
  "http://loki.example.com:3100/loki/api/v1/label/container/values"

# 确认告警规则加载
curl -s -H "X-Scope-OrgID: my-tenant" \
  "http://loki.example.com:3100/loki/api/v1/rules"

# 查询 windmill_server ERROR 日志
curl -s -H "X-Scope-OrgID: my-tenant" \
  "http://loki.example.com:3100/loki/api/v1/query_range?\
query=%7Bcontainer%3D%22windmill_server-1%22%7D%20%7C%20json%20%7C%20level%20%3D%20%22ERROR%22&limit=3"

# 确认 Alertmanager 告警
curl -s "http://alertmanager:9093/api/v2/alerts"

总结 链接到标题

与前一篇 CouchDB 日志监控对比:

对比项 CouchDB(文本日志) Windmill(JSON 日志)
日志格式 Erlang [error] {"level":"ERROR","msg":"..."}
Alloy 解析 不需要,原样透传 stage.json 提取字段
LogQL 写法 |~ "\\\\[error\\\\]" | json | level = "ERROR"
字段过滤 不支持 支持精确匹配 msg

对于 JSON 格式的日志,| json 比文本正则更稳定、表达能力更强。

如果项目中还有其他的 JSON 格式服务日志,用完全相同的模式接入即可——Alloy 配置只需要改 relabel 规则,Loki 只需要新增一个 YAML 规则文件。