Prometheus: Alertmanager’s alerts receivers and routing based on severity level and tags

Click to rate this post!

[Total: 0 Average: 0]

We have three working environments – Dev, Stage, Production.

Also, there are a bunch of alerts with different severities – info, warning и critical.

For example:

...
- name: SSLexpiry.rules

  rules:

  - alert: SSLCertExpiring30days
    expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30
    for: 10m
    labels:
      severity: info
    annotations:
      summary: "SSL certificate warning"
      description: "SSL certificate for the {{ $labels.instance }} will expire within 30 days!"
...

Alerts are sent to Slack and OpsGenie.

The task is depending on environment and severity level – send to Slack only, or Slack + OpsGenie.

OpsGenie in its turn depending on severity level will do:

for the warning – will send an email plus notification to its mobile application
for the critical – email plus notification to its mobile application plus bot’s call to a mobile

Thus the whole logic looks like next:

Dev
- all messages independent on severity – send to Slack only
Staging:
- info – Slack only
- warning и critical – Slack + OpsGenie and set warning priority for OpsGenie (P3)
Production
- info – Slack only
- warning и critical – Slack and OpsGenie and set critical priority for OpsGenie (P1)

To break down messages between Slack and OpsGenie we gave three receivers configured and in warning and critical receivers – priorities P3 or P1 will be set for the OpsGenie:

...
receivers:

  - name: 'default'
    slack_configs:
      - send_resolved: true
        title_link: 'https://monitor.example.com/prometheus/alerts'
        title: '{{ if eq .Status "firing" }}:confused:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}"

  - name: 'warning'
    slack_configs:
      - send_resolved: true
        title_link: 'https://monitor.example.com/prometheus/alerts'
        title: '{{ if eq .Status "firing" }}:disappointed_relieved:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}"
    opsgenie_configs:
      - priority: P3

  - name: 'critical'
    slack_configs:
      - send_resolved: true
        title_link: 'https://monitor.example.com/prometheus/alerts'
        title: '{{ if eq .Status "firing" }}:scream:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}'
        text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}"
    opsgenie_configs:
      - priority: P1

And the routing itself is done in the route block:

...
route:

  group_by: ['alertname', 'cluster', 'job', 'env']
  repeat_interval: 24h
  group_interval: 5m

  # capture All Dev + All INFO
  receiver: 'default'

  routes:

    # capture All WARN to the 'warning' with P3
    - match:
        severity: warning
      receiver: warning

      routes:
      # forward Dev WARN to the 'default'
      - match_re:
          env: .*(-dev).*
        receiver: default

    # capture All CRIT to the 'critical' with P1
    - match:
        severity: critical
      receiver: critical

      routes:
      # forward Stage CRIT to the 'warning'
      - match_re:
          env: .*(-stage).*
        receiver: warning
      # forward Dev CRIT to the 'default'
      - match_re:
          env: .*(-dev).*
        receiver: default
...

Here we the ‘default‘ route set – all alerts didn’t match for other rules below will be sent via this route, which will send only Slack notification.

The additional routes are described:

in match catch alerts using the severity: warning tag
in the nested route using match_re the env will be checked – if it as “-dev” value then it will send back to the default receiver
all other alerts with the warning level will be sent back and will go thru the receiver: warning receiver

Similarly, rules on the next level will be applied – catch alerts with the severity: critical and check them:

if env: .*(-stage).* – then go to the warning receiver
if env: .*(-dev).* – then go to the default receiver
everything other (only env == production and severity == critical are left) – will go thru the critical receiver

Using such an approach you can write rules using any tags and use any nested levels to check conditions and select next routes for alerts.

M	T	W	T	F	S	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30