We have three working environments – Dev, Stage, Production.
Also, there are a bunch of alerts with different severities – info, warning и critical.
For example:
... - name: SSLexpiry.rules rules: - alert: SSLCertExpiring30days expr: probe_ssl_earliest_cert_expiry{job="blackbox"} - time() < 86400 * 30 for: 10m labels: severity: info annotations: summary: "SSL certificate warning" description: "SSL certificate for the {{ $labels.instance }} will expire within 30 days!" ...
Alerts are sent to Slack and OpsGenie.
The task is depending on environment and severity level – send to Slack only, or Slack + OpsGenie.
OpsGenie in its turn depending on severity level will do:
- for the warning – will send an email plus notification to its mobile application
- for the critical – email plus notification to its mobile application plus bot’s call to a mobile
Thus the whole logic looks like next:
- Dev
- all messages independent on severity – send to Slack only
- Staging:
- info – Slack only
- warning и critical – Slack + OpsGenie and set warning priority for OpsGenie (P3)
- Production
- info – Slack only
- warning и critical – Slack and OpsGenie and set critical priority for OpsGenie (P1)
To break down messages between Slack and OpsGenie we gave three receivers configured and in warning and critical receivers – priorities P3 or P1 will be set for the OpsGenie:
... receivers: - name: 'default' slack_configs: - send_resolved: true title_link: 'https://monitor.example.com/prometheus/alerts' title: '{{ if eq .Status "firing" }}:confused:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}' text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}" - name: 'warning' slack_configs: - send_resolved: true title_link: 'https://monitor.example.com/prometheus/alerts' title: '{{ if eq .Status "firing" }}:disappointed_relieved:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}' text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}" opsgenie_configs: - priority: P3 - name: 'critical' slack_configs: - send_resolved: true title_link: 'https://monitor.example.com/prometheus/alerts' title: '{{ if eq .Status "firing" }}:scream:{{ else }}:dancing_panda:{{ end }} [{{ .Status | toUpper }}] {{ .CommonAnnotations.summary }}' text: "{{ range .Alerts }}*Priority*: `{{ .Labels.severity | toUpper }}`\nMonitoring host: {{ .Labels.monitor }}\n{{ .Annotations.description }}\n{{ end }}" opsgenie_configs: - priority: P1
And the routing itself is done in the route
block:
... route: group_by: ['alertname', 'cluster', 'job', 'env'] repeat_interval: 24h group_interval: 5m # capture All Dev + All INFO receiver: 'default' routes: # capture All WARN to the 'warning' with P3 - match: severity: warning receiver: warning routes: # forward Dev WARN to the 'default' - match_re: env: .*(-dev).* receiver: default # capture All CRIT to the 'critical' with P1 - match: severity: critical receiver: critical routes: # forward Stage CRIT to the 'warning' - match_re: env: .*(-stage).* receiver: warning # forward Dev CRIT to the 'default' - match_re: env: .*(-dev).* receiver: default ...
Here we the ‘default
‘ route set – all alerts didn’t match for other rules below will be sent via this route, which will send only Slack notification.
The additional routes are described:
- in
match
catch alerts using theseverity: warning
tag - in the nested route using
match_re
theenv
will be checked – if it as “-dev” value then it will send back to thedefault
receiver - all other alerts with the warning level will be sent back and will go thru the
receiver: warning
receiver
Similarly, rules on the next level will be applied – catch alerts with the severity: critical
and check them:
- if
env: .*(-stage).*
– then go to thewarning
receiver - if
env: .*(-dev).*
– then go to thedefault
receiver - everything other (only env == production and severity == critical are left) – will go thru the
critical
receiver
Using such an approach you can write rules using any tags and use any nested levels to check conditions and select next routes for alerts.