Prometheus: blackbox-exporter probe_http_status_code == 0 and its debug

By | 03/06/2019

Today I decided to upgrade Grafana to already released version 6.0 and all other Docker images as well.

Upgrade was successful – Loki eventually started displaying previously missed log-file names and other tags, just – immediately I got a bunch of CRITICAL alerts in our Slack from the blackbox-exporter which is used to check every our API endpoints.

Actually, in this post – an example of how to debug the blackbox-exporter.

Exactly monitoring’s stack consist of the next containers:

[simterm]

admin@monitoring-production:~$ cat /opt/prometheus/prometheus-compose.yml | grep -B1 image | grep -v image
  prometheus-server:
--
  grafana-ui:
--
  cloudwatch-exporter-us-east-1:
--
  cloudwatch-exporter-us-east-2:
--
  cloudwatch-exporter-eu-west-1:
--
  blackbox-exporter:
--
  alertmanager:
--
  loki:
--
  grafana-loki:

[/simterm]

On the Prometheus graphs in looked like:

At the first moment, I was a bit scared as this is real Production API endpoints and alerts… And all of them suddenly “went down”.

But obviously – that was something with the blackbox-exporter itself after Docker images upgrade (and after this, I eventually added Ansible variables with Docker images tags/versions instead of the just “latest“).

Okay – let’s try to debug and find what’s happened.

First – let’s check Google instead of our URLs:

[simterm]

root@monitoring-production:/home/admin# curl -s "localhost:9115/probe?target=https://google.com&module=http_200_module" | grep -v \#
probe_dns_lookup_time_seconds 0.001286702
probe_duration_seconds 0.001563372
probe_failed_due_to_regex 0
probe_http_content_length 0
probe_http_duration_seconds{phase="connect"} 0
probe_http_duration_seconds{phase="processing"} 0
probe_http_duration_seconds{phase="resolve"} 0.001286702
probe_http_duration_seconds{phase="tls"} 0
probe_http_duration_seconds{phase="transfer"} 0
probe_http_redirects 0
probe_http_ssl 0
probe_http_status_code 0
probe_http_version 0
probe_ip_protocol 6
probe_success 0

[/simterm]

And the probe_http_status_code 0, and the probe_success 0

“Something went wrong” (с)

Add the debug=true parameter to the probe’s request:

[simterm]

root@monitoring-production:/home/admin# curl -s "localhost:9115/probe?debug=true&target=https://google.com&module=http_200_module" | grep -v \#
Logs for the probe:
ts=2019-03-06T14:52:43.226669976Z caller=main.go:118 module=http_200_module target=https://google.com level=info msg="Beginning probe" probe=http timeout_seconds=4.5
ts=2019-03-06T14:52:43.226800447Z caller=utils.go:42 module=http_200_module target=https://google.com level=info msg="Resolving target address" ip_protocol=ip6
ts=2019-03-06T14:52:43.227902933Z caller=utils.go:70 module=http_200_module target=https://google.com level=info msg="Resolved target address" ip=2607:f8b0:4009:804::200e
ts=2019-03-06T14:52:43.227948957Z caller=http.go:349 module=http_200_module target=https://google.com level=info msg="Making HTTP request" url=https://[2607:f8b0:4009:804::200e] host=google.com
ts=2019-03-06T14:52:43.228076802Z caller=http.go:364 module=http_200_module target=https://google.com level=error msg="Error for HTTP request" err="Get https://[2607:f8b0:4009:804::200e]: dial tcp [2607:f8b0:4009:804::200e]:443: connect: cannot assign requested address"
ts=2019-03-06T14:52:43.228098528Z caller=http.go:450 module=http_200_module target=https://google.com level=info msg="Response timings for roundtrip" roundtrip=0 start=2019-03-06T14:52:43.227999657Z dnsDone=2019-03-06T14:52:43.227999657Z connectDone=2019-03-06T14:52:43.228064489Z gotConn=0001-01-01T00:00:00Z responseStart=0001-01-01T00:00:00Z end=0001-01-01T00:00:00Z
ts=2019-03-06T14:52:43.228120026Z caller=main.go:131 module=http_200_module target=https://google.com level=error msg="Probe failed" duration_seconds=0.001397763

Metrics that would have been returned:
...
probe_http_status_code 0
probe_http_version 0
probe_ip_protocol 6
probe_success 0

Module configuration:
prober: http
timeout: 5s

[/simterm]

Eeeerrr…

ts=2019-03-06T14:52:43.227948957Z caller=http.go:349 module=http_200_module target=https://google.com level=info msg=”Making HTTP request” url=https://[2607:f8b0:4009:804::200e] host=google.com
ts=2019-03-06T14:52:43.228076802Z caller=http.go:364 module=http_200_module target=https://google.com level=error msg=”Error for HTTP request” err=”Get https://[2607:f8b0:4009:804::200e]: dial tcp [2607:f8b0:4009:804::200e]:443: connect:
cannot assign requested address”

url=https://[2607:f8b0:4009:804::200e] host=google.com – WTF? IP v6? But there always IP v4 was used…

Check the blackbox_exporter‘s documentation:

...
  # The IP protocol of the HTTP probe (ip4, ip6).
  [ preferred_ip_protocol: <string> | default = "ip6" ]
...

Ah, well, okay…

The default version was changed.

Suddenly – IP v6 used much more in the world then I thought.

You can check the statistic on the ipv6-test.com.

OK – update config and set ip4 explicitly:

modules:
  icmp:
    prober: icmp
    timeout: 5s

  http_200_module:
    prober: http
    timeout: 5s
    http:
      preferred_ip_protocol: ip4
      valid_status_codes: [200]
      method: GET
      no_follow_redirects: true
      fail_if_not_ssl: true

Check again:

[simterm]

root@monitoring-production:/home/admin# curl -s "localhost:9115/probe?debug=true&target=https://google.com&module=http_200_module" | grep -v \#
Logs for the probe:
ts=2019-03-06T14:57:41.827341666Z caller=main.go:118 module=http_200_module target=https://google.com level=info msg="Beginning probe" probe=http timeout_seconds=4.5
ts=2019-03-06T14:57:41.82743191Z caller=utils.go:42 module=http_200_module target=https://google.com level=info msg="Resolving target address" ip_protocol=ip4
ts=2019-03-06T14:57:41.847978323Z caller=utils.go:70 module=http_200_module target=https://google.com level=info msg="Resolved target address" ip=172.217.8.174
ts=2019-03-06T14:57:41.848021827Z caller=http.go:349 module=http_200_module target=https://google.com level=info msg="Making HTTP request" url=https://[172.217.8.174] host=google.com
ts=2019-03-06T14:57:41.949592972Z caller=http.go:305 module=http_200_module target=https://google.com level=info msg="Received redirect" url=https://www.google.com/
ts=2019-03-06T14:57:42.075895742Z caller=http.go:368 module=http_200_module target=https://google.com level=info msg="Received HTTP response" status_code=200
ts=2019-03-06T14:57:42.078492632Z caller=http.go:450 module=http_200_module target=https://google.com level=info msg="Response timings for roundtrip" roundtrip=0 start=2019-03-06T14:57:41.848092002Z dnsDone=2019-03-06T14:57:41.848092002Z connectDone=2019-03-06T14:57:41.867216977Z gotConn=2019-03-06T14:57:41.917500103Z responseStart=2019-03-06T14:57:41.949545477Z end=0001-01-01T00:00:00Z
ts=2019-03-06T14:57:42.078528466Z caller=http.go:450 module=http_200_module target=https://google.com level=info msg="Response timings for roundtrip" roundtrip=1 start=2019-03-06T14:57:41.949692456Z dnsDone=2019-03-06T14:57:41.950657776Z connectDone=2019-03-06T14:57:41.969811463Z gotConn=2019-03-06T14:57:42.020416388Z responseStart=2019-03-06T14:57:42.075831768Z end=2019-03-06T14:57:42.078481538Z
ts=2019-03-06T14:57:42.078563182Z caller=main.go:129 module=http_200_module target=https://google.com level=info msg="Probe succeeded" duration_seconds=0.251186806

Metrics that would have been returned:
...
probe_http_status_code 200
probe_http_version 1.1
probe_ip_protocol 4
probe_ssl_earliest_cert_expiry 1.5586899e+09
probe_success 1

Module configuration:
prober: http
timeout: 5s
http:
  preferred_ip_protocol: ip4

[/simterm]

Everything went back to the OK state.

Done.