The second one seems not too real for me at this moment, and to debug the dnsmasq – let’s take a look at its logging and debugging options to see if we have any issues there, for example with cache size.
Prometheus dnsmasq metrics
The very first thing we did after agreed with the assumption that the issue can be caused by the failure of the dnsmasq – added it under our Prometheus monitoring with en dnsmasq_exporter service.
Oct 26 10:53:45 localhost dnsmasq: queries for authoritative zones 0
Oct 26 10:53:45 localhost dnsmasq: server 10.0.3.2#53: queries sent 11352, retried or failed 0
Oct 26 10:53:45 localhost dnsmasq: server 18.104.22.168#53: queries sent 46, retried or failed 0
And here you can seeВ котором мы видимо невероятно много полезной информации:
time 1572076425 – time since the service started in UNIX epoch format
cache size 150, 0/45516 cache insertions re-used unexpired cache entries – is exactly what we are looking for: if new records added will push out records already present in the cache and which still have no TTL expired – this is the signal, that cache size is now enough
queries forwarded 22241, queries answered locally 633009 – a number of the requests which wasn’t found in the cache and forwarded to an upstream server (the value is taken from the misses.bind), and number of requests which was served by the dnsmasq‘s cache (the hits.bind‘s value)
queries for authoritative zones – if using dnsmasq as SOA – number of requests to those zones
server 10.0.3.2#53: queries sent 11352, retried or failed 0 – is the most interesting line here – in theory, this will help us to see if we have issues on the AWS VPC DNS side when the another “php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution” error will happen next time
AWS “Temporary failure in name resolution”
By the time of this post writing – we got our lovely errors again.
It looks like the next (dnsmasq‘s metrics):
But still, in its statistic – everything is just perfect: