We are using AWS VPC DNS and sometimes facing with errors like “php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution“.
The only advice from AWS tech. support was to configure a local dnsmasq service to act as a local DNS cache, but I did this already year ago and this issue happens once in 1-2-3 months.
Although this post is more about dnsmasq configuration – a couple of possible causes at this moment:
a local dnsmasq fails in some reasons and goes down, then all requests are sent to an AWS VPC DNS service and fill the hard-limit in 1024 requests per second, it fails too and we have our error
The second one seems not too real for me at this moment, and to debug the dnsmasq – let’s take a look at its logging and debugging options to see if we have any issues there, for example with cache size.
Contents
Prometheus dnsmasq metrics
The very first thing we did after agreed with the assumption that the issue can be caused by the failure of the dnsmasq – added it under our Prometheus monitoring with en dnsmasq_exporter service.
You can grab them yourself using Chaos DNS request type (see DNS classes) with the dig utility.
For example, to get DNS upstream servers list which will be used by our dnsmasq to forward requests too if a domain wasn’t found in the local dnsmasq‘s cache:
Well, that’s it – dnsmasq uses them: 10.0.3.2 – is a VPC DNS, 1.1.1.1 – CloudFlare DNS for fallback in case if VPC DNS did not respond (although this didn’t help too much as we can see).
The data here are:
cachesize.bind: cache size (how many domains will be kept in the cache)
insertions.bind: number of domains added to the cache
evictions.bind: number of domains removed from the cache (because of a domain’s TTL or if the cache is exhausted)
misses.bind: number of requests which wasn’t found in the cache and were forwarded to upstream servers
hits.bind: number of requests served by the dnsmasq from its cache
auth.bind: number of requests to authority zones (dnsmasq can be used as an SOA host)
servers.bind: already reviewed above – servers to forward unknown domains to
Using those metrics we can build a graph, for example – hit ratio between overall requests and requests, served by the dnsmasq‘s cache:
Repeat the same request again – and again check STDOUT:
...
dnsmasq: query[A] ya.ru from ::1
dnsmasq: cached ya.ru is 87.250.250.242
...
On the first attempt, the ya.ru domain wasn’t found in the cache and dnsmasq forwarded it to its upstream servers.
On the second request – value was returned from the cache.
You can also add extra value to the log-queries option to add a serial number (actually – it will be an internal port) to each request – can be useful when you have a lot of records:
The logging option is log-facility where you can set a syslog‘s channel: by default DAEMON and LOCAL0 if using -d (–no-daemon), see the syslog facility.
Also, instead of using syslog you can just set a log-file to be used:
log-facility="/var/log/dnsmasq.log"
And add log-queries if you want to see all the requests served.
Check the log-async if you have a lot of logs passed via the syslog.
resolv.conf and resolv.dnsmasq
Maybe a good idea to move out upstream-servers from a hosts’ resolv.conf and leave only dnsmasq‘s address there to make it the only one DNS for a system.
Also, in this way you can have a dynamic resolv.conf, for example, if it is periodically updated by the NetworkManager.
Still, dnsmasq will check for the resolv.conf updates. To disable it – add the no-resolv option.
--strict-order
By default, dnsmasq forwards requests to all upstream-servers, and the first one who will reply – will become a primary – dnsmasq will try to use it at first for next forwards.
This can be any server from the resolv.conf, resolv.dnsmasq or from the server in the dnsmasq‘s config file.
To make dnsmasq use servers exactly as they are added in configs – add the strict-order option.
dnsmasq cache size recommended
And the most interesting question: what is the best cache size for te dnsmasq?
By default, it has 150 records limit, maximum – 10000.
Can be adjusted with the cachesize option:
cachesize=1000
But still – do we need to increase it?
SIGUSR1
To check if you have enough cache size now you can dump dnsmasq‘s data from the moment of its start by using the SIGUSR1 signal:
Oct 26 10:53:45 localhost dnsmasq[30433]: queries for authoritative zones 0
Oct 26 10:53:45 localhost dnsmasq[30433]: server 10.0.3.2#53: queries sent 11352, retried or failed 0
Oct 26 10:53:45 localhost dnsmasq[30433]: server 1.1.1.1#53: queries sent 46, retried or failed 0
And here you can seeВ котором мы видимо невероятно много полезной информации:
time 1572076425 – time since the service started in UNIX epoch format
cache size 150, 0/45516 cache insertions re-used unexpired cache entries – is exactly what we are looking for: if new records added will push out records already present in the cache and which still have no TTL expired – this is the signal, that cache size is now enough
queries forwarded 22241, queries answered locally 633009 – a number of the requests which wasn’t found in the cache and forwarded to an upstream server (the value is taken from the misses.bind), and number of requests which was served by the dnsmasq‘s cache (the hits.bind‘s value)
queries for authoritative zones – if using dnsmasq as SOA – number of requests to those zones
server 10.0.3.2#53: queries sent 11352, retried or failed 0 – is the most interesting line here – in theory, this will help us to see if we have issues on the AWS VPC DNS side when the another “php_network_getaddresses: getaddrinfo failed: Temporary failure in name resolution” error will happen next time
AWS “Temporary failure in name resolution”
By the time of this post writing – we got our lovely errors again.
It looks like the next (dnsmasq‘s metrics):
But still, in its statistic – everything is just perfect: