I’d been wrestling with the problem of accessing AWS EKS from the office for a long time – finally lost my patience and figured it out 🙂
Here’s the problem: there’s an AWS EKS cluster with both Public and Private endpoints for the API.
Working from my office laptop, sometimes requests to it go through fine – and sometimes they die with an “i/o timeout” error:
$ kk get pod [...] Get \"https://F07***D78.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": dial tcp 10.0.64.9:443: i/o timeout" ...
Let’s go digging – because there are nuances here both with DNS and with network routes.
Contents
AWS VPC DNS and my VPNs
In my case EKS has both Public and Private endpoints enabled – so DNS resolution uses split-horizon DNS:
- for a request from the “public internet” AWS VPC DNS returns a Public IP
- for a request from inside the VPC – it’ll be a Private IP
Next: I have two active VPN connections + the office WiFi, and the problem starts when I add AWS VPC DNS to resolv.conf, because:
- there’s the project’s Pritunl/OpenVPN and there are AWS domains that need to resolve through 10.0.0.2 – AWS VPC DNS
- there’s my own WireGuard and domains that need to resolve from my home MikroTik through 10.100.0.1 (see MikroTik: setting up WireGuard and connecting Linux peers)
- and there are just public DNS zones that need to resolve through 1.1.1.1
In resolv.conf it looks like this:
nameserver 1.1.1.1 # CloudFlare DNS, returns EKS Endpoint Public IP nameserver 10.100.0.1 # my MikroTik with WireGuard, returns EKS Endpoint Public IP nameserver 10.0.0.2 # AWS VPC DNS via OpenVPN, returns EKS Endpoint Private IP EKS 10.0.64.9
The file is managed by openresolv, which WireGuard launches when the tunnel starts – WireGuard sets its own DNS:
$ sudo cat /etc/wireguard/wg0.conf ... DNS = 10.100.0.1, 10.0.0.2, 192.168.0.1 ...
In the timeout error we can see that the request to F07***D78.gr7.us-east-1.eks.amazonaws.com goes to IP 10.0.64.9 – meaning DNS resolution went through OpenVPN and AWS VPC DNS 10.0.0.2.
Linux DNS and systemd-resolved
Let’s check who’s actually responsible for DNS in the system – grep the /etc/nsswitch.conf file:
$ grep hosts /etc/nsswitch.conf hosts: mymachines resolve [!UNAVAIL=return] files myhostname dns
Here the resolve option means using the nss-resolve module over D-Bus to systemd-resolved.
And it comes first, before the files parameters (nss-files and /etc/hosts) and dns (the nss-dns module and the “classic” glibc DNS resolver) – so requests go to systemd-resolved first.
See Domain name resolution on the Arch Wiki.
systemd-resolved, DNS resolution and network interfaces
Now the interesting part – exactly how systemd-resolved performs DNS resolution.
systemd-resolved uses openresolv – let’s look at its parameters:
$ resolvectl status
Global
Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: foreign
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.100.0.1 192.168.0.1
...
Next, let’s check what’s happening in the system – enable the debug log for openresolv:
$ sudo resolvectl log-level debug
Then in one window we open the logs:
$ sudo journalctl -u systemd-resolved -f | grep "F07***D78.gr7.us-east-1.eks.amazonaws.com"
Run kubectl get pod – and in the logs we see:
...
May 20 12:06:39 setevoy-office systemd-resolved[698]: varlink-28-28: Received message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"F07***D78.gr7.us-east-1.eks.amazonaws.com","flags":0,"ifindex":0}}
...
May 20 12:06:39 setevoy-office systemd-resolved[698]: varlink-28-28: Sending message: {"parameters":{"addresses":[{"ifindex":6,"family":2,"address":[10,0,64,9]},{"ifindex":6,"family":2,"address":[10,0,65,205]}],"name":"F07***D78.gr7.us-east-1.eks.amazonaws.com","flags":1048577}}
Here:
- Received message: the request came in with the parameter
"ifindex":0– “don’t care where to look“ - Sending message: the response came back through
"ifindex":6– that’stun0, OpenVPN and AWS VPC DNS
Let’s check the interfaces:
$ ip -o link | awk -F': ' '{print $1, $2}'
1 lo
2 enp0s31f6
4 wlan0
5 enp0s13f0u3u4u4
6 tun0
...
"ifindex":6 is the tun0 interface, the work OpenVPN, and the result returned from AWS VPC DNS – "address":[10,0,64,9], because AWS VPC DNS returns a private address.
We repeat the request – and now the result is different:
...
varlink-28-28: Sending message: {"parameters":{"addresses":[{"ifindex":4,"family":2,"address":[44,216,7,46]},{"ifindex":4,"family":2,"address":[3,***,***,161]}],"name":"F07***D78.gr7.us-east-1.eks.amazonaws.com","flags":8388609}}
...
This time the response is from ifindex":4 – wlan0, and we get a public IP.
Why – because in the same log we see:
... Firing regular transaction 49587 ... IN A> scope dns on */* Firing regular transaction 59798 ... IN A> scope dns on wlan0/* ...
Here the first entry is a request through the global pool, to all the servers in it:
$ resolvectl status
Global
Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: foreign
Current DNS Server: 10.0.0.2
DNS Servers: 10.0.0.2 10.100.0.1 192.168.0.1
...
And the result comes back from whoever answers first, see systemd-resolved.service:
If lookups are routed to multiple interfaces, the first successful response is returned
And in this case it was wlan0:
... Added positive ... cache entry ... on wlan0/INET/10.0.0.1 ...
Since the request went through wlan0 – the response from AWS DNS for the EKS endpoint was a public IP.
While on the first attempt it was tun0:
... Added positive ... cache entry ... on tun0/INET/10.0.0.2 ...
And in response we got the private IP 10.0.64.9.
So:
systemd-resolvedqueries all available DNS servers- returns the result from whoever responds first
- if the request is from
wlan0, the office network – we get a public IP, and the connection goes through - if the request is from
tun0, OpenVPN and AWS VPC DNS – we get a private IP, and the connection fails with a timeout
Don’t forget to set the log level back to info:
$ sudo resolvectl log-level info
Now let’s move on to routing – why exactly does the connection fail with a timeout error?
VPN and Linux IP routes mess
Let’s look at the routes on the work laptop:
$ route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.0.0.1 0.0.0.0 UG 600 0 0 wlan0 10.0.0.0 0.0.0.0 255.255.255.0 U 600 0 0 wlan0 10.0.0.2 172.16.0.1 255.255.255.255 UGH 0 0 0 tun0 10.0.6.162 172.16.0.1 255.255.255.255 UGH 0 0 0 tun0 10.0.32.0 172.16.0.1 255.255.240.0 UG 0 0 0 tun0 10.0.48.0 172.16.0.1 255.255.240.0 UG 0 0 0 tun0 10.0.66.0 172.16.0.1 255.255.255.0 UG 0 0 0 tun0 10.0.67.0 172.16.0.1 255.255.255.0 UG 0 0 0 tun0 10.100.0.0 0.0.0.0 255.255.255.0 U 0 0 0 wg0 ...
Here:
- 10.0.0.0 through wlan0: because this is the office network and we have MacMinis here that we need access to, plus internet access
- 10.0.0.2, 10.0.32.0, 10.0.48.0 etc – through tun0: these are AWS VPC Private Subnets – requests are routed here through the work OpenVPN for access to AWS RDS and other private resources
- 10.100.0.0 through wg0: this is my WireGuard network through MikroTik for access to my home networks
And now – here’s where the problem shows up: when the EKS endpoint resolves through AWS VPC DNS 10.0.0.2 – we get the private address 10.0.64.9.
But there’s no dedicated route for it through OpenVPN – so it gets routed through 10.0.0.1, the office router and the public internet:
$ kk get pod [...] Get \"https://F07***D78.gr7.us-east-1.eks.amazonaws.com/api?timeout=32s\": dial tcp 10.0.64.9:443: i/o timeout" ...
We check the route itself – and we see it goes via 10.0.0.1, the office router, instead of tun0 and OpenVPN:
$ ip route get 10.0.64.9
10.0.64.9 via 10.0.0.1 dev wlan0 src 10.0.0.133 uid 1000
cache
And of course traceroute:
$ traceroute 10.0.64.9 traceroute to 10.0.64.9 (10.0.64.9), 30 hops max, 60 byte packets 1 office.example.dev (10.0.0.1) 14.261 ms 14.614 ms 14.592 ms 2 * * * 3 * * * ...
And obviously, a request to an address in a private subnet sent through the public internet just dies.
Possible solutions
There are a few options – either just add 10.0.64.9 to OpenVPN, or set up split-DNS – and resolve the domains correctly:
- just add 10.0.64.9 to OpenVPN – then it’ll create a route through
tun0on top of the existing ones - you can set up split-DNS through
systemd-resolved - you can set up split-DNS on a local Unbound or
dnsmasq– and switch all DNS queries over to it
The option with 10.0.64.9 on OpenVPN is a hack.
Note: only after I’d written the whole post did I remember that the EKS Control Plane lives in its own VPC Subnets, and I could’ve just added those to OpenVPN the same way it’s done for RDS, but whatever – it turned out interesting anyway 🙂
The split-DNS solution through systemd-resolved looks kind of painful.
And I’d already run Unbound on FreeBSD for my home NAS (see FreeBSD: Home NAS, part 4 – a local DNS with Unbound), the config is simple and clear, and on top of that it kicks systemd-resolved with all its complexities out of the picture – a solid option.
Although dnsmasq might’ve been a better solution for a laptop – because the config is even simpler, but I really liked Unbound – so I went with it.
Arch Linux and Unbound
Install the package itself:
$ sudo pacman -S unbound
What we need to do:
- route all queries for
compute.internal(AWS EC2 etc) through OpenVPN and AWS VPC DNS - same for all queries to
ops.example.com, because that’s where we have records for AWS RDS likedb.prod.ops.example.com - route all queries for
grafana.net.setevoythrough MikroTik, because that’s my local zone for home hosts - everything else – send to 1.1.1.1 and 8.8.8.8
We write the /etc/unbound/unbound.conf file, describing three forward-zone blocks with our own DNS and one with public DNS:
server:
interface: 127.0.0.1
access-control: 127.0.0.0/8 allow
do-ip6: no
hide-identity: yes
hide-version: yes
prefetch: yes
# local homelab via MikroTik
forward-zone:
name: "setevoy."
forward-addr: 10.100.0.1
forward-addr: 192.168.0.1
forward-zone:
name: "compute.internal."
forward-addr: 10.0.0.2
forward-zone:
name: "ops.example.com."
forward-addr: 10.0.0.2
# everything else
forward-zone:
name: "."
forward-addr: 1.1.1.1
forward-addr: 8.8.8.8
Check the syntax:
$ sudo unbound-checkconf unbound-checkconf: no errors in /etc/unbound/unbound.conf
Disabling systemd-resolved
In the post Arch Linux: WireGuard Peer for connecting to MikroTik I described a solution to a different problem, and there I added dns=systemd-resolved for NetworkManager.
If it’s there – remove it in /etc/NetworkManager/NetworkManager.conf, just set dns=none:
... [main] dns=none
Disable systemd-resolved (the internet will drop here – because there’s nowhere to send DNS):
$ sudo systemctl disable --now systemd-resolved systemd-resolved-monitor.socket systemd-resolved-varlink.socket
Restart NetworkManager:
$ sudo systemctl restart NetworkManager
Check port 53 – if systemd-resolve is still alive, that means something is triggering its startup:
$ sudo ss -tulpn | grep ':53'
...
tcp LISTEN 0 4096 127.0.0.53%lo:53 0.0.0.0:* users:(("systemd-resolve",pid=723720,fd=25))
tcp LISTEN 0 4096 127.0.0.54:53 0.0.0.0:* users:(("systemd-resolve",pid=723720,fd=27))
You can hard-block its startup with systemctl mask:
$ sudo systemctl mask systemd-resolved $ sudo systemctl stop systemd-resolved
Check the ports once more, and if there’s no longer anyone on port 53 – start unbound.service:
$ sudo systemctl stop systemd-resolved
$ sudo systemctl enable --now unbound
Created symlink '/etc/systemd/system/multi-user.target.wants/unbound.service' → '/usr/lib/systemd/system/unbound.service'.
$ sudo ss -tulpn | grep ':53\b'
udp UNCONN 0 0 127.0.0.1:53 0.0.0.0:* users:(("unbound",pid=727532,fd=3))
tcp LISTEN 0 256 127.0.0.1:53 0.0.0.0:* users:(("unbound",pid=727532,fd=4))
Edit /etc/resolv.conf – point all DNS through it:
nameserver 127.0.0.1
And try something public:
$ dig google.com +short 216.58.207.14
Then the EKS endpoint – it should return public IPs:
$ dig F07***D78.gr7.us-east-1.eks.amazonaws.com +short 3.***.***.161 44.***.***.46
Try RDS – it should return private IPs from the VPC pool:
$ dig prod.db.kraken.ops.example.com +short kraken-ops-rds-prod.***.us-east-1.rds.amazonaws.com. 10.0.66.14
Edit the WireGuard /etc/wireguard/wg0.conf – change the DNS parameter:
[Interface] ... DNS = 127.0.0.1 ...
Run sudo resolvconf -u, since we made changes to /etc/resolv.conf manually and WireGuard will complain.
Restart WireGuard:
$ sudo wg-quick down wg0 && sudo wg-quick up wg0
Check the file:
$ cat /etc/resolv.conf # Generated by resolvconf nameserver 127.0.0.1
And now everything works as it should.
![]()