Click to rate this post!

[Total: 0 Average: 0]

For a couple of months now, my work laptop, a Lenovo ThinkPad T14 Gen 5 running Arch Linux, has been having trouble opening new websites – for the first 10-15 seconds, the site loads in “pieces”, for example:

But then it “wakes up”, and everything starts working perfectly:

Finally, when I started setting up a proper home network with a VPN (see FreeBSD: Home NAS, part 3 – WireGuard VPN, Linux peer, and routing), and then DNS for it (see FreeBSD: Home NAS, part 4 – local DNS with Unbound), I got around to dealing with this problem.

And the problem turned out to be very interesting. I spent a long time searching for the cause and checked a bunch of different settings – from IPv6 and DNS to the network card driver.

The main thing was that the problem wasn’t exactly critical – overall the internet worked, so I would occasionally start looking for the cause, then give up, then return to it again.

Contents

The issue: “communications error to 192.168.0.1#53: timed out”

Interestingly, the problem was only observed on an Ethernet connection – on WiFi everything worked perfectly.

And on Ethernet, it was reproducible with different cables and through different routers.

So – what does that mean? It means either I tinkered with something in my Linux manually, or a “buggy” update arrived for the kernel, the driver, or some library.

I don’t remember why, but at first I blamed DNS, because we all know that:

And indeed – I managed to reproduce it precisely with DNS during tests with dig – so I spent a long time digging in that direction.

The problem looked like this: we run dig, 10-15 requests pass normally, and then “communications error to 192.168.0.1#53: timed out” arrives:

$ time dig google.com +short @192.168.0.1
;; communications error to 192.168.0.1#53: timed out
...

real    0m5.018s
user    0m0.004s
sys     0m0.008s

And this seemed like the actual reason why websites were sluggish with content loading: if DNS periodically drops out, and websites have a bunch of additional scripts and images loading from other resources – by the time all hosts are resolved and all addresses obtained, we get exactly this delay of several dozen seconds.

Logical? Yes.

Therefore, all subsequent tests were done in a loop with dig:

$ for i in {1..50}; do { time dig +nocookie +noedns +tries=1 +time=2 google.com >/dev/null; } 2>&1; done
...
real    0m0.016s
...
real    0m2.015s
...
real    0m0.013s
...
real    0m1.392s

And the result was consistent – a batch of requests passes normally – “real 0m0.016s“, and then on one of them – a timeout and “real 0m2.015s” (because +time=2 – wait for 2 seconds, instead of the default 5).

The same problem was visible with tcpdump: at 09:57:47 the request is sent, but no response is received; after 2 seconds, at 09:57:49 – a new request, and a response arrives for that one:

...
09:57:47.717951 IP setevoy-work.40923 > _gateway.domain: 13058+ [1au] A? google.com. (51)
09:57:49.729589 IP setevoy-work.45441 > _gateway.domain: 63641+ [1au] A? google.com. (51)
09:57:49.730249 IP _gateway.domain > setevoy-work.45441: 63641 6/4/4 A 142.250.109.101, A 142.250.109.100, A 142.250.109.139, A 142.250.109.138, A 142.250.109.102, A 142.250.109.113 (260)
...

The problem was similarly visible with strace:

$ strace -r -e trace=network dig google.com
...
     0.002788 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 15
...
     ;; communications error to 192.168.0.1#53: timed out
     5.005754 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 16
...

Here at 0.002788 a socket is opened to send the request, and after 5 seconds (5.005754) – since dig was now running without +time=2 – a new socket opens for a new request because there was no response to the previous one.

Searching for the cause

Here I will describe what I checked – it turned out to be quite a quest.

I didn’t record everything I did, but I saved the main parts – I’ve had a habit for a long time of throwing notes into a draft post on RTFM while debugging problems.

Checking DNS in Linux

First – what’s up with DNS in the system?

The router is specified in /etc/resolv.conf:

# Generated by NetworkManager
nameserver 192.168.0.1

Changed to 1.1.1.1 or 8.8.8.8 – the problem remains.

Okay… Maybe there’s another active resolver in the system, and a “DNS race in the kernel” begins – the request “wanders” between them?

Checked systemd-resolved – no, not running:

$ systemctl status systemd-resolved
○ systemd-resolved.service - Network Name Resolution
     Loaded: loaded (/usr/lib/systemd/system/systemd-resolved.service; disabled; preset: enabled)
     Active: inactive (dead)
...

Maybe dnsmasq?

Also disabled:

$ systemctl status dnsmasq
○ dnsmasq.service - dnsmasq - A lightweight DHCP and caching DNS server
     Loaded: loaded (/usr/lib/systemd/system/dnsmasq.service; disabled; preset: disabled)
     Active: inactive (dead)
...

So, DNS requests are going directly to the router, and… What? Is the router lagging with responses? Do the requests not reach it – are they lost occasionally?

What could it be?

local firewall on Linux or the router?
- no – disabled them, problem remained
race between several local DNS services?
- ruled out above
network card power management – is it going to sleep?
- unlikely, but I checked this later as well
network card driver bug?
- possible, because the problem appeared not long ago; before this, everything worked without issues on this laptop and this system
some problems specifically with UDP?
- also no – ran dig +tcp google.com, problem remained
response to DNS request returning from a different IP?
- an exotic idea, but as an option – the router has several network interfaces combined in a bridge, and – theoretically – the router could send the response from a different one
- but this is something very extraordinary, and the problem occurred identically on different routers, and it didn’t exist before

IPv6 and DNS

I don’t remember why, but at the beginning I suspected IPv6 during DNS execution.

/etc/gai.conf manages the address selection algorithm in glibc (GAI = getaddrinfo()), and determines which address (IPv4 or IPv6) an application making a DNS request will choose first if DNS returned both A and AAAA records.

You can enable IPv4 first by uncommenting the line:

...
precedence ::ffff:0:0/96 100
...

Check which returns first – the IPv4 address or IPv6:

$ getent ahosts google.com
142.250.130.100 STREAM google.com
...  
2a00:1450:4025:800::64 STREAM 
...

IPv4 first, but that didn’t help either.

I tried disabling IPv6 in the kernel entirely:

$ sudo sysctl -w net.ipv6.conf.all.disable_ipv6=1
$ sudo sysctl -w net.ipv6.conf.default.disable_ipv6=1

At this point, it seemed like the problem was found – because the first time everything went through without issues, but no – then timeouts appeared again.

NIC Offloading

NIC Offloading is when part of the operations are performed on the network interface itself, i.e., offloading some tasks from the laptop CPU to the card controller.

Check active ones with ethtool -k:

$ sudo ethtool -k enp0s31f6 | grep on
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ip-generic: on
scatter-gather: on
        tx-scatter-gather: on
tcp-segmentation-offload: on
...
generic-segmentation-offload: on
generic-receive-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
receive-hashing: on
...

The most interesting ones here are:

TSO (TCP Segmentation Offloading): the processor hands the card one large chunk of data (e.g., 64 KB), and the card itself “slices” it into small 1500-byte TCP packets
GSO (Generic Segmentation Offloading): same as TSO, but more universal (works not only for TCP)
GRO (Generic Receive Offloading): the reverse process – the card receives many small packets, “glues” them into one large one, and only then hands it to the processor, which saves CPU resources
RX and TX Checksum Offloading: the card itself checks checksums (CRC) of incoming packets – if a packet is “corrupt”, the card simply discards it without even notifying the operating system

Disable them one by one and make dig tests:

sudo ethtool -K enp0s31f6 gro off: didn’t help
sudo ethtool -K enp0s31f6 gso off: didn’t help
sudo ethtool -K enp0s31f6 tso off: didn’t help
sudo ethtool -K enp0s31f6 rx off: didn’t help, and it actually made it worse

In fact, the fact that it got worse after disabling RX Checksum Offloading was already a hint: if the network card had been filtering errors itself until then, now they all flooded the kernel, creating additional load and chaos in the packet queue, so useful DNS responses started getting lost even more often.

NIC Power Management

EEE (Energy Efficient Ethernet) is supposed to reduce the energy consumption of the card.

Checking:

$ sudo ethtool --show-eee enp0s31f6
EEE settings for enp0s31f6:
enabled - active
17 (us)
        Supported EEE link modes:  100baseT/Full
                                   1000baseT/Full
        Advertised EEE link modes:  100baseT/Full
                                    1000baseT/Full
        Link partner advertised EEE link modes:  100baseT/Full
                                                 1000baseT/Full

Currently “enabled – active” – disabling:

$ sudo ethtool --set-eee enp0s31f6 eee off

Didn’t help.

I also tried this: ran ping with short intervals so the card doesn’t fall asleep:

$ ping -i 0.2 192.168.0.1

And simultaneously launching the loop with dig – but the problem remains.

I separately checked the Runtime Power Management settings:

Find the PCI address for the device enp0s31f6:

$ ls -l /sys/class/net/enp0s31f6/device
lrwxrwxrwx 1 root root 0 Jan 19 09:38 /sys/class/net/enp0s31f6/device -> ../../../0000:00:1f.6

Or:

[setevoy@setevoy-work ~]  $ lspci -D | grep Ethernet
0000:00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (18) I219-LM (rev 20)

And check power parameters:

$ cat /sys/bus/pci/devices/0000:00:1f.6/power/control
on

“on” – enabled constantly, so it shouldn’t be turning off.

The Driver and Message Signaled Interrupts

Check the driver:

$ lspci -k -s 00:1f.6
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (18) I219-LM (rev 20)
        Subsystem: Lenovo Device 2327
        Kernel driver in use: e1000e
        Kernel modules: e1000e

Network controller – Intel I219-LM, and the e1000e driver, which is said to be “capricious”.

Interrupt parameters:

$ cat /proc/interrupts | grep -i enp0s31f6 ... IR-PCI-MSI-0000:00:1f.6 0-edge enp0s31f6

IR-PCI-MSI-0000:00:1f.6 – the driver uses MSI (Message Signaled Interrupts), which reportedly can cause drops for UDP on some Intel cards in Linux.

I created the file /etc/modprobe.d/e1000e.conf and set the interrupt mode to legacy (see Linux* Driver for Intel(R) Ethernet Network Connection):

options e1000e IntMode=0

Rebooted and checked:

$ cat /proc/interrupts | grep -i enp0s31f6
  19:     240716         ...  IR-IO-APIC   19-fasteoi   enp0s31f6

Didn’t help – the problem was still there.

And besides, dig +tcp google.com was still having problems.

Final: `rx_crc_errors` and reducing speed

And what I missed initially – checked errors on the interface.

I missed it because the number of errors was not growing during tests:

$ ip -s link show enp0s31f6
3: enp0s31f6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether c4:c6:e6:e7:e4:26 brd ff:ff:ff:ff:ff:ff
    RX:  bytes packets errors dropped  missed   mcast           
     750558152  589207    104       0       0       0 
    TX:  bytes packets errors dropped carrier collsns           
      40067575  157761      0       2       0       0 
    altname enxc4c6e6e7e426

Or with ethtool:

$ sudo ethtool -S enp0s31f6 | grep -E "errors|missed|dropped|timeout|tx_aborted" | grep -v ": 0"
     rx_errors: 114
     tx_dropped: 26
     rx_crc_errors: 57

rx_crc_errors indicates a problem with packet integrity, and – if the router and cable are fine (and the problem was observed on different routers and with different cables) – it is most likely a problem with the RJ-45 port on the laptop itself, although the contacts look fine.

I tried forcibly reducing the speed on the interface from gigabit to 100 Mbps:

$ sudo ethtool -s enp0s31f6 speed 100 duplex full autoneg on

And a miracle! Everything works!

Returned to 1000 again:

$ sudo ethtool -s enp0s31f6 speed 1000 duplex full autoneg on

And the problem reappears.

I could have just left it at 100 Mbps – but I’m not connected via cable and paying for gigabit GPON for nothing, right?

Fortunately, I have a few USB Ethernet adapters at home; I switched the cable to one:

$ ip a s enp0s13f0u2u3
2: enp0s13f0u2u3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether c8:4d:44:29:27:6b brd ff:ff:ff:ff:ff:ff
    altname enxc84d4429276b
    inet 192.168.0.198/24 brd 192.168.0.255 scope global dynamic noprefixroute enp0s13f0u2u3
...

Gigabit and Full Duplex are available:

$ sudo ethtool enp0s13f0u2u3
Settings for enp0s13f0u2u3:
        Supported ports: [ TP    MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Half 1000baseT/Full
        ...
        Speed: 1000Mb/s
        Duplex: Full
        ...
                               drv probe link timer ifdown ifup rx_err tx_err tx_queued intr tx_done rx_status pktdata hw wol
        Link detected: yes

And now everything works without issues.