Excessive negative cache time for DNS lookups

  • Open
  • quality assurance status badge
Details
3 participants
  • Thompson, David
  • Ludovic Courtès
  • Mark H Weaver
Owner
unassigned
Submitted by
Mark H Weaver
Severity
normal
M
M
Mark H Weaver wrote on 14 Jun 2015 19:43
(address . bug-guix@gnu.org)
87si9titcm.fsf@netris.org
On my GuixSD system, DNS lookup failures (due to lack of network) are
cached for at least 4 minutes, and possibly much longer (I've not had
the patience to wait until it expires).

Even after restarting nscd, the negative cache entry persists for about
another 20 seconds.

I run into this problem quite frequently after waking my system up from
suspend-to-ram. It typically takes several seconds for wicd to
establish a network connection, and if I make the mistake of triggering
a name lookup of my server (world.peace.net), then I must do the
following:

(1) deco restart nscd
(2) wait about 20 seconds

Looking at %nscd-default-caches in (gnu services base), I see that the
intent is to cache host lookup failures for 20 seconds, and service
lookup failures for an hour. The fact that the negative cache entry
persists for 20 seconds after restarting nscd would seem to indicate
that nscd considers this to be a host lookup. So perhaps this is a bug
in nscd.

Is anyone else able to reproduce this problem?

Mark
T
T
Thompson, David wrote on 15 Jun 2015 06:15
(name . Mark H Weaver)(address . mhw@netris.org)(address . 20816@debbugs.gnu.org)
CAJ=RwfYU6HK8XDxgxnnjipoC7SGJvic1A5zOShWb68T3+AbPOw@mail.gmail.com
On Sun, Jun 14, 2015 at 10:43 PM, Mark H Weaver <mhw@netris.org> wrote:
Toggle quote (25 lines)
> On my GuixSD system, DNS lookup failures (due to lack of network) are
> cached for at least 4 minutes, and possibly much longer (I've not had
> the patience to wait until it expires).
>
> Even after restarting nscd, the negative cache entry persists for about
> another 20 seconds.
>
> I run into this problem quite frequently after waking my system up from
> suspend-to-ram. It typically takes several seconds for wicd to
> establish a network connection, and if I make the mistake of triggering
> a name lookup of my server (world.peace.net), then I must do the
> following:
>
> (1) deco restart nscd
> (2) wait about 20 seconds
>
> Looking at %nscd-default-caches in (gnu services base), I see that the
> intent is to cache host lookup failures for 20 seconds, and service
> lookup failures for an hour. The fact that the negative cache entry
> persists for 20 seconds after restarting nscd would seem to indicate
> that nscd considers this to be a host lookup. So perhaps this is a bug
> in nscd.
>
> Is anyone else able to reproduce this problem?

Yes, I had this exact same issue on Friday after having wireless
network connectivity problems.

- Dave
L
L
Ludovic Courtès wrote on 15 Jun 2015 13:02
(name . Mark H Weaver)(address . mhw@netris.org)(address . 20816@debbugs.gnu.org)
874mm891ud.fsf@inria.fr
Mark H Weaver <mhw@netris.org> skribis:

Toggle quote (7 lines)
> On my GuixSD system, DNS lookup failures (due to lack of network) are
> cached for at least 4 minutes, and possibly much longer (I've not had
> the patience to wait until it expires).
>
> Even after restarting nscd, the negative cache entry persists for about
> another 20 seconds.

To reproduce it, I ran ‘guix system vm’ on this:

Toggle snippet (24 lines)
(use-modules (gnu) (guix))
(use-package-modules base)

(operating-system
(host-name "komputilo")
(timezone "Europe/Berlin")
(locale "en_US.UTF-8")
(bootloader (grub-configuration (device "/dev/sdX")))
(file-systems (cons (file-system
(device "root")
(title 'label)
(mount-point "/")
(type "ext4"))
%base-file-systems))
(packages (cons* (canonical-package glibc) %base-packages))
(services (map (lambda (mservice)
(mlet %store-monad ((service mservice))
(if (memq 'nscd (service-provision service))
(nscd-service (nscd-configuration
(debug-level 10)))
mservice)))
%base-services)))

Then I had one console running “tail -f /var/log/nscd.log”, and another
one where I would run “ping www.gnu.org.”

Initially looking up www.gnu.org fails (because the network is down),
and nscd.log indeed shows:

Toggle snippet (6 lines)
Mon Jun 15 21:04:20 2015 - 276: handle_request: request received (Version = 2) from PID 296
Mon Jun 15 21:04:20 2015 - 276: GETAI (www.gnu.org)
Mon Jun 15 21:04:20 2015 - 276: Haven't found "www.gnu.org" in hosts cache!
Mon Jun 15 21:04:20 2015 - 276: add new entry "www.gnu.org" of type GETAI for hosts to cache (first)

For subsequent requests, it shows fewer details, meaning that the cached
failure is used:

Toggle snippet (5 lines)
Mon Jun 15 21:04:25 2015 - 276: handle_request: request received (Version = 2) from PID 297
Mon Jun 15 21:04:25 2015 - 276: GETFDHST
Mon Jun 15 21:04:25 2015 - 276: provide access to FD 6, for hosts

Twenty seconds after the original failure, we see:

Toggle snippet (4 lines)
Mon Jun 15 21:04:40 2015 - 276: pruning hosts cache; time 1434395080
Mon Jun 15 21:04:40 2015 - 276: considering GETAI entry "www.gnu.org", timeout 1434395080

So cache expiry seems to be working as expected.

Now, if we enable networking with “dhclient -v eth0” and see what
happens, we get again a series of lookup failures, even after the
www.gnu.org entry has been removed from cache; apparently new failures
get cached. Eventually, after a bunch of cache prunes (so a few times
20 seconds), lookup succeeds.

The situation is different if you use ‘getent hosts’ instead of ‘ping’–
note that the latter uses getaddrinfo whereas the former uses
gethostbyname{,v6}, which is cached differently. Here we get a
successful lookup within 20 seconds after networking has been put up.

So my guess is that the GETAI cache is racy or something.

I’ve run nscd with Helgrind, which reports a few things (see attached
file), but I suspect most of it are false positives.

Ludo’.
Attachment: vg.log
?