cuirass: missing derivation error

  • Open
  • quality assurance status badge
Details
6 participants
  • 宋文武
  • John Kehayias
  • Ludovic Courtès
  • Maxim Cournoyer
  • Maxime Devos
  • Mathieu Othacehe
Owner
unassigned
Submitted by
Mathieu Othacehe
Severity
important
M
M
Mathieu Othacehe wrote on 18 Mar 2022 05:36
(address . bug-guix@gnu.org)
877d8r4etz.fsf@gnu.org
Hello,

A lot of builds, among them ~20 system tests[1], are failing with:
"cannot build missing derivation
?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
errors.

Those derivations are present on the CI head node. This means that the
errors occur during substitution. This is most likely caused by some
issue with the publish server, because:

- The publish server serves a 404 error. We should get rid once and for
all of this 404 thing, pushing something like:

or

- The publish server is not fast enough and hits an Nginx timeout that
closes the communication.

Any other cause I could be missing?

Thanks,

Mathieu

M
M
M
Maxime Devos wrote on 10 Aug 2022 08:30
(address . 54447@debbugs.gnu.org)
66c88669-b72b-d1a0-613c-abb346cf73e7@telenet.be
On 10-08-2022 11:43, Maxime Devos wrote:
Toggle quote (2 lines)
> Here's another instance: https://ci.guix.gnu.org/eval/528710
>
More information:
* non-ASCII does not seem to be set up (see: ?) (looks irrelevant)
* here are connection failures
Log:
Toggle quote (4 lines)
> substitute:
> substitute: updating substitutes from 'http://141.80.167.131'... 0.0%guix substitute: warning: 141.80.167.131: connection failed: Connection refused
> substitute:
> cannot build missing derivation ?/gnu/store/4gqj2byvj9zz30wzvwkbijpya3vn1bjw-rust-dogged-0.2.0.drv?
Greetings,
Maxime.
Attachment: file
Attachment: OpenPGP_signature
L
L
Ludovic Courtès wrote on 10 Dec 2022 02:57
Re: bug#54447: cuirass: missing derivation error
(address . 54447@debbugs.gnu.org)
87a63v5xwd.fsf@gnu.org
Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (18 lines)
> A lot of builds, among them ~20 system tests[1], are failing with:
> "cannot build missing derivation
> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
> errors.
>
> Those derivations are present on the CI head node. This means that the
> errors occur during substitution. This is most likely caused by some
> issue with the publish server, because:
>
> - The publish server serves a 404 error. We should get rid once and for
> all of this 404 thing, pushing something like:
> https://issues.guix.gnu.org/50040.
>
> or
>
> - The publish server is not fast enough and hits an Nginx timeout that
> closes the communication.

Also being discussed at https://issues.guix.gnu.org/48468#12.

Ludo’.
L
L
Ludovic Courtès wrote on 10 Dec 2022 02:56
control message for bug #54447
(address . control@debbugs.gnu.org)
87edt75xxm.fsf@gnu.org
severity 54447 important
quit
M
M
Maxim Cournoyer wrote on 21 Aug 2023 20:38
Re: bug#54447: cuirass: missing derivation error
(name . Mathieu Othacehe)(address . othacehe@gnu.org)(address . 54447@debbugs.gnu.org)
87ttsrbvb2.fsf@gmail.com
Hello,

Mathieu Othacehe <othacehe@gnu.org> writes:

Toggle quote (22 lines)
> Hello,
>
> A lot of builds, among them ~20 system tests[1], are failing with:
> "cannot build missing derivation
> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
> errors.
>
> Those derivations are present on the CI head node. This means that the
> errors occur during substitution. This is most likely caused by some
> issue with the publish server, because:
>
> - The publish server serves a 404 error. We should get rid once and for
> all of this 404 thing, pushing something like:
> https://issues.guix.gnu.org/50040.
>
> or
>
> - The publish server is not fast enough and hits an Nginx timeout that
> closes the communication.
>
> Any other cause I could be missing?

Looking at multiple of recent 'cannot build missing derivation' build
failures on Cuirass, I see for example:

Toggle snippet (7 lines)
substitute:
substitute: [Kupdating substitutes from 'http://141.80.167.131'... 0.0%
substitute: [Kcould not fetch http://141.80.167.131/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo 504
substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
cannot build missing derivation ?/gnu/store/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym-python-asdf-standard-1.0.3.drv?

So it seems the error originated from guix-publish being too heavily
under load to produce a timely reply, and the nginx proxy issued a 504
(timeout) error response.

Looking into /var/log/guix-publish.log for a corresponding entry, I
found:

Toggle snippet (10 lines)
2023-08-21 23:59:35 GET /rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo
2023-08-21 23:59:35 In web/server/http.scm:
2023-08-21 23:59:35 159:7 2 (http-write #<<http-server> socket: #<input-output: fi…> …)
2023-08-21 23:59:35 In unknown file:
2023-08-21 23:59:35 1 (put-bytevector #<input-output: socket 42> #vu8(83 # …) …)
2023-08-21 23:59:35 In ice-9/boot-9.scm:
2023-08-21 23:59:35 1685:16 0 (raise-exception _ #:continuable? _)
2023-08-21 23:59:35 In procedure fport_write: Broken pipe

So the connection was apparently severed (?), resulting in the "broken
pipe" error.

Here's a different one:

Toggle snippet (7 lines)
substitute:
substitute: [Kupdating substitutes from 'http://141.80.167.131'... 0.0%
substitute: [Kcould not fetch http://141.80.167.131/p2lfyvbxicjqsm4qp6368bx76gp0g948.narinfo 504
substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
cannot build missing derivation ?/gnu/store/p2lfyvbxicjqsm4qp6368bx76gp0g948-python-astropy-healpix-0.7.drv?

it occurred around the same time, and the failing mode was the same, per
guix-publish.log:

Toggle snippet (10 lines)
2023-08-21 23:59:35 GET /p2lfyvbxicjqsm4qp6368bx76gp0g948.narinfo
2023-08-21 23:59:35 In web/server/http.scm:
2023-08-21 23:59:35 159:7 2 (http-write #<<http-server> socket: #<input-output: fi…> …)
2023-08-21 23:59:35 In unknown file:
2023-08-21 23:59:35 1 (put-bytevector #<input-output: socket 50> #vu8(83 # …) …)
2023-08-21 23:59:35 In ice-9/boot-9.scm:
2023-08-21 23:59:35 1685:16 0 (raise-exception _ #:continuable? _)
2023-08-21 23:59:35 In procedure fport_write: Broken pipe

I wonder if these could be related to the DDoS protection discovered on
the Berlin network. I'll keep looking for other, potentially different
occurrences.

--
Thanks,
Maxim
L
L
Ludovic Courtès wrote on 22 Aug 2023 13:38
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
871qfu4xtr.fsf@gnu.org
Hi,

Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (30 lines)
> Looking at multiple of recent 'cannot build missing derivation' build
> failures on Cuirass, I see for example:
>
> substitute:
> substitute: [Kupdating substitutes from 'http://141.80.167.131'... 0.0%
> substitute: [Kcould not fetch http://141.80.167.131/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo 504
> substitute: updating substitutes from 'http://141.80.167.131'... 100.0%
> cannot build missing derivation ?/gnu/store/rhgrs3ac6h64siz0krqh2ia8kkn3h6ym-python-asdf-standard-1.0.3.drv?
>
>
> So it seems the error originated from guix-publish being too heavily
> under load to produce a timely reply, and the nginx proxy issued a 504
> (timeout) error response.
>
> Looking into /var/log/guix-publish.log for a corresponding entry, I
> found:
>
> 2023-08-21 23:59:35 GET /rhgrs3ac6h64siz0krqh2ia8kkn3h6ym.narinfo
> 2023-08-21 23:59:35 In web/server/http.scm:
> 2023-08-21 23:59:35 159:7 2 (http-write #<<http-server> socket: #<input-output: fi…> …)
> 2023-08-21 23:59:35 In unknown file:
> 2023-08-21 23:59:35 1 (put-bytevector #<input-output: socket 42> #vu8(83 # …) …)
> 2023-08-21 23:59:35 In ice-9/boot-9.scm:
> 2023-08-21 23:59:35 1685:16 0 (raise-exception _ #:continuable? _)
> 2023-08-21 23:59:35 In procedure fport_write: Broken pipe
>
>
> So the connection was apparently severed (?), resulting in the "broken
> pipe" error.

I think it’s just that, when ‘guix publish’ eventually replied, the
client had left, hence EPIPE.

The initial problem does look like ‘guix publish’ being too slow. Do
the corresponding nginx logs confirm the “backend too slow => 504”
hypothesis?

Thanks for investigating!

Ludo’.
宋
宋文武 wrote on 30 Aug 2023 05:17
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87y1hsu3lb.fsf@envs.net
Maxim Cournoyer <maxim.cournoyer@gmail.com> writes:

Toggle quote (5 lines)
> I wonder if these could be related to the DDoS protection discovered on
> the Berlin network. I'll keep looking for other, potentially different
> occurrences.



cannot build missing derivation ?/gnu/store/anzz2p18b7r9x45y350avnk8br2yihi2-ddd-3.4.0.drv?

Restart it on CI still got the same error.
L
L
Ludovic Courtès wrote on 10 Oct 2023 08:52
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87r0m2v5ih.fsf@gnu.org
Hello!

Mathieu Othacehe <othacehe@gnu.org> skribis:

Toggle quote (5 lines)
> A lot of builds, among them ~20 system tests[1], are failing with:
> "cannot build missing derivation
> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
> errors.

I have a disappointingly simple hypothesis for this. Remember that
“missing derivation” errors happen primarily for system tests.

Turns out that ‘cleanup-cuirass-roots’ in maintenance.git, used as an
mcron job, explicitly removes GC roots for things like *-os-encrypted
once they’re more than two days old, as well as GC roots for the
corresponding .drv.

I think this was increasing the likelihood that a .drv would be GC’d by
the time we run the test: under high load¹, it’s plausible that a system
test wouldn’t be built within two days after it’s been queued.

I’m proposing the change below to address this; I don’t think we need
‘--gc-keep-outputs --gc-keep-derivations’ anymore now that we keep
things in ‘guix publish’ cache first and foremost.

Thoughts?

In addition to the mcron job, Cuirass’s own ‘register-gc-roots’
procedure periodically deletes GC roots older than ‘%gc-roots-ttl’ (30
days in practice). That’s okay, except that it would be safer to delete
GC roots for a .drv if and only if it’s been built already.

Thanks,
Ludo’.

¹ The queue was often processed slowly, with many workers remaining idle
due to the bug fixed by
Toggle diff (73 lines)
diff --git a/hydra/modules/sysadmin/services.scm b/hydra/modules/sysadmin/services.scm
index fecfdde..e6f2b44 100644
--- a/hydra/modules/sysadmin/services.scm
+++ b/hydra/modules/sysadmin/services.scm
@@ -110,9 +110,7 @@
((guix config) => ,(make-config.scm)))
#~(begin
(use-modules (ice-9 ftw)
- (srfi srfi-1)
- (guix store)
- (guix derivations))
+ (srfi srfi-1))
(define %roots-directory
"/var/guix/profiles/per-user/cuirass/cuirass")
@@ -157,28 +155,6 @@
deleted))
deleted))
- (define (root-target root)
- ;; Return the store item ROOT refers to.
- (string-append (%store-prefix) "/" (basename root)))
-
- (define (derivation-referrers store item)
- ;; Return the referrers of the derivers of ITEM.
- (let* ((derivers (valid-derivers store item))
- (referrers (append-map (lambda (drv)
- (referrers store drv))
- derivers)))
- (delete-duplicates referrers)))
-
- (define (delete-gc-root-for-derivation drv)
- ;; Delete the GC root for DRV, if any.
- (catch 'system-error
- (lambda ()
- (let ((item (derivation-path->output-path drv)))
- (delete-file
- (string-append %roots-directory
- "/" (basename drv)))))
- (const #f)))
-
;; Note: 'scandir' would introduce too much overhead due
;; to the large number of entries that it would sort.
(define deleted
@@ -197,17 +173,7 @@
(for-each (lambda (file)
(display file port)
(newline port))
- deleted)))
-
- ;; Since we run 'guix-daemon --gc-keep-outputs
- ;; --gc-keep-derivations', also remove GC roots for the outputs of
- ;; derivations that refer to the derivers of DELETED.
- (for-each delete-gc-root-for-derivation
- (with-store store
- (append-map (lambda (root)
- (derivation-referrers
- store (root-target root)))
- deleted))))))))
+ deleted))))))))
(define (gc-jobs threshold)
"Return the garbage collection mcron jobs. The garbage collection
@@ -251,8 +217,7 @@ collection instead."
(build-accounts (* build-accounts-to-max-jobs-ratio max-jobs))
(extra-options (list "--max-jobs" (number->string max-jobs)
- "--cores" (number->string cores)
- "--gc-keep-outputs" "--gc-keep-derivations"))))
+ "--cores" (number->string cores)))))
;;;
M
M
Maxim Cournoyer wrote on 10 Oct 2023 20:08
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mswpeu03.fsf@gmail.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (27 lines)
> Hello!
>
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> A lot of builds, among them ~20 system tests[1], are failing with:
>> "cannot build missing derivation
>> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
>> errors.
>
> I have a disappointingly simple hypothesis for this. Remember that
> “missing derivation” errors happen primarily for system tests.
>
> Turns out that ‘cleanup-cuirass-roots’ in maintenance.git, used as an
> mcron job, explicitly removes GC roots for things like *-os-encrypted
> once they’re more than two days old, as well as GC roots for the
> corresponding .drv.
>
> I think this was increasing the likelihood that a .drv would be GC’d by
> the time we run the test: under high load¹, it’s plausible that a system
> test wouldn’t be built within two days after it’s been queued.
>
> I’m proposing the change below to address this; I don’t think we need
> ‘--gc-keep-outputs --gc-keep-derivations’ anymore now that we keep
> things in ‘guix publish’ cache first and foremost.
>
> Thoughts?

Ah, so that mcron job is kind of a hack to hasten garbage collecting
only *some* items faster than the default policy of 30 days? And we'd
now avoid deleting selected .drv files while still deleting their
outputs, so in the case something that needs it took more than 2 days to
build, it could lead to having to rebuild the garbage collected outputs?

I'm not sure if we need such a fancy hack with the 100 TiB of data we
now have, but your fix seems reasonable (LGTM!)

Toggle quote (5 lines)
> In addition to the mcron job, Cuirass’s own ‘register-gc-roots’
> procedure periodically deletes GC roots older than ‘%gc-roots-ttl’ (30
> days in practice). That’s okay, except that it would be safer to delete
> GC roots for a .drv if and only if it’s been built already.

Hm. I wonder if this could explain the other cases we've seen. It
could be that building a derivation was interrupted or canceled for some
reason, then 30 days elapsed, then was garbage collected, and after
which it doesn't get recreated and we get the error of the missing .drv?

--
Thanks,
Maxim
M
M
Maxim Cournoyer wrote on 10 Oct 2023 20:21
(name . 宋文武)(address . iyzsong@envs.net)
87il7detde.fsf@gmail.com
Hello,

宋文武 <iyzsong@envs.net> writes:

[...]

Toggle quote (6 lines)
>
> cannot build missing derivation ?/gnu/store/anzz2p18b7r9x45y350avnk8br2yihi2-ddd-3.4.0.drv?
>
> Restart it on CI still got the same error.

Toggle snippet (6 lines)
substitute:
substitute: [Kupdating substitutes from 'http://10.0.0.1'... 0.0%
substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?

--
Thanks,
Maxim
L
L
Ludovic Courtès wrote on 15 Oct 2023 09:45
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87r0lv96mm.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (7 lines)
>
> substitute:
> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 0.0%
> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
> cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?

This one is from Sep. 9, which is before I deployed the remote-worker
fixes, so I’ll dismiss it (happy to look at more recent ones though!).

Tip of the day: M-: (build-farm-build 1982454)

Ludo’.
L
L
Ludovic Courtès wrote on 15 Oct 2023 13:21
(address . 54447@debbugs.gnu.org)
87pm1f8wm1.fsf@gnu.org
Hi!

Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (22 lines)
> Mathieu Othacehe <othacehe@gnu.org> skribis:
>
>> A lot of builds, among them ~20 system tests[1], are failing with:
>> "cannot build missing derivation
>> ?/gnu/store/hs6kp1lqgymhyp3jndc0dsp0pn4psgv0-gui-installed-desktop-os-encrypted.drv?"
>> errors.
>>
>> Those derivations are present on the CI head node. This means that the
>> errors occur during substitution. This is most likely caused by some
>> issue with the publish server, because:
>>
>> - The publish server serves a 404 error. We should get rid once and for
>> all of this 404 thing, pushing something like:
>> https://issues.guix.gnu.org/50040.
>>
>> or
>>
>> - The publish server is not fast enough and hits an Nginx timeout that
>> closes the communication.
>
> Also being discussed at <https://issues.guix.gnu.org/48468#12>.

I got confirmation that the cache-bypass-threshold hypothesis holds, at
least for system tests.

which ends like this:

Toggle snippet (10 lines)
@ substituter-succeeded /gnu/store/qh2876i5l1wvxgwhg9fbl9zmb3px3n2m-gc-roots.drv
fetching path `/gnu/store/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder'...
@ substituter-started /gnu/store/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder substitute
Downloading http://141.80.167.131/nar/lzip/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder...
. xdg-mime-database-builder 3.6MiB/s 00:00 | 3KiB transferred. xdg-mime-database-builder 1.9MiB/s 00:00 | 3KiB transferred

@ substituter-succeeded /gnu/store/fh9dnmrfsz429pwqmvsjnk0snlm959kc-xdg-mime-database-builder
cannot build missing derivation ‘/gnu/store/4r1wij3bzj9zv75ds82a93jl7bcman2x-installed-extlinux-os.drv’

Looking at the nginx and ‘guix publish’ logs, I found that the missing
substitute is not that of 4r1wij3bzj9zv75ds82a93jl7bcman2x (the .drv
itself) but rather that of a dependency of that .drv:

[14/Oct/2023:23:22:09 +0200] "GET /wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg.narinfo HTTP/1.1" 404 58 "-" "GNU Guile"

That item’s size is above the cache bypass threshold of 100 MiB as
currently configured on berlin:

Toggle snippet (4 lines)
$ du -hs /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
124M /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5

The immediate fix/workaround is to raise that threshold.

A better solution would be for system tests to depend on a fixed-output
derivation for the Guix source instead of the “source” above (I use
“source” as it is used in the context of <derivation>).

Thanks,
Ludo’.
L
L
Ludovic Courtès wrote on 15 Oct 2023 13:34
(address . 54447@debbugs.gnu.org)
87lec38w1a.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (14 lines)
> Looking at the nginx and ‘guix publish’ logs, I found that the missing
> substitute is not that of 4r1wij3bzj9zv75ds82a93jl7bcman2x (the .drv
> itself) but rather that of a dependency of that .drv:
>
> [14/Oct/2023:23:22:09 +0200] "GET /wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg.narinfo HTTP/1.1" 404 58 "-" "GNU Guile"
>
> That item’s size is above the cache bypass threshold of 100 MiB as
> currently configured on berlin:
>
> $ du -hs /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
> 124M /gnu/store/wqqzcxrhbnv0nzg64iiiqd5grr4vk2zg-guix-5a6b1a5
>
> The immediate fix/workaround is to raise that threshold.

I raised the threshold to 150 MiB in maintenance.git commit
213384e43de63ce3a5a55599e8fb89891ffef7eb.

I reconfigured berlin and restarted ‘guix publish’ seconds ago.
Hopefully next time installation tests won’t have that problem.

Ludo’.
L
L
Ludovic Courtès wrote on 15 Oct 2023 13:42
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87h6mr8vo9.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (5 lines)
> In addition to the mcron job, Cuirass’s own ‘register-gc-roots’
> procedure periodically deletes GC roots older than ‘%gc-roots-ttl’ (30
> days in practice). That’s okay, except that it would be safer to delete
> GC roots for a .drv if and only if it’s been built already.

Fixed in Cuirass commit 55af0f70c0d4938b8eda777382bbc4d8f5698a37.

Ludo'.
M
M
Maxim Cournoyer wrote on 16 Oct 2023 06:25
(name . Ludovic Courtès)(address . ludo@gnu.org)
87sf6azolb.fsf@gmail.com
Hi,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (14 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Another example: https://ci.guix.gnu.org/build/1982454/details
>>
>> substitute:
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 0.0%
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
>> cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?
>
> This one is from Sep. 9, which is before I deployed the remote-worker
> fixes, so I’ll dismiss it (happy to look at more recent ones though!).
>
> Tip of the day: M-: (build-farm-build 1982454)

I don't have such a function in scope, is this from the guix-emacs
package?

--
Thanks,
Maxim
L
L
Ludovic Courtès wrote on 16 Oct 2023 10:39
(name . Maxim Cournoyer)(address . maxim.cournoyer@gmail.com)
87edhu79hm.fsf@gnu.org
Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:

Toggle quote (5 lines)
>> Tip of the day: M-: (build-farm-build 1982454)
>
> I don't have such a function in scope, is this from the guix-emacs
> package?

It’s from the ‘emacs-build-farm’ package, which I recommend. :-)

Ludo’.
L
L
Ludovic Courtès wrote on 16 Oct 2023 10:44
(name . Mathieu Othacehe)(address . othacehe@gnu.org)
87a5si7986.fsf@gnu.org
Ludovic Courtès <ludo@gnu.org> skribis:

Toggle quote (13 lines)
> Turns out that ‘cleanup-cuirass-roots’ in maintenance.git, used as an
> mcron job, explicitly removes GC roots for things like *-os-encrypted
> once they’re more than two days old, as well as GC roots for the
> corresponding .drv.
>
> I think this was increasing the likelihood that a .drv would be GC’d by
> the time we run the test: under high load¹, it’s plausible that a system
> test wouldn’t be built within two days after it’s been queued.
>
> I’m proposing the change below to address this; I don’t think we need
> ‘--gc-keep-outputs --gc-keep-derivations’ anymore now that we keep
> things in ‘guix publish’ cache first and foremost.

I pushed a variant of this patch:

053839d hydra: services: Leave “guix-binary.tar.xz” GC roots.
e40d961 hydra: services: Preserve Cuirass .drv GC roots.
b8fc66c hydra: cuirass: Fix build product regexps.

I didn’t dare remove “--gc-keep-derivations”. I reconfigured berlin
just now from this commit and restarted mcron (I didn’t restart
guix-daemon to avoid downtime; we should do that when the queue is close
to empty).

We’ll have to monitor disk usage to make sure it’s not negatively
affected.

Ludo’.
M
M
Maxim Cournoyer wrote on 20 Nov 2023 11:09
(name . Ludovic Courtès)(address . ludo@gnu.org)
87zfz89r8i.fsf@gmail.com
Hi Ludovic,

Ludovic Courtès <ludo@gnu.org> writes:

Toggle quote (12 lines)
> Maxim Cournoyer <maxim.cournoyer@gmail.com> skribis:
>
>> Another example: https://ci.guix.gnu.org/build/1982454/details
>>
>> substitute:
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 0.0%
>> substitute: [Kupdating substitutes from 'http://10.0.0.1'... 100.0%
>> cannot build missing derivation ?/gnu/store/vwhgs9dkj9spryglb180j27dr5vidjxv-ecl-23.9.9.drv?
>
> This one is from Sep. 9, which is before I deployed the remote-worker
> fixes, so I’ll dismiss it (happy to look at more recent ones though!).

Here's a more recent occurrence:

I haven't restarted it to leave proof of its existence :-)

--
Thanks,
Maxim
L
L
Ludovic Courtès wrote on 4 Apr 14:33 -0700
(address . 54447@debbugs.gnu.org)
87cyr4u825.fsf@gnu.org
Hello!

News from the everlasting bug!

cannot build missing derivation ‘/gnu/store/dfgc46q3l8wlnymv49a1wjnxypin8p0y-plink-1.07.drv’


Why was it missing this time? /var/log/nginx/error.log:

Toggle snippet (3 lines)
2024/04/04 17:15:03 [error] 98751#0: *152293778 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.169, server: ci.guix.gnu.org, request: "GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo", host: "141.80.167.131"

Oops! (There are dozens of upstream timeouts logged on that minute.)

/var/log/guix-publish.log:

Toggle snippet (15 lines)
2024-04-04 17:14:51 GET /nar/lzip/pz39bkq7pd1hgy5rwiynqa33gyjvpgs5-python-pygments-2.12.0
2024-04-04 17:14:51 GET /z2xxwwxswdd4b8c8iwmxhqnqbp5nwz09.narinfo
2024-04-04 17:14:51 GET /lgyck285bsxzwrnh3x5ix5dwzd3n3wga.narinfo
2024-04-04 17:14:51 GET /nar/zstd/jxkglr445f215m2faqz1i2lgmbans4rf-texlive-amsmath-66594-doc
2024-04-04 17:15:33 GET /qg5cxb869i42jn7x2dm6k5l41ikkz21w.narinfo
2024-04-04 17:15:33 GET /nar/zstd/i2hp3q2pfhsyl0al7z38am7cqpddi4qr-texlive-capt-of-66594-doc
2024-04-04 17:15:33 GET /hh0gdbljj3cjdnjbr88kfm21mhys5sy7.narinfo
2024-04-04 17:15:33 GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo
2024-04-04 17:15:33 GET /yj63wifalfr6sla42h7mkqg011qrl5d0.narinfo
2024-04-04 17:15:33 GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo
2024-04-04 17:15:33 -> GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo: 404
2024-04-04 17:15:33 GET /nar/lzip/6zxlrw15b9dsv73s7v5fqabl7iv5v5il-python-exceptiongroup-1.1.1
2024-04-04 17:15:33 GET /nar/zstd/pychjd114abscbqlzcr3s7myf1497vw2-julia-compilersupportlibraries-jll-0.4.0%2B1

‘guix publish’ replied, but 40s too late (nginx has
“proxy_connect_timeout 10s;” for .narinfo URLs¹).

Notice the 40s pause time between 17:14:51 and 17:15:33. Stop-the-world
GC? Unlikely, because ‘guix publish’ had been running for ~3h, so even
with a leak², it’s hard to believe GC could take this long.

Ludo’.

J
J
John Kehayias wrote on 13 Apr 17:15 -0700
(name . Ludovic Courtès)(address . ludo@gnu.org)
87mspwbxzn.fsf@protonmail.com
Hi all,

On Thu, Apr 04, 2024 at 11:33 PM, Ludovic Courtès wrote:

Toggle quote (45 lines)
> Hello!
>
> News from the everlasting bug!
>
> cannot build missing derivation
> ‘/gnu/store/dfgc46q3l8wlnymv49a1wjnxypin8p0y-plink-1.07.drv’
>
> (From <https://ci.guix.gnu.org/build/3861708/>.)
>
> Why was it missing this time? /var/log/nginx/error.log:
>
> 2024/04/04 17:15:03 [error] 98751#0: *152293778 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 141.80.167.169, server: ci.guix.gnu.org, request: "GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo HTTP/1.1", upstream: "http://127.0.0.1:3000/dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo", host: "141.80.167.131"
>
>
> Oops! (There are dozens of upstream timeouts logged on that minute.)
>
> /var/log/guix-publish.log:
>
> 2024-04-04 17:14:51 GET /nar/lzip/pz39bkq7pd1hgy5rwiynqa33gyjvpgs5-python-pygments-2.12.0
> 2024-04-04 17:14:51 GET /z2xxwwxswdd4b8c8iwmxhqnqbp5nwz09.narinfo
> 2024-04-04 17:14:51 GET /lgyck285bsxzwrnh3x5ix5dwzd3n3wga.narinfo
> 2024-04-04 17:14:51 GET /nar/zstd/jxkglr445f215m2faqz1i2lgmbans4rf-texlive-amsmath-66594-doc
> 2024-04-04 17:15:33 GET /qg5cxb869i42jn7x2dm6k5l41ikkz21w.narinfo
> 2024-04-04 17:15:33 GET /nar/zstd/i2hp3q2pfhsyl0al7z38am7cqpddi4qr-texlive-capt-of-66594-doc
> 2024-04-04 17:15:33 GET /hh0gdbljj3cjdnjbr88kfm21mhys5sy7.narinfo
> 2024-04-04 17:15:33 GET /dfgc46q3l8wlnymv49a1wjnxypin8p0y.narinfo
> 2024-04-04 17:15:33 GET /yj63wifalfr6sla42h7mkqg011qrl5d0.narinfo
> 2024-04-04 17:15:33 GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo
> 2024-04-04 17:15:33 -> GET /h2s2g2adxbnd67g34mnjnpcr6p3nhr69.narinfo: 404
> 2024-04-04 17:15:33 GET /nar/lzip/6zxlrw15b9dsv73s7v5fqabl7iv5v5il-python-exceptiongroup-1.1.1
> 2024-04-04 17:15:33 GET /nar/zstd/pychjd114abscbqlzcr3s7myf1497vw2-julia-compilersupportlibraries-jll-0.4.0%2B1
>
> ‘guix publish’ replied, but 40s too late (nginx has
> “proxy_connect_timeout 10s;” for .narinfo URLs¹).
>
> Notice the 40s pause time between 17:14:51 and 17:15:33. Stop-the-world
> GC? Unlikely, because ‘guix publish’ had been running for ~3h, so even
> with a leak², it’s hard to believe GC could take this long.
>
> Ludo’.
>
> ¹
> https://git.savannah.gnu.org/cgit/guix/maintenance.git/tree/hydra/nginx/berlin.scm#n103
> ² https://issues.guix.gnu.org/69596

I don't have any insight, but if anyone wants to see this in action at a
large scale, take look at pretty much any red dot on

From my quick look all the CL and texlive failures were all missing
derivation. I've tried restarting a bunch to get i686 coverage going, so
hopefully some will disappear. But I can't/won't manually restart the
thousands(?) of failed builds. I didn't see such issues on x86_64, while
other architectures take a really long time to build on Berlin so I
haven't looked.

I don't know if this is helpful, but thought I would chime in if anyone
wants potentially a bunch of data. And if there are good ideas to
recover (just restart all builds?) that would be great so mesa-updates
will be build on i686 since otherwise it looks good.

Thanks!
John
?