*** artom has quit IRC | 00:14 | |
*** artom has joined #opendev | 00:14 | |
clarkb | fungi: it looks like maybe the zuul restart caught the periodic jobs for requirements in a weird spot? they are all retry limits and the ref is 000000 | 00:18 |
---|---|---|
clarkb | (just noting it, I expect that the 0600 enqueing of those jobs will be fine) | 00:18 |
fungi | i meant to follow up on that, i got several tracebacks from reenqueues | 00:20 |
fungi | checking to see if i can tell which ones | 00:20 |
fungi | so this is what i ran for it: zuul enqueue-ref --tenant openstack --pipeline periodic --project openstack/requirements --ref refs/heads/master | 00:21 |
fungi | no errors from that | 00:22 |
fungi | the ones which did throw errors were pyca/cryptography | 00:23 |
fungi | http://paste.openstack.org/show/804064 | 00:24 |
fungi | clarkb: the retry limit may be separate from the reason for the restarts | 00:29 |
fungi | retries | 00:29 |
fungi | er, let me retry | 00:29 |
fungi | the reason for the retries may be unrelated to the 0 ref | 00:30 |
openstackgerrit | Merged opendev/system-config master: review01.openstack.org: add key for gerrit data copying https://review.opendev.org/c/opendev/system-config/+/783778 | 00:30 |
fungi | last zuul scheduler restart we saw a similar situation with an hourly deploy item enqueued with a 0 ref, but it grabbed the branch anyway | 00:30 |
*** hamalq_ has quit IRC | 00:49 | |
openstackgerrit | Jeremy Stanley proposed zuul/zuul-jobs master: Document algorithm var for remove-build-sshkey https://review.opendev.org/c/zuul/zuul-jobs/+/783988 | 00:56 |
*** iurygregory has quit IRC | 01:16 | |
*** iurygregory has joined #opendev | 01:17 | |
*** iurygregory has quit IRC | 01:18 | |
*** iurygregory has joined #opendev | 01:18 | |
*** osmanlicilegi has joined #opendev | 01:20 | |
openstackgerrit | Jeremy Stanley proposed opendev/base-jobs master: Clean up OpenEdge configuration https://review.opendev.org/c/opendev/base-jobs/+/783989 | 01:44 |
openstackgerrit | Jeremy Stanley proposed openstack/project-config master: Clean up OpenEdge configuration https://review.opendev.org/c/openstack/project-config/+/783990 | 01:44 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Clean up OpenEdge configuration https://review.opendev.org/c/opendev/system-config/+/783991 | 01:45 |
*** iurygregory has quit IRC | 02:08 | |
*** iurygregory has joined #opendev | 02:09 | |
ianw | our new review02.opendev.org can't ping review01.openstack.org via ipv6, but the other way (review01 -> review02) *does* work | 02:19 |
ianw | i'm taking suggestions on how i might have messed this up :) | 02:20 |
ianw | i am connecting to review02 via ipv6. i can also ping it locally here. so it's not ipv6 in general | 02:25 |
ianw | i know nobody is around, but dumping some debugging info in #vexxhost channel | 02:38 |
*** diablo_rojo has quit IRC | 02:40 | |
fungi | i can take a look when i wake up too | 02:48 |
ianw | fungi: :) thanks, you'll probably have more cross-over with vexxhost people | 02:49 |
fungi | it does seem on the face to be similar to some of the ipv6 oddness we've seen with rackspace in the past, so i wouldn't assume there's anything to do with how things are set up in vexxhost | 02:50 |
fungi | anyway, passing out now, will sleep on it | 02:51 |
ianw | yeah; i agree. however in that case we usually did see the packets coming *into* the host, which responded, but the packets never found their way back | 02:51 |
fungi | true | 02:51 |
ianw | in this case, a tcpdump doesn't show the ping packets making to the host | 02:51 |
ianw | also of major ipv6 things i can think of, it can't seem to ping most, but can ping google | 02:52 |
ianw | anyway, i have review02 now syncing via ipv4 | 03:07 |
ianw | ~45MB/s so not too shabby | 03:08 |
*** akahat has quit IRC | 03:08 | |
*** kopecmartin has quit IRC | 03:09 | |
*** fbo has quit IRC | 03:09 | |
*** kopecmartin has joined #opendev | 03:13 | |
*** fbo has joined #opendev | 03:14 | |
*** akahat has joined #opendev | 03:22 | |
*** ykarel|away has joined #opendev | 04:20 | |
*** ykarel|away is now known as ykarel | 04:39 | |
*** marios has joined #opendev | 05:03 | |
*** zbr|rover4 has joined #opendev | 05:04 | |
*** zbr|rover has quit IRC | 05:06 | |
*** zbr|rover4 is now known as zbr|rover | 05:06 | |
*** whoami-rajat has joined #opendev | 05:17 | |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Add upload-logs-azure role https://review.opendev.org/c/zuul/zuul-jobs/+/782004 | 05:27 |
*** auristor has quit IRC | 05:27 | |
ianw | ok, https://review02.opendev.org has some content | 05:41 |
*** ysandeep|away is now known as ysandeep | 05:59 | |
openstackgerrit | Ian Wienand proposed opendev/system-config master: AFS documentation : add notes on replication https://review.opendev.org/c/opendev/system-config/+/784002 | 06:01 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: review02 : bump heap limit to 96gb https://review.opendev.org/c/opendev/system-config/+/784003 | 06:01 |
ianw | time docker-compose run shell java -jar /var/gerrit/bin/gerrit.war reindex -d /var/gerrit --threads 32 | 06:04 |
ianw | real 50m8.443s | 06:04 |
*** ralonsoh has joined #opendev | 06:10 | |
*** slaweq has joined #opendev | 06:10 | |
*** sboyron has joined #opendev | 06:21 | |
*** eolivare has joined #opendev | 06:30 | |
*** hashar has joined #opendev | 06:45 | |
openstackgerrit | Hervé Beraud proposed openstack/project-config master: Use publish-to-pypi on barbican ansible roles https://review.opendev.org/c/openstack/project-config/+/784011 | 06:52 |
ianw | 24 threads was "real 52m10.284s" fyi | 07:19 |
ianw | i just rebooted the review02.opendev.org host, and all ipv6 seems to work now | 07:25 |
*** tosky has joined #opendev | 07:33 | |
*** ysandeep is now known as ysandeep|lunch | 07:47 | |
*** ykarel has quit IRC | 08:01 | |
openstackgerrit | Merged opendev/irc-meetings master: Remove Automation SIG meeting https://review.opendev.org/c/opendev/irc-meetings/+/783878 | 08:06 |
*** dpawlik0 is now known as dpawlik | 08:08 | |
*** hrw has joined #opendev | 08:11 | |
hrw | morning | 08:11 |
hrw | can someone help me get centos-8-stream-arm64 node running? | 08:14 |
hrw | project-config has such | 08:14 |
hrw | https://zuul.openstack.org/nodes does not | 08:14 |
openstackgerrit | Merged openstack/project-config master: Use publish-to-pypi on barbican ansible roles https://review.opendev.org/c/openstack/project-config/+/784011 | 08:16 |
ianw | hrw: this build doesn't look good | 08:21 |
ianw | https://nb03.opendev.org/centos-8-stream-arm64-0000001549.log | 08:22 |
hrw | ianw: let me look | 08:22 |
ianw | 2021-03-31 07:44:32.339 | + /usr/sbin/grub2-install '--modules=part_msdos part_gpt lvm' --removable --force /dev/loop6 | 08:22 |
ianw | 2021-03-31 07:44:32.341 | /usr/sbin/grub2-install: error: this utility cannot be used for EFI platforms because it does not support UEFI Secure Boot. | 08:22 |
ianw | it may be possible we fixed this and we haven't either done a dib release or included a new dib release in the nodepool container... | 08:22 |
ianw | diskimage-builder version 3.7.0 | 08:22 |
hrw | grub-- | 08:23 |
ianw | looks like https://review.opendev.org/c/openstack/diskimage-builder/+/779106 | 08:24 |
ianw | it seems we are due a release | 08:24 |
hrw | looks like | 08:25 |
ianw | ok, i pushed 3.8.0, but we'll have to pull into nodepool and then deploy to builders. sorry, no quick route there :/ | 08:27 |
hrw | no problem, happens | 08:28 |
ianw | ok, https://review.opendev.org/c/zuul/nodepool/+/784026 will start the process | 08:29 |
*** ykarel has joined #opendev | 08:43 | |
*** ykarel is now known as ykarel|lunch | 08:43 | |
jrosser | debian-bullseye-updates and debian-bullseye-backports don't seem to be being mirrored, the logs are zero length here https://files.openstack.org/mirror/logs/reprepro/ | 09:05 |
*** ysandeep|lunch is now known as ysandeep | 09:07 | |
*** klonn has joined #opendev | 09:11 | |
*** klonn has quit IRC | 09:50 | |
*** gibi is now known as gibi_away | 09:52 | |
*** hashar has quit IRC | 09:52 | |
hrw | jrosser: are there such repos upstream already? | 09:52 |
*** klonn has joined #opendev | 09:52 | |
jrosser | hrw: they seem to be here http://ftp.uk.debian.org/debian/dists/ | 09:52 |
hrw | o, they are. nice | 09:52 |
hrw | jrosser: note that https://files.openstack.org/mirror/logs/reprepro/debian-buster-updates.log.1 has content so perhaps there was nothing to mirror last time reprepro ran | 09:54 |
hrw | ops, wrong version | 09:55 |
jrosser | theres a patch to build images for bullseye which fails because it tries to add -updates and -backports repos and apt-update is upset that theres no Releases file | 09:56 |
*** ykarel|lunch is now known as ykarel | 10:04 | |
mordred | bullseye is current testing - it's not going to have -updates or -backports yet | 10:32 |
mordred | it won't grow working versions of those until it is actually released | 10:32 |
mordred | hrm. I take that back - I agree that -updates exists for real | 10:33 |
chkumar|ruck | Hello Infra, we are seeing few retry limits on one patch https://zuul.opendev.org/t/openstack/status#782187 , please have a look thanks! | 10:34 |
chkumar|ruck | join name : tripleo-ansible-centos-8-molecule-tripleo-modules | 10:35 |
chkumar|ruck | *job | 10:35 |
mordred | same with -backports - how weird (although I gotta say it makes automation nice) | 10:35 |
* mordred goes back to morning caffine | 10:35 | |
chkumar|ruck | it is the earlier retry_limit job https://zuul.opendev.org/t/openstack/build/d28be58628484f92a36bd8ab87279d6e | 10:35 |
*** klonn has quit IRC | 10:45 | |
*** mugsie__ is now known as mugsie | 11:01 | |
*** dtantsur|afk is now known as dtantsur | 11:34 | |
*** lpetrut has joined #opendev | 11:37 | |
fungi | jrosser: hrw: mordred: catching up, but the problem is that reprepro won't create empty repositories, even if they exist empty at the source end. there is set of commands we can run to create the empty indices, documented in the reprepro manpage i think, i vaguely recall doing that for buster | 11:41 |
fungi | chkumar|ruck: zbr|rover was also asking about those retries in #openstack-infra, seemed like it could be related to a specific job or node type, i can help get an autohold set up for it in a bit and then we can try to retrigger the failure and investigate the resultant state of the vm after the failure and also try to extract a vm console log from it | 11:43 |
mordred | Ahhh right | 11:45 |
chkumar|ruck | fungi: thanks :-) | 11:50 |
hrw | fungi: maybe scripts which call reprepro should take care of creating empty RELEASE-{backports,update} ones when new release gets added? | 11:56 |
hrw | fungi: so in 2 years time we will not get into same discussion again ;) | 11:56 |
fungi | hrw: maybe, but how to add that requires some thought in declarative configuration management. we don't have a lot of tasks which are run-once-on-setup | 11:56 |
*** sshnaidm|off is now known as sshnaidm | 11:57 | |
hrw | fungi: understood | 11:58 |
fungi | though maybe our script which runs reprepro could run it if the suites are missing at the end or something | 11:59 |
fungi | basically "create these empty if they don't exist at completion" | 12:00 |
hrw | ;) | 12:01 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: Update SSL exceptions https://review.opendev.org/c/opendev/gear/+/784082 | 12:01 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: WIP: Client: use NonBlockingConnection to allow TLSv1.3 https://review.opendev.org/c/opendev/gear/+/784083 | 12:01 |
*** auristor has joined #opendev | 12:05 | |
openstackgerrit | Jeremy Stanley proposed opendev/zone-opendev.org master: Clean up OpenEdge configuration https://review.opendev.org/c/opendev/zone-opendev.org/+/784086 | 12:10 |
openstackgerrit | Guillaume Chauvel proposed opendev/gear master: WIP: Client: use NonBlockingConnection to allow TLSv1.3 https://review.opendev.org/c/opendev/gear/+/784083 | 12:14 |
fungi | zbr|rover: chkumar|ruck: i've set an autohold for the failing job on https://review.opendev.org/782187 so feel free to recheck it and we can take a closer look at the node once it fails again | 12:35 |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible https://review.opendev.org/c/zuul/zuul-jobs/+/780662 | 12:42 |
fungi | jrosser: hrw: mordred: according to the reprepro manpage, i think something like `reprepro export buster-updates` is what we want, but i'll get some caffeine in me before attempting | 12:42 |
hrw | fungi: s/buster/bullseye/ and also -backports but yes, it looks like it | 12:44 |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible https://review.opendev.org/c/zuul/zuul-jobs/+/780662 | 12:45 |
jrosser | fungi: thanks for taking a look at that :) | 12:45 |
*** smcginnis has quit IRC | 12:53 | |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible https://review.opendev.org/c/zuul/zuul-jobs/+/780662 | 13:04 |
fungi | jrosser: hrw: mordred: apparently i did it almost a year ago for focal-backports: http://eavesdrop.openstack.org/irclogs/%23opendev/%23opendev.2020-04-24.log.html#t2020-04-24T00:36:00 | 13:13 |
hrw | "we just have to remember to do that whenever adding a new release i guess" | 13:14 |
hrw | ;) | 13:14 |
fungi | well, or like i said, maybe if we can detect in our script that no indices were created for a configured dist, we make it run that command at the end | 13:18 |
fungi | i'm going to work on that angle here in a bit | 13:18 |
*** ralonsoh has left #opendev | 13:20 | |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible https://review.opendev.org/c/zuul/zuul-jobs/+/780662 | 13:30 |
zbr|rover | fungi: probably is going to happen with https://zuul.opendev.org/t/openstack/stream/5ebc409e1d554b89b5569c6fbbfcc1f7?logfile=console.log too | 13:32 |
zbr|rover | already >12 min without any reply, probably is already stuck. | 13:33 |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible https://review.opendev.org/c/zuul/zuul-jobs/+/780662 | 13:34 |
zbr|rover | fungi: yep, it did fail too. | 13:36 |
*** darshna has joined #opendev | 14:09 | |
*** ykarel is now known as ykarel|away | 14:40 | |
clarkb | chkumar|ruck: zbr|rover fungi note if the problem is network connectivity holding the node may not be helpful | 14:46 |
fungi | clarkb: well, the idea is that i may at least be able to capture a vm console log from the held node, or reboot it with the nova api, or even boot it on a rescue image to get at the logs | 14:48 |
fungi | but we've also seen similar cases where something during the job drove system load or disk i/o up so high that sshd ceased responding fast enough to beat ansible's timeout | 14:49 |
clarkb | fair enough. The last time tripleo had network setup issues the console logs did help in that it showed that network manager was undoing our static ip config | 14:49 |
fungi | and after a time the vm recovers and can be reached again | 14:49 |
zbr|rover | i am now trying to manually run the same tests that are happening inside that job, in may be able to identify if there is an issue with these tests or not. | 14:56 |
zbr|rover | i will find out soon, already passed 1/6 | 14:57 |
*** mfixtex has joined #opendev | 15:14 | |
*** lpetrut has quit IRC | 15:22 | |
*** Dmitrii-Sh4 has joined #opendev | 15:26 | |
*** Dmitrii-Sh has quit IRC | 15:26 | |
*** Dmitrii-Sh4 is now known as Dmitrii-Sh | 15:26 | |
*** noonedeadpunk has quit IRC | 15:27 | |
*** noonedeadpunk has joined #opendev | 15:28 | |
*** ykarel|away has quit IRC | 15:37 | |
*** hashar has joined #opendev | 15:41 | |
*** ysandeep is now known as ysandeep|away | 15:47 | |
*** spotz has joined #opendev | 15:50 | |
*** diablo_rojo has joined #opendev | 16:06 | |
*** hamalq has joined #opendev | 16:18 | |
*** hamalq_ has joined #opendev | 16:19 | |
*** hamalq has quit IRC | 16:22 | |
*** Dmitrii-Sh has quit IRC | 16:23 | |
*** Dmitrii-Sh has joined #opendev | 16:24 | |
*** hamalq_ has quit IRC | 16:41 | |
*** hamalq has joined #opendev | 16:41 | |
*** eolivare has quit IRC | 16:43 | |
*** marios is now known as marios|out | 16:47 | |
*** marios|out has quit IRC | 16:59 | |
*** dtantsur is now known as dtantsur|afk | 17:04 | |
corvus | i'm getting started on looking into the zuul memory leak now | 17:23 |
zbr|rover | fungi: clarkb: re stuck job, we are currently making it nv and we have another patch that may fix the root cause, but we ware not sure yet. | 17:29 |
clarkb | zbr|rover: its failing in a loop through right? ideally we wouldn't just set it to nv in that case | 17:29 |
fungi | zbr|rover: thanks for the update, i also realized there was no reason to limit the autohold to a single change so broadened it to any tripleo-ansible-centos-8-molecule-tripleo-modules failure for openstack/tripleo-ansible | 17:30 |
zbr|rover | i am almost sure is not zuul or infra issue here, is a genuine bug. | 17:30 |
fungi | also, clarkb is right, setting it nv means we'll eat 3x the node count for that job anyway and just throw away the results | 17:30 |
zbr|rover | we do want to see the impact of the real fix first. it do not expect it to stay nv more than day. | 17:30 |
fungi | the change with the fix can readd the job to the check and gate pipelines | 17:31 |
fungi | that way you still see the effects of the fix on the fix change and any queued after it | 17:31 |
fungi | just not on the changes where you expect it to fail | 17:31 |
zbr|rover | to be clear it does not always fails. in fact we do not really know what introduced the issue. | 17:32 |
fungi | <100% failure would also be a possible explanation for how the bug itself got merged if it wasn't an outside shift | 17:32 |
clarkb | if it is a similar network issue to last time it had to do with the job forcing dhcp in regions without dhcp | 17:35 |
clarkb | all the regions that used dhcp said ok whatever and kept running, but those that use static IPs immediately broke | 17:35 |
clarkb | corvus: I need to take a break, but let me know if I can help with the memory leak and I can dive into that after | 17:36 |
corvus | clarkb: thx. i'm still at step 1: waiting for the first siguser2 objgraph most common types report to finish | 17:36 |
corvus | about 15 minutes into that | 17:37 |
zbr|rover | done, updated the patch to disable job | 17:40 |
zbr|rover | fungi: clarkb thanks for helping on that issue. time for me to go offline now. | 17:42 |
fungi | have a good evening zbr|rover! | 17:42 |
*** avass has quit IRC | 17:56 | |
*** yourname has joined #opendev | 17:57 | |
*** yourname is now known as avass | 17:58 | |
*** avass has quit IRC | 18:00 | |
*** yourname has joined #opendev | 18:01 | |
*** yourname is now known as avass | 18:02 | |
*** avass has quit IRC | 18:07 | |
*** yourname has joined #opendev | 18:10 | |
*** yourname is now known as avass | 18:10 | |
fungi | poking at the missing empty debian bullseye dists a bit more, i'm starting to think it may make the most sense to just reprepro export unconditionally after every reprepro update. testing that theory now | 18:13 |
fungi | if it's reasonably quick even on nonempty dists, then the cost is low enough to warrant the one-line fix rather than lots of unnecessary config parsing and conditionals | 18:15 |
*** avass has quit IRC | 18:17 | |
*** yourname has joined #opendev | 18:18 | |
*** yourname is now known as avass | 18:18 | |
*** avass has quit IRC | 18:19 | |
*** yourname has joined #opendev | 18:19 | |
*** yourname has quit IRC | 18:21 | |
*** avass has joined #opendev | 18:21 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Update Gerrit to 3.2.8 https://review.opendev.org/c/opendev/system-config/+/784152 | 18:22 |
clarkb | that is't urgent but noticed they made a new release so figure we should try and keep up if we can | 18:22 |
fungi | the dstat effort is not yet to the point where we have enough data to decide on a timeline for 3.3 i guess | 18:22 |
clarkb | ya I think we really want to improve the gatling git stuff for 3.3 | 18:23 |
clarkb | that said new bigger server gives us a lot of headroom and I think we can be less cautious (early data says 3.3 is more memory but faster) | 18:23 |
johnsom | Hmm, things are queuing in an odd way. I posted a patch fifteen minutes ago and it still isn't listed in the check pipeline | 18:33 |
johnsom | https://review.opendev.org/c/openstack/octavia-tempest-plugin/+/760465 | 18:33 |
clarkb | johnsom: we're (mostly corvus) doing object introspection on zuul to try and root cause this memory leak and that slows stuff down | 18:34 |
clarkb | hopefully just temporary until we get data we need | 18:34 |
johnsom | Ah, ok. Just thought I would give a heads up | 18:34 |
*** DSpider has joined #opendev | 18:39 | |
*** DSpider has quit IRC | 18:39 | |
fungi | okay, so adding an explicit reprepro export on an otherwise noop update added 5.5 minutes for the debian repo, (9 dists covering hundreds of thousands of packages) | 19:01 |
fungi | i'll see what it adds for debian-security | 19:01 |
clarkb | not too bad considering the runtime a full sync | 19:01 |
fungi | so the fix to debian-security is going to have to be different, i think | 19:02 |
fungi | aptmethod error receiving 'http://security.debian.org/dists/bullseye/updates/Release': | 19:02 |
clarkb | because it doesn't exist at all upstream yet or ? | 19:02 |
fungi | '404 Not Found | 19:02 |
clarkb | ya | 19:02 |
mordred | clarkb: *amazing* that upstream released a point release and there is an opendev patch up to maybe run it | 19:03 |
clarkb | mordred: ya we've managed to keep up with point releases | 19:03 |
clarkb | 3.3 scares me a bit simply because a few people on the repo discuss list reverted | 19:03 |
mordred | yah | 19:04 |
clarkb | but I've been trying to add better testing of it when I can. We added dstat for system level stats and also have a clunky gatling git thing up | 19:04 |
mordred | fungi: deb http://security.debian.org/debian-security bullseye-security main is what's in the bullseye docker image for security - is security of *updates* the thing that doesn't exist until it exists that I was thinking about earlier? | 19:04 |
mordred | oh ... | 19:05 |
mordred | fungi: http://security.debian.org/debian-security/dists/bullseye-security/updates/ | 19:05 |
mordred | fungi: it's security.d.o bullseye-security - not security.d.o bullseye | 19:06 |
mordred | (even though the other releases do not have a -security suffix) | 19:06 |
*** hashar has quit IRC | 19:08 | |
fungi | oh, righto, that's changing in bullseye | 19:09 |
fungi | totally forgot about that announcement | 19:09 |
fungi | so we probably need the fix i'm considering *and* some config change for bullseye's security repo | 19:10 |
fungi | okay, lemme get this pushed up first then, it seems to solve the first issue | 19:12 |
fungi | after than i can meditate on the reprepro config a bit | 19:12 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Explicitly create empty reprepro dists https://review.opendev.org/c/opendev/system-config/+/784158 | 19:27 |
fungi | jrosser: hrw: ^ that's part of the solution | 19:27 |
fungi | now to work out the config change we need for bullseye-security | 19:27 |
*** sboyron has quit IRC | 20:12 | |
*** d34dh0r53 has quit IRC | 20:35 | |
*** d34dh0r53 has joined #opendev | 20:40 | |
*** whoami-rajat has quit IRC | 20:47 | |
*** dhellmann_ has joined #opendev | 21:12 | |
*** dhellmann has quit IRC | 21:12 | |
*** dhellmann_ is now known as dhellmann | 21:14 | |
fungi | okay, so i've worked out the fix for bullseye-security, unfortunately it meant that we haven't been updating the debian-security volume since the bullseye addition went in because it was breaking that reprepro invocation, so we're some days behind on security repo state for stretch/buster and only just now starting to mirror it for bullseye | 21:14 |
fungi | good news is it's a one-line fix | 21:15 |
fungi | though i had to run reprepro clearvanished on it to clean up old incomplete references to the wrong bullseye security repo | 21:15 |
fungi | which was part of what was preventing it from running | 21:15 |
*** dhellmann has quit IRC | 21:16 | |
*** dhellmann has joined #opendev | 21:17 | |
*** dhellmann has quit IRC | 21:23 | |
*** dhellmann has joined #opendev | 21:25 | |
fungi | so one down-side here is i think i need to put the mirror-update server in the emergency disable list until i get the config patch deployed | 21:27 |
fungi | well and merged | 21:27 |
fungi | and uploaded | 21:27 |
fungi | and written | 21:27 |
fungi | one thing at a time ;) | 21:27 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Correct debian-security repo codename for bullseye https://review.opendev.org/c/opendev/system-config/+/784169 | 21:32 |
fungi | infra-root: ^ appreciate an expedited review of that and the parent change since i have the server in emergency disable with the latter fix applied manually to avoid ansible from re-breaking it and requiring additional manual cleanup | 21:33 |
clarkb | looking | 21:34 |
fungi | jrosser: hrw: ^ that's the remaining fix, but at this point i've also applied and run it on the mirror-update server so we should be ready to move forward and recheck the dib job addition now | 21:36 |
fungi | and assuming that passes, approve the nodeset addition too | 21:37 |
ianw | fungi: lgtm, thanks | 21:38 |
hrw | fungi: cool, thanks! | 21:40 |
*** artom has quit IRC | 21:44 | |
*** artom has joined #opendev | 21:45 | |
fungi | i'm confused by the error on https://zuul.opendev.org/t/openstack/build/dbe8af6f6b054f0eb85401a70f74b188 | 22:12 |
fungi | i wonder if that test has bitrotted | 22:12 |
fungi | vos examine exiting 255 | 22:14 |
fungi | famous last words, but i don't think my change is causing that, seems to be arising in a wholly separate script | 22:14 |
fungi | the last time system-config-run-mirror-update succeeded was two days ago | 22:16 |
fungi | but these changes are the first to run it since | 22:16 |
clarkb | fungi: could it just be a fluke related to udp and internets? | 22:19 |
clarkb | and or did it run on an ipv6 only cloud which might be more sensitive to problems? | 22:19 |
fungi | maybe, but both changes hit the same error a couple of hours apart | 22:20 |
fungi | aha! | 22:21 |
fungi | external cause | 22:21 |
fungi | Volume does not exist on server afs01.ord.openstack.org as indicated by the VLDB | 22:21 |
fungi | just tried it from my workstation | 22:21 |
fungi | i guess that will clear up once ianw's vos releases finish | 22:22 |
fungi | ianw: would it be safe to go ahead and replicate project.zuul.readonly to ord ahead of the others? | 22:22 |
fungi | since we explicitly reference it in that test, it can't pass currently | 22:22 |
ianw | oh, i guess alphabetically that came last in the loop | 22:26 |
clarkb | I'm going through johnsom's list of CI issues and seeing ifI can provide any help/feedback/fixes | 22:26 |
ianw | i'm releasing it now | 22:26 |
clarkb | https://etherpad.opendev.org/p/wallaby-RC1-ci-challenges <- is the list | 22:27 |
fungi | thanks ianw! lmk when it completes and i'll approve those debian mirror fixes | 22:27 |
ianw | fungi: Released volume project.zuul successfully | 22:29 |
clarkb | fungi: ianw: if you get a chance can you look at my orange ish notes on item 3 in that etherpad and tell me if that looks like the pip solver to you? | 22:34 |
clarkb | I wonder if it is really slow on amd/vexxhost for some reason | 22:35 |
TheJulia | do we have mor general ci grumpiness? a lot of jobs just went to 2nd retry | 22:35 |
TheJulia | At least, looking at https://zuul.opendev.org/t/openstack/status#ironic | 22:36 |
clarkb | TheJulia: there was a zk reconnection aboutan hour ago? something like that | 22:36 |
clarkb | corvus is actively debugging whihc at times has impact on zuul performance which can trigger that (even though last I checked memory use and thus swap was fine) | 22:36 |
TheJulia | Looks fairly recent-ish :( | 22:36 |
TheJulia | okay | 22:36 |
fungi | cacti claims we're not back into memory pressure on the scheduler yet at least, but maybe the repl work is stalling zk connections out | 22:37 |
clarkb | johnsom: out of curiousity are you enabling nested virt on any of these that have libvirt/cpu trouble? | 22:38 |
clarkb | johnsom: yes I think so as the label being used is explicitly the nested virt label | 22:39 |
johnsom | All of them, but the errors in in nova<->libvirt. The qemu/kvm layer has no errors | 22:39 |
clarkb | johnsom: well the cpu lockup was in the kernel/cpu/etc | 22:40 |
johnsom | It seems related to bionic as well, they are all stable jobs that I have seen | 22:40 |
clarkb | I have a ver ystrong suspicion that that one is related to nested virt | 22:40 |
johnsom | It always goes through the "try CPU type", that is not unusual. The speculation is it is a bug in libvirt/glib combo | 22:43 |
clarkb | johnsom: sure, but all of your examples are vexxhost so far :) | 22:44 |
clarkb | maybe it is a bug with libvirt/glib + amd :) | 22:44 |
clarkb | johnsom: also reading the qemu log I'm not sure the amd nested virt flag is being set properly on those hosts | 22:44 |
fungi | keeping in mind that amd nested virt accel is different than intel nested virt accel too | 22:44 |
clarkb | should be svm but doesn't seem its in the opteron_g2 flag list | 22:44 |
clarkb | fungi: yup though I'm not convinced it is properly enabled, but if it is that could be another factor | 22:45 |
johnsom | Yeah, it always whines about that stuff too. Not unusual | 22:45 |
clarkb | fungi: in the stackviz one that failed on name resolution that you checked against unbound it is running /tmp/stackviz/bin/pip3 install -u file://path/to/stackviz.tar.gz | 22:55 |
clarkb | that then does a python setup.py egg_info somewhere which does the fetch against pypi directly | 22:55 |
clarkb | I suspect that somehow we are tripping over easy install? | 22:55 |
clarkb | I wonder if an explicit install of pbr into the virtualenv first would help | 22:56 |
fungi | yeah, i was more wondering why stackviz install is being done that way | 22:56 |
fungi | we could unpack the stackviz tree and then just pip install /the/path/to/it | 22:57 |
clarkb | johnsom: another thing to consider is your jobs are running on a reduced set of clouds due to the nested virt request. Limestone which I guess sometimes has dns failures, vexxhost which may have amd weirdness and also pip SAT solver slowness?, and ovh which I haven't seen any specific issues against yet | 22:57 |
clarkb | calling that out because if those clouds have problems your jobs will notice much more than background | 22:57 |
clarkb | also simply turning off the problematic clouds wn't help much if they are the only ones that can run the flavors you want | 22:58 |
johnsom | Ha, well, do we have other clouds? I'm ignoring RAX as it has it's own set of Xen problems | 22:58 |
fungi | and internap | 22:58 |
fungi | we have lots of nodes there | 22:58 |
clarkb | johnsom: rax and inap are the other two clouds currently used for x86. Neither does nested virt | 22:58 |
clarkb | but they provide a majority of resources iirc | 22:59 |
fungi | er, right, they're inap now not internap | 22:59 |
johnsom | Hmm, internap did at one point | 22:59 |
clarkb | johnsom: we don't put the nsted-virt label there as we don't get the same attention of debugging the nested virt problems | 22:59 |
fungi | possible we just don't create a special node type there to add it | 22:59 |
clarkb | so even if it is enabled we won't put the special label there | 22:59 |
johnsom | Admittedly, this sampling is very small. It is all from just one patch and not the normal day-to-day | 22:59 |
fungi | mgagne may be able to suggest someone who can help with more low-level investigation of nested virt issues there, but he's not in here at the moment | 23:00 |
clarkb | I think it is worth investigating further if the amd cpus are having trouble with pip solving and/or nested virt | 23:00 |
johnsom | I am 90% sure we used to just "turn it on" there in the past, before the nodeset existed. | 23:00 |
clarkb | the pip install timing on those is reall weird | 23:01 |
fungi | but if we want to consider exposing a nested-virt label for inap i agree that would be a prerequisite | 23:01 |
clarkb | johnsom: right but that isn't how we are exposing the label | 23:01 |
johnsom | Yeah, I know | 23:01 |
clarkb | johnsom: for the label we've gotten those clouds to minimally buy into to helping debug things when we can attribute them to nested virt | 23:01 |
fungi | clarkb: oh, speaking of clouds, i did get the openedge cleanup pushed under topic:openedge | 23:02 |
clarkb | fungi: do you know if pip solving slowness looks like https://zuul.opendev.org/t/openstack/build/d35cc616da1744e98c2d5b081866d541/log/job-output.txt#6209-6211 ? | 23:02 |
clarkb | the reason I don't really suspect networ slowness is after that first package everything else is much quicker | 23:03 |
clarkb | I kind of expect an oddity in how pip logs things where the solving is just no logging and you jump ahead a minute later with the downloads starting but I'm not sure | 23:03 |
clarkb | fungi: is there an order to those openedge cleanup changes? | 23:04 |
fungi | dns depends on system-config which depends on the others | 23:04 |
fungi | base-jobs and project-config can merge first | 23:05 |
fungi | sudo -H LC_ALL=en_US.UTF-8 SETUPTOOLS_USE_DISTUTILS=stdlib http_proxy= https_proxy= no_proxy= PIP_FIND_LINKS= SETUPTOOLS_SYS_PATH_TECHNIQUE=rewrite python3.8 -m pip install -c /opt/stack/requirements/upper-constraints.txt etcd3gw | 23:05 |
fungi | i guess that would be worth testing | 23:05 |
clarkb | (a gerrit plugin showing depends on chains in list of changes would be neat but probably difficult to do in a way that performance isn't terrible since the gerritdb knows nothing about depends on) | 23:05 |
fungi | clarkb: i agree it could be dep solver slowness if the pip version being used is new enough to have it | 23:06 |
clarkb | fungi: I think devstack upgrades pip very early | 23:06 |
clarkb | but not sure | 23:07 |
fungi | like 4175 claims pip 20.0.2 | 23:08 |
clarkb | dep solver is 21? | 23:09 |
fungi | 20.0.3 i think | 23:10 |
clarkb | I wonder if devstack pinned pre solver, but that wouldalso rule out that theory | 23:10 |
clarkb | maybe we should boot one of those vexxhost nodes and profile it? | 23:11 |
clarkb | ianw: ^ possibly related to gerrit things | 23:11 |
clarkb | it definitely seems like it just goes out to lunch every time it needs to install something | 23:11 |
clarkb | but then catches up after the first dep is pulled | 23:11 |
ianw | i'm seeing ipv6 weirdness, many sites unavailable. so possibly it gives up on something a falls back? | 23:13 |
fungi | clarkb: sorry, 20.3 | 23:13 |
fungi | but yes, not new enough to be the new solver | 23:13 |
fungi | clarkb: the log looks like it's using distro python version from focal | 23:14 |
clarkb | ianw: oh that is a good theory, ya it could be that | 23:15 |
clarkb | ianw: and then it remembers to use ipv4 for everything subsequent | 23:15 |
fungi | "pip 20.0.2 from /usr/lib/python3/dist-packages/pip (python 3.8)" | 23:15 |
openstackgerrit | Merged opendev/base-jobs master: Clean up OpenEdge configuration https://review.opendev.org/c/opendev/base-jobs/+/783989 | 23:15 |
* clarkb updates the etherpad | 23:16 | |
fungi | and yes, ipv6 connection timeout could explain the long delay | 23:17 |
clarkb | and why it is so consistent | 23:17 |
clarkb | of about a minute exactly | 23:17 |
fungi | that would definitely make setup take a long time given how many different pip install commands devstack likes to break up into | 23:18 |
ianw | i'm tracking things somewhat between #vexxhost channel and https://etherpad.opendev.org/p/gerrit-upgrade-2021 | 23:19 |
openstackgerrit | Merged openstack/project-config master: Clean up OpenEdge configuration https://review.opendev.org/c/openstack/project-config/+/783990 | 23:22 |
clarkb | fungi: I +2'd ^ the changes in the stack but didn't approve the later two as I can't watch the big inventory change go in | 23:23 |
clarkb | I'm going to need to sort out dinner and enjoy this 70F march day shortly | 23:23 |
fungi | go enjoy it, was a great day here too. had the windows open all day | 23:24 |
*** tosky has quit IRC | 23:30 | |
TheJulia | sigh, 3rd retry on multiple jobs :( | 23:34 |
corvus | TheJulia: i'm sorry :( | 23:53 |
TheJulia | c'est la vie | 23:53 |
TheJulia | All I can do is wait it out | 23:53 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!