Thursday, 2024-11-21

clarkbwe run unbound for local lookups on localhost so that is why we don't bind to 0.0.0.0 or ::00:00
clarkbI wonder if that is just a buggy systemd unit then if we're trying to start too quickly.00:00
corvus#status log restarted schedulers and zuul-launcher to pick up zuul.yaml config change00:01
opendevstatuscorvus: finished logging00:03
clarkbre hostkeys and sshfp: when I was in suwon I wrote myself a little python utility that I could copy the sha from ssh output and the sha from sshfp output and have python calculate if they are the same or different (since they use different formats because why make this easy for humans)00:05
fungiopenssh client is able to check them too00:10
clarkbfungi: oh I thought there were problems with that? I guess I can try configuring it again too. But basically converting formats between the two keys and checking equivalent with python has been easy enough00:11
clarkbbecause I only have to do it once per host and then the local known hosts carries it going forward00:11
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add vexxhost provider and build raw images  https://review.opendev.org/c/opendev/zuul-jobs/+/93583800:13
corvusclarkb: fungi ^ that should be safe to go ahead and approve00:15
clarkbdone00:24
clarkbianw: the arm64 job succeeded with the grub-mkconfig in it00:48
ianw! genius :)00:48
opendevmeetianw: Error: "genius" is not a valid command.00:48
clarkbI'll fix the zuul loggre stuff and then maybe with ns04 restarted we'll get package installs working more reliably00:49
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade and reboot test nodes before openafs installation  https://review.opendev.org/c/opendev/system-config/+/93581200:51
corvusopendevmeet: indeed it is not.00:51
opendevmeetcorvus: Error: "indeed" is not a valid command.00:51
opendevreviewMerged openstack/diskimage-builder master: Fall back to extract-image on ubuntu build  https://review.opendev.org/c/openstack/diskimage-builder/+/92674801:01
clarkbkarolinku[m]: one thing to note is that nested virt isn't super stable (this is one reason we giv eit a special label because we really only want people trying to use it where it can possibly be debugged if it fails). Additionally, you should be wary of accidentally adding additional dependencies on top of the centos 10 cpu requirements01:04
clarkbbasically I owuld avoid using kvm instead of qemu in those jobs so that you don't unnecessarily furthe rrestrict where and how the jobs can run01:04
clarkbianw: huh a rerun resulted in kernel mismatch failures so maybe it isn't deterministic?01:05
clarkbalso ansible lint saying playbooks/openafs-rpm-package-build/pre.yaml:6: package-latest: Package installs should not use latest. is just factually wrong01:05
clarkbI should be ablloewd to use ansible to upgrade all packges if I want to..01:06
ianw... sigh ... non deterministic kernel installs are great01:17
JayFI'll note, Ironic just had a failure with "Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" -- I know you all merged something today to try and help with this, sorry to be the bearer of bad news in the form of anecdata :D 02:04
JayFhttps://zuul.opendev.org/t/openstack/build/98c619e4fef3461a8beac1b8f3a22f6802:04
opendevreviewMerged opendev/zuul-jobs master: Add vexxhost provider and build raw images  https://review.opendev.org/c/opendev/zuul-jobs/+/93583802:32
Clark[m]JayF that job doesn't seem to use the stuff we modified (so you weren't using the caching proxy) but yes we've still seen errors. Hoping that maybe the overall rate falls a bit so to make it easier to move away from docker hub03:45
JayFIf there's something we should modify to make it run better on our hardware, we can try to get the knob put into that code to do it03:50
Clark[m]2 weeks ago there was. Now it's a big who knows what the best choices are03:52
*** jroll09 is now known as jroll007:36
opendevreviewMarios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message  https://review.opendev.org/c/opendev/irc-meetings/+/93580608:44
opendevreviewMarios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message  https://review.opendev.org/c/opendev/irc-meetings/+/93580609:02
opendevreviewMarios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message  https://review.opendev.org/c/opendev/irc-meetings/+/93580612:50
opendevreviewMarios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message  https://review.opendev.org/c/opendev/irc-meetings/+/93580612:50
corvusClark: fungi regarding raw images -- the job ran once and succeeded (openmetal), then ran 3 more times and failed due to disk space issues (rax).  should we adjust the job to do something like mount an extra disk at /opt in rax?  i don't recall what the current state of disk resources are across our providers.15:06
fungiyeah, rax classic nodes have a 40gb rootfs with an 80gb ephemeral drive15:07
fungii think (we should also collect an initial df in the build logs)15:08
corvushttps://zuul.opendev.org/t/opendev/build/4029030b9b3e483b9d91023717d25aed was the success; sadly, that was the gate-build, not the image-build-pipeline-build, so we've seen the raw image upload time (1 hour), but not the download time.15:08
corvusfungi: yes we do; i think it shows 40g at / and nothing else mounted15:08
fungisounds right15:08
corvushttps://zuul.opendev.org/t/opendev/build/b38234f4bad0451788efcea88f9b9bb4/log/zuul-info/zuul-info.ubuntu-noble.txt#12015:08
corvusthere 'tis15:09
corvusso we might need to copy that bit of launch-node into the job pre-run... do we do that in any other jobs these days?  and do we have at least 80gb everywhere?  (i guess i'm wondering if it's safe to assume we can get 80g from each provider, or if we should assume 40g max)15:10
fungiyes, we have at least 80gb everywhere afaik, so long as you mount available ephemeral drives15:15
fungiat least we assert it as a minimum expectation in our documentation15:15
corvuscool, i'll work on copying that chuck of make_swap into a playbook.  i don't see anything else doing something like that.15:16
corvusregarding the 1-hour upload time, i have two thoughts (aside from just accepting it -- i think it's liveable):  1) the raw images should compress well, so we may be able to compress them quite a bit before uploading, if we have the launcher understand that and uncompress after downloading.  2) we could probably have the launcher perform an image conversion, but that makes the launcher more complex (and most installations aren't going to deal15:18
corvuswith multiple formats anyway).15:18
fungiyeah, compression is probably a good compromise, i agree that more complex on-the-fly format conversions are best avoided15:19
fungicorvus: generic sql question... for the update queries at steps 4.4. and 4.5 in https://etherpad.opendev.org/p/lists-openinfra-org-migration is there a way to do substring substitution (a la sed) or will it be easier to just expand those to 15 separate queries each?15:24
fungi(not urgent, of course)15:26
Clark[m]corvus: there is already a job role somewhere that manages the ephemeral disk. Not exactly sure where though15:38
Clark[m]Devstack jobs use it but not sure if it is part of devstack 15:38
corvusfungi: yes, i added a note in there for what i *think* is the correct syntax.  worth testing/double checking.  :)15:54
corvusClark: ack thx, i'll poke around there15:54
fungicorvus: perfect, thanks! and yeah, i intend to source a production db dump into a held test node and then try all the queries once i have the correct values inserted15:55
corvusClark: here it is: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/roles/configure-swap/README.rst16:00
corvusI wonder if we should move that into opendev/base-jobs16:02
corvusoh wait we already do something with /opt in our image build job...16:04
corvusi think i need to take a second look and see exactly where we ran out of space16:05
corvusokay i think the problem is that the output images are on / not /opt16:06
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Move dib image output to /opt  https://review.opendev.org/c/opendev/zuul-jobs/+/93589316:11
corvusClark: fungi ^ i think that may be all we need to do :)16:11
fricklercorvus: seems you need to pre-create the target dir, also added another comment16:16
clarkbsha is up to dc0 thinking it will finish sometime tomorrow16:26
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Move dib image output to /opt  https://review.opendev.org/c/opendev/zuul-jobs/+/93589316:49
corvusfrickler: thanks16:49
opendevreviewMerged zuul/zuul-jobs master: Allow overriding the buildset registry image  https://review.opendev.org/c/zuul/zuul-jobs/+/84998916:52
corvusi'm running some compression benchmarks on nb01; i've installed some extra packages (qemu-img, and zstd) to help with that.17:46
clarkbI guess thats a convenient location since it has many images to work with17:47
corvusand space17:49
corvusi can remove the packages when done, but i think it's harmless to keep them17:50
clarkbya I'm not worried about having those installed17:51
fungigeek humor, i just noticed that if you create a venv with cpython 3.14, it adds a 𝜋thon symlink to the python executable17:51
clarkbif they were services it would be different but those are just utilities17:51
clarkbfungi: is there a piethon too ?17:52
fungionly when i have time to spend baking next week17:52
opendevreviewMerged opendev/zuul-jobs master: Move dib image output to /opt  https://review.opendev.org/c/opendev/zuul-jobs/+/93589318:13
opendevreviewClark Boylan proposed opendev/system-config master: Upgrade and reboot test nodes before openafs installation  https://review.opendev.org/c/opendev/system-config/+/93581218:18
clarkbhttps://zuul.opendev.org/t/openstack/build/2f97721d42f3448e961500d1391ca581/log/job-output.txt#327 the centos jobs have been hitting errors like this a lot18:25
clarkbthat job ran in raxflex but I think I'veseen it it rax iad as well (and possibly others)18:25
clarkbmy first thought was maybe openafs is sad but nothing in dmesg output recently. Looking at the apache access log I see the node grabbing other files on the mirror before the error18:25
clarkbthe errors logs are either empty or don't have relevant info18:27
clarkbits almost like the tcp connection doesn't successfully establish18:28
fungirace condition?18:29
clarkbcould be anything in setup for teh connection I guess since the log from dnf/yum is basically useless18:30
clarkbDNS, tcp, apache, openafs (and all the networking in between)18:30
clarkblooking at the mirror there total number of tcp connections isn't super high so I don't think this is a mirror getting overloaded and running out of connection slots18:31
clarkblet me see if I can find other examples18:31
clarkbhttps://zuul.opendev.org/t/openstack/build/a3a4f2abfe7544f0aa3e1495d23a6661/log/job-output.txt#324 rax-iad https://zuul.opendev.org/t/openstack/build/690f7e08605f4e61b862e214fb967f54/log/job-output.txt#329 raxflex https://zuul.opendev.org/t/openstack/build/f0609499f6ef4801ba6d38b11154320f/log/job-output.txt#335 openmetal18:33
clarkbwith that sampling it seems unlikely to be networking specific to one cloud (say the NAT in raxflex)18:33
clarkbalso if it were an openafs issue you would expect debuntu to have similar errors18:35
clarkball of these roles use the package module instead of dnf/apt18:36
clarkbI suspect package farms out to dnf or apt based on the platform so probably not an issue with the ansible module18:37
clarkbNext step is probably fetching the /var/log/dnf.log and /var/log/yum.log files18:40
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Compress raw images  https://review.opendev.org/c/opendev/zuul-jobs/+/93597018:58
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Compress raw images  https://review.opendev.org/c/opendev/zuul-jobs/+/93597018:58
corvus^ that change and the depends-on should handle compression of raw images.  it takes zstd 2m15s to compress a 25GB raw image to 8GB, ~ the same size as qcow2, so should have similar upload times.19:01
corvus(same time to decompress too)19:01
fungithat's surprisingly fast19:01
corvuszstd eeked out a slight win over lz4 on time and size;  lz4 was 9GB in 2m32s19:02
corvusgzip was 8.4GB in 10m42s19:02
clarkbpoor gzip19:02
corvusfungi: yeah, i'm not displeased19:03
corvusinterestingly, multi-thread/core zstd isn't worth it for this; it saved 10s.  i'd be curious to learn when it's advantageous.19:04
clarkbcorvus: couple of thoughts inline but I +2'd because their thinking about what happens after raw which can be handled later if you prefer19:06
corvusclarkb: good point; i agree i think we can leave it for now and rework it when we add vhd.  i wasn't sure about that.  :)19:07
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Compress raw images  https://review.opendev.org/c/opendev/zuul-jobs/+/93597019:09
corvusclarkb: oops but i left an extra no_log in that i just removed19:09
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Compress raw/vhd images  https://review.opendev.org/c/opendev/zuul-jobs/+/93597019:12
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Compress raw/vhd images  https://review.opendev.org/c/opendev/zuul-jobs/+/93597019:12
corvusclarkb: well, while i'm in there, i updated it for vhd too ^ :)19:12
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Compress raw/vhd images  https://review.opendev.org/c/opendev/zuul-jobs/+/93597019:13
corvusfor real.  i'm done.19:14
clarkbI checked and didn't see vhd glad I wasn't going crazy :)19:14
corvusnope that's me not seeing the "post-inner.yaml change since last saved" prompt...19:15
clarkbis the role that copies zuul_dir/logs/ as set up by stage-output not in our base job?19:37
clarkbI've been trying to copy logs using stage output and I can see it stages a couple of files but then they don't end up in the resulting job logs so they must not get ferried to the executor then to swift?19:37
clarkbbecause I feel like I'm beating my head against tis openafs on centos problem (and now the general can't find the package problem) I've gone ahead and rerequested a new image build on nb0419:42
clarkbit looks like the arm builds may be happier now that nsd was restarted? at least a jammy build from today didn't fail to look up opendev.org19:42
clarkbthe ttl on opendev.org is 3600 too whcih was the other thing I thought about adjusting but one hour seems plenty19:43
clarkband now lunch19:43
opendevreviewJeremy Stanley proposed opendev/zone-zuul-ci.org master: Add Bluesky Verification Record for Zuul Account  https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/93597720:15
clarkbfungi: I'm +2 on ^ but would be good for someone in the zuul community that isn't one of us to review it too21:42
fungiyes, agreed21:45
fungithere's no rush21:45
clarkbthinking about the centos mirror fetch issue a bit more I checked the mirrors and all three are a different version of ubuntu and the noble one is a different openafs version too21:50
clarkbso no fingers pointing at openafs upgrades21:50
clarkbhttps://zuul.opendev.org/t/openstack/build/d23154399d404949ad4dc366e7b19355/log/dnf.log#81-97 roughly coincides with https://zuul.opendev.org/t/openstack/build/d23154399d404949ad4dc366e7b19355/log/job-output.txt#324-329 but there is nothing logged in there about failing to contact any mirrors22:08
clarkband the grub config has a single menuentry for efi firmware settings not linux22:15
clarkbI know I'm getting a bit annoyed that debugging this is taking upwwards of several days at this point, but also why have we removed all the configs from the where the configs go and not logged errors to the package manager logs?22:16
clarkbit feels like we stopped trying to make the software understandable by the people who have to run it22:16
clarkband note that the grub entry is for x86 so I don't think this explains why arm64 is broken its just that we've decided to do booting differently now and its not clear what we're actually booting anymore from a grub config file22:17
clarkbI'm waving the flag of surrender and have put holds in place22:21
clarkblooks like the old menu contents for grub may go in /boot/loader/entries/$SHA.$cpuarch.conf files22:35
clarkbI wonder if the sha makes them unordered and thus we get the old one?22:35
ianwclarkb: i do low-key suspect the mirror; i have deja-vu on having what to the client end looks like a 404 coming from posix-ish-level failures from the openafs mount on the mirror22:54
ianwthat percolate up through apache and end up as a 40422:55
ianwi'm trying to remember if this showed up as like timeouts in the afs logs on the mirrors ... am i nuts or is this ringing bells for anyone else?!22:57
clarkbianw: I don't recall. But the weird thing is its three different mirrors running three different openafs packages (2 different versions) with a mix of other networking too22:58
clarkbit seems really odd that debuntu doesn't seem to have these issues either22:58
clarkbbut also why can't dnf just log a 404 that would make my life so much eaiser (note I couldn't find a 404 in the server logs fwiw22:58
clarkbianw: for the arm boots problem grubby shows both kernels but the old one is the default. I just pushed a patch to the ozj change which sets an /etc/default/grub option to explicitly enable bls which the internet indicates may help if your default kernel isn't updating23:00
ianwi might be getting confused with when we trying kafs.  my notes have23:00
clarkbI have an autohold in place to debug the no more mirrors error but haven't tripped a failure yet (I would say its about a 50% hit rate)23:00
ianw2019-07-26 : kafs: investigation into 404 error; http://lists.infradead.org/pipermail/linux-afs/2019-July/003122.html23:01
ianwwhich in the ultimately irony is now a 40423:01
clarkb Ithink latest push may have caught one though so thats "good"23:01
clarkbheh23:01
clarkbI did go ahead and run the grubby command to change the default kernel on my held arm node and I can't figure out what it actually chagned on the running system to make it the default23:03
clarkbt here must be some config file that has a default kernel to boot == this index in the bls entries23:03
clarkbsomehow having an explicit "this is the default entry" in the grub config wasn't good enough /rant23:03
clarkb(there seems to be a general trend away from everything is configured via a well known file to use this utility it will make your life easier we swear and I'm not sure the promises are holding up. I'm particularly not fond of the new firewall stuff"23:05
ianwso it definitely gets a new "/boot/loader/entries/" file?23:06
ianwthe new kernel?23:06
clarkbyes23:06
clarkbbut the old kernel remains the default23:06
clarkband if I use grubby to chagne the default it doesn't seem to change anything in the entries thesmelves23:07
JayFis there a .csv file in your EFI dir?23:07
JayFBOOT.CSV23:07
clarkbso I'm somewhat lost on where to work backwards to to figure out where it determines what teh default is to figure out why it isn't set23:07
JayFThat is one way you can end up with weird stuff booted that many folks aren't aware of (but unlikely in your case I think)23:07
clarkbthere is a BOOTAA64.CSV23:07
clarkbJayF: but in this case grubby knows the old kernel is the default so I don't think it is anything weird beyond the system has decided not to update what the default ekrenl is23:08
clarkband my frustration is that we've made that determination of that default extremely obtuse compared to grub saying this entry is default and it isclear how to change and how it is determiend23:08
JayFis grubby even involved?23:09
ianwi'm feeling like you might need to use "grubby" -- "grubby --default-kernel" i think should show, and we might need "grubby --set-default"23:09
JayFif boot.csv points to a kernel; it's likely being loaded as an efi stub23:09
JayFat least AIUI23:09
JayF(I'm operating on limited context)23:09
clarkbJayF: grubby says old kernel is default and old kernel is what boots. It implies grubby at least knows what is going on23:09
clarkbJayF: we are upgrading the kernel via dnf on centos 9 stream aarch64 and rebooting with the theory we reboot onto the new kernel but then we actually boot on old kernel23:10
clarkbgrubby seems to accurately report the old kernel is the default23:10
clarkbI've changed that manually with grubby to see if I can observe a change in ssystem state I can work off of but I haven't seen where that happens yet23:10
JayFhow was the kernel you're upgrading from installed?23:10
JayFjust in the image, or did we do something manually to get it in there23:10
clarkbits whatever dib does to install the packages using dnf23:11
ianwit's been so long, i think grubby changes things that then grub2-mkconfig writes out?  i don't know ... i page this stuff in occasionally for a bug then it falls right back out23:11
clarkband really I think thats my I'm grumpy. This was straightforward and apparent before we decided we needed a new utility to manage this with obtuse config files that don't even tell you what the dffault is without running a command23:12
clarkbits a model shift that has been happening across the linux ecosystem and sometimes I just want the dead simple file says foo is default so foo is default23:13
clarkband I don't even necessarily think its all a bad thing except that because its all new we end up with poor documentation and stuff that is impossible to debug23:14
clarkbby the time we sort that out we'll be onto the new shiny and rinse and repeat 23:14
JayFclarkb: https://github.com/openstack/diskimage-builder/blob/master/diskimage_builder/elements/centos/post-install.d/03-reset-bls-entries#L24 I don't know much about centos, but this smells bad to me23:14
JayFe.g. maybe the entries we add/edit are not getting changed by grubby?23:14
clarkbJayF: I think we actually use this element https://github.com/openstack/diskimage-builder/tree/master/diskimage_builder/elements/centos-minimal23:15
clarkband then something else must do the grub/bls config?23:15
clarkbok my latest patchset seems to maybe be working23:16
clarkbI'll have to recheck a couple of times but setting "GRUB_ENABLE_BLSCFG=true" >> /etc/default/grub may have been the ticket23:17
JayFwhoa, yum-minimal is basically debootstrap for "enterprise linux"23:17
JayFthat's neat23:17
clarkbif we operate on that assumption then the issue is likely that when booting efi you really want to explicitly enable blsconfig otherwise grub doesn't do enough work on kernel updates to update the kernel23:18
clarkbhowever, one wonders why there is a world where this isn't the default23:18
clarkbbut thats not my problem to solve as I can work with a workaround like this. Now to try and debug the dnf can't find my mirror problems23:18
JayFsome days it's "why?" and some days it's "WHY!?!?" :) 23:18
clarkbon the held node for broken dnf install (not the kernel decision problem) I tried installing kernel-devel manually (the package that failed) and it succeeded23:23
clarkbso its not a consistent error even in these cloud regions23:23
clarkbnow to figure out how to get dnf to forcefully go back through the process of installing thsi package...23:23
clarkb`while dnf reinstall -y -v kernel-devel ; do echo next ; done` is running23:26
clarkbwhat if the problem is a race post reboot and trying to dnf install something?23:27
clarkbthat would be weird because we'd already be ssh'd back in so networking should be up23:27
ianwhrm ... does that still have the package dumping happening that i put in?23:38
ianwi feel like that would update the dnf metadata, which would i guess mean that networking was minimally working?23:38
clarkbianw: you added it to system-config and I've been working in ozj because its a smaller set of jobs. I guess i Could add it in23:40
clarkbbut also I think what you added is all pre reboot not post23:40
ianwahh could be right; was just trying to think.  i mean i guess as you say networking is up because we ssh'd in ... _maybe_ name resolution is not though?  what else could it be ...23:42
clarkbhrm updating blsconfig grub stuff is not the fix for booting the correct kernel23:42
clarkbit failed on the second pass. What is interestin about that is it means the correct kernel is chosen sometimes23:42
clarkbbut not every time23:42
clarkbianw: oh ya name resolution could still be waiting on unbound to come up that is a very good point23:43
clarkbunfortauntely I'm back to the drawing board on how to make arm happy. I could explicitly set the default kernel with grubby but you need to feed it a specific version and that changes over time so I have to figure out how to glean that from the package installs23:43
clarkb(doing something similar to what you added earlier I think)23:43
clarkbbut also that seems super hacky as the distro should just handle that for us23:44
clarkblist /boot | grep vmlinuz | sort by update time23:46
JayFHave we attempted to contact upstream about this? We should have no shortage of contacts who are CentOS-adjacent23:54
JayFand I agree with you it sounds like it's a distro bug23:54
ianwis it -> https://bugzilla.redhat.com/show_bug.cgi?id=2032680 ?23:55
ianwthat doesn't explain why it sometimes works and not.  that was the issue linked in dib that the install wouldn't touch bls entries with the wrong machine-id.  maybe it's something about setting the default kernel in the kernel install scripts that isn't matching the machine id?23:56
* ianw is clutching at proverbial straws23:56
clarkbianw: that actually looks like it yes23:57
clarkbthe machine id entries in bls are different for my held node23:57
JayFthis is reverse23:57
JayFyou *need the suspicious looking code in centos element*23:57
JayFrather than it being at fault in this case23:58
JayFthis code https://github.com/openstack/diskimage-builder/blob/master/diskimage_builder/elements/centos/post-install.d/03-reset-bls-entries#L24 23:58
JayFI think you tied it all together ian :) 23:58
clarkbJayF: I think in that case you're still using the wrong machine id because it is the machine id of the builder not what you boot on later23:59
clarkbhonestly using machine id here seems like a flaw that doesn't account for virtual machines or even moving a diskf rom one laptop to another23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!