clarkb | we run unbound for local lookups on localhost so that is why we don't bind to 0.0.0.0 or :: | 00:00 |
---|---|---|
clarkb | I wonder if that is just a buggy systemd unit then if we're trying to start too quickly. | 00:00 |
corvus | #status log restarted schedulers and zuul-launcher to pick up zuul.yaml config change | 00:01 |
opendevstatus | corvus: finished logging | 00:03 |
clarkb | re hostkeys and sshfp: when I was in suwon I wrote myself a little python utility that I could copy the sha from ssh output and the sha from sshfp output and have python calculate if they are the same or different (since they use different formats because why make this easy for humans) | 00:05 |
fungi | openssh client is able to check them too | 00:10 |
clarkb | fungi: oh I thought there were problems with that? I guess I can try configuring it again too. But basically converting formats between the two keys and checking equivalent with python has been easy enough | 00:11 |
clarkb | because I only have to do it once per host and then the local known hosts carries it going forward | 00:11 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add vexxhost provider and build raw images https://review.opendev.org/c/opendev/zuul-jobs/+/935838 | 00:13 |
corvus | clarkb: fungi ^ that should be safe to go ahead and approve | 00:15 |
clarkb | done | 00:24 |
clarkb | ianw: the arm64 job succeeded with the grub-mkconfig in it | 00:48 |
ianw | ! genius :) | 00:48 |
opendevmeet | ianw: Error: "genius" is not a valid command. | 00:48 |
clarkb | I'll fix the zuul loggre stuff and then maybe with ns04 restarted we'll get package installs working more reliably | 00:49 |
opendevreview | Clark Boylan proposed opendev/system-config master: Upgrade and reboot test nodes before openafs installation https://review.opendev.org/c/opendev/system-config/+/935812 | 00:51 |
corvus | opendevmeet: indeed it is not. | 00:51 |
opendevmeet | corvus: Error: "indeed" is not a valid command. | 00:51 |
opendevreview | Merged openstack/diskimage-builder master: Fall back to extract-image on ubuntu build https://review.opendev.org/c/openstack/diskimage-builder/+/926748 | 01:01 |
clarkb | karolinku[m]: one thing to note is that nested virt isn't super stable (this is one reason we giv eit a special label because we really only want people trying to use it where it can possibly be debugged if it fails). Additionally, you should be wary of accidentally adding additional dependencies on top of the centos 10 cpu requirements | 01:04 |
clarkb | basically I owuld avoid using kvm instead of qemu in those jobs so that you don't unnecessarily furthe rrestrict where and how the jobs can run | 01:04 |
clarkb | ianw: huh a rerun resulted in kernel mismatch failures so maybe it isn't deterministic? | 01:05 |
clarkb | also ansible lint saying playbooks/openafs-rpm-package-build/pre.yaml:6: package-latest: Package installs should not use latest. is just factually wrong | 01:05 |
clarkb | I should be ablloewd to use ansible to upgrade all packges if I want to.. | 01:06 |
ianw | ... sigh ... non deterministic kernel installs are great | 01:17 |
JayF | I'll note, Ironic just had a failure with "Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit" -- I know you all merged something today to try and help with this, sorry to be the bearer of bad news in the form of anecdata :D | 02:04 |
JayF | https://zuul.opendev.org/t/openstack/build/98c619e4fef3461a8beac1b8f3a22f68 | 02:04 |
opendevreview | Merged opendev/zuul-jobs master: Add vexxhost provider and build raw images https://review.opendev.org/c/opendev/zuul-jobs/+/935838 | 02:32 |
Clark[m] | JayF that job doesn't seem to use the stuff we modified (so you weren't using the caching proxy) but yes we've still seen errors. Hoping that maybe the overall rate falls a bit so to make it easier to move away from docker hub | 03:45 |
JayF | If there's something we should modify to make it run better on our hardware, we can try to get the knob put into that code to do it | 03:50 |
Clark[m] | 2 weeks ago there was. Now it's a big who knows what the best choices are | 03:52 |
*** jroll09 is now known as jroll0 | 07:36 | |
opendevreview | Marios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message https://review.opendev.org/c/opendev/irc-meetings/+/935806 | 08:44 |
opendevreview | Marios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message https://review.opendev.org/c/opendev/irc-meetings/+/935806 | 09:02 |
opendevreview | Marios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message https://review.opendev.org/c/opendev/irc-meetings/+/935806 | 12:50 |
opendevreview | Marios Andreou proposed opendev/irc-meetings master: Update Watcher team meeting and clarify check_chair error message https://review.opendev.org/c/opendev/irc-meetings/+/935806 | 12:50 |
corvus | Clark: fungi regarding raw images -- the job ran once and succeeded (openmetal), then ran 3 more times and failed due to disk space issues (rax). should we adjust the job to do something like mount an extra disk at /opt in rax? i don't recall what the current state of disk resources are across our providers. | 15:06 |
fungi | yeah, rax classic nodes have a 40gb rootfs with an 80gb ephemeral drive | 15:07 |
fungi | i think (we should also collect an initial df in the build logs) | 15:08 |
corvus | https://zuul.opendev.org/t/opendev/build/4029030b9b3e483b9d91023717d25aed was the success; sadly, that was the gate-build, not the image-build-pipeline-build, so we've seen the raw image upload time (1 hour), but not the download time. | 15:08 |
corvus | fungi: yes we do; i think it shows 40g at / and nothing else mounted | 15:08 |
fungi | sounds right | 15:08 |
corvus | https://zuul.opendev.org/t/opendev/build/b38234f4bad0451788efcea88f9b9bb4/log/zuul-info/zuul-info.ubuntu-noble.txt#120 | 15:08 |
corvus | there 'tis | 15:09 |
corvus | so we might need to copy that bit of launch-node into the job pre-run... do we do that in any other jobs these days? and do we have at least 80gb everywhere? (i guess i'm wondering if it's safe to assume we can get 80g from each provider, or if we should assume 40g max) | 15:10 |
fungi | yes, we have at least 80gb everywhere afaik, so long as you mount available ephemeral drives | 15:15 |
fungi | at least we assert it as a minimum expectation in our documentation | 15:15 |
corvus | cool, i'll work on copying that chuck of make_swap into a playbook. i don't see anything else doing something like that. | 15:16 |
corvus | regarding the 1-hour upload time, i have two thoughts (aside from just accepting it -- i think it's liveable): 1) the raw images should compress well, so we may be able to compress them quite a bit before uploading, if we have the launcher understand that and uncompress after downloading. 2) we could probably have the launcher perform an image conversion, but that makes the launcher more complex (and most installations aren't going to deal | 15:18 |
corvus | with multiple formats anyway). | 15:18 |
fungi | yeah, compression is probably a good compromise, i agree that more complex on-the-fly format conversions are best avoided | 15:19 |
fungi | corvus: generic sql question... for the update queries at steps 4.4. and 4.5 in https://etherpad.opendev.org/p/lists-openinfra-org-migration is there a way to do substring substitution (a la sed) or will it be easier to just expand those to 15 separate queries each? | 15:24 |
fungi | (not urgent, of course) | 15:26 |
Clark[m] | corvus: there is already a job role somewhere that manages the ephemeral disk. Not exactly sure where though | 15:38 |
Clark[m] | Devstack jobs use it but not sure if it is part of devstack | 15:38 |
corvus | fungi: yes, i added a note in there for what i *think* is the correct syntax. worth testing/double checking. :) | 15:54 |
corvus | Clark: ack thx, i'll poke around there | 15:54 |
fungi | corvus: perfect, thanks! and yeah, i intend to source a production db dump into a held test node and then try all the queries once i have the correct values inserted | 15:55 |
corvus | Clark: here it is: https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/roles/configure-swap/README.rst | 16:00 |
corvus | I wonder if we should move that into opendev/base-jobs | 16:02 |
corvus | oh wait we already do something with /opt in our image build job... | 16:04 |
corvus | i think i need to take a second look and see exactly where we ran out of space | 16:05 |
corvus | okay i think the problem is that the output images are on / not /opt | 16:06 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Move dib image output to /opt https://review.opendev.org/c/opendev/zuul-jobs/+/935893 | 16:11 |
corvus | Clark: fungi ^ i think that may be all we need to do :) | 16:11 |
frickler | corvus: seems you need to pre-create the target dir, also added another comment | 16:16 |
clarkb | sha is up to dc0 thinking it will finish sometime tomorrow | 16:26 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Move dib image output to /opt https://review.opendev.org/c/opendev/zuul-jobs/+/935893 | 16:49 |
corvus | frickler: thanks | 16:49 |
opendevreview | Merged zuul/zuul-jobs master: Allow overriding the buildset registry image https://review.opendev.org/c/zuul/zuul-jobs/+/849989 | 16:52 |
corvus | i'm running some compression benchmarks on nb01; i've installed some extra packages (qemu-img, and zstd) to help with that. | 17:46 |
clarkb | I guess thats a convenient location since it has many images to work with | 17:47 |
corvus | and space | 17:49 |
corvus | i can remove the packages when done, but i think it's harmless to keep them | 17:50 |
clarkb | ya I'm not worried about having those installed | 17:51 |
fungi | geek humor, i just noticed that if you create a venv with cpython 3.14, it adds a 𝜋thon symlink to the python executable | 17:51 |
clarkb | if they were services it would be different but those are just utilities | 17:51 |
clarkb | fungi: is there a piethon too ? | 17:52 |
fungi | only when i have time to spend baking next week | 17:52 |
opendevreview | Merged opendev/zuul-jobs master: Move dib image output to /opt https://review.opendev.org/c/opendev/zuul-jobs/+/935893 | 18:13 |
opendevreview | Clark Boylan proposed opendev/system-config master: Upgrade and reboot test nodes before openafs installation https://review.opendev.org/c/opendev/system-config/+/935812 | 18:18 |
clarkb | https://zuul.opendev.org/t/openstack/build/2f97721d42f3448e961500d1391ca581/log/job-output.txt#327 the centos jobs have been hitting errors like this a lot | 18:25 |
clarkb | that job ran in raxflex but I think I'veseen it it rax iad as well (and possibly others) | 18:25 |
clarkb | my first thought was maybe openafs is sad but nothing in dmesg output recently. Looking at the apache access log I see the node grabbing other files on the mirror before the error | 18:25 |
clarkb | the errors logs are either empty or don't have relevant info | 18:27 |
clarkb | its almost like the tcp connection doesn't successfully establish | 18:28 |
fungi | race condition? | 18:29 |
clarkb | could be anything in setup for teh connection I guess since the log from dnf/yum is basically useless | 18:30 |
clarkb | DNS, tcp, apache, openafs (and all the networking in between) | 18:30 |
clarkb | looking at the mirror there total number of tcp connections isn't super high so I don't think this is a mirror getting overloaded and running out of connection slots | 18:31 |
clarkb | let me see if I can find other examples | 18:31 |
clarkb | https://zuul.opendev.org/t/openstack/build/a3a4f2abfe7544f0aa3e1495d23a6661/log/job-output.txt#324 rax-iad https://zuul.opendev.org/t/openstack/build/690f7e08605f4e61b862e214fb967f54/log/job-output.txt#329 raxflex https://zuul.opendev.org/t/openstack/build/f0609499f6ef4801ba6d38b11154320f/log/job-output.txt#335 openmetal | 18:33 |
clarkb | with that sampling it seems unlikely to be networking specific to one cloud (say the NAT in raxflex) | 18:33 |
clarkb | also if it were an openafs issue you would expect debuntu to have similar errors | 18:35 |
clarkb | all of these roles use the package module instead of dnf/apt | 18:36 |
clarkb | I suspect package farms out to dnf or apt based on the platform so probably not an issue with the ansible module | 18:37 |
clarkb | Next step is probably fetching the /var/log/dnf.log and /var/log/yum.log files | 18:40 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Compress raw images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 18:58 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Compress raw images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 18:58 |
corvus | ^ that change and the depends-on should handle compression of raw images. it takes zstd 2m15s to compress a 25GB raw image to 8GB, ~ the same size as qcow2, so should have similar upload times. | 19:01 |
corvus | (same time to decompress too) | 19:01 |
fungi | that's surprisingly fast | 19:01 |
corvus | zstd eeked out a slight win over lz4 on time and size; lz4 was 9GB in 2m32s | 19:02 |
corvus | gzip was 8.4GB in 10m42s | 19:02 |
clarkb | poor gzip | 19:02 |
corvus | fungi: yeah, i'm not displeased | 19:03 |
corvus | interestingly, multi-thread/core zstd isn't worth it for this; it saved 10s. i'd be curious to learn when it's advantageous. | 19:04 |
clarkb | corvus: couple of thoughts inline but I +2'd because their thinking about what happens after raw which can be handled later if you prefer | 19:06 |
corvus | clarkb: good point; i agree i think we can leave it for now and rework it when we add vhd. i wasn't sure about that. :) | 19:07 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Compress raw images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 19:09 |
corvus | clarkb: oops but i left an extra no_log in that i just removed | 19:09 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Compress raw/vhd images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 19:12 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Compress raw/vhd images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 19:12 |
corvus | clarkb: well, while i'm in there, i updated it for vhd too ^ :) | 19:12 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Compress raw/vhd images https://review.opendev.org/c/opendev/zuul-jobs/+/935970 | 19:13 |
corvus | for real. i'm done. | 19:14 |
clarkb | I checked and didn't see vhd glad I wasn't going crazy :) | 19:14 |
corvus | nope that's me not seeing the "post-inner.yaml change since last saved" prompt... | 19:15 |
clarkb | is the role that copies zuul_dir/logs/ as set up by stage-output not in our base job? | 19:37 |
clarkb | I've been trying to copy logs using stage output and I can see it stages a couple of files but then they don't end up in the resulting job logs so they must not get ferried to the executor then to swift? | 19:37 |
clarkb | because I feel like I'm beating my head against tis openafs on centos problem (and now the general can't find the package problem) I've gone ahead and rerequested a new image build on nb04 | 19:42 |
clarkb | it looks like the arm builds may be happier now that nsd was restarted? at least a jammy build from today didn't fail to look up opendev.org | 19:42 |
clarkb | the ttl on opendev.org is 3600 too whcih was the other thing I thought about adjusting but one hour seems plenty | 19:43 |
clarkb | and now lunch | 19:43 |
opendevreview | Jeremy Stanley proposed opendev/zone-zuul-ci.org master: Add Bluesky Verification Record for Zuul Account https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/935977 | 20:15 |
clarkb | fungi: I'm +2 on ^ but would be good for someone in the zuul community that isn't one of us to review it too | 21:42 |
fungi | yes, agreed | 21:45 |
fungi | there's no rush | 21:45 |
clarkb | thinking about the centos mirror fetch issue a bit more I checked the mirrors and all three are a different version of ubuntu and the noble one is a different openafs version too | 21:50 |
clarkb | so no fingers pointing at openafs upgrades | 21:50 |
clarkb | https://zuul.opendev.org/t/openstack/build/d23154399d404949ad4dc366e7b19355/log/dnf.log#81-97 roughly coincides with https://zuul.opendev.org/t/openstack/build/d23154399d404949ad4dc366e7b19355/log/job-output.txt#324-329 but there is nothing logged in there about failing to contact any mirrors | 22:08 |
clarkb | and the grub config has a single menuentry for efi firmware settings not linux | 22:15 |
clarkb | I know I'm getting a bit annoyed that debugging this is taking upwwards of several days at this point, but also why have we removed all the configs from the where the configs go and not logged errors to the package manager logs? | 22:16 |
clarkb | it feels like we stopped trying to make the software understandable by the people who have to run it | 22:16 |
clarkb | and note that the grub entry is for x86 so I don't think this explains why arm64 is broken its just that we've decided to do booting differently now and its not clear what we're actually booting anymore from a grub config file | 22:17 |
clarkb | I'm waving the flag of surrender and have put holds in place | 22:21 |
clarkb | looks like the old menu contents for grub may go in /boot/loader/entries/$SHA.$cpuarch.conf files | 22:35 |
clarkb | I wonder if the sha makes them unordered and thus we get the old one? | 22:35 |
ianw | clarkb: i do low-key suspect the mirror; i have deja-vu on having what to the client end looks like a 404 coming from posix-ish-level failures from the openafs mount on the mirror | 22:54 |
ianw | that percolate up through apache and end up as a 404 | 22:55 |
ianw | i'm trying to remember if this showed up as like timeouts in the afs logs on the mirrors ... am i nuts or is this ringing bells for anyone else?! | 22:57 |
clarkb | ianw: I don't recall. But the weird thing is its three different mirrors running three different openafs packages (2 different versions) with a mix of other networking too | 22:58 |
clarkb | it seems really odd that debuntu doesn't seem to have these issues either | 22:58 |
clarkb | but also why can't dnf just log a 404 that would make my life so much eaiser (note I couldn't find a 404 in the server logs fwiw | 22:58 |
clarkb | ianw: for the arm boots problem grubby shows both kernels but the old one is the default. I just pushed a patch to the ozj change which sets an /etc/default/grub option to explicitly enable bls which the internet indicates may help if your default kernel isn't updating | 23:00 |
ianw | i might be getting confused with when we trying kafs. my notes have | 23:00 |
clarkb | I have an autohold in place to debug the no more mirrors error but haven't tripped a failure yet (I would say its about a 50% hit rate) | 23:00 |
ianw | 2019-07-26 : kafs: investigation into 404 error; http://lists.infradead.org/pipermail/linux-afs/2019-July/003122.html | 23:01 |
ianw | which in the ultimately irony is now a 404 | 23:01 |
clarkb | Ithink latest push may have caught one though so thats "good" | 23:01 |
clarkb | heh | 23:01 |
clarkb | I did go ahead and run the grubby command to change the default kernel on my held arm node and I can't figure out what it actually chagned on the running system to make it the default | 23:03 |
clarkb | t here must be some config file that has a default kernel to boot == this index in the bls entries | 23:03 |
clarkb | somehow having an explicit "this is the default entry" in the grub config wasn't good enough /rant | 23:03 |
clarkb | (there seems to be a general trend away from everything is configured via a well known file to use this utility it will make your life easier we swear and I'm not sure the promises are holding up. I'm particularly not fond of the new firewall stuff" | 23:05 |
ianw | so it definitely gets a new "/boot/loader/entries/" file? | 23:06 |
ianw | the new kernel? | 23:06 |
clarkb | yes | 23:06 |
clarkb | but the old kernel remains the default | 23:06 |
clarkb | and if I use grubby to chagne the default it doesn't seem to change anything in the entries thesmelves | 23:07 |
JayF | is there a .csv file in your EFI dir? | 23:07 |
JayF | BOOT.CSV | 23:07 |
clarkb | so I'm somewhat lost on where to work backwards to to figure out where it determines what teh default is to figure out why it isn't set | 23:07 |
JayF | That is one way you can end up with weird stuff booted that many folks aren't aware of (but unlikely in your case I think) | 23:07 |
clarkb | there is a BOOTAA64.CSV | 23:07 |
clarkb | JayF: but in this case grubby knows the old kernel is the default so I don't think it is anything weird beyond the system has decided not to update what the default ekrenl is | 23:08 |
clarkb | and my frustration is that we've made that determination of that default extremely obtuse compared to grub saying this entry is default and it isclear how to change and how it is determiend | 23:08 |
JayF | is grubby even involved? | 23:09 |
ianw | i'm feeling like you might need to use "grubby" -- "grubby --default-kernel" i think should show, and we might need "grubby --set-default" | 23:09 |
JayF | if boot.csv points to a kernel; it's likely being loaded as an efi stub | 23:09 |
JayF | at least AIUI | 23:09 |
JayF | (I'm operating on limited context) | 23:09 |
clarkb | JayF: grubby says old kernel is default and old kernel is what boots. It implies grubby at least knows what is going on | 23:09 |
clarkb | JayF: we are upgrading the kernel via dnf on centos 9 stream aarch64 and rebooting with the theory we reboot onto the new kernel but then we actually boot on old kernel | 23:10 |
clarkb | grubby seems to accurately report the old kernel is the default | 23:10 |
clarkb | I've changed that manually with grubby to see if I can observe a change in ssystem state I can work off of but I haven't seen where that happens yet | 23:10 |
JayF | how was the kernel you're upgrading from installed? | 23:10 |
JayF | just in the image, or did we do something manually to get it in there | 23:10 |
clarkb | its whatever dib does to install the packages using dnf | 23:11 |
ianw | it's been so long, i think grubby changes things that then grub2-mkconfig writes out? i don't know ... i page this stuff in occasionally for a bug then it falls right back out | 23:11 |
clarkb | and really I think thats my I'm grumpy. This was straightforward and apparent before we decided we needed a new utility to manage this with obtuse config files that don't even tell you what the dffault is without running a command | 23:12 |
clarkb | its a model shift that has been happening across the linux ecosystem and sometimes I just want the dead simple file says foo is default so foo is default | 23:13 |
clarkb | and I don't even necessarily think its all a bad thing except that because its all new we end up with poor documentation and stuff that is impossible to debug | 23:14 |
clarkb | by the time we sort that out we'll be onto the new shiny and rinse and repeat | 23:14 |
JayF | clarkb: https://github.com/openstack/diskimage-builder/blob/master/diskimage_builder/elements/centos/post-install.d/03-reset-bls-entries#L24 I don't know much about centos, but this smells bad to me | 23:14 |
JayF | e.g. maybe the entries we add/edit are not getting changed by grubby? | 23:14 |
clarkb | JayF: I think we actually use this element https://github.com/openstack/diskimage-builder/tree/master/diskimage_builder/elements/centos-minimal | 23:15 |
clarkb | and then something else must do the grub/bls config? | 23:15 |
clarkb | ok my latest patchset seems to maybe be working | 23:16 |
clarkb | I'll have to recheck a couple of times but setting "GRUB_ENABLE_BLSCFG=true" >> /etc/default/grub may have been the ticket | 23:17 |
JayF | whoa, yum-minimal is basically debootstrap for "enterprise linux" | 23:17 |
JayF | that's neat | 23:17 |
clarkb | if we operate on that assumption then the issue is likely that when booting efi you really want to explicitly enable blsconfig otherwise grub doesn't do enough work on kernel updates to update the kernel | 23:18 |
clarkb | however, one wonders why there is a world where this isn't the default | 23:18 |
clarkb | but thats not my problem to solve as I can work with a workaround like this. Now to try and debug the dnf can't find my mirror problems | 23:18 |
JayF | some days it's "why?" and some days it's "WHY!?!?" :) | 23:18 |
clarkb | on the held node for broken dnf install (not the kernel decision problem) I tried installing kernel-devel manually (the package that failed) and it succeeded | 23:23 |
clarkb | so its not a consistent error even in these cloud regions | 23:23 |
clarkb | now to figure out how to get dnf to forcefully go back through the process of installing thsi package... | 23:23 |
clarkb | `while dnf reinstall -y -v kernel-devel ; do echo next ; done` is running | 23:26 |
clarkb | what if the problem is a race post reboot and trying to dnf install something? | 23:27 |
clarkb | that would be weird because we'd already be ssh'd back in so networking should be up | 23:27 |
ianw | hrm ... does that still have the package dumping happening that i put in? | 23:38 |
ianw | i feel like that would update the dnf metadata, which would i guess mean that networking was minimally working? | 23:38 |
clarkb | ianw: you added it to system-config and I've been working in ozj because its a smaller set of jobs. I guess i Could add it in | 23:40 |
clarkb | but also I think what you added is all pre reboot not post | 23:40 |
ianw | ahh could be right; was just trying to think. i mean i guess as you say networking is up because we ssh'd in ... _maybe_ name resolution is not though? what else could it be ... | 23:42 |
clarkb | hrm updating blsconfig grub stuff is not the fix for booting the correct kernel | 23:42 |
clarkb | it failed on the second pass. What is interestin about that is it means the correct kernel is chosen sometimes | 23:42 |
clarkb | but not every time | 23:42 |
clarkb | ianw: oh ya name resolution could still be waiting on unbound to come up that is a very good point | 23:43 |
clarkb | unfortauntely I'm back to the drawing board on how to make arm happy. I could explicitly set the default kernel with grubby but you need to feed it a specific version and that changes over time so I have to figure out how to glean that from the package installs | 23:43 |
clarkb | (doing something similar to what you added earlier I think) | 23:43 |
clarkb | but also that seems super hacky as the distro should just handle that for us | 23:44 |
clarkb | list /boot | grep vmlinuz | sort by update time | 23:46 |
JayF | Have we attempted to contact upstream about this? We should have no shortage of contacts who are CentOS-adjacent | 23:54 |
JayF | and I agree with you it sounds like it's a distro bug | 23:54 |
ianw | is it -> https://bugzilla.redhat.com/show_bug.cgi?id=2032680 ? | 23:55 |
ianw | that doesn't explain why it sometimes works and not. that was the issue linked in dib that the install wouldn't touch bls entries with the wrong machine-id. maybe it's something about setting the default kernel in the kernel install scripts that isn't matching the machine id? | 23:56 |
* ianw is clutching at proverbial straws | 23:56 | |
clarkb | ianw: that actually looks like it yes | 23:57 |
clarkb | the machine id entries in bls are different for my held node | 23:57 |
JayF | this is reverse | 23:57 |
JayF | you *need the suspicious looking code in centos element* | 23:57 |
JayF | rather than it being at fault in this case | 23:58 |
JayF | this code https://github.com/openstack/diskimage-builder/blob/master/diskimage_builder/elements/centos/post-install.d/03-reset-bls-entries#L24 | 23:58 |
JayF | I think you tied it all together ian :) | 23:58 |
clarkb | JayF: I think in that case you're still using the wrong machine id because it is the machine id of the builder not what you boot on later | 23:59 |
clarkb | honestly using machine id here seems like a flaw that doesn't account for virtual machines or even moving a diskf rom one laptop to another | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!