Thursday, 2024-08-29

fungii'm reenqueuing 927214,1 into the deploy pipeline to see if that rax iad api error is transient00:05
fungisame error again00:37
fungihttps://rackspace.service-now.com/system_status doesn't indicate any known problem00:38
fungibuild history says the job's been working in daily periodic00:45
fungithe same keypair task succeeded for rax dfw and ord but failed on iad, so this must be temporary00:48
fungimaybe when the periodic run happens in a couple of hours i'll still be up, otherwise i can reenqueue it again tomorrow when i get a free moment00:49
Clark[m]You can also probably remove that region temporarily?01:35
opendevreviewtzing proposed opendev/system-config master: Update openEuler mirror repo  https://review.opendev.org/c/opendev/system-config/+/92746206:33
opendevreviewtzing proposed openstack/diskimage-builder master: Upgrade openEuler to 22.03 LTS  https://review.opendev.org/c/openstack/diskimage-builder/+/92746606:41
fricklerthe keypair api was still broken for the periodic job, too, but it is working when I'm checking it manually from bridge now, so I'll try to reenqueue once again06:51
fricklerlikely not related, but checking grafana it looks like we have about 70 instances stuck deleting in rax since a week https://grafana.opendev.org/d/a8667d6647/nodepool3a-rackspace?orgId=1&from=now-30d&to=now06:54
frickleralso ubuntu-ports mirroring is broken, might be related to arm slowness issues. The lock file '/afs/.openstack.org/mirror/ubuntu-ports/db/lockfile' already exists.06:59
fricklernot sure if I'll get to any of this today, though06:59
opendevreviewtzing proposed openstack/diskimage-builder master: Upgrade openEuler to 24.03 LTS  https://review.opendev.org/c/openstack/diskimage-builder/+/92746608:35
fricklernext failure, this time in infra-prod-base, but transient: E: Could not get lock /var/lib/dpkg/lock-frontend. It is held by process 15213 (unattended-upgr)09:24
fricklerdoing another enqueue ...09:24
frickleroh, seems interest in openeuler has returned. curious to see for how long it will last this time09:26
opendevreviewThomas Bachman proposed openstack/project-config master: Re-introduce gate jobs for networking-cisco  https://review.opendev.org/c/openstack/project-config/+/92749310:38
opendevreviewJens Harbott proposed openstack/project-config master: Pin publish-openstack-releasenotes-python3 to jammy  https://review.opendev.org/c/openstack/project-config/+/92749511:43
fricklerfungi: elodilles: ^^11:43
fungithanks11:43
elodillesthanks, too o/11:50
opendevreviewMerged openstack/project-config master: Pin publish-openstack-releasenotes-python3 to jammy  https://review.opendev.org/c/openstack/project-config/+/92749512:03
fricklerfungi: latest invocation of infra-prod-run-cloud-launcher seems to have passed12:10
fricklerconfirming keypairs are in place for SJC3(flex)12:12
fungiyep, i saw you mention rerunning it earlier, thanks!12:14
fungionce i'm in a stable place for a bit, and assuming nothing more urgent comes up, i'll try launching a mirror server there12:16
cardoeIf ya need any help on the flex bits or the existing rax bits, you can ping me.12:51
fricklercardoe: nice, thx, we'll certainly come back to that ... like possibly with the stuck instances I mentioned earlier12:53
cardoeYou know the tenant / project? I’ll poke it when I’m at my desk.12:56
fungicardoe: we were getting a lengthy (at least several hours) 503 "service unavailable" error back from the iad keystone endpoint over night, at least 23:38-02:35 utc but could have been longer. the rackspace status page didn't indicate any outage though13:00
fungiit's working fine now, just figured you'd want to know if it wasn't an unannounced maintenance or something13:01
frickleractually when I checked this morning, only the os-keypair API seemed affected, other things like "flavor list" did work, which seemed very confusing to me13:04
cardoeYeah that's weird.13:04
fricklerthe issue was gone when I tried to debug in more detail later13:04
fungioh, interesting, so it was just os-keypair. agreed that's extra strange13:05
fricklercardoe: project_id 637776, example instance 97bf1d90-7137-420e-9156-95224ff72945, stuck in deleting for a week now13:07
cardoepoking13:20
fungicardoe: if you're interested, i tried to boot our first instance in flex sjc3, dc5d05c7-8bdb-4b68-a1ee-eac44d6312f2 (project name 610275_Flex) errored with "Exceeded max scheduling attempts 3... Last exception: HTTP HTTPInternalServerError" as the detail in server show13:41
fungisimilar info returned from the sdk during the server create:13:43
fungiopenstack.exceptions.SDKException: Error in creating the server. Compute service reports fault: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance dc5d05c7-8bdb-4b68-a1ee-eac44d6312f2. Last exception: HTTP HTTPInternalServerError13:43
fungii'll leave that one for now and try booting a second one just to see if it persists13:49
fungimy second try succeeded, so it must have been an intermittent failure13:51
fungiinfra-root: new development in raxflex testing... the Ubuntu-24.04 image they have lacks ifupdown (i think it must be all netplan/networkmanager and maybe systemd-networkd?). does that sound familiar? is there a preference for falling back on jammy for now vs trying with an official noble image downloaded from ubuntu.com?14:37
fungior is there an outstanding change for our launch script to handle noble?14:38
fricklerfungi: interesting, best refer to how tonyb did this, iiuc he did create mirror nodes running noble already?14:42
fricklermy noble vm running the default cloud image doesn't have ifupdown either14:44
Clark[m]Maybe login to the vexxhost nodes and see what they have?14:48
Clark[m]I wasn't aware of any patches to launch node to make that work14:48
fricklerdpkg-query: package 'ifupdown' is not installed14:53
frickleron mirror02.sjc3.vexxhost14:54
fungiit's possible i misread where the actual error was on launch14:55
fungiin openeuler mirror news, the reason for the failures/stale lock is because the volume is full14:55
fungiit's got a 350g quota right now, should i try increasing that to 400g?14:56
fungialso i've commented on https://review.opendev.org/927462 trying to get a handle on what the addition of another release means for storage requirements14:56
fungiokay, yeah it's actually make_swap.sh that the launch failed on14:57
fungiswapon: /dev/sdc: read swap header failed14:58
fungimount: /mnt: unknown filesystem type 'swap'.14:58
frickleris that the ephemeral volume or an extra one? the one on vexxhost doesn't have any extra disks, so this may be an untested scenario15:01
fungioh, interesting, it also formatted a /swapfile so i think what's happening is that there's a preexisting entry for /dev/sdc (probably the ephemeral disk?) as a swap device15:01
fungii'll have to rerun with --keep and inspect the fstab just to be sure15:01
fricklerseems the script only knows about /dev/vdb and /dev/xvde, so we'll need to adapt it15:02
fungioh, good point, i could give that a try too. right now struggling with how to get ssh'd into the held node15:09
fungis/held node/kept server/15:09
fungiaha, it may be added by cloud-init? found this entry in fstab:15:14
fungi/dev/sdc        none    swap    sw,comment=cloudconfig  0       015:14
fungibut yeah, i can try patching it to add /dev/sdc as a possibility and see what happens15:14
Clark[m]If you keep the node you should be able to use the temporary ssh key that launch node generates15:25
fungiyeah, i did, that's how i got the fstab15:29
fungii think the server is coming up with /dev/sdc already active as swap too, so i'm testing unswapping and deleting that line out of fstab in the script before it tries to configure swap15:30
clarkbwe could also just have two swap devices if that is simpler15:30
clarkbbut ya maybe we should do something like if `lsblk | grep SWAP`; then skip; fi15:31
fungiwell, the ephemeral disk is configured as swap but then we try mounting it15:31
clarkbhrm reading make_swap.sh we check if the device is mounted, if it is we unmount it then format it then remount the formatted device15:33
clarkbhow is it getting to the point where it mounts something that hasn't been reformated?15:33
clarkband all of that is already protected by a check for "only do this if we have no swap"15:33
clarkbre openeuler, maybe the thing that makes the most sense is to delete what we've already got and start fresh?15:36
clarkbI'm not sure there was much if any utilization of the old release15:36
fungiyeah, i'm not sure. i've instrumented some additional debugging around that part to find out what exactly is happening15:37
clarkbis it possible that the swap device is one that we created so ended up in that code15:37
clarkbthen its the reformatting we got wrong preventing it from being mounted?15:38
fungioh, there's also already a /dev/sdb mounted as swap15:42
fungier, not mounted, but formatted15:43
fungiblkid reports this:15:43
fungi/dev/sdb: UUID="db3566d3-28f5-4d1b-86f7-d0edc30f2b14" TYPE="swap"15:43
clarkbmgiht also help to get a before picture too15:43
fungithat is before15:43
fungialso this:15:44
fungi/dev/sdc: LABEL="ephemeral0" UUID="f8bf32ef-29c0-4eb5-99b0-8b28e6527add" BLOC15:44
fungiK_SIZE="4096" TYPE="ext4"15:44
clarkbif there is already a swap device we should completely skip the code in make_swap.sh15:44
fungibut then this is what's in the fstab:15:44
fungi/dev/sdb     /mnt    auto    defaults,nofail,x-systemd.requires=cloud-init.service,_netdev,comment=cloudconfig       0       215:44
clarkboh!15:44
clarkbso the disk layout for the image is buggy15:45
clarkbin that case maybe uploading the upstream image does make sense15:45
fungioh, er15:45
fungialso this is in the original fstab:15:46
fungi/dev/sdc     none    swap    sw,comment=cloudconfig  0       015:47
fungiso i think that means cloud-init detected sdb and sdc backwards?15:47
clarkbya seems like something got them mixed up (could be udev changing the order too I Think)15:48
mordredstupid cloud-init15:48
corvusmordred with the hot takes :)15:48
clarkbI think the next thing to try would be to upload an upstream image to see if it behaves any better15:48
mordredI'm useful15:48
fungiassuming comment=cloudconfig is an indicator the entry was added by cloud-init anyway15:48
fungiyeah, i can try uploading a known official ubuntu image later today maybe, we'll see how far i get with it15:49
clarkbya but maybe their image is a snapshot of an image that was booted and added that stuff prior to our boot 15:50
clarkbthen when we boot off the snapshot udev runs against new devices and gets the names mixed up15:50
clarkbwould need to look at cloud-init logs to see where it is doing that15:50
fricklerfungi: what does the flavor say that you used? iiuc nova can create both swap + ephemeral if both is in the flavor15:51
fungiswap                       | 409615:52
fungiOS-FLV-EXT-DATA:ephemeral  | 6415:52
fungiso yes, it has both devices, but they're switched around in the fstab for some reason15:52
fungior maybe blkid is identifying them backwards15:53
clarkbfungi: I think this is the classic issue with using device paths in fstab. Order isn't garunteed boot to boot15:53
clarkbyou should prefer uuids or labels instead15:53
fungiyep15:53
clarkbwhich is why I wonder if the fstab was written by cloudinit then the image was snapshooted then we boot that later and get a different order15:53
frickleroh, that isn't even shown in "flavor list" by default15:53
fungiyeah, it's possible the /etc/fstab is baked into the image and it's a custom doctored image supplied by the provider instead of an official distro image15:54
fricklermaybe that's also something cardoe can find out more about15:54
fricklerbut also another reason to try with a stock image15:55
fungitrying to upload one from https://cloud-images.ubuntu.com/noble/current/ right now15:57
fungiit's queued as ubuntu-noble-server-cloudimg-2024-08-2215:57
fungionce it's done importing glance-side i'll try booting with that instead15:57
clarkb(side note, we may need to try multiple reboots too to check that things are stable as we may encounter the same issue down the road with a reboot)15:58
fungiagreed15:58
fungiyeah, looks like that's getting past the swap setup issues15:59
fungii'll unwind my changes to the script and see if it "just works" without fixing15:59
clarkbthat == upstream image?16:00
fungiyes16:00
fungiit got as far as trying to start unbound and then brokw16:01
fungibroke16:01
fungiunbound.service: Start request repeated too quickly.16:02
fungiunbound.service: Failed with result 'exit-code'.16:02
clarkbI think that means it failed for some other reason previously and is trying to restart too quickly which systsemd doesn't like16:03
clarkbmay need to keep the node and check journalctl logs16:03
fungithough now openstack server list is failing there... 502 Bad Gateway: nginx/1.27.016:04
clarkbmaybe we wrote out ipv6 configs for unbound when we shouldn't have or something16:04
fungicoming back from the nova api16:04
fungiand now it's back to working again16:04
fungiconfirmed that the official noble image doesn't have a problem with the original launch node script before i worked on trying to fix it16:06
clarkbfungi: oh so undoing your changes fixed the unbound issue?16:07
fungii don't know about that yet, just noting that they were unnecessary to get make_swap.sh to succeed once i switched to the official image16:09
fungiand no, unbound still isn't starting, but i told it to keep the instance this time so i can look closer16:10
clarkbgot it16:10
fungierror: unable to open /var/lib/unbound/root.key for reading: No such file or directory16:14
fungii guess we need to fix that in the launch script too16:14
frickleroh, I think I remember something. that could be in a separate pkg that moved from required to optional16:15
fungimaybe we have a race condition? dns-root-data was already installed, and stopping/starting unbound "just worked"16:16
fricklerhmm, the file exists on mirror02, but is not known to dpkg16:16
clarkbfungi: I thought there was a fix for that in system-config. We don't install unbound until we run ansible so that sould cover it16:17
fungiyeah16:17
clarkbbut ya maybe it is an order of operations thing we install unbound before getting to that package?16:17
fricklerthe pkgs puts it into /usr/share/dns/root.key16:17
clarkbhttps://review.opendev.org/c/opendev/system-config/+/925311/2/playbooks/roles/base/unbound/tasks/main.yaml16:18
fungiyeah, it's an order of operations problem16:18
clarkbyes I think the order is the issue16:18
clarkbwe need to flip the task we added and the one above it16:18
fungiunbound is already enabled, and fails to start, then we install dns-root-data and we try to enable unbound which is already enabled and remains in a failed state16:19
fricklerExecStartPre=-/usr/libexec/unbound-helper root_trust_anchor_update16:19
fricklerso the service needs a restart after installing dns-root-data I'd think16:20
fungiright, unless we can install dns-root-data before unbound16:20
opendevreviewJeremy Stanley proposed opendev/system-config master: Install dns-root-data before unbound  https://review.opendev.org/c/opendev/system-config/+/92752616:22
clarkbthe dns-root-key pacakge looks simple and should be able to write teh data out without any interaction to anything else, Then when we install unbound I would expect it to work16:26
clarkbbut I didn't approve that change in case there was any other debugging you wanted to do first. Feel free to approve if not16:26
fricklerseems like tonyb hit the same error, but didn't test with a fresh install https://review.opendev.org/c/opendev/system-config/+/92531116:28
fricklerso that fix seems reasonable16:28
fungii don't need to check anything else i don't think, i'll go ahead and approve it so i can test again here in a bit and see if we have any other problems16:29
fungioh, frickler approved it. thanks all!16:30
cardoeapologies. been in video chats for hours on end.16:57
cardoefungi: Flex SJC3 should be fixed now.16:58
fungicardoe: thanks!16:58
fungii'll go ahead and delete the error node i was leaving there16:59
cardoeSo for some context, flex is cloudnull's17:00
cardoeIt used to be my org/team so that's why I said I'd wrangle.17:00
cardoeI'm doing the ironic-y side17:00
fungino problem, thanks for the assist!17:01
cardoeI know the image is custom and it's silly/annoying.17:01
fungino worries, we thought we'd try it first17:01
fungibut we're used to uploading our own images for stuff, so it's no biggie17:01
fungijust be aware there might be a mismatch between the swap/ephemeral devices and the baked-in fstab on the noble image there17:02
opendevreviewMerged opendev/system-config master: Install dns-root-data before unbound  https://review.opendev.org/c/opendev/system-config/+/92752617:36
cardoewell now I can finally peek.17:59
cardoefungi: the noble image on flex is straight Ubuntu... https://docs.rackspacecloud.com/openstack-glance-images/#ubuntu-2404-noble that's how it's sourced.18:06
cardoeThe image is custom on OSPC (the old stuff) and that's annoying.18:06
Clark[m]cardoe: is it edited in anyway? It seemed like the fstab entries between swap and the ephemeral disk were getting mixed up. One theory I had is maybe y'all are snapshotting a slightly edited version that includes a cloud inited fstab that maybe gets it wrong when booted later18:13
Clark[m]I also theorized this mixup would occur in subsequent boots regardless and we'll need to test that18:13
cardoeI'm being told that it's not. But I'm gonna compare hashes.18:14
JayFThis issue was pointed out in Ironic regarding device reordering; it was cited as being seen on new RHELs but maybe worth consideration? https://review.opendev.org/c/openstack/ironic/+/927518 tl;dr: new async device initialization can cause direct-use of e.g. /dev/sda to fail (and reorder)18:20
cardoeIt's unmodified. It's old. It's from June but it's the same hash.18:20
JayFI'd be surprised if this is it, but just worth a thought at all18:21
cardoeSo I've reproduced your issue and I even grabbed the latest image from Canonical and reproduced it as well.18:41
fungica18:54
fungigah, sorry, typing on a phone18:54
fungicardoe: when uploading the current noble cloud image (url mentioned above) we didn't run into the issue, but at the moment i'm not in a position to investigate further18:56
cardoehrm. I literally did exactly what was in that doc and reproduced it.18:57
fungiinteresting19:00
fungii'll be doing some more test boots soon and can try to replicate both ways19:01
fungionce i'm at a computer for a few minutes19:01
clarkbya it could be a race19:07
clarkband if so the solution is likely to have cloud-init write out fstab using labels or uuids19:07
clarkbor maybe there is something that can be done on the kvm/qemu side to make the devices attach more consistently though I think that could always still be potentially problematic19:08
fungicardoe: `sha512sum noble-server-cloudimg-amd64.img` claims "06a897542d73dbfd40ad5131ba108a21aed84b027c71451e4fd756681802dd62529f32867c4f38c148cb068076eeba334b7e8880336c7f1a829cd57fa3e9850b" which is also what `openstack image show ubuntu-noble-server-cloudimg-2024-08-22` claims while `openstack image show Ubuntu-24.04` says it's "19:14
fungier, says its hash is "06a897542d73dbfd40ad5131ba108a21aed84b027c71451e4fd756681802dd62529f32867c4f38c148cb068076eeba334b7e8880336c7f1a829cd57fa3e9850b"19:15
fungiso both match, yes19:15
fungii'll retry with your latest Ubuntu-24.04 now19:15
fungicardoe: sorry, no, i pasted from the wrong output, the Ubuntu-24.04 hash claimed by glance is actually "a8c6a2adfcf440f5d8f849afc1e84790558fbf163abe9bb925b05f6e861d5c90dca3c3652a54d852821edfad1d0d2bc7749e0661f5055209ca063c1e99421bb3" instead19:16
fungiso different image. which one did you download?19:17
fungithe image i was having success with is https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img specifically19:17
Clark[m]The hash reported by glance is never the same as the one you upload aiui (which is crazy but ya) unfortunately 19:22
fungioddly, it does match in the one i uploaded19:24
fungihttps://paste.opendev.org/show/bXaIezBstl4YLazqm1j9/19:27
fungithough maybe the problem isn't a difference in the image, but a difference in the image properties which is changing nova's behavior19:28
fungii didn't set any properties at all when uploading to glance19:28
fungiyeah, i'm still consistently failing on "mount: /mnt: unknown filesystem type 'swap'." when launching with Ubuntu-24.04 while the ubuntu-noble-server-cloudimg-2024-08-22 i uploaded is working fine19:31
Clark[m]Have you tried rebooting the one you uploaded multiple times on a single instance or just new boots via new instances?19:35
funginot yet19:37
fungistruggling with car chargers, netbook battery is about to give out19:38
fungiand cellular wifi is being finicky too19:38
cardoeSo I set all those properties and it was a baddie.19:46
cardoeThe deleting stuff on the old rax are cleaned up now.19:48
fungijust confirmed for the image i uploaded (no properties), clout-init correctly found and used /dev/vdb as ephemeral and /dev/vdc as swap19:49
fungier, cloud-init19:49
fungiso virtual disk driver instead of scsi disk driver19:50
fungiguess that's the hw_scsi_model='virtio-scsi' property at work for the latter19:51
fungiand/or hw_disk_bus='scsi'19:52
fungirepeatedly rebooting the instance created from my propertyless image upload, several times now, comes up with vdb as ephemeral and vdc as swap every time so far20:02
JayFfungi: this is sounding really similar to the issue TheJulia documented in Ironic that I linked above ^^20:03
fungii can try reuploading and setting the same scsi properties i see on the public Ubuntu-24.04 that's been exhibiting the problem, maybe that will narrow it down20:04
JayFespecially since vdx works and sdx breaks20:04
fungiyeah, that's what i'm surmising20:04
fungicardoe: JayF: yep, bug reproduced when i boot from the exact same image file uploaded with --property hw_disk_bus='scsi' --property hw_scsi_model='virtio-scsi'20:13
JayFWhat is it, they change ordering from some device every 5 years in the kernel? You could set a very slow annoying clock by it20:14
funginot sure which of those two properties is sufficient to reproduce the problem, or whether both are required together20:15
fungii guess for now, from opendev's perspective, it's a question of whether infra-root sysadmins would prefer a mirror server that's using virtio or an older ubuntu jammy instead of noble20:18
JayFthere are absolutely ways to fix this, I suspect even in a cloud-init script20:18
JayFe.g. sub out device names for LABEL= in a first boot script20:18
JayFas long as cloud-init stuff is setup to find the right device and not *also* hardcoded to sda20:19
cardoeWell so for you to use virtio-scsi you gotta be using the scsi disk bus20:21
cardoeSo I suspect that disabling either effectively turns them both off.20:21
fungii suspected as much, which is why i simply set both20:22
Clark[m]I suspect that we don't need whatever performance improvement that might bring for the mirror node20:23
Clark[m]It's bottleneck is almost certainly going to be network and not disk related20:23
fungiJayF: well, i think it's being detected the wrong way around *by* cloud-init20:23
JayFthey're going to get a flood of bug reports then and will have to find a way to solve that20:24
fungiClark[m]: also i'm going to attach a cinder volume regardless because neither the rootfs nor the ephemeral disk is large enough to hold all our usual caches20:24
fungiso yeah, we'll hardly be touching these devices either way20:24
fungiand sorry for slow replies. i switched to my backup netbook which had a full charge but teensy little keyboard20:25
cardoeI was trying to see if there was an existing report against cloud-init20:26
cardoeCould it be how nova creates the entry in libvirt is causing it to hint sda?20:26
cardoeI'm not crazy familiar with the nova side but I do know the libvirt side.20:26
fungiwell, sda is the rootfs in this case, but yeah b and c for ephemeral and swap are being detected in reverse20:27
cardoeI meant sdb/sdc sorry.20:27
cardoeBasically in libvirt you'd end up having <target dev='sdb' bus='scsi'/>20:28
fungii do happen to know some nova people, unsurprisingly. we can certainly ask around ;)20:28
cardoeBut you don't have to specify those and instead can just say where on the bus it's hanging from.20:28
cardoeAnd you can even say "gimme the next one"20:28
JayFcardoe: AIUI, those are even reordering, the pci path stuff20:29
JayFcardoe: at least according to the doc julia wrote up20:29
fungiso it depends on the guest kernel vintage i guess?20:29
cardoeKevin says he reproduced it in 22.0420:30
cardoeWhich shouldn't have async device initialization20:30
fungioh, good, in that case i can cancel my boot attempt i was testing that exact question with20:31
fungithough yeah it already failed on the same problem before i could cancel20:32
fungiso, double-confirmed i guess20:32
cardoeSo removed that property on all the images.22:50
cardoeissue on Flex should go away.22:50
fungithanks cardoe! easily worked around, but i'm sure it was going to trip up other users who might not be in as good a position to figure it out23:53

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!