Tuesday, 2024-12-10

clarkbfrickler: good idea I'll add that00:08
clarkbfungi: I aprpeciated the oregon trail joke00:09
clarkbok my last agenda updates are in. I'll send that out in like 15 minutes if there are any other edits to add00:16
opendevreviewClark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93729000:35
clarkbany idea why we need the -i to point to the key there? I didn't think it was necessary but I got permission denied public key on the most recent run00:35
clarkbthen I saw an ssh command later does supply -i. Its a default ed25519 keyname in /root/.ssh/ so it should be autodetected I would've thought00:35
clarkbagenda is on its way out00:37
clarkbhrm maybe $HOME isn't properly set so ssh doesn't know how to find the .ssh dir?00:42
clarkbin any case I'm hopeful that being explicit about the key to use is sufficient00:42
fungiyeah, wrong $HOME would have been my guess too, i normally only add -i when working with non-default key names/paths00:57
Clark[m]It passed finally01:46
Clark[m]Tomorrow after meetings I'll have to try and hold a 3.10 test node to retest the openid fixup01:48
Clark[m]And look at reenabling the upgrade test01:48
*** diablo_rojo_phone is now known as Guest263908:38
PetraeaMorning folks. I found a small bug in diskimage-builder where grub2-install won't proceed in EL9 when it can't find a secureboot path - turns out you can just --force to get past this. Has this been covered before by someone else? Am I just reporting a dupe?08:44
opendevreviewGwen Dawes proposed openstack/diskimage-builder master: Force grub2-install to bypass secureboot complaints.  https://review.opendev.org/c/openstack/diskimage-builder/+/93744209:11
PetraeaFound the bug already reported, so I referenced it. Hope that's OK, I've tried to follow as many of the style guides as I can!09:13
*** ralonsoh_ is now known as ralonsoh09:21
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems  https://review.opendev.org/c/openstack/diskimage-builder/+/93433212:41
fricklerdid anyone else note that pypi.org seems to be protected by fastly now? at least when browsing. how long until they also introduce rate limits for wheel downloads? :-/13:00
fungiPetraea: also, #openstack-dib is a dedicated irc channel for diskimage-builder discussions14:33
fungimight be easier to reach the right people in there (though a lot of them hang out in here too of course)14:34
Petraeafungi: Oh, cool, thanks. I'll go bug people there too, in short order.14:35
fungifrickler: yeah, it came up wrt to having broken some tools that had frontends for pypi's search feature too: https://discuss.python.org/t/fastly-interfering-with-pypi-search/7359714:35
fricklerhmm, was gerrit user history trimmed somehow? I see "Joined: 16.12.2015" for a lot of accounts that I think should be older, like corvus and fungi and myself (though the latter only by a bit I think)15:15
clarkbI believe that was due to some database edits 9 years ago.15:16
clarkbmaybe to change the openid provider?15:16
fungifrickler: long, long ago in a database far, far away, gerrit set their timestamps to reset on any change to their corresponding rows. during one database migration where we didn't explicitly force the timestamp fields to their existing values, they updated to the date of the maintenance and we didn't notice it for weeks15:16
clarkbI don't recall what exactly but yes15:16
clarkbah fungi recalls better than I do15:17
fricklerah, ok, strange that I seem to never have noticed this before. but almost exactly 9y ago would match15:18
opendevreviewGwen Dawes proposed openstack/diskimage-builder master: Force grub2-install to bypass secureboot complaints.  https://review.opendev.org/c/openstack/diskimage-builder/+/93744215:56
opendevreviewClark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test the 3.10 openid update  https://review.opendev.org/c/opendev/system-config/+/89357116:15
clarkbI'm going to put a new autohold on ^ to test the openid update16:15
clarkboh huh gerrit 3.11 actually recommends java 21. But I think we sort that out as a followon to the current stack of changes since java 17 will keep working fine16:19
opendevreviewClark Boylan proposed opendev/system-config master: Reenable Gerrit upgrade job now testing 3.10 to 3.11  https://review.opendev.org/c/opendev/system-config/+/93746516:21
opendevreviewJeremy Stanley proposed openstack/diskimage-builder master: Temporarily disable OpenEuler functional testing  https://review.opendev.org/c/openstack/diskimage-builder/+/93746616:22
clarkbI suspect the switch to java 21 might be easier after we upgrade to 3.11 then we can update our base image to use java 21 instead of 17 for both 3.11 and 3.12 at the same time ratehr than trying to sort it out between 3.10 and 3.1116:30
opendevreviewClark Boylan proposed opendev/system-config master: Reenable Gerrit upgrade job now testing 3.10 to 3.11  https://review.opendev.org/c/opendev/system-config/+/93746517:13
cardoeclarkb: what's the label issue with raxflex?17:33
clarkbcardoe: the kolla-ansible jobs repartition and reformat the raxflex node ephemeral drive then attempt to mount one of the partitions referring to it by label and then mount says it cannot find that label17:47
clarkbcardoe: its possible that there is a driver issue with reloadingthat data into the kernel or maybe a sync issue?17:48
klamath@clarkb can you please show me the output from blkid17:49
clarkbklamath: I don't have a node handy for that. frickler might? Otherwise we can probably boot one out of band or something17:49
cardoeThat's who can fix it. Couldn't remember his IRC handle.17:49
clarkbthis is the downside to having a CI system recycle nodes aggressively17:49
klamaththere shouldnt be anything preventing the block device from getting a label17:54
clarkbklamath: the repartitioning to set the label succeeds17:55
clarkbbut when you run mount based on the label mount says that label is not present anywhere and that fails17:56
klamaththe blkid will be helpful in trying to understand what the OS is seeing17:56
clarkbklamath: do you have any involvement in the older xen based cloud stuff? we've found that older xen compared to hypervisors within the same cloud region appear to boot ubuntu noble with one vcpu when it should have 8 (and does have 8 when newer xen is used)17:58
clarkbfrickler: maybe you can collect the blkid info for klamath since I think you're plugged into kolla17:58
klamathxen is bad for my health, i dont touch that stuff17:58
fricklersince we have the workaround now in kolla, I didn't continue trying to get a held node with the issue18:06
clarkbwe can just boot one directly, frickler did the operating system matter or was this seen on a variety of OSs?18:07
clarkbklamath: the workaround was to use the devcie path to mount rather than a label.18:08
klamathYeah, I saw the PR, i was interested in verifying the issue before spinning people up18:08
clarkbya I can boot a node directly and reproduce outside of ansible and capture blkid info18:09
clarkbbut I'm in meetings for a bit so won't be until this afternoon18:09
fricklerclarkb: I think it was seen on all oses kolla uses (d/r/u)18:12
clarkbthanks for confirming18:13
clarkbok all of the gerrit image and testing update changes pass. Now I need to test openid logins on the rebuilt gerrit 3.10 then we can merge the stack18:21
clarkband maybe plan for that tomorrow which should be ~long enough to not expect a revert and also have time to restart gerrit on the updated 3.10 version with openid updates18:21
clarkber I'll do my best to test today so we can land stuff tomorrow18:22
fungiand backup purge later in the week?18:25
clarkbfungi: I think we can probably still do that today or tomorrow?18:27
clarkbI don't expect that to take long (particularly since no docker rate limits should be involved)18:27
fungisure, i probably have time after the next meeting, though may need to do it after dinner18:28
clarkbya after the meeting i have lunch to consume. But after that I can test openid logins with newer gerrit 3.10, boot a test node in raxflex to check mount by label, and also purge retired backups18:38
clarkblooks like we need to pause openeuler builds in nodepool too cc fungi 18:59
clarkbhttps://nb01.opendev.org/openEuler-22-03-LTS-88110d3fbc1b473f9006054260b897ea.log is a random failure I pulled up18:59
mnasiadkaclarkb: we switched to not use the label, it works with raxflex, it was convenient but whatever ;)18:59
clarkbmnasiadka: right but now rax is asking us to help debug18:59
clarkbso I can boot a node directly then repartition and format and mount and see what happens then capture blkid19:01
clarkbor you could too but its a bit more involved for you to do it and wasteful as you aren't garunteed a boot in the right cloud provider19:02
clarkbmnasiadka: frickler: if you can point at the code that was doing the formatting and mount by label that could be helpful so that I partition in a similar manner19:04
fricklerclarkb: https://review.opendev.org/c/openstack/kolla/+/937345 19:05
clarkbthanks19:05
*** atmark_ is now known as atmark19:37
clarkbfungi: ok I'm going to eat lunch and then dive into openid redirect testing. When you're back from food too I think we can go ahead and approve the backup purge change if it looks good to you19:57
clarkband once openid testing is done I'm going to look into the raxflex mount by label issue19:58
fungisgtm, yep. i'll also try to work on the nodepool openeuler image build pause change then too19:58
clarkbthanks!19:58
opendevreviewMerged zuul/zuul-jobs master: zuul_debug_info: Add /proc/cpuinfo output  https://review.opendev.org/c/zuul/zuul-jobs/+/93737620:07
fungiJayF: in positive news, i think we've finally reached steady-state with any lingering defunct subscribers on openstack-discuss having either been auto-disabled or manually disabled by me after looking through the uncaught bounce notifications. any after this point are probably accounts newly going dead or one-off bounces for things like dmarc policy violations20:11
JayFnice, thanks!20:11
fungithanks for bearing with me on that, and apologies for all the noise20:12
JayFmy role was "search and hit delete every now and then", you did the real work :D20:21
fungipopping out to grab eats, should be back on soon20:24
clarkbinfra-root held gerrit 3.10.3 with the openid fix is in place at 172.99.69.123 you'll have to override review.opendev.org in /etc/hosts to get a fully working login path there but it works for me.20:53
clarkbI think we're good to land the gerrit upgrade changes when people are comfortable with not rolling back and with a restart to pick up the oepnid change20:53
clarkbI did notice one weird thing which is that it redirects you to https://review.opendev.org/c/opendev/system-config/%2F/893571 when following change sign in links. However, I reproduced this with the current version of gerrit without my openid fix so this is unrelated and it successfully sends you to https://review.opendev.org/c/opendev/system-config/+/893571 from there which is the20:54
clarkbcanonical url20:54
clarkbI am booting clarkb-test-jammy in raxflex now20:57
clarkband I'm quickly reminded I need to remember how to use floating ips20:58
clarkbI have created a floating IP and attached it but the server does not erspond to pings or ssh connections at that address. Could be any number of things but this generally works with nodepool so I'm guessing it isn't the image boot up sequenece21:05
clarkbI can see glean logging that it brought up the ens3 interface in the console log21:06
fricklerdid you add a security group?21:06
clarkbnot explicitly so it should use the default group21:07
clarkbI can ping the private address from a nodepool booted test node in the clodu region so network is up21:07
clarkbya security group is set to default according to server show21:07
clarkband that matches the nodepool node I jumped over to21:08
clarkbthe floating IP also pings from that nodepool node but not externally21:09
clarkbthat does make it seem like a firwall/security group issue but it is using the default group just like the others so I'm confused21:10
clarkbI may try another floating IP21:10
clarkbsecond floating IP works21:13
clarkbklamath: 63.131.145.251 this floating IP did not work from the outside world when attached to clarkb-test-jammy21:14
clarkbworking through the kolla playbook the ephemeral device is identified via `/sbin/blkid -L ephemeral0` and running that on my test node returns /dev/vdb as expected.21:17
clarkb(so we are not using the wrong device)21:17
klamathI can pass along this float IP info to team. 21:19
clarkbfrickler: mnasiadka one thing I notice working through your playbook is you don't create a new partition you simply create an ext4 filesystem directly on /dev/vdb21:19
klamathis the blkid returning a disk label now?21:20
clarkbit wouldn't surprise me if this is the root of the issue since labels would be written to a partition table but we don't have one?21:20
clarkbklamath: yes the current lable is ephemeral0. I haven't started on the steps to do what kolla was doing yet as I'm trying to map it from ansible to parted and mkfs21:20
klamathwith our without a partition a disk label should work21:21
clarkbklamath: but does mkfs -L foo set a disklabel or a filesystem label in a partition table if operating directly against the device and not a partition? I should know soon21:22
klamathsample here, formatting mkfs.btrfs -L shows this:21:23
klamath /dev/rbd0: LABEL="urbackup-btrfs" UUID="d428c0de-e0a7-4d10-ba6a-c352c1c5604e" UUID_SUB="89336cf6-3ad0-4dd7-81fc-414685526488" BLOCK_SIZE="4096" TYPE="btrfs"21:23
clarkbok I am not able to reproduce the kolla issue21:27
clarkbdoing `parted /dev/vdb rm 1` then `mkfs.ext4 -L kolla -m 0 /dev/vdb` produces a disk with label = kolla21:28
clarkboh I guess I didn't try mounting it via the lable yet21:28
clarkbthat still succeeds but I mounted it via device path first which maybe updated things in the kernel. I'll try again with a new label and attempt to mount first by label only21:30
klamathi would be impressed if there is a race condition in which we are formatting the disk faster then the kernel gets updated...21:31
clarkbya I mean I'm just trying to make the test as 1:1 as possible without running kolla ansible21:32
clarkbanyway it still doesn't reproduce. Using parted and mkfs.ext4 directly it works fine21:32
clarkbthis makes me suspect this is a bug in ansible21:32
clarkbmnasiadka: frickler ^ fyi I cannot reproduce the issue on an ubuntu jammy node in raxflex using parted and mkfs.ext4 directly21:33
clarkbI suspect thati f you want to test this further you'll need to do so via ansible parted and filesystem and mount modules as I'm beginning to suspect the bug is with ansible21:33
clarkbI did `parted /dev/vdb rm 1 ; mkfs.ext4 -L kollatwo -m 0 /dev/vdb ; mount LABEL=kollatwo /root/mnt` and it worked21:34
clarkband ls works against the mount point and shows the expected lost+found dir21:35
clarkbfrickler: I used the infra root keys when booting the node if you want to jump on it yourself and investigate further21:35
clarkbfrickler: 65.17.193.149 is the ip21:35
clarkbI've unmounted the filesystem too so you can modify it wtihout impact21:36
clarkb/dev/vdb: LABEL="kollatwo" UUID="fd1c4718-2cd8-4130-bfa1-54f9d25d9d23" BLOCK_SIZE="4096" TYPE="ext4" <- is what blkid reports for completeness cc klamath 21:36
clarkbreading ansible documentation you don't set force on the filesystem creation. I half wonder if the partition deletion is what fails. With the commands I used above I had to manually hit 'y' to confirm I want a new filesystem over the old one21:42
clarkbbut maybe its failing to force so you don't update the label from ephemeral0 to kolla ?21:42
clarkb`parted /dev/vdb print` does show a single "partition" with id 1. But it continues to do so after you `parted /dev/vdb rm 1` too possibly because parted wants to do that cleanup on a partition table and not a filesystem without a table?21:43
clarkbya thats my hunch21:43
clarkbyou can probably confirm this by recording the blkid output in your kolla jobs with your workaround in plce? the workaround may be working simply because the existing filesystem on the epheermal device fills the device and is ext4 so its "good enough"21:44
clarkbbut the actual ansible tasks may not be doing what you expect?21:44
clarkbklamath: ^ thats my best guess to what is going on. And I'm guessing rax classic ephemeral devices actually have a partition table on the ephemeral device that we wipe out and mkfs then works without a force21:45
clarkbI'll leave the test node in place for now for additional debugging21:46
klamathmaybe using wipfs -af might be a better option then parted?21:46
clarkbI assume they are using parted because ansible comes with parted modules out of the box but ya a more aggressive tool may be more appropriate here21:48
clarkbso ya I think kolla should update the configure ephemeral role to capture blkid output after doing the filesystem creation. We may discover that we're simply mounting the existing ephemeral0 labeled ext4 fs that came with the ephemeral drive when we booted21:52
fungiokay, back21:53
fungiclarkb: one thing i realized when testing out flex with my personal account is that if i specify the existing public net (not the one we created), auto network will "just work"21:55
clarkbfungi: is that still usingfloating IPs or attaching you directly tothe public net?21:55
fungidirect attach, similar to rax classic21:56
clarkbfungi: either way switching to a different floating ip worked so I suspect the problem is in the ip itself (maybe that portion of a range isn't routed or whatever)21:56
fungiit's _possible_ they added that after we started using it, or that we misunderstood the requirements as described to us21:56
clarkbya it might be worth switching to direct attaching for consistency with the other clouds21:57
fungimy notes say i did `openstack server create --flavor=flavorname --image=imagename --network=PUBLICNET severname`21:58
fungiand ended up with a reachable v4 addy directly assigned on the ens3 interface21:59
fungiwhether that was intended to work, i can't say, but it did21:59
clarkbfungi: ya thats how openmetal works too. Bsaically you can go the floating ip route but you sacrifice IPs for router interfaces if you do22:00
fungi(i did create a security group first with ingress allowed, and specified it with --security-group=groupname of course)22:00
clarkbor you just direct attach. I think it is intentional within neutron that this works22:00
clarkbfungi: to catch you up since you went for dinner I tested openid logins with a held gerrit and all looks well with newer 3.10.3 build. I thinkwe can proceed with landing gerrit changes when we are ready to give up on 3.9 (rollback unlikely) and restart on 3.10 latest to pick up that openid update. I also did the whole raxflex test node thing and was unable to reproduce kolla's22:03
clarkbproblem which led to the theory that it isn't actually replacing the filesystem in the first place so the label doesn't update. But the filesystem that is already there is compatible so their workaround works22:03
clarkband now I'm ready for the next thing22:03
clarkbwhich according to my notes is pausing openeuler image builds and purging retired backups22:04
fungiyeah, it could be that the ephemeral device in flex is preformatted in such a way that kolla's job isn't thinking it needs to reformat it22:05
clarkbfungi: more specifically I think it is formated in such a way that if parted isn't proeprly clearing it out I think they need to use the force option22:06
clarkband I'm guessing not using the force option isn't an error in this case22:06
fungiyeah, setting the label is probably more like "and if this command results in formatting the device then set the label while doing so"22:08
fungias opposed to something more explicit like tune2fs -L22:09
clarkbya or use wipefs before hand like klamath suggested so that mkfs doesn't see another fs there and proceed without --force22:15
clarkbfungi: is there an openeuler pause change yet?22:26
funginot yet, almost caught up on e-mail and will get it pushed up shortly22:27
opendevreviewMerged opendev/system-config master: Purge previously retired backups on vexxhost backup server  https://review.opendev.org/c/opendev/system-config/+/93704022:27
clarkbok cool just making sure I didn't miss it in all of the eventful afternoon happenings22:27
clarkbcurrent backup server disk use is 75%22:28
clarkbwill be a few minutes before we get to the job that does the purging22:29
clarkbbackup job just started22:38
clarkbborg-ask01 is gone now22:40
opendevreviewJeremy Stanley proposed openstack/project-config master: Pause openEuler-22-03-LTS image builds  https://review.opendev.org/c/openstack/project-config/+/93748822:40
clarkbthat got us down to 67% and spot checking backups I expect to still have backups I see that they do indeed still have backups22:41
clarkbthe job completed successfull too22:41
clarkbso I think the last thing to do is rerun the prune script and make sure it doesn't complain about anythign we purged22:42
clarkbfungi: there is an arm64 build that needs pausing too I think22:42
clarkbin the nb04 file22:42
fungioh, good call22:42
clarkbinterestingly in the nb04 file we haev pause directives in a different location in the file22:44
clarkbnot sure whether one is more correct than theo ther22:44
clarkbin theory they are both valid because nodepool validates the yaml schema22:44
opendevreviewJeremy Stanley proposed openstack/project-config master: Pause openEuler-22-03-LTS image builds  https://review.opendev.org/c/openstack/project-config/+/93748822:44
clarkboh I guess one is provider specific and the other is global.22:45
fungione pauses builds, the other uploads22:45
clarkback22:45
clarkbfungi: pause: false needs to be pause: true in the nb04 file22:45
clarkbthen I'll apprive it22:45
opendevreviewJeremy Stanley proposed openstack/project-config master: Pause openEuler-22-03-LTS image builds  https://review.opendev.org/c/openstack/project-config/+/93748822:46
fungid'oh! thanks22:46
clarkband then we were going to add the openeulder person to that review. Do you have that info available?22:48
fungijust did on both changes22:48
clarkbthanks!22:49
funginow on to backups. just need to run another prune on backup02.ca-ymq-1.vexxhost?22:49
clarkbfungi: yes I think that is the last step in the retire -> prune -> pruge -> prune process22:49
clarkbmostly as a sanity check that we haven't made the script unahppy by purging22:49
fungi/dev/mapper/main--202010-backups--202010 1007G  669G  338G  67% /opt/backups-20201022:49
clarkbwill also give us a low water mark22:50
fungistarting the prune now in a root screen session on the server22:50
opendevreviewMerged zuul/zuul-jobs master: Add mirror-container-images role and job  https://review.opendev.org/c/zuul/zuul-jobs/+/93557423:00
fungi/dev/mapper/main--202010-backups--202010 1007G  591G  417G  59% /opt/backups-20201023:03
clarkbthat is with pruning complete?23:05
clarkbif so I think we roughly doubled our free disk space?23:06
clarkber free disk space at the low water mark I should say23:06
fungiyeah, it finished moments before i posted that23:06
clarkbnot too bad overall23:06
fungiindeed23:06
clarkband the script didn't complain about missing files etc so I think we're good23:07
fungi#status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 67% to 59% (after the purge of retired backups brought it down from 75%)23:08
fungiyeah, zero error messages23:08
clarkbfungi: https://review.opendev.org/c/opendev/gerritlib/+/937276 is one of the gerrit upgrade related changes that should be completely safe to land now if we like23:08
fungiscript silently exited 023:08
clarkbit is independent of the production stuff and simply updates gerritlib abd jeepyb testing to test against 3.10 to match production23:08
clarkbperfect23:08
fungii'm realizing that the slowness of wiki.openstack.org is what's causing statusbot to take so long responding23:09
clarkbhttps://review.opendev.org/c/opendev/system-config/+/937277 is also technically safe I think as worst case we'll retain more log files than we want but we can probably land it with the other stuff tomorrow since that does touch production23:09
clarkbfungi: oh that makes sense23:09
fungiclosing out the entirely uneventful screen session23:09
opendevreviewJay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API  https://review.opendev.org/c/openstack/diskimage-builder/+/90769123:09
opendevreviewMerged opendev/gerritlib master: Test gerrit lib (and jeepyb) against Gerrit 3.10.3  https://review.opendev.org/c/opendev/gerritlib/+/93727623:20
opendevreviewMerged openstack/project-config master: Pause openEuler-22-03-LTS image builds  https://review.opendev.org/c/openstack/project-config/+/93748823:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!