Tuesday, 2024-12-10

clarkb	frickler: good idea I'll add that	00:08
clarkb	fungi: I aprpeciated the oregon trail joke	00:09
clarkb	ok my last agenda updates are in. I'll send that out in like 15 minutes if there are any other edits to add	00:16
opendevreview	Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290	00:35
clarkb	any idea why we need the -i to point to the key there? I didn't think it was necessary but I got permission denied public key on the most recent run	00:35
clarkb	then I saw an ssh command later does supply -i. Its a default ed25519 keyname in /root/.ssh/ so it should be autodetected I would've thought	00:35
clarkb	agenda is on its way out	00:37
clarkb	hrm maybe $HOME isn't properly set so ssh doesn't know how to find the .ssh dir?	00:42
clarkb	in any case I'm hopeful that being explicit about the key to use is sufficient	00:42
fungi	yeah, wrong $HOME would have been my guess too, i normally only add -i when working with non-default key names/paths	00:57
Clark[m]	It passed finally	01:46
Clark[m]	Tomorrow after meetings I'll have to try and hold a 3.10 test node to retest the openid fixup	01:48
Clark[m]	And look at reenabling the upgrade test	01:48
*** diablo_rojo_phone is now known as Guest2639		08:38
Petraea	Morning folks. I found a small bug in diskimage-builder where grub2-install won't proceed in EL9 when it can't find a secureboot path - turns out you can just --force to get past this. Has this been covered before by someone else? Am I just reporting a dupe?	08:44
opendevreview	Gwen Dawes proposed openstack/diskimage-builder master: Force grub2-install to bypass secureboot complaints. https://review.opendev.org/c/openstack/diskimage-builder/+/937442	09:11
Petraea	Found the bug already reported, so I referenced it. Hope that's OK, I've tried to follow as many of the style guides as I can!	09:13
*** ralonsoh_ is now known as ralonsoh		09:21
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add support for DNF5-based systems https://review.opendev.org/c/openstack/diskimage-builder/+/934332	12:41
frickler	did anyone else note that pypi.org seems to be protected by fastly now? at least when browsing. how long until they also introduce rate limits for wheel downloads? :-/	13:00
fungi	Petraea: also, #openstack-dib is a dedicated irc channel for diskimage-builder discussions	14:33
fungi	might be easier to reach the right people in there (though a lot of them hang out in here too of course)	14:34
Petraea	fungi: Oh, cool, thanks. I'll go bug people there too, in short order.	14:35
fungi	frickler: yeah, it came up wrt to having broken some tools that had frontends for pypi's search feature too: https://discuss.python.org/t/fastly-interfering-with-pypi-search/73597	14:35
frickler	hmm, was gerrit user history trimmed somehow? I see "Joined: 16.12.2015" for a lot of accounts that I think should be older, like corvus and fungi and myself (though the latter only by a bit I think)	15:15
clarkb	I believe that was due to some database edits 9 years ago.	15:16
clarkb	maybe to change the openid provider?	15:16
fungi	frickler: long, long ago in a database far, far away, gerrit set their timestamps to reset on any change to their corresponding rows. during one database migration where we didn't explicitly force the timestamp fields to their existing values, they updated to the date of the maintenance and we didn't notice it for weeks	15:16
clarkb	I don't recall what exactly but yes	15:16
clarkb	ah fungi recalls better than I do	15:17
frickler	ah, ok, strange that I seem to never have noticed this before. but almost exactly 9y ago would match	15:18
opendevreview	Gwen Dawes proposed openstack/diskimage-builder master: Force grub2-install to bypass secureboot complaints. https://review.opendev.org/c/openstack/diskimage-builder/+/937442	15:56
opendevreview	Clark Boylan proposed opendev/system-config master: DNM Forced fail on Gerrit to test the 3.10 openid update https://review.opendev.org/c/opendev/system-config/+/893571	16:15
clarkb	I'm going to put a new autohold on ^ to test the openid update	16:15
clarkb	oh huh gerrit 3.11 actually recommends java 21. But I think we sort that out as a followon to the current stack of changes since java 17 will keep working fine	16:19
opendevreview	Clark Boylan proposed opendev/system-config master: Reenable Gerrit upgrade job now testing 3.10 to 3.11 https://review.opendev.org/c/opendev/system-config/+/937465	16:21
opendevreview	Jeremy Stanley proposed openstack/diskimage-builder master: Temporarily disable OpenEuler functional testing https://review.opendev.org/c/openstack/diskimage-builder/+/937466	16:22
clarkb	I suspect the switch to java 21 might be easier after we upgrade to 3.11 then we can update our base image to use java 21 instead of 17 for both 3.11 and 3.12 at the same time ratehr than trying to sort it out between 3.10 and 3.11	16:30
opendevreview	Clark Boylan proposed opendev/system-config master: Reenable Gerrit upgrade job now testing 3.10 to 3.11 https://review.opendev.org/c/opendev/system-config/+/937465	17:13
cardoe	clarkb: what's the label issue with raxflex?	17:33
clarkb	cardoe: the kolla-ansible jobs repartition and reformat the raxflex node ephemeral drive then attempt to mount one of the partitions referring to it by label and then mount says it cannot find that label	17:47
clarkb	cardoe: its possible that there is a driver issue with reloadingthat data into the kernel or maybe a sync issue?	17:48
klamath	@clarkb can you please show me the output from blkid	17:49
clarkb	klamath: I don't have a node handy for that. frickler might? Otherwise we can probably boot one out of band or something	17:49
cardoe	That's who can fix it. Couldn't remember his IRC handle.	17:49
clarkb	this is the downside to having a CI system recycle nodes aggressively	17:49
klamath	there shouldnt be anything preventing the block device from getting a label	17:54
clarkb	klamath: the repartitioning to set the label succeeds	17:55
clarkb	but when you run mount based on the label mount says that label is not present anywhere and that fails	17:56
klamath	the blkid will be helpful in trying to understand what the OS is seeing	17:56
clarkb	klamath: do you have any involvement in the older xen based cloud stuff? we've found that older xen compared to hypervisors within the same cloud region appear to boot ubuntu noble with one vcpu when it should have 8 (and does have 8 when newer xen is used)	17:58
clarkb	frickler: maybe you can collect the blkid info for klamath since I think you're plugged into kolla	17:58
klamath	xen is bad for my health, i dont touch that stuff	17:58
frickler	since we have the workaround now in kolla, I didn't continue trying to get a held node with the issue	18:06
clarkb	we can just boot one directly, frickler did the operating system matter or was this seen on a variety of OSs?	18:07
clarkb	klamath: the workaround was to use the devcie path to mount rather than a label.	18:08
klamath	Yeah, I saw the PR, i was interested in verifying the issue before spinning people up	18:08
clarkb	ya I can boot a node directly and reproduce outside of ansible and capture blkid info	18:09
clarkb	but I'm in meetings for a bit so won't be until this afternoon	18:09
frickler	clarkb: I think it was seen on all oses kolla uses (d/r/u)	18:12
clarkb	thanks for confirming	18:13
clarkb	ok all of the gerrit image and testing update changes pass. Now I need to test openid logins on the rebuilt gerrit 3.10 then we can merge the stack	18:21
clarkb	and maybe plan for that tomorrow which should be ~long enough to not expect a revert and also have time to restart gerrit on the updated 3.10 version with openid updates	18:21
clarkb	er I'll do my best to test today so we can land stuff tomorrow	18:22
fungi	and backup purge later in the week?	18:25
clarkb	fungi: I think we can probably still do that today or tomorrow?	18:27
clarkb	I don't expect that to take long (particularly since no docker rate limits should be involved)	18:27
fungi	sure, i probably have time after the next meeting, though may need to do it after dinner	18:28
clarkb	ya after the meeting i have lunch to consume. But after that I can test openid logins with newer gerrit 3.10, boot a test node in raxflex to check mount by label, and also purge retired backups	18:38
clarkb	looks like we need to pause openeuler builds in nodepool too cc fungi	18:59
clarkb	https://nb01.opendev.org/openEuler-22-03-LTS-88110d3fbc1b473f9006054260b897ea.log is a random failure I pulled up	18:59
mnasiadka	clarkb: we switched to not use the label, it works with raxflex, it was convenient but whatever ;)	18:59
clarkb	mnasiadka: right but now rax is asking us to help debug	18:59
clarkb	so I can boot a node directly then repartition and format and mount and see what happens then capture blkid	19:01
clarkb	or you could too but its a bit more involved for you to do it and wasteful as you aren't garunteed a boot in the right cloud provider	19:02
clarkb	mnasiadka: frickler: if you can point at the code that was doing the formatting and mount by label that could be helpful so that I partition in a similar manner	19:04
frickler	clarkb: https://review.opendev.org/c/openstack/kolla/+/937345	19:05
clarkb	thanks	19:05
*** atmark_ is now known as atmark		19:37
clarkb	fungi: ok I'm going to eat lunch and then dive into openid redirect testing. When you're back from food too I think we can go ahead and approve the backup purge change if it looks good to you	19:57
clarkb	and once openid testing is done I'm going to look into the raxflex mount by label issue	19:58
fungi	sgtm, yep. i'll also try to work on the nodepool openeuler image build pause change then too	19:58
clarkb	thanks!	19:58
opendevreview	Merged zuul/zuul-jobs master: zuul_debug_info: Add /proc/cpuinfo output https://review.opendev.org/c/zuul/zuul-jobs/+/937376	20:07
fungi	JayF: in positive news, i think we've finally reached steady-state with any lingering defunct subscribers on openstack-discuss having either been auto-disabled or manually disabled by me after looking through the uncaught bounce notifications. any after this point are probably accounts newly going dead or one-off bounces for things like dmarc policy violations	20:11
JayF	nice, thanks!	20:11
fungi	thanks for bearing with me on that, and apologies for all the noise	20:12
JayF	my role was "search and hit delete every now and then", you did the real work :D	20:21
fungi	popping out to grab eats, should be back on soon	20:24
clarkb	infra-root held gerrit 3.10.3 with the openid fix is in place at 172.99.69.123 you'll have to override review.opendev.org in /etc/hosts to get a fully working login path there but it works for me.	20:53
clarkb	I think we're good to land the gerrit upgrade changes when people are comfortable with not rolling back and with a restart to pick up the oepnid change	20:53
clarkb	I did notice one weird thing which is that it redirects you to https://review.opendev.org/c/opendev/system-config/%2F/893571 when following change sign in links. However, I reproduced this with the current version of gerrit without my openid fix so this is unrelated and it successfully sends you to https://review.opendev.org/c/opendev/system-config/+/893571 from there which is the	20:54
clarkb	canonical url	20:54
clarkb	I am booting clarkb-test-jammy in raxflex now	20:57
clarkb	and I'm quickly reminded I need to remember how to use floating ips	20:58
clarkb	I have created a floating IP and attached it but the server does not erspond to pings or ssh connections at that address. Could be any number of things but this generally works with nodepool so I'm guessing it isn't the image boot up sequenece	21:05
clarkb	I can see glean logging that it brought up the ens3 interface in the console log	21:06
frickler	did you add a security group?	21:06
clarkb	not explicitly so it should use the default group	21:07
clarkb	I can ping the private address from a nodepool booted test node in the clodu region so network is up	21:07
clarkb	ya security group is set to default according to server show	21:07
clarkb	and that matches the nodepool node I jumped over to	21:08
clarkb	the floating IP also pings from that nodepool node but not externally	21:09
clarkb	that does make it seem like a firwall/security group issue but it is using the default group just like the others so I'm confused	21:10
clarkb	I may try another floating IP	21:10
clarkb	second floating IP works	21:13
clarkb	klamath: 63.131.145.251 this floating IP did not work from the outside world when attached to clarkb-test-jammy	21:14
clarkb	working through the kolla playbook the ephemeral device is identified via `/sbin/blkid -L ephemeral0` and running that on my test node returns /dev/vdb as expected.	21:17
clarkb	(so we are not using the wrong device)	21:17
klamath	I can pass along this float IP info to team.	21:19
clarkb	frickler: mnasiadka one thing I notice working through your playbook is you don't create a new partition you simply create an ext4 filesystem directly on /dev/vdb	21:19
klamath	is the blkid returning a disk label now?	21:20
clarkb	it wouldn't surprise me if this is the root of the issue since labels would be written to a partition table but we don't have one?	21:20
clarkb	klamath: yes the current lable is ephemeral0. I haven't started on the steps to do what kolla was doing yet as I'm trying to map it from ansible to parted and mkfs	21:20
klamath	with our without a partition a disk label should work	21:21
clarkb	klamath: but does mkfs -L foo set a disklabel or a filesystem label in a partition table if operating directly against the device and not a partition? I should know soon	21:22
klamath	sample here, formatting mkfs.btrfs -L shows this:	21:23
klamath	/dev/rbd0: LABEL="urbackup-btrfs" UUID="d428c0de-e0a7-4d10-ba6a-c352c1c5604e" UUID_SUB="89336cf6-3ad0-4dd7-81fc-414685526488" BLOCK_SIZE="4096" TYPE="btrfs"	21:23
clarkb	ok I am not able to reproduce the kolla issue	21:27
clarkb	doing `parted /dev/vdb rm 1` then `mkfs.ext4 -L kolla -m 0 /dev/vdb` produces a disk with label = kolla	21:28
clarkb	oh I guess I didn't try mounting it via the lable yet	21:28
clarkb	that still succeeds but I mounted it via device path first which maybe updated things in the kernel. I'll try again with a new label and attempt to mount first by label only	21:30
klamath	i would be impressed if there is a race condition in which we are formatting the disk faster then the kernel gets updated...	21:31
clarkb	ya I mean I'm just trying to make the test as 1:1 as possible without running kolla ansible	21:32
clarkb	anyway it still doesn't reproduce. Using parted and mkfs.ext4 directly it works fine	21:32
clarkb	this makes me suspect this is a bug in ansible	21:32
clarkb	mnasiadka: frickler ^ fyi I cannot reproduce the issue on an ubuntu jammy node in raxflex using parted and mkfs.ext4 directly	21:33
clarkb	I suspect thati f you want to test this further you'll need to do so via ansible parted and filesystem and mount modules as I'm beginning to suspect the bug is with ansible	21:33
clarkb	I did `parted /dev/vdb rm 1 ; mkfs.ext4 -L kollatwo -m 0 /dev/vdb ; mount LABEL=kollatwo /root/mnt` and it worked	21:34
clarkb	and ls works against the mount point and shows the expected lost+found dir	21:35
clarkb	frickler: I used the infra root keys when booting the node if you want to jump on it yourself and investigate further	21:35
clarkb	frickler: 65.17.193.149 is the ip	21:35
clarkb	I've unmounted the filesystem too so you can modify it wtihout impact	21:36
clarkb	/dev/vdb: LABEL="kollatwo" UUID="fd1c4718-2cd8-4130-bfa1-54f9d25d9d23" BLOCK_SIZE="4096" TYPE="ext4" <- is what blkid reports for completeness cc klamath	21:36
clarkb	reading ansible documentation you don't set force on the filesystem creation. I half wonder if the partition deletion is what fails. With the commands I used above I had to manually hit 'y' to confirm I want a new filesystem over the old one	21:42
clarkb	but maybe its failing to force so you don't update the label from ephemeral0 to kolla ?	21:42
clarkb	`parted /dev/vdb print` does show a single "partition" with id 1. But it continues to do so after you `parted /dev/vdb rm 1` too possibly because parted wants to do that cleanup on a partition table and not a filesystem without a table?	21:43
clarkb	ya thats my hunch	21:43
clarkb	you can probably confirm this by recording the blkid output in your kolla jobs with your workaround in plce? the workaround may be working simply because the existing filesystem on the epheermal device fills the device and is ext4 so its "good enough"	21:44
clarkb	but the actual ansible tasks may not be doing what you expect?	21:44
clarkb	klamath: ^ thats my best guess to what is going on. And I'm guessing rax classic ephemeral devices actually have a partition table on the ephemeral device that we wipe out and mkfs then works without a force	21:45
clarkb	I'll leave the test node in place for now for additional debugging	21:46
klamath	maybe using wipfs -af might be a better option then parted?	21:46
clarkb	I assume they are using parted because ansible comes with parted modules out of the box but ya a more aggressive tool may be more appropriate here	21:48
clarkb	so ya I think kolla should update the configure ephemeral role to capture blkid output after doing the filesystem creation. We may discover that we're simply mounting the existing ephemeral0 labeled ext4 fs that came with the ephemeral drive when we booted	21:52
fungi	okay, back	21:53
fungi	clarkb: one thing i realized when testing out flex with my personal account is that if i specify the existing public net (not the one we created), auto network will "just work"	21:55
clarkb	fungi: is that still usingfloating IPs or attaching you directly tothe public net?	21:55
fungi	direct attach, similar to rax classic	21:56
clarkb	fungi: either way switching to a different floating ip worked so I suspect the problem is in the ip itself (maybe that portion of a range isn't routed or whatever)	21:56
fungi	it's _possible_ they added that after we started using it, or that we misunderstood the requirements as described to us	21:56
clarkb	ya it might be worth switching to direct attaching for consistency with the other clouds	21:57
fungi	my notes say i did `openstack server create --flavor=flavorname --image=imagename --network=PUBLICNET severname`	21:58
fungi	and ended up with a reachable v4 addy directly assigned on the ens3 interface	21:59
fungi	whether that was intended to work, i can't say, but it did	21:59
clarkb	fungi: ya thats how openmetal works too. Bsaically you can go the floating ip route but you sacrifice IPs for router interfaces if you do	22:00
fungi	(i did create a security group first with ingress allowed, and specified it with --security-group=groupname of course)	22:00
clarkb	or you just direct attach. I think it is intentional within neutron that this works	22:00
clarkb	fungi: to catch you up since you went for dinner I tested openid logins with a held gerrit and all looks well with newer 3.10.3 build. I thinkwe can proceed with landing gerrit changes when we are ready to give up on 3.9 (rollback unlikely) and restart on 3.10 latest to pick up that openid update. I also did the whole raxflex test node thing and was unable to reproduce kolla's	22:03
clarkb	problem which led to the theory that it isn't actually replacing the filesystem in the first place so the label doesn't update. But the filesystem that is already there is compatible so their workaround works	22:03
clarkb	and now I'm ready for the next thing	22:03
clarkb	which according to my notes is pausing openeuler image builds and purging retired backups	22:04
fungi	yeah, it could be that the ephemeral device in flex is preformatted in such a way that kolla's job isn't thinking it needs to reformat it	22:05
clarkb	fungi: more specifically I think it is formated in such a way that if parted isn't proeprly clearing it out I think they need to use the force option	22:06
clarkb	and I'm guessing not using the force option isn't an error in this case	22:06
fungi	yeah, setting the label is probably more like "and if this command results in formatting the device then set the label while doing so"	22:08
fungi	as opposed to something more explicit like tune2fs -L	22:09
clarkb	ya or use wipefs before hand like klamath suggested so that mkfs doesn't see another fs there and proceed without --force	22:15
clarkb	fungi: is there an openeuler pause change yet?	22:26
fungi	not yet, almost caught up on e-mail and will get it pushed up shortly	22:27
opendevreview	Merged opendev/system-config master: Purge previously retired backups on vexxhost backup server https://review.opendev.org/c/opendev/system-config/+/937040	22:27
clarkb	ok cool just making sure I didn't miss it in all of the eventful afternoon happenings	22:27
clarkb	current backup server disk use is 75%	22:28
clarkb	will be a few minutes before we get to the job that does the purging	22:29
clarkb	backup job just started	22:38
clarkb	borg-ask01 is gone now	22:40
opendevreview	Jeremy Stanley proposed openstack/project-config master: Pause openEuler-22-03-LTS image builds https://review.opendev.org/c/openstack/project-config/+/937488	22:40
clarkb	that got us down to 67% and spot checking backups I expect to still have backups I see that they do indeed still have backups	22:41
clarkb	the job completed successfull too	22:41
clarkb	so I think the last thing to do is rerun the prune script and make sure it doesn't complain about anythign we purged	22:42
clarkb	fungi: there is an arm64 build that needs pausing too I think	22:42
clarkb	in the nb04 file	22:42
fungi	oh, good call	22:42
clarkb	interestingly in the nb04 file we haev pause directives in a different location in the file	22:44
clarkb	not sure whether one is more correct than theo ther	22:44
clarkb	in theory they are both valid because nodepool validates the yaml schema	22:44
opendevreview	Jeremy Stanley proposed openstack/project-config master: Pause openEuler-22-03-LTS image builds https://review.opendev.org/c/openstack/project-config/+/937488	22:44
clarkb	oh I guess one is provider specific and the other is global.	22:45
fungi	one pauses builds, the other uploads	22:45
clarkb	ack	22:45
clarkb	fungi: pause: false needs to be pause: true in the nb04 file	22:45
clarkb	then I'll apprive it	22:45
opendevreview	Jeremy Stanley proposed openstack/project-config master: Pause openEuler-22-03-LTS image builds https://review.opendev.org/c/openstack/project-config/+/937488	22:46
fungi	d'oh! thanks	22:46
clarkb	and then we were going to add the openeulder person to that review. Do you have that info available?	22:48
fungi	just did on both changes	22:48
clarkb	thanks!	22:49
fungi	now on to backups. just need to run another prune on backup02.ca-ymq-1.vexxhost?	22:49
clarkb	fungi: yes I think that is the last step in the retire -> prune -> pruge -> prune process	22:49
clarkb	mostly as a sanity check that we haven't made the script unahppy by purging	22:49
fungi	/dev/mapper/main--202010-backups--202010 1007G 669G 338G 67% /opt/backups-202010	22:49
clarkb	will also give us a low water mark	22:50
fungi	starting the prune now in a root screen session on the server	22:50
opendevreview	Merged zuul/zuul-jobs master: Add mirror-container-images role and job https://review.opendev.org/c/zuul/zuul-jobs/+/935574	23:00
fungi	/dev/mapper/main--202010-backups--202010 1007G 591G 417G 59% /opt/backups-202010	23:03
clarkb	that is with pruning complete?	23:05
clarkb	if so I think we roughly doubled our free disk space?	23:06
clarkb	er free disk space at the low water mark I should say	23:06
fungi	yeah, it finished moments before i posted that	23:06
clarkb	not too bad overall	23:06
fungi	indeed	23:06
clarkb	and the script didn't complain about missing files etc so I think we're good	23:07
fungi	#status log Pruned backups on backup02.ca-ymq-1.vexxhost.opendev.org reducing volume utilization from 67% to 59% (after the purge of retired backups brought it down from 75%)	23:08
fungi	yeah, zero error messages	23:08
clarkb	fungi: https://review.opendev.org/c/opendev/gerritlib/+/937276 is one of the gerrit upgrade related changes that should be completely safe to land now if we like	23:08
fungi	script silently exited 0	23:08
clarkb	it is independent of the production stuff and simply updates gerritlib abd jeepyb testing to test against 3.10 to match production	23:08
clarkb	perfect	23:08
fungi	i'm realizing that the slowness of wiki.openstack.org is what's causing statusbot to take so long responding	23:09
clarkb	https://review.opendev.org/c/opendev/system-config/+/937277 is also technically safe I think as worst case we'll retain more log files than we want but we can probably land it with the other stuff tomorrow since that does touch production	23:09
clarkb	fungi: oh that makes sense	23:09
fungi	closing out the entirely uneventful screen session	23:09
opendevreview	Jay Faulkner proposed openstack/diskimage-builder master: Stop using deprecated pkg_resources API https://review.opendev.org/c/openstack/diskimage-builder/+/907691	23:09
opendevreview	Merged opendev/gerritlib master: Test gerrit lib (and jeepyb) against Gerrit 3.10.3 https://review.opendev.org/c/opendev/gerritlib/+/937276	23:20
opendevreview	Merged openstack/project-config master: Pause openEuler-22-03-LTS image builds https://review.opendev.org/c/openstack/project-config/+/937488	23:44

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!