Thursday, 2024-06-20

opendevreview	Tony Breeds proposed openstack/project-config master: Remove publish-wheel-cache-centos-8-stream* jobs https://review.opendev.org/c/openstack/project-config/+/922357	00:27
opendevreview	Tony Breeds proposed openstack/project-config master: Remove publish-wheel-cache-centos-8-stream* jobs https://review.opendev.org/c/openstack/project-config/+/922357	01:03
opendevreview	Tony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config https://review.opendev.org/c/opendev/system-config/+/921322	01:16
*** enick_791 is now known as mordred		01:37
mordred	tonyb: am I showing up properly now?	01:38
tonyb	mordred: Yes, Yes you are :)	01:38
mordred	\o/	01:38
tonyb	Thank you for indulging me :)	01:39
mordred	sorry about that - I had nickserv username and password configured - but for some reason it wasn't actually setting the nick. who knows	01:39
tonyb	No problem, at first I thought it was just a "drive-by-chat" thing ... but then others knew who you were/are	01:41
opendevreview	Tony Breeds proposed opendev/system-config master: Add noble repo files https://review.opendev.org/c/opendev/system-config/+/921770	02:18
opendevreview	Tony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786	02:18
opendevreview	Tony Breeds proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771	02:18
opendevreview	Tony Breeds proposed opendev/system-config master: Remove zuul-role-integration-centos-8-stream jobs https://review.opendev.org/c/opendev/system-config/+/922360	02:18
opendevreview	Merged opendev/system-config master: Remove old meetpad and jvb servers https://review.opendev.org/c/opendev/system-config/+/920762	02:31
frickler	I think there has been some stuck buildset on openmetal due to the "max-servers: 1" test. there was a zuul-build-image job in "paused" state and I think it couldn't progress because the depending zuul-quick-start jobs needs to run on the same provider? I manually bumped max-servers for openmetal to 3 in order to allow this to progress, will make a patch to bump to 50 next	06:35
frickler	just wondering whether there would be a way for zuul/nodepool to avoid such a deadlock from happening, or whether we just need to enforce some sane minimum of max-servers	06:36
frickler	see https://zuul.opendev.org/t/zuul/buildset/a47d2ae8a5d84838946a9bc394d32661 for the timeline of the affected patch	06:37
opendevreview	Jens Harbott proposed openstack/project-config master: Bump max-servers for openmetal cloud to maximum https://review.opendev.org/c/openstack/project-config/+/922368	07:01
frickler	I'm a bit worried about the ceph setup for openmetal, just 32 pgs per pools seems too low for 7 OSDs. and I also don't trust in ceph autotuning this, but let's run some jobs there first and see	07:02
opendevreview	Merged openstack/project-config master: Bump max-servers for openmetal cloud to maximum https://review.opendev.org/c/openstack/project-config/+/922368	07:19
frickler	hmm, I guess I never looked at it this closely, but it seems nodepool does not use max-servers as a weight for pool usage? I was surprised to see openmetal getting maxed out during this relatively quiet period, but according to grafana all regions are at about 50 nodes being in use. which is like 30% for some regions and 100% for openmetal, not sure if that's intentional?	08:28
frickler	also seems 50 server might be too much after all, seeing some "no valid host found" errors in nodepool log. I've put nl02.opendev.org into the emergency list and will try to find a maximum working value by manual testing now	08:34
opendevreview	Tony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config https://review.opendev.org/c/opendev/system-config/+/921322	10:15
opendevreview	Merged openstack/project-config master: Drop wheel publishing for centos-8-stream https://review.opendev.org/c/openstack/project-config/+/922313	11:01
opendevreview	Mark Goddard proposed openstack/diskimage-builder master: Fix manifest element with non-root user https://review.opendev.org/c/openstack/diskimage-builder/+/922385	11:07
frickler	so it seems max-server=46 is the sweet spot for openmetal, will watch for another day (including the rush from the periodic pipeline) before doing the final commit on that, though. unless we decide to leave some headroom and that 42 is a nicer number anyway ;)	12:31
frickler	seems launchpad is having some major issues, just ftr	12:54
fungi	thanks frickler! i guess at this point we just need a change to s/50/46/ for its max-servers? i can push that up momentarily	13:00
frickler	well I'd prefer to wait until tomorrow just to be on the safe side, but feel free to go ahead. one other thing to think about however is whether we'd want to use the cloud also for larger/nested-virt flavors, in which case we'd likely need to revise the max-servers value again, too	13:05
fungi	oh, yeah this may also point to our quotas being off a bit, so maybe they could be adjusted to more accurately reflect actual capacity	13:19
frickler	we got a nice mention from pleia2 ;) https://floss.social/@pleia2/112649393181593069	14:34
*** ykarel_ is now known as ykarel		14:59
*** ykarel is now known as ykarel\|away		15:00
fungi	i miss her	15:07
clarkb	fungi: frickler: are there any urgent todos related to the openmetal cloud? I saw there was some issues with uploading images due to pruning the raw images. I was going to suggest we could stop doing that, but I think it may have been sorted out generally?	15:12
*** Guest10177 is now known as dasm		15:15
fungi	clarkb: yeah, that issue is eventually consistent (with the exception of gentoo and centos-8-stream since builds for those are paused)	15:16
fungi	just took more waiting and/or manually forcing image rotations	15:17
clarkb	for centos-8-stream we're going to delete the image entirely in the nearish future which should remove that problem. Not sure if we should do the same for gentoo (its probably not used much if at all?)	15:17
fungi	there's also a couple of nodepool upstream fixes related to problems we encountered with image uploads for missing files getting rapidly retried and api errors blocking uploads for other providers	15:17
fungi	and we discovered that nodepool randomly wanted to boot nodes in the restricted az, possibly something we can fix on the cloud side but for now we amended nodepool's config to specify the "nova" az	15:18
fungi	also it looks like quotas may be tuned too high, max-servers of 50 was resulting in "no valid host" fails but 46 seems to work	15:19
clarkb	ya I saw that change looking at the etherpad todo list. I think specifying the nova az explicitly is correct	15:19
fungi	i'm popping out to lunch briefly but can help dig into things in an hour-ish	15:20
clarkb	still no gitea 1.22.1 release	15:20
frickler	clarkb: I agree with fungi, nothing urgent, seems to be working smoothly now. my idea is for tomorrow when we have some more data to repeat the statistics that sean did and see if the timings for devstack and tempest are comparable to other clouds or maybe even better	15:28
clarkb	frickler: great, thank you all for finishing that process up	15:29
clarkb	https://review.opendev.org/c/opendev/gerritlib/+/920837 is a not super important change I pushed up as part of the gerrit upgrade work. Should be an easy review if we can get that pushed in	15:42
corvus	the fix for retrying uploads merged, so the order of rax-dfw in the config file shouldn't matter any more	15:52
clarkb	corvus: are there any other fixes I should be reviewing?	15:52
frickler	corvus: do you want us to revert the ordering in order to verify or are you confident enough about that fix?	15:54
corvus	i think that was the only nodepool source-code fix; the other thing fungi mentioned was a config file tuning (specifying the nova az); technically that's fine to leave indefinitely, the only question about that is if we wanted to figure out how to tell the cloud not to include the reserved az so that nodepool never sees it.	15:54
corvus	frickler: i don't think verification is necessary; we can revert or not depending on our aesthetic choices for the config file :)	15:55
frickler	ok, I think then we can leave it as is. I just verified that the builders and launchers are running the new nodepool image from 10h ago	15:56
opendevreview	Clark Boylan proposed opendev/gerritlib master: Fixup gerritlib jeepyb integration testing https://review.opendev.org/c/opendev/gerritlib/+/920837	16:27
clarkb	frickler: ^ that should tell us if we still need those caps	16:27
opendevreview	Clark Boylan proposed opendev/system-config master: Cleanup leftover inmotion configs https://review.opendev.org/c/opendev/system-config/+/922422	16:54
clarkb	that is the last bit of inmotion cleanup that I'm seeing via git grep and hound	16:55
frickler	clarkb: looks like the caps are indeed still needed :-( sorry for the annoyance	17:10
clarkb	no problem I'll restore the older patchset	17:10
opendevreview	Clark Boylan proposed opendev/gerritlib master: Fixup gerritlib jeepyb integration testing https://review.opendev.org/c/opendev/gerritlib/+/920837	17:11
*** jph5 is now known as jph		17:46
clarkb	frickler: did you check yet if the hypervisors have nested virt enabled?	17:56
clarkb	I suspect they do, but I'm thinking I should write a followup email to the openmetal folks to point them at the grafana dashboard and ask about the datadog integration. Could mention this too if it isn't already enabled	17:57
frickler	clarkb: I didn't check yet, I can try to do so tomorrow	18:04
clarkb	https://zuul.opendev.org/t/openstack/build/a46b253da05640d3a6febc6d412cb94e two different failures on system-config-run-base. We might need to drop xenial from there (I thought we already had). Not sure what is going on with ara-manage yet	18:04
clarkb	frickler: ack it isn't urgent. I'll go ahead and followup on the other bits	18:05
clarkb	no new ara releases recently	18:05
frickler	clarkb: there was some launchpad outage earlier, seems the task wants to install a ppa	18:06
clarkb	frickler: ah ok so that is probably transient. Not sure about the ara failure	18:06
clarkb	maybe we skipped installing it because of the xenial failure. I'll recheck	18:06
frickler	not sure the outage is actually completely resolved yet. but dropping xenial might be a good idea nevertheless	18:07
clarkb	ack and agreed	18:07
fungi	lp is still acting a bit "weird"	18:08
frickler	the ara venv is created in the run playbook, which was skipped because of the earlier failure opendev.org/opendev/system-config/playbooks/zuul/run-base.yaml	18:11
frickler	the topic in the lp channel still says "down"	18:11
clarkb	ok I shall practice patience in that acse then when I'ev got other stuff done I'll swing around to cleaning up the xenial testing	18:12
fungi	yeah, i'm seeing bug comments returned in odd orders, and getting timeout errors trying to add comments to bugs in lp	18:25
clarkb	before I send this email to openmetal is this a true statement "we haven't had to making any tuning changes to the cloud"	18:28
fungi	afaik we haven't yet, though we're still testing the waters and keeping an eye on it	18:29
clarkb	cool then I think what I've written is fine (basically we haven't tuned it yet but we'll let you know if we think tuning needs to happen)	18:30
fungi	i expect we'll probably need to tune the quota values for nodepool since it seemed to be having trouble finding enough host capacity for 50 nodes	18:30
fungi	and was getting errors back from nova instead of figuring that out fro the available numbers	18:31
clarkb	yup I've made note of that as well	18:31
clarkb	I suspect it has to do with the reserved az	18:31
clarkb	cutting into total available resources for the nova az	18:31
frickler	well as mentioned earlier I think the ceph setup needs tuning	19:09
frickler	also still need to investigate the az issue, either nodepool is misbehaving or the cloud is badly set up I think	19:10
frickler	the value of compute_overbooking seemed to be hand tuned to achieve around 50 servers, was this discussed in your meeting with openmetal or did they come up with that value on their own?	19:12
frickler	cpu_allocation_ratio = 2.6	19:13
frickler	that does only seem to be set on the four compute nodes however	19:14
frickler	the other thing we may want to change is the assignment of 2 of the 3 controllers to the nova zone. we may either want to move those to reserved or not run nova-compute on them at all	19:15
*** dmellado075 is now known as dmellado07		19:44
clarkb	frickler: the ratio was sorted out in the old cloud iirc to yes find a balance that resulted in jobs that didn't run too slow and enabled us to have a reasonable amount of test nodes	20:14
clarkb	in the meeting they said they would port that ratio over. IMO the default in nova is actually completely wrong and should be changed too fwiw.	20:14
clarkb	frickler: re removing the all in one nodes from compute I'm not sure I agree. We ran that way in the old cloud just fine? There were periods of api slowness that may have been due to load, but I don't think any of that was particularly problematic	20:15
clarkb	and having two additional compute nodes allows us to run more VMs	20:15
clarkb	and for ceph I think we can follow up with them. I'm not sure what the concern is yet (I haven't made it through all of the scrollback and may not so a summary would be helpful)	20:16
clarkb	aha search in scrollback was useful. There are 32 pgs for 7 osds. https://docs.ceph.com/en/latest/dev/placement-group/ recommends 100 pgs per osd. So we're like 25x lower?	20:18
clarkb	https://review.opendev.org/c/opendev/system-config/+/922422 did end up apssing after a recheck	21:22
fungi	oh awesome, yuriy says we should be able to crank max-servers back up to 60 now!	22:03
fungi	er, 50 i mean	22:05
clarkb	we can see if the grafana reported error count drops off	22:07
fungi	pip 24.1 just arrived. drops support for python 3.7, may also start breaking on some non-pep-440 version strings if still in use anywhere	22:38
fungi	https://pip.pypa.io/en/stable/news/#v24-1 but relevant entries are in the b1 and b2 sections	22:41
clarkb	I can understand why the release notes are organized that way, but it isn't really end user friendly. I wonder if they could easily rollup for the final non beta releases	22:50
fungi	yeah	23:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!