Thursday, 2024-06-20

opendevreviewTony Breeds proposed openstack/project-config master: Remove publish-wheel-cache-centos-8-stream* jobs  https://review.opendev.org/c/openstack/project-config/+/92235700:27
opendevreviewTony Breeds proposed openstack/project-config master: Remove publish-wheel-cache-centos-8-stream* jobs  https://review.opendev.org/c/openstack/project-config/+/92235701:03
opendevreviewTony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config  https://review.opendev.org/c/opendev/system-config/+/92132201:16
*** enick_791 is now known as mordred01:37
mordredtonyb: am I showing up properly now?01:38
tonybmordred: Yes, Yes you are :)01:38
mordred\o/01:38
tonybThank you for indulging me :)01:39
mordredsorry about that - I had nickserv username and password configured - but for some reason it wasn't actually setting the nick. who knows01:39
tonybNo problem, at first I thought it was just a "drive-by-chat" thing ... but then others knew who you were/are01:41
opendevreviewTony Breeds proposed opendev/system-config master: Add noble repo files  https://review.opendev.org/c/opendev/system-config/+/92177002:18
opendevreviewTony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later  https://review.opendev.org/c/opendev/system-config/+/92178602:18
opendevreviewTony Breeds proposed opendev/system-config master: Test mirror services on noble  https://review.opendev.org/c/opendev/system-config/+/92177102:18
opendevreviewTony Breeds proposed opendev/system-config master: Remove *zuul-role-integration-centos-8-stream* jobs  https://review.opendev.org/c/opendev/system-config/+/92236002:18
opendevreviewMerged opendev/system-config master: Remove old meetpad and jvb servers  https://review.opendev.org/c/opendev/system-config/+/92076202:31
fricklerI think there has been some stuck buildset on openmetal due to the "max-servers: 1" test. there was a zuul-build-image job in "paused" state and I think it couldn't progress because the depending zuul-quick-start jobs needs to run on the same provider? I manually bumped max-servers for openmetal to 3 in order to allow this to progress, will make a patch to bump to 50 next06:35
fricklerjust wondering whether there would be a way for zuul/nodepool to avoid such a deadlock from happening, or whether we just need to enforce some sane minimum of max-servers06:36
fricklersee https://zuul.opendev.org/t/zuul/buildset/a47d2ae8a5d84838946a9bc394d32661 for the timeline of the affected patch06:37
opendevreviewJens Harbott proposed openstack/project-config master: Bump max-servers for openmetal cloud to maximum  https://review.opendev.org/c/openstack/project-config/+/92236807:01
fricklerI'm a bit worried about the ceph setup for openmetal, just 32 pgs per pools seems too low for 7 OSDs. and I also don't trust in ceph autotuning this, but let's run some jobs there first and see07:02
opendevreviewMerged openstack/project-config master: Bump max-servers for openmetal cloud to maximum  https://review.opendev.org/c/openstack/project-config/+/92236807:19
fricklerhmm, I guess I never looked at it this closely, but it seems nodepool does not use max-servers as a weight for pool usage? I was surprised to see openmetal getting maxed out during this relatively quiet period, but according to grafana all regions are at about 50 nodes being in use. which is like 30% for some regions and 100% for openmetal, not sure if that's intentional?08:28
frickleralso seems 50 server might be too much after all, seeing some "no valid host found" errors in nodepool log. I've put nl02.opendev.org into the emergency list and will try to find a maximum working value by manual testing now08:34
opendevreviewTony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config  https://review.opendev.org/c/opendev/system-config/+/92132210:15
opendevreviewMerged openstack/project-config master: Drop wheel publishing for centos-8-stream  https://review.opendev.org/c/openstack/project-config/+/92231311:01
opendevreviewMark Goddard proposed openstack/diskimage-builder master: Fix manifest element with non-root user  https://review.opendev.org/c/openstack/diskimage-builder/+/92238511:07
fricklerso it seems max-server=46 is the sweet spot for openmetal, will watch for another day (including the rush from the periodic pipeline) before doing the final commit on that, though. unless we decide to leave some headroom and that 42 is a nicer number anyway ;)12:31
fricklerseems launchpad is having some major issues, just ftr12:54
fungithanks frickler! i guess at this point we just need a change to s/50/46/ for its max-servers? i can push that up momentarily13:00
fricklerwell I'd prefer to wait until tomorrow just to be on the safe side, but feel free to go ahead. one other thing to think about however is whether we'd want to use the cloud also for larger/nested-virt flavors, in which case we'd likely need to revise the max-servers value again, too13:05
fungioh, yeah this may also point to our quotas being off a bit, so maybe they could be adjusted to more accurately reflect actual capacity13:19
fricklerwe got a nice mention from pleia2 ;) https://floss.social/@pleia2/11264939318159306914:34
*** ykarel_ is now known as ykarel14:59
*** ykarel is now known as ykarel|away15:00
fungii miss her15:07
clarkbfungi: frickler: are there any urgent todos related to the openmetal cloud? I saw there was some issues with uploading images due to pruning the raw images. I was going to suggest we could stop doing that, but I think it may have been sorted out generally?15:12
*** Guest10177 is now known as dasm15:15
fungiclarkb: yeah, that issue is eventually consistent (with the exception of gentoo and centos-8-stream since builds for those are paused)15:16
fungijust took more waiting and/or manually forcing image rotations15:17
clarkbfor centos-8-stream we're going to delete the image entirely in the nearish future which should remove that problem. Not sure if we should do the same for gentoo (its probably not used much if at all?)15:17
fungithere's also a couple of nodepool upstream fixes related to problems we encountered with image uploads for missing files getting rapidly retried and api errors blocking uploads for other providers15:17
fungiand we discovered that nodepool randomly wanted to boot nodes in the restricted az, possibly something we can fix on the cloud side but for now we amended nodepool's config to specify the "nova" az15:18
fungialso it looks like quotas may be tuned too high, max-servers of 50 was resulting in "no valid host" fails but 46 seems to work15:19
clarkbya I saw that change looking at the etherpad todo list. I think specifying the nova az explicitly is correct15:19
fungii'm popping out to lunch briefly but can help dig into things in an hour-ish15:20
clarkbstill no gitea 1.22.1 release15:20
fricklerclarkb: I agree with fungi, nothing urgent, seems to be working smoothly now. my idea is for tomorrow when we have some more data to repeat the statistics that sean did and see if the timings for devstack and tempest are comparable to other clouds or maybe even better15:28
clarkbfrickler: great, thank you all for finishing that process up15:29
clarkbhttps://review.opendev.org/c/opendev/gerritlib/+/920837 is a not super important change I pushed up as part of the gerrit upgrade work. Should be an easy review if we can get that pushed in15:42
corvusthe fix for retrying uploads merged, so the order of rax-dfw in the config file shouldn't matter any more15:52
clarkbcorvus: are there any other fixes I should be reviewing?15:52
fricklercorvus: do you want us to revert the ordering in order to verify or are you confident enough about that fix?15:54
corvusi think that was the only nodepool source-code fix; the other thing fungi mentioned was a config file tuning (specifying the nova az); technically that's fine to leave indefinitely, the only question about that is if we wanted to figure out how to tell the cloud not to include the reserved az so that nodepool never sees it.15:54
corvusfrickler: i don't think verification is necessary; we can revert or not depending on our aesthetic choices for the config file :)15:55
fricklerok, I think then we can leave it as is. I just verified that the builders and launchers are running the new nodepool image from 10h ago15:56
opendevreviewClark Boylan proposed opendev/gerritlib master: Fixup gerritlib jeepyb integration testing  https://review.opendev.org/c/opendev/gerritlib/+/92083716:27
clarkbfrickler: ^ that should tell us if we still need those caps16:27
opendevreviewClark Boylan proposed opendev/system-config master: Cleanup leftover inmotion configs  https://review.opendev.org/c/opendev/system-config/+/92242216:54
clarkbthat is the last bit of inmotion cleanup that I'm seeing via git grep and hound16:55
fricklerclarkb: looks like the caps are indeed still needed :-( sorry for the annoyance17:10
clarkbno problem I'll restore the older patchset17:10
opendevreviewClark Boylan proposed opendev/gerritlib master: Fixup gerritlib jeepyb integration testing  https://review.opendev.org/c/opendev/gerritlib/+/92083717:11
*** jph5 is now known as jph17:46
clarkbfrickler: did you check yet if the hypervisors have nested virt enabled?17:56
clarkbI suspect they do, but I'm thinking I should write a followup email to the openmetal folks to point them at the grafana dashboard and ask about the datadog integration. Could mention this too if it isn't already enabled17:57
fricklerclarkb: I didn't check yet, I can try to do so tomorrow18:04
clarkbhttps://zuul.opendev.org/t/openstack/build/a46b253da05640d3a6febc6d412cb94e two different failures on system-config-run-base. We might need to drop xenial from there (I thought we already had). Not sure what is going on with ara-manage yet18:04
clarkbfrickler: ack it isn't urgent. I'll go ahead and followup on the other bits18:05
clarkbno new ara releases recently18:05
fricklerclarkb: there was some launchpad outage earlier, seems the task wants to install a ppa18:06
clarkbfrickler: ah ok so that is probably transient. Not sure about the ara failure18:06
clarkbmaybe we skipped installing it because of the xenial failure. I'll recheck18:06
fricklernot sure the outage is actually completely resolved yet. but dropping xenial might be a good idea nevertheless18:07
clarkback and agreed18:07
fungilp is still acting a bit "weird"18:08
fricklerthe ara venv is created in the run playbook, which was skipped because of the earlier failure opendev.org/opendev/system-config/playbooks/zuul/run-base.yaml18:11
fricklerthe topic in the lp channel still says "down"18:11
clarkbok I shall practice patience in that acse then when I'ev got other stuff done I'll swing around to cleaning up the xenial testing18:12
fungiyeah, i'm seeing bug comments returned in odd orders, and getting timeout errors trying to add comments to bugs in lp18:25
clarkbbefore I send this email to openmetal is this a true statement "we haven't had to making any tuning changes to the cloud"18:28
fungiafaik we haven't yet, though we're still testing the waters and keeping an eye on it18:29
clarkbcool then I think what I've written is fine (basically we haven't tuned it  yet but we'll let you know if we think tuning needs to happen)18:30
fungii expect we'll probably need to tune the quota values for nodepool since it seemed to be having trouble finding enough host capacity for 50 nodes18:30
fungiand was getting errors back from nova instead of figuring that out fro the available numbers18:31
clarkbyup I've made note of that as well18:31
clarkbI suspect it has to do with the reserved az18:31
clarkbcutting into total available resources for the nova az18:31
fricklerwell as mentioned earlier I think the ceph setup needs tuning19:09
frickleralso still need to investigate the az issue, either nodepool is misbehaving or the cloud is badly set up I think19:10
fricklerthe value of compute_overbooking seemed to be hand tuned to achieve around 50 servers, was this discussed in your meeting with openmetal or did they come up with that value on their own?19:12
fricklercpu_allocation_ratio = 2.619:13
fricklerthat does only seem to be set on the four compute nodes however19:14
fricklerthe other thing we may want to change is the assignment of 2 of the 3 controllers to the nova zone. we may either want to move those to reserved or not run nova-compute on them at all19:15
*** dmellado075 is now known as dmellado0719:44
clarkbfrickler: the ratio was sorted out in the old cloud iirc to yes find a balance that resulted in jobs that didn't run too slow and enabled us to have a reasonable amount of test nodes20:14
clarkbin the meeting they said they would port that ratio over. IMO the default in nova is actually completely wrong and should be changed too fwiw.20:14
clarkbfrickler: re removing the all in one nodes from compute I'm not sure I agree. We ran that way in the old cloud just fine? There were periods of api slowness that may have been due to load, but I don't think any of that was particularly problematic20:15
clarkband having two additional compute nodes allows us to run more VMs20:15
clarkband for ceph I think we can follow up with them. I'm not sure what the concern is yet (I haven't made it through all of the scrollback and may not so a summary would be helpful)20:16
clarkbaha search in scrollback was useful. There are 32 pgs for 7 osds. https://docs.ceph.com/en/latest/dev/placement-group/ recommends 100 pgs per osd. So we're like 25x lower?20:18
clarkbhttps://review.opendev.org/c/opendev/system-config/+/922422 did end up apssing after a recheck21:22
fungioh awesome, yuriy says we should be able to crank max-servers back up to 60 now!22:03
fungier, 50 i mean22:05
clarkbwe can see if the grafana reported error count drops off22:07
fungipip 24.1 just arrived. drops support for python 3.7, may also start breaking on some non-pep-440 version strings if still in use anywhere22:38
fungihttps://pip.pypa.io/en/stable/news/#v24-1 but relevant entries are in the b1 and b2 sections22:41
clarkbI can understand why the release notes are organized that way, but it isn't really end user friendly. I wonder if they could easily rollup for the final non beta releases22:50
fungiyeah23:02

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!