opendevreview | Tony Breeds proposed openstack/project-config master: Remove publish-wheel-cache-centos-8-stream* jobs https://review.opendev.org/c/openstack/project-config/+/922357 | 00:27 |
---|---|---|
opendevreview | Tony Breeds proposed openstack/project-config master: Remove publish-wheel-cache-centos-8-stream* jobs https://review.opendev.org/c/openstack/project-config/+/922357 | 01:03 |
opendevreview | Tony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config https://review.opendev.org/c/opendev/system-config/+/921322 | 01:16 |
*** enick_791 is now known as mordred | 01:37 | |
mordred | tonyb: am I showing up properly now? | 01:38 |
tonyb | mordred: Yes, Yes you are :) | 01:38 |
mordred | \o/ | 01:38 |
tonyb | Thank you for indulging me :) | 01:39 |
mordred | sorry about that - I had nickserv username and password configured - but for some reason it wasn't actually setting the nick. who knows | 01:39 |
tonyb | No problem, at first I thought it was just a "drive-by-chat" thing ... but then others knew who you were/are | 01:41 |
opendevreview | Tony Breeds proposed opendev/system-config master: Add noble repo files https://review.opendev.org/c/opendev/system-config/+/921770 | 02:18 |
opendevreview | Tony Breeds proposed opendev/system-config master: Don't use the openstack-ci-core/openafs PPA on noble and later https://review.opendev.org/c/opendev/system-config/+/921786 | 02:18 |
opendevreview | Tony Breeds proposed opendev/system-config master: Test mirror services on noble https://review.opendev.org/c/opendev/system-config/+/921771 | 02:18 |
opendevreview | Tony Breeds proposed opendev/system-config master: Remove *zuul-role-integration-centos-8-stream* jobs https://review.opendev.org/c/opendev/system-config/+/922360 | 02:18 |
opendevreview | Merged opendev/system-config master: Remove old meetpad and jvb servers https://review.opendev.org/c/opendev/system-config/+/920762 | 02:31 |
frickler | I think there has been some stuck buildset on openmetal due to the "max-servers: 1" test. there was a zuul-build-image job in "paused" state and I think it couldn't progress because the depending zuul-quick-start jobs needs to run on the same provider? I manually bumped max-servers for openmetal to 3 in order to allow this to progress, will make a patch to bump to 50 next | 06:35 |
frickler | just wondering whether there would be a way for zuul/nodepool to avoid such a deadlock from happening, or whether we just need to enforce some sane minimum of max-servers | 06:36 |
frickler | see https://zuul.opendev.org/t/zuul/buildset/a47d2ae8a5d84838946a9bc394d32661 for the timeline of the affected patch | 06:37 |
opendevreview | Jens Harbott proposed openstack/project-config master: Bump max-servers for openmetal cloud to maximum https://review.opendev.org/c/openstack/project-config/+/922368 | 07:01 |
frickler | I'm a bit worried about the ceph setup for openmetal, just 32 pgs per pools seems too low for 7 OSDs. and I also don't trust in ceph autotuning this, but let's run some jobs there first and see | 07:02 |
opendevreview | Merged openstack/project-config master: Bump max-servers for openmetal cloud to maximum https://review.opendev.org/c/openstack/project-config/+/922368 | 07:19 |
frickler | hmm, I guess I never looked at it this closely, but it seems nodepool does not use max-servers as a weight for pool usage? I was surprised to see openmetal getting maxed out during this relatively quiet period, but according to grafana all regions are at about 50 nodes being in use. which is like 30% for some regions and 100% for openmetal, not sure if that's intentional? | 08:28 |
frickler | also seems 50 server might be too much after all, seeing some "no valid host found" errors in nodepool log. I've put nl02.opendev.org into the emergency list and will try to find a maximum working value by manual testing now | 08:34 |
opendevreview | Tony Breeds proposed opendev/system-config master: DNM: Initial dump or mediawiki role and config https://review.opendev.org/c/opendev/system-config/+/921322 | 10:15 |
opendevreview | Merged openstack/project-config master: Drop wheel publishing for centos-8-stream https://review.opendev.org/c/openstack/project-config/+/922313 | 11:01 |
opendevreview | Mark Goddard proposed openstack/diskimage-builder master: Fix manifest element with non-root user https://review.opendev.org/c/openstack/diskimage-builder/+/922385 | 11:07 |
frickler | so it seems max-server=46 is the sweet spot for openmetal, will watch for another day (including the rush from the periodic pipeline) before doing the final commit on that, though. unless we decide to leave some headroom and that 42 is a nicer number anyway ;) | 12:31 |
frickler | seems launchpad is having some major issues, just ftr | 12:54 |
fungi | thanks frickler! i guess at this point we just need a change to s/50/46/ for its max-servers? i can push that up momentarily | 13:00 |
frickler | well I'd prefer to wait until tomorrow just to be on the safe side, but feel free to go ahead. one other thing to think about however is whether we'd want to use the cloud also for larger/nested-virt flavors, in which case we'd likely need to revise the max-servers value again, too | 13:05 |
fungi | oh, yeah this may also point to our quotas being off a bit, so maybe they could be adjusted to more accurately reflect actual capacity | 13:19 |
frickler | we got a nice mention from pleia2 ;) https://floss.social/@pleia2/112649393181593069 | 14:34 |
*** ykarel_ is now known as ykarel | 14:59 | |
*** ykarel is now known as ykarel|away | 15:00 | |
fungi | i miss her | 15:07 |
clarkb | fungi: frickler: are there any urgent todos related to the openmetal cloud? I saw there was some issues with uploading images due to pruning the raw images. I was going to suggest we could stop doing that, but I think it may have been sorted out generally? | 15:12 |
*** Guest10177 is now known as dasm | 15:15 | |
fungi | clarkb: yeah, that issue is eventually consistent (with the exception of gentoo and centos-8-stream since builds for those are paused) | 15:16 |
fungi | just took more waiting and/or manually forcing image rotations | 15:17 |
clarkb | for centos-8-stream we're going to delete the image entirely in the nearish future which should remove that problem. Not sure if we should do the same for gentoo (its probably not used much if at all?) | 15:17 |
fungi | there's also a couple of nodepool upstream fixes related to problems we encountered with image uploads for missing files getting rapidly retried and api errors blocking uploads for other providers | 15:17 |
fungi | and we discovered that nodepool randomly wanted to boot nodes in the restricted az, possibly something we can fix on the cloud side but for now we amended nodepool's config to specify the "nova" az | 15:18 |
fungi | also it looks like quotas may be tuned too high, max-servers of 50 was resulting in "no valid host" fails but 46 seems to work | 15:19 |
clarkb | ya I saw that change looking at the etherpad todo list. I think specifying the nova az explicitly is correct | 15:19 |
fungi | i'm popping out to lunch briefly but can help dig into things in an hour-ish | 15:20 |
clarkb | still no gitea 1.22.1 release | 15:20 |
frickler | clarkb: I agree with fungi, nothing urgent, seems to be working smoothly now. my idea is for tomorrow when we have some more data to repeat the statistics that sean did and see if the timings for devstack and tempest are comparable to other clouds or maybe even better | 15:28 |
clarkb | frickler: great, thank you all for finishing that process up | 15:29 |
clarkb | https://review.opendev.org/c/opendev/gerritlib/+/920837 is a not super important change I pushed up as part of the gerrit upgrade work. Should be an easy review if we can get that pushed in | 15:42 |
corvus | the fix for retrying uploads merged, so the order of rax-dfw in the config file shouldn't matter any more | 15:52 |
clarkb | corvus: are there any other fixes I should be reviewing? | 15:52 |
frickler | corvus: do you want us to revert the ordering in order to verify or are you confident enough about that fix? | 15:54 |
corvus | i think that was the only nodepool source-code fix; the other thing fungi mentioned was a config file tuning (specifying the nova az); technically that's fine to leave indefinitely, the only question about that is if we wanted to figure out how to tell the cloud not to include the reserved az so that nodepool never sees it. | 15:54 |
corvus | frickler: i don't think verification is necessary; we can revert or not depending on our aesthetic choices for the config file :) | 15:55 |
frickler | ok, I think then we can leave it as is. I just verified that the builders and launchers are running the new nodepool image from 10h ago | 15:56 |
opendevreview | Clark Boylan proposed opendev/gerritlib master: Fixup gerritlib jeepyb integration testing https://review.opendev.org/c/opendev/gerritlib/+/920837 | 16:27 |
clarkb | frickler: ^ that should tell us if we still need those caps | 16:27 |
opendevreview | Clark Boylan proposed opendev/system-config master: Cleanup leftover inmotion configs https://review.opendev.org/c/opendev/system-config/+/922422 | 16:54 |
clarkb | that is the last bit of inmotion cleanup that I'm seeing via git grep and hound | 16:55 |
frickler | clarkb: looks like the caps are indeed still needed :-( sorry for the annoyance | 17:10 |
clarkb | no problem I'll restore the older patchset | 17:10 |
opendevreview | Clark Boylan proposed opendev/gerritlib master: Fixup gerritlib jeepyb integration testing https://review.opendev.org/c/opendev/gerritlib/+/920837 | 17:11 |
*** jph5 is now known as jph | 17:46 | |
clarkb | frickler: did you check yet if the hypervisors have nested virt enabled? | 17:56 |
clarkb | I suspect they do, but I'm thinking I should write a followup email to the openmetal folks to point them at the grafana dashboard and ask about the datadog integration. Could mention this too if it isn't already enabled | 17:57 |
frickler | clarkb: I didn't check yet, I can try to do so tomorrow | 18:04 |
clarkb | https://zuul.opendev.org/t/openstack/build/a46b253da05640d3a6febc6d412cb94e two different failures on system-config-run-base. We might need to drop xenial from there (I thought we already had). Not sure what is going on with ara-manage yet | 18:04 |
clarkb | frickler: ack it isn't urgent. I'll go ahead and followup on the other bits | 18:05 |
clarkb | no new ara releases recently | 18:05 |
frickler | clarkb: there was some launchpad outage earlier, seems the task wants to install a ppa | 18:06 |
clarkb | frickler: ah ok so that is probably transient. Not sure about the ara failure | 18:06 |
clarkb | maybe we skipped installing it because of the xenial failure. I'll recheck | 18:06 |
frickler | not sure the outage is actually completely resolved yet. but dropping xenial might be a good idea nevertheless | 18:07 |
clarkb | ack and agreed | 18:07 |
fungi | lp is still acting a bit "weird" | 18:08 |
frickler | the ara venv is created in the run playbook, which was skipped because of the earlier failure opendev.org/opendev/system-config/playbooks/zuul/run-base.yaml | 18:11 |
frickler | the topic in the lp channel still says "down" | 18:11 |
clarkb | ok I shall practice patience in that acse then when I'ev got other stuff done I'll swing around to cleaning up the xenial testing | 18:12 |
fungi | yeah, i'm seeing bug comments returned in odd orders, and getting timeout errors trying to add comments to bugs in lp | 18:25 |
clarkb | before I send this email to openmetal is this a true statement "we haven't had to making any tuning changes to the cloud" | 18:28 |
fungi | afaik we haven't yet, though we're still testing the waters and keeping an eye on it | 18:29 |
clarkb | cool then I think what I've written is fine (basically we haven't tuned it yet but we'll let you know if we think tuning needs to happen) | 18:30 |
fungi | i expect we'll probably need to tune the quota values for nodepool since it seemed to be having trouble finding enough host capacity for 50 nodes | 18:30 |
fungi | and was getting errors back from nova instead of figuring that out fro the available numbers | 18:31 |
clarkb | yup I've made note of that as well | 18:31 |
clarkb | I suspect it has to do with the reserved az | 18:31 |
clarkb | cutting into total available resources for the nova az | 18:31 |
frickler | well as mentioned earlier I think the ceph setup needs tuning | 19:09 |
frickler | also still need to investigate the az issue, either nodepool is misbehaving or the cloud is badly set up I think | 19:10 |
frickler | the value of compute_overbooking seemed to be hand tuned to achieve around 50 servers, was this discussed in your meeting with openmetal or did they come up with that value on their own? | 19:12 |
frickler | cpu_allocation_ratio = 2.6 | 19:13 |
frickler | that does only seem to be set on the four compute nodes however | 19:14 |
frickler | the other thing we may want to change is the assignment of 2 of the 3 controllers to the nova zone. we may either want to move those to reserved or not run nova-compute on them at all | 19:15 |
*** dmellado075 is now known as dmellado07 | 19:44 | |
clarkb | frickler: the ratio was sorted out in the old cloud iirc to yes find a balance that resulted in jobs that didn't run too slow and enabled us to have a reasonable amount of test nodes | 20:14 |
clarkb | in the meeting they said they would port that ratio over. IMO the default in nova is actually completely wrong and should be changed too fwiw. | 20:14 |
clarkb | frickler: re removing the all in one nodes from compute I'm not sure I agree. We ran that way in the old cloud just fine? There were periods of api slowness that may have been due to load, but I don't think any of that was particularly problematic | 20:15 |
clarkb | and having two additional compute nodes allows us to run more VMs | 20:15 |
clarkb | and for ceph I think we can follow up with them. I'm not sure what the concern is yet (I haven't made it through all of the scrollback and may not so a summary would be helpful) | 20:16 |
clarkb | aha search in scrollback was useful. There are 32 pgs for 7 osds. https://docs.ceph.com/en/latest/dev/placement-group/ recommends 100 pgs per osd. So we're like 25x lower? | 20:18 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/922422 did end up apssing after a recheck | 21:22 |
fungi | oh awesome, yuriy says we should be able to crank max-servers back up to 60 now! | 22:03 |
fungi | er, 50 i mean | 22:05 |
clarkb | we can see if the grafana reported error count drops off | 22:07 |
fungi | pip 24.1 just arrived. drops support for python 3.7, may also start breaking on some non-pep-440 version strings if still in use anywhere | 22:38 |
fungi | https://pip.pypa.io/en/stable/news/#v24-1 but relevant entries are in the b1 and b2 sections | 22:41 |
clarkb | I can understand why the release notes are organized that way, but it isn't really end user friendly. I wonder if they could easily rollup for the final non beta releases | 22:50 |
fungi | yeah | 23:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!