ianw | clarkb: should we merge 811233 and restart with it? | 01:11 |
---|---|---|
Clark[m] | Ya I think we should. If you approve it I can sort out the restart tomorrow | 01:26 |
Clark[m] | Or if you want to get it done fee free | 01:26 |
ianw | i'll get it in and see how we go | 01:36 |
opendevreview | Merged opendev/system-config master: Properly copy gerrit static files https://review.opendev.org/c/opendev/system-config/+/811233 | 02:26 |
opendevreview | 赵晨凯 proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290 | 02:46 |
opendevreview | NMG-K proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290 | 02:50 |
opendevreview | NMG-K proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290 | 03:17 |
frickler | ehm, did the latest stuff clean up autoholds? my hold from yesterday evening seems to be gone and there is only one currently which has an id of 0000000000 | 03:32 |
frickler | ah, I should've read all backlog | 03:32 |
opendevreview | fupingxie proposed openstack/project-config master: test https://review.opendev.org/c/openstack/project-config/+/811295 | 04:03 |
opendevreview | Ian Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running https://review.opendev.org/c/opendev/system-config/+/807672 | 04:48 |
opendevreview | Ian Wienand proposed opendev/system-config master: infra-prod: clone source once https://review.opendev.org/c/opendev/system-config/+/807808 | 04:48 |
*** ysandeep|out is now known as ysandeep | 05:51 | |
ianw | clarkb: i doubt i will make the meeting, but i added some notes on the parallel job changes. i think they're ready for review now | 07:03 |
ianw | https://hub.docker.com/layers/opendevorg/gerrit/3.2/images/sha256-8d847be97aea80ac1b395819b1a3197ff1e69c5dcb594bec2a16715884b540cc?context=explore | 07:05 |
ianw | is the latest gerrit image | 07:05 |
ianw | that matches what we have for gerrit 3.2 tag ... "opendevorg/gerrit@sha256:8d847be97aea80ac1b395819b1a3197ff1e69c5dcb594bec2a16715884b540cc" | 07:06 |
ianw | so i'll just do a quick restart to pick up the fixed static content changes | 07:06 |
ianw | #status log restarted gerrit to pickup https://review.opendev.org/c/opendev/system-config/+/811233 | 07:08 |
opendevstatus | ianw: finished logging | 07:08 |
opendevreview | NMG-K proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290 | 07:09 |
*** ianw is now known as ianw_pto | 07:21 | |
*** jpena|off is now known as jpena | 07:31 | |
*** ykarel is now known as ykarel|lunch | 09:02 | |
*** ykarel|lunch is now known as ykarel | 10:15 | |
opendevreview | Alfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392 | 10:44 |
*** ysandeep is now known as ysandeep|brb | 10:47 | |
opendevreview | Yuriy Shyyan proposed openstack/project-config master: Adjusting tenancy limits for this cloud. https://review.opendev.org/c/openstack/project-config/+/811395 | 10:59 |
yuriys | ianw: clarkb: just saw from yesterday, unlucky... adjusting limit today. We use the libvirt driver, and libvirt uses the kvm virt_type, not qemu. The biggest issue that I'm trying to figure out is why when a codeset requires multiple instances for testing is - do they all seem to get started/push on the same baremetal node | 11:06 |
yuriys | it has an explosive effect and doesn't naturally balance out. | 11:08 |
*** ysandeep|brb is now known as ysandeep | 11:13 | |
*** bhagyashris_ is now known as bhagyashris|rover | 11:16 | |
yuriys | in the nl files what does rate: do , cant find that info in the zuul-ci.org/docs | 11:16 |
*** jpena is now known as jpena|lunch | 11:24 | |
*** dviroel|out is now known as dviroel | 11:27 | |
opendevreview | Alfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392 | 12:14 |
*** jpena|lunch is now known as jpena | 12:17 | |
*** ysandeep is now known as ysandeep|brb | 12:23 | |
*** ysandeep|brb is now known as ysandeep | 12:31 | |
fungi | yuriys: the rate there is a throttle for how quickly nodepool will make api calls to the provider | 14:12 |
fungi | some providers have rate limiters in front of their apis and will return errors if nodepool makes too many calls in rapid succession | 14:13 |
yuriys | I'm just concerned that placement may not be instant, still validating that part. So if it's returning the same zone node nova availability this may be causing same instances to be schedules on same infra nodes. | 14:23 |
opendevreview | Merged openstack/project-config master: Adjusting tenancy limits for this cloud. https://review.opendev.org/c/openstack/project-config/+/811395 | 14:24 |
yuriys | Although there is a whole message queuing system inside, so that part isn't up to your rate limit, but rather how placement handles it's queue I suppose. | 14:24 |
yuriys | Basically the problem I'm trying to solve, is sometimes I'll see a node let's say with 11 instances on it while the other one only has 5 or 6, so it's not 'balanced'. I've been thinking maybe tweaking overcommits to force the correct balancing behavior, so that let's say once one node gets to 8 or so, it has to be provisioned on some other node and wont be in the node list placement returns. | 14:27 |
fungi | i guess placement tries to follow something like a round-robin or least-loaded scheme? | 14:27 |
yuriys | From everything I have seen that is not the case, or maybe we didn't configure that part properly. | 14:28 |
fungi | least-loaded could work cross purposes since that might allow one under-utilized host to suddenly get a lot of instances placed | 14:28 |
yuriys | imo it should always return the least loaded node. | 14:28 |
fungi | depending on how racy the determination is | 14:28 |
yuriys | Yeah, when I see a subset of tests return this: | 14:29 |
yuriys | (victoria) [root@lucky-firefox ~]# for i in 682b7f82-3433-474a-a4c1-76c8a8316abd 64f48d2c-9cf8-4c3d-86f7-017a4f7f6ad8 aaf52bf4-e0a9-41b8-a307-1b0e637bcb69; do openstack server show $i -c OS-EXT-SRV-ATTR:host -f shell; done | 14:29 |
yuriys | os_ext_srv_attr_host="dashing-tiglon.local" | 14:29 |
yuriys | os_ext_srv_attr_host="dashing-tiglon.local" | 14:29 |
yuriys | os_ext_srv_attr_host="dashing-tiglon.local" | 14:29 |
yuriys | I go full /reeeeee | 14:29 |
yuriys | And I'm not sure throwing more hardware would solve this problem, unless I figure out why placement is misbehaving basically. | 14:30 |
yuriys | Otherwise it looks like you'll be like 'hey cloud, give me 3 instances for this test', and it will just create 3 instances on 1 node regardless of how many there are in the cloud. | 14:31 |
yuriys | Yeah looks like we don't really customize placement, womp womp. It's all like defaults. | 14:34 |
fungi | i see placement.randomize_allocation_candidates is false by default, i wonder if that would help | 14:34 |
fungi | just looking through the config sample and docs for it now, i'm unfortunately not particularly familiar with it | 14:35 |
yuriys | maybe enabling randomize_allocation_candidates would help , idk, worth a try | 14:35 |
*** ykarel is now known as ykarel|away | 14:36 | |
fungi | https://docs.openstack.org/placement/latest/configuration/config.html#placement.randomize_allocation_candidates | 14:37 |
fungi | yeah, it seems like a long shot | 14:37 |
yuriys | that docstring is i think what im having happen sometimes: That is, all things # being equal, two requests for allocation candidates will return the same # results in the same order | 14:38 |
fungi | it looks like more advanced load distribution would maybe have to be done in nova's scheduler still? i'm trying to quickly digest the docs | 14:40 |
fungi | ahh, maybe what i was expecting to be static configuration is actually behaviors set through the placement api, via creation of "resource providers"? | 14:45 |
fungi | anyway, i need to switch gears, more meetings on the way | 14:45 |
yuriys | yup, ty for taking a peek, we'll see how we do with the new limit, tired of ianw yelling at me! | 14:46 |
*** ysandeep is now known as ysandeep|out | 14:57 | |
clarkb | yuriys: fungi: melwitt helped with placement things when we had the leaks and may have input too. THough I think this morning everyone is still trying to sort through the devstack apache issue | 15:10 |
*** marios is now known as marios|out | 15:50 | |
corvus | clarkb: i believe the change at the head of the starlingx gate queue is stuck due to the zuul issue. | 16:14 |
opendevreview | Alfredo Moralejo proposed openstack/project-config master: Add support for CentOS Stream 9 in nodepool elements https://review.opendev.org/c/openstack/project-config/+/811442 | 16:15 |
fungi | corvus: clarkb and i are both on a call at the moment but i can try to take a look | 16:17 |
corvus | fungi: no need, i'm looking into the zuul bug. at some point we may want to dequeue/enqueue to see if it fixes it, but for now i'd appreciate the opportunity to learn more in situ | 16:18 |
fungi | corvus: oh, thanks, no problem i can try to get the tests going for that change again once you're done looking at it | 16:19 |
fungi | i'd probably try a promote on it first just to "reorder" the queue in the same order and see if that would be less disruptive | 16:20 |
corvus | i have a suspcion that neither would work and we would need to dequeue it and to touch it for 2 hours to fix it (absent external zk intervention) | 16:26 |
opendevreview | Alfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392 | 16:28 |
*** jpena is now known as jpena|off | 16:31 | |
*** artom_ is now known as artom | 16:35 | |
clarkb | corvus: my call is done. Let me knwo if I can help, but from what I can tell you've got it under control and possibly need reviews for changes in the near future | 17:00 |
corvus | clarkb: yep, making progress. will update soon. | 17:01 |
clarkb | thanks | 17:01 |
corvus | clarkb, fungi: i think i'm done inspecting the state. i suspect now that a dequeue/enqueue may actually fix the immediate issue (that is, if the dequeue manages to complete). if you want to try a promote (but i'm 80% confident that won't work), and then dequeue/enqueue on 810014,2 i think that's appropriate. | 17:06 |
fungi | thanks, i'll try in that sequence | 17:07 |
fungi | as anticipated, the promote seems to have done nothing | 17:12 |
fungi | zuul dequeue also doesn't seem to have done anything | 17:14 |
*** slaweq_ is now known as slaweq | 17:22 | |
corvus | fungi: hrm i don't see the dequeue command in the log :/ | 17:23 |
fungi | yeah, i was trying to find it | 17:23 |
fungi | ran as `sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec zuul-scheduler zuul dequeue --tenant=openstack --pipeline=gate 810762,6` but that exited 1 so it may not have worked | 17:24 |
corvus | fungi: wrong change number | 17:24 |
clarkb | fungi: sudo docker exec zuul-scheduler_scheduler_1 zuul dequeue --tenant openstack --pipeline gate --project openstack/placement --change 809366,1 is what i used last week | 17:25 |
corvus | oh you were promoting the change behind it | 17:25 |
fungi | i tried both | 17:25 |
corvus | adding a '--change' argument like clarkb may help | 17:26 |
fungi | oh, yep | 17:26 |
fungi | --help also returns nothing though | 17:26 |
fungi | sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec zuul-scheduler zuul dequeue --tenant=openstack --pipeline=gate --change=810014,2 | 17:27 |
fungi | is what i ran just now | 17:27 |
fungi | i'll try via docker exec instead | 17:27 |
corvus | fungi: sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul --help | 17:27 |
fungi | oh, probably need --project too | 17:27 |
corvus | service name is 'scheduler' not 'zuul-scheduler' | 17:27 |
fungi | aha, yes thank you | 17:28 |
fungi | okay, now the promote try first | 17:29 |
fungi | all the builds for 810014,2 are back to a waiting state | 17:29 |
fungi | i guess we'll know in a moment whether they get nodes assigned | 17:29 |
fungi | i see builds starting | 17:30 |
fungi | corvus: the promote seems to have done the trick | 17:30 |
corvus | fungi: i beleive a re-enqueue of 810014,2 should be okay | 17:31 |
corvus | oh nm | 17:31 |
corvus | that's the one you promoted :) | 17:31 |
corvus | so we're all done | 17:31 |
fungi | yeah | 17:31 |
fungi | i first tried promoting the stuck change to "reorder" it in the same order | 17:32 |
fungi | it was just a matter of getting the docker-compose command plumbing correct, thanks! | 17:32 |
corvus | lemme check the change object ids real quick and see if i can anticipate further problems or not | 17:32 |
fungi | for posterity i did this: | 17:32 |
fungi | sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul promote --tenant=openstack --pipeline=gate --changes=810014,2fungi@zuul02:~ | 17:32 |
fungi | er, my cursor also seems to have grabbed the prompt on the next line | 17:33 |
fungi | sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul promote --tenant=openstack --pipeline=gate --changes=810014,2 | 17:33 |
fungi | that | 17:33 |
corvus | fungi: unfortunately, i think that this will not work, it's still using the outdated change object. | 17:34 |
fungi | status page suggests an eta of 10 minutes 'til merge for 810014,2 | 17:34 |
corvus | fungi: i think it's going to require a dequeue/enqueue (and optionally promote) | 17:34 |
fungi | i guess it would get stuck at completion? | 17:35 |
corvus | yep | 17:35 |
fungi | okay, dequeuing it now | 17:35 |
fungi | and enqueuing | 17:36 |
fungi | and promoting | 17:36 |
corvus | fungi: great, that looks like it's using the new change object, so we should be good | 17:36 |
fungi | okay, it's at the top of the queue again | 17:36 |
fungi | thanks corvus! | 17:37 |
fungi | so the bug has to do with outdated change objects in zk? | 17:37 |
corvus | fungi, Clark: we are highly suceptible to this error; basically, any network issue between zuul<->gerrit could cause this. | 17:37 |
corvus | fungi: outdated in memory actually; full explanation in commit msg on https://review.opendev.org/811452 | 17:38 |
fungi | oh, cool looking | 17:38 |
corvus | i think we should restart with that asap. | 17:38 |
clarkb | I've approved the fix | 17:38 |
fungi | yes, restart as soon as there are new images sounds prudent | 17:39 |
corvus | https://zuul.opendev.org/t/openstack/status/change/805981,3 is a lot of jobs | 17:39 |
fungi | oh wow | 17:41 |
fungi | i guess they're running all their molecule jobs because of the ansible bump | 17:43 |
clarkb | I'm amazed they all succeeded | 17:45 |
clarkb | corvus: any idea why some changes have ended up in zuul's periodic pipeline that don't appear to belong there? https://zuul.opendev.org/t/zuul/status | 17:52 |
corvus | clarkb: it may be related to the other traceback i haven't started digging into yet. that was hitting periodic pipelines. | 17:53 |
corvus | i'm going to afk for a bit, then resume work on that | 17:53 |
clarkb | ok | 17:54 |
melwitt | yuriys: re: your placement query from earlier, default behavior of the nova scheduler is to "stack" instances/maximize efficiency. if you want to "spread" instances you can adjust configuration, | 18:46 |
melwitt | https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.host_subset_size is the main one. increasing it will increase the spread by picking randomly from a subset of hosts that can fit the instance | 18:48 |
melwitt | yuriys: there is also https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.shuffle_best_same_weighed_hosts which will randomly shuffle hosts that have the same weight to get more spread. this one says it's particularly well suited for ironic deployments | 18:52 |
melwitt | yuriys: and finally, as fungi mentioned https://docs.openstack.org/placement/latest/configuration/config.html#placement.randomize_allocation_candidates is useful when you have more compute nodes than https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.max_placement_results it will shuffle hosts before truncating at the max results which will allow spread placement | 19:00 |
clarkb | melwitt: thanks for all the pointers! | 19:00 |
fungi | thanks melwitt! and yeah, i realized after digging deeper that most of the control over that was from the nova side rather than the placement side | 19:01 |
melwitt | np | 19:01 |
fungi | clarkb: corvus: looks like promote on 811452 completed roughly 20 minutes ago | 19:01 |
yuriys | melwitt: clarkb: tyty! that makes big sense! | 19:12 |
yuriys | I did find the weight docs, and was probably going that way as well. ideally we pick best suitable host per instance, per create call. | 19:13 |
clarkb | fungi: if you have time can you weigh in on https://review.opendev.org/c/opendev/system-config/+/810284 I think the replication issue being corrected by network updates shows this isn't necessarythough it may still help improve things | 20:10 |
clarkb | curious what you think about it given what we've learned | 20:10 |
corvus | fungi, clarkb: how about i restart zuul now? | 20:12 |
clarkb | let me see what queues look like | 20:13 |
clarkb | I don't see any openstack release jobs | 20:13 |
clarkb | I think we're good and fungi gave them notice a bit earlier | 20:13 |
clarkb | there is a stack of tripleo changes that may be mergeable in ~17 minutes | 20:13 |
corvus | cool, restarting now | 20:14 |
corvus | oh :( | 20:14 |
clarkb | I don't think that is very critical | 20:14 |
clarkb | they also release after openstack does | 20:14 |
corvus | i had just hit enter when i got that msg; so restart is proceeding | 20:14 |
clarkb | no worries | 20:14 |
clarkb | The bug is bad enough that we should get it fixed | 20:15 |
fungi | yeah, sorry, stepped away to do dinner prep but now seems like a good enough time to restart | 20:27 |
fungi | i'll approve 810284 once zuul's running again | 20:28 |
corvus | re-enqueing | 20:32 |
fungi | clarkb: once 810284 is in for a few days or a week we can check cacti graphs for the gitea servers and see if maybe it helps or worsens cpu, memory, i/o, et cetera | 20:36 |
clarkb | ++ | 20:36 |
fungi | with the level of churn some projects like nova see, i wouldn't be surprised if a week is a long time to go between repacks | 20:36 |
clarkb | ya | 20:36 |
clarkb | one reason I suspected that was it seemed like projects like nova, cinder, ironic, etc were more likely to hit the replication issues. That could be because they are more active or because they are larger (or both). In this case because they are more active but when I wrote that change I was trying to hedge against the various concerns | 20:37 |
corvus | re-enqueue complete | 20:40 |
fungi | thanks corvus! | 20:40 |
corvus | #status log restarted all of zuul on commit 29d0534696b3b541701b863bef626f7c804b90f2 to pick up change cache fix | 20:41 |
opendevstatus | corvus: finished logging | 20:41 |
priteau | Do we need to recheck any change submitted during the restart? | 20:44 |
clarkb | priteau: if they don't show up in the status dashboard then yes | 20:44 |
priteau | Thanks, a recheck has put it in the queue | 20:45 |
*** elodilles is now known as elodilles_pto | 20:52 | |
opendevreview | Merged opendev/system-config master: GC/pack gitea repos every other day https://review.opendev.org/c/opendev/system-config/+/810284 | 21:35 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!