Tuesday, 2021-09-28

ianw	clarkb: should we merge 811233 and restart with it?	01:11
Clark[m]	Ya I think we should. If you approve it I can sort out the restart tomorrow	01:26
Clark[m]	Or if you want to get it done fee free	01:26
ianw	i'll get it in and see how we go	01:36
opendevreview	Merged opendev/system-config master: Properly copy gerrit static files https://review.opendev.org/c/opendev/system-config/+/811233	02:26
opendevreview	赵晨凯 proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290	02:46
opendevreview	NMG-K proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290	02:50
opendevreview	NMG-K proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290	03:17
frickler	ehm, did the latest stuff clean up autoholds? my hold from yesterday evening seems to be gone and there is only one currently which has an id of 0000000000	03:32
frickler	ah, I should've read all backlog	03:32
opendevreview	fupingxie proposed openstack/project-config master: test https://review.opendev.org/c/openstack/project-config/+/811295	04:03
opendevreview	Ian Wienand proposed opendev/system-config master: Refactor infra-prod jobs for parallel running https://review.opendev.org/c/opendev/system-config/+/807672	04:48
opendevreview	Ian Wienand proposed opendev/system-config master: infra-prod: clone source once https://review.opendev.org/c/opendev/system-config/+/807808	04:48
*** ysandeep\|out is now known as ysandeep		05:51
ianw	clarkb: i doubt i will make the meeting, but i added some notes on the parallel job changes. i think they're ready for review now	07:03
ianw	https://hub.docker.com/layers/opendevorg/gerrit/3.2/images/sha256-8d847be97aea80ac1b395819b1a3197ff1e69c5dcb594bec2a16715884b540cc?context=explore	07:05
ianw	is the latest gerrit image	07:05
ianw	that matches what we have for gerrit 3.2 tag ... "opendevorg/gerrit@sha256:8d847be97aea80ac1b395819b1a3197ff1e69c5dcb594bec2a16715884b540cc"	07:06
ianw	so i'll just do a quick restart to pick up the fixed static content changes	07:06
ianw	#status log restarted gerrit to pickup https://review.opendev.org/c/opendev/system-config/+/811233	07:08
opendevstatus	ianw: finished logging	07:08
opendevreview	NMG-K proposed openstack/project-config master: add taibai namesapce and base project https://review.opendev.org/c/openstack/project-config/+/811290	07:09
*** ianw is now known as ianw_pto		07:21
*** jpena\|off is now known as jpena		07:31
*** ykarel is now known as ykarel\|lunch		09:02
*** ykarel\|lunch is now known as ykarel		10:15
opendevreview	Alfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392	10:44
*** ysandeep is now known as ysandeep\|brb		10:47
opendevreview	Yuriy Shyyan proposed openstack/project-config master: Adjusting tenancy limits for this cloud. https://review.opendev.org/c/openstack/project-config/+/811395	10:59
yuriys	ianw: clarkb: just saw from yesterday, unlucky... adjusting limit today. We use the libvirt driver, and libvirt uses the kvm virt_type, not qemu. The biggest issue that I'm trying to figure out is why when a codeset requires multiple instances for testing is - do they all seem to get started/push on the same baremetal node	11:06
yuriys	it has an explosive effect and doesn't naturally balance out.	11:08
*** ysandeep\|brb is now known as ysandeep		11:13
*** bhagyashris_ is now known as bhagyashris\|rover		11:16
yuriys	in the nl files what does rate: do , cant find that info in the zuul-ci.org/docs	11:16
*** jpena is now known as jpena\|lunch		11:24
*** dviroel\|out is now known as dviroel		11:27
opendevreview	Alfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392	12:14
*** jpena\|lunch is now known as jpena		12:17
*** ysandeep is now known as ysandeep\|brb		12:23
*** ysandeep\|brb is now known as ysandeep		12:31
fungi	yuriys: the rate there is a throttle for how quickly nodepool will make api calls to the provider	14:12
fungi	some providers have rate limiters in front of their apis and will return errors if nodepool makes too many calls in rapid succession	14:13
yuriys	I'm just concerned that placement may not be instant, still validating that part. So if it's returning the same zone node nova availability this may be causing same instances to be schedules on same infra nodes.	14:23
opendevreview	Merged openstack/project-config master: Adjusting tenancy limits for this cloud. https://review.opendev.org/c/openstack/project-config/+/811395	14:24
yuriys	Although there is a whole message queuing system inside, so that part isn't up to your rate limit, but rather how placement handles it's queue I suppose.	14:24
yuriys	Basically the problem I'm trying to solve, is sometimes I'll see a node let's say with 11 instances on it while the other one only has 5 or 6, so it's not 'balanced'. I've been thinking maybe tweaking overcommits to force the correct balancing behavior, so that let's say once one node gets to 8 or so, it has to be provisioned on some other node and wont be in the node list placement returns.	14:27
fungi	i guess placement tries to follow something like a round-robin or least-loaded scheme?	14:27
yuriys	From everything I have seen that is not the case, or maybe we didn't configure that part properly.	14:28
fungi	least-loaded could work cross purposes since that might allow one under-utilized host to suddenly get a lot of instances placed	14:28
yuriys	imo it should always return the least loaded node.	14:28
fungi	depending on how racy the determination is	14:28
yuriys	Yeah, when I see a subset of tests return this:	14:29
yuriys	(victoria) [root@lucky-firefox ~]# for i in 682b7f82-3433-474a-a4c1-76c8a8316abd 64f48d2c-9cf8-4c3d-86f7-017a4f7f6ad8 aaf52bf4-e0a9-41b8-a307-1b0e637bcb69; do openstack server show $i -c OS-EXT-SRV-ATTR:host -f shell; done	14:29
yuriys	os_ext_srv_attr_host="dashing-tiglon.local"	14:29
yuriys	os_ext_srv_attr_host="dashing-tiglon.local"	14:29
yuriys	os_ext_srv_attr_host="dashing-tiglon.local"	14:29
yuriys	I go full /reeeeee	14:29
yuriys	And I'm not sure throwing more hardware would solve this problem, unless I figure out why placement is misbehaving basically.	14:30
yuriys	Otherwise it looks like you'll be like 'hey cloud, give me 3 instances for this test', and it will just create 3 instances on 1 node regardless of how many there are in the cloud.	14:31
yuriys	Yeah looks like we don't really customize placement, womp womp. It's all like defaults.	14:34
fungi	i see placement.randomize_allocation_candidates is false by default, i wonder if that would help	14:34
fungi	just looking through the config sample and docs for it now, i'm unfortunately not particularly familiar with it	14:35
yuriys	maybe enabling randomize_allocation_candidates would help , idk, worth a try	14:35
*** ykarel is now known as ykarel\|away		14:36
fungi	https://docs.openstack.org/placement/latest/configuration/config.html#placement.randomize_allocation_candidates	14:37
fungi	yeah, it seems like a long shot	14:37
yuriys	that docstring is i think what im having happen sometimes: That is, all things # being equal, two requests for allocation candidates will return the same # results in the same order	14:38
fungi	it looks like more advanced load distribution would maybe have to be done in nova's scheduler still? i'm trying to quickly digest the docs	14:40
fungi	ahh, maybe what i was expecting to be static configuration is actually behaviors set through the placement api, via creation of "resource providers"?	14:45
fungi	anyway, i need to switch gears, more meetings on the way	14:45
yuriys	yup, ty for taking a peek, we'll see how we do with the new limit, tired of ianw yelling at me!	14:46
*** ysandeep is now known as ysandeep\|out		14:57
clarkb	yuriys: fungi: melwitt helped with placement things when we had the leaks and may have input too. THough I think this morning everyone is still trying to sort through the devstack apache issue	15:10
*** marios is now known as marios\|out		15:50
corvus	clarkb: i believe the change at the head of the starlingx gate queue is stuck due to the zuul issue.	16:14
opendevreview	Alfredo Moralejo proposed openstack/project-config master: Add support for CentOS Stream 9 in nodepool elements https://review.opendev.org/c/openstack/project-config/+/811442	16:15
fungi	corvus: clarkb and i are both on a call at the moment but i can try to take a look	16:17
corvus	fungi: no need, i'm looking into the zuul bug. at some point we may want to dequeue/enqueue to see if it fixes it, but for now i'd appreciate the opportunity to learn more in situ	16:18
fungi	corvus: oh, thanks, no problem i can try to get the tests going for that change again once you're done looking at it	16:19
fungi	i'd probably try a promote on it first just to "reorder" the queue in the same order and see if that would be less disruptive	16:20
corvus	i have a suspcion that neither would work and we would need to dequeue it and to touch it for 2 hours to fix it (absent external zk intervention)	16:26
opendevreview	Alfredo Moralejo proposed openstack/diskimage-builder master: [WIP] Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392	16:28
*** jpena is now known as jpena\|off		16:31
*** artom_ is now known as artom		16:35
clarkb	corvus: my call is done. Let me knwo if I can help, but from what I can tell you've got it under control and possibly need reviews for changes in the near future	17:00
corvus	clarkb: yep, making progress. will update soon.	17:01
clarkb	thanks	17:01
corvus	clarkb, fungi: i think i'm done inspecting the state. i suspect now that a dequeue/enqueue may actually fix the immediate issue (that is, if the dequeue manages to complete). if you want to try a promote (but i'm 80% confident that won't work), and then dequeue/enqueue on 810014,2 i think that's appropriate.	17:06
fungi	thanks, i'll try in that sequence	17:07
fungi	as anticipated, the promote seems to have done nothing	17:12
fungi	zuul dequeue also doesn't seem to have done anything	17:14
*** slaweq_ is now known as slaweq		17:22
corvus	fungi: hrm i don't see the dequeue command in the log :/	17:23
fungi	yeah, i was trying to find it	17:23
fungi	ran as `sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec zuul-scheduler zuul dequeue --tenant=openstack --pipeline=gate 810762,6` but that exited 1 so it may not have worked	17:24
corvus	fungi: wrong change number	17:24
clarkb	fungi: sudo docker exec zuul-scheduler_scheduler_1 zuul dequeue --tenant openstack --pipeline gate --project openstack/placement --change 809366,1 is what i used last week	17:25
corvus	oh you were promoting the change behind it	17:25
fungi	i tried both	17:25
corvus	adding a '--change' argument like clarkb may help	17:26
fungi	oh, yep	17:26
fungi	--help also returns nothing though	17:26
fungi	sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec zuul-scheduler zuul dequeue --tenant=openstack --pipeline=gate --change=810014,2	17:27
fungi	is what i ran just now	17:27
fungi	i'll try via docker exec instead	17:27
corvus	fungi: sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul --help	17:27
fungi	oh, probably need --project too	17:27
corvus	service name is 'scheduler' not 'zuul-scheduler'	17:27
fungi	aha, yes thank you	17:28
fungi	okay, now the promote try first	17:29
fungi	all the builds for 810014,2 are back to a waiting state	17:29
fungi	i guess we'll know in a moment whether they get nodes assigned	17:29
fungi	i see builds starting	17:30
fungi	corvus: the promote seems to have done the trick	17:30
corvus	fungi: i beleive a re-enqueue of 810014,2 should be okay	17:31
corvus	oh nm	17:31
corvus	that's the one you promoted :)	17:31
corvus	so we're all done	17:31
fungi	yeah	17:31
fungi	i first tried promoting the stuck change to "reorder" it in the same order	17:32
fungi	it was just a matter of getting the docker-compose command plumbing correct, thanks!	17:32
corvus	lemme check the change object ids real quick and see if i can anticipate further problems or not	17:32
fungi	for posterity i did this:	17:32
fungi	sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul promote --tenant=openstack --pipeline=gate --changes=810014,2fungi@zuul02:~	17:32
fungi	er, my cursor also seems to have grabbed the prompt on the next line	17:33
fungi	sudo docker-compose -f /etc/zuul-scheduler/docker-compose.yaml exec scheduler zuul promote --tenant=openstack --pipeline=gate --changes=810014,2	17:33
fungi	that	17:33
corvus	fungi: unfortunately, i think that this will not work, it's still using the outdated change object.	17:34
fungi	status page suggests an eta of 10 minutes 'til merge for 810014,2	17:34
corvus	fungi: i think it's going to require a dequeue/enqueue (and optionally promote)	17:34
fungi	i guess it would get stuck at completion?	17:35
corvus	yep	17:35
fungi	okay, dequeuing it now	17:35
fungi	and enqueuing	17:36
fungi	and promoting	17:36
corvus	fungi: great, that looks like it's using the new change object, so we should be good	17:36
fungi	okay, it's at the top of the queue again	17:36
fungi	thanks corvus!	17:37
fungi	so the bug has to do with outdated change objects in zk?	17:37
corvus	fungi, Clark: we are highly suceptible to this error; basically, any network issue between zuul<->gerrit could cause this.	17:37
corvus	fungi: outdated in memory actually; full explanation in commit msg on https://review.opendev.org/811452	17:38
fungi	oh, cool looking	17:38
corvus	i think we should restart with that asap.	17:38
clarkb	I've approved the fix	17:38
fungi	yes, restart as soon as there are new images sounds prudent	17:39
corvus	https://zuul.opendev.org/t/openstack/status/change/805981,3 is a lot of jobs	17:39
fungi	oh wow	17:41
fungi	i guess they're running all their molecule jobs because of the ansible bump	17:43
clarkb	I'm amazed they all succeeded	17:45
clarkb	corvus: any idea why some changes have ended up in zuul's periodic pipeline that don't appear to belong there? https://zuul.opendev.org/t/zuul/status	17:52
corvus	clarkb: it may be related to the other traceback i haven't started digging into yet. that was hitting periodic pipelines.	17:53
corvus	i'm going to afk for a bit, then resume work on that	17:53
clarkb	ok	17:54
melwitt	yuriys: re: your placement query from earlier, default behavior of the nova scheduler is to "stack" instances/maximize efficiency. if you want to "spread" instances you can adjust configuration,	18:46
melwitt	https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.host_subset_size is the main one. increasing it will increase the spread by picking randomly from a subset of hosts that can fit the instance	18:48
melwitt	yuriys: there is also https://docs.openstack.org/nova/latest/configuration/config.html#filter_scheduler.shuffle_best_same_weighed_hosts which will randomly shuffle hosts that have the same weight to get more spread. this one says it's particularly well suited for ironic deployments	18:52
melwitt	yuriys: and finally, as fungi mentioned https://docs.openstack.org/placement/latest/configuration/config.html#placement.randomize_allocation_candidates is useful when you have more compute nodes than https://docs.openstack.org/nova/latest/configuration/config.html#scheduler.max_placement_results it will shuffle hosts before truncating at the max results which will allow spread placement	19:00
clarkb	melwitt: thanks for all the pointers!	19:00
fungi	thanks melwitt! and yeah, i realized after digging deeper that most of the control over that was from the nova side rather than the placement side	19:01
melwitt	np	19:01
fungi	clarkb: corvus: looks like promote on 811452 completed roughly 20 minutes ago	19:01
yuriys	melwitt: clarkb: tyty! that makes big sense!	19:12
yuriys	I did find the weight docs, and was probably going that way as well. ideally we pick best suitable host per instance, per create call.	19:13
clarkb	fungi: if you have time can you weigh in on https://review.opendev.org/c/opendev/system-config/+/810284 I think the replication issue being corrected by network updates shows this isn't necessarythough it may still help improve things	20:10
clarkb	curious what you think about it given what we've learned	20:10
corvus	fungi, clarkb: how about i restart zuul now?	20:12
clarkb	let me see what queues look like	20:13
clarkb	I don't see any openstack release jobs	20:13
clarkb	I think we're good and fungi gave them notice a bit earlier	20:13
clarkb	there is a stack of tripleo changes that may be mergeable in ~17 minutes	20:13
corvus	cool, restarting now	20:14
corvus	oh :(	20:14
clarkb	I don't think that is very critical	20:14
clarkb	they also release after openstack does	20:14
corvus	i had just hit enter when i got that msg; so restart is proceeding	20:14
clarkb	no worries	20:14
clarkb	The bug is bad enough that we should get it fixed	20:15
fungi	yeah, sorry, stepped away to do dinner prep but now seems like a good enough time to restart	20:27
fungi	i'll approve 810284 once zuul's running again	20:28
corvus	re-enqueing	20:32
fungi	clarkb: once 810284 is in for a few days or a week we can check cacti graphs for the gitea servers and see if maybe it helps or worsens cpu, memory, i/o, et cetera	20:36
clarkb	++	20:36
fungi	with the level of churn some projects like nova see, i wouldn't be surprised if a week is a long time to go between repacks	20:36
clarkb	ya	20:36
clarkb	one reason I suspected that was it seemed like projects like nova, cinder, ironic, etc were more likely to hit the replication issues. That could be because they are more active or because they are larger (or both). In this case because they are more active but when I wrote that change I was trying to hedge against the various concerns	20:37
corvus	re-enqueue complete	20:40
fungi	thanks corvus!	20:40
corvus	#status log restarted all of zuul on commit 29d0534696b3b541701b863bef626f7c804b90f2 to pick up change cache fix	20:41
opendevstatus	corvus: finished logging	20:41
priteau	Do we need to recheck any change submitted during the restart?	20:44
clarkb	priteau: if they don't show up in the status dashboard then yes	20:44
priteau	Thanks, a recheck has put it in the queue	20:45
*** elodilles is now known as elodilles_pto		20:52
opendevreview	Merged opendev/system-config master: GC/pack gitea repos every other day https://review.opendev.org/c/opendev/system-config/+/810284	21:35

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!