Wednesday, 2025-07-09

corvus	#status log restarted zuul-launchers with latest fixes	00:22
opendevstatus	corvus: finished logging	00:23
corvus	anyone remember what the rackspace classic network caps are? cause looking at the cacti graphs, i'm suspecting we're hitting a 100Mbit outbound cap on the launchers. that's on a 2GB ram node.	00:29
corvus	that's nbd for node operations, but it'll slow down image uploads	00:29
corvus	the previous version of zl01, which was an 8GB vm, was able to push at least 200Mbps, possibly more (hard to tell in our degraded cacti due to rollup artifacts)	00:31
corvus	we're generating more images now than normal, so it's possible this may be reduced once things settle down, but it looks like we're spending around 20 hours each of the last 2 days uploading images. so getting pretty close to the point where this could be a problem; maybe past that point now with the new images.	00:33
corvus	so we might need to add another launcher just to handle the image load, or upsize the vms to get more bandwidth. which is a shame, because they're nowhere near using their allocated cpu or ram, which is miniscule.	00:34
corvus	i think if we decide to upsize, we should go back to 8GB; i think a 4GB node might still limit our bandwidth.	00:45
corvus	that's the way i'm leaning. i can start on that tomorrow if folks agree.	00:49
Clark[m]	corvus I think flavor show shows some network cap info	00:51
Clark[m]	It may be ratios between the flavors and not raw bit rate and ya bigger flavors get more bit rate so maybe that is a workaround	00:52
corvus	yeah, that matches the docs i found. the number is 400 for 2GB, 800 for 4GB, and 1600 for 8GB. based on those ratios, i guess we should be able to push 400Mbps with an 8GB node.	01:22
opendevreview	Clark Boylan proposed opendev/system-config master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/system-config/+/954382	01:25
opendevreview	Clark Boylan proposed opendev/base-jobs master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/base-jobs/+/954383	01:29
noonedeadpunk	so translations were proposed at least for us: https://review.opendev.org/c/openstack/openstack-ansible/+/954400	07:07
mnasiadka	noonedeadpunk: for a lot of other projects as well	07:12
noonedeadpunk	for a lot it failed :)	07:13
noonedeadpunk	https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&pipeline=periodic&skip=0	07:13
noonedeadpunk	as yesterday some were passing as well, just because they didn't have updates for translations	07:17
noonedeadpunk	like this seems to use old whitelist_externals in tox: https://zuul.opendev.org/t/openstack/build/b9c83152e96f42a58a91af2afbdc00e2/log/job-output.txt	07:18
noonedeadpunk	which I think is broken with nodeset switch	07:19
noonedeadpunk	but they seem to fail on unrelated things I'd say... One fails to build Pillow, Ironic has conflicting constraint to requirements.	07:27
frickler	noonedeadpunk: I saw https://review.opendev.org/c/openstack/designate-dashboard/+/954385 which tries to delete all releasenote translations, which doesn't seem right to me, either	07:45
frickler	oh, although this is for stable branches, maybe the reno translations only apply on the master branch, so this change could be correct? some neutron-* changes look similar. might be good to confirm with I18N people though	07:47
noonedeadpunk	am I right that designate dashboard is not functional anyway?	08:09
noonedeadpunk	ah, ok, sorry, mixed it with barrbican	08:10
noonedeadpunk	yeah, I agree that it would be good to confirm that	08:11
noonedeadpunk	as it's not super clear if we fixed or bricked things	08:11
noonedeadpunk	like - there's a horizon proposal as well, which I've heard should not have happened	08:11
noonedeadpunk	but also only for stable branches	08:12
stephenfin	clarkb: fungi: I think you reached the same conclusion, but yes, deprecation and deprecation for removal are separate things and I'm mostly pursuing the former	12:38
stephenfin	The only exceptions are things like the `test` and `build_doc` commands, which are not core to building a package and rely on other packages	12:39
stephenfin	I've indicated this on a few of the reviews, but I'm open to adding verbiage to indicate this. I want to drive new users towards the upstream variants of options, but don't want to break existing workflows	12:40
opendevreview	Merged opendev/base-jobs master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/base-jobs/+/954383	12:51
ykarel	fungi, clarkb still seeing that multinode issue, mixing cloud providers	12:56
ykarel	https://zuul.openstack.org/build/83acbc4e29ca4117bb2d7825d4504d30	12:56
ykarel	https://zuul.openstack.org/build/91af210331494b35bcf21d9b7cbdb0d5	12:56
ykarel	from today ^	12:56
fungi	ykarel: thanks! looks like those are both periodic jobs, i wonder if the burst when those all get enqueued at once is overwhelming some providers and causing repeated boot failures/timeouts that eventually force it over to trying get the node from another provider rather than failing the job outright	13:00
ykarel	yes seen in periodic ones	13:06
opendevreview	Merged opendev/system-config master: Disable recursive git clones with ansible https://review.opendev.org/c/opendev/system-config/+/954382	13:21
frickler	corvus: ^^ so the question would be whether we still consider that an issue in zuul-launcher or whether we need to somehow this in devstack, either by a pre-check that rejects such a nodeset or by changing the handling of internal vs. public addresses	13:29
frickler	*somehow handle this	13:30
fungi	one of the suggestions was some means to express at the job level when nodes should be allocated from a single provider or otherwise immediately failed when that's not possible, rather than trying to proceed with mixed providers	13:32
corvus	yeah, i just want to make sure that zuul-launcher is operating as intended before we start talking about changing it. let me investigate that before we decide how to proceed.	13:34
corvus	okay, after looking at the first instance, it exceeded its three retries and did failover to another provider for one of the nodes. so that part seems to be working better. but what's interesting is why it failed 3 times	14:41
corvus	1) launch timeout	14:42
corvus	2) nova scheduling failure	14:42
corvus	3) quota exceeded (cores)	14:42
fungi	it's going for a full bingo card	14:42
fungi	was that in rackspace classic?	14:43
corvus	the third one should not have resulted in a permanent failure (it should not count against the 3 retries). i think ovh issues a slightly different error message from when we wrote the code originally (we have to string match). so i wrote https://review.opendev.org/c/zuul/zuul/+/954505 to update it	14:43
corvus	no, ovh	14:43
fungi	oh, even more surprising	14:43
corvus	failure #1, the timeout, is also a new thing for us, since we just (re-)implemented launch timeouts yesterday. we would not have seen those in yesterday's rush of periodic jobs.	14:44
corvus	so i wounder if we should increase the launch-timeout on ovh?	14:44
fungi	or maybe not surprising, from what i understand our tenant in ovh has a dedicated host aggregate with separate overcommit tuning, mainly to avoid impacting other customers as a noisy neighbor	14:45
corvus	i'm not sure why we would get a nova error instead of a quota error for #2...	14:45
corvus	this is the error it got: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 81889faa-c359-4f23-8b17-77d91877406f.	14:46
corvus	should zuul treat that as a temporary failure? or should we still consider that a real error? is there something else we should do to avoid it?	14:47
opendevreview	James E. Blair proposed opendev/zuul-providers master: Increase launch timeout for ovh https://review.opendev.org/c/opendev/zuul-providers/+/954508	14:48
clarkb	corvus: I think that is nova indicating there was no hypervisor to assign the node. I hesitate to treat that as a permanent error becuse in the past we've seen errors like that require human intervention due to stale/broken placement data	14:54
corvus	that sounds like we should treat it as a non-temporary error. basically, if we mark the node as "tempfailed" then we will not count the failure against the retry counter; if we mark it as "failed" we will. so errors that result in "failed" will try 3 times before falling back on another provider. "temfailed" could loop forever (literally) in the provider, so we only want to do that for cases where we actually expect the provider to recover on	14:57
corvus	its own (eg, quota)	14:57
clarkb	ya quota makes sense for doing that	14:58
corvus	so what do you think? leave "Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance" as a "permanent" error? or should we treat it as quota error?	15:04
clarkb	I almost wonder if we need a third state which is atemporary temporary failure	15:05
clarkb	basically if it persists past some limit then we treat it as permanent	15:05
frickler	sadly nova doesn't bubble up more details about why scheduling failed, so keeping it permanent is this safer option IMO. how about increasing the number of retries?	15:06
corvus	yeah, we can increase retries. but this also didn't happen as often. most of the errors were regular quota errors. so if we fix those (in progress) we might be able to leave the current value	15:09
johnsom	noonedeadpunk Designate dashboard should be fully functional. Are you seeing otherwise?	15:18
noonedeadpunk	johnsom: nah, as I said, I mixed up with barbican	15:24
johnsom	ack	15:25
*** dhill is now known as Guest21627		15:33
clarkb	stephenfin: I think there is a small bug in https://review.opendev.org/c/openstack/pbr/+/954514/ but otherwise the stack up to about https://review.opendev.org/c/openstack/pbr/+/954047/ looks mergeable to me	16:04
clarkb	stephenfin: fungi ^ maybe we focus on getting that first half in to avoid too much rebase churn when fixing small bugs like the one in 954514? Then tackle the second half of the stack after?	16:05
clarkb	maybe that doesn't help much I dunno	16:05
fungi	makes sense, sure	16:05
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-devstack https://review.opendev.org/c/zuul/zuul-jobs/+/954528	16:38
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-dstat-graph https://review.opendev.org/c/zuul/zuul-jobs/+/954529	16:40
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-python https://review.opendev.org/c/zuul/zuul-jobs/+/954530	16:40
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for ensure-skopeo https://review.opendev.org/c/zuul/zuul-jobs/+/954531	16:41
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for upload-npm https://review.opendev.org/c/zuul/zuul-jobs/+/954532	16:42
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Disable recursive git clone for upload-git-mirror test https://review.opendev.org/c/zuul/zuul-jobs/+/954533	16:43
corvus	https://grafana.opendev.org/d/21a6e53ea4/zuul-status?orgId=1&from=now-24h&to=now&timezone=utc	16:50
corvus	today was a little anomalous -- a big bunch of node requests at 15:00	16:51
corvus	but i think the launchers did what they were supposed to do	16:51
corvus	infra-root: i do think we should recreate the launchers as 8gb nodes in order to get the extra bandwidth. i didn't see any feedback on that overnight... so i'm... going to take that as consensus? :)	16:52
corvus	last chance to object or suggest an alternative	16:53
fungi	i concur	16:54
fungi	sad but necessary, i guess	16:54
fungi	heading out on some quick errands, should only be a few minutes	16:55
clarkb	corvus: ya bigger nodes seems reasonable to me	16:59
opendevreview	James E. Blair proposed opendev/zone-opendev.org master: Replace zl01 https://review.opendev.org/c/opendev/zone-opendev.org/+/954536	17:20
opendevreview	James E. Blair proposed opendev/zone-opendev.org master: Replace zl02 https://review.opendev.org/c/opendev/zone-opendev.org/+/954537	17:20
opendevreview	James E. Blair proposed opendev/system-config master: Replace zl01 https://review.opendev.org/c/opendev/system-config/+/954538	17:20
opendevreview	James E. Blair proposed opendev/system-config master: Replace zl02 https://review.opendev.org/c/opendev/system-config/+/954539	17:20
corvus	infra-root: if you want to +2 those, i'll +w and replace one at a time.	17:21
clarkb	done	17:28
opendevreview	Merged opendev/zone-opendev.org master: Replace zl01 https://review.opendev.org/c/opendev/zone-opendev.org/+/954536	17:40
corvus	looking at the status page, i'm seeing some node_failures for arm nodes -- it looks like it's due to launch-timeouts.	17:41
corvus	do we need a larger value for osuosl as well?	17:41
ianychoi	<frickler> "noonedeadpunk: I saw https://..." <- I have just replied to the patch - releasenotes translation is only managed on master branch, so deleting on stable branches is right. Thank you for helping I18n team	17:43
clarkb	the current values appear to have been ported over from nodepool? Thinking out loud here I wonder if the more regularl image builds are impacting nova's caching of images	17:43
clarkb	corvus: a little while back we reduced the total number of image builds in nodepool to ease back on that and update less common images less often. Thats my best guess as to what is happening. Increasingthe timeouts would help if that is the problem	17:44
corvus	oh... it looks like we're counting the "paused" time in the launch timeout... which is probably fine for other providers, but not osuosl...	17:44
noonedeadpunk	ianychoi: aha, cool, thanks for having a look at that, as I was really unsure if I broke something with it or not	17:44
corvus	clarkb: let me take a look at tweaking this first	17:44
clarkb	corvus: basically when you boot a node on a hypervisor that hasn't used that image yet it has to be copied from glance if not using ceph	17:44
clarkb	corvus: ack	17:44
noonedeadpunk	then this project is kinda terribly broken overall I believe: https://review.opendev.org/c/openstack/training-guides/+/954432	17:44
noonedeadpunk	but it's not I18n concern...	17:45
corvus	yeah, that may be happening too, but for the example i'm looking at in the logs, i think it's the launch-timeout issue	17:45
ianychoi	noonedeadpunk: Thank you too for helping Ubuntu node upgrade!!	17:45
noonedeadpunk	ianychoi: ah, no worries, I was just pinged that expected translations didn't land to had to check anyway	17:45
clarkb	the weather is cooperating today and I don't have a morning full of meetings. I'm going to pop out for a bike ride while I can.	17:48
clarkb	back in a bit	17:48
fungi	you definitely should	17:54
corvus	okay, this should address the node_errors for arm64 nodes: https://review.opendev.org/954540	17:54
corvus	i have stopped zl01 in preparation for its replacement; zl02 is handling all the node requests now	18:01
opendevreview	Merged opendev/system-config master: Replace zl01 https://review.opendev.org/c/opendev/system-config/+/954538	18:17
opendevreview	Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-devstack https://review.opendev.org/c/zuul/zuul-jobs/+/954528	18:20
opendevreview	Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-dstat-graph https://review.opendev.org/c/zuul/zuul-jobs/+/954529	18:20
opendevreview	Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-python https://review.opendev.org/c/zuul/zuul-jobs/+/954530	18:20
opendevreview	Merged zuul/zuul-jobs master: Disable recursive git clone for ensure-skopeo https://review.opendev.org/c/zuul/zuul-jobs/+/954531	18:20
opendevreview	Merged zuul/zuul-jobs master: Disable recursive git clone for upload-npm https://review.opendev.org/c/zuul/zuul-jobs/+/954532	18:20
opendevreview	Merged zuul/zuul-jobs master: Disable recursive git clone for upload-git-mirror test https://review.opendev.org/c/zuul/zuul-jobs/+/954533	18:20
corvus	i am in favor of increasing the system-config mutex :)	18:57
corvus	#status log replaced zl01 with an 8GB server	20:05
opendevstatus	corvus: finished logging	20:06
corvus	i've stopped zl02, so all requests are handled by the new zl01 now	20:07
corvus	clarkb: fungi would you mind taking a look at https://review.opendev.org/954246 -- that should make the zuul status graphs a bit easier to read	20:10
opendevreview	Merged opendev/zone-opendev.org master: Replace zl02 https://review.opendev.org/c/opendev/zone-opendev.org/+/954537	20:12
fungi	i see that infra-prod-service-zuul-preview failed in deploy for 954538, service-zuul-preview.yaml.log on bridge says it was a "HTTP status: 504 Gateway Time-out" when pulling the zuul-preview image with docker-compose	20:12
fungi	probably not anything to worry about, and just wait for the next deploy	20:12
corvus	yeah, we're not changing anything about that, so it doesn't affect anything	20:12
fungi	corvus: one review comment on 954246	20:14
corvus	replied	20:16
fungi	thanks! lgtm	20:16
clarkb	reviewing that now	20:27
clarkb	corvus: is there a semaphore count increase change yet? I'm happy to do that and can write the chnage if not	20:27
corvus	nope, was just sharing thoughts to gauge opinions :)	20:31
clarkb	cool I work on that after I review that other change	20:32
corvus	unlike other some other clouds, when a server in rax-flex is in error state, the 'fault' does not show up in the result we get back from listing /servers/detail. but it does show up with "openstack server show". is that a change in the openstack api? is there something we can add to get it returned in the list?	20:34
clarkb	corvus: why drop the deleted nodes graphing?	20:35
clarkb	corvus: chances are that is a change to nova since rax-flex is a relatively new version aiui	20:35
corvus	there is no "deleting" node state	20:35
clarkb	corvus: aha	20:35
clarkb	melwitt: ^ do you know the answer to corvus' nova server api question there?	20:35
opendevreview	Merged zuul/zuul-jobs master: Remove no_log for image upload tasks https://review.opendev.org/c/zuul/zuul-jobs/+/953983	20:36
corvus	i don't see anything in https://docs.openstack.org/api-ref/compute/ that would explain the behavior; in fact, it just says that "fault" is "Only displayed when the server status is ERROR or DELETED and a fault occurred."	20:38
opendevreview	Clark Boylan proposed opendev/system-config master: Increase infra-prod-playbook-limit semaphore to 6 https://review.opendev.org/c/opendev/system-config/+/954545	20:39
corvus	this is a great time to investigate this, because every node we're building in flex is getting: {'code': 500, 'created': '2025-07-09T20:15:49Z', 'message': 'No valid host was found. There are not enough hosts available.'}	20:39
corvus	(but zuul can't see that, it just sees "ERROR")	20:40
fungi	i've sent the service-announce post about removing cla enforcement from remaining projects in our gerrit	20:41
clarkb	fungi: I've recieved the email	20:42
fungi	the zl02 replacement inventory change failed in the gate	20:42
clarkb	probably on the same image fetch 504 problem. I'm guessing quay is having a sad	20:43
fungi	yeah, i haven't dug in	20:44
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/954545 is an easy review if you think that is safe enough	20:46
corvus	https://zuul.opendev.org/t/openstack/build/d4994768717840c29cf6fc29a74e9074/log/job-output.txt#25700	20:46
corvus	it failed something something dstat deb on noble?	20:46
clarkb	oh ya that is a known issue	20:47
corvus	oh?	20:47
clarkb	dstat is deprecated so some other tool picked it up (something copilot) and provides a shim for it. THe packaging for that new tool has been flaky at times	20:47
corvus	how is that non-deterministic?	20:48
fungi	clarkb: approved. now going to try to tackle mowing the lawn for a bit before the rains return	20:48
corvus	i guess the db installs a python post-install script, and it's not-deterministic python?	20:48
clarkb	corvus: I think its a race amongst the related copilottools starting up	20:49
clarkb	corvus: they have dependencies and don't properly express that in their units	20:49
clarkb	though the line you linked to seems to be something different	20:50
corvus	is anything using dstat-logger?	20:51
clarkb	corvus: I don't think anything uses it from a production standpoint. Its there for info gathering in CI	20:51
clarkb	oh wait	20:52
clarkb	corvus: the line you linked is for focal I think	20:52
clarkb	https://zuul.opendev.org/t/openstack/build/d4994768717840c29cf6fc29a74e9074/log/job-output.txt#25640 is the actual error	20:52
clarkb	corvus: and I think rc -13 is what happens when ansible attempts to use the controlperist connection as ssh is shutting down that connection	20:53
clarkb	we increased the controlpersist timeout to make this happen less often but some tasks can still trip over it if they run long enough	20:53
clarkb	in this case a recheck is maybe sufficient. If we see rc -13 occuring more often again then we might tweak the controlpersist timeout again	20:53
corvus	or maybe set up a keepalive	20:54
corvus	zuul does -o ControlPersist=60s -o ServerAliveInterval=60	20:54
corvus	i have re-enqueued the zl02change	20:55
clarkb	let me see if we set serveraliveinterval	20:55
opendevreview	Merged opendev/zuul-providers master: Increase launch timeout for ovh https://review.opendev.org/c/opendev/zuul-providers/+/954508	20:56
opendevreview	Merged openstack/project-config master: Update Zuul status node graphs https://review.opendev.org/c/openstack/project-config/+/954246	20:56
opendevreview	Merged opendev/system-config master: Increase infra-prod-playbook-limit semaphore to 6 https://review.opendev.org/c/opendev/system-config/+/954545	20:57
clarkb	corvus: man 5 ssh_config implies that ControlPersist relies on client connections to determine when to shutdown the control master	20:59
clarkb	that said I'm willing to add serveraliveinterval to see if it helps. Maybe the internal ping pong for the serveraliveinterval counts as a client connection	20:59
melwitt	clarkb: I thought it should be returned when the GET /servers/detail API is called https://docs.openstack.org/api-ref/compute/#list-servers-detailed I will look into if there is some reason it would not	21:00
clarkb	melwitt: thank you	21:00
opendevreview	Clark Boylan proposed opendev/system-config master: Update Ansible config to set ssh ServerAliveInterval https://review.opendev.org/c/opendev/system-config/+/954547	21:02
clarkb	melwitt: as mentioned we've got a cloud where we can pretty consistently reproduce from the client side and dan_with may be willing to help us debug further on the server side if there are things you think we should check	21:04
corvus	clarkb: yes, i think that should help. can probably ramp the timeout down too, but it's not important, and not worth doing now.	21:05
corvus	clarkb: do you mind giving https://review.opendev.org/951018 a re-review? it needed a rebase	21:05
clarkb	on it. Shoudl I recheck it as well?	21:06
clarkb	I went ahead and recheked it	21:07
clarkb	looks like the only reason my vote was removed was the new depends on which has merged so basically a noop now	21:07
melwitt	clarkb: so far I'm not seeing anything in https://github.com/openstack/nova/blob/master/nova/api/openstack/compute/views/servers.py#L504 that would prevent it from showing the fault. do you know what is being used to query the API? I'm guessing not OSC if you are wanting more attributes from the detail list	21:08
clarkb	corvus: probably knows how that is being reuqested	21:10
melwitt	I have a devstack handy so actually I can try this myself to see	21:10
clarkb	melwitt: I think its doing a direct request with python requests	21:11
clarkb	let me link the code	21:11
corvus	https://paste.opendev.org/show/bOK5kTOJ66AJfTR3F1tQ/	21:11
clarkb	https://opendev.org/zuul/zuul/src/commit/9554f9593df757fcd3e5fbb8e6cc86ea55fd9c51/zuul/driver/openstack/openstackendpoint.py#L764-L804	21:12
corvus	there's the test script i'm using (for simplicity)	21:12
melwitt	thx	21:12
opendevreview	Merged opendev/system-config master: Replace zl02 https://review.opendev.org/c/opendev/system-config/+/954539	21:12
clarkb	corvus: ^ we should see how much quicker that one deploys compared to zl01	21:13
corvus	good point	21:13
melwitt	I'm getting the same behavior ... no fault field in the list. I will need to debug what is going on here. as far as I know it's not intentional but I could be wrong. the only hint I find that it might be intentional is this sentence in the microversion history https://docs.openstack.org/nova/latest/reference/api-microversion-history.html#id77 "API returns a details parameter for each failed event with a fault message, similar to the server	21:17
melwitt	fault.message parameter in GET /servers/{server_id} for a server with status ERROR" that is, it specifically mentions /servers/{server_id} only with regard to the fault field	21:17
melwitt	gmaan might know actually, in the meantime	21:18
corvus	hopefully that's just a harmless omission (ie, the fault field shows up in servers/{server_id} as well as servers/detail, but they omitted the last part because it doesn't change the meaning)	21:20
corvus	melwitt: thanks for looking into it	21:21
clarkb	looks like the letsencrypt job failed on the zl02 deployment	21:33
clarkb	rc -13 on gitea13 checking the acme.sh script	21:34
clarkb	nothing else has merged since. I'll just reenqueue that change to deploy	21:34
clarkb	ok reenqueued	21:36
corvus	thx	21:36
melwitt	the more I looked at the code the more it looked like a bug and sure enough https://bugs.launchpad.net/nova/+bug/1856329	22:08
clarkb	hrm does that imply it is a cells configuration thing then? So maybe less a ne wopenstack problem (that bug is fairly old) and more how the cells are layed out?	22:10
melwitt	I think it's just a bug in the nova code, unfortunately	22:12
corvus	the bug mentions a change which claims to fix it: https://review.opendev.org/c/openstack/nova/+/699176	22:15
clarkb	gitea13 is under high load and that is why the manage project sjob is taking a while I think	22:16
clarkb	load seems to be slowly falling no wso maybe manage-projects will complete successfully without intervention?	22:20
corvus	The Meta-ExternalAgent crawler crawls the web for use cases such as training AI models or improving products by indexing content directly.	22:20
melwitt	corvus: I see. off the top of my head I'm not sure if that's the best way to fix it but will review	22:21
melwitt	the instance_list itself has code in it to fan out across cells so it feels like there's something wrong at a lower layer but I have to look into it more	22:23
clarkb	corvus: if the manage projects run times out or fails then the hourly runs should run zuul without depending it on the manage projects job	22:26
corvus	oh good	22:26
clarkb	for that reason I don't think we should bother reenquing again or trying to manually run the playbook. A bit lazy but I think it will get things done	22:26
corvus	yay it did not fail	22:36
clarkb	woot	22:36
corvus	#status log replaced zl02 with an 8gb server	22:45
opendevstatus	corvus: finished logging	22:46
corvus	i've restarted zl01 on the newest zuul commit as well; transferred the volumes to the new servers, and deleted the old servers	22:46
clarkb	did zl02 deploy the latet code due to timing?	22:47
corvus	outbound traffic on zl01 peaked around 800Mbps	22:47
corvus	yes it did	22:47
corvus	at some point we may want to look at what's capping the inbound traffic at 100Mbps, but that's not as important right now.	22:48
clarkb	I wonder if that is limited on the source side	22:49
corvus	yeah i wonder -- but we have 5 threads downloading in parallel, so if it's per-connection it's 25Mbps each; it might be more sophisticated traffic shaping based on endpoints and flows.	22:53
corvus	clarkb: fungi do you mind taking a look at https://review.opendev.org/931824 and https://review.opendev.org/953820 ? that addresses something we saw last week, and i'd like to start exercising that fix	22:57
clarkb	corvus: the checksum that is being validated is the one between intermediate upload storage and the launcher right? not launcher and glance?	23:07
corvus	right, that's the checksum that we generate locally when we upload to intermediate storage, and we store that checksum in zuul's db	23:09
corvus	locally == on the image build worker node	23:09
clarkb	thanks, I think I foudn a bug in that one. Post ed as comments	23:10
corvus	clarkb: thanks fixed	23:22
corvus	we're going to burn a lot of cycles trying to spin up failed servers in flex... do we want to drop the limits, or leave it in place and see how the system performs under very adverse conditions?	23:24
corvus	(i think it will be okay)	23:24
corvus	(just maybe a little slower than it would be if we turned flex off)	23:25
corvus	https://grafana.opendev.org/d/0172e0bb72/zuul-launcher3a-rackspace-flex?orgId=1&from=now-24h&to=now&timezone=utc&var-region=$__all	23:26
corvus	you can see the problem's already been happening for most of the day	23:26
clarkb	corvus: maybe dan_with with an opinion?	23:26
clarkb	james denton isn't in here	23:26
gmaan	melwitt: clarkb corvus fault field should be present in both GET server and list server details. only thing is it is included in response if server is in fault status (error or deleted) otherwise nova does not add it	23:48
clarkb	gmaan: ya except there is a bug apparently	23:48
gmaan	so there are chances that some server response might not have it if they are not in error or deleted	23:48
clarkb	so it isn't properly included even when in an error state	23:49
melwitt	thanks gmaan. I later found there is already a bug open about it also https://bugs.launchpad.net/nova/+bug/1856329 I have pinged about it in #openstack-nova for awareness	23:49
gmaan	clarkb: in both API response you mean right? or there is diff between show and list detai;	23:49
corvus	gmaan: thanks. that's my understanding of the intended behavior. we are seeing that fault is not present in /servers/detail on a very new cloud even on servers in ERROR (and the fault does appear with "server show")	23:49
gmaan	ohk, got it. yeah just saw that	23:49
clarkb	corvus: I've approved the checksum checking change despite fungi's comment since fungi +2'd it I figured its fine for now	23:50
clarkb	we can adjust based on real world data if it is too slow as is	23:50
corvus	ack, thx	23:52

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!