Monday, 2021-11-08

opendevreview	Merged opendev/system-config master: Add LE config for zuul01 https://review.opendev.org/c/opendev/system-config/+/816903	00:51
corvus	zuul is still looking good; i think we should just continue to run it on master with both schedulers	00:56
corvus	note that there is a minor consideration to be aware of re zuul: it is possible for the status json to be slightly out-of-date or flap between two states. it's random which scheduler formats the json, and they can be slightly out of sync with each other. i don't expect users to notice most of the time (i haven't even noticed so far), so i think it's okay to run in this configuration, and the benefits of early testing of 2 schedulers outweigh	01:11
corvus	the downsides. but i wanted to make sure folks are aware of that in case it is noticed. (that's the next thing that's being worked on, so this isn't a long-term issue).	01:11
opendevreview	Merged opendev/system-config master: refstack: don't chown db directory https://review.opendev.org/c/opendev/system-config/+/816761	02:00
ianw	Digest:sha256:bc67ba409c0f02a4ee36bf855a40f511228aa4286009cd99cb3595cdee9cf3ac is the build from 816618	02:03
ianw	"opendevorg/gerrit@sha256:bc67ba409c0f02a4ee36bf855a40f511228aa4286009cd99cb3595cdee9cf3ac"	02:04
ianw	so we've got the the right image on the server	02:04
ianw	i'm going to restart now to get the new plugin	02:06
ianw	#status log restarted gerrit to pick up plugin changes from #816618	02:09
opendevstatus	ianw: finished logging	02:10
opendevreview	Merged opendev/system-config master: Adjust RefStack build for osf->openinfra rename https://review.opendev.org/c/opendev/system-config/+/808480	02:20
fungi	i'm back, but looks like everything's okay	02:51
*** diablo_rojo_phone_ is now known as diablo_rojo_phone		04:06
*** diablo_rojo_phone is now known as Guest5311		04:06
*** gouthamr_ is now known as gouthamr		04:07
*** Guest5311 is now known as diablo_rojo_phone		04:08
opendevreview	Ian Wienand proposed openstack/project-config master: Add Fedora 35 https://review.opendev.org/c/openstack/project-config/+/816931	04:15
opendevreview	Ian Wienand proposed openstack/project-config master: Add Fedora 35 https://review.opendev.org/c/openstack/project-config/+/816931	04:20
opendevreview	Ian Wienand proposed openstack/project-config master: Set Fedora 34 min-ready to zero https://review.opendev.org/c/openstack/project-config/+/816932	04:20
opendevreview	Ian Wienand proposed openstack/project-config master: Remove Fedora 34 https://review.opendev.org/c/openstack/project-config/+/816933	04:20
ianw	2021-11-08 04:28:04.208 \| ++ /opt/dib_tmp/dib_build.VHZvuUPn/hooks/root.d/08-containerfile:main:62 : podman run -d dib-work-image /bin/sh	04:29
ianw	2021-11-08 04:28:04.487 \| Error: error allocating lock for new container: no space left on device	04:29
ianw	this is weird because there is space	04:29
ianw	so it must be talking about something else ...	04:30
ianw	nb02	04:30
ianw	same thing on nb01	04:31
opendevreview	Merged openstack/diskimage-builder master: tests: remove debootstrap install https://review.opendev.org/c/openstack/diskimage-builder/+/815571	04:59
ianw	$ podman ps -a \| wc -l	06:33
ianw	2049	06:33
ianw	i think this has to do with it	06:33
*** ysandeep\|pto is now known as ysandeep		06:36
ianw	$ podman ps -a \| wc -l	06:37
ianw	2049	06:37
ianw	same number on nb01 as nb02 ... that can not be a coincidence	06:37
ianw	I've run "podman system reset"; the builders have daily queues going now, although i suspect that will fix it for now	06:38
ianw	i'd say we're leaking the dib-work-image. i'll look into it more tomorrow	06:38
*** akahat is now known as akahat\|rover		07:00
*** pojadhav is now known as pojadhav\|lunch		07:47
*** sshnaidm\|off is now known as sshnaidm		08:04
*** pojadhav\|lunch is now known as pojadhav		08:19
*** ykarel is now known as ykarel\|lunch		08:53
*** ysandeep is now known as ysandeep\|afk		09:18
*** ysandeep\|afk is now known as ysandeep		09:51
*** ykarel\|lunch is now known as ykarel		10:21
opendevreview	Martin Kopec proposed opendev/system-config master: Update Interop doc https://review.opendev.org/c/opendev/system-config/+/816982	10:33
tobias-urdin	ianw: thanks for fixing the fedora image patches when it was removed from mirrors :)	10:33
*** pojadhav is now known as pojadhav\|afk		10:51
*** tobias-urdin5 is now known as tobias-urdin		11:06
*** lbragstad7 is now known as lbragstad		11:41
*** ysandeep is now known as ysandeep\|brb		11:45
*** ysandeep\|brb is now known as ysandeep		11:58
*** jpena\|off is now known as jpena		13:24
fungi	in the openstack tenant i see we've got some queue items waiting hours with a handful of their builds queued. no idea yet if it's related to the zuul updates over the weekend and dual schedulers, or maybe a recurrence of the problem we observed with dependent job sets stuck trying to get nodes from very constrained providers like airship-kna1. i'll start looking once my coffee kicks in a little	13:36
fungi	also i get the impression the periodic-stable pipeline e-mails did not get distributed to the stable-maint ml. i'll double-check it, but we may be sending e-mail differently (or not at all?) from the second scheduler	13:42
*** ykarel is now known as ykarel\|away		14:36
corvus	fungi: thanks, i'll start looking into those	14:38
fungi	stable-maint ml may just be me imagining things. i see at least some failures posted there as recently as Mon Nov 8 03:32:26 UTC 2021	14:43
corvus	fungi: oh, i believe i did not re-enqueue a batch of periodic jobs over the weekend; so they may not have gotten saturday/sunday but should have gotten sunday/monday	14:44
fungi	yep, seems like that's working as expected, thanks	14:45
corvus	oops, one of the log lines has accidentally changed to be less helpful, which will make tracking down this node issue a little harder	14:46
fungi	i think check queue item 816807,3 in the zuul tenant may be exhibiting the same situation i'm seeing for a number of changes in the openstack tenant	14:47
corvus	"for job <zuul.model.FrozenJob object at 0x7ff3fe56dbb0>" that used to be a job name :/	14:47
*** pojadhav\|afk is now known as pojadhav		14:50
corvus	i used the repl to get the node request id for the top check queued job: 'neutron-tempest-plugin-scenario-openvswitch': '300-0016042207'	14:51
fungi	as for general exception roundup, i'm not spotting a couple of new ones the scheduler debug logs aside from the previously observed "TypeError: 'NoneType' object is not subscriptable"	14:51
fungi	this hit a couple of times on zuul02: https://paste.opendev.org/show/810851/	14:53
corvus	that could definitely cause problems; maybe even the one we're seeing now with unhandled node requests	14:54
fungi	and this on zuul01: https://paste.opendev.org/show/810852/	14:54
fungi	also zuul01 logged 13 occurrences of kazoo.exceptions.ConnectionLoss and zuul02 logged 54 of them	14:55
fungi	no idea if any of that is related	14:55
corvus	that one is probably harmless; i think it's a delete of an already deleted node, but could be a fallow-on from the first error	14:55
corvus	that's a lot of connection losses	14:56
fungi	trying to see if i they're interspersed randomly or clustered	14:56
corvus	would also be good to see what the rate was last week	14:56
fungi	/var/log/zuul/debug.log.6.gz on zuul02 contains none as far as i can see	14:57
fungi	checking a few other days	14:58
fungi	as for clustering, it looks like the entries today on zuul02 came in bunches, around 11:38, 12:03-12:06, 12:10, 12:16, 12:27, 13:18, 13:36, 13:55, 14:16, 14:26, 14:29, 14:34, 14:58	15:00
fungi	some clusters were larger, some as few as one exception	15:00
fungi	there could be some ongoing/intermittent network disruption	15:01
fungi	do we maybe only log kazoo.exceptions.ConnectionLoss in some very recent commit? i've scanned back more than two weeks in the zuul02 debug log and found no hits before today	15:04
corvus	wow, zuul01 saw a multi-hour period it couldn't connect to zk: from 06:56:17 to 08:09:34	15:05
corvus	fungi: no that should be unchanged; something is likely very different with our zk connection, whether that's zuul or network issues is unclear	15:05
fungi	yeah, i'm continuing to grep farther and farther back, finally found one from 2021-10-22 07:55:08,906	15:06
fungi	so it's been almost 3 weeks	15:07
fungi	corvus: this is likely related? http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70037&rra_id=all	15:07
fungi	the other two zk servers are only using 42% of their rootfs	15:08
corvus	that is entirely possible	15:09
fungi	/var/zookeeper/datalog/version-2 contains 28gb of data	15:10
fungi	and that's around 9gb on the other two servers	15:11
fungi	so 3x larger on 04	15:11
corvus	starting at 02:00 the amount of data in zk spiked according to https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1&from=now-24h&to=now	15:11
fungi	periodic-stable jobs were going on around then	15:11
fungi	for the openstack tenant	15:12
fungi	periodic-stable triggers at 02:01 utc daily	15:12
fungi	and periodic at 02:00 utc	15:13
fungi	so that could be related to queuing hundreds/thousands of builds in a matter of minutes	15:14
corvus	makes sense	15:17
corvus	it looks like the other zk servers had their storage fill up as well, but they recovered?	15:18
fungi	looks like zk may have crashed on 04? i'm trying to see if there are clues in the docker-compose logs	15:21
fungi	2021-11-08 06:53:42,330 [myid:4] - ERROR [Snapshot Thread:ZooKeeperServer@329] - Severe unrecoverable error, exiting	15:22
fungi	java.io.IOException: No space left on device	15:22
fungi	seems like filling up the rootfs crasged it when trying to snapshot	15:22
fungi	crashed	15:22
fungi	i wonder if it would recover if we freed up enough space to start it again	15:23
fungi	also i wonder if it should really be consuming that much disk, and if so we should probably attach dedicated volumes for them	15:24
*** ysandeep is now known as ysandeep\|dinner		15:25
fungi	happy to get started on that if it's the direction we want to go	15:25
corvus	we have zk purge snapshots every 6 hours; that corresponds with the drop in this graph: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70101&rra_id=all	15:25
corvus	i wonder if we're keeping too many snapshots and should reduce that	15:26
corvus	aiui, we retain 3 snapshots, but only delete the ones >3 every 6 hours	15:26
fungi	the disk utilization graph suggests that it doesn't normally use anywhere close to this much on previous days	15:27
corvus	on zk04 (which is helpfully frozen in time at max usage), we have around 75 snapshot files	15:28
corvus	and aronud 25 on zk05	15:28
fungi	growth on all three zk servers seems to have increased by close to an order of magnitude	15:29
corvus	fungi: maybe before looking at expanding the disk, we should increase the snapshot cleanup frequency?	15:29
corvus	maybe every 2 hours?	15:30
fungi	yeah, that seems worthwhile, though also it appears that letting zk fill up the rootfs destabilizes it, dedicating a filesystem would shield the system from that impact	15:30
corvus	yes, that's true. want to do both then?	15:30
fungi	i'm fine with focusing on cleanup improvements first, and then working on adding cinder volumes after we've stabilized the cluster	15:31
corvus	ok. i'll propose a patch to change the interval	15:32
fungi	i'm curious what's come in after 4.10.4 though which causes it to create so much larger snapshots	15:32
corvus	queue state :)	15:32
fungi	ahh, okay, so that's the churn from moving queue state in	15:32
corvus	i have some thoughts on reducing the data size, but they depend on having the queue state in zk to start with	15:33
corvus	basically, we should be able to share some data which we currently duplicate in the executor queue	15:33
corvus	(global repo state, job vars, etc; with queue state in zk now, executors can reach all of that data directly)	15:34
fungi	what's the safest way to get zk going again on 04? i could free up ~1gb by clearing /var/log/journal for instance	15:34
fungi	or can we just start it as-is and expect it will clean up old snapshots before it tries to write anything?	15:35
corvus	fungi: i'm unsure; it might be okay to manually delete the snapshots	15:36
Clark[m]	Not really here yet but there might be old docker images to prune too if we aren't pruning them regularly	15:36
corvus	i think we should get the new interval in place first though	15:36
corvus	(that way we can do just one rolling restart to get it in use)	15:37
fungi	there's only one zookeeper image and two opendevorg/zookeeper-statsd images according to `docker image list`	15:37
fungi	i could delete the older opendevorg/zookeeper-statsd which would free up 122mb	15:38
opendevreview	James E. Blair proposed opendev/system-config master: Purge ZK snapshots more frequently https://review.opendev.org/c/opendev/system-config/+/817038	15:38
corvus	i'm going to work on breakfast, then resume zk work	15:55
fungi	i see what you mean about the status json being inconsistent, i was watching that change in the status page and saw a couple of builds transition from starting back to queued	15:55
*** ysandeep\|dinner is now known as ysandeep		15:56
yuriys	I noticed some errors for launch attempts on the inmotion cloud, and all of them were due to (raise exception.NoMoreFixedIps(net=network_id) , which shouldn't be happening, because that's statically configuration, unless we're not 'releasing' ips fast enough during destroy+provision requests.	15:57
yuriys	static config* i meant	15:57
yuriys	although it did go from 0 to 100 at 13:58 UTC... hmm	16:00
fungi	we have at times in the past seen floating ips leak due to pathological conditions in clouds preventing us from deleting them when we free up the ports/servers to which they were bound	16:02
yuriys	Yeah looks like at 13:58 a call to destroy nearly all instances went out, shortly by rebuilding all them, not quite sure how to combat that for the ip release delay vs the reprovision	16:02
fungi	nodepool expects to occasionally see errors for such things, it's fine as long as they don't persist and it's able to eventually start booting new server instances there	16:03
fungi	it's possible we could teach the launchers to track floating ip capacity similar to how it does other limits and refrain from making boot calls until there's capacity to assign new floating ips, i'm not quite sure what would be entailed in implementing that however	16:05
yuriys	yeah it looks good now, just need a delay between destroy and create calls	16:06
Clark[m]	Note this wasn't floating ips just normal IPs aiui	16:06
fungi	oh, yep sorry i see now from the pasted error it was fixed ips	16:07
fungi	so i guess the "ip release delay" is something neutron does?	16:08
fungi	and likely doesn't expose it via the api such that we could query	16:08
clarkb	as soon as my glasses are defogged I can ACK the zk purge interval change	16:18
clarkb	Let me know if I can help with zk otherwise I'll stay out of the way as I've not been on top of it with y'all	16:20
fungi	there's not much to catch up on. basically latest zuul moved queue state into zk which creates a ton of new churn resulting in much larger zk snapshots, so much so that it's easy for them to overrun the available space on the rootfs within the 6-hour cleanup interval we'd configured	16:21
fungi	and that can also lead to crashing zk entirely if it runs out of space while writing a snapshot	16:22
clarkb	I think another option available to us is to increase the snapshot interval	16:22
clarkb	so that we produce snapshots less often	16:22
clarkb	that is based on numbers of transactions iirc	16:23
fungi	either way, yes the goal is to keep the snapshot count to a manageable level, but also probably to relocate /var/zookeeper or at least /var/zookeeper/datalog to a dedicated filesystem/volume	16:23
clarkb	apparently I've got a kernel update waiting so first reboots, then acking that change	16:25
fungi	also we may need some manual intervention on zk04 as the zk service on it crashed ~10 hours ago leaving its rootfs full	16:25
fungi	but holding off there until the cleanup interval change deploys	16:26
opendevreview	Clark Boylan proposed opendev/bindep master: Try out PBR pep 517 support https://review.opendev.org/c/opendev/bindep/+/816741	16:43
clarkb	through localy testing I think maybe that all we need to do is keep the setup.py script then it will work? we'll find out	16:43
clarkb	if that does work I have a docs updated that I need to push to pbr	16:44
fungi	if it winds up calling into setup.py even with latest build/pip/wheel, then that probably doesn't avoid the deprecation warning	16:45
clarkb	fungi: ya I'm not sure on that point yet. You'd think that the official method of doing this via setuptools wouldn't trip the deprecation warning. But I don't know yet	16:46
fungi	though i guess it's just calls to `setup.py install` and setup_requires/easy_install which trigger the warnings	16:47
fungi	so calling into setup.py on its own may not be a problem	16:47
clarkb	fungi: https://github.com/pypa/setuptools/blob/main/setuptools/build_meta.py#L158 ultimately that is what it does	16:47
clarkb	oh ya and its doing setup.py sdist and setup.py bdist_wheel equivalents	16:48
clarkb	and then those artifacts are handled directly by pip or poetry or whatever	16:48
*** marios is now known as marios\|out		17:03
*** ysandeep is now known as ysandeep\|out		17:04
corvus	i think i have identified the issue that's caused some stuck 'queued' jobs. details and a fix in #zuul. i'd like to merge the fix, then do what i guess would be our first ever rolling scheduler restart to deploy it.	17:08
fungi	woah, i'm game!	17:14
corvus	looks like we're about 5m or so from the zk interval change landing. perhaps once it does, we can go ahead and manually update it and roll zk	17:15
fungi	should we expect the rolling restart to get those stuck queue items moving again? (for whatever definition of "expect" we have at this point in regards to rolling scheduler restarts)	17:19
fungi	looking at the fix, i assume we'll need to manually dequeue/enqueue them	17:20
clarkb	ya I think we would have to reenqueue so they get new node requests	17:20
fungi	as they presumably correspond to incorrectly cleaned-up node requests	17:20
fungi	or we'd have to wait for the scheduler's node request timeout, which if memory serves is fairly long (8 hours? 10?)	17:21
corvus	we may not have a node request timeout. and yeah, dequeue/enqueue. we could go ahead and do that now.	17:23
corvus	i'll get started on that	17:23
opendevreview	Merged opendev/system-config master: Purge ZK snapshots more frequently https://review.opendev.org/c/opendev/system-config/+/817038	17:24
fungi	there are a couple of builds to go in the opendev-prod-hourly buildset, before it will be able to start infra-prod-service-zookeeper in deploy	17:26
fungi	huh, infra-prod-service-zuul just failed in opendev-prod-hourly	17:28
fungi	oh, it's the lack of letsencrypting breaking apache reloads	17:30
fungi	on zuul01	17:30
clarkb	yes, known problem with a proposed fix	17:31
clarkb	https://review.opendev.org/c/opendev/system-config/+/816903 hrm fix is merged maybe insufficient fix?	17:31
clarkb	ACME records I bet. One sec	17:31
fungi	clarkb: following https://discuss.python.org/t/11810 it seems like we may still be forced to call into deprecated setuptools functions when doing editable installs	17:31
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add acme record for zuul01 https://review.opendev.org/c/opendev/zone-opendev.org/+/817061	17:32
clarkb	fungi: corvus ^ re zuul01 le	17:33
fungi	and yeah, i agree with 817061	17:33
fungi	[Mon Nov 8 03:41:12 UTC 2021] zuul01.opendev.org:Verify error:DNS problem: NXDOMAIN looking up TXT for _acme-challenge.zuul01.opendev.org - check that a DNS record exists for this domain	17:33
fungi	rather, acme.sh agrees ;)	17:33
clarkb	fungi: re editable installs sounds like setuptools is hoping to address that at some point	17:34
fungi	right, they simply haven't yet	17:34
fungi	i should have said "...for now"	17:34
corvus	clarkb: thx, sorry	17:35
corvus	there are many items that need re-enqueing; i'm having emacs help me with the commands	17:35
fungi	elisp to the rescue?	17:36
corvus	okay, it's 122 commands (half dequeue half enqueue)	17:36
corvus	that's running. it will take a long time.	17:37
corvus	it should be safe to manually edit the zoo.cfg files now, right?	17:38
fungi	infra-prod-service-zookeeper failed in deploy	17:38
*** jpena is now known as jpena\|off		17:38
clarkb	likely due to the lack of disk space to ansible on zk04?	17:38
corvus	probably because zk04 is out of space	17:38
corvus	yeah	17:38
fungi	mkdir: cannot create directory ‘/root/.ansible/tmp/ansible-tmp-1636392894.8472807-6361-108486208652702’: No space left on device	17:39
fungi	indeed	17:39
corvus	i'd like to manually make those edits on all 3 servers; delete the snapshots from under zk04, then restart zk04. how does that sound?	17:40
clarkb	that seems reasonable.	17:40
clarkb	(I don't know how zk04 will handle the loss of snapshots but in theory they are intended to be moved around for restoring stuff so should be safe to move them away/delete them	17:40
corvus	yeah; i'm expecting it to fully resync from the other servers; we'll see	17:41
fungi	sounds great to me, thanks	17:41
corvus	i deleted the logs and snapshots. that appears to have caused a zk outage :/	17:47
corvus	zuul is...recovering now?	17:49
fungi	was that by deleting from 05 and/or 06 with 04 still down, or was 04 running again?	17:50
corvus	only restarting 04 after deleting files from disk	17:50
fungi	maybe the cluster members got really busy trying to restore missing snapshots so stopped handling requests for a while?	17:51
fungi	i guess zk doesn't like to answer requests if it has reason to believe the cluster is incoherent	17:51
clarkb	I think snapshots are per node and sort of forgotten about once written and not puging	17:51
fungi	ahh, so maybe not then	17:51
clarkb	but zk04 reentering the cluster will cause the cluster to bring it up to speed for the current dataset	17:52
corvus	05 crashed when i brought 04 up	17:52
fungi	ouch	17:52
corvus	i think that explains the outage (but i don't know why 05 crashed)	17:52
corvus	06 stayed up	17:53
fungi	something was clearly going on with zk05 around 17:39	17:54
corvus	my dequeue command is stuck. i wonder if the rpc handler thread is stuck	17:54
corvus	however, both schedulers are still processing pipelines	17:56
corvus	i've restarted the dequeue/enqueue script	17:56
corvus	and it is proceeding (i'm unsure if only one scheduler is handling those commands now or both; it doesn't really matter since we're going to restart them)	17:57
fungi	zk05 logged quite a few "java.io.IOException: ZK down" and immediately prior to those appearing there was this:	17:57
fungi	2021-11-08 17:39:42,051 [myid:5] - WARN [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Follower@96] - Exception when following the leader	17:57
corvus	the zk graphs look reasonable now; i think it's time to update the config for 05 and restart it now	17:57
corvus	oh, zk05 disk is full	17:58
fungi	d'oh!	17:58
fungi	you'd think i would have checked that FIRST	17:58
* fungi sighs		17:58
corvus	now we know why it crashed	17:58
corvus	i did check it first, but i clicked on ze05 and not zk05 :(	17:59
* corvus sighs		17:59
corvus	okay, so i'm going to repeat the process i did on zk04	17:59
clarkb	should we check zk06 state first?	17:59
clarkb	then see if restarting zk05 causes it to fill tis disk?	18:00
corvus	there's lots of space os zk06 right now	18:00
fungi	/dev/root 39G 7.9G 31G 21% /	18:00
clarkb	like becoming the leader causes an explosion of disk use maybe?	18:00
corvus	so i think we have good reason to think it might work this time	18:00
clarkb	cool	18:00
corvus	starting 05 now	18:02
fungi	disk usage is climbing on 06, but not super rapidly	18:03
opendevreview	Merged opendev/zone-opendev.org master: Add acme record for zuul01 https://review.opendev.org/c/opendev/zone-opendev.org/+/817061	18:05
corvus	i don't know if we get a confirmation that zk05 is fully synced or not. it looks like it has the right amount of data, but there's no warm-fuzzy log message saying everything is okay.	18:06
clarkb	corvus: I think with zookeeper it can't be part of the cluster without having synced fully	18:06
clarkb	basically once it shows up as a follower to one of the others it is good. But ya no explicit indication of that	18:06
corvus	i think i'm ready to assume that everything is okay and we can proceed to zk06 now	18:06
corvus	++	18:07
corvus	i will shut down zk06 now. it's the leader, so this will cause some churn	18:07
corvus	starting 06	18:08
corvus	it's back up; all 3 servers are up; 05 is the leader now	18:09
corvus	i think that concludes zk work. dequeue/enqueue script is still crawling along.	18:10
clarkb	then next up is the update to fix the identified zuul bugs?	18:11
corvus	incidentally, i just observed the expected status page flap for the first time on the zuul tenant status page	18:11
corvus	yep	18:11
fungi	i observed it a few times earlier today watching system-config changes ni the status page for the openstack tenant	18:12
corvus	once those changes merge, we'll do the rolling restart.	18:12
clarkb	corvus: for the status page flag that happens because zuul-web can load the state from the non active scheduler and get stale info?	18:12
corvus	clarkb: yep (though i wouldn't characterize it as a non-active scheduler, but rather, a scheduler who may not have processed the pipeline most recently)	18:13
clarkb	the one that won the election least recently :)	18:14
corvus	(since it's sort of random which scheduler processes which pipeline(s) at any given time); odds are they should eventually converge on the same data relatively quickly	18:14
corvus	they're both active at the same time. they just both race through the list of tenants and pipelines and try to lock and process each one individually	18:15
corvus	if they start at the same time, and one of them locks openstack/check the other will skip that one and lock openstack/gate	18:15
clarkb	got it	18:15
corvus	then when the first finishes openstack/check, it'll try to lock openstack/gate. if the second has finished, it will succeed and process it again (which is wasteful but harmless and quick); if the second still has the lock, it'll leapfrog and go to openstack/poste.	18:16
fungi	keeping an eye on the filesystem utilization on the zk servers... based on the fairly linear trending i anticipate we'll keep them under 60% used when they prune snapshots	19:02
fungi	s/when/just before/	19:02
clarkb	note I'm trying to dequeue/enqueue things to address the fight between the schedulers over builds with retry results	19:02
fungi	this should also give us some idea on sizing for cinder volumes	19:02
clarkb	it is slow going.	19:03
fungi	though i think rackspace requires cinder volumes be a minimum of 100gb, which will probably be 4x what we end up using with the current churn and pruning	19:04
clarkb	I am unable to tell if the tripleo gate queue is thrashing in this way or not. It is hard for me to see with the flapping status page. It seems like tripleo might be ok?	19:06
corvus	i monkeypatched in a fix for the retry thrashing bug; actual fix is https://review.opendev.org/817067 (and that's == what i patched)	19:26
clarkb	I've approved it just now	19:28
clarkb	I have ~7 changes to reenqueue that were dequeued and that is in progress	19:28
corvus	clarkb: yeah, i don't see a way to dequeue that periodic item, but with the immediate retry thrashing issue resolved i think we can probably just leave it?	19:28
corvus	(if we really wanted to dequeue it i could probably come up with something in the repl, but danger seems to outweigh benefit)	19:29
clarkb	ok	19:29
corvus	incidentally, my de/re-enqueue of nodepool-bug-affected changes is continuing	19:30
fungi	it will eventually time out the node requests, right?	19:30
clarkb	a number of the jobs are actually running I think	19:30
clarkb	it should largely be harmless other than node consumption	19:30
corvus	fungi: at this point i don't think the periodic item is suffering from any bugs	19:30
corvus	(and i don't believe node requests have a timeout)	19:31
corvus	so i think the current state is: retry bug hotpatched; nodepool bug still present but rarely triggered; remediation of earlier nodepool bug fixes in progress; waiting for zuul changes to merge in order to rolling-restart with all known bugs fixed	19:32
fungi	i thought the scheduler would resubmit a node request if not satisfied in something like 8 hours?	19:32
fungi	(i vaguely recall reviewing a change for that, at least)	19:33
SpamapS	Ohai, trying to add an email identity on my opendev account.. just wondering if the email loop verification emails are working. I waited a whole 5 minutes. ;)	19:33
clarkb	down to two reenqueues in progress and then I should be done recovering	19:33
fungi	SpamapS: i can check the mta on the server	19:33
corvus	i'm going to afk for a bit	19:34
SpamapS	fungi: TY, I don't mean to be urgent I'd just like to submit the patch before I have to task switch back to not-zuul things. :)	19:35
fungi	SpamapS: i see a message for what i assume to be your new work address accepted by aspmx.l.google.com at 19:29:21 utc with gsmtp queue id e18si38289043vsu.306	19:35
fungi	maybe check your junk mail?	19:35
fungi	definitely wasn't rejected, at least	19:36
SpamapS	Oh boo it was counted as spam wow that's rare.	19:36
SpamapS	TY, reported as not-spam :)	19:36
fungi	SpamapS: if you can help us determine why it was counted as spam we could try to remedy it or pass the reasons along to the gerrit upstream maintainers	19:37
clarkb	reenqueue complete for the things I dequeued. I think we're in a good spot now. But will keep an eye on the oepnstack check queue in particular to make sure things are rotating out of there	19:38
SpamapS	fungi: DM'd you the full text with headers. I don't see many clues in there. It's possible some creepy AI made a few assumptions.	19:39
clarkb	I want to say in the past we've had to force smtp over ipv4 for gmail addresses. Maybe we need to find a way to send email via ipv4 for any google hosted domain?	19:40
fungi	clarkb: in this case it went over ipv4 anyway from the looks of the headers and our logs	19:46
clarkb	gotcha likely unrelated to google assuming ipv6 mail is spam then	19:47
fungi	i like that my personal mailserver's spamassassin is configured to inject a header line indicating which sa rules a given message matched so i can have some idea of why it was or wasn't considered spam	19:47
SpamapS	It's also possible that it's just assuming nobody has ever sent @spotify.com an email from opendev and so it made me +1 the first one.	19:47
fungi	wish other mailservers did that	19:47
clarkb	I think the tripleo gate queue may have reset without a failing job. Indication of another bug? I don't have hard evidence of that though as its more a I was skimming things and noticed it seemed jobs went away and now they have restarted but I don't see a failed change	19:52
fungi	zk04 pruned its snapshots around 50% utilization on the fs	19:54
clarkb	tooz runs a ridiculous number of jobs	19:54
fungi	i expect 05 and 06 to also do so shortly, and they're at similar utilization levels	19:54
fungi	yep, they all cleaned up at around 20-21gb used on the rootfs	20:28
fungi	hopefully that's enough to survive through tomorrow's periodic jobs	20:29
clarkb	we are able to reduce the cleanup period to an hour too	20:29
clarkb	a few more options available to us to trim it back further if we feel that is necesary	20:30
fungi	yep	20:30
clarkb	the zuul fixes are landing now and over the next half hour should hopefully all be in	20:31
clarkb	I'm goign to have to grab lunch soon but will be back after and can help with watching it post restart	20:31
fungi	yeah, i'll be context switching to dinner prep in roughly an hour, but after i've eaten i can help with restarts or whatever	20:31
clarkb	the tripleo queue has reset but this time clearly due to a failing job so no concern there	21:09
*** dviroel is now known as dviroel\|out		21:11
ianw	anyone mind if i supervise f35 builds today? https://review.opendev.org/c/openstack/project-config/+/816931	21:13
ianw	i don't want to spend time debugging f34 any more; if f35 still has issues booting i guess we/i can debug that	21:13
ianw	we also have some sort of leak with containerfile builds + podman images; probably on an error path, but i'm not sure yet. i'll look into that	21:14
clarkb	ianw: is there a corresponding grafyaml update we need to do? But +2 from me	21:15
ianw	clarkb: hrm, looks like we haven't updated that in a while; i'll do that today	21:17
ianw	on the gmail spam thing -- prompted me to look as all my @redhat.com is gmail and i have 30 days history -- nov 3, 5 & 6 i got one mail, and today every mail in the stack https://review.opendev.org/c/openstack/heat/+/816604 got put in there	21:19
clarkb	fungi: corvus: looks like we had a kazoo connection loss at ~20:03 UTC from zuul02	21:20
ianw	so, that would be another data point that perhaps in early nov something did change to make gmail think gerrit might be more spammy	21:20
ianw	for all their machine learning skills, why they can't realise i've probably got 10 of thousands of emails from this address and never marked one as spam, and have filters setup to apply labels ... maybe that's a pretty big flag i do want to see these mails	21:21
clarkb	we seem to have a number of these connection closed due to Len errors in the recent zk05 docker container log	21:26
clarkb	side note we seem to write a new snapshot every 5 minutes or so	21:26
ianw	tobias-urdin: yeah, sorry i missed that use of the image. although i'd strongly suggest that with the stable branches, a fedora is not really a good choice. i presume it's because something recent is required -- but perhaps 8-stream; or even 9-stream now -- would work better?	21:27
corvus	my de/re-enqueues from earlier are done	21:28
corvus	i think we may be writing a too-large object into zk which is triggering a connection loss	21:31
clarkb	ah ok	21:31
clarkb	the size is ~1.8MB and our limit is 1MB?	21:31
clarkb	the size from the len error I mean is ~1.8MB	21:32
corvus	https://paste.opendev.org/show/810857/	21:32
corvus	what's the len error?	21:32
corvus	oh that's the error from the zk server?	21:32
clarkb	yes if you docker logs the zk container you'll see the len error tracebacks	21:32
clarkb	with what I presume is a byte length	21:32
corvus	that sounds plausible	21:33
clarkb	I noticed it when looking into the "Unknown Configuration Error" messages for https://review.opendev.org/c/openstack/octavia/+/798151 the latest recheck seems to have enqueued though	21:34
corvus	it looks like we're writing a big serialized config error to zk	21:34
corvus	i don't quite understand why we're doing that in the format status json call	21:35
corvus	i think i want to assume the second traceback is somehow corrupt, and i only need to look at the first one. it makes sense.	21:36
clarkb	looking at it I think the second one may be side effect of the connection loss but not due to the specific write	21:37
clarkb	basically first one explains the connection loss then second one is explained by the connection loss	21:37
corvus	yeah, just not sure why the second one is doing any writes; like, _loadDynamicLayout shouldn't be in the call stack for a status json get	21:38
clarkb	oh I see, could be interleaved writes maybe?	21:40
corvus	i don't think we get that with the logging module	21:41
clarkb	looks like it is trying to get the event queue length to report in the status json. Maybe that loads more info than anticipated?	21:43
corvus	how do we go from "raise self._exception" to "item.setConfigErrors" in the next frame?	21:44
corvus	maybe it has something to do with kazoo's internal async processing and retry handling	21:45
corvus	and the second thread has an async retry of the original exception in its call stack. i dunno. it's weird. i'm still just going to try to fix the first error. :)	21:46
corvus	unfortunately, i don't know which attribute of the ConfigurationError we're trying to serialize which is too big	21:46
clarkb	my hunch would be the message content since that can be quite large iirc	21:47
corvus	right, but "error" or "short_error" ?	21:48
corvus	or is it that there are a large number of ConfigurationError objects	21:48
clarkb	gotcha	21:49
clarkb	corvus: should we shard all the error objects? since they can easily get large? or are you still trying to track down the specific attribute(s)	22:33
corvus	clarkb: i'm working on a change to shard all the error objects	22:34
clarkb	ok I look forward to reviewing it	22:34
fungi	okay, i'm back from cooking and eating foods	22:49
clarkb	infra-root I've updated https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting with notes about multi zuul scheduler and the container user stuff I started looking at. I dropped fips stuff and fedora 34. Anything else we should add?	22:50
fungi	andcaught up on the zk disconnect sitch	22:50
clarkb	ianw: btw the new js theme looks great	22:50
ianw	thanks, i think that's it. will be working on the 3.4 upgrade prep	22:52
ianw	i don't know if i just hadn't noticed before, but it was great to see on the polymer docs page i was looking at that it's all deprecated and now you should use "lit"	22:56
clarkb	hahaha	22:56
ianw	so just as we get polygerrit and i have some vague understanding of it ... it's already old news	22:56
fungi	web frameworks will stabilize some day	23:16
fungi	and on that day, i'll begin to understand them	23:16
clarkb	corvus: fungi there are ~8 changes in openstack check that look like they might be stuck. They are the first 8	23:24
clarkb	is this possibly related to the lost noderequests somehow?	23:24
corvus	clarkb: yes, likely so.	23:24
clarkb	ok I can dequeue enqueue them if we think that is the best step for them	23:25
corvus	i think all the bugfixes so far have landed, except the one i just uploaded for the configerrors	23:25
corvus	clarkb: yeah, i'm leaning toward dequeue/enqueue and wait for the next bugfix to land before doing the rolling restart	23:26
clarkb	ok I'll start on those now	23:26
corvus	ok. i'm going to afk so i can be around when it lands	23:27
clarkb	alright dequeue enqueue complete for the changes that looked suspect to me	23:39
clarkb	corvus: when you return 815696,4 seems to be stuck but is also only just 2 hours old which means it should be more recent than the monkey patches. I'll hold off on dequeue/enqueing that one in case that is a good one to examine for further debugging	23:43
corvus	clarkb: i only monkeypatched the retry fix, not nodepool	23:47
corvus	so i think you should just go ahead and re-enqueue that one	23:48
clarkb	oh got it	23:48
clarkb	proceeding	23:48
corvus	whew, i saw zuul tenant gate head with a queued job, but it just started :)	23:54
clarkb	I notice there are a few more changes that need to be dequeue/enqueued. Working on that now	23:59
clarkb	they weren't all at the top of the queue	23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!