Tuesday, 2023-10-17

tkajinam	clarkb corvus, fyi: I tried recheck and it works now.	01:39
*** ramishra_ is now known as ramishra		04:37
fungi	sorry, left my phone behind when heading out to the conference yesterday and was too exhausted to check in once i got back to the rental last night. i may not be able to catch up on all the scrollback, but let me know if there's still anything urgent needing my attention	13:30
fungi	ran into pleia2 and olaph here so far. pabelanger is apparently around here somewhere too, trying to track him down	14:17
Clark[m]	fungi: mostly just struggles getting the mm3 exim update in place but that has happened.	14:30
Clark[m]	I'm going to followup on some of the container updates/cleanups today and try to upgrade zk to 3.8 as well cc corvus	14:31
NeilHanlon	fungi: really wish I could've made it to ATO this year :\ hope you're having a good time!	14:31
fungi	Clark[m]: cool, thanks for that!	14:32
fungi	NeilHanlon: it's great, but also changed a lot since the very first one, which was the only other time i was able to make it	14:33
fungi	they started running it while i was still living in raleigh, which was super convenient	14:34
NeilHanlon	that does sound very convenient heh	14:37
fungi	yeah, now it's a ~4hr drive from the beach for me, not terrible but does still require more planning	14:50
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/898475 and its parent are the main things I tripped over yesterday trying to get the exim update deployed	15:10
clarkb	fungi: landing the parent or something like it is probably the most important fix since it will prevent us from landing some updates to system-config https://review.opendev.org/c/opendev/system-config/+/898502/2	15:10
clarkb	corvus: zk05 is our zk leader. I think the rough plan is put zk04,05,06 in the emrgency file, manualy edit the docker-compose.yaml on zk04 to use the :3.8 label, docker-compose pull, docker-compose down, docker-compose up -d. Check that the node is a follower again. Repeat on zk06, then zk05 and check we haev a new leader. Finally land	15:17
clarkb	https://review.opendev.org/c/opendev/system-config/+/897985 and take the nodes out of the emergency file	15:17
corvus	clarkb: sounds good	15:19
clarkb	corvus: I'm good to start that now if you think now is a good time for it	15:20
clarkb	zk nodes are in the emergency file	15:27
clarkb	the tripleo jobs are hitting retry limits due to the galaxy api udpates...	15:30
corvus	clarkb: sounds good; i'm around	15:32
clarkb	ok I'll proceed with zk04 now	15:33
clarkb	zk04 is now 3.8.3	15:35
clarkb	as far as I can tell things are still working	15:36
corvus	i'm just reviewing the log now	15:36
clarkb	ack let me know when you are happy for me to do zk06	15:36
clarkb	grafana graphs look good though zk04 is very idle (I think this is normal as zk06 was in a previous position)	15:37
clarkb	er was in that position previously	15:37
corvus	looks okay. looks like it took a few attempts to get re-synced, but it seems happy now	15:40
clarkb	alright I'll proceed with zk06 now	15:40
corvus	that one recovered more quickly	15:42
clarkb	all of the active connections appear to have ended up on the leader (none went to zk04)	15:42
corvus	that is not great	15:42
corvus	i think it might be worth restarting some mergers or something to see if they connect to 4 or 6	15:43
corvus	i'm not comfortable stopping 5 without knowing more	15:43
clarkb	ok	15:43
clarkb	I'll start with zm01 and work my way up from there. We can do graceful stops then restart	15:44
corvus	sounds good	15:44
clarkb	I believe zm01 connected to zk04	15:45
clarkb	but I'm happy to do a couple more since it is low impact and helps build confidence	15:46
clarkb	I did `sudo docker-compose exec -T merger zuul-merger stop` then `sudo docker-compose start merger` fwiw	15:46
corvus	yeah.. .also i just noticed something in the logs	15:46
corvus	2023-10-17 15:45:35,043 [myid:]	15:46
corvus	there's no "id" on 4 and 6	15:46
corvus	2023-10-17 15:45:39,074 [myid:5]	15:46
clarkb	hrm	15:47
corvus	compare to 5 ^	15:47
clarkb	that comes out of the config file or should iirc	15:47
clarkb	but lets figure that out before restarting more mergers	15:47
corvus	it's in /var/zookeeper/data/myid	15:49
clarkb	corvus: which does show up in the container as well and contains 4 on zk04. So maybe we're not putting it where the new containers expect it	15:50
clarkb	it is just /data/myid within the container	15:50
opendevreview	Merged opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test https://review.opendev.org/c/opendev/system-config/+/898502	15:52
clarkb	https://github.com/31z4/zookeeper-docker/blob/master/3.8.3/docker-entrypoint.sh#L45-L48 thats where the upstream image manages the id but only if we don't set it and it sets it to 1 (ours is still 4 implying that didn't fire)	15:53
clarkb	dataDir=/data is set in the config too so we're pointing at the directory containing the file at least	15:53
corvus	what 3.7 version are we running?	15:53
clarkb	corvus: 3.7.2	15:54
clarkb	`echo stat \| nc localhost 2181` shows the full version and build info on each host	15:54
corvus	they migrating logging frameworks between those versions	15:55
corvus	log4j->logback	15:55
corvus	maybe they missed something	15:55
clarkb	ya I've been looking for a four letter command that will report the myid value back to us independent of logging and haven't found one yet but I think that is the next thing to sort out	15:56
clarkb	maybe logging is broken	15:56
corvus	oh interesting, in my local test container (3.8.1) i have some lines with myid:1 and some are myid:	15:56
corvus	yeah and i see similar on zk06	15:57
corvus	if we go back a ways in the log, there are some myid:6 entries	15:57
clarkb	phew	15:57
clarkb	fwiw I think `conf` mgiht be the command we need but it isn't allowed in our list of allowed four letter commands. But I'm much ahppier with your report that some log lines include teh correct value	15:58
* clarkb makes a note to test for myid values in the server log in our ci job		15:58
corvus	yeah, i think we can call this a red herring, and just proceed with merger restarts and observe the connection distribution	15:58
clarkb	ok proceed with zm02 now	15:58
clarkb	er I'm proceeding with	15:59
corvus	(this is a behavior change; because zk05 is reporting the mntr log entries with myid:5 and the others are not)	15:59
clarkb	zm02 appears to have reconnected to zk05	16:00
clarkb	I'm going to stop and start it again and see if we can get it to connect elsewhere	16:01
corvus	++	16:02
clarkb	now it is connected to zk06 I think	16:02
* clarkb continues with the rest of the mergers since this is easy		16:02
corvus	agreed	16:02
clarkb	zm03 is now attached to zk06	16:04
corvus	i verified that /var/lib/zuul/zuul-keys-backup.json is current, btw.	16:04
clarkb	thanks	16:04
clarkb	zm04 is connected to zk06 as well	16:05
corvus	likewise /var/log/nodepool/nodepool-image-backup.json on nb01 is relatively current (from yesterday)	16:05
clarkb	zm05 connected to zk04	16:06
clarkb	zm06 connected to zk06	16:07
clarkb	zm07 connected to zk05 so I'm redoing it	16:07
corvus	i also note that the number of watches has grown considerably; i don't know what to make of that.	16:08
clarkb	zm07 really likes zk05... I'll skip it and go to zm08	16:08
clarkb	zm08 connected to zk04	16:10
corvus	okay i reckon we restart zk05 now?	16:10
clarkb	corvus: ya I think so. zm07 is still connected to it but the other 7 mergers connected to a different one and seem to be working?	16:11
clarkb	do we want ot check the operating logs of a merger first?	16:11
corvus	i think i saw zm01 run jobs; but let's double check	16:11
clarkb	zm01 did work at 16:07 whcih is after I restarted it at 15:45 or so	16:12
clarkb	a refstate job	16:12
corvus	yeah it's run a lot of jobs since the restart. i think we're good	16:12
clarkb	ok I'll proceed with the upgrade of zk05 (the leader) from 3.7.2 to 3.8.3 now	16:12
corvus	++	16:13
clarkb	zk06 is the new leader	16:13
clarkb	zk05 reports it is a follower	16:14
corvus	zm07 reconnected to something and is happy	16:14
corvus	things look reasonable to me	16:18
clarkb	cool	16:18
clarkb	corvus: want to approve https://review.opendev.org/c/opendev/system-config/+/897985 so that our config matches the new reality? I can pull the ndoes out of the emergency file once that lands	16:19
corvus	it looks like we're running more builds than before we started, so increased activity may explain the increase in watches	16:19
clarkb	ah	16:19
corvus	+3	16:19
clarkb	thanks! I'll look into updating our test for zk deployment to check for the myid value in the logs as that seems like a good check	16:20
corvus	sounds cool	16:20
corvus	thanks for driving this! i'll check back a bit later and see if the graphs still look good	16:21
clarkb	#status log Upgraded our Zookeeper cluster to Zookeeper 3.8.3	16:21
opendevstatus	clarkb: finished logging	16:21
clarkb	corvus: and thank you for being an extra set of eyeballs. I always find that helpful as different prespective tends to find mroe things to be cautious with	16:22
fungi	thanks for working on that! sorry i've	16:25
fungi	been out of touch	16:25
clarkb	fungi: it just occurred to me you are in raleigh, maybe you want ot do a pitstop at RH HQ and fix their mailservers for them >_<	16:31
opendevreview	Clark Boylan proposed opendev/system-config master: Add zk test to check myid is set in service https://review.opendev.org/c/opendev/system-config/+/898614	16:32
JayF	Just wonder near the IBM campus in RTP with a suit on and say you're looking for a solution, they'll let you right in ;)	16:32
clarkb	I think ^ that will ensure we've got myid showing up in the logs properly	16:32
*** ralonsoh is now known as ralonsoh_ooo		16:33
clarkb	fungi: if you get a chance another less urgent but easy review is https://review.opendev.org/c/opendev/system-config/+/898479 want to get that in before we start removing older container image builds	16:35
fungi	clarkb: JayF: yeah, the rh building is just a few blocks from here	16:41
clarkb	fungi: print out a copy of the dns rfc section that covers ttls :)	16:41
clarkb	fwiw its getting annoying beacuse rh peopel are replying to the list and other thread members. Then thread members who aren't at rh reply and we get half the email chain	16:42
clarkb	And it isn't any of the involved rh people's fault but their email systems seem to be sad	16:42
fungi	the other possibility is that rackspace's dns servers are sometimes returning old records, i guess?	16:44
clarkb	fungi: that seems unlikely given that only rh seems affected so far?	16:45
clarkb	but maybe cdns or anycast are involved	16:45
JayF	I would believe that is possible.	16:45
JayF	But any personal experiences I have informing that are years old.	16:45
fungi	i checked both authoritative nameservers are returning correct addresses at least	16:47
clarkb	from my home all three of the major dns forwarders (google, cloudflare, and quad 9) return the correct record. As do dns1 and dns2 at stabletransit	16:48
clarkb	fungi: one thing I'll note is that you used an A record instead of a CNAME probably beacuse A records are fallback MX records. Maybe we want explicit MX records?	16:49
clarkb	I think what you did is correct but maybe whatever resolvers/mail server out there that is having trouble isn't happy with that	16:49
clarkb	also we'll want to bump the ttl up to an hour at some point	16:51
clarkb	maybe after this issue is resolved though in case we have to make changes	16:51
clarkb	corvus: there is a spike in event processing time. Possibly related to oepnstack deleting EOL branches?	17:03
clarkb	ya I see zuul02 handling a bunch of ref updated events for stable/stein	17:06
corvus	clarkb: yeah, from what i saw yesterday, the release jobs are trickling a bunch of branch creates (so today, deletes?) which puts the openstack tenant in more or less a continual reconfiguration loop. the events get deduplicated, but if by the time it finishes a reconfig, there's another batch of events to trigger another one, then it starts again	17:07
corvus	that behavior would cause event processing delays	17:07
clarkb	cool just making sure we're comfortable with it. And that seems to match what I seei nthe logs	17:07
corvus	(the faster that the release team/jobs can issue branch ops, so they are more closely clustered in time, the better)	17:07
clarkb	elodilles: ^ fyi	17:08
opendevreview	Merged opendev/system-config master: Bump zookeeper from 3.7 to 3.8 https://review.opendev.org/c/opendev/system-config/+/897985	17:10
clarkb	I'm going to remove zk04,05, and 06 from the emergency file as soon as the hourly run for zuul finishes. This way we get the deploy run for ^ applying and we can check it all looks good after	17:30
clarkb	emergency file is updated. The zookeeper job for 898985 will run and should noop due to matching configs not due to skipping hosts	17:35
clarkb	zk04 is done being "updated" and it nooped as expected	17:41
clarkb	and the other two look good as well.	17:43
opendevreview	Clark Boylan proposed opendev/system-config master: Update to Ansible 8 on bridge https://review.opendev.org/c/opendev/system-config/+/898505	21:18
clarkb	ok that should run many jobs. Maybe too many. But will give good feedback on how ansible 8 does with our existing playbooks and roles	21:19
clarkb	https://review.opendev.org/c/opendev/system-config/+/898505 passed even when running all those extra jobs. I can't think of much else to test before we take the ansible 8 plunge so probably we just go for it when we've got a day to monitor it	22:51

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!