tkajinam | clarkb corvus, fyi: I tried recheck and it works now. | 01:39 |
---|---|---|
*** ramishra_ is now known as ramishra | 04:37 | |
fungi | sorry, left my phone behind when heading out to the conference yesterday and was too exhausted to check in once i got back to the rental last night. i may not be able to catch up on all the scrollback, but let me know if there's still anything urgent needing my attention | 13:30 |
fungi | ran into pleia2 and olaph here so far. pabelanger is apparently around here somewhere too, trying to track him down | 14:17 |
Clark[m] | fungi: mostly just struggles getting the mm3 exim update in place but that has happened. | 14:30 |
Clark[m] | I'm going to followup on some of the container updates/cleanups today and try to upgrade zk to 3.8 as well cc corvus | 14:31 |
NeilHanlon | fungi: really wish I could've made it to ATO this year :\ hope you're having a good time! | 14:31 |
fungi | Clark[m]: cool, thanks for that! | 14:32 |
fungi | NeilHanlon: it's great, but also changed a lot since the very first one, which was the only other time i was able to make it | 14:33 |
fungi | they started running it while i was still living in raleigh, which was super convenient | 14:34 |
NeilHanlon | that does sound very convenient heh | 14:37 |
fungi | yeah, now it's a ~4hr drive from the beach for me, not terrible but does still require more planning | 14:50 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/898475 and its parent are the main things I tripped over yesterday trying to get the exim update deployed | 15:10 |
clarkb | fungi: landing the parent or something like it is probably the most important fix since it will prevent us from landing some updates to system-config https://review.opendev.org/c/opendev/system-config/+/898502/2 | 15:10 |
clarkb | corvus: zk05 is our zk leader. I think the rough plan is put zk04,05,06 in the emrgency file, manualy edit the docker-compose.yaml on zk04 to use the :3.8 label, docker-compose pull, docker-compose down, docker-compose up -d. Check that the node is a follower again. Repeat on zk06, then zk05 and check we haev a new leader. Finally land | 15:17 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/897985 and take the nodes out of the emergency file | 15:17 |
corvus | clarkb: sounds good | 15:19 |
clarkb | corvus: I'm good to start that now if you think now is a good time for it | 15:20 |
clarkb | zk nodes are in the emergency file | 15:27 |
clarkb | the tripleo jobs are hitting retry limits due to the galaxy api udpates... | 15:30 |
corvus | clarkb: sounds good; i'm around | 15:32 |
clarkb | ok I'll proceed with zk04 now | 15:33 |
clarkb | zk04 is now 3.8.3 | 15:35 |
clarkb | as far as I can tell things are still working | 15:36 |
corvus | i'm just reviewing the log now | 15:36 |
clarkb | ack let me know when you are happy for me to do zk06 | 15:36 |
clarkb | grafana graphs look good though zk04 is very idle (I think this is normal as zk06 was in a previous position) | 15:37 |
clarkb | er was in that position previously | 15:37 |
corvus | looks okay. looks like it took a few attempts to get re-synced, but it seems happy now | 15:40 |
clarkb | alright I'll proceed with zk06 now | 15:40 |
corvus | that one recovered more quickly | 15:42 |
clarkb | all of the active connections appear to have ended up on the leader (none went to zk04) | 15:42 |
corvus | that is not great | 15:42 |
corvus | i think it might be worth restarting some mergers or something to see if they connect to 4 or 6 | 15:43 |
corvus | i'm not comfortable stopping 5 without knowing more | 15:43 |
clarkb | ok | 15:43 |
clarkb | I'll start with zm01 and work my way up from there. We can do graceful stops then restart | 15:44 |
corvus | sounds good | 15:44 |
clarkb | I believe zm01 connected to zk04 | 15:45 |
clarkb | but I'm happy to do a couple more since it is low impact and helps build confidence | 15:46 |
clarkb | I did `sudo docker-compose exec -T merger zuul-merger stop` then `sudo docker-compose start merger` fwiw | 15:46 |
corvus | yeah.. .also i just noticed something in the logs | 15:46 |
corvus | 2023-10-17 15:45:35,043 [myid:] | 15:46 |
corvus | there's no "id" on 4 and 6 | 15:46 |
corvus | 2023-10-17 15:45:39,074 [myid:5] | 15:46 |
clarkb | hrm | 15:47 |
corvus | compare to 5 ^ | 15:47 |
clarkb | that comes out of the config file or should iirc | 15:47 |
clarkb | but lets figure that out before restarting more mergers | 15:47 |
corvus | it's in /var/zookeeper/data/myid | 15:49 |
clarkb | corvus: which does show up in the container as well and contains 4 on zk04. So maybe we're not putting it where the new containers expect it | 15:50 |
clarkb | it is just /data/myid within the container | 15:50 |
opendevreview | Merged opendev/system-config master: Fix the Ansible Galaxy proxy testinfra test https://review.opendev.org/c/opendev/system-config/+/898502 | 15:52 |
clarkb | https://github.com/31z4/zookeeper-docker/blob/master/3.8.3/docker-entrypoint.sh#L45-L48 thats where the upstream image manages the id but only if we don't set it and it sets it to 1 (ours is still 4 implying that didn't fire) | 15:53 |
clarkb | dataDir=/data is set in the config too so we're pointing at the directory containing the file at least | 15:53 |
corvus | what 3.7 version are we running? | 15:53 |
clarkb | corvus: 3.7.2 | 15:54 |
clarkb | `echo stat | nc localhost 2181` shows the full version and build info on each host | 15:54 |
corvus | they migrating logging frameworks between those versions | 15:55 |
corvus | log4j->logback | 15:55 |
corvus | maybe they missed something | 15:55 |
clarkb | ya I've been looking for a four letter command that will report the myid value back to us independent of logging and haven't found one yet but I think that is the next thing to sort out | 15:56 |
clarkb | maybe logging is broken | 15:56 |
corvus | oh interesting, in my local test container (3.8.1) i have some lines with myid:1 and some are myid: | 15:56 |
corvus | yeah and i see similar on zk06 | 15:57 |
corvus | if we go back a ways in the log, there are some myid:6 entries | 15:57 |
clarkb | phew | 15:57 |
clarkb | fwiw I think `conf` mgiht be the command we need but it isn't allowed in our list of allowed four letter commands. But I'm much ahppier with your report that some log lines include teh correct value | 15:58 |
* clarkb makes a note to test for myid values in the server log in our ci job | 15:58 | |
corvus | yeah, i think we can call this a red herring, and just proceed with merger restarts and observe the connection distribution | 15:58 |
clarkb | ok proceed with zm02 now | 15:58 |
clarkb | er I'm proceeding with | 15:59 |
corvus | (this is a behavior change; because zk05 is reporting the mntr log entries with myid:5 and the others are not) | 15:59 |
clarkb | zm02 appears to have reconnected to zk05 | 16:00 |
clarkb | I'm going to stop and start it again and see if we can get it to connect elsewhere | 16:01 |
corvus | ++ | 16:02 |
clarkb | now it is connected to zk06 I think | 16:02 |
* clarkb continues with the rest of the mergers since this is easy | 16:02 | |
corvus | agreed | 16:02 |
clarkb | zm03 is now attached to zk06 | 16:04 |
corvus | i verified that /var/lib/zuul/zuul-keys-backup.json is current, btw. | 16:04 |
clarkb | thanks | 16:04 |
clarkb | zm04 is connected to zk06 as well | 16:05 |
corvus | likewise /var/log/nodepool/nodepool-image-backup.json on nb01 is relatively current (from yesterday) | 16:05 |
clarkb | zm05 connected to zk04 | 16:06 |
clarkb | zm06 connected to zk06 | 16:07 |
clarkb | zm07 connected to zk05 so I'm redoing it | 16:07 |
corvus | i also note that the number of watches has grown considerably; i don't know what to make of that. | 16:08 |
clarkb | zm07 really likes zk05... I'll skip it and go to zm08 | 16:08 |
clarkb | zm08 connected to zk04 | 16:10 |
corvus | okay i reckon we restart zk05 now? | 16:10 |
clarkb | corvus: ya I think so. zm07 is still connected to it but the other 7 mergers connected to a different one and seem to be working? | 16:11 |
clarkb | do we want ot check the operating logs of a merger first? | 16:11 |
corvus | i think i saw zm01 run jobs; but let's double check | 16:11 |
clarkb | zm01 did work at 16:07 whcih is after I restarted it at 15:45 or so | 16:12 |
clarkb | a refstate job | 16:12 |
corvus | yeah it's run a lot of jobs since the restart. i think we're good | 16:12 |
clarkb | ok I'll proceed with the upgrade of zk05 (the leader) from 3.7.2 to 3.8.3 now | 16:12 |
corvus | ++ | 16:13 |
clarkb | zk06 is the new leader | 16:13 |
clarkb | zk05 reports it is a follower | 16:14 |
corvus | zm07 reconnected to something and is happy | 16:14 |
corvus | things look reasonable to me | 16:18 |
clarkb | cool | 16:18 |
clarkb | corvus: want to approve https://review.opendev.org/c/opendev/system-config/+/897985 so that our config matches the new reality? I can pull the ndoes out of the emergency file once that lands | 16:19 |
corvus | it looks like we're running more builds than before we started, so increased activity may explain the increase in watches | 16:19 |
clarkb | ah | 16:19 |
corvus | +3 | 16:19 |
clarkb | thanks! I'll look into updating our test for zk deployment to check for the myid value in the logs as that seems like a good check | 16:20 |
corvus | sounds cool | 16:20 |
corvus | thanks for driving this! i'll check back a bit later and see if the graphs still look good | 16:21 |
clarkb | #status log Upgraded our Zookeeper cluster to Zookeeper 3.8.3 | 16:21 |
opendevstatus | clarkb: finished logging | 16:21 |
clarkb | corvus: and thank you for being an extra set of eyeballs. I always find that helpful as different prespective tends to find mroe things to be cautious with | 16:22 |
fungi | thanks for working on that! sorry i've | 16:25 |
fungi | been out of touch | 16:25 |
clarkb | fungi: it just occurred to me you are in raleigh, maybe you want ot do a pitstop at RH HQ and fix their mailservers for them >_< | 16:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add zk test to check myid is set in service https://review.opendev.org/c/opendev/system-config/+/898614 | 16:32 |
JayF | Just wonder near the IBM campus in RTP with a suit on and say you're looking for a solution, they'll let you right in ;) | 16:32 |
clarkb | I think ^ that will ensure we've got myid showing up in the logs properly | 16:32 |
*** ralonsoh is now known as ralonsoh_ooo | 16:33 | |
clarkb | fungi: if you get a chance another less urgent but easy review is https://review.opendev.org/c/opendev/system-config/+/898479 want to get that in before we start removing older container image builds | 16:35 |
fungi | clarkb: JayF: yeah, the rh building is just a few blocks from here | 16:41 |
clarkb | fungi: print out a copy of the dns rfc section that covers ttls :) | 16:41 |
clarkb | fwiw its getting annoying beacuse rh peopel are replying to the list and other thread members. Then thread members who aren't at rh reply and we get half the email chain | 16:42 |
clarkb | And it isn't any of the involved rh people's fault but their email systems seem to be sad | 16:42 |
fungi | the other possibility is that rackspace's dns servers are sometimes returning old records, i guess? | 16:44 |
clarkb | fungi: that seems unlikely given that only rh seems affected so far? | 16:45 |
clarkb | but maybe cdns or anycast are involved | 16:45 |
JayF | I would believe that is possible. | 16:45 |
JayF | But any personal experiences I have informing that are years old. | 16:45 |
fungi | i checked both authoritative nameservers are returning correct addresses at least | 16:47 |
clarkb | from my home all three of the major dns forwarders (google, cloudflare, and quad 9) return the correct record. As do dns1 and dns2 at stabletransit | 16:48 |
clarkb | fungi: one thing I'll note is that you used an A record instead of a CNAME probably beacuse A records are fallback MX records. Maybe we want explicit MX records? | 16:49 |
clarkb | I think what you did is correct but maybe whatever resolvers/mail server out there that is having trouble isn't happy with that | 16:49 |
clarkb | also we'll want to bump the ttl up to an hour at some point | 16:51 |
clarkb | maybe after this issue is resolved though in case we have to make changes | 16:51 |
clarkb | corvus: there is a spike in event processing time. Possibly related to oepnstack deleting EOL branches? | 17:03 |
clarkb | ya I see zuul02 handling a bunch of ref updated events for stable/stein | 17:06 |
corvus | clarkb: yeah, from what i saw yesterday, the release jobs are trickling a bunch of branch creates (so today, deletes?) which puts the openstack tenant in more or less a continual reconfiguration loop. the events get deduplicated, but if by the time it finishes a reconfig, there's another batch of events to trigger another one, then it starts again | 17:07 |
corvus | that behavior would cause event processing delays | 17:07 |
clarkb | cool just making sure we're comfortable with it. And that seems to match what I seei nthe logs | 17:07 |
corvus | (the faster that the release team/jobs can issue branch ops, so they are more closely clustered in time, the better) | 17:07 |
clarkb | elodilles: ^ fyi | 17:08 |
opendevreview | Merged opendev/system-config master: Bump zookeeper from 3.7 to 3.8 https://review.opendev.org/c/opendev/system-config/+/897985 | 17:10 |
clarkb | I'm going to remove zk04,05, and 06 from the emergency file as soon as the hourly run for zuul finishes. This way we get the deploy run for ^ applying and we can check it all looks good after | 17:30 |
clarkb | emergency file is updated. The zookeeper job for 898985 will run and should noop due to matching configs not due to skipping hosts | 17:35 |
clarkb | zk04 is done being "updated" and it nooped as expected | 17:41 |
clarkb | and the other two look good as well. | 17:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update to Ansible 8 on bridge https://review.opendev.org/c/opendev/system-config/+/898505 | 21:18 |
clarkb | ok that should run many jobs. Maybe too many. But will give good feedback on how ansible 8 does with our existing playbooks and roles | 21:19 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/898505 passed even when running all those extra jobs. I can't think of much else to test before we take the ansible 8 plunge so probably we just go for it when we've got a day to monitor it | 22:51 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!