Wednesday, 2023-12-06

fricklerJayF: fwiw I've been thinking to set up a link shortener fed by gerrit reviews, that would be pretty simplistic. for now though, I'm just running one privately for the things I regularly need05:44
fricklerseems gmail is still at least delaying mails because we have more recipients than they like. I really wonder if proactively blocking mails towards them wouldn't be a better solution07:24
fricklerif not, we'll likely have to do further work and maybe deploy ARC
*** Guest8552 is now known as layer808:37
*** layer8 is now known as Guest939108:37
*** Guest9391 is now known as layer908:39
opendevreviewMerged opendev/ master: Remove old mirror nodes from DNS
opendevreviewBartosz Bezak proposed openstack/diskimage-builder master: Add NetworkManager-config-server to rocky-container
opendevreviewJeremy Stanley proposed opendev/system-config master: Switch install-docker playbook to include_tasks
funginoticed that in this deploy job failure:
funginot sure if it was the reason for the failure or just a red herring, but worth cleaning up either way13:59
fungiyeah, it wasn't the cause. found this in the log on bridge:14:00
fungiRecreating jaeger-docker_jaeger_1 ... done14:00
fungifatal: []: FAILED!14:00
fungiTimeout when waiting for
fungiso the container didn't come up, or didn't come up fast enough14:01
fungithe periodic build also failed exactly the same way14:03
fungithe last successful periodic run was on friday14:04
fungithe container log is complaining about subchannel connectivity problems14:18
fungi2023-12-06T12:40:47.191630938Z grpc@v1.59.0/server.go:964 [core][Server #5] grpc: Server.Serve failed to create ServerTransport: connection error: desc = ServerHandshake("") failed: tls: first record does not look like a TLS handshake14:19
fungii'm going to try manually downing and upping the container just to see if i get anything else out of it14:20
fungilog is still full of connection errors after the restart14:23
fungilooks like there were updates for the jaegertracing/all-in-one container image 44 hours ago, 4 days ago, and 7 days ago14:24
tonybcan we get the SHA for the last good deploy?14:25
fungii don't know if the failures are related to the images, but the image from 7 days ago was definitely before the periodic job started failing14:26
fungithe one from 4 days ago falls inside the window between the last successful run on friday and the first failing run on monday14:28
fungithe one from 44 hours ago was after the first failure14:29
fungidocker image inspect of the 7-day-old image says it's jaegertracing/all-in-one@sha256:963fed00648f7e797fa15a71c6e693b7ddace2ba7968207bb14f657914dac65b14:30
fungi"Created": "2023-11-29T06:06:18.566997546Z"14:30
fungican i replace "latest" with "963fed00648f7e797fa15a71c6e693b7ddace2ba7968207bb14f657914dac65b" in the compose file to test? does that syntax work?14:32
funginot found: manifest unknown: manifest unknown14:33
fungiguess not14:33
fungii switched from latest to 1.51 after looking at and am seeing similar connection failures14:35
tonybTry FROM jaegertracing/all-in-one@sha256:963fed00648f7e797fa15a71c6e693b7ddace2ba7968207bb14f657914dac65b14:35
tonybAh okay so that probably isn't it14:36
fungimmm, though these connection failures are info and warn level only14:37
fungi2023-12-06T14:36:11.773226176Z grpc@v1.59.0/clientconn.go:1521 [core][Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:4317", ServerName: "localhost:4317", }. Err: connection error: desc = transport: Error while dialing: dial tcp connect: connection refused14:37
fricklerinfra-root: forgot to mention yesterday, we still have three old held nodes that don't show up in autoholds, can someone look into cleaning those up?14:44
fricklerthere's also node 0035950265 that seems to be stuck in "deleted" state somehow14:48
fungisometimes `openstack server show ...` will include an error message in those situations14:49
fricklerI think that's more for "deleting" rather than "deleted"? anyway: No Server found for 01b73bd3-ad22-46a6-a88e-6e33fc2f4b6115:00
fungi"stuck in deleted state" according to openstack server list? or according to nodepool list?15:01
frickleraccording to the "nodes" tab in zuul, is that the same as nodepool list?15:02
fungii think so, i haven't relied much on the webui for that15:02
fungiso is that where you're also seeing the held nodes you're talking about?15:03
funginodepool list reports 11 nodes in a hold state, which corresponds to what i see at
fricklerjust checked, "nodepool list" doesn't have 0035950265, but 31 other very old nodes in "deleted" state15:06
fricklerso a) some mismatch in state between zuul and nodepool and b) some cleanup in nodepools zk being broken I guess15:07
fungizuul/nodepool switched from numeric node ids to uuids semi-recently15:08
fungimaybe that's the event horizon where it lost track of the old deleted nodes15:10
fricklerI'm still seing numeric node ids both in zuul and nodepool, though. I think the switch was only for image IDs?15:10
fungioh, maybe15:11
fricklerbuild ids to be more specific15:11
fricklerimage builds, that is, not zuul builds15:12
fungilooks like all the ones `nodepool list` reports in a "deleted" state are missing pretty much all data besides the node number, state, time in state, lock status, username, port15:13
fungino provider even15:13
fungiso yes, probably will require some manual cleanup with zkcli15:13
fungilooks like they're all around 8-12 months old15:14
JayFfrickler: running one privately is an option, I should do that15:59
tonybI did a little poking at the ensure-pip role for enabling python3.12.  Under the hood both pyenv and stow use the python-build tool from pyenv.  It's just a question of when.  pyenv would build a python3.12 on every job run, stow would build the python3.12 binary when we build the nodepool image (by using the python-stow-versions DIB element)16:00
tonybI guess for now as a POC option 1 pyenv with job builds is the quickest POC16:01
fungilooks like the issue with the jaeger role may be that the 60-second timeout is too short16:16
fungiit does eventually listen and start responding on port 16686 but takes a while16:17
fungii'll see if i can get a baseline timeframe16:17
clarkbto figure out the extra held nodes nodepool can list them with the detail data. I'll take a look shortly16:20
fungiclarkb: the "extra" held nodes already show the comment info in the nodes list (both from the nodepool cli and zuul's webui)16:20
fungithere just isn't a corresponding autohold zuul is tracking any longer16:21
clarkboh then we just identify if they need to be held any longer and if not delete them16:21
clarkbbased on the comment16:22
fungiempirical testing suggests 60 seconds is way too short of a timeout for jaeger to start listening. my initial test took 80 seconds. i'll run a few more restarts to see how consistent that is16:22
tonybfungi: so we've just gotten *really* lucky with the 60s timeout to date?16:24
fungior something has recently caused it to get slower16:24
clarkbfrickler: fungi: the quoted text in the commit messages for the spf record updates implied that spf or dkim were sufficient. Maybe they actually want spf and dkim? In which case arc does seem like a next step16:24
fungii can try rolling back the version again to see if it speeds up16:24
fungiclarkb: the new deferral responses from gmail doesn't indicate it has anything to do with message authentication, nor does it imply that additional message authentication would help16:25
clarkbfwiw jaeger must sit in front of a db of some sort. I'm not sure how safe rolling backwards will be16:25
clarkbfungi: oh they chagned the message?16:25
fungiclarkb: it says there are too many recipients for the same message-id16:25
tonybI know I had to add SPF and DKIM to my domain to get google to not send messages directly into spam.16:25
clarkbweird that we would do what they ask and then complain about something else16:25
tonybbut lists are ... different16:26
fungi80 seconds again on my second restart test16:26
tonybI guess the MTA just bailed at the first error so there could be more coming16:26
tonyb... Other communities must have hit this16:27
clarkbtonyb: likely, but these are also new changeso n the google side16:28
clarkbso it may be we're all hitting them in the last couple of days and scratching our heads16:28
fungiwell, gmail wasn't exactly barfing before. it rate-limited deliveries from the server, saying authenticating by adding either spf or dkim would help (and will become mandatory in a couple months). now it's rate limiting again, but because of the number of people receiving the same post (so basically the number of gmail subscribers to lists)16:28
clarkband I'm sure they'll be intentionally vague/obtuse in the name of not giving spammers a leg up16:29
fungithis time the implication is that the deferral is per message-id though, so it seems like some gmail subscribers receive the message, but it takes multiple returns back from the deferral queue before everyone does16:29
clarkbthe incredibly frustrating thing here is that if you use email you'll be aware of the fact that gmail is the source of a significant portion of spam16:29
fungiand yeah, i expect this is just gmail ratcheting up their rules to try to block spam messages. i half wonder if it's also restricted to posts from people using gmail or gmail-hosted addresses. i've seen insinuation elsewhere that gmail is cracking down on messages that say they're "from" a gmail account but are originating from servers outside gmail's network. this may be the time we turn on16:33
fungiselective dmarc mitigation on openstack-discuss for people posting from and any other domains which seem to be hosted at gmail16:33
fungisupposedly they added that feature in the latest release specifically because of gmail delivery problems16:34
fungibasically, mailman tries to guess how to mitigate messages with existing dkim signatures by evaluating the published dmarc policies for the post author's domain, but gmail doesn't obey its own published dmarc policies so mailman guesses wrong16:36
tonybIt seems to be a reaonsably long list of do's16:36
tonybSPF, DKIM and ARC16:36
fungiwe don't send 5k messages a day, but maybe they mean multiplied by the number of recipients16:37
tonybYeah I suspect 1 list-email to 100 [gmail]accounts counts as 100 in the context of that document16:39
opendevreviewBrian Rosmaita proposed openstack/project-config master: Implement ironic-unmaintained-core group
fungiall my jaeger restart tests are coming in at 80 or 81 seconds16:41
clarkbmaybe set the timeout to 160 seconds then?16:42
opendevreviewJeremy Stanley proposed opendev/system-config master: Increase jaeger startup timeout
clarkbinfra-root could use reviews if we still want to restart gerrit and try to use the new key again16:53
clarkbI've just come to a realization that that file is not using any templating anymore. Do you want me to make it a normal file copy and stop jinjaing it?16:53
clarkbthat file == .ssh/config16:53
fungithat would probably be better, yes16:55
clarkbok I'll make that update momentarily. Still haven't sorted out ssh keys this morning16:56
opendevreviewClark Boylan proposed opendev/system-config master: Reapply "Switch Gerrit replication to a larger RSA key"
clarkbfungi: ^ that removes the jinja templating17:02
fungilgtm, thanks!17:04
tonyband me.17:05
clarkbwe still good to do a restart later today? If so I'll go ahead and approve it now17:06
tonybYup.  I'll be AFK from 1730-1900[UTC] but I'm also optional17:07
clarkbfrickler: two of the nodes are related to holds I did for testing of the Gerrit bookworm and java 17 update. Those two can be deleted (I'll do this). The third is a kolla octavia debugging hold. I believe all three were leaked bceause we did a zuul update that including removing the zuul database (but not nodepools)17:08
clarkbfrickler: I'll let you delete 0035154490 when you are done with it17:09
tonybWe could also merge as it doesn't touch prod right?17:09
fungii expect to step out to a late lunch/early dinner 19:30-20:30 utc17:09
fungiotherwise i'm available17:09
fricklerclarkb: how do I delete it? I only know how to delete autoholds17:10
clarkbcool the best time for me is probably after 21:00 anyway (since before that I've got lunch and all the stuff from yesterday to catch up on)17:10
clarkbfrickler: on a nodepool launcher node (nl01-nl04) you can run nodepool commands using this incantation `sudo docker exec nodepool-docker_nodepool-launcher_1 nodepool $subcommand`. In this case I ran the `list --detail` subcommand and piped it to `grep hold` to find the nodes nodepool sees as heald17:12
clarkbfrickler: tehn you can run the `delete $nodeid` subcommand to delete nodes directly17:12
clarkbfrickler: the number I pasted above is the node id (0035154490)17:12
clarkbit shouldn't generally be necessary but I believe when we cleared out the zuul database entirely (because there was a manual upgrade problem? logs would tell us exactly why) that only claered out the zuul side of the zk database and nodepool kept its portion of the held records17:14
clarkbgenerally we don't want to dlete the nodepool side of the db because it keeps track of state outside of our systems and we want that to be in sync as much as possible17:15
opendevreviewMerged opendev/system-config master: Increase jaeger startup timeout
fricklerclarkb: ok, thx, that seems to have worked fine, node 0035154490 is gone. doing a delete on one of the deleted nodes has put it back into "deleting"17:29
clarkbya the delete command puts nodes in a deleting state in the db then normla nodepool runtime loops process that deletion17:29
clarkbwhen nodes are already deleting and stuck in that state an explicit deletion in nodepool is unlikely to change anything due to this. We can try to delete things manually using the openstack client directly and see if we get any errors back that we can parse and take action on17:30
clarkblooking at the ARC stuff that is basically fancy DKIM for mailing lists? We wouldn't need to configure separate DKIM records?17:33
clarkbor would we need separate DKIM for the administrative emails that come directly from the server?17:34
fungiwe'd also have to stop keeping the original from addresses17:34
tonybThe doc I linked to indicates that we need SPF and DKIM for all "bulk" mail senders, and ARC for list hosts specifically17:35
clarkbhrm I'm confused as to what ARC does then. If we're rewriting the email to say its from us then we would just do normal DKIM ?17:35
fungioh, i have no idea about arc, by "fancy dkim" thought you still meant based on the from address17:35
tonybWe're saying it's from us on behalf for "them", and we good with that17:36
clarkbya I think my confusion is taht ARC is just DKIM17:36
fungiso far what little i've known about arc is that it's yet another attempt by massmail providers to make e-mail impossible for anyone who isn't them17:36
clarkbbut we've got another term in play to encapsulate the "remove the source DKIM stuff and replacei t with our own"17:36
clarkbthe dns records used by arc are dkim records17:37
clarkbso it is just dkim with maybe extra steps17:37
tonybheading out.  I'll be on my phone if needed17:37
fungibut anyway, if people want to get list mail in a timely manner, they can subscribe from a proper mail provider. i've maybe got bandwidth to look at what would be involved in adding our own dkim signing sometime next year17:38
clarkbya I mean I gave up on gmail for open source mailing lists in ~2015? I don't remember exactly when I jumped ship due to the problems they were creating back then17:42
clarkbDefinitely not new problems. What I think has chagned is in the intervening period more people (often due to employer choices) have ended up on gmail for work like this and gone the opposite direction17:43
fungiyes, well i also don't subscribe to mailing lists with my employer-supplied e-mail address for similar reasons17:44
clarkbto followup on the ControlPersist chagne I can see ssh processes passing in the new values so it applied properly and doesn't seem to have broken anything. On the process cleanup side of things the main ssh process with the -oControlPersist=180s flag does seem to go awy when ansible goes away. but there are ssh processes associated with .ansible/cp/$socketid sockets that appear to19:13
clarkbhang around for the timeout19:13
clarkbso there is "leakage" but I think three minutes is still short enough to not create too much headache19:13
clarkbthose socket paths are the -o ControlPath values so ssh must fork a child or something to manage the socket because the timeout is meant to last beyond the lifetime of the parent if you start another control persist process with the same path?19:14
clarkbfrickler: ^ fyi since there was qusetion about this19:15
opendevreviewMerged opendev/system-config master: Reapply "Switch Gerrit replication to a larger RSA key"
fungioh, as for the jaeger startup timeout change, the deploy job worked when it merged19:23
fricklerclarkb: ok, thx for checking19:26
clarkbthe gerrit ssh config change appears to have applied as expected19:26
clarkbI need lunch soon and will look at service restarts afterwards19:27
tonybclarkb: Yeah. there is a master pid for each ControlSocket.  ps `pidof ssh` | grep -E mux  should show you?19:31
fungiokay, disappearing for food, back in about an hour19:34
clarkbThe process for restarting gerrit and using the new key should be something like: docker-compose down; mv id_rsa and to new .bak suffixed files; mv the replication wauting queue aside; docker-compose up -d; trigger replication20:24
clarkbthis is basically the same process as on Friday but we expect different results due to the updated ssh config file with the correct path to the new private key20:25
clarkbunfortunately it seems that we only get the "trying key foo" log messages when replication fails and we also get "no more keys"20:26
clarkbso there isn't a positive confirmation of the key being used when it succeeds in the replication log. One alternativei f we don't want to mv id_rsa aside is to check the sshd logs on the gitea servers as those should log the hashed pubkey value20:27
clarkbif we don't move id_rsa aside because we're concerned about a repeat of friday that may be good enough for confirmation20:27
tonybI get that MINA ssh isn't openssh but is it worth doing something like: docker exec -it --user gerrit gerrit_compose_???_1 ssh -vv before the replication step20:29
tonybthat'd verif that the ssh config file is correct and the key is prsent at each end20:30
clarkbya I think we can do that before we even restart20:30
clarkbI'll do that now20:30
tonybis .ssh/config coming from a mounted volume?20:31
clarkbunexpectedly openssh wants me to confirm the ssh host key for gitea09 so I've ^C'd and am trying to sort out why20:32
clarkboh I know the port is wrong20:32
clarkbI did `sudo docker exec -it gerrit-compose_gerrit_1 bash` then `ssh -p222` and that returned "Hi there, gerrit! You've successfully authenticated with the key named Gerrit replication key B, but Gitea does not provide shell access."20:33
tonybif you do include the -vv it'll tell you the Host sections parsed and the key being presented20:34
clarkbtonyb: ah ok I can do that again. Though gitea seems to have confirmed it used the correct key20:35
tonybOh okay20:35
tonyb.... Ah I see it there.  nevermind20:36
clarkb"/var/gerrit/.ssh/config line 1: Applying options for gitea*" and "identity file /var/gerrit/.ssh/replication_id_rsa_B type 0" are in the debug output20:36
tonybThat's sounding good.20:37
clarkbopenssh at least appears to parse this the way we expect20:37
clarkbin that case I'm inlcined to move id_rsa aside just so there is no doubt when we restart since we should be pretty confident that the new key can be used and will be used20:38
fungiokay, i've returned from food. need to change back into something more comfortable and will be available for gerrit work20:56
clarkbinfra-root looking at the gitea09 backup failures today did not fail and the failures that happened appaer to have occured due to mysql being updated at the same time as we try to do mysqldumps20:56
clarkbautomated softrware updates conflicting with automated backups. The good news is we backup to two separate location and only the one location seems to have conflicted with our mysql updates20:57
clarkbI think this has to do with overlap in periodic job runtimes and updates upstream of us20:57
fungithat sounds plausible20:57
clarkbif it persists we should look at maybe removing the 02:00 to 08:00 window of time from valid hours for automated backups or something like that since that is around when periodic jobs run20:58
clarkbfungi: see above for additional gerrit replication validation. I think we can start planning a time to restart the service20:58
clarkbmaybe 21:30 ish?20:59
fungiyeah, already saw the ssh tests, i concur21:00
fungi30 minutes from now sounds great21:00
clarkbI've started a root screen21:21
clarkbalso how does this look #status notice We are restarting Gerrit again for replication configuration updates after we failed to make them last week. Gerrit may be unavailable for short periods of time in the near future.21:24
fungia bit wordy. i'm not opposed, but if it were me i'd just repeat the one i sent last week for brevity21:25
fungiattached to the screen session21:26
* clarkb looks that one up21:26
clarkbhere it is: #status notice The Gerrit service on will be offline momentarily to restart it onto an updated replication key21:27
clarkbI'm good with that21:27
clarkbI'm tempted to let the zuul gate clean up since a number of changes are saying they are less than a minute away21:28
clarkbbut then send that notice and proceed21:28
fungiyeah, i don't feel like we need to apologize for the previous attempt. it's not like anybody else volunteered to take care of it21:29
fungitakes as many tries as it takes21:29
funginot everything gets done right the first time around21:30
clarkbthe nova job is finishing up21:31
clarkbso ya I can wait a couple minutes21:31
fungii'm in no hurry21:32
clarkbI think enough things have happened in zuul we can proceed21:35
clarkbI'll send the notice now21:35
clarkb#status notice The Gerrit service on will be offline momentarily to restart it onto an updated replication key21:35
opendevstatusclarkb: sending notice21:35
-opendevstatus- NOTICE: The Gerrit service on will be offline momentarily to restart it onto an updated replication key21:35
opendevstatusclarkb: finished sending notice21:38
clarkbI will proceed now21:38
clarkbgerrit has been restarted. Now we need someone to push something :)21:40
fungii have something to push, just a sec21:40
fungigit review took a minute21:41
fungithat says patchset 3 which has a timestamp from now:
clarkbyou pushed as a wip so it didn't show up in my listing of open changes :)21:42
clarkbthe replication log appears to show replication for openstack/election completing successfully though21:43
clarkbnow to check the actual gitea content21:43
clarkbfungi: I confirm that the latest patchset which was pushed after the restart shows up in gitea21:43
clarkbI think we are good21:43
fungiyep, lgtm!21:44
fungisecond time's a charm21:44
clarkbI'm going to shutdown the screen now21:44
clarkbI'll get a change up to remove the old key and rebase the gitea 1.21 upgrade on that21:44
opendevreviewClark Boylan proposed opendev/system-config master: Set both replication gitea ssh keys to the same value
JayFInternal server error while editing a wiki page,
opendevreviewClark Boylan proposed opendev/system-config master: Update gitea to 1.21.1
JayF"Lock wait timeout exceeded"21:50
JayFPerhaps just a DB needs a restart?21:51
JayFI'll note that the page edit did take.21:51
clarkbor there is lock contention for that lock for some reason21:52
JayFNo idea, but wanted to make sure it got reported since I was able to screenshot the error. I know wiki is barely supported if at all21:52
clarkbI was able to edit without an error21:53
clarkbso isn't representative of a general persistent problem (or at least not affecting 100% of edits)21:53
fungithe database is remote (rackspace "trove" instance), so network timeouts for database writes aren't unheard of21:55
opendevreviewClark Boylan proposed opendev/system-config master: DNM intentional gitea failure to hold a node
clarkbI've refreshed the autoholds for ^ and I'll clean up the gerrit replication autoholds tomorrow21:57
clarkbI'll approve the gerrit 3.9 image stuff first thing tomorrow22:42
clarkbWill be a good conversation item for the gerrit community meeting if nothing else comes up22:42
tonybsounds good 22:43
clarkbnow that EU and NA are both off of DST the meeting should be at 8am pacific time22:43
tonybthat isn't terrible, but could be a pain with any school run22:44
clarkbit alsoconflicts with some writing show off thing at school, but it sounds like the kids will come home with their writing and can show it off at home attheend of the day so not a big deal22:45
tonybthat's frustrating.  but good that you have a backup 23:16

Generated by 2.17.3 by Marius Gedminas - find it at!