#opendev-meeting log

19:01:04 <clarkb> #startmeeting infra
19:01:04 <opendevmeet> Meeting started Tue Jul 13 19:01:04 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:05 <opendevmeet> The meeting name has been set to 'infra'
19:01:08 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-July/000267.html Our Agenda
19:01:10 <diablo_rojo_phone> Juuuust made it to Seattle.
19:01:21 <diablo_rojo_phone> So I may half paying attention.
19:01:28 <clarkb> diablo_rojo_phone: don't worry about it
19:01:35 <clarkb> #topic Announcements
19:01:58 <ianw> o/
19:02:03 <clarkb> A reminder that the gerrit server will be moving July 18 at 23:00UTC. We'll talk about that in more depth later in the meeting though
19:02:12 <clarkb> Other than that I dind't have any announcements
19:02:19 <clarkb> I can't type didn't today
19:02:31 <clarkb> #topic Actions from last meeting
19:02:37 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-07-06-19.01.txt minutes from last meeting
19:02:44 <clarkb> #action someone write spec to replace Cacti with Prometheus
19:03:08 <clarkb> That hasn't happened yet. I'm not too worried about it as we've been focused on updates to other systems. But once we can get up for some air that would be a good thing to look at next
19:03:15 <clarkb> There were no other actions recorded that I saw
19:03:23 <clarkb> #topic Specs Approval
19:03:33 <clarkb> #link https://review.opendev.org/796156 Supporting communications on our very own Matrix homeserver
19:04:01 <clarkb> I think this is now in a position where people can review it with enough real world information to make informed decisions
19:04:49 <clarkb> We have a matrix homeserver up for opendev.org. We have a test channel on that server. infra-root can invite themselves to that channel usign the admin account (details in the typical location) or you can ask corvus, mordred, fungi, or myself to add you
19:05:09 <clarkb> thoguh doesn't look like fungi has made it in there yet
19:05:28 <clarkb> given all that do we think we are in a position to put the spec up for approval now? I'm comfortable with that myself
19:05:35 <corvus> ++
19:05:35 <clarkb> corvus: ^ you may have input
19:06:39 <clarkb> considering the focus on gerrit things this week and that fungi is not currently around. What about asking for reviews before 7/22 then approving it then if there are no objections
19:06:45 <clarkb> (gives people a few days after this week to review it)
19:07:15 <corvus> wfm
19:07:47 <clarkb> Alright infra-root please review https://review.opendev.org/796156 by 7/22
19:08:12 <clarkb> and feelfree to interact with the system that is there to aid your review
19:08:40 <clarkb> #topic Topics
19:08:49 <clarkb> #topic review server upgrade
19:09:05 <clarkb> This is still schedueld for July 18 at 23:00 UTC
19:09:39 <clarkb> we ran into a small hiccup today when it was noticed that depends-on had stopped working. Turns out this was related to switching zuul to talk to review01.opendev.org instead of review.opendev.org. Zuul uses that config the line up the depends on and determine which are valid
19:10:04 <clarkb> A revert of that chaneg is in the gate right now and we'll need to restart Zuul once the deploy job for zuul runs for that
19:10:29 <clarkb> To handle the DNS update during the migration I think we can force merge the DNSchange in gerrit on review02, then manually pull and run the dns deploy playbook on bridge
19:10:39 <clarkb> Not as elegant but prevents depends-on from being ignored
19:10:48 <clarkb> ianw: ^ not sure if you had caught up on all that scrollback yet but that is the tldr
19:11:02 <ianw> ++ yep, i will re-write the checklist today to account for that
19:11:14 <clarkb> I also pushed up a change to reduce the TTL on the review.o.o cname record to 300 seconds since updating that will be more important now
19:11:26 <clarkb> we should be able to land that one today to get it out of the way
19:12:03 <ianw> yep, good idea
19:12:15 <clarkb> I think it would be good to do a resync of review02 today as well. Then we can spin it up with the current gerrit image and make sure everything looks happy
19:12:36 <clarkb> I have a related item on the next topic, but I'll hold off in case there are other upgrade specific things to go over
19:13:02 <clarkb> oh! have reminders gone out yet? we should send those to the mailing list. The meme is peopel don't read but we can only do our best :)
19:13:24 <corvus> send our reminder gifs?
19:13:24 <ianw> ahh, i said i would do that and got sidetracked sorry.  i'll send one in reply to the original now
19:13:33 <clarkb> ianw: thanks!
19:13:44 <clarkb> corvus: no reminders that the server will haev a longer that typical outage
19:13:59 <clarkb> corvus: but adding gifs is probably a good way to get people to read them :)
19:14:54 <clarkb> Anything else?
19:16:01 <clarkb> #topic Gerrit Account Cleanup
19:16:30 <clarkb> I won't bother with cleaning up the ~176 external ID conflicts that I retired accounts for until after the move
19:16:44 <clarkb> however efoley reached out yesterday after they managed to get themselves into a bad spot with their account.
19:17:17 <clarkb> The root cause has been captured in https://bugs.chromium.org/p/gerrit/issues/detail?id=14776 tldr is deleting emails in the web ui is not safe if you delete the email for your openid it also deletes your openid externalid
19:17:57 <clarkb> We can't fix this in a simple way because of the conflicts I have been working to cleanup. What we can do is take advantage of the downtime to push a fix to the externalid records under gerrit then we'll reindex anyway and in theory be happy
19:18:40 <clarkb> ianw: ^ I started looking at the testing and staging of this on review02 today. That led me to create a /home/gerrit2/scratch/ dir where I was going to clone All-Users to and then checkout refs/meta/external-ids to make the necessary edits so they are staged and ready to go (and possibly test them?)
19:19:08 <clarkb> ianw: but I ran into a weird thing: I don't want that dir to be backed up because the refs/meta/externalids checkout has tons of small files and we already backup the source repo
19:19:24 <clarkb> ianw: is there a better location for me to do that? maybe /tmp? we can figure that out after the meeting too
19:20:03 <ianw> hrm, yeah the root disk should be big enough to handle it
19:20:05 <clarkb> But I am hoping to be able to stage that all up, push the fixes back into review_site/git/All-Users.git after we sync up to current state then maybe have efoley test a login against review02 if we turn it on
19:20:20 <clarkb> I'll coordinate that with ianw and we can edit the outage doc with what we learn
19:20:58 <ianw> otherwise we could do something like add an exclude to ~gerrit2/tmp ... that might be a good idea as even on the old server we've acquired random intermediate bits of little value
19:21:52 <ianw> so working in ~/tmp/<user>/ ... would be good that we know we can always remove those bits (and a signal to us working to remind us to consider it ephemeral and do things another way if we want it persisted)
19:22:23 <clarkb> not a bad idea. I actually do that on my personal machines because tmp is small
19:23:12 <clarkb> Anyway I think /tmp will work for now and we can coordinate the syncing and testing bits later
19:23:38 <clarkb> Another odd thing I noticed when doing that is /home/gerrit2 is root:root ownership
19:23:55 <clarkb> which means gerrit2 can't create dirs or files in its own homedir root. I suspect something related to docker containers with that?
19:24:12 <clarkb> Not critical either, but things like that make me want to turn on the gerrit if we can and ensure it starts up cleanly
19:24:36 <clarkb> #topic gitea backups failing to one backup target
19:24:44 <ianw> hrm, i quite possibly did a mkdir of /home/gerrit2 to get the LVM mounted there
19:24:49 <clarkb> ianw: ah
19:25:03 <clarkb> ianw: re gitea backups do we still suspecttimeout values in mysql configs?
19:25:06 <ianw> so that would be an oversight.  i definitely have started it and played with it, so it does minimally work
19:25:11 <clarkb> cool
19:25:42 <ianw> umm, last thing was the ipv6 between gitea01 -> backup was seem to not work
19:25:54 <clarkb> oh right this is the vexxhost between regions routing problem
19:25:55 <ianw> i've reported that to mnaser and i believe an issue was raised, but i haven't followed up since
19:26:20 <clarkb> ok. This topic is on here mostly to remind me to catch up if there are any updates to catch up on. Sounds like we are still waiting for vexxhost
19:26:35 <clarkb> maybe we should consider dropping the AAAA record for now?
19:27:16 <ianw> it seems unfortunate but we could
19:27:28 <ianw> also the filesystem component of the backup is working
19:27:33 <ianw> so it must be falling back to ipv4
19:27:54 <clarkb> I wonder if the streaming setup for the db prevents fallback from working
19:27:59 <clarkb> because the stream gets interrupted
19:28:09 <clarkb> vs the fs backup which can simply reconnect and then start doing files
19:29:16 <ianw> afaics borg doesn't log anything of interest relating to that
19:29:32 <ianw> i'll have to fiddle more, i'll put it on the todo list
19:29:34 <clarkb> It seems plausible at least
19:29:43 <ianw> the ipv6 may be a red herring to the actual problem
19:29:52 <clarkb> ya
19:29:55 <clarkb> and thanks
19:29:57 <ianw> it would just be nice to debug one thing at a time :)
19:30:06 <clarkb> ++
19:30:14 <clarkb> #topic Gitea 1.14.4 upgrade scheduling
19:30:25 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/800274 Gitea 1.14.4 upgrade
19:30:40 <clarkb> I've got this change passing now. It is one of the larger Gitea upgrade changes that we've had I think
19:30:55 <clarkb> worthy of careful review. There is a link to a held test node running that code too
19:31:17 <clarkb> Given everything else happening I'm happy to defer this to next week assuming things settle down a bit :) But if you have time for review this week that would be helpful
19:31:26 <clarkb> as that way I can address any concerns before we actually do the upgrade
19:31:55 <ianw> ++ i played around and change overall lgtm
19:32:33 <clarkb> #topic Scheduling Gerrit Project Renames
19:33:40 <clarkb> We said we'd do the week after the server upgrade/mova previously. Does anyone have opinions on a specific day for that? Probably Monday 7/26 or Friday 7/30? (I think I'm supposed to not be around on 7/30)
19:34:26 <clarkb> Any objections to pencilling in 7/26?
19:35:34 <clarkb> Let's pencil that in then and when fungi returns we can talk about a specific timeframe
19:35:45 <clarkb> I expect the rename outage to be quite short as we can do online reindexing
19:35:53 <clarkb> #topic Open Discussion
19:35:58 <clarkb> Anything else?
19:36:33 <ianw> I got https://paste01.opendev.org/ up
19:36:52 <ianw> i have a minor change to db layout to merge, but will then import the old database
19:37:13 <ianw> if it seems to work, i'm presuming no objections to changing the paste.openstack.org CNAME ?
19:37:34 <clarkb> sounds good to me
19:38:22 <ianw> i don't think the service has a bright future, but it should continue chugging along for a while in it's container
19:39:08 <ianw> as with all good web apps, every library it depends on has changed to the point that you basically have to rewrite everything to update it
19:39:30 <clarkb> fun, I think vexxhost was doing some minor maintenance with it though
19:40:45 <ianw> yeah, i got into "this bit deprecated from main framework, use this library -- oh, that library is now unmaintained and has  bug that makes it not work with later versions of main framework" loop and gave up
19:42:56 <clarkb> Sounds like that may be about it. I'll go ahead and call the meeting here so that we can proceed with the Zuul restart
19:43:11 <clarkb> As always feel free to bring discussion up in #opendev or at service-discuss@lists.opendev.org
19:43:14 <clarkb> Thank you everyone
19:43:16 <clarkb> #endmeeting