19:01:04 <clarkb> #startmeeting infra 19:01:04 <opendevmeet> Meeting started Tue Jul 13 19:01:04 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:05 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:05 <opendevmeet> The meeting name has been set to 'infra' 19:01:08 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-July/000267.html Our Agenda 19:01:10 <diablo_rojo_phone> Juuuust made it to Seattle. 19:01:21 <diablo_rojo_phone> So I may half paying attention. 19:01:28 <clarkb> diablo_rojo_phone: don't worry about it 19:01:35 <clarkb> #topic Announcements 19:01:58 <ianw> o/ 19:02:03 <clarkb> A reminder that the gerrit server will be moving July 18 at 23:00UTC. We'll talk about that in more depth later in the meeting though 19:02:12 <clarkb> Other than that I dind't have any announcements 19:02:19 <clarkb> I can't type didn't today 19:02:31 <clarkb> #topic Actions from last meeting 19:02:37 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-07-06-19.01.txt minutes from last meeting 19:02:44 <clarkb> #action someone write spec to replace Cacti with Prometheus 19:03:08 <clarkb> That hasn't happened yet. I'm not too worried about it as we've been focused on updates to other systems. But once we can get up for some air that would be a good thing to look at next 19:03:15 <clarkb> There were no other actions recorded that I saw 19:03:23 <clarkb> #topic Specs Approval 19:03:33 <clarkb> #link https://review.opendev.org/796156 Supporting communications on our very own Matrix homeserver 19:04:01 <clarkb> I think this is now in a position where people can review it with enough real world information to make informed decisions 19:04:49 <clarkb> We have a matrix homeserver up for opendev.org. We have a test channel on that server. infra-root can invite themselves to that channel usign the admin account (details in the typical location) or you can ask corvus, mordred, fungi, or myself to add you 19:05:09 <clarkb> thoguh doesn't look like fungi has made it in there yet 19:05:28 <clarkb> given all that do we think we are in a position to put the spec up for approval now? I'm comfortable with that myself 19:05:35 <corvus> ++ 19:05:35 <clarkb> corvus: ^ you may have input 19:06:39 <clarkb> considering the focus on gerrit things this week and that fungi is not currently around. What about asking for reviews before 7/22 then approving it then if there are no objections 19:06:45 <clarkb> (gives people a few days after this week to review it) 19:07:15 <corvus> wfm 19:07:47 <clarkb> Alright infra-root please review https://review.opendev.org/796156 by 7/22 19:08:12 <clarkb> and feelfree to interact with the system that is there to aid your review 19:08:40 <clarkb> #topic Topics 19:08:49 <clarkb> #topic review server upgrade 19:09:05 <clarkb> This is still schedueld for July 18 at 23:00 UTC 19:09:39 <clarkb> we ran into a small hiccup today when it was noticed that depends-on had stopped working. Turns out this was related to switching zuul to talk to review01.opendev.org instead of review.opendev.org. Zuul uses that config the line up the depends on and determine which are valid 19:10:04 <clarkb> A revert of that chaneg is in the gate right now and we'll need to restart Zuul once the deploy job for zuul runs for that 19:10:29 <clarkb> To handle the DNS update during the migration I think we can force merge the DNSchange in gerrit on review02, then manually pull and run the dns deploy playbook on bridge 19:10:39 <clarkb> Not as elegant but prevents depends-on from being ignored 19:10:48 <clarkb> ianw: ^ not sure if you had caught up on all that scrollback yet but that is the tldr 19:11:02 <ianw> ++ yep, i will re-write the checklist today to account for that 19:11:14 <clarkb> I also pushed up a change to reduce the TTL on the review.o.o cname record to 300 seconds since updating that will be more important now 19:11:26 <clarkb> we should be able to land that one today to get it out of the way 19:12:03 <ianw> yep, good idea 19:12:15 <clarkb> I think it would be good to do a resync of review02 today as well. Then we can spin it up with the current gerrit image and make sure everything looks happy 19:12:36 <clarkb> I have a related item on the next topic, but I'll hold off in case there are other upgrade specific things to go over 19:13:02 <clarkb> oh! have reminders gone out yet? we should send those to the mailing list. The meme is peopel don't read but we can only do our best :) 19:13:24 <corvus> send our reminder gifs? 19:13:24 <ianw> ahh, i said i would do that and got sidetracked sorry. i'll send one in reply to the original now 19:13:33 <clarkb> ianw: thanks! 19:13:44 <clarkb> corvus: no reminders that the server will haev a longer that typical outage 19:13:59 <clarkb> corvus: but adding gifs is probably a good way to get people to read them :) 19:14:54 <clarkb> Anything else? 19:16:01 <clarkb> #topic Gerrit Account Cleanup 19:16:30 <clarkb> I won't bother with cleaning up the ~176 external ID conflicts that I retired accounts for until after the move 19:16:44 <clarkb> however efoley reached out yesterday after they managed to get themselves into a bad spot with their account. 19:17:17 <clarkb> The root cause has been captured in https://bugs.chromium.org/p/gerrit/issues/detail?id=14776 tldr is deleting emails in the web ui is not safe if you delete the email for your openid it also deletes your openid externalid 19:17:57 <clarkb> We can't fix this in a simple way because of the conflicts I have been working to cleanup. What we can do is take advantage of the downtime to push a fix to the externalid records under gerrit then we'll reindex anyway and in theory be happy 19:18:40 <clarkb> ianw: ^ I started looking at the testing and staging of this on review02 today. That led me to create a /home/gerrit2/scratch/ dir where I was going to clone All-Users to and then checkout refs/meta/external-ids to make the necessary edits so they are staged and ready to go (and possibly test them?) 19:19:08 <clarkb> ianw: but I ran into a weird thing: I don't want that dir to be backed up because the refs/meta/externalids checkout has tons of small files and we already backup the source repo 19:19:24 <clarkb> ianw: is there a better location for me to do that? maybe /tmp? we can figure that out after the meeting too 19:20:03 <ianw> hrm, yeah the root disk should be big enough to handle it 19:20:05 <clarkb> But I am hoping to be able to stage that all up, push the fixes back into review_site/git/All-Users.git after we sync up to current state then maybe have efoley test a login against review02 if we turn it on 19:20:20 <clarkb> I'll coordinate that with ianw and we can edit the outage doc with what we learn 19:20:58 <ianw> otherwise we could do something like add an exclude to ~gerrit2/tmp ... that might be a good idea as even on the old server we've acquired random intermediate bits of little value 19:21:52 <ianw> so working in ~/tmp/<user>/ ... would be good that we know we can always remove those bits (and a signal to us working to remind us to consider it ephemeral and do things another way if we want it persisted) 19:22:23 <clarkb> not a bad idea. I actually do that on my personal machines because tmp is small 19:23:12 <clarkb> Anyway I think /tmp will work for now and we can coordinate the syncing and testing bits later 19:23:38 <clarkb> Another odd thing I noticed when doing that is /home/gerrit2 is root:root ownership 19:23:55 <clarkb> which means gerrit2 can't create dirs or files in its own homedir root. I suspect something related to docker containers with that? 19:24:12 <clarkb> Not critical either, but things like that make me want to turn on the gerrit if we can and ensure it starts up cleanly 19:24:36 <clarkb> #topic gitea backups failing to one backup target 19:24:44 <ianw> hrm, i quite possibly did a mkdir of /home/gerrit2 to get the LVM mounted there 19:24:49 <clarkb> ianw: ah 19:25:03 <clarkb> ianw: re gitea backups do we still suspecttimeout values in mysql configs? 19:25:06 <ianw> so that would be an oversight. i definitely have started it and played with it, so it does minimally work 19:25:11 <clarkb> cool 19:25:42 <ianw> umm, last thing was the ipv6 between gitea01 -> backup was seem to not work 19:25:54 <clarkb> oh right this is the vexxhost between regions routing problem 19:25:55 <ianw> i've reported that to mnaser and i believe an issue was raised, but i haven't followed up since 19:26:20 <clarkb> ok. This topic is on here mostly to remind me to catch up if there are any updates to catch up on. Sounds like we are still waiting for vexxhost 19:26:35 <clarkb> maybe we should consider dropping the AAAA record for now? 19:27:16 <ianw> it seems unfortunate but we could 19:27:28 <ianw> also the filesystem component of the backup is working 19:27:33 <ianw> so it must be falling back to ipv4 19:27:54 <clarkb> I wonder if the streaming setup for the db prevents fallback from working 19:27:59 <clarkb> because the stream gets interrupted 19:28:09 <clarkb> vs the fs backup which can simply reconnect and then start doing files 19:29:16 <ianw> afaics borg doesn't log anything of interest relating to that 19:29:32 <ianw> i'll have to fiddle more, i'll put it on the todo list 19:29:34 <clarkb> It seems plausible at least 19:29:43 <ianw> the ipv6 may be a red herring to the actual problem 19:29:52 <clarkb> ya 19:29:55 <clarkb> and thanks 19:29:57 <ianw> it would just be nice to debug one thing at a time :) 19:30:06 <clarkb> ++ 19:30:14 <clarkb> #topic Gitea 1.14.4 upgrade scheduling 19:30:25 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/800274 Gitea 1.14.4 upgrade 19:30:40 <clarkb> I've got this change passing now. It is one of the larger Gitea upgrade changes that we've had I think 19:30:55 <clarkb> worthy of careful review. There is a link to a held test node running that code too 19:31:17 <clarkb> Given everything else happening I'm happy to defer this to next week assuming things settle down a bit :) But if you have time for review this week that would be helpful 19:31:26 <clarkb> as that way I can address any concerns before we actually do the upgrade 19:31:55 <ianw> ++ i played around and change overall lgtm 19:32:33 <clarkb> #topic Scheduling Gerrit Project Renames 19:33:40 <clarkb> We said we'd do the week after the server upgrade/mova previously. Does anyone have opinions on a specific day for that? Probably Monday 7/26 or Friday 7/30? (I think I'm supposed to not be around on 7/30) 19:34:26 <clarkb> Any objections to pencilling in 7/26? 19:35:34 <clarkb> Let's pencil that in then and when fungi returns we can talk about a specific timeframe 19:35:45 <clarkb> I expect the rename outage to be quite short as we can do online reindexing 19:35:53 <clarkb> #topic Open Discussion 19:35:58 <clarkb> Anything else? 19:36:33 <ianw> I got https://paste01.opendev.org/ up 19:36:52 <ianw> i have a minor change to db layout to merge, but will then import the old database 19:37:13 <ianw> if it seems to work, i'm presuming no objections to changing the paste.openstack.org CNAME ? 19:37:34 <clarkb> sounds good to me 19:38:22 <ianw> i don't think the service has a bright future, but it should continue chugging along for a while in it's container 19:39:08 <ianw> as with all good web apps, every library it depends on has changed to the point that you basically have to rewrite everything to update it 19:39:30 <clarkb> fun, I think vexxhost was doing some minor maintenance with it though 19:40:45 <ianw> yeah, i got into "this bit deprecated from main framework, use this library -- oh, that library is now unmaintained and has bug that makes it not work with later versions of main framework" loop and gave up 19:42:56 <clarkb> Sounds like that may be about it. I'll go ahead and call the meeting here so that we can proceed with the Zuul restart 19:43:11 <clarkb> As always feel free to bring discussion up in #opendev or at service-discuss@lists.opendev.org 19:43:14 <clarkb> Thank you everyone 19:43:16 <clarkb> #endmeeting