Tuesday, 2021-08-10

clarkbanyone else here for the meeting?19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Aug 10 19:01:10 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2021-August/000273.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbI had none. Let's just jump right into the meeting proper19:01
clarkb#topic Actions from last meeting19:01
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-08-03-19.01.txt minutes from last meeting19:01
clarkbI did manage to get around to writing up the start of a prometheus spec yesterday and today19:02
clarkb#link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus spec19:02
clarkbThis is still quite high level as I haven't run one locally but did read a fair bit of documentation yesterday19:02
clarkbI think in this case we don't need to have a bunch of specifics sorted out early because we can run this side by side with cacti while we sort it out and make it do what we want19:02
corvus++19:03
fungiyeah, i'm in favor of feeling it out once it's running19:03
clarkbThen as noted in the spec we can run it for a month or so and compare data between cacti and prometheus before shutting down cacti19:03
corvusi will review the spec asap19:03
clarkbI think it captures the important bits, but I'm happy for feedback and will update it appropriately19:03
fungii need to read it still, was there any treatment of how we might import our historical data, or just keep the old graphs around?19:03
clarkbfungi: no, I think that is well beyond the scope of that spec19:04
fungigot it, thanks19:04
clarkbyou'd need to write an rrd to tsdb conversion tool which may exist?19:04
clarkbhttps://groups.google.com/g/opentsdb/c/H7t-WPY11Ro19:05
fungiyeah, or may be as simple as plugging a coupld or python libraries into one another19:05
clarkbif someone wants to work on that during that side by side period it should definitely be possible19:05
fungis/coupld or/couple of/19:05
clarkbbut I'm not sure it is critical?19:05
fungiright, it's something else we'll want to figure out as a group19:05
corvusi'd vote for just keeping cacti around for many months until we don't care19:05
clarkbcorvus: ya that was sort of what I was thinking19:05
fungicertainly one option19:05
clarkbbasically keep cacti around to ensure the data we have in prometheus is at least as accurate as cacti then when ready delete cacti19:06
clarkbthe spec says we can do that after a month but happy to update that to be more flexible19:06
fungidepends on how much we value being able to compare against older trending (and how much older)19:06
corvusif there's a security reason we can't keep cacti up, we could still keep it around but firewall it19:06
corvusso that if we need to look at old data, it's possible (if not easy)19:06
fungianyway, all things we can hash out later19:07
clarkb#topic Topics19:07
clarkbYa lets hash it out in the spec review :)19:07
clarkb#topic Service Coordinator Election19:07
clarkbThe end of today UTC time is the end of the service coordinator nomination period19:07
clarkbI've not seen anyone volunteer yet :P19:08
clarkbI'll keep doing it if no one else wants to do it, but definitely think someone else should do it19:08
clarkbAnyway this is your reminder of that deadline. Please do volunteer if you are interested19:09
fungii can volunteer if you really need to step down, but i'm not sure another openinfra foundation staff member is a better choice. as things are, it's a struggle to explain that opendev is a community rather than a service run by the foundation19:09
fungi(hard to explain that to the rest of the foundation staff as much as to the public)19:09
clarkbI think for me it would be nice to be able to focus on more of the technical details of upgrading services and running new services, etc. But I agree that is also a struggle19:10
clarkband I think having someone else do it can be good for a shift in approach/perspective19:10
fungifrom a sustainability perspective, it would be nice to have an option other than foundation employees19:10
clarkb#topic Review Upgrades19:12
clarkbI believe old server cleanups have happend. Thank you ianw again for doing a bunch of the work on this19:13
clarkb#link https://review.opendev.org/c/opendev/system-config/+/803374 Clean up old mysql gerrit stuff19:13
ianwyep all done19:13
clarkbThat removes the mysql connector from our images as well as support for h2 and mysql from the gerrit role in system-config19:13
clarkbat this point I think we are good to move forward on landing that as there haven't been problems with prod since the mariadb switch19:13
fungineatly wrapped up!19:13
fungii agree19:13
ianwthe only thing left on the cleanup list is "decide on sshfp records"19:14
clarkbour options are to have no sshfp records or only do port 29418 sshd records on review.o.o and port 22 on review02.o.o ?19:15
ianwpersonally i think we generally want to access ssh on port 22 & 29418  @ review.opendev.org so that is in conflict with choosing one for sshfp records19:15
clarkbfwiw I've been trying to train myself to ssh to the actual host fqdn when using port 22 and use review.o.o for 2941819:15
fungii'm okay leaving it as-is, but it's inconsistent with how we handle sshfp records for admin access to our other servers19:15
clarkbbut ya I'm not doing any sshfp verification from my client as far as I know19:16
clarkbI'm happy to leave it as is with the comment in the zone file about why this host is different19:16
fungion the other hand, if we do have a review02.opendev.org-only sshfp record then it wouldn't directly conflict with anything, we'd just need to separate the address records and not use a cname for that19:16
ianwat the time i was thinking also things like zuul want review02 as the ssh target19:17
ianwbut that turned out to not work so well19:17
ianw(gerrit ssh port target i mean)19:17
fungianother option would be to switch openssh to using the same host key as the gerrit service, it's the only service running there, and so i'm not super concerned that someone might get ahold of the api hostkey and use that to take control of the underlying operating system, if they get that first bit then the whole server is already sunk really19:17
fungiit's not as if there's anything else to protect which the gerrit service doesn't have access to19:18
clarkbthat is an interesting idea. I hadn't considered that before. It would make distinguishing between gerrit hosts a bit more fuzzy, but would simplify sshfp records19:18
fungiyeah, i guess it's the transitional gerrit server replacement period when there are two running which is the real issue19:19
ianwhrm, i'm not sure we have any ansible logic for writing out host keys on base servers though19:19
clarkbI don't feel strongly about any of the otpions fwiw. I'm happy with the current situation but have also started trying to train myself when ssh'ing to use the actual host fqdn which falls in line with the old sshfp setup19:19
clarkbianw: ya we don't19:19
fungiright, my takeaway is that all the solutions are fairly complex and have their own distinct downsides, so i'm good with the option requiring the least work (that is, to be clear, just leaving it how it's configured now)19:20
ianwi think we're all ok with no records and a comment why, which is the status quo19:20
ianwall right, decided.  i'll cross that off the list and so other than that cleanup change, i think this is done!19:21
fungithe split record solution was elegant enough until we had to reason about server replacements19:21
fungithanks!19:21
clarkbianw: ++ we can always reevaluate if some reason to have the sshfp records pops up19:21
clarkb#topic Project Renames19:21
clarkb#link https://review.opendev.org/c/opendev/system-config/+/803992 Accomodate zuul's new zk key management system19:21
clarkbI've pushed that up with a depends-on to handle the future zuul state where it doesn't implicitly back up things to disk19:22
clarkbThe other thing we had on the todo list was updating the docs to handle the edits we made to the etherpad compared to the documented process19:22
clarkbhas anyone started on that change yet?19:23
fungialso we discovered that accepting the inability to run zuul jobs on rename changes makes it hard to spot when you've caught all the remaining tentacles. we ended up merging two fixes (i think it was two?) where the old project name was referenced19:23
clarkbyup, I think part of the doc updates should be splitting those changes up so that we can review them with more CI testing upfront19:23
fungii agree, but last time this came up we couldn't agree on where/how to split them so we wound up just keeping it all squashed19:24
clarkbya its a bit of a pain iirc19:24
clarkbI was thinkign we could do a add everything but don't remove old stuff change for things like acls etc19:25
fungialso no i haven't yet written any process changes based on the notes in the pad19:25
clarkbthen we can safely land that first and then land a cleanup that does the actual rename?19:25
fungi#link https://etherpad.opendev.org/p/project-renames-2021-07-30 The maintenance plan we followed19:25
clarkbfungi: ok, I can probably look at that this week.19:25
clarkbthat == writing the docs update change19:25
fungii may get to it if you don't. i think a lot of it is going to be deletions anyway19:26
clarkbThen we can delete this from the agenda along with the review upgrade topic :)19:26
clarkbfungi: thanks19:26
fungii guess it's step #5 there which will need some consideration19:27
fungiwell, and step #119:27
fungialso is there anything about how zuul handles configuration we can improve to make this easier, or which we can take advantage of (run a config check on the altered config in teh check pipeline?)19:28
clarkbfungi: the problem is that zuul in prod is verifying its own config against the config changes19:28
clarkbfungi: we could run a testing zuul to validate things but those jobs won't even run due to the config errors in the proposal19:28
fungiwell, it isn't going to speculatively apply the change anyway, the refusal to enqueue is a safeguard19:29
fungimaybe there's an option we could add to bypass that safety check in a yes-i-know-this-doesn't-make-sense kind of way?19:30
clarkbsomething like that would work for acl verification at least19:30
clarkbbasically where we do out of band validation19:30
fungior post zuul v5 maybe some support for actual repository renames in zuul, where it can reason about such things... but that's likely to be a significant undertaking19:31
clarkbya something to bring up with the zuul maintainers I suspect19:32
clarkbLets continue on. We can hash out our options while writing and reviewing the docs updates19:32
clarkb#topic Matrix Homeserver and bots19:32
clarkbtristanC's prometheus metrics show that gerritbot loses connectivity to review.opendev.org reliably every hour19:33
clarkbSorting that out is probably a good idea, though possibly not critical to  zuul using the service19:33
fungithat's affecting our production gerrit, or a test instance?19:33
clarkbWe also got billed for the homeserver in the expected amount which means that aspect is working without surprises (a very good thing)19:34
corvustristanC: are you working on that?19:34
clarkbfungi: our production gerrit19:34
fungier, production gerritbot (the irc-connected one)?19:34
clarkbfungi: aiui yes19:34
fungineat19:34
clarkboh sorry no19:34
clarkbthe production matrix gerritbot19:34
corvusit's affected the irc one too?19:34
corvusdidn't think so19:34
clarkbI don't have any evidence that it is affecting the irc gerritbot19:34
fungiwell, that's what i'm wondering. if the gerrit connection code is all the same then it could i suppose19:35
ianwis it reliably at the same time every hour, or reliably once an hour?19:35
clarkbianw: same time every hour according to the prometheus graph I saw19:35
clarkbfungi: its completelydifferent. irc gerritbot uses paramiko iirc and matrix gerritbot uses libssh2 in haskell19:35
ianwi do seem to remember rewriting/fixing the gerritbot reconnect logic at some point19:35
ianwit might be hiding any drops19:35
clarkbI'm calling it out because it may lead to service impacts for zuul to use the matrix bot19:36
corvusclarkb: do you know if tristanC is working on a fix?19:37
corvus(i'm unaware of any previous discussion about this -- it's the first time i'm hearing of it)19:37
clarkbcorvus: I do not know. It was mentioned over the weekend and I don't know if anyone including tristanC is looking into it further19:37
clarkbhttps://matrix-client.matrix.org/_matrix/media/r0/download/matrix.org/TIjNHQWUwHJlwgOpLbQRMYdN was what tristanC shared on Sunday (relative to me)19:38
corvusis there some discussion somewhere?19:39
corvusi can't find anything in #opendev eavesdrop logs19:40
clarkbcorvus: it was in #opendev on oftc from ~2100UTC Sunday to early Monday19:40
clarkbhttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-08.log.html#t2021-08-08T21:29:52 and https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-09.log.html#t2021-08-09T00:15:0919:40
corvushttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-08-09.log.html#t2021-08-09T00:15:09 looks relevant19:41
clarkbI haven't seen mention of it since19:41
corvusokay, well, i was hoping to get the 'all clear' at this meeting to move zuul over, but it doesn't seem like we're there19:41
corvustristanC: can you please provide an update (if you're not here now, maybe over in #opendev when you are around) on the impact of this issue and if you're addressing it?19:42
clarkbI think I'm happy for Zuul to use it as is. It would be up to Zuul if they are ok with the connection error issue being sorted out concurrently with the move19:42
clarkbBilling was my last major concern before moving (I didn't want zuul to move then us get a large unexpected bill and have to move quickly to something else for example)19:43
corvusclarkb: i don't feel like i have enough info to make that decision -- like -- how much stream-events time does gerritbot miss?19:43
clarkbcorvus: ya getting more info makes sense19:43
clarkbWhy don't we followup with tristanC on that then if keepalives fixed it Zuul can proceed otherwise dig in more? and make a decision? But I think from OpenDev's perspective its largely up to Zuul's level of comfort with starting to actually use the services19:45
clarkbAnything else to bring up on this subject?19:46
fungithat jives with my reconing19:46
corvusthat's it19:46
clarkb#topic Gitea Backups19:46
clarkbWe got an email saying lists failed as well. I was worried that it may be suffering the same issue now but it only happened the once19:47
clarkbI suspect that was "normal" internet flakyness rather tahn the persistent variety19:47
clarkbianw: did an email get sent about this yet?19:47
ianwahh, no sorry19:49
clarkbAlright, considering the lists issue hasn't persisted I think that is all for this topic19:49
clarkb#topic Gitea 1.15.0 upgrade19:49
clarkbThank you everyone for helping to review and land the prep changes for this work. We are no longer using hacky UI interactions via http and instead use the REST api for all gitea project management updates19:50
clarkbThe lates gitea 1.15.0-rc3 release seems to work fine in testing with the associated template updates and file moves19:50
clarkbUpstream has a milestone setup due on the 18th for the 1.15.0 release and no outstanding bugs are listed. I expect the release will happen soon. Once it happens we can update my change and hold the nodes and do direct verification that stuff works as expected19:51
clarkbThe other gotcha is that the hosting of the logos changes and the paths move19:51
clarkbthis will impact review and paste's theming19:51
clarkbIf anyone has time to host those logos on static or with each service that uses them that might be a good idea19:52
fungiwe haven't merged any project additions to exercise the new api interactions in production, as far as anyone knows?19:52
clarkbthen we aren't updating a bunch of random stuff when our hacked up gitea theming changes19:52
clarkbfungi: ya I don't know of any new project creations since19:52
ianwah i can make a static logo location19:53
fungii think baking the logos into each image/deploying them to each server is probably the safest so we don't have unnecessary cross-site hosting19:53
fungibut keeping them in a single place in system-config (or some repo) would be good so we don't have duplicates in git19:53
clarkbI hadn't considered that concern. It seems to be working now at least, but preventing future problems seems liek a good thing19:54
clarkbWe can definitely coordinate the 1.15.0 gitea update around making sure we're happy with logo hosting19:54
clarkbWhile it would be nice to update early we don't need to19:54
clarkbAlmost out of time so lets move on here19:55
clarkb#topic Mailman Ansible and Upgrades19:55
clarkbThe newlist fix landed19:55
clarkbI don't know of any new lists being created since, so keep an eye out when that happens19:55
clarkbI have not had time to snapshot the lists.kc.io server yet for server inplace upgrade testing but hope that it will happen this week19:56
clarkb#topic Open Discussion19:56
clarkbAnything else?19:56
fungii've got nothing19:56
clarkbRico Lin reached out to fungi and I about doing a presentation about OpenDev for Open Infra Days Asia 2021. This is happening in a month and we have ~3 weeks to put together a recorded talk. I'd like to give it a go, but am balancing that with everything else19:57
ianwi'm trying to get to the bottom of debian-stable19:57
ianwhttps://review.opendev.org/q/topic:%22debian-stretch-rm%22+(status:open%20OR%20status:merged)19:57
clarkbMentioning it in case anyone is interested in helping put that together. I've been told that one of the easiest ways to do a recording like that is to have a recorded conference call whenre you present the data either to an empty call or to your copresenters19:57
fungiianw: not sure if you saw, but jrosser was in favor of bypassing ci to merge the removals from murano-dashboard19:57
ianwfungi: oh, no missed that but that seems good19:58
fungiclarkb: yeah, i expect we could talk through some slides on jitsi-meet and then someone could record it locally from their browser19:58
fungithe more the merrier on that19:58
clarkbAnd we are at time20:00
fungithanks clarkb!20:00
clarkbThank you everyone!20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Aug 10 20:00:11 2021 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2021/infra.2021-08-10-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-08-10-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2021/infra.2021-08-10-19.01.log.html20:00

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!