19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Aug 23 19:01:10 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:17 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-August/000355.html Our Agenda
19:01:24 <clarkb> #topic Announcements
19:01:26 <ianw> o/
19:02:01 <clarkb> The service coordinator nomination period ended last week. I saw only one nomination; the one for myself. I guess that means I'm it by default again
19:02:46 <frickler> \o
19:02:52 <clarkb> #link https://releases.openstack.org/zed/schedule.html OpenStack Feature Freeze begins next week
19:03:19 <clarkb> Heads up that openstack is about to enter the typically most crazy portion of its release cycle
19:03:28 <clarkb> though the last few have been pretty chill comparatively
19:03:56 <ianw> long live the king^W service-coordinator! :)
19:04:02 <clarkb> And finally I'll be AFK tomorrow. Back thursday
19:04:44 <fungi> off celebrating your reelection... i mean drowning your sorrows?
19:04:59 <clarkb> looking for salmon swimming up the columbia river
19:05:14 <clarkb> and coincidentally escaping the heat at home
19:05:32 <fungi> so that's a yes
19:05:43 <clarkb> ha
19:05:48 <clarkb> #topic Bastion Host Updates
19:05:53 <clarkb> Time to dive in
19:06:16 <clarkb> ianw: one thing that occured to me is that the recent Zuul auto upgrades should've deplyoed your fixes for the console log file leaks?
19:06:20 <clarkb> I think those changes landed
19:06:49 <clarkb> If that is the case should we go ahead and manually clear those files out of bridge and static as they should leak far less quickly now?
19:07:03 <ianw> umm, yes, i think this weekend actually should have deployed the file deletion in /tmp
19:07:24 <ianw> i'll double check, restart the zuul_console on the static nodes and cleanup the tmps, then make a note to go back and check
19:08:12 <corvus> we're probably testing the backwards compat now, until the daemon is killed
19:08:13 <clarkb> ianw: just be careful to not delete the files that will be automatically deleted. Might need to use an age filter when deleting
19:08:38 <ianw> last weekend it deployed just the initial changes, which broke xenial/centos-7 because it used 3-era f-strings
19:08:39 <clarkb> corvus: ianw: oh right we need to have it start a new one.
19:09:04 <ianw> in related news
19:09:06 <ianw> #link https://review.opendev.org/q/topic:stream-2.7-container
19:09:31 <clarkb> but also we block port 19885 so the zuul cluster doesn't succeed at getting the logs. I wonder if we should also look at just not trying to run it at all on bridge (I think we need to be very careful exposing those logs direclty)
19:09:32 <ianw> runs the console streamer in a python 2.7 environment to do ultimate backwards compat testing
19:09:50 <clarkb> sounds like progress though. Thank you for looking at that
19:10:26 <corvus> so opendev doesn't actually need / will not use these changes?
19:10:50 <clarkb> corvus: it does use them because the console log stuff runs there and leaks the files. But we don't currently expose the results on the live stream through zuul finger protocol
19:11:14 <corvus> i mean, we could have just stopped running the log daemon?  or aggressively pruned?
19:12:09 <clarkb> corvus: maybe? I think zuul will continue to try and fetch the data but I guess the firewall blocking the port and not having anything listening on the port are roughly equivalent from that perspecitve (sorry this just occured to me)
19:12:14 <corvus> i'm asking partly for curiosity, but also trying to get a handle on whether this is actually getting tested
19:12:21 <clarkb> corvus: it will be tested
19:12:28 <corvus> it sounds like opendev is not going to be a robust test of this feature?
19:12:37 <clarkb> but ya it may not be a complete test
19:13:22 <fungi> though also worth revisiting whether we might be comfortable streaming such console logs in the future
19:13:29 <clarkb> fungi: yup that too
19:13:43 <corvus> thanks, it's good to know the limitations of opendev's production testing of new features like this.
19:13:50 <clarkb> corvus: I think the regular jobs will exercise this pretty well too fwiw. Since it happens for all the jobs
19:14:06 <clarkb> I think where our gap may be is how much we'll leak due to aborted jobs and the like?
19:14:26 <ianw> there was also some work from several years ago to tunnel the console logs over a unix socket over ssh
19:14:27 <corvus> if we don't allow connections on 19885 then will anything be deleted from bridge?
19:14:38 <fungi> we put a lot of belts and suspenders in place early on because we were unsure of the security of some solutions, but now we've had time to evaluate things in a production scenario and could make better (informed by observed data and experience) decisions
19:14:54 <clarkb> oh the entire protocl happens over 19885? yes I think that is correct. This may not delete anything on bridge I guess :/
19:15:27 <ianw> hrm, that is a wrinkle, it does now send a message "i've finished with this, remove it"
19:15:28 <corvus> i suggested that zuul-console should have a periodic deletion as a backstop.  did that get implemented?
19:15:47 <clarkb> fungi: we would need to review all of the log files we produce to double check them for leaked sensitive info. Address any such leaks, then remember to not add new ones
19:16:12 <fungi> what's the link to the review for the feature which got merged?
19:16:17 <clarkb> corvus: I don't think so
19:16:47 <clarkb> fungi: https://review.opendev.org/c/zuul/zuul/+/850270
19:16:51 <fungi> thanks
19:17:06 <clarkb> corvus: I think the python 2.7 testing has become the next focus there to avoid regressions on the ansible target end
19:17:55 <ianw> i havent' implemented cleanup, but i have expanded the docs to talk about it explicitly
19:18:07 <ianw> that is
19:18:08 <ianw> #link https://review.opendev.org/c/zuul/zuul/+/851942/
19:18:13 <ianw> which could use a review
19:18:37 <corvus> so how is bridge going to get cleaned up?
19:18:52 <clarkb> corvus: it won't, but we only just realized that
19:19:09 <clarkb> we'd need to stop running the console streamer on bridge or implement the periodic cleanups.
19:20:00 <corvus> ianw: do you plan on implementing periodic cleanup in zuul-console, or separately?
19:20:09 <corvus> (by separately, i mean just a cron job on bridge?)
19:21:10 <ianw> i don't have immediate plans to work on adding it to zuul-console
19:21:41 <corvus> did this make it into a zuul release?
19:22:20 <fungi> looks like no
19:22:30 <fungi> i don't see a tag which contains df3f9dcd30a13232447b3be67c7845c51cb527a0 in its history
19:22:35 <fungi> so it could be easily reverted
19:23:04 <ianw> i'm not sure anything needs to be reverted
19:23:15 <corvus> okay.  i think further discussion can happen in #zuul, but given that the feature won't be used for the intended use case, it's probably worth considering
19:23:52 <clarkb> ya we've got a few more topics to get to here. probably best to pick this back up in the zuul room on matrix
19:24:11 <clarkb> really quickly before the next topic, ianw any venv on bridge chagnes needing review yet?
19:24:21 <fungi> yeah, just to confirm, it merged 9 days ago, and the most recent release (6.2.0) was 10 days before that
19:24:26 <ianw> no sorry, i have some local work i haven't pushed yet
19:24:59 <clarkb> no problem. Just want to make sure I'm not missing any important changes to review
19:25:13 <clarkb> #topic Updating Bionic Servers to Focal or Jammy
19:25:27 <clarkb> I don't think there is anything new on this front
19:25:48 <clarkb> But I think we are generally ready to deploy to jammy for new or replacement things.
19:25:57 <clarkb> Jammy just got its .1 release too
19:26:24 <clarkb> which is when the open up the in place upgrade path for focal installations. A good indication upstream thinks it is ready too
19:26:38 <clarkb> #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.
19:26:53 <clarkb> Feel free to add any additional bits of info to that etherpad as we start to take this on
19:27:28 <clarkb> #topic Mailman 3
19:27:41 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/851248
19:27:59 <clarkb> I think the deployment is largely there now. Plan is to start testing migration of actual lists ~Thursday
19:28:31 <clarkb> Reviews definitely welcome at this point. I've still got it marked WIP but it has congealed into something that looks mergeable now
19:28:38 <fungi> i may be able to try out some of the migration tools on the held node tomorrow
19:28:56 <fungi> thanks for putting the deployment together!
19:28:58 <clarkb> There is also a held node at 198.72.124.71 if anyone wants to poke at it
19:29:12 <clarkb> you're welcome
19:29:31 <clarkb> Importantly I think I've got the native vhosting working
19:29:44 <clarkb> whcih means we don't have any regressions from our mm2 vhosting behavior
19:30:00 <clarkb> and the rest api seems to be sufficient for the management we need to do.
19:30:31 <clarkb> For downsides mm3 is a significantly more complicated piece of software built on django with a database and all that. But it shouldn't be too bad
19:31:27 <clarkb> Anyway feel free to poke at the held node and leave review comments. I'll do my best to catch up on that after tomorrow. Previous investigation has been helpful in improving the deployment
19:31:35 <clarkb> #topic Gitea 1.17 Upgrade
19:31:42 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/847204 1.17.1 out, time to schedule the upgrade
19:31:55 <clarkb> Gitea 1.17.1 is out. That change has been updated to deploy 1.17.1.
19:32:13 <clarkb> I think we can upgrade whenever we are comfortable doing so with the opnestack release schedule and so on
19:32:31 <fungi> yeah, the list of changes didn't look too risky for us
19:32:41 <clarkb> Big changes for gitea to pay attention to: main is the default branch for new projects. Testing was updated to ensure we continue to create master by default to match gerrit and jeepyb
19:33:14 <clarkb> also they added a package repos feature that had a bunch of bugs in the .0 release. We intentionally disable it in part due to our distributed cluster not having shared storage, but also because we likely don't have sufficient storage for it
19:33:31 <clarkb> If I can get reviews I'm happy to babysit that upgrade on thursday when I get back
19:33:50 <ianw> ++ will look
19:34:15 <clarkb> #topic Gerrit Load Issues
19:34:53 <clarkb> Last week a couple of times around 08:00 UTC Gerrit got busy and stopped accepting http requests
19:35:12 <clarkb> I believe that Gerrit itself was still running it just exhausted its thread pool which caused it and apache to return 500 errors
19:35:49 <clarkb> In response to that we've bumped up our http thread count above the value for ssh+http git request threads. The idea being we can still have a responsive web ui and rest api if the git side of gerrit is busy
19:36:18 <clarkb> Let's keep an eye on it and evaluate if further changes need to be made based on the new behavior with this config update
19:36:45 <clarkb> During debugging and response to this a few busy IPs were blackholed using the linux route table.
19:37:12 <clarkb> One class of blockage was jenkins servers using the gerrit trigger plugin beacuse they make a request to /plugins/events-log/ every 2 seconds which 404s
19:37:18 <clarkb> #link https://github.com/jenkinsci/gerrit-trigger-plugin/pull/470 trying to make the gerrit trigger plugin less noisy
19:37:36 <clarkb> I've made that pull request trying to improve the plugin to be less noisy as Jenkins is still reasonable for third party CI
19:37:55 <clarkb> I did unblock an IP beacuse the users noticed
19:38:12 <clarkb> Also this pointed out that infra-root may use different approaches to block traffic to a server.
19:38:37 <clarkb> The first place I always look for network traffic blockages is the firwall. I was very confused when I couldn't find iptables rules for this.
19:39:17 <clarkb> I don't thnik we need to solve this in this meeting but it would be good if we are consistent in applying those temporary ephemeral blockages on our servers. We should decide on a method either iptables or ip route and stick to it
19:39:59 <clarkb> Maybe give that some thought over the next week and we can discuss it in next week's meeting
19:40:39 <clarkb> And finally if we have to make additional changes to Gerrit to address this let's try to make them as slowly as is reasonable so that we can measure their impact and avoid negatively impacting openstack feature freeze (a time when gerrit tends to be busy)
19:42:03 <clarkb> #topic Jaeger Tracging Server
19:42:08 <clarkb> #undo
19:42:08 <opendevmeet> Removing item from minutes: #topic Jaeger Tracging Server
19:42:12 <clarkb> #topic Jaeger Tracing Server
19:42:20 <clarkb> #link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html Zuul spec for OpenTelemetry support
19:42:31 <clarkb> Zuul will be growing support for opentelemetry tracing
19:42:58 <clarkb> The question is whether or not OpenDev should have a Jaeger server to help test this and take advantage of the functionality
19:43:32 <clarkb> my initialy thought was maybe we can colocate this with the prometheus server I have on my todo list. But then immediately decided keeping them separate to reduce difficulty of OS upgrades etc is better
19:44:09 <clarkb> I think from an operational standpoint zuul's logs have been pretty good and we've been able to debug problems tracing through logs. frickler and I did that recently with that github repo's unexpected behavior
19:44:18 <fungi> oase os upgrades would mater for containerized prometheus and jaeger deployments?
19:44:24 <fungi> er, base os upgrades i mean
19:44:40 <corvus> jaeger would be containerized
19:45:10 <clarkb> fungi: typically we do OS upgrades by spinning up new hosts and both prometheus and jaeger appear to store data in databases of some sort that would need to be migrated
19:45:32 <frickler> I must admin I have no idea what jaeger does. what kind of additional data would we get from it?
19:45:50 <clarkb> keeping things separate will liekly simplify things and there isn't a strong reason to collocate
19:45:50 <fungi> opentelemetry tracing data
19:46:11 <ianw> clarkb: not that different to graphite though?  that has a db that needs to be moved if we update
19:46:12 <clarkb> frickler: its like fancy log data. But instead of raw text it goes into a db and there is tooling to render it nicely with timings and so on
19:46:13 <fungi> frickler: basically timing and sequencing of events
19:46:20 <corvus> i think the utility here would be a marginal improvement to infra-root's ability to debug user-facing issues; potentially a significant improvement in ability for users to self-diagnose; and a benefit to the zuul project in being able to fully demonstrate and collaborate on the reference zuul implementation that opendev runs
19:46:51 <clarkb> corvus: user exposure is a good point since the tracing data is far more sanitized than the raw logs
19:46:52 <corvus> i plan on adding a sample deployment to the zuul quickstart system, so it's not a big deal for me to port that over to opendev.
19:47:02 <fungi> #link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html
19:47:09 <fungi> that might be a missing bit of context
19:47:09 <corvus> clarkb: yep, i expect this to be fully user-safe
19:47:42 <fungi> er, i guess clarkb already linked the zuul spec
19:47:50 <clarkb> the zuul event id in our logs is a rudimentary form of tracing
19:48:30 <clarkb> you can think of it like grep 'eventid' /var/log/zuul/ across all the zuul machines for curated info
19:48:30 <corvus> jaeger can store data locally on the filesystem
19:49:10 <corvus> so i'm imagining a really simple self-contained jaeger server.  and it's okay if we lose the data.
19:49:25 <clarkb> no opposition from me.
19:49:46 <clarkb> particularly now that I've realized it can help users debug or at least better understand uenxpected zuul behavior
19:49:49 <ianw> for mine, we have such good pre-production system-config testing, as long as a service is working in with that I don't see any reason not to just bring it up
19:49:53 <clarkb> thats new functionality that would be useful
19:50:07 <fungi> frickler: anyway, the introduction to that spec outlines the potential benefits of including an interface to that information in our deployment
19:50:13 <corvus> given the simplicity of that (self-contained, ephemeral, no security issues) i thought maybe i could just propose the change to implement it and we can review it there (i'll be doing the work regardless, so i can accept the risk that we run into a blocker in review)
19:50:40 <fungi> sounds good to me
19:50:43 <clarkb> ya a spec may be overkill given the low risk of deploying it
19:50:53 <clarkb> worst case we just turn it off and then write a spec :)
19:51:08 <corvus> clarkb: yep
19:52:39 <corvus> okay, sounds like once i'm ready, i'll propose an implementation change
19:52:54 <clarkb> corvus: and probably good to target jammy as the base at this point.
19:53:03 <corvus> ack
19:53:54 <clarkb> Sounds like that may be it for this topic?
19:54:28 <clarkb> #topic Open Discussion
19:54:33 <clarkb> Anything else?
19:54:49 <fungi> nothing from me
19:56:51 <clarkb> If that is all then thank you everyone. We can end here
19:57:00 <clarkb> We'll be back next week at the same time and location
19:57:07 <clarkb> #endmeeting