19:01:10 #startmeeting infra 19:01:10 Meeting started Tue Aug 23 19:01:10 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:17 #link https://lists.opendev.org/pipermail/service-discuss/2022-August/000355.html Our Agenda 19:01:24 #topic Announcements 19:01:26 o/ 19:02:01 The service coordinator nomination period ended last week. I saw only one nomination; the one for myself. I guess that means I'm it by default again 19:02:46 \o 19:02:52 #link https://releases.openstack.org/zed/schedule.html OpenStack Feature Freeze begins next week 19:03:19 Heads up that openstack is about to enter the typically most crazy portion of its release cycle 19:03:28 though the last few have been pretty chill comparatively 19:03:56 long live the king^W service-coordinator! :) 19:04:02 And finally I'll be AFK tomorrow. Back thursday 19:04:44 off celebrating your reelection... i mean drowning your sorrows? 19:04:59 looking for salmon swimming up the columbia river 19:05:14 and coincidentally escaping the heat at home 19:05:32 so that's a yes 19:05:43 ha 19:05:48 #topic Bastion Host Updates 19:05:53 Time to dive in 19:06:16 ianw: one thing that occured to me is that the recent Zuul auto upgrades should've deplyoed your fixes for the console log file leaks? 19:06:20 I think those changes landed 19:06:49 If that is the case should we go ahead and manually clear those files out of bridge and static as they should leak far less quickly now? 19:07:03 umm, yes, i think this weekend actually should have deployed the file deletion in /tmp 19:07:24 i'll double check, restart the zuul_console on the static nodes and cleanup the tmps, then make a note to go back and check 19:08:12 we're probably testing the backwards compat now, until the daemon is killed 19:08:13 ianw: just be careful to not delete the files that will be automatically deleted. Might need to use an age filter when deleting 19:08:38 last weekend it deployed just the initial changes, which broke xenial/centos-7 because it used 3-era f-strings 19:08:39 corvus: ianw: oh right we need to have it start a new one. 19:09:04 in related news 19:09:06 #link https://review.opendev.org/q/topic:stream-2.7-container 19:09:31 but also we block port 19885 so the zuul cluster doesn't succeed at getting the logs. I wonder if we should also look at just not trying to run it at all on bridge (I think we need to be very careful exposing those logs direclty) 19:09:32 runs the console streamer in a python 2.7 environment to do ultimate backwards compat testing 19:09:50 sounds like progress though. Thank you for looking at that 19:10:26 so opendev doesn't actually need / will not use these changes? 19:10:50 corvus: it does use them because the console log stuff runs there and leaks the files. But we don't currently expose the results on the live stream through zuul finger protocol 19:11:14 i mean, we could have just stopped running the log daemon? or aggressively pruned? 19:12:09 corvus: maybe? I think zuul will continue to try and fetch the data but I guess the firewall blocking the port and not having anything listening on the port are roughly equivalent from that perspecitve (sorry this just occured to me) 19:12:14 i'm asking partly for curiosity, but also trying to get a handle on whether this is actually getting tested 19:12:21 corvus: it will be tested 19:12:28 it sounds like opendev is not going to be a robust test of this feature? 19:12:37 but ya it may not be a complete test 19:13:22 though also worth revisiting whether we might be comfortable streaming such console logs in the future 19:13:29 fungi: yup that too 19:13:43 thanks, it's good to know the limitations of opendev's production testing of new features like this. 19:13:50 corvus: I think the regular jobs will exercise this pretty well too fwiw. Since it happens for all the jobs 19:14:06 I think where our gap may be is how much we'll leak due to aborted jobs and the like? 19:14:26 there was also some work from several years ago to tunnel the console logs over a unix socket over ssh 19:14:27 if we don't allow connections on 19885 then will anything be deleted from bridge? 19:14:38 we put a lot of belts and suspenders in place early on because we were unsure of the security of some solutions, but now we've had time to evaluate things in a production scenario and could make better (informed by observed data and experience) decisions 19:14:54 oh the entire protocl happens over 19885? yes I think that is correct. This may not delete anything on bridge I guess :/ 19:15:27 hrm, that is a wrinkle, it does now send a message "i've finished with this, remove it" 19:15:28 i suggested that zuul-console should have a periodic deletion as a backstop. did that get implemented? 19:15:47 fungi: we would need to review all of the log files we produce to double check them for leaked sensitive info. Address any such leaks, then remember to not add new ones 19:16:12 what's the link to the review for the feature which got merged? 19:16:17 corvus: I don't think so 19:16:47 fungi: https://review.opendev.org/c/zuul/zuul/+/850270 19:16:51 thanks 19:17:06 corvus: I think the python 2.7 testing has become the next focus there to avoid regressions on the ansible target end 19:17:55 i havent' implemented cleanup, but i have expanded the docs to talk about it explicitly 19:18:07 that is 19:18:08 #link https://review.opendev.org/c/zuul/zuul/+/851942/ 19:18:13 which could use a review 19:18:37 so how is bridge going to get cleaned up? 19:18:52 corvus: it won't, but we only just realized that 19:19:09 we'd need to stop running the console streamer on bridge or implement the periodic cleanups. 19:20:00 ianw: do you plan on implementing periodic cleanup in zuul-console, or separately? 19:20:09 (by separately, i mean just a cron job on bridge?) 19:21:10 i don't have immediate plans to work on adding it to zuul-console 19:21:41 did this make it into a zuul release? 19:22:20 looks like no 19:22:30 i don't see a tag which contains df3f9dcd30a13232447b3be67c7845c51cb527a0 in its history 19:22:35 so it could be easily reverted 19:23:04 i'm not sure anything needs to be reverted 19:23:15 okay. i think further discussion can happen in #zuul, but given that the feature won't be used for the intended use case, it's probably worth considering 19:23:52 ya we've got a few more topics to get to here. probably best to pick this back up in the zuul room on matrix 19:24:11 really quickly before the next topic, ianw any venv on bridge chagnes needing review yet? 19:24:21 yeah, just to confirm, it merged 9 days ago, and the most recent release (6.2.0) was 10 days before that 19:24:26 no sorry, i have some local work i haven't pushed yet 19:24:59 no problem. Just want to make sure I'm not missing any important changes to review 19:25:13 #topic Updating Bionic Servers to Focal or Jammy 19:25:27 I don't think there is anything new on this front 19:25:48 But I think we are generally ready to deploy to jammy for new or replacement things. 19:25:57 Jammy just got its .1 release too 19:26:24 which is when the open up the in place upgrade path for focal installations. A good indication upstream thinks it is ready too 19:26:38 #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. 19:26:53 Feel free to add any additional bits of info to that etherpad as we start to take this on 19:27:28 #topic Mailman 3 19:27:41 #link https://review.opendev.org/c/opendev/system-config/+/851248 19:27:59 I think the deployment is largely there now. Plan is to start testing migration of actual lists ~Thursday 19:28:31 Reviews definitely welcome at this point. I've still got it marked WIP but it has congealed into something that looks mergeable now 19:28:38 i may be able to try out some of the migration tools on the held node tomorrow 19:28:56 thanks for putting the deployment together! 19:28:58 There is also a held node at 198.72.124.71 if anyone wants to poke at it 19:29:12 you're welcome 19:29:31 Importantly I think I've got the native vhosting working 19:29:44 whcih means we don't have any regressions from our mm2 vhosting behavior 19:30:00 and the rest api seems to be sufficient for the management we need to do. 19:30:31 For downsides mm3 is a significantly more complicated piece of software built on django with a database and all that. But it shouldn't be too bad 19:31:27 Anyway feel free to poke at the held node and leave review comments. I'll do my best to catch up on that after tomorrow. Previous investigation has been helpful in improving the deployment 19:31:35 #topic Gitea 1.17 Upgrade 19:31:42 #link https://review.opendev.org/c/opendev/system-config/+/847204 1.17.1 out, time to schedule the upgrade 19:31:55 Gitea 1.17.1 is out. That change has been updated to deploy 1.17.1. 19:32:13 I think we can upgrade whenever we are comfortable doing so with the opnestack release schedule and so on 19:32:31 yeah, the list of changes didn't look too risky for us 19:32:41 Big changes for gitea to pay attention to: main is the default branch for new projects. Testing was updated to ensure we continue to create master by default to match gerrit and jeepyb 19:33:14 also they added a package repos feature that had a bunch of bugs in the .0 release. We intentionally disable it in part due to our distributed cluster not having shared storage, but also because we likely don't have sufficient storage for it 19:33:31 If I can get reviews I'm happy to babysit that upgrade on thursday when I get back 19:33:50 ++ will look 19:34:15 #topic Gerrit Load Issues 19:34:53 Last week a couple of times around 08:00 UTC Gerrit got busy and stopped accepting http requests 19:35:12 I believe that Gerrit itself was still running it just exhausted its thread pool which caused it and apache to return 500 errors 19:35:49 In response to that we've bumped up our http thread count above the value for ssh+http git request threads. The idea being we can still have a responsive web ui and rest api if the git side of gerrit is busy 19:36:18 Let's keep an eye on it and evaluate if further changes need to be made based on the new behavior with this config update 19:36:45 During debugging and response to this a few busy IPs were blackholed using the linux route table. 19:37:12 One class of blockage was jenkins servers using the gerrit trigger plugin beacuse they make a request to /plugins/events-log/ every 2 seconds which 404s 19:37:18 #link https://github.com/jenkinsci/gerrit-trigger-plugin/pull/470 trying to make the gerrit trigger plugin less noisy 19:37:36 I've made that pull request trying to improve the plugin to be less noisy as Jenkins is still reasonable for third party CI 19:37:55 I did unblock an IP beacuse the users noticed 19:38:12 Also this pointed out that infra-root may use different approaches to block traffic to a server. 19:38:37 The first place I always look for network traffic blockages is the firwall. I was very confused when I couldn't find iptables rules for this. 19:39:17 I don't thnik we need to solve this in this meeting but it would be good if we are consistent in applying those temporary ephemeral blockages on our servers. We should decide on a method either iptables or ip route and stick to it 19:39:59 Maybe give that some thought over the next week and we can discuss it in next week's meeting 19:40:39 And finally if we have to make additional changes to Gerrit to address this let's try to make them as slowly as is reasonable so that we can measure their impact and avoid negatively impacting openstack feature freeze (a time when gerrit tends to be busy) 19:42:03 #topic Jaeger Tracging Server 19:42:08 #undo 19:42:08 Removing item from minutes: #topic Jaeger Tracging Server 19:42:12 #topic Jaeger Tracing Server 19:42:20 #link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html Zuul spec for OpenTelemetry support 19:42:31 Zuul will be growing support for opentelemetry tracing 19:42:58 The question is whether or not OpenDev should have a Jaeger server to help test this and take advantage of the functionality 19:43:32 my initialy thought was maybe we can colocate this with the prometheus server I have on my todo list. But then immediately decided keeping them separate to reduce difficulty of OS upgrades etc is better 19:44:09 I think from an operational standpoint zuul's logs have been pretty good and we've been able to debug problems tracing through logs. frickler and I did that recently with that github repo's unexpected behavior 19:44:18 oase os upgrades would mater for containerized prometheus and jaeger deployments? 19:44:24 er, base os upgrades i mean 19:44:40 jaeger would be containerized 19:45:10 fungi: typically we do OS upgrades by spinning up new hosts and both prometheus and jaeger appear to store data in databases of some sort that would need to be migrated 19:45:32 I must admin I have no idea what jaeger does. what kind of additional data would we get from it? 19:45:50 keeping things separate will liekly simplify things and there isn't a strong reason to collocate 19:45:50 opentelemetry tracing data 19:46:11 clarkb: not that different to graphite though? that has a db that needs to be moved if we update 19:46:12 frickler: its like fancy log data. But instead of raw text it goes into a db and there is tooling to render it nicely with timings and so on 19:46:13 frickler: basically timing and sequencing of events 19:46:20 i think the utility here would be a marginal improvement to infra-root's ability to debug user-facing issues; potentially a significant improvement in ability for users to self-diagnose; and a benefit to the zuul project in being able to fully demonstrate and collaborate on the reference zuul implementation that opendev runs 19:46:51 corvus: user exposure is a good point since the tracing data is far more sanitized than the raw logs 19:46:52 i plan on adding a sample deployment to the zuul quickstart system, so it's not a big deal for me to port that over to opendev. 19:47:02 #link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html 19:47:09 that might be a missing bit of context 19:47:09 clarkb: yep, i expect this to be fully user-safe 19:47:42 er, i guess clarkb already linked the zuul spec 19:47:50 the zuul event id in our logs is a rudimentary form of tracing 19:48:30 you can think of it like grep 'eventid' /var/log/zuul/ across all the zuul machines for curated info 19:48:30 jaeger can store data locally on the filesystem 19:49:10 so i'm imagining a really simple self-contained jaeger server. and it's okay if we lose the data. 19:49:25 no opposition from me. 19:49:46 particularly now that I've realized it can help users debug or at least better understand uenxpected zuul behavior 19:49:49 for mine, we have such good pre-production system-config testing, as long as a service is working in with that I don't see any reason not to just bring it up 19:49:53 thats new functionality that would be useful 19:50:07 frickler: anyway, the introduction to that spec outlines the potential benefits of including an interface to that information in our deployment 19:50:13 given the simplicity of that (self-contained, ephemeral, no security issues) i thought maybe i could just propose the change to implement it and we can review it there (i'll be doing the work regardless, so i can accept the risk that we run into a blocker in review) 19:50:40 sounds good to me 19:50:43 ya a spec may be overkill given the low risk of deploying it 19:50:53 worst case we just turn it off and then write a spec :) 19:51:08 clarkb: yep 19:52:39 okay, sounds like once i'm ready, i'll propose an implementation change 19:52:54 corvus: and probably good to target jammy as the base at this point. 19:53:03 ack 19:53:54 Sounds like that may be it for this topic? 19:54:28 #topic Open Discussion 19:54:33 Anything else? 19:54:49 nothing from me 19:56:51 If that is all then thank you everyone. We can end here 19:57:00 We'll be back next week at the same time and location 19:57:07 #endmeeting