19:01:20 <clarkb> #startmeeting infra 19:01:21 <openstack> Meeting started Tue Jan 5 19:01:20 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:24 <openstack> The meeting name has been set to 'infra' 19:01:37 <clarkb> hello everyone, welcome to the first meeting of 2021 19:01:56 <clarkb> Others indicated they would be delayed in joining so I'll give it a few minutes before we dive into the agenda I sent out 19:02:06 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000160.html Our Agenda 19:05:32 <clarkb> #topic Announcements 19:05:42 <clarkb> I didn't have any announcements. Were there others to share? 19:05:59 * corvus joins late 19:06:44 <fungi> i've nothing to share 19:06:48 <clarkb> #topic Actions from last meeting 19:06:54 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-12-08-19.01.txt minutes from last meeting 19:07:13 <clarkb> It hasbeen a while since our last meeting. I don't see any actions registered tehre. I think we can just roll forward into 2021 19:07:22 <clarkb> #topic Priority Efforts 19:07:27 <clarkb> #topic Update Config Management 19:07:48 <clarkb> Over the holidays it appears that rax was doing a number of host migrations. A non zero number of these failed leaving servers unreachable 19:08:36 <clarkb> other than services like ethercalc, wiki, and elasticsearch going down as a result one of the fallouts to this is our ansible playbooks try to connect to the servers and never time out piling up a number of stale ansible-playbook processes and their children on bridge 19:08:49 <clarkb> then subsequent runs timeout because the server is slow due to load 19:09:06 <clarkb> We do set an ansible ssh connection timeout but it doesn't seem to be sufficient in these cases 19:09:18 <clarkb> fungi: ^ I think you had a theory for why that may be but I can't remember it right now? 19:09:18 <fungi> because ssh doesn't time out connecting 19:09:31 <fungi> ssh authenticates and hangs 19:09:50 <clarkb> I see, its the next step that isn't being useful 19:10:05 <clarkb> I wonder if we can make that better in ansible or if ansible already has tooling to try and detect that. 19:10:15 <fungi> basically the servers are in a pathological condition which i think ansible's timeout mechanism doesn't take into consideration but happens rather regularly for us 19:10:18 <clarkb> like maybe we can set a task timeout to some value like 2 hours 19:11:06 <clarkb> anyway we don't need to solve it here. I just wanted to call that out since we hit this problem multiple times on bridge over the holidays ( and on our return) 19:11:19 <corvus> unsure if this is on/off topic, but i made some changes to the root email alias, and it doesn't seem to have taken effect on many servers; is our periodic ansible run failing due to these issues? 19:11:19 <fungi> it's either hanging the connection indefinitely during or immediately following authentication, i'm not sure which 19:11:42 <clarkb> corvus: base was failing, but should be running as of yeaterday evening my local time 19:11:50 <clarkb> correction: base was timing out 19:12:06 <corvus> ok, so i'll see if my inbox is full again tomorrow :) 19:12:08 <fungi> yeah, so servers later in the sequence would have been repeatedly skipped 19:13:01 <clarkb> and if you notice servers are unresponsive reboots seem to correct their issues 19:13:19 <clarkb> any other config management items to bring up? that was all I had 19:14:12 <clarkb> #topic OpenDev 19:14:46 <clarkb> On the Gerrit tuning topic we enabled the git v2 protocl then updated our zuul images to enable it client side and that was the last gerrit tuning we did 19:15:11 <clarkb> it seems to be working from a functionality perspective (zuul and git review are happy etc) but probably too early to say if it has helped with the system load issues 19:15:51 <corvus> yeah, we also scheduled holidays ;) 19:16:00 <corvus> if the tuning doesn't work out, let's fall back on scheduling more holidays 19:16:19 <fungi> yeah, i'll be more convinced next week or the week after when everyone's turning it up to 11 again 19:16:20 <clarkb> Other tuning ideas are the strong refs for jgit caches (potentially needs more memory and is scary for that reason), setting up service user and regular user thread counts to better balance CI and humans, and on the upstream mailing list there has been a ton of recent discussion from other users about tuning caches 19:17:16 <clarkb> corvus: do you know where ianw has gotten with the zuul results plugin work? I think you were helping to get that into an upstream plugin? 19:18:25 <clarkb> I expect we will be able to incorporate taht into our images soon, but I've not yet acught up on the status of this work 19:18:28 <fungi> i'll readily admit i ended up not finding time to work on the jeepyb fixes for update_bug/update_bp as other problems kept preempting my time 19:18:37 <corvus> um... i haven't checked recently but last i remember is it exists in an upstream repo 19:18:47 <clarkb> corvus: cool so progress :) 19:19:07 <clarkb> the other thing ianw had brought up was using the built in WIP status for changes. In testing that we have found that Zuul doesn't understand WIP status changse as unmergable 19:19:16 <corvus> #link https://gerrit.googlesource.com/plugins/zuul-results-summary/ 19:19:23 <clarkb> we mentioned this last time we had a meeting but we should discourage users from using that until Zuul does understand that status 19:19:46 <corvus> i can add that feature 19:20:00 <clarkb> the preexisting WIP vote on the workflow should be used until zuul has been updated 19:20:14 <clarkb> corvus: tahnks 19:20:23 <corvus> #action corvus add wip support to zuul 19:20:49 <clarkb> The last Gerrit related topic I wanted to bring up was the 3.3 upgrade. guillaumec says that 3.3.1 incorporates the fix for zuul 19:21:13 <corvus> this was the comments thing (that would break 'recheck' i think) 19:21:39 <clarkb> I think that means we can start looking at 3.3.1 upgrades if people have time. The upgrade does involve some changes like Non-Interactive Users group being renamed to Service Users and I am sure there are other things to consider so if we do that lets read release notes and test it (review-test can still be used for this I think) 19:21:44 <clarkb> corvus: yup 19:21:47 <corvus> i haven't checked on what the final status of that is (ie, do we need to enable an option or is it transparantly backwards compat) 19:22:13 <clarkb> oh good point we should also dobule check this fix doesn't need settings to be effective 19:22:47 <corvus> i think people were leaning towards not requiring that, but it was a suggestion, so we should verify 19:22:53 <clarkb> I don't know that I'll have time to drive a gerrit upgrade at the beginning of the year. I've got all the typically beginning of the year things distracting me. But I can help anyone else who may have time (if they don't also have beginning of the year items) 19:23:27 <clarkb> ianw was also working on improving our testing of gerrit in CI 19:23:49 <clarkb> it might be worth getting those improvements landed then relying on it to help verify the next upgrade. I don't think we're in a rush so that may be a good idea 19:24:51 <clarkb> The other opendev related upgrade is Gitea 1.13 19:25:01 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/769226 19:25:18 <clarkb> this upgrade seems to be a bigger leap than previous gitea upgrades. They have added new features like project management kanban boards 19:25:56 <clarkb> our testing is decent for api checking but maybe we should hold the run job for that change now and put a repo or three in it and confirm it is happy from a ui perspective? 19:25:56 <corvus> oO 19:26:30 <clarkb> this version also adds elasticsearch support for indexing. It isn't the default and I think we should upgrade to it first without worrying about elasticsearch just to sort out the other changes. Then as a followon we can work to sort out elasticsearch 19:26:55 <fungi> our manage-projects test loads repos into gitea, can we depends-on or something to just take advantage of that and hold it? 19:27:16 <clarkb> fungi: the gitea test creats all of the projects, but without git content 19:27:24 <clarkb> fungi: all you need to do is push the content in after holding it 19:27:35 <fungi> ahh 19:27:36 <clarkb> we could potentially modify the job to push in content for some small repos too) 19:27:51 <clarkb> that may be a good idea 19:27:53 <fungi> or push some ourselves after setting up necessary credentials, yeah 19:29:13 <clarkb> ya why don't we do that. I'll WIP the change and suggest we hold it and check the ui since the upgrade is a bit more involved than ones we have done perviously 19:30:04 <clarkb> Any other opendev topics to discuss or should we move on? 19:30:29 <fungi> annual report? 19:30:39 <clarkb> thats next though I guess technically it fits under here 19:30:40 <fungi> or did you have a separate topic for that? 19:30:46 <fungi> ahh, no worries 19:30:52 <clarkb> ya I had it in general topics but it is the opendev project update. Lets talk about it here 19:30:54 * fungi should read meeting agendas 19:31:05 <clarkb> We have been asked to put together a project update for opendev in the foundation's annual report 19:31:15 <clarkb> #link https://etherpad.opendev.org/p/opendev-2020-annual-report 19:31:38 <clarkb> I have written a draft. But I'm happy to scrap that if others want to write one. Also happy for edits and suggestions 19:31:56 <clarkb> I believe we have a week from tomorrow to get it together so this isn't a huge rush but is also a near future item to figure out 19:34:00 <fungi> i'm also putting some polish on our engagement metrics generator: https://review.opendev.org/729293 19:34:03 <clarkb> I've been planning to do periodic rereads and edits myself too. Basically want to reread it with it being a bit more fresh than correct things as necessary 19:34:46 <clarkb> #topic General topics 19:34:54 <clarkb> #topic Bup and Borg Backups 19:35:16 <clarkb> I think we may be about ready to drop this entry from our agenda. I'll double check with ianw when holidays end. 19:35:30 <clarkb> tldr aiui is we're using borg now, bup should be diasbled at least on some servers 19:35:54 <clarkb> we'll keep the old bup backups around on the old volumes liek we've done with previous bup rotations 19:36:25 <clarkb> if you haven't yet had a chance to interact with borg and try out recovery methods that may be a good exercise. Should only take about half an hour I would expect 19:37:29 <clarkb> #topic InMotion Hosted Cloud 19:37:55 <clarkb> The other thing I've been working on this week is getting an account with inmotion bootstrapped so that we can spin up an openstack cloud there for nodepool resources when they are ready 19:38:28 <clarkb> I have created an account and the details for that as well as our contacts are in the usual location. There is no actualy cloud yet though. AIUI we are waiting on them to tell us they are ready to try bootstrapping the actual resources 19:39:45 <fungi> this is the experiment where we're sort of on the hook as openstack cloud admins, right? 19:39:52 <fungi> infracloud mk2? 19:40:03 <clarkb> yes, but I think we've decided taht we are comfortable with a redeploy strategy using their provided management tools 19:40:14 <clarkb> in theory that means the actual overhead to us is low 19:40:28 <fungi> okay, so basically hands-off and if it breaks we push a button and rebuild it all 19:40:36 <clarkb> exactly 19:40:43 <corvus> so if it breaks or we need to upgrade, ^ that? 19:40:48 <clarkb> yup 19:41:13 <corvus> that happens occasionally with our current providers too 19:41:46 <clarkb> they have also expressed interest in zuul and nodepool so maybe we can get them involved there too 19:41:55 <fungi> openstack as a service. it'll be interesting 19:42:51 <clarkb> #topic Open Discussion 19:43:14 <clarkb> That was about all I had. There are some old agenda items that I should probably clean up after thinking about them for half a second 19:43:47 <clarkb> I've got meetings mon-wed next week that will have me distracted in the mornings (and maybe afternoons? I don't know if that has bee nsorted out yet) 19:43:57 <clarkb> I should be around for our meeting next week though 19:44:16 <fungi> yeah, same here (same meetings) 19:44:43 <fungi> but they're half-day if memory serves, so shouldn't be entirely distracting 19:46:05 <clarkb> Anything else? or should we call it here? 19:46:59 * fungi has nothing 19:47:22 <clarkb> sounds like that may be it then. Thanks everyone and we'll see you here next week 19:47:34 <fungi> thanks clarkb! 19:47:38 <clarkb> feel free to bring up discussions in #opendev or on the mailing list and we can pick things up there if they were missed here 19:47:39 <corvus> thanks! 19:47:41 <clarkb> #endmeeting