19:01:11 <clarkb> #startmeeting infra 19:01:12 <openstack> Meeting started Tue Mar 10 19:01:11 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:14 * mordred has fewer sandwiches than he should 19:01:15 <openstack> The meeting name has been set to 'infra' 19:01:19 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-March/006608.html Our Agenda 19:01:27 <clarkb> #topic Announcements 19:01:47 <clarkb> First up a friendly reminder that for many of us in north america DST switch has happened. Careful with your meeting schedule :) 19:01:58 <clarkb> Europe and others happen at the end of the month so we'll get round two with a different set of people 19:02:15 <clarkb> This meeting is at 1900UTC Tuesdays regardless of DST or timezone 19:02:25 <clarkb> #link http://lists.openstack.org/pipermail/foundation/2020-March/002852.html OSF email on 2020 events 19:02:46 <clarkb> Second we've got an email from the Foundation with an update on 2020 events and what the current status is. 19:02:58 <clarkb> If you've got feedback for that I know mark and others are more than happy to receive it 19:03:14 <clarkb> it went to the foundation list so I wanted to make sure everyone saw it 19:04:19 <clarkb> #topic Actions from last meeting 19:04:26 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-03-03-19.02.txt minutes from last meeting 19:04:34 <clarkb> There were no actions recorded 19:04:42 <clarkb> #topic Specs approval 19:05:27 <clarkb> I approved the python cleanup spec and the website activity stats spec. I think I've seen progress on both since then (I pushed a change today to add a job that runs goaccess and ianw landed the glean reparenting to python3 -m venv in dib) 19:05:41 <clarkb> Please add those review topics to your review lists 19:06:19 <clarkb> topic:website-stats topic:cleanup-test-image-python 19:06:55 <clarkb> We've also got the xwiki spec 19:06:56 <clarkb> #link https://review.opendev.org/#/c/710057/ xwiki for wikis 19:07:02 <clarkb> zbr: thank you for the review on that one. 19:07:25 <clarkb> frickler and corvus I think you had good feedback on IRC, would be great if you could double check that your concerns were addressed (or not) in the spec itself 19:08:27 <clarkb> Any other input on the specs work that has happened recently? 19:09:46 <clarkb> #topic Priority Efforts 19:09:58 <clarkb> #topic OpenDev 19:10:11 <clarkb> #link https://review.opendev.org/#/c/710020/ Split OpenDev out of OpenStack governance. 19:10:20 <clarkb> That change seems to be getting proper review this time around (which is good) 19:10:34 <clarkb> it also appears that we can continue to roll forward and make progress if I'm parsing the feedback there 19:10:43 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-March/006603.html OpenDev as OSF pilot project 19:10:57 <clarkb> I'll try to see if jbryce has time to follow up with our responses on that mailing list thread too 19:11:37 <clarkb> last week we discovered that we had missed replication events on at least one gitea backend as well as a different backend OOMing and causing some jobs that don't use the local git cache to fail 19:11:45 <AJaeger> Infra Manual has been updated and is now the "OpenDev Manual", a few changes still up for review 19:12:10 <clarkb> AJaeger: thanks for the reminder /me needs to make sure he is caught up on those change reviews 19:12:34 <clarkb> on the missed replication events problem mordred and I brainstormed a bit and we think we can force Gerrit to retry by always stopping the ssh daemon in the gitea setup first 19:13:04 <clarkb> that way replication either actually succeeds or will have tcp errors. This is a good thing because Gerrit replication plugin will retry forever if it notices an error replicating a ref 19:13:11 <mordred> WIP patch for that: https://review.opendev.org/#/c/711130/ 19:13:20 <mordred> which is incomplete 19:14:23 <clarkb> for the second thing (OOM) I think we should begin thinking about deploying bigger servers and/or more servers. My concern with more servers is that since we pin by IP any busiy IP (NAT perhaps) can still overload a single server 19:15:19 <mordred> clarkb: while we should certainly avoid OOMing, should we also be trying to make those jobs use the git cache more better? 19:15:40 <clarkb> mordred: yes jrosser said they would work on that aspect of it 19:15:54 <fungi> it's not necessarily our jobs which are the bulk of the cause though 19:16:27 <fungi> remember that our nodes don't go through nat to reach our gitea farm 19:16:48 <mordred> nod 19:17:07 <clarkb> fungi: but we do often start many jobs all at once that can by chance hash to the same backend node 19:17:09 <fungi> so in theory if we have a big batch of jobs kick off which all clone nova, we should see that impact spread somewhat evenly across the pool already 19:17:17 <clarkb> (also these jobs were cloning nova which is our biggest repo iirc) 19:17:41 <fungi> the times we've seen this go sideways, it's generally been one backend hit super hard and little or no noticeable spike on the other backends around teh same timeframe 19:18:10 <fungi> we talked about some options for using weighting in haproxy based on system resource health checks too, though that was kinda hand-wavey on the details 19:18:22 <corvus> the other approach would be to change the lb algorithm to least busy, but the issue there is that backends are sometimes slightly out of sync? is it the case the only resolution there is shared storage? 19:19:13 <clarkb> corvus: ya I think shared storage solves that problem properly. Its possible that least busy would mostly work now that our gitea backends are largely in sync with each other and packing on a similar schedule 19:19:30 <fungi> even shared storage doesn't necessarily fix it if every backend's view of that storage can be slightly out of sync 19:19:32 <clarkb> the old cgit setup used least busy and didn't have the problems with git clients moving around. I think possibly because the git repos were similar enough 19:19:34 <corvus> clarkb: oh, what changed to get them mostly in sync? 19:20:00 <clarkb> corvus: when we first spun up gitea we didn't do any explicit packing so they were somewhat randomly packed based on usage (git will decide on its own to pack things periodically iirc) 19:20:01 <corvus> fungi: i don't think that is possible with cephfs? 19:20:25 <fungi> right, using an appropriate shared storage can solve that 19:20:36 <fungi> (not all shared storage has those properties) 19:21:34 <clarkb> thinking out loud here, it may be worth flipping the switch back to least busy. Centos7 git in particular was the one unhappy with slightly different backend state iirc and its possible that as centos7 is used less and we are keeping things more in sync we can largely avoid the problem? 19:21:45 <clarkb> maybe try that for a bit before deciding to make more/bigger servers? 19:22:02 <corvus> sounds like it's worth it since the state may have improved 19:22:18 <fungi> definitely worth a try, sure 19:22:43 <fungi> i expect we can fairly quickly correlate any new failures resulting from that 19:23:05 <clarkb> fungi: ya and we can run git updates and stuff locally as well to look for issues 19:23:05 <fungi> (presumably it'll manifest as 404 or similar errors fetching specific objects) 19:23:52 <clarkb> #link https://review.opendev.org/#/q/project:openstack/infra-manual+topic:opendev+status:open Infra manual opendev updates 19:24:09 <clarkb> AJaeger and I (mostly AJaeger) have been reworking the infra manual to make it more opendev and less openstack 19:24:14 <clarkb> hopefully this will be more welcoming to new users 19:24:23 <clarkb> if you've got time reviews on those changes are much appreciated 19:24:43 <clarkb> Anything else on OpenDev? or should we move on? 19:26:05 <clarkb> #topic Update Config Management 19:26:41 <clarkb> #link https://review.opendev.org/#/q/topic:nodepool-legacy+status:open 19:26:54 <clarkb> ianw has a stack there that will deploy nodepool builders with containers 19:27:07 <mordred> \o/ 19:27:18 <clarkb> This worksaround problems with the builder itself not having things it needs like newer decompression tools because we can get them into the container image 19:27:23 <ianw> yeah just the main change really left, and then we can test 19:28:34 <clarkb> It doesn't totally solve the problems we've had as I'm not sure any single container image could have all the right combo of tools necessary to build all the things, but its closer than where we were before 19:28:59 <ianw> this makes things better for fedora, but worse for suse, as this image doesn't have the tools to build that 19:29:30 <clarkb> ianw: we could in theory have an image that did do both or use two different images possibly? 19:29:38 <corvus> what do we need for suse? 19:29:41 <ianw> we have discussed longer term moving the minimal builds away from host tools, in the mean time we can have a hetrogenous situation 19:29:46 <clarkb> corvus: `zypper` the package manager 19:29:50 <corvus> ah that 19:30:04 <fungi> which was temporarily dropped from debian, and thus ubuntu 19:30:22 <corvus> is it back in debian? 19:30:40 <fungi> it's been reintroduced, but not long enough ago to appear in ubuntu/bionic 19:30:55 <ianw> oh interesting, because the container isn't ubuntu 19:30:56 <mordred> could we potentially add it via bionic-backports? 19:30:57 <clarkb> fungi: our nodepool-builder images are debian based though so may be able to pull that in 19:31:01 <mordred> yeah. 19:31:08 <fungi> #link https://packages.debian.org/zypper 19:31:15 <fungi> it made it back in time for the buster release 19:31:24 <mordred> cool. then we should be able to just add it 19:31:26 <ianw> that's somewhat good news, i can look into that avenue 19:31:52 <ianw> it's still a fragile situation, but if it works, it works 19:31:52 <clarkb> yay 19:32:08 <corvus> also we had the 'containerfile' idea for images 19:32:20 <corvus> stalled at https://review.opendev.org/700083 19:32:48 <clarkb> right that is the longer term solution taht should fix this more directly for dib itself 19:33:18 <corvus> i don't actually understand dib, so it's going to take me a while to figure out why those tests are failing 19:33:23 <mordred> ianw: I have verified that zypper is installable in python-base 19:33:44 <fungi> basically accepting that no one platform will every have all the right tools available to build all the kinds of images we want to build 19:33:48 <corvus> if anyone else wants to take a peek at that and see if they have any hints, that might help move things along 19:34:05 <fungi> s/every/always/ 19:34:47 <ianw> for step 0 though, i think getting a working image out of the extant container, and seeing how we manage it's lifespan etc will be the interesting bits for the very near future 19:35:14 <clarkb> ianw: ya I think it will also be good to have someone start using that image too as we've had zuul users be confused about how to get nodepool builder running 19:35:43 <ianw> ++ 19:35:52 <clarkb> mordred: any updates on gerrit and jeepyb and docker/podman? 19:36:19 <mordred> nope. last week got consumed - I'm about to turn my attention back to that 19:36:51 <clarkb> ok 19:37:00 <clarkb> Anything else before we move on to the next topic? 19:38:01 <clarkb> #topic General Topics 19:38:34 <clarkb> static.openstack.org is basically gone at this point right? ianw has shut it down to shake out any errors (and we found one for zuul-ci? did we check starlingx too?) 19:38:56 <ianw> the old host is shutdown, along with files02 19:39:20 <fungi> docs.starlingx.io still seems to be working 19:39:24 <ianw> i haven't heard anything about missing sites ... 19:39:26 <clarkb> fungi: thanks for checking 19:39:38 <clarkb> I mentioned starglinx because it has a similar setup to zuul-ci.org 19:39:50 <clarkb> though dns for it is managed completely differently and that was where we had the miss I think 19:40:01 <fungi> (the main starlingx.io site isn't hosted by us, the osf takes care of it directly) 19:40:29 <ianw> yeah, i updated that in RAX IIRC 19:40:30 <clarkb> Anything else we need to do before cleaning up the old resources? do we want to keep any of the old cinder volumes around after deleting servers (as a psuedo backup) 19:40:58 <fungi> i already deleted all but one of them before a scheduled hypervisor host maintenance 19:41:13 <ianw> https://review.opendev.org/#/c/709639/ would want one more review to remove the old configs 19:41:14 <fungi> pvmove and lvreduce and so on 19:41:35 <clarkb> #link https://review.opendev.org/#/c/709639/ Clean up old static and files servers 19:41:52 <fungi> so at this point the old static.o.o cinder volumes are down to just main01 19:42:25 <clarkb> fungi: gotcha, I suppose we can hang onto that volume for a bit longer once the servers are deleted if we feel that is worthwhile 19:42:46 <clarkb> the servers themselves shouldn't have any other data on them that matters 19:42:58 <ianw> not much left in the story 19:43:51 <clarkb> anything else to bring up on this subject? 19:43:55 <fungi> i did notice, in looking over our mirroring configs, that we're mirroring tarballs.openstack.org and not tarballs.opendev.org 19:44:13 <clarkb> fungi: in afs? 19:44:32 <clarkb> oh no the in region mirrors 19:44:33 <fungi> apache proxy cache, but yeah i suppose that's now obsolete 19:44:44 <fungi> if we serve them from afs anyway 19:44:57 <clarkb> I didn't realize we did that for tarballs at all. But ya could serve it "direclty" or update the cache pointer 19:45:06 <fungi> (though maybe we need to plumb that in the vhost config to make it available) 19:46:22 <ianw> want me to add a task to look into it? 19:46:29 <clarkb> ianw: ++ 19:46:31 <clarkb> Ok as a time check we have less than 15 minutes and a few more items to get through so we'll move on 19:46:44 <fungi> it doesn't necessarily have to get lumped into the static.o.o move, but can't hurt i guess 19:46:56 <fungi> it's sort of cleanup 19:47:06 <fungi> cool, next topic 19:47:23 <clarkb> next up I wanted to remind everyone we've started to reorg our IRC channels. 19:47:41 <clarkb> We've dropped openstacky things from #opendev. But before we actually split the channels we need to communicate the switch 19:47:45 <clarkb> #link https://review.opendev.org/#/c/711106/1 if we get quorum on that change we can announce the migration then start holding people to it after an agreed on date. 19:48:00 <clarkb> and before we send out mass comms some agreement on the change at ^ would be great 19:48:15 <clarkb> its a quick review. Mostly about sanity checking the division 19:48:50 <clarkb> and if we get agreement on that I can send out emails to the various places explaining that we'd like to be more friendly to non openstack users and push more of the opendev discussion into #opendev 19:48:57 <clarkb> then actually make that switch on #date 19:49:25 <clarkb> Next up: FortNebula is no more. Long live Open Edge cloud 19:49:40 <clarkb> effectively what we've done is s/FortNebula/Open Edge/ donnyd is still our contact there 19:49:59 <clarkb> I mentioned I could remove the fn grafana dashboard, but have not done that yet /me makes note on todo list for that 19:50:01 <fungi> though he rebuilt the environment from scratch, new and improved (hopefully) 19:50:20 <fungi> changed up various aspects of storage there 19:50:28 <clarkb> This is largely just a heads up so no one is confused about where thsi new cloud has come from 19:50:48 <fungi> and the name change signals a shift in funding for it, or something like that 19:51:15 <clarkb> And that takes us to project rename planning 19:51:39 <clarkb> we've got a devstack plugin that the cinder team wants to adopt and move from x/ to openstack/ and the openstack-infra/infra-manual repo should ideally be opendev/infra-manual 19:52:05 <clarkb> mordred: ^ I think I'd somewhat defer to your thoughts on scheduling this since you are also trying to make changing in the gerrit setup 19:52:19 <fungi> i think it's currently openstack/infra-manual actually, not openstack-infra/infra-manual 19:52:37 <clarkb> fungi: correct (my bad) 19:52:58 <fungi> basically we chose wrong (or failed to predict accurately) when deciding which namespace to move it into during the mass migration 19:53:15 <smcginnis> How far away are we from being able to do a gerrit upgrade? (while we're near the subject) 19:53:22 <fungi> so i'm viewing that one more as fixing a misstep from the initial migration 19:53:30 <donnyd> clarkb: fungi you are correct - Open Edge is likely to be around for quite a lot longer than FN was going to be able to be sustained for 19:53:37 <clarkb> smcginnis: mordred ran into an unexpected thing that needs accomodating so I don't think we have a good idea yet 19:53:42 <mordred> well - step one is just doing a restart on the same version but with new deployment 19:53:46 <smcginnis> clarkb: OK, thanks. 19:53:54 <fungi> thanks again donnyd!!! it's been a huge help (and so have you) 19:54:00 <mordred> clarkb: I don't expect manage-projects to take more than a day as long as I can actually work on it 19:54:07 <clarkb> mordred: ok 19:54:20 <clarkb> looking at ussuri scheduling we are kind of right in the middle of all the fun https://releases.openstack.org/ussuri/schedule.html 19:54:41 <clarkb> next week doesn't look bad but then after that we may have to wait until mid april? 19:55:04 <clarkb> (also I'm supposed to be driving to arizona a week from friday, but unsure if those plans are still in play) 19:55:19 <mordred> why don't we schedule for next week - make that a target for also restarting for the container 19:55:35 <mordred> I think that's a high doable goal gerrit-wise 19:55:52 <clarkb> mordred: I'm fine with that, though depending on how things go I may or may not be driving a car and not much help 19:55:56 <mordred> (to be clear - this isn't upgrade - is jst container) 19:55:58 <mordred> clarkb: totes 19:56:01 <clarkb> if I'm not driving the car I'm happy to help 19:56:15 <fungi> yeah, i expect to be on hand, so any time around then wfm 19:56:18 <clarkb> mordred: do you want to say friday the 20th? 19:56:38 <mordred> clarkb: seems reasonable 19:57:08 <clarkb> any suggestions on a time frame (I can send out email that it is our intent to do that soon) 19:57:25 <clarkb> fungi: mordred: what works best for you since it sounds like you may be most around (judging on current volunteering) 19:57:42 <corvus> i'd like to help, but i'm not sure i can commit to the 20th right now, will let mordred know asap 19:58:21 <fungi> fridays work well for me more generally, also any time which works for mordred should be fine for me as well since i live an hour in his future anyway 19:58:31 <clarkb> well we are just about at time. Maybe we can sort out a timeframe between now and this friday and I'll announce stuff 19:58:49 <mordred> yeah - any time on the 20th works for me too 19:58:59 <clarkb> I'll open the floor for any last quick things 19:59:06 <clarkb> #topic Open Discussion 19:59:52 <clarkb> Sounds like that may be it. Thank you everyone and we'll see you here next week 20:00:08 <clarkb> feel free to continue discussion in #openstack-infra or on the mailing list 20:00:11 <clarkb> #endmeeting