#openstack-meeting log

19:01:11 <clarkb> #startmeeting infra
19:01:12 <openstack> Meeting started Tue Mar 10 19:01:11 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:14 * mordred has fewer sandwiches than he should
19:01:15 <openstack> The meeting name has been set to 'infra'
19:01:19 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-March/006608.html Our Agenda
19:01:27 <clarkb> #topic Announcements
19:01:47 <clarkb> First up a friendly reminder that for many of us in north america DST switch has happened. Careful with your meeting schedule :)
19:01:58 <clarkb> Europe and others happen at the end of the month so we'll get round two with a different set of people
19:02:15 <clarkb> This meeting is at 1900UTC Tuesdays regardless of DST or timezone
19:02:25 <clarkb> #link http://lists.openstack.org/pipermail/foundation/2020-March/002852.html OSF email on 2020 events
19:02:46 <clarkb> Second we've got an email from the Foundation with an update on 2020 events and what the current status is.
19:02:58 <clarkb> If you've got feedback for that I know mark and others are more than happy to receive it
19:03:14 <clarkb> it went to the foundation list so I wanted to make sure everyone saw it
19:04:19 <clarkb> #topic Actions from last meeting
19:04:26 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-03-03-19.02.txt minutes from last meeting
19:04:34 <clarkb> There were no actions recorded
19:04:42 <clarkb> #topic Specs approval
19:05:27 <clarkb> I approved the python cleanup spec and the website activity stats spec. I think I've seen progress on both since then (I pushed a change today to add a job that runs goaccess and ianw landed the glean reparenting to python3 -m venv in dib)
19:05:41 <clarkb> Please add those review topics to your review lists
19:06:19 <clarkb> topic:website-stats topic:cleanup-test-image-python
19:06:55 <clarkb> We've also got the xwiki spec
19:06:56 <clarkb> #link https://review.opendev.org/#/c/710057/ xwiki for wikis
19:07:02 <clarkb> zbr: thank you for the review on that one.
19:07:25 <clarkb> frickler and corvus I think you had good feedback on IRC, would be great if you could double check that your concerns were addressed (or not) in the spec itself
19:08:27 <clarkb> Any other input on the specs work that has happened recently?
19:09:46 <clarkb> #topic Priority Efforts
19:09:58 <clarkb> #topic OpenDev
19:10:11 <clarkb> #link https://review.opendev.org/#/c/710020/ Split OpenDev out of OpenStack governance.
19:10:20 <clarkb> That change seems to be getting proper review this time around (which is good)
19:10:34 <clarkb> it also appears that we can continue to roll forward and make progress if I'm parsing the feedback there
19:10:43 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-March/006603.html OpenDev as OSF pilot project
19:10:57 <clarkb> I'll try to see if jbryce has time to follow up with our responses on that mailing list thread too
19:11:37 <clarkb> last week we discovered that we had missed replication events on at least one gitea backend as well as a different backend OOMing and causing some jobs that don't use the local git cache to fail
19:11:45 <AJaeger> Infra Manual has been updated and is now the "OpenDev Manual", a few changes still up for review
19:12:10 <clarkb> AJaeger: thanks for the reminder /me needs to make sure he is caught up on those change reviews
19:12:34 <clarkb> on the missed replication events problem mordred and I brainstormed a bit and we think we can force Gerrit to retry by always stopping the ssh daemon in the gitea setup first
19:13:04 <clarkb> that way replication either actually succeeds or will have tcp errors. This is a good thing because Gerrit replication plugin will retry forever if it notices an error replicating a ref
19:13:11 <mordred> WIP patch for that: https://review.opendev.org/#/c/711130/
19:13:20 <mordred> which is incomplete
19:14:23 <clarkb> for the second thing (OOM) I think we should begin thinking about deploying bigger servers and/or more servers. My concern with more servers is that since we pin by IP any busiy IP (NAT perhaps) can still overload a single server
19:15:19 <mordred> clarkb: while we should certainly avoid OOMing, should we also be trying to make those jobs use the git cache more better?
19:15:40 <clarkb> mordred: yes jrosser said they would work on that aspect of it
19:15:54 <fungi> it's not necessarily our jobs which are the bulk of the cause though
19:16:27 <fungi> remember that our nodes don't go through nat to reach our gitea farm
19:16:48 <mordred> nod
19:17:07 <clarkb> fungi: but we do often start many jobs all at once that can by chance hash to the same backend node
19:17:09 <fungi> so in theory if we have a big batch of jobs kick off which all clone nova, we should see that impact spread somewhat evenly across the pool already
19:17:17 <clarkb> (also these jobs were cloning nova which is our biggest repo iirc)
19:17:41 <fungi> the times we've seen this go sideways, it's generally been one backend hit super hard and little or no noticeable spike on the other backends around teh same timeframe
19:18:10 <fungi> we talked about some options for using weighting in haproxy based on system resource health checks too, though that was kinda hand-wavey on the details
19:18:22 <corvus> the other approach would be to change the lb algorithm to least busy, but the issue there is that backends are sometimes slightly out of sync?  is it the case the only resolution there is shared storage?
19:19:13 <clarkb> corvus: ya I think shared storage solves that problem properly. Its possible that least busy would mostly work now that our gitea backends are largely in sync with each other and packing on a similar schedule
19:19:30 <fungi> even shared storage doesn't necessarily fix it if every backend's view of that storage can be slightly out of sync
19:19:32 <clarkb> the old cgit setup used least busy and didn't have the problems with git clients moving around. I think possibly because the git repos were similar enough
19:19:34 <corvus> clarkb: oh, what changed to get them mostly in sync?
19:20:00 <clarkb> corvus: when we first spun up gitea we didn't do any explicit packing so they were somewhat randomly packed based on usage (git will decide on its own to pack things periodically iirc)
19:20:01 <corvus> fungi: i don't think that is possible with cephfs?
19:20:25 <fungi> right, using an appropriate shared storage can solve that
19:20:36 <fungi> (not all shared storage has those properties)
19:21:34 <clarkb> thinking out loud here, it may be worth flipping the switch back to least busy. Centos7 git in particular was the one unhappy with slightly different backend state iirc and its possible that as centos7 is used less and we are keeping things more in sync we can largely avoid the problem?
19:21:45 <clarkb> maybe try that for a bit before deciding to make more/bigger servers?
19:22:02 <corvus> sounds like it's worth it since the state may have improved
19:22:18 <fungi> definitely worth a try, sure
19:22:43 <fungi> i expect we can fairly quickly correlate any new failures resulting from that
19:23:05 <clarkb> fungi: ya and we can run git updates and stuff locally as well to look for issues
19:23:05 <fungi> (presumably it'll manifest as 404 or similar errors fetching specific objects)
19:23:52 <clarkb> #link https://review.opendev.org/#/q/project:openstack/infra-manual+topic:opendev+status:open Infra manual opendev updates
19:24:09 <clarkb> AJaeger and I (mostly AJaeger) have been reworking the infra manual to make it more opendev and less openstack
19:24:14 <clarkb> hopefully this will be more welcoming to new users
19:24:23 <clarkb> if you've got time reviews on those changes are much appreciated
19:24:43 <clarkb> Anything else on OpenDev? or should we move on?
19:26:05 <clarkb> #topic Update Config Management
19:26:41 <clarkb> #link https://review.opendev.org/#/q/topic:nodepool-legacy+status:open
19:26:54 <clarkb> ianw has a stack there that will deploy nodepool builders with containers
19:27:07 <mordred> \o/
19:27:18 <clarkb> This worksaround problems with the builder itself not having things it needs like newer decompression tools because we can get them into the container image
19:27:23 <ianw> yeah just the main change really left, and then we can test
19:28:34 <clarkb> It doesn't totally solve the problems we've had as I'm not sure any single container image could have all the right combo of tools necessary to build all the things, but its closer than where we were before
19:28:59 <ianw> this makes things better for fedora, but worse for suse, as this image doesn't have the tools to build that
19:29:30 <clarkb> ianw: we could in theory have an image that did do both or use two different images possibly?
19:29:38 <corvus> what do we need for suse?
19:29:41 <ianw> we have discussed longer term moving the minimal builds away from host tools, in the mean time we can have a hetrogenous situation
19:29:46 <clarkb> corvus: `zypper` the package manager
19:29:50 <corvus> ah that
19:30:04 <fungi> which was temporarily dropped from debian, and thus ubuntu
19:30:22 <corvus> is it back in debian?
19:30:40 <fungi> it's been reintroduced, but not long enough ago to appear in ubuntu/bionic
19:30:55 <ianw> oh interesting, because the container isn't ubuntu
19:30:56 <mordred> could we potentially add it via bionic-backports?
19:30:57 <clarkb> fungi: our nodepool-builder images are debian based though so may be able to pull that in
19:31:01 <mordred> yeah.
19:31:08 <fungi> #link https://packages.debian.org/zypper
19:31:15 <fungi> it made it back in time for the buster release
19:31:24 <mordred> cool. then we should be able to just add it
19:31:26 <ianw> that's somewhat good news, i can look into that avenue
19:31:52 <ianw> it's still a fragile situation, but if it works, it works
19:31:52 <clarkb> yay
19:32:08 <corvus> also we had the 'containerfile' idea for images
19:32:20 <corvus> stalled at https://review.opendev.org/700083
19:32:48 <clarkb> right that is the longer term solution taht should fix this more directly for dib itself
19:33:18 <corvus> i don't actually understand dib, so it's going to take me a while to figure out why those tests are failing
19:33:23 <mordred> ianw: I have verified that zypper is installable in python-base
19:33:44 <fungi> basically accepting that no one platform will every have all the right tools available to build all the kinds of images we want to build
19:33:48 <corvus> if anyone else wants to take a peek at that and see if they have any hints, that might help move things along
19:34:05 <fungi> s/every/always/
19:34:47 <ianw> for step 0 though, i think getting a working image out of the extant container, and seeing how we manage it's lifespan etc will be the interesting bits for the very near future
19:35:14 <clarkb> ianw: ya I think it will also be good to have someone start using that image too as we've had zuul users be confused about how to get nodepool builder running
19:35:43 <ianw> ++
19:35:52 <clarkb> mordred: any updates on gerrit and jeepyb and docker/podman?
19:36:19 <mordred> nope. last week got consumed - I'm about to turn my attention back to that
19:36:51 <clarkb> ok
19:37:00 <clarkb> Anything else before we move on to the next topic?
19:38:01 <clarkb> #topic General Topics
19:38:34 <clarkb> static.openstack.org is basically gone at this point right? ianw has shut it down to shake out any errors (and we found one for zuul-ci? did we check starlingx too?)
19:38:56 <ianw> the old host is shutdown, along with files02
19:39:20 <fungi> docs.starlingx.io still seems to be working
19:39:24 <ianw> i haven't heard anything about missing sites ...
19:39:26 <clarkb> fungi: thanks for checking
19:39:38 <clarkb> I mentioned starglinx because it has a similar setup to zuul-ci.org
19:39:50 <clarkb> though dns for it is managed completely differently and that was where we had the miss I think
19:40:01 <fungi> (the main starlingx.io site isn't hosted by us, the osf takes care of it directly)
19:40:29 <ianw> yeah, i updated that in RAX IIRC
19:40:30 <clarkb> Anything else we need to do before cleaning up the old resources? do we want to keep any of the old cinder volumes around after deleting servers (as a psuedo backup)
19:40:58 <fungi> i already deleted all but one of them before a scheduled hypervisor host maintenance
19:41:13 <ianw> https://review.opendev.org/#/c/709639/ would want one more review to remove the old configs
19:41:14 <fungi> pvmove and lvreduce and so on
19:41:35 <clarkb> #link https://review.opendev.org/#/c/709639/ Clean up old static and files servers
19:41:52 <fungi> so at this point the old static.o.o cinder volumes are down to just main01
19:42:25 <clarkb> fungi: gotcha, I suppose we can hang onto that volume for a bit longer once the servers are deleted if we feel that is worthwhile
19:42:46 <clarkb> the servers themselves shouldn't have any other data on them that matters
19:42:58 <ianw> not much left in the story
19:43:51 <clarkb> anything else to bring up on this subject?
19:43:55 <fungi> i did notice, in looking over our mirroring configs, that we're mirroring tarballs.openstack.org and not tarballs.opendev.org
19:44:13 <clarkb> fungi: in afs?
19:44:32 <clarkb> oh no the in region mirrors
19:44:33 <fungi> apache proxy cache, but yeah i suppose that's now obsolete
19:44:44 <fungi> if we serve them from afs anyway
19:44:57 <clarkb> I didn't realize we did that for tarballs at all. But ya could serve it "direclty" or update the cache pointer
19:45:06 <fungi> (though maybe we need to plumb that in the vhost config to make it available)
19:46:22 <ianw> want me to add a task to look into it?
19:46:29 <clarkb> ianw: ++
19:46:31 <clarkb> Ok as a time check we have less than 15 minutes and a few more items to get through so we'll move on
19:46:44 <fungi> it doesn't necessarily have to get lumped into the static.o.o move, but can't hurt i guess
19:46:56 <fungi> it's sort of cleanup
19:47:06 <fungi> cool, next topic
19:47:23 <clarkb> next up I wanted to remind everyone we've started to reorg our IRC channels.
19:47:41 <clarkb> We've dropped openstacky things from #opendev. But before we actually split the channels we need to communicate the switch
19:47:45 <clarkb> #link https://review.opendev.org/#/c/711106/1 if we get quorum on that change we can announce the migration then start holding people to it after an agreed on date.
19:48:00 <clarkb> and before we send out mass comms some agreement on the change at ^ would be great
19:48:15 <clarkb> its a quick review. Mostly about sanity checking the division
19:48:50 <clarkb> and if we get agreement on that I can send out emails to the various places explaining that we'd like to be more friendly to non openstack users and push more of the opendev discussion into #opendev
19:48:57 <clarkb> then actually make that switch on #date
19:49:25 <clarkb> Next up: FortNebula is no more. Long live Open Edge cloud
19:49:40 <clarkb> effectively what we've done is s/FortNebula/Open Edge/ donnyd is still our contact there
19:49:59 <clarkb> I mentioned I could remove the fn grafana dashboard, but have not done that yet /me makes note on todo list for that
19:50:01 <fungi> though he rebuilt the environment from scratch, new and improved (hopefully)
19:50:20 <fungi> changed up various aspects of storage there
19:50:28 <clarkb> This is largely just a heads up so no one is confused about where thsi new cloud has come from
19:50:48 <fungi> and the name change signals a shift in funding for it, or something like that
19:51:15 <clarkb> And that takes us to project rename planning
19:51:39 <clarkb> we've got a devstack plugin that the cinder team wants to adopt and move from x/ to openstack/ and the openstack-infra/infra-manual repo should ideally be opendev/infra-manual
19:52:05 <clarkb> mordred: ^ I think I'd somewhat defer to your thoughts on scheduling this since you are also trying to make changing in the gerrit setup
19:52:19 <fungi> i think it's currently openstack/infra-manual actually, not openstack-infra/infra-manual
19:52:37 <clarkb> fungi: correct (my bad)
19:52:58 <fungi> basically we chose wrong (or failed to predict accurately) when deciding which namespace to move it into during the mass migration
19:53:15 <smcginnis> How far away are we from being able to do a gerrit upgrade? (while we're near the subject)
19:53:22 <fungi> so i'm viewing that one more as fixing a misstep from the initial migration
19:53:30 <donnyd> clarkb: fungi you are correct - Open Edge is likely to be around for quite a lot longer than FN was going to be able to be sustained for
19:53:37 <clarkb> smcginnis: mordred ran into an unexpected thing that needs accomodating so I don't think we have a good idea yet
19:53:42 <mordred> well - step one is just doing a restart on the same version but with new deployment
19:53:46 <smcginnis> clarkb: OK, thanks.
19:53:54 <fungi> thanks again donnyd!!! it's been a huge help (and so have you)
19:54:00 <mordred> clarkb: I don't expect manage-projects to take more than a day as long as I can actually work on it
19:54:07 <clarkb> mordred: ok
19:54:20 <clarkb> looking at ussuri scheduling we are kind of right in the middle of all the fun https://releases.openstack.org/ussuri/schedule.html
19:54:41 <clarkb> next week doesn't look bad but then after that we may have to wait until mid april?
19:55:04 <clarkb> (also I'm supposed to be driving to arizona a week from friday, but unsure if those plans are still in play)
19:55:19 <mordred> why don't we schedule for next week - make that a target for also restarting for the container
19:55:35 <mordred> I think that's a high doable goal gerrit-wise
19:55:52 <clarkb> mordred: I'm fine with that, though depending on how things go I may or may not be driving a car and not much help
19:55:56 <mordred> (to be clear - this isn't upgrade - is jst container)
19:55:58 <mordred> clarkb: totes
19:56:01 <clarkb> if I'm not driving the car I'm happy to help
19:56:15 <fungi> yeah, i expect to be on hand, so any time around then wfm
19:56:18 <clarkb> mordred: do you want to say friday the 20th?
19:56:38 <mordred> clarkb: seems reasonable
19:57:08 <clarkb> any suggestions on a time frame (I can send out email that it is our intent to do that soon)
19:57:25 <clarkb> fungi: mordred: what works best for you since it sounds like you may be most around (judging on current volunteering)
19:57:42 <corvus> i'd like to help, but i'm not sure i can commit to the 20th right now, will let mordred know asap
19:58:21 <fungi> fridays work well for me more generally, also any time which works for mordred should be fine for me as well since i live an hour in his future anyway
19:58:31 <clarkb> well we are just about at time. Maybe we can sort out a timeframe between now and this friday and I'll announce stuff
19:58:49 <mordred> yeah - any time on the 20th works for me too
19:58:59 <clarkb> I'll open the floor for any last quick things
19:59:06 <clarkb> #topic Open Discussion
19:59:52 <clarkb> Sounds like that may be it. Thank you everyone and we'll see you here next week
20:00:08 <clarkb> feel free to continue discussion in #openstack-infra or on the mailing list
20:00:11 <clarkb> #endmeeting