19:01:06 <clarkb> #startmeeting infra
19:01:07 <openstack> Meeting started Tue Feb  2 19:01:06 2021 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:11 <openstack> The meeting name has been set to 'infra'
19:01:19 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000179.html Our Agenda
19:01:26 <kopecmartin> hi o/
19:02:24 <ianw> o/
19:02:36 <clarkb> hello kopecmartin your agenda item is near the tail end of the meeting, if that is a problem feel free to say something and we can cover it earlier (not sure what meeting timing is like for you)
19:02:57 <kopecmartin> clarkb: it's fine, i'll wait :)
19:03:00 <clarkb> #topic Announcements
19:03:08 <clarkb> I had no announcements
19:03:38 <clarkb> #topic Actions from last meeting
19:03:44 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-01-26-19.01.txt minutes from last meeting
19:03:54 <clarkb> fungi's request for help on config-core duties went out
19:03:57 <clarkb> fungi: thank you for that
19:04:08 <clarkb> #action clarkb begin puppet -> ansible and xenial upgrade audit
19:04:16 <clarkb> I did not manage to find time for ^ so have added it back on
19:04:31 <clarkb> ianw: do we need to keep an action item for wiki backups or are those happy now?
19:04:49 <ianw> not done yet ...
19:05:00 <clarkb> #action ianw figure out borg backups for wiki
19:05:14 <clarkb> Ok lets dive into our topics for today
19:05:16 <clarkb> #topic Priority Efforts
19:05:22 <clarkb> #topic OpenDev
19:05:36 <clarkb> The service coordination nominations period has finished.
19:05:47 <clarkb> I didn't see anyone else volunteer by the weekend so I put my name in
19:05:51 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000161.html Clarkb appears to be the only nomination
19:06:07 <clarkb> I haven't seen any other since I made mine either.
19:06:18 <clarkb> I think that means I'm it again, but if I missed something please call it out :)
19:07:08 <clarkb> Since last weeks meeting I've done a bit of work on the Gerrit account inconsistencies problem
19:07:11 <clarkb> #link https://etherpad.opendev.org/p/gerrit-user-consistency-2021 High level notes.
19:07:23 <clarkb> I've started to try and keep high level notes there while keeping the PII out of the etherpad
19:07:49 <clarkb> Group problems and 81 accounts with preferred emails missing external ids have been fixed.
19:08:01 <clarkb> thank you fungi for being an extra set of eyes while I worked through ^
19:08:10 <fungi> any time!
19:08:13 <clarkb> We have 28 accounts with preferred email addresses that don't have a matching external id
19:08:23 <clarkb> We have ~642 accounts with conflicting emails in their external ids. This needs more investigating to better understand the fix for them.
19:08:31 <clarkb> Need to correct the ~642 external id issues before we can push updates to refs/meta/external-ids with Gerrit online.
19:08:38 <clarkb> Workaround is we can stop gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?)
19:08:58 <clarkb> I'ev given the next set of steps some though and I think roughly it is:
19:09:05 <clarkb> Classify users further into situation groups
19:09:10 <clarkb> Decide on next steps for users depending on their situation group
19:09:17 <clarkb> Fix the preferred email issue if possible as this can be done with gerrit online
19:09:25 <clarkb> Start a refs/meta/external-ids checkout in a shared location and begin committing fixes to it. If we can't push all the fixes as separate commits we can squash them together and then push.
19:09:49 <clarkb> that might be broken down further to do all the preferred email issues first as we can correct them online. Then do the external ids
19:09:59 <zbr> clarkb: does this mean manually investigating and patching >600 accounts?
19:10:27 <clarkb> Another upside to doing it ^ that way is I expect some of the external id fixes will result in preferred email issues in the account side. If we fix the existing issues first we won't confuse them with any new ones we introduce
19:10:29 <clarkb> zbr: yes
19:10:42 <fungi> probably semi-scripted at least
19:11:04 <fungi> and with distinct classifications, some of them may be quick to blow through
19:11:07 <clarkb> right the 81 we fixed already were 95% done with a script once we were satisfied with an earlier pass and classification
19:11:57 <clarkb> depending on how the classification for external ids goes I think downtime to crrect a portion of them is an option as well
19:12:14 <clarkb> that will help us ensure that we're making changes that are viable once loaded into gerrit
19:12:29 <clarkb> btu I don't want to make too strong of a plan for those until we start actually committing changes to that shared checkout
19:12:43 <zbr> while this does not sound like a joy, if spliting the work can speed things up, i may give it a try.
19:13:14 <clarkb> zbr: part of the problem here is that it is all PII so I think we need to be careful who we give access to. Currently it is just gerrit admins
19:13:32 <clarkb> (you need to be admin to access the refs)
19:13:42 <zbr> ah.
19:14:13 <clarkb> anyway, after fixing the first 81 accounts I spent a bit of time doing further classification but then got distracted
19:14:35 <clarkb> I need to pick that back up again. I think that a good chunk of the remaining preferred email issues can be fixed like the first 81
19:14:44 <clarkb> but I need to actually make those lists and then see if others agree
19:14:55 <clarkb> I'll be trying to pick this back up again this week
19:15:07 <clarkb> Other Gerrit items:
19:15:23 <clarkb> We upgraded Gerrit to ~3.2.7 yseterday to patch a security issue
19:15:57 <clarkb> I also tested that Gerrit's workinprogress state is handled by zuul properly when you approve changes. It appears to ignore workinprogress changes properly now
19:16:08 <clarkb> (we expected it to since the fix was deployed, but needed to test with actual gerrit)
19:16:43 <clarkb> ianw and I have made some improvements to the gerrit testing too.
19:17:10 <clarkb> the selenium stuff is a bit better now and I added a test to check that the x/ clone workaround continues to work
19:17:23 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/765021 Build 3.3 images
19:17:52 <clarkb> I also resurrected my change to build 3.3 images. I don't think we're in a hurry to upgrade, doing the OS upgrade first seems like better prioritization, but having working image builds ready for us would be nice
19:18:16 <clarkb> That was what I had for OpenDev and Gerrit things. Anything else to add before we move on?
19:18:19 <ianw> hrm that doesn't run the system-config job?
19:18:41 <clarkb> ianw: no, beacuse currently the system-config job is 3.2 only
19:18:49 <clarkb> I think in a followup I could add a system-config + 3.3 job
19:19:02 <clarkb> or if people prefer can add it to this existing change
19:19:30 <ianw> oh i see totally new jobs :)  i think it would be great to run it, either way
19:20:09 <clarkb> ok I'll probably start with a followup change then as that is slightly easier and take it from there
19:21:14 <clarkb> #topic Update Config Management
19:21:32 <clarkb> I am not aware of anything to add to this other than maybe the refstack topic which we've got later on in the agenda
19:22:46 <clarkb> Might be worth mentioning that I helped zuul fix an issue in zuul-registry that new buildx exposed. This was affecting our ability to do multiarch builds (things like nodepool builders)
19:23:05 <clarkb> that should be fixed now though. Thank you zbr for calling out the problem
19:24:20 <clarkb> #topic General topics
19:24:27 <clarkb> #topic OpenAFS cluster status
19:24:43 <clarkb> Just a quick status check on the openafs cluster. I think we still need to upgrade the db servers?
19:25:39 <clarkb> ianw: fungi: anything else to note about ^ ?
19:25:50 <fungi> that's still the status afaik
19:26:07 <ianw> yeah, i got distracted with other things.  high on my todo list :)
19:26:07 <fungi> and then we can think through upgrading operating systems/replacing servers
19:26:25 <clarkb> no worries, just making sure I (and others) are up to date
19:26:31 <clarkb> #topic Bup and Borg Backups
19:26:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/773570
19:27:06 <clarkb> I think that is the latest on borg? Would be good if fungi frickler and corvus could review that one
19:27:19 <zbr> clarkb: i just realized why i did not get notifications from Zuul-jobs-failures mailing list... i did not whitelist the user.
19:27:47 <clarkb> ianw: feel free to fill us in on any and all relevant info for this topic though :)
19:28:14 <ianw> yeah, i'm pretty focused on getting our working set to a reasonable level
19:28:40 <ianw> unfortunately i didn't quite fully grok the implications of --append mode and the particular way borg implements that
19:28:47 <clarkb> (I didn't either)
19:29:04 <ianw> all the details are in the changelog of 773570
19:29:46 <fungi> i caught some of it last night before i passed out
19:30:13 <ianw> anyway, a better way would be do do someting like a rolling set of LVM snapshots on the server side
19:30:47 <fungi> i guess cow wouldn't help there because of the encryption layer
19:31:36 <fungi> or maybe it would, depends on if borg manages to not update most of the blocks when updating the backup
19:31:47 <ianw> we don't encrypt the backups, i think it would be ok
19:31:47 <clarkb> I think cow would be fine the way we're using borg
19:31:57 <fungi> oh, right then
19:33:01 <ianw> anyway we can discuss in the review, but yeah i would like to get this all sorted and running by itself very soon
19:33:23 <clarkb> ++ to getting this sorted soon. I intend on looking at it much closer this afternoon. I want to catch up on the docs and related issues
19:33:42 <clarkb> Anything else or should we move on to the next item?
19:34:08 <ianw> nope, move on
19:34:19 <clarkb> #topic Deploy a new refstack.openstack.org server
19:34:41 <clarkb> kopecmartin: has updated my old change to make a refstack container
19:34:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/705258
19:35:39 <clarkb> we think it is just about read to be landed, then we can spin up a new refstack server on bionic/focal (probably focal), make sure it works ( kopecmartin has volunteered to help with this step ), then migrate teh data from the old instance to the new one iwth a scheduled downtime
19:36:31 <clarkb> I think the main thing we need help with is someone to spin up the new instance, configure dns records, and ensure that LE and ansible and all that are happy
19:36:51 <ianw> happy to help with that
19:37:17 <clarkb> cool. I'm happy to keep helping too, but worried that I'm not in a great spot to drive any single effort right now (as I'm assisting a bunch)
19:37:48 <clarkb> I also expect we may need to learn us a refstack in order to figure out what the migration from old to new server will look like, but I'm 99% sure that can happen once we're happy the new deployment functions at all
19:37:51 <ianw> can it run without DNS pointed to it?
19:38:23 <ianw> i haven't looked but i was imagning it would be a db import?
19:38:25 <clarkb> ianw: you might need to edit your lcaol /etc/hosts to make everything happy but it should
19:38:34 <clarkb> yes I believe it is a db import
19:39:22 <clarkb> kopecmartin: do you know what kind of testing you think would be appropriate here?
19:39:23 <ianw> speaking of, should it be in our backup rotation?
19:39:33 <clarkb> ianw: probably
19:40:07 <kopecmartin> clarkb: click on every button in the UI , try to upload new results files, register a new user maybe .. this kind of things
19:40:21 <clarkb> kopecmartin: got it, general usage
19:40:23 <clarkb> makes sense
19:40:32 <ianw> ok, we can tackle that separately.  i'll review 705258 and can try starting something
19:40:34 <clarkb> I expect all that will work if you set /etc/hosts
19:40:43 <kopecmartin> yeah, we dropped py2 support, so i'd like to exercise every function of the site
19:40:53 <kopecmartin> luckily it's not that complex
19:41:15 <clarkb> Thank you ianw for helping out.
19:41:19 <clarkb> Anything else on this topic?
19:41:37 <kopecmartin> not from my side, just let me know it you need anything
19:41:43 <clarkb> will do!
19:42:10 <clarkb> The next two items are on my plate and have been neglected due to other distractions. This is why I'm wary to dive into something new :/
19:42:15 <clarkb> #topic Picking up steam on Puppet -> Ansible rewrites
19:42:39 <clarkb> I have yet to write this etherpad, but I'm hopeful I'll get to it this week. I think it will give us good perspective and ability to prioritize effort
19:43:14 <clarkb> Not really anything else to add to this. Other than thank you to everyone who has continued to push on migrating us off of puppet
19:43:20 <clarkb> #topic inmotion cloud openstack as a service
19:43:36 <clarkb> I'm hoping that tomorrow I can try turning this on and see what happens
19:44:02 <clarkb> If all goes well hopeflly we'll be able to expand nodepool's resource pool
19:44:17 <clarkb> its been a while since I did one of these though so should be interesting to see how it goes
19:45:00 <clarkb> I know they are interested in our feedback too, which always makes it easier when things are weird or not working
19:45:13 <clarkb> #topic Open Discussion
19:45:25 <clarkb> Anything else that didn't make it on the agenda that you'd like to bring up?
19:45:51 <fungi> change in vexxhost node memory?
19:46:23 <fungi> something we probably need to keep an eye on, as folks could start merging regressions for memory use more easily
19:46:46 <ianw> i missed that, did it go up or down?
19:46:50 <fungi> or will generally start asking why not all of our nodes have 32gb ram
19:46:54 <clarkb> ianw: it wen tup to 32GB of memory
19:47:14 <clarkb> the risk is that changes could merge in vexxhost that cannot merge anywhere else
19:47:28 <fungi> #link https://review.opendev.org/773710 Switch to using v3-standard-8 flavors
19:48:22 <ianw> ahh
19:48:24 <clarkb> fungi: piecing together dansmith's question in #openstack-infra and some of what was discussed in #opendev is this also thought to improve io in vexxhost?
19:48:28 <clarkb> or are those separate concerns?
19:48:50 <ianw> i feel like there was at some point something we did booting nodes with like kernel mem= parameters to keep them all the same
19:49:12 <ianw> but that's probably very silly, to have 32gb allocated but artifically limit to only 8
19:49:17 <fungi> i'm not clear on whether it will improve i/o performance
19:49:50 <fungi> ianw: yeah, that's what i was referring to in my review comment
19:49:55 <clarkb> ianw: ya we did that to avoid the fear that we could merge thigns in one cloud and then break jobs in all the others
19:50:07 <fungi> also we had to do it in bootloader configuration, which means applying it to all our providers
19:50:13 <clarkb> back then you couldn't reboot with new kernel parameters, you can now so it would be a bit of a bandaid to do it now
19:51:52 <ianw> we could run them as static nodes and do a 1:4 reverse split :)
19:52:24 <clarkb> that is an interesting thought, static nodes seem like pain though :)
19:53:07 <clarkb> maybe converting the set to a large k8s cluster and then scheduling into that with nodepool would make sense if we found infinite time somewhere :)
19:53:08 <ianw> we'll see how the bare-metal cloud thing works out :)
19:53:41 <clarkb> definitely worth a brainstorm to think about other ways of slicing them
19:53:50 <clarkb> I'll think it through on tomorrow's bike ride :)
19:53:58 <clarkb> or try to anyway, its probably going to be cold and my brain won't work
19:54:01 <ianw> yeah the k8s cluster is probably actually a pretty sane thing to think about
19:57:09 <clarkb> sounds like that may be all. Thank you everyone
19:57:12 <clarkb> #endmeeting