19:01:06 <clarkb> #startmeeting infra 19:01:07 <openstack> Meeting started Tue Feb 2 19:01:06 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:11 <openstack> The meeting name has been set to 'infra' 19:01:19 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-February/000179.html Our Agenda 19:01:26 <kopecmartin> hi o/ 19:02:24 <ianw> o/ 19:02:36 <clarkb> hello kopecmartin your agenda item is near the tail end of the meeting, if that is a problem feel free to say something and we can cover it earlier (not sure what meeting timing is like for you) 19:02:57 <kopecmartin> clarkb: it's fine, i'll wait :) 19:03:00 <clarkb> #topic Announcements 19:03:08 <clarkb> I had no announcements 19:03:38 <clarkb> #topic Actions from last meeting 19:03:44 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-01-26-19.01.txt minutes from last meeting 19:03:54 <clarkb> fungi's request for help on config-core duties went out 19:03:57 <clarkb> fungi: thank you for that 19:04:08 <clarkb> #action clarkb begin puppet -> ansible and xenial upgrade audit 19:04:16 <clarkb> I did not manage to find time for ^ so have added it back on 19:04:31 <clarkb> ianw: do we need to keep an action item for wiki backups or are those happy now? 19:04:49 <ianw> not done yet ... 19:05:00 <clarkb> #action ianw figure out borg backups for wiki 19:05:14 <clarkb> Ok lets dive into our topics for today 19:05:16 <clarkb> #topic Priority Efforts 19:05:22 <clarkb> #topic OpenDev 19:05:36 <clarkb> The service coordination nominations period has finished. 19:05:47 <clarkb> I didn't see anyone else volunteer by the weekend so I put my name in 19:05:51 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-January/000161.html Clarkb appears to be the only nomination 19:06:07 <clarkb> I haven't seen any other since I made mine either. 19:06:18 <clarkb> I think that means I'm it again, but if I missed something please call it out :) 19:07:08 <clarkb> Since last weeks meeting I've done a bit of work on the Gerrit account inconsistencies problem 19:07:11 <clarkb> #link https://etherpad.opendev.org/p/gerrit-user-consistency-2021 High level notes. 19:07:23 <clarkb> I've started to try and keep high level notes there while keeping the PII out of the etherpad 19:07:49 <clarkb> Group problems and 81 accounts with preferred emails missing external ids have been fixed. 19:08:01 <clarkb> thank you fungi for being an extra set of eyes while I worked through ^ 19:08:10 <fungi> any time! 19:08:13 <clarkb> We have 28 accounts with preferred email addresses that don't have a matching external id 19:08:23 <clarkb> We have ~642 accounts with conflicting emails in their external ids. This needs more investigating to better understand the fix for them. 19:08:31 <clarkb> Need to correct the ~642 external id issues before we can push updates to refs/meta/external-ids with Gerrit online. 19:08:38 <clarkb> Workaround is we can stop gerrit, push to external ids directly, reindex accounts (and groups?), start gerrit, then clear accounts caches (and groups caches?) 19:08:58 <clarkb> I'ev given the next set of steps some though and I think roughly it is: 19:09:05 <clarkb> Classify users further into situation groups 19:09:10 <clarkb> Decide on next steps for users depending on their situation group 19:09:17 <clarkb> Fix the preferred email issue if possible as this can be done with gerrit online 19:09:25 <clarkb> Start a refs/meta/external-ids checkout in a shared location and begin committing fixes to it. If we can't push all the fixes as separate commits we can squash them together and then push. 19:09:49 <clarkb> that might be broken down further to do all the preferred email issues first as we can correct them online. Then do the external ids 19:09:59 <zbr> clarkb: does this mean manually investigating and patching >600 accounts? 19:10:27 <clarkb> Another upside to doing it ^ that way is I expect some of the external id fixes will result in preferred email issues in the account side. If we fix the existing issues first we won't confuse them with any new ones we introduce 19:10:29 <clarkb> zbr: yes 19:10:42 <fungi> probably semi-scripted at least 19:11:04 <fungi> and with distinct classifications, some of them may be quick to blow through 19:11:07 <clarkb> right the 81 we fixed already were 95% done with a script once we were satisfied with an earlier pass and classification 19:11:57 <clarkb> depending on how the classification for external ids goes I think downtime to crrect a portion of them is an option as well 19:12:14 <clarkb> that will help us ensure that we're making changes that are viable once loaded into gerrit 19:12:29 <clarkb> btu I don't want to make too strong of a plan for those until we start actually committing changes to that shared checkout 19:12:43 <zbr> while this does not sound like a joy, if spliting the work can speed things up, i may give it a try. 19:13:14 <clarkb> zbr: part of the problem here is that it is all PII so I think we need to be careful who we give access to. Currently it is just gerrit admins 19:13:32 <clarkb> (you need to be admin to access the refs) 19:13:42 <zbr> ah. 19:14:13 <clarkb> anyway, after fixing the first 81 accounts I spent a bit of time doing further classification but then got distracted 19:14:35 <clarkb> I need to pick that back up again. I think that a good chunk of the remaining preferred email issues can be fixed like the first 81 19:14:44 <clarkb> but I need to actually make those lists and then see if others agree 19:14:55 <clarkb> I'll be trying to pick this back up again this week 19:15:07 <clarkb> Other Gerrit items: 19:15:23 <clarkb> We upgraded Gerrit to ~3.2.7 yseterday to patch a security issue 19:15:57 <clarkb> I also tested that Gerrit's workinprogress state is handled by zuul properly when you approve changes. It appears to ignore workinprogress changes properly now 19:16:08 <clarkb> (we expected it to since the fix was deployed, but needed to test with actual gerrit) 19:16:43 <clarkb> ianw and I have made some improvements to the gerrit testing too. 19:17:10 <clarkb> the selenium stuff is a bit better now and I added a test to check that the x/ clone workaround continues to work 19:17:23 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/765021 Build 3.3 images 19:17:52 <clarkb> I also resurrected my change to build 3.3 images. I don't think we're in a hurry to upgrade, doing the OS upgrade first seems like better prioritization, but having working image builds ready for us would be nice 19:18:16 <clarkb> That was what I had for OpenDev and Gerrit things. Anything else to add before we move on? 19:18:19 <ianw> hrm that doesn't run the system-config job? 19:18:41 <clarkb> ianw: no, beacuse currently the system-config job is 3.2 only 19:18:49 <clarkb> I think in a followup I could add a system-config + 3.3 job 19:19:02 <clarkb> or if people prefer can add it to this existing change 19:19:30 <ianw> oh i see totally new jobs :) i think it would be great to run it, either way 19:20:09 <clarkb> ok I'll probably start with a followup change then as that is slightly easier and take it from there 19:21:14 <clarkb> #topic Update Config Management 19:21:32 <clarkb> I am not aware of anything to add to this other than maybe the refstack topic which we've got later on in the agenda 19:22:46 <clarkb> Might be worth mentioning that I helped zuul fix an issue in zuul-registry that new buildx exposed. This was affecting our ability to do multiarch builds (things like nodepool builders) 19:23:05 <clarkb> that should be fixed now though. Thank you zbr for calling out the problem 19:24:20 <clarkb> #topic General topics 19:24:27 <clarkb> #topic OpenAFS cluster status 19:24:43 <clarkb> Just a quick status check on the openafs cluster. I think we still need to upgrade the db servers? 19:25:39 <clarkb> ianw: fungi: anything else to note about ^ ? 19:25:50 <fungi> that's still the status afaik 19:26:07 <ianw> yeah, i got distracted with other things. high on my todo list :) 19:26:07 <fungi> and then we can think through upgrading operating systems/replacing servers 19:26:25 <clarkb> no worries, just making sure I (and others) are up to date 19:26:31 <clarkb> #topic Bup and Borg Backups 19:26:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/773570 19:27:06 <clarkb> I think that is the latest on borg? Would be good if fungi frickler and corvus could review that one 19:27:19 <zbr> clarkb: i just realized why i did not get notifications from Zuul-jobs-failures mailing list... i did not whitelist the user. 19:27:47 <clarkb> ianw: feel free to fill us in on any and all relevant info for this topic though :) 19:28:14 <ianw> yeah, i'm pretty focused on getting our working set to a reasonable level 19:28:40 <ianw> unfortunately i didn't quite fully grok the implications of --append mode and the particular way borg implements that 19:28:47 <clarkb> (I didn't either) 19:29:04 <ianw> all the details are in the changelog of 773570 19:29:46 <fungi> i caught some of it last night before i passed out 19:30:13 <ianw> anyway, a better way would be do do someting like a rolling set of LVM snapshots on the server side 19:30:47 <fungi> i guess cow wouldn't help there because of the encryption layer 19:31:36 <fungi> or maybe it would, depends on if borg manages to not update most of the blocks when updating the backup 19:31:47 <ianw> we don't encrypt the backups, i think it would be ok 19:31:47 <clarkb> I think cow would be fine the way we're using borg 19:31:57 <fungi> oh, right then 19:33:01 <ianw> anyway we can discuss in the review, but yeah i would like to get this all sorted and running by itself very soon 19:33:23 <clarkb> ++ to getting this sorted soon. I intend on looking at it much closer this afternoon. I want to catch up on the docs and related issues 19:33:42 <clarkb> Anything else or should we move on to the next item? 19:34:08 <ianw> nope, move on 19:34:19 <clarkb> #topic Deploy a new refstack.openstack.org server 19:34:41 <clarkb> kopecmartin: has updated my old change to make a refstack container 19:34:46 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/705258 19:35:39 <clarkb> we think it is just about read to be landed, then we can spin up a new refstack server on bionic/focal (probably focal), make sure it works ( kopecmartin has volunteered to help with this step ), then migrate teh data from the old instance to the new one iwth a scheduled downtime 19:36:31 <clarkb> I think the main thing we need help with is someone to spin up the new instance, configure dns records, and ensure that LE and ansible and all that are happy 19:36:51 <ianw> happy to help with that 19:37:17 <clarkb> cool. I'm happy to keep helping too, but worried that I'm not in a great spot to drive any single effort right now (as I'm assisting a bunch) 19:37:48 <clarkb> I also expect we may need to learn us a refstack in order to figure out what the migration from old to new server will look like, but I'm 99% sure that can happen once we're happy the new deployment functions at all 19:37:51 <ianw> can it run without DNS pointed to it? 19:38:23 <ianw> i haven't looked but i was imagning it would be a db import? 19:38:25 <clarkb> ianw: you might need to edit your lcaol /etc/hosts to make everything happy but it should 19:38:34 <clarkb> yes I believe it is a db import 19:39:22 <clarkb> kopecmartin: do you know what kind of testing you think would be appropriate here? 19:39:23 <ianw> speaking of, should it be in our backup rotation? 19:39:33 <clarkb> ianw: probably 19:40:07 <kopecmartin> clarkb: click on every button in the UI , try to upload new results files, register a new user maybe .. this kind of things 19:40:21 <clarkb> kopecmartin: got it, general usage 19:40:23 <clarkb> makes sense 19:40:32 <ianw> ok, we can tackle that separately. i'll review 705258 and can try starting something 19:40:34 <clarkb> I expect all that will work if you set /etc/hosts 19:40:43 <kopecmartin> yeah, we dropped py2 support, so i'd like to exercise every function of the site 19:40:53 <kopecmartin> luckily it's not that complex 19:41:15 <clarkb> Thank you ianw for helping out. 19:41:19 <clarkb> Anything else on this topic? 19:41:37 <kopecmartin> not from my side, just let me know it you need anything 19:41:43 <clarkb> will do! 19:42:10 <clarkb> The next two items are on my plate and have been neglected due to other distractions. This is why I'm wary to dive into something new :/ 19:42:15 <clarkb> #topic Picking up steam on Puppet -> Ansible rewrites 19:42:39 <clarkb> I have yet to write this etherpad, but I'm hopeful I'll get to it this week. I think it will give us good perspective and ability to prioritize effort 19:43:14 <clarkb> Not really anything else to add to this. Other than thank you to everyone who has continued to push on migrating us off of puppet 19:43:20 <clarkb> #topic inmotion cloud openstack as a service 19:43:36 <clarkb> I'm hoping that tomorrow I can try turning this on and see what happens 19:44:02 <clarkb> If all goes well hopeflly we'll be able to expand nodepool's resource pool 19:44:17 <clarkb> its been a while since I did one of these though so should be interesting to see how it goes 19:45:00 <clarkb> I know they are interested in our feedback too, which always makes it easier when things are weird or not working 19:45:13 <clarkb> #topic Open Discussion 19:45:25 <clarkb> Anything else that didn't make it on the agenda that you'd like to bring up? 19:45:51 <fungi> change in vexxhost node memory? 19:46:23 <fungi> something we probably need to keep an eye on, as folks could start merging regressions for memory use more easily 19:46:46 <ianw> i missed that, did it go up or down? 19:46:50 <fungi> or will generally start asking why not all of our nodes have 32gb ram 19:46:54 <clarkb> ianw: it wen tup to 32GB of memory 19:47:14 <clarkb> the risk is that changes could merge in vexxhost that cannot merge anywhere else 19:47:28 <fungi> #link https://review.opendev.org/773710 Switch to using v3-standard-8 flavors 19:48:22 <ianw> ahh 19:48:24 <clarkb> fungi: piecing together dansmith's question in #openstack-infra and some of what was discussed in #opendev is this also thought to improve io in vexxhost? 19:48:28 <clarkb> or are those separate concerns? 19:48:50 <ianw> i feel like there was at some point something we did booting nodes with like kernel mem= parameters to keep them all the same 19:49:12 <ianw> but that's probably very silly, to have 32gb allocated but artifically limit to only 8 19:49:17 <fungi> i'm not clear on whether it will improve i/o performance 19:49:50 <fungi> ianw: yeah, that's what i was referring to in my review comment 19:49:55 <clarkb> ianw: ya we did that to avoid the fear that we could merge thigns in one cloud and then break jobs in all the others 19:50:07 <fungi> also we had to do it in bootloader configuration, which means applying it to all our providers 19:50:13 <clarkb> back then you couldn't reboot with new kernel parameters, you can now so it would be a bit of a bandaid to do it now 19:51:52 <ianw> we could run them as static nodes and do a 1:4 reverse split :) 19:52:24 <clarkb> that is an interesting thought, static nodes seem like pain though :) 19:53:07 <clarkb> maybe converting the set to a large k8s cluster and then scheduling into that with nodepool would make sense if we found infinite time somewhere :) 19:53:08 <ianw> we'll see how the bare-metal cloud thing works out :) 19:53:41 <clarkb> definitely worth a brainstorm to think about other ways of slicing them 19:53:50 <clarkb> I'll think it through on tomorrow's bike ride :) 19:53:58 <clarkb> or try to anyway, its probably going to be cold and my brain won't work 19:54:01 <ianw> yeah the k8s cluster is probably actually a pretty sane thing to think about 19:57:09 <clarkb> sounds like that may be all. Thank you everyone 19:57:12 <clarkb> #endmeeting