19:01:01 <clarkb> #startmeeting infra 19:01:01 <opendevmeet> Meeting started Tue Dec 7 19:01:01 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:01 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:01 <opendevmeet> The meeting name has been set to 'infra' 19:01:05 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000305.html Our Agenda 19:01:27 <clarkb> #topic Announcements 19:02:17 <clarkb> This didn't make it onto the agenda because it didn't occur to me until this morning. We are fast approaching a holiday period for many of us. I'll be unable to make a meeting on the 21st and likely unable to make a meetingon January 4 19:02:40 <fungi> i'm okay with skipping those 19:03:15 <clarkb> ya I think we can go ahead and cancel the 21st and 4th. And I'll try hard to do a check in on the 28th though I expect things will get pretty quiet all around 19:03:46 <clarkb> everyone should enjoy the holidays and their assocaited time off. I'm going to attempt to do this myself :) 19:04:21 <ianw> ++ won't be regularly around then either 19:05:01 <clarkb> #topic Actions from last meeting 19:05:06 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-30-19.01.txt minutes from last meeting 19:05:13 <clarkb> There weren't any actions recorded 19:05:18 <clarkb> #topic Topics 19:05:26 <clarkb> #topic Improving CD Throughput 19:05:41 <clarkb> We made some progress here and also took a step or two back but we learned some stuff 19:06:10 <clarkb> When we switched in the "setup source for system-config on bridge at the start of each buildset" change we missed a few important things that we have reverted that change over 19:06:42 <clarkb> We need to make sure that we are using nodeless jobs, that we update system-config on bridge and not on a normal zuul node, we need to honor DISABLE-ANSIBLE and we need to be sure every buildset has this job run first 19:06:55 <clarkb> The good news is that since we learned all of that we are able to regroup and make a new plan. 19:07:02 <clarkb> link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps 19:07:04 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps 19:07:36 <ianw> on the DISABLE-ANSIBLE ... i was thinking about that 19:07:38 <clarkb> infra-root ^ that email describes a refactor of various things to make it harder to make those previous mistakes again 19:08:06 <ianw> i feel like it makes sense to check that in the base job that sets up the system-config checkout for the buildset ... that will hold all prod jobs 19:08:29 <ianw> but maybe not so much in the prod jobs themselves. that way, if a buildset starts, it completes, but a new one won't 19:09:06 <clarkb> ya I'm somewhat on the fence over that. To me if I disable ansible that means no more ansible even if the buildset is still running 19:09:20 <clarkb> But I can see an argument for allowing an in progress buildset to complete for consistency 19:09:22 <ianw> in parallel operation, it seems unclear if you dropped it in the middle of a deploy buildset what it would catch 19:09:41 <clarkb> What we can do if we really really need to stop the production line is move the authorized keys file aside 19:10:15 <clarkb> and usually we use that toggle when doing stuff like project renames, not as an emergency off switch (ssh authorized_keys seems better for that) 19:10:16 <corvus> or dequeue the job 19:10:27 <clarkb> ya I guess that too 19:10:37 <ianw> with zuul authenticated ui, that would be practical 19:10:50 <clarkb> ianw: we should update the documentation to make that behavior change clear though 19:10:58 <ianw> (currently, pretty sure the jobs would be done before i'd pulled up a login window and figured out :) 19:11:30 <ianw> sure, i can post a doc update and we can discuss there 19:11:52 <clarkb> sounds good, thanks 19:12:07 <fungi> we talked about having the base job abort if the disable-ansible file is present, did i push that change (or has someone)? i can't recall now 19:12:32 <clarkb> fungi: I don't recall seeing a change for that. You did split it out into a separate role if we wanted to consume it in multiple jobs 19:13:25 <fungi> oh, right 19:13:28 <ianw> fungi: abort as in abort, or do the pause thing it does now? 19:13:39 <fungi> i think the pause thing it does now, sorry 19:13:41 <clarkb> infra-root ^ if you can review the changes outlined in that email that would be great. I'm planning on digging in this afternoon myself. I think we're really close to being able to start updating semaphores and getting parallel runs which is exciting 19:14:36 <ianw> fungi: that would be the status quo I believe, as that is checked in the setup-src job 19:15:00 <ianw> currently every prod job runs that; after the changes, only the bootstrap-bridge (that all other jobs depend on) would run it 19:15:34 <clarkb> I think we have to avoid soft dependencies to make that work, but I was alread asking for that. 19:15:42 <fungi> yeah, and the problem we ran into was that subsequent jobs didn't check it so proceeded normally when setup-src got skipped 19:16:04 <clarkb> I suspect this beacuse in the current system if you set the disable-ansible file they all run serially failing and retrying in a loop until they have failed 3 times in a row? Or maybe that is only when you pull the ssh keys out 19:16:09 <fungi> as belt-and-braces safety we could check it in the job they all inherit from 19:16:12 <fungi> or in base 19:16:27 <clarkb> fungi: ya I think that is an artifact of the soft dependency 19:16:28 <ianw> right, yes the base job (after proposed changes) is a hard dependency that should always run (no file matchers) 19:16:35 <clarkb> if we make it a hard dependency they they shouldn't proceed 19:17:29 <corvus> if you don't want child jobs to run, you can filter them out of the list 19:17:45 <clarkb> corvus: isn't that what a hard dependency failing to succeed will already do? 19:17:51 <corvus> https://zuul-ci.org/docs/zuul/reference/jobs.html#skipping-dependent-jobs 19:18:19 <ianw> yeah, it was supposed to be a hard dependency in this case -- it has to run to checkout the system-config source for the buildset 19:18:20 <corvus> yeah, but i think you could do "child_jobs: []" to cause 0 child jobs to run regardless of hard/soft 19:18:33 <corvus> so if you want to do it with soft, that could be a way 19:18:40 <corvus> but if it needs to be hard for other reasons, then meh. :) 19:18:44 <clarkb> gotcha, ya in this case I think we need a hard dependency either way 19:18:45 <ianw> it was a bug to not run it, not the intention 19:19:31 <corvus> ack 19:19:45 <clarkb> Let's continue on as we have a few other subjects to cover 19:19:50 <clarkb> #topic User management on our systems 19:20:04 <clarkb> Yesterday we managed to update the matrix-gerritbot image to run under the gerritbot user 19:20:32 <clarkb> I think what we learned from this exercise is that even simple "read only" appaering images can be complicated and that setting users to run a container under is going to be an image by image exercise 19:20:42 <clarkb> that said I still think there is value in this and we should try to pick them off as we can 19:21:18 <clarkb> But beware that we need to be careful about permissions within the image and bind mounts as well as expectation of the running processes. Turns out openssh fails if it is running as a user without an extry in /etc/passwd 19:21:56 <clarkb> At this point I don't think there is anything else to review or cover other than to say, if you've got free time you might look into updating one of our containers :) 19:22:08 <ianw> do we have a list? 19:22:11 <clarkb> IRC bots in particular seem like good targets since they all run on a shared host 19:22:23 <clarkb> ianw: I haven't made a comprehensive one yet as I was mostly going to focus on eavesdrop to start 19:22:41 <clarkb> low impact from our perspective to restart them and debug as wego, but also relatively high ROI since they share a host 19:22:52 <clarkb> most other systems are all dedicated hosts so less returns 19:22:59 <ianw> ok, np. i know i had issues when haproxy switched *to* having a separate user with the LB setup 19:24:33 <clarkb> ianw: looking at my notes hound, lodgeit, refstack, grafana are others. But this isn't a comprehensive list I don't think 19:24:47 <clarkb> but ya I was focusing on ircbots to start since all of ^ are on dedicated hosts 19:25:37 <clarkb> Anyway as mentioned I/we hae learned a bit doing this for the gerritbots and there are more irc/matrix bots to address. Also the services above. If you've got time feel free to pick them off. Our testing helps with ensuring it is happy too 19:25:49 <clarkb> #topic Zuul Gearman Going Away 19:26:13 <clarkb> Zuul's gearman tooling is very close to being deleted. This means we can no longer use the zuul gearman commands to enqueue/dequeue etc 19:26:27 <clarkb> Instead we'll need to use Zuul client to talk to the REST API for this whcih requires a JWT 19:26:54 <clarkb> corvus has changes up to set up a local JWT for administrative tasks on our zuul installation. We should also update our docs and our queue saving scripts to match when that is ready 19:27:08 <corvus> i think they just merged (thanks fungi ) 19:27:31 <corvus> with those in place, i'll generate a jwt and set up zuul-client 19:27:32 <fungi> yeah, so we should in theory still be able to run them from a shell on the server without looking up credentials 19:27:55 <corvus> note, zuul-client != zuul. they are very similar, but only zuul-client has the ability to read a jwt from a config file. 19:28:33 <corvus> we will probably remove the admin commands from zuul eventually too since they are redundant and not as useful as zuul-client's implementation 19:28:53 <corvus> so anyway, that'll be "zuul-client enqueue" etc in the future 19:29:06 <fungi> thanks! 19:29:07 <clarkb> yup mostly calling this out so people are aware and that we don't forget to update docs and our queue saving script 19:29:30 <clarkb> #topic keycloak.opendev.org 19:30:04 <clarkb> On the subject of authentication we now have a keycloak server to experiment with 19:30:46 <clarkb> The main thing I wanted to clarify on this is currently the server is in a pre production state right? we shouldn't be relying on this for anything production liek and instead use it to figure out how to make keycloak work according to our auth spec that fungi wrote 19:31:05 <clarkb> for example we can integrate keycloak with zuul's new auth stuff but we aren't doing that yet while we learn about keycloak? 19:31:29 <clarkb> or maybe if we do that it will be in a limited capcity and functionality could come and go. We'll continue to rely on local auth for admin stuff 19:31:39 <fungi> i'm willing to be flexible there 19:31:50 <corvus> it is in pre-prod. expect data to disappear at any time. 19:32:08 <corvus> i would like to go ahead and create a realm for use with zuul... i think maybe something simple where a few of us make some accounts manually or something 19:32:15 <fungi> one thing we already learned is it's apparently still easy to accidentally create multiple accounts when you use different ids if you don't link them in advance 19:32:34 <corvus> yeah. that can be resolved, but only if we allow password authentication. 19:33:10 <corvus> (like, you can fix that in a self-service way, but only if password auth is available too) 19:33:12 <clarkb> Seems like we should avoid that if we can to make sure people understand we aren't intending to be an actual auth identity 19:33:41 <fungi> which we had previously wanted to avoid so we could not be in the business of having a database of passwords as a high-profile target, nor deal with frequent password reset requests. it's something we'll need to weigh as the poc moves along 19:34:05 <corvus> i think that's worth revisiting. here's a thought experiment: 19:34:27 <clarkb> ya and I bet it is impossible to disable the password auth for external identity usages because that identity should be the same for any method used to authenticate via keyloak 19:34:35 <corvus> how different is a database of passwords from a database of mappings from a threat POV. 19:34:49 <corvus> sorry, was meant to be a question 19:35:04 <fungi> if users avoid reusing passwords, not terribly different. but users often reuse passwords 19:35:08 <clarkb> if there as a way to run it where password auth only let you run keycloak account tasks and not log in elsewhere I think that would be fine 19:35:29 <clarkb> But I strongly suspect that isn't hwo things are designed 19:35:35 <corvus> anyway -- not something we need to answer now, but i do think it's worth revisiting that with updated knowledge 19:35:51 <corvus> clarkb: i couldn't say whether that's possible or not 19:36:01 <fungi> also if we add passwords, we probably need to add integrated 2fa 19:36:25 <fungi> which could become its own support burden 19:36:30 <ianw> iiuc, a holdup for gerrit conversion was that keycloak doesn't allow adding launchpad/openid right? but there was a theory that it wouldn't be too hard to add? is that accurate? 19:36:30 <clarkb> ya all stuff to explore. Maybe figure out if 2fa is viable and if we can require it for example to mitigate the concerns with passwords 19:36:35 <corvus> clarkb: there's a lot of workflow-by-form stuff, so maybe something can be created for that. but it's certainly not a "checkbox" :) 19:36:47 <corvus> ianw: yes 19:36:55 <clarkb> ianw: yes there is a php saml tool thing that can translate to other backends and keycloak speaks the saml to that php tool in theory 19:36:57 <corvus> 2fa is available and is a "checkbox" :) 19:37:07 <fungi> ianw: yes, there's a proposal in the spec to create a sort of bridge from keycloak to openid via phpsimplesaml 19:37:32 <fungi> corvus: i expect turning on 2fa is not hard, but helping users reset it every time they lock themselves out might be 19:37:36 <clarkb> writing that bit would be a good next step for someone interested in experimenting with keycloak more 19:37:38 <ianw> cool, well this seems like a great step in having an environment we can test that too. i'd be interested in working on that in the future 19:37:57 <clarkb> ianw: ++ having the actual service up gives us something to look at that is more than theoretical 19:38:18 <clarkb> I'll also need to finish the gerrit user cleanups so that we can update the external ids database in a straightforward manner 19:38:26 <corvus> yeah... and i won't be able to drive this, so having other folks step in and pick it up would be great 19:39:00 <clarkb> Alright tldr is work to be done, feel free to experiment, but this isn't for production use yet 19:39:03 <clarkb> anything else? 19:40:03 <clarkb> #topic Adding a lists.openinfra.dev mailman site 19:40:25 <fungi> i'm still trying to fix things to make our current mailman orchestration go 19:40:28 <clarkb> fungi and I ran into some trouble with newlist when doing this that we thought we had corrected. Long story short newlist is still looking for input to confirm emailing people 19:40:56 <clarkb> seems that redirecting /dev/null into newlist corrects this, but it also exposes that our testing is different than prod 19:41:12 <fungi> i was able to reproduce it with a dnm change, and determined that redirecting stdin from /dev/null in a shell task properly solves it 19:41:12 <clarkb> fungi: the plan is to update our system-config-run jobs to all block port 25 outbound then we can tell newlist to send email right? 19:41:22 <clarkb> oh sorry I'll let fungi fill us in :) 19:41:23 <fungi> setting stdin to a null string in a cmd task does not have the same effect 19:41:33 <fungi> (which is waht we had merged previously) 19:42:12 <fungi> and yeah, i have changes up to collect exim logs so we can see what's trying to send e-mail through the mta in tests, as well as blocking 25/tcp egress to prevent our deploy jobs from accidentally sending e-mail 19:42:37 <fungi> and then i'm dropping the test-time addition of the -q option for the newlist command 19:42:55 <clarkb> fungi: are any of those ready for review yet? 19:43:19 <fungi> probably, though i have a pending update to one of them once i get test results back from the latest revision to the iptables change 19:43:34 <fungi> and haven't rebased the dropping of -q onto that stack yet 19:43:40 <clarkb> gotcha, feel free to ping me when you want reviews and I'll happily take a look 19:43:57 <fungi> should hopefully have it up right after the meeting, and then once those merge we can add new mailing lists again more easily 19:44:06 <clarkb> thanks! 19:44:49 <fungi> topic:mailman-lists 19:44:55 <fungi> in case anyone's looking for them 19:45:12 <clarkb> #topic Gerrit User Summit 19:45:31 <clarkb> Gerrit User Summit happened last week. I found it useful to catch up on some of the gerrit upstream activities 19:45:54 <clarkb> I took notes and they are in my homedir on review02. But I'll try to summarize some of the interesting bits really quickly here 19:46:03 <clarkb> Gerrit 3.2 is EOL. Thank you ianw for helping get us to 3.3 19:46:33 <clarkb> The new Checks UI work relies on a plugin in the Gerrit server that queries CI systems for results/status and then renders them in a consistent way regardless of the CI system 19:46:57 <clarkb> this means that we could probably replace the Zuul summary plugin with a Checks UI plugin using this new system. But I think that is 3.4 and beyond. Not a 3.3 thing 19:47:37 <corvus> and that's a java plugin? or is it a pg plugin? 19:47:54 <clarkb> I think that is a java plugin because you have to interact with gerrit internal state 19:48:16 <clarkb> the plugin acts as a data retrieval and conformance system between your CI system and the checks UI 19:48:29 <clarkb> and I think that requires you make writes somewhere which I suspect the js stuff can't do 19:48:39 <clarkb> however, that wasn't entirely clear to me so I could be wrong 19:49:13 <clarkb> Gerrit is working towards deleting prolog for complex acl rule applications. Instead they are replacing it with "Composable Submit Requirements" which use a simple query language based on Gerrit's existing query language 19:49:31 <clarkb> you essentially write rules that say "if this gerrit query returns a result then this rule applies to this change" 19:49:42 <clarkb> and the rules can say this is required for submitting etc 19:50:22 <clarkb> I don't expect we'll migrate to this quickly for anything though random users may use it for various additional checks. However, we should be careful to ensure we don't accidentally reduce our requirements for submitting via zuul 19:51:08 <clarkb> There is a ChronicleMap libmodule plugin for persistent caches. This apparently improves performance quite a bit since you don't lose cache data when restarting gerrit. Some people suggested it be incorporated directly into Gerrit rather than a plugin 19:51:34 <clarkb> Our performance is pretty good these days and we don't restart Gerrit often but may be worth looking into as some users (those talking to nova stuff) have indicated slowness after restarts 19:52:02 <clarkb> And finally the Gerrit meetings are open to the entire community. YOu can also put stuff on their agenda if you have something specific you want to discuss 19:52:12 <clarkb> This is something I wasn't clear about since I think they title them the EC meeting or similar 19:52:25 <clarkb> I'll probably try to start attending these once I figure out when they happen 19:53:30 <clarkb> So ya feel free to ask me any questions if you have them though I'm still a gerrit community noob 19:53:46 <clarkb> Overall I think the event went well and I learned some stuff about what to look for for the future 19:54:38 <clarkb> #topic Nodepool Image cleanups 19:54:42 <ianw> thanks for attending/the summary! 19:55:33 <clarkb> We've got a number of images that are either under maintained or going EOL 19:55:36 <clarkb> #link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html 19:55:47 <clarkb> I send email outlining a rough plan for cleanups to service-announce 19:56:07 <clarkb> This should reduce a lot of pressure on our AFS volumes too 19:57:16 <clarkb> If we do get responses for opensuse and gentoo help it would be good to maybe also try and run periodic jobs on those platforms somewhere 19:57:21 <clarkb> to serve as a signal when tehy break? 19:57:34 <clarkb> An idea I had that mgiht make maintenance a bit more repsonsive in the future 19:57:38 <clarkb> #topic Open Discussion 19:57:41 <ianw> we do have a periodic run of zuul-jobs that tries everything 19:57:51 <clarkb> ianw: ah ok so those volunteers could watch that 19:57:53 <ianw> but if a job fails in the woods with nobody listening ... :) 19:58:03 <ianw> speaking of 19:58:04 <clarkb> We are almost out of time, anything else? 19:58:18 <ianw> #link https://review.opendev.org/c/zuul/zuul-jobs/+/818702 19:58:30 <jentoio> I'd like to help/volunteer 19:58:32 <ianw> that will enable f35; which seems to have gone smoothly 19:59:03 <clarkb> jentoio: cool, can you respond to that email so that we can help keep track of it and not miss that there is interest? 19:59:23 <jentoio> sure, I was hoping we can meet for coffee to discuss 19:59:37 <jentoio> since we live near each other - unless you moved ;) 19:59:46 <ianw> i can also lookup the details of the zuul-jobs runs and post 20:00:02 <clarkb> I'm still in the same part of the world though I don't get out much these days 20:00:19 <jentoio> but I'll respond to email as well. 20:00:33 <clarkb> ya I think that helps since others are involved as well and around the world 20:00:43 <clarkb> but I'm happy to discuss further as well. And we are at time 20:00:46 <clarkb> thank you everyone! 20:00:49 <clarkb> #endmeeting