19:01:09 <clarkb> #startmeeting infra 19:01:09 <opendevmeet> Meeting started Tue Nov 9 19:01:09 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:09 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:09 <opendevmeet> The meeting name has been set to 'infra' 19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2021-November/000295.html Our Agenda 19:01:45 <clarkb> #topic Announcements 19:02:05 <clarkb> I was hoping I could link to gerrit user summit stuff but I can't find any details on that yet. They must be running into planning issues. I can sympathize with that 19:02:52 <ianw> o/ 19:03:01 <clarkb> #topic Actions from last meeting 19:03:08 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-02-19.01.txt minutes from last meeting 19:03:19 <clarkb> We didn't record any new actions and the prior actions are all done or in progress. \o/ 19:03:25 <clarkb> #topic Specs 19:03:56 <clarkb> Just a quick note here that I approved fungi's mailman3 spec after a quick respin to address input from frickler. The updates were minor and didn't seem like anything that needed to be completely rereviewed 19:05:13 <clarkb> #topic Topics 19:05:20 <clarkb> #topic Improving OpenDev's CD throughput 19:05:31 <clarkb> this is still on my todo list. All the zuul and container fun has been distracting me :/ 19:05:46 <clarkb> ianw: have you had a chance to look at why the child changes are failing CI? (I'm mostly just curious) 19:06:06 <ianw> no sorry, i've managed to be distracted on other things 19:06:14 <clarkb> I think we've all been in that boat recently 19:06:22 <clarkb> #topic Gerrit Account Cleanups 19:06:52 <clarkb> I was mostly going to skip this over except fungi found a story where someone had trouble with @ in their username. fungi has asked them to clarify if they were trying to use an email address as a username or if their username actually has an @ in it 19:07:07 <clarkb> Noting that here for the possibility there is another sort of cleanup we'll need to do in normalizing usernames 19:07:17 <clarkb> Not really actioanable at this point but an interesting possibility 19:08:00 <clarkb> #topic Zuul Multi Scheduler Setup 19:08:41 <corvus> there are 2 schedulers 19:08:45 <clarkb> Zuul has made great progress on supporting multiple schedulers (removing the last remaining spof for a zuul install). Our OpenDev zuul is running two schedulers. One on zuul01.o.o and the other on zuul02.o.o. Zuul02 is the "primary" 19:08:55 <clarkb> What makes zuul02 the primary for us is that all web traffic hits it first 19:09:08 <corvus> and gearman 19:09:13 <corvus> and actually there's no web on zuul01 19:09:35 <clarkb> If things go really sideways I think we can stop zuul01's scheduler and restart the scheduler on zuul02 only 19:09:43 <corvus> (cause we don't have a load balancer for it) 19:09:56 <corvus> ++ 19:10:00 <fungi> and, if necessary, clear out zk 19:10:10 <corvus> and if things go really badly, clearing the zk state would be a good idea 19:10:16 <clarkb> It is worth noting that we have run into problems but we've been trying to work through them as they show up. corvus has been a great help with that. 19:10:41 <clarkb> So far we've had issues with retried jobs not behing handled properly. Nodepool requests getting stuck in a perpetually waiting state, and config errors not serializing properly 19:10:41 <fungi> i'm thrilled that we haven't needed to downgrade again 19:10:49 <corvus> i think we're at the point where the problems that have been cropping up have been minimal enough we can roll forward 19:11:33 <clarkb> There are a few more new issues showing up today that deserve followup after this meeting. Specifically johnsom's designate change error and elodilles lack of a zuul.tag var on release jobs. There is also a job runtime estimate problem that corvus has a fix up for 19:11:43 <clarkb> corvus: ^ fyi details for both of those other things are in #opendev 19:13:13 <clarkb> Please do report any weird behavior. So far a lot of weird behavior I have noticed ahs been tracked back to the multi scheduler setup and reporting those things has been very helpful because now they are fixed :) 19:13:42 <clarkb> And ya the recovery at this point can probably be achived by simple restarting onto one scheduler using zuul02 since the code seems quite stable in a single scheduler setup 19:14:36 <clarkb> Anything else to add to the zuul topic? or questions about the setup? 19:15:49 <clarkb> #topic User management on our systems 19:16:08 <clarkb> Last week I started pulling on a thread and noticed there were some improvements we could make to how we manage our users and uids 19:16:14 <clarkb> thank you to those who helped me make sense of it all 19:16:47 <clarkb> A number of changes have come out of that. Some potentially impactful if we aren't careful. I'd like to ask infra-root take a look over these changes so that we can start landing them when we are confident they are safe and are able to watch them go in 19:16:55 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/816869/ Be explicit about uid/gid ranges 19:17:24 <clarkb> This first change adjusts our configured ranges in adduser.conf and logins.def which different tools refer to when creating usrrs 19:17:38 <clarkb> The rough layout is: 0-999 system, 1000-1999 unallocated, 2000-2999 for infra-root users, 3000-9999 host level users, 10k - 64k container users that need uids on the host as well for bind mounts. 19:17:40 <fungi> #link https://review.opendev.org/816869 Lower UID/GID range max to make way for containers 19:17:43 <fungi> also that one 19:17:48 <clarkb> Thats the same one :) 19:18:04 <fungi> er, yep sorry. i guess it had a different title 19:18:20 <clarkb> the idea there is we've put a number of container services on high uids like 10001 but then when we create say the letsencrypt group it gets created as gid 10002 19:19:05 <clarkb> fungi and I are thinking it would be better to not assign specific values to stuff that actually belongs to the system like our users and letsencrypt group and so on so we cap those at 9999 then we can eb explicit about container uids/gids and ensure those are non overlapping in the >10k space 19:19:35 <fungi> well, our users are already statically assigned uids and gids 19:19:40 <clarkb> yup 19:19:44 <fungi> "our" personal accounts i mean 19:19:59 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/816771 Clean up unused bootstrapping users 19:20:11 <clarkb> Is antoher user related change. This time to cleanup users that I think we don't need. 19:20:36 <clarkb> This is mostly a belts and suspenders spring cleaning move and one that scares me slightly since we might've accidentally used one of those users for something functional but as far as I can tell this isn't the case 19:20:49 <fungi> well, it's also a security concern 19:20:55 <clarkb> right 19:21:05 <clarkb> I think it is important, but one we should take care with and review carefully 19:21:08 <fungi> as on some systems those accounts come with provider-supplied authorized keys 19:21:27 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/816769/ Give gerritbot and matrix-gerritbot a shared user 19:21:51 <clarkb> This is a followon to 816869, and I'd like to have us work our way through giving most of our other containers similar treatment 19:22:04 <clarkb> lodgeit, refstack, hound, some other irc bots, etc 19:22:21 <clarkb> But I figure start slow make sure we've got things laid out the way we want first before we do a bunch of work that needs updating again later 19:22:31 <clarkb> The gerritbots seemed like a good example case 19:23:03 <clarkb> Then when that has been worked through we should also look into updating the uid for our mariadb containers to something other than 999 19:23:20 <clarkb> The mariadb uid was what sparked this whole thing off in the first place and will likely be the last thing to be addressed :) 19:23:48 <clarkb> I think for mariadb we can probably run it as the user for the services it supports in most cases. For review run it as the gerrit user, for etherpad run it as the etherpad user and so on 19:24:02 <clarkb> Since we don't run shared mariadbs between services we can do that safely 19:24:53 <clarkb> If y'all can review with a very critical eye I would appreciate it. I'm happy to do additiaonl testing (we already did some manual testing before settling on 816869) 19:25:29 <fungi> yeah, it *seems* to work as hoped 19:25:44 <fungi> part of the problem is that adduser and useradd rely on entirely different configs 19:26:05 <fungi> and package maintscripts and ansible roles use who knows which one 19:26:26 <clarkb> I think I foudn some evidence that package scripts do use both. Or rather one package uses one and another packages uses the other 19:26:36 <fungi> right 19:26:43 <clarkb> the evidence for this is that one will do gidmax-1 and the other will do gidmin+1 19:26:47 <fungi> so at least having the two of them in sync should help 19:26:49 <clarkb> and we see evidence of both on our systems 19:28:02 <clarkb> Any other questions or concerns to bring up on this topic? 19:28:27 <ianw> thanks for digging into it and explaining, i'll take a look at the changes too 19:30:11 <clarkb> #topic Open Discussion 19:30:40 <clarkb> I wasn't sure which other topics we would want to discuss so decided to trim the agenda down and let this portion of the meeting cover anything else 19:31:01 <clarkb> Gerrit3.4 upgrade stuff and fedora 35 work seem to be progressing, but not sure there is anything to share yet 19:31:13 <clarkb> I think there is a dib change I should review for containerfile stuff that I haven't been able to get to 19:31:35 <clarkb> #link https://review.opendev.org/c/openstack/diskimage-builder/+/817139 dib handle containerfile errors better 19:31:35 <ianw> yeah, that was working but last night centos-9 mirrors were broken 19:31:41 <ianw> speaking of 19:31:44 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/817136 19:32:13 <fungi> i could use some help on fixing bitrot in the storyboard-webclient builds, if anyone has tips for how to update a yarn.lock 19:32:14 <ianw> adds centos 9-stream mirrors, but i don't think we have space. my plan is to continue to remove debian-stretch 19:32:15 <fungi> #link https://review.opendev.org/814053 [opendev/storyboard-webclient] Bindep cleanup 19:32:57 <clarkb> ianw: just left a comment on the centos-9-stream change 19:33:29 <clarkb> fungi: to update a yarn.lock you remove the lock and then reinstall iirc. That produces a new lock file and if testing succeeds with that you can merge it 19:33:53 <fungi> reinstall what? 19:34:00 <clarkb> reinstall the javascript stuff using yarn 19:34:08 <ianw> i've always just done "yarn upgrade" i think 19:34:17 <clarkb> https://classic.yarnpkg.com/en/docs/cli/install/ I think 19:34:19 <fungi> oh, i can give that a try, thanks 19:34:28 <fungi> oh, also i've had this up for a while, to hopefully make our system-config jobs a little more robust... 19:34:28 <clarkb> ianw: ah that might be the better method then. I guess upgrade ignores the lock and writes a new one? 19:34:30 <fungi> #link https://review.opendev.org/813880 [opendev/system-config] Retry acme.sh cloning 19:34:47 <ianw> but either way, the *real* problem is going to be every javascript library that has maintained the same name but rewritten itself completely (see prior discussions in #opendev yesterday :) 19:35:52 <fungi> yeah, in this case the reason i need to do it is because one of the js packages needs updating to support python3 19:36:08 <fungi> but i assume that will involve updates to a lot of other dependencies 19:36:23 <clarkb> ya you might have to fiddle with the requirements file equivalent to find something that produces a working set 19:36:32 <clarkb> when I've done this for zuul before it is a fun exercise 19:36:42 <fungi> and this for making the rejects in our iptables rules slightly more expressive... 19:36:44 <fungi> #link https://review.opendev.org/810013 [opendev/system-config] Switch IPv4 rejects from host-prohibit to admin 19:38:58 <clarkb> fungi: I've +2'd that one so you can approve it when you are able to watch it. Like the user changes has potential for wide spread pain if somehow it goes wrong (though again I don't expect any issues) 19:39:25 <fungi> yep 19:39:38 <fungi> i did some spot testing and confirmed it works as expected, at least 19:40:03 <ianw> oh, i approved it in between, but yeah, i'll be around 19:40:08 <clarkb> cool 19:40:17 <clarkb> I'll give this meeting a few more minutes to bring anything else up 19:40:30 <clarkb> But then I need to review some zuul fixes and eat lunch :) 19:40:47 <fungi> i'll be around too, of course 19:40:59 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/816766 19:41:09 <ianw> is a minor one to expose the db in gerrit testing 19:41:24 <ianw> i don't think we were noticing gerrit wasn't actually talking to the db correctly 19:41:26 <fungi> oh, right, i meant to look at that one, thanks 19:41:33 <clarkb> ++ thats a good update to our testing for gerrit 19:42:06 <clarkb> ianw: you might consider toggling the state too since we had the issue with the non unique keys thing in the past that was only hit on the change of an already created row 19:42:47 <clarkb> I approved it as is as this is claerly better than what we had before 19:43:06 <ianw> that's a good idea, can do that, just the same request with a DELETE 19:43:43 <ianw> I guess a loop of PUT DELETE PUT might work 19:44:24 <ianw> i wonder if a template to method: works 19:45:33 <fungi> seems like it should? it's just a string, right? 19:48:06 <clarkb> Sounds like that is it. Thanks everyone! 19:48:09 <clarkb> #endmeeting