19:01:06 <clarkb> #startmeeting infra 19:01:07 <openstack> Meeting started Tue Jun 16 19:01:06 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:11 <openstack> The meeting name has been set to 'infra' 19:01:18 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-June/000039.html Our Agenda 19:01:26 <clarkb> #topic Announcements 19:01:31 <clarkb> I didn't have any announcements 19:02:00 <clarkb> #topic Actions from last meeting 19:02:07 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-09-19.01.txt minutes from last meeting 19:02:31 <clarkb> no actions recorded, but it is feeling like things are returning to normal after the PTG. Oddly ti seemed like we still had the quiet week last week even though people didn't need to travel 19:02:45 <clarkb> (maybe that was just my perception) 19:02:46 <clarkb> #topic Specs approval 19:02:55 <clarkb> #link https://review.opendev.org/#/c/731838/ Authentication broker service 19:03:06 <clarkb> This isn't ready for approval yet, but wanted to keep pointing eyeballs towards it 19:03:09 <mordred> o/ 19:03:13 <clarkb> fungi: ^ anything else to say about this spec? 19:03:16 <fungi> i haven't updated it yet, had a "quiet week" ;) 19:03:29 <fungi> more comments appreciated though 19:03:39 <corvus> o/ 19:03:40 * mordred looks forward to fungi's updates 19:04:56 <clarkb> #topic Priority Efforts 19:05:04 <clarkb> #topic Update Config Management 19:05:45 <clarkb> I'm not aware of a ton of changes here recently. Anyone have topics to bring up? 19:06:07 <mordred> corvus improved the disable-ansible script 19:06:08 <clarkb> (I'm moving somewhat quickly beacuse our general topics list is pretty long this week and want to be sure we get there but feel free to bring stuff up under priority topics if relevant) 19:06:22 <mordred> nope. /me shuts up 19:06:27 <fungi> i've got a half-baked change to move the rest of our repo mirrornig configuration from puppet to ansible 19:06:34 <clarkb> mordred: thats a good call out. We should get into the habit of providing detailed reasons for disabling ansible there 19:06:35 <mordred> \o/ 19:07:07 <clarkb> fungi: is that ready for review yet? 19:07:21 <fungi> clarkb: it's ready for suggestions, but no it's not ready to merge yet 19:07:41 <fungi> it's a lot of me learning some ansible and jinja concepts for the first time 19:07:44 <clarkb> #link https://review.opendev.org/#/c/735406/ Ansiblify reprepro configs. Is WIP comments welcome 19:08:40 <fungi> it will be a pretty massive diff 19:08:47 <fungi> (once complete) 19:09:18 <clarkb> it should be a pretty safe transition too as we can avoid releasing volumes until we are happy with the end results? 19:09:23 <clarkb> thanks for working on that 19:09:35 <fungi> yep, once i rework it following your earlier suggestion 19:10:30 <clarkb> #topic OpenDev 19:10:51 <clarkb> I don't have much to add here. I've completely failed at sending reminder emails about the advisory board but mnaser has responded. Thank you! 19:10:57 <clarkb> I'll really try to get to that this week 19:10:59 <mnaser> \o/ 19:12:12 * mordred hands mnaser an orange 19:12:41 <clarkb> Anything else to add re OpenDev? 19:13:34 <mordred> oh - 19:13:59 <mordred> this is really minor - but I snagged the opendev freenode nick yesterday and put it in our secrets file (thanks for the suggestion mnaser) 19:14:31 <mordred> in case we want to use it for opendev-branded bots 19:14:57 <clarkb> thanks 19:15:35 <clarkb> #topic General Topics 19:15:53 <clarkb> #topic pip-and-virtualenv 19:16:09 <clarkb> This change has landed and we're starting to see more and more fallout from it. Nothing unexpected yet I don't think 19:16:41 <clarkb> possibly even a new case of it in #zuul right now too :) 19:16:58 <clarkb> Keep an eye out for problems and thank you to AJaeger and mordred for jumping on fixes 19:17:30 <clarkb> ianw: where are we with considering the spec completed? and cleanup in the nodepool configs? can we start on that or should we wait a bit logner? 19:17:49 <ianw> i have pushed changes to do cleanup of the old jobs and images 19:18:17 <ianw> i guess it's not too high a priority right now, the changes are there and i'll keep on them as appropriate 19:19:31 <clarkb> thanks. Should I push up a change to mark the spec compelted? or wait a bit more on cleanup for that? 19:20:26 <ianw> i guess we can mark it complete, if we consider dealing with the fallout complete :) 19:20:47 <ianw> i thought it was too quiet yesterday, so i'll try to catch up on anything i've missed 19:21:18 <clarkb> ianw: I think a lot of people may have had friday and monday off or something 19:21:23 <clarkb> because ya definitely seems to be picking up now 19:21:56 <corvus> 40% of vacation days are fridays or mondays 19:23:16 <clarkb> #topic Zookeeper TLS 19:23:30 <clarkb> This is the thing that led to the ansible limbo 19:23:37 <clarkb> corvus: want to walk us through this? 19:23:43 <corvus> this topic was *supposed* to be about scheduling some downtime for friday to switch to zk tls 19:23:56 <corvus> but as it turns out, this morning we switched to tls and switched back already 19:24:13 <corvus> the short version is that yesterday our self-signed gearman certs expired 19:24:32 <corvus> well, technically just the ca cert 19:25:06 <corvus> which means that no zuul component could connect to gearman. so we lost the use of the zuul cli, and if any component were restarted for any reason, it would be unable to connect, so the system would decay 19:25:25 <corvus> correcting that required a full restart, as did the zk tls work, so we decided to combine them 19:25:53 <corvus> unfortunately, shortly after starting the nodepool launchers, we ran into a bug 19:25:58 <corvus> #link kazoo bug https://github.com/python-zk/kazoo/issues/587 19:26:19 <corvus> so we manually reverted the tls change (leaving the new gear certs in place) 19:26:25 <corvus> and everything is running again. 19:26:46 <corvus> next steps: make sure this is merged: 19:26:51 <corvus> #link revert zk tls https://review.opendev.org/735990 19:27:02 <corvus> then when it is, we can clear the disable-ansible file and resume speed 19:27:20 <corvus> after that, i'm going to look into running zk in a mode where it can accept tls and plain connections 19:27:36 <clarkb> and once disable ansible is cleared we'll get updates to a few docker images, apply dns zone backups, and change rsync flags for centos/fedora mirrors 19:27:36 <corvus> if that's possible, i'd like to restart the zk cluster with that, and then try to repro the bug against production 19:27:58 <clarkb> calling that out so toehrs are aware there will be a few changes landing once ansible is reenabled 19:28:05 <clarkb> corvus: ++ I like that plan 19:28:08 <mordred> ++ 19:28:11 <corvus> based on info from tobiash, we suspect it may have to do with response size, so it may help to get a repro case out of production data 19:28:25 <clarkb> corvus: we should be able to easily switch over a single builder or launcher without major impact to production to help sort out what is going on 19:28:48 <corvus> clarkb: agreed 19:29:41 <corvus> eot 19:29:48 <clarkb> #topic DNS Cleanup 19:30:06 <clarkb> The change to implement the recording of zone data has landed and should apply to bridge when ansibel starts rerunning 19:30:08 <clarkb> ianw: ^ fyi 19:30:22 <clarkb> probably want to make sure that is working properly once it goes in? 19:30:28 <ianw> yeah, i added the credentials so should be ok 19:31:06 <clarkb> I annotated the etherpad with notes on things I thinkwe can cleanup 19:31:23 <clarkb> what are we thinking about for cleanup? wait for backups to run with existing records first so we've got that info recorded then do cleanup? 19:31:30 <clarkb> (that is sort of what I thought would be a good process) 19:31:43 <mordred> yeah 19:32:03 <clarkb> k I can help with the button clicking to clean up records once we're at that point 19:32:23 <clarkb> ianw: anything else worth mentioning on this topic? 19:32:40 <ianw> yeah, we can iterate on it a bit then too, as the list gets shorter it's easier to see what can go :) 19:32:50 <ianw> nope, thanks 19:32:53 <clarkb> sounds good, thanks for putting this together 19:32:56 <clarkb> #topic Etherpad Upgrade to 1.8.4 or 1.8.5 19:33:13 <clarkb> Fungi did some work to get our etherpad server upgraded to 1.8.4 (from 1.8.0) 19:33:21 <clarkb> we then noticed that there was a UI rendering bug when testing that 19:33:29 <clarkb> #link https://review.opendev.org/#/c/729029/ Upgrade Etherpad 19:33:52 <fungi> not dissimilar from some of the weirdness we noticed with author colors when going through meetpad 19:33:54 <clarkb> this change now addresses that with a simple css hack that I came up with. Upstream they've fixed this differently with a fairly large css refactor and we should see that in the next release (1.8.5?( 19:34:17 <fungi> i wonder if that will also resolve it for meetpad uses 19:34:25 <clarkb> the question I've got is do we think we should upgrade with the workaround as 1.8.4 includes a bug fix around db writes? or wait for 1.8.5 to avoid potential UI weirdness 19:35:01 <fungi> i'm in no huge hurry. i'm excited for the potential fix for perpetually "loading..." pads, but other than that there's no urgency 19:35:24 <mordred> there's no urgency, but the workaround isn't super onerous either 19:35:41 <mordred> so I'm ok rolling it out in that form - or with waiting 19:35:57 <mordred> we can remove the sed from our dockerfile when we bump the version 19:36:24 <mordred> (it's not like one of those "in a few months we're not going to be paying attention and our local hack is going to bork us) 19:36:40 <corvus> clarkb: what's the workaround? 19:36:48 <corvus> oh i see it now 19:36:51 <corvus> sorry, buried in the run cmd 19:36:52 <clarkb> corvus: changing the padding between the spans that contain lines 19:37:13 <clarkb> the way the padding was set up before caused the lines to overlap so their colors successively covered each other 19:37:42 <corvus> i agree it seems safe to move forward with 029 19:37:50 <fungi> you tracked that back to a particular change between 1.8.1 and 1.8.3 yeah? 19:38:00 <clarkb> the bug is also purely cosmetic so shouldn't affect content directly, just how we see it 19:38:13 <clarkb> fungi: ya its in 1.8.3 (there was no 1.8.1 or 1.8.2 iirc) 19:39:09 <fungi> yeah, there was at least no 1.8.2, for sure 19:39:22 <clarkb> I think what I'm taking away from this is if everything else calms down (uwsgi, pip/virtualenv, dns, zk tls, etc) then we can go ahead with this and watch it 19:39:39 <fungi> sounds fine to me 19:39:45 <clarkb> thanks for the feedback 19:40:03 <clarkb> and if 1.8.5 happens before then we can drop the workaround 19:40:36 <clarkb> #topic Getting more stuff off of python2 19:41:15 <clarkb> One of the things that came out of the PTG was it would be useful for those a bit more familiar with our systems to do an audit of where we stand with python2 usage. This way others can dive in and port or switch runtimes 19:41:17 <clarkb> #link https://etherpad.opendev.org/p/opendev-tools-still-running-python2 19:42:06 <clarkb> I've started this audit in that etherpad. It is nowhere near complete. One thing that I have discovered is that a lot of our software is python3 capable but running under python2. We'll want to keep that in mind as we update configuration management that a good next step is to switch the runtime too 19:42:38 <clarkb> I have also found a couple cases that are definitely python2 only right now. Our use of supybot for meetbot and the logstash data pipeline. For meetbot we have a spec to replace it already which is noted on the etherpad 19:42:58 <clarkb> if you can think of other tools we need to be checking on feel free to add them to the list and I can dig in further as I have time 19:43:08 <clarkb> The goal here isn't really to fix everything as much as to be aware of what needs fixing 19:45:13 <fungi> as we move services and automation to platforms without python 2.7, we can fix things where needed 19:45:27 <fungi> if they don't become urgent before that 19:45:37 <clarkb> yup and it gives people a list of things they can pick off over time if they want to help out 19:45:46 <mordred> yeah - most of the things are easy enough to work on - but are pretty opaque that they need to be worked on 19:45:49 <mordred> what clarkb said 19:45:54 <fungi> useful to know where we expect the pain points for such moves to be though 19:46:57 <clarkb> #topic Trusty Upgrades 19:47:29 <clarkb> I don't have much to add on this topic but did want to point out it seems that osf's interop working group is picking up some steam. I'm hoping that may translate into some better interest/support for refstack 19:47:54 <clarkb> and we can maybe channel that into a refstack server upgrade. The docker work I did is actually really close. Its mostly a matter of having someone test it now (which I'm hoping the interop wg can do) 19:48:47 <clarkb> #topic Open Discussion 19:49:02 <clarkb> Anything else to bring up before we end the meeting? 19:51:06 <fungi> it's just come to light in #opendev that the zuul restart switched our ansible default from 2.8 to 2.9, so expect possible additional behavior changes 19:51:36 <clarkb> fungi: have we confirmed that (I theorized it and wouldn't be surprised) 19:51:59 <fungi> ianw: compared logs which showed passing results used 2.8 and the failure observed used 2.9 19:53:02 <corvus> yes, latest zuul master uses 2.9 by default 19:53:10 <corvus> we can pin the openstack tenant to 2.8 if we need to 19:53:34 <clarkb> so far its only popped up as a single issue which has a fix 19:53:40 <clarkb> I guess if it gets worse we can pin 19:53:41 <fungi> i don't know yet if that's warranted, may be the issues are small enough we can just fix them 19:53:44 <corvus> though istr we did some testing around this and didn't see a lot of issues 19:54:00 <fungi> apparently match as a filter is one 19:54:02 <clarkb> corvus: ya we tested with a lot of general playbooks in zuul-jobs 19:54:04 <corvus> so yeah, i think it's probably best to roll forward 19:54:13 <frickler> also devstack is broken by a new uwsgi release and still needs fixes 19:54:40 <frickler> and neutron-grenade-multinode seems to suffer from the venv removal 19:55:37 <clarkb> frickler: ya I meant to look into why multinode was sad about that 19:56:40 <clarkb> the last day and a half have been distracting :) After lunch today I'll try to be useful for all the things 19:58:16 <clarkb> sounds like that may be it for our meeting 19:58:18 <clarkb> thank you everyone! 19:58:21 <clarkb> #endmeeting