19:01:28 <clarkb> #startmeeting infra 19:01:29 <openstack> Meeting started Tue Sep 8 19:01:28 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:30 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:32 <openstack> The meeting name has been set to 'infra' 19:01:42 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-September/000082.html Our Agenda 19:01:54 <clarkb> #topic Announcements 19:01:55 <ianw> o/ 19:02:47 <clarkb> I didn't have any formal announcements. But Yesterday and Today Oregon decided to catch on fire so I'm semi distracted by that. We should be ok though a neraby field decided it wanted to be a fire instead 19:02:59 <clarkb> anyone else have anything to announce? 19:03:15 <clarkb> (oh also power outages have been a problem so I may drop out due to that too though haven't lost power yet) 19:03:23 <fungi> nothing which tops that, no ;) 19:03:46 <fungi> :/ 19:03:53 <clarkb> really I expect the worst bit will be the smoke when the winds shift again. So I should just be happy right now :) 19:04:14 <clarkb> #topic Actions from last meeting 19:04:23 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-09-01-19.01.txt minutes from last meeting 19:04:33 <clarkb> There were no actions from lsat meeting. Lets just dive into this one then 19:04:39 <clarkb> #topic Priority Efforts 19:04:49 <clarkb> #topic Update Config Management 19:05:04 <clarkb> I've booted a new nb03.opendev.org to run nodepool-builder with docker for arm64 image builds 19:05:20 <clarkb> That has been enrolled into our inventory but has a problem installing things because there aren't wheels for arm64 :) 19:05:29 <clarkb> #link https://review.opendev.org/750472 Add build deps for docker-compose on nb03 19:05:39 <clarkb> that should fix it and once thats done everything should be handled by docker so should work 19:06:12 <clarkb> one thing that came up as part of this is that we don't seem to have ansible using sshfp records yet? or maybe we do and the issue I had was specific to having a stale known_hosts entry for a reused IP? 19:06:23 <clarkb> ianw: fungi ^ any updates on that? 19:06:49 <fungi> we have usable sshfp records for at least some hosts 19:06:54 <ianw> umm, i think that the stale known_hosts overrides the sshfp 19:07:14 <fungi> yes, if there is an existing known_hosts entry that will be used instead 19:07:22 <clarkb> gotcha, that was likely the issue here then 19:07:27 <ianw> it might be a bit of a corner case with linaro 19:07:29 <clarkb> do we expect sshfp to work otherwise? 19:07:36 <ianw> where we have a) few ip's and b) have rebuilt the mirror a lot 19:08:01 <fungi> though i also don't think bridge.o.o is configured to do VerifyHostKeyDNS=yes is it? 19:08:14 <clarkb> https://review.opendev.org/#/c/744821/ <- reviewing and landing that would be good if we expect sshfp to work now 19:08:15 <ianw> my understanding is yes, since it is using unbound and the dns records are trusted 19:08:32 <fungi> i thought VerifyHostKeyDNS=ask was the default 19:09:06 <fungi> and i couldn't find anywhere we'd overridden it 19:10:17 <fungi> ahh, ssh_config manpage on bridge.o.o claims VerifyHostKeyDNS=no is the default actually 19:10:42 <clarkb> ok we don't have to solve this in the meeting but wanted to call it out as a question that came up 19:11:02 <fungi> yeah, i'm not certain we've actually started using sshfp records for ansible runs from bridge yet 19:11:35 <clarkb> Are there other config management update to call out? 19:12:20 <fungi> also worth noting, glibc 2.31 breaks dnssec (there are nondefault workarounds), so we need to be mindful of that when we eventually upgrade bridge.o.o, or for our own systems 19:12:44 <clarkb> fungi: is 2.31 or newer in focal? 19:12:49 <fungi> as that will also prevent openssh from relying on sshfp records 19:13:22 <fungi> yeah, focal 19:13:46 <fungi> 2.31-0ubuntu9 19:14:57 <clarkb> sounds like that may be it for config management and sshfp 19:15:00 <clarkb> #topic OpenDev 19:15:02 <ianw> we could also move back to the patch that just puts the fingerprints into known_hosts 19:15:24 <ianw> as sshfp seems like it is a nice idea, but ... perhaps more trouble that it's worth tbh 19:15:38 <clarkb> ianw: something to consider for sure 19:15:40 <clarkb> #link https://review.opendev.org/#/c/748263/ Update opendev.org front page 19:15:47 <clarkb> Thank you ianw for reviewing this one 19:16:13 <clarkb> Looks like we've got a couple +2s now. corvus do you want to review it before we approve it? 19:16:48 <clarkb> I should rereview it, but in trying to follow the comments its all made sense to me so far s o Idoubt I'll have major concerns 19:17:00 <fungi> i've left some comments there for things i'm happy to address in a follow-up patch 19:17:50 <fungi> so as not to drag this one out unnecessarily 19:18:11 <fungi> it's already a significant improvement over what's on the site now, in my opinion 19:18:14 <clarkb> frickler: ^ you may be interested as well 19:18:37 <clarkb> maybe fungi can approve it first thing tomorrow if there are no further objects between now and then? 19:18:46 <clarkb> because ya I agree a big improvement 19:19:05 <fungi> sure, i'll push up my suggestions as a second change when doing so 19:19:37 <clarkb> On the gerrit upgrade testing side of things I've not had time to push on that since my least email to luca. I'm hoping that I'll hvae time this week for more testing 19:20:03 <clarkb> Any other opendev topics others would like to call out before we move on? 19:20:32 <corvus> clarkb: i will +3 front page 19:20:43 <clarkb> corvus: k 19:20:51 <fungi> i finished the critical volume replacements in rax-dfw last week 19:21:05 <fungi> and have been poking at replacing the less critical ones in the background as time allows 19:21:21 <clarkb> fungi: other than the sometimes old volumes don't delete problem were there issues? 19:21:54 <fungi> ahh, yeah, looks like wiki.o.o will need special attention. i expect it's because it's booted from a snapshot of a legacy flavor instance, but i can't attach a new volume to it 19:22:23 <fungi> may need to rsync its content over to another instance booted from a modern flavor 19:22:34 <clarkb> "fun" 19:22:54 <fungi> the api accepts the volume add, but then the volume immediately returns to available and the instance never sees it 19:23:20 <fungi> oh, and also i discovered that something about osc is causing it not to be able to refer to volumes by name 19:23:48 <fungi> and it gives an empty name column in the volume list output too 19:24:13 <fungi> i've resorted to using cinderclient for now to get a volume listing with names included 19:24:30 <fungi> i suspect it's something to do with using cinder v1 api, or maybe a rackspace-specific problem 19:24:43 <fungi> just something worth keeping in mind if anybody needs something similar 19:24:59 <fungi> i haven't really had time to take it up with the sdk/cli folks yet 19:25:13 <clarkb> Thank you for taking care of that 19:25:48 <fungi> no problem 19:26:16 <clarkb> #topic General Topics 19:26:30 <clarkb> #topic Vexxhost Mirror IPv6 Problems 19:27:08 <clarkb> With this issue it seems we get rogue router advertisements which add bogus IPs to our instance. When that happens we basically break IPv6 routing on the host 19:27:20 <clarkb> This is likely a neutron bug but needs more cloud side involvement to debug 19:27:59 <fungi> note we've seen it (at least) once in limestone too. based on the prefixes getting added we suspect it's coming from a job node in another tenant 19:28:02 <clarkb> frickler has brought up that we should try and mitigate this better. Perhaps via assigning the IP details statically. I looked at this and it should be possible with the new netplan tooling but its a new thing we'll need to figure out 19:28:20 <clarkb> I wrote up an etherpad that I can't find anymore with a potential example config 19:28:35 <clarkb> another thought I has was maybe wecan filter RAs by origin mac ? 19:28:42 <clarkb> is that something iptables can be convinced to do ? 19:29:11 <fungi> i'm not absolutely sure iptables can block that 19:29:28 <fungi> if it's handled like arp, the kernel may be listening to a bpf on the interface 19:29:41 <fungi> so will see and act on it before it ever reaches iptables 19:29:59 <fungi> (dhcp has similar challenges in that regard) 19:29:59 <clarkb> my concern with the netplan idea is if we get it wrong we may have to build a new server. At least with iptables we can tes tthe rule and if we get it wrong reboot 19:30:29 <ianw> clarkb: you could always set a console root password for a bit? 19:30:52 <clarkb> ianw: does remote console access work with vexxhost (I'm not sure but if it does that would be a reaosnable compromise) 19:31:23 <ianw> oh, i'm assuming it would, yeah 19:31:37 <clarkb> Also totally open to other ideas here :) 19:32:18 <ianw> it seems like this is something you have to stop, like a rogue dhcp server 19:32:41 <fungi> statically configuring ipv6 and configuring the kernel not to do autoconf is probably the safest workaround 19:32:41 <clarkb> ya, its basically the same issue just with different IP protocols 19:33:05 <clarkb> I'll try harder to dig out the netplan etherpad after the meeting 19:33:10 <ianw> yeah, so i'm wondering what best practice others use is ... ? 19:33:35 <ianw> oh, it's ipv6 19:33:39 <ianw> of course there's a rfc 19:33:42 <ianw> https://tools.ietf.org/html/rfc6104 19:33:51 <fungi> ianw: generally it's to rely on autoconf and hope there's no bug in neutron leaking them between tenants 19:34:11 <clarkb> manual configuration is the first item on that rfc 19:34:17 <ianw> just 15 pages of options 19:34:18 <clarkb> so maybe we start there as frickler suggests 19:34:42 <clarkb> but if any of the other options there look preferable to you I'm happy to try others instead :) 19:35:44 <ianw> is it neutron leaking ra's ... or devstack doing something to the underlying nic maybe? 19:36:11 <clarkb> ianw: we believe it is neutron running in test jobs on the other tenant (we split mirror and test nodes into different tenants) 19:36:27 <fungi> devstack in a vm altering the host's nic would be... even more troubling 19:36:28 <clarkb> and neutron in the base cloud (vexxhost) is expected to block those RAs 19:36:40 <clarkb> per the bug we filed when limestone had this issue 19:36:42 <fungi> in which case it would point to a likely bug in qemu i guess 19:36:43 <ianw> that seems like a DOS attack :/ 19:37:02 <clarkb> ianw: yes I originally filed it as a security bug a year ago or whatever it was 19:37:21 <clarkb> but it largely got ignored as cannot reproduce and then disclosed (so now we can talk about it freely) 19:37:23 <fungi> ianw: yep. neutron has protections which are supposed to prevent exactly this, but sometimes those aren't effective apparently 19:37:40 <clarkb> its possible that because we open up our security groups we're the only ones that notice 19:37:47 <clarkb> (we could try using security groups to block them too maybe?) 19:38:02 <fungi> however we haven't worked out the sequence to reliably recreate the problem, only observed it cropping up with some frequency, so it's hard to pin down the exact circumstances which lead to it 19:38:16 <fungi> the open bug on neutron is still basically a dead end without a reproducer 19:38:22 <clarkb> yup also we don't run the clouds so we don't really see the underlying network behavior 19:39:48 <clarkb> anyway we don't have to solve this here, let's just not forget to work around it this time :) I can help with this once nb03 is in a good spot 19:40:05 <clarkb> #topic Bup and Borg Backups 19:40:27 <clarkb> ianw anything new on this? and if not should we drop it from the agenda until we start enrolling servers with borg? 19:40:37 <ianw> sorry i've just had my head in rhel and efi stuff 19:40:45 <clarkb> (I've kept it on because I think backups are important but bup seems to be working well enough for now so borg isn't urgent) 19:40:49 <ianw> it is right at the top of my todo list though 19:41:45 <ianw> we can keep it for now, and i'll try to get at least an initial host done asap 19:41:50 <clarkb> ok and thank you 19:41:55 <clarkb> #topic PTG Planning 19:42:08 <clarkb> #topic https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here 19:42:22 <clarkb> er 19:42:24 <clarkb> #undo 19:42:25 <openstack> Removing item from minutes: #topic https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here 19:42:35 <clarkb> #link https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here 19:42:48 <clarkb> October is fast approaching and I really do intend to add some content to that etherpad 19:43:03 <clarkb> as always others should feel free to add their own content 19:43:31 <clarkb> #topic Docker Hub Rate Limits 19:43:56 <clarkb> This wasn't on the agenda I sent out this morning as it occurred to me that it may be owrth talking about after looking at emails in openstack-discuss 19:44:51 <clarkb> Long story short docker hub is changing/has changed how they apply rate limits to image pulls. In the past limits were applied to layer blobs which we do cache in our mirrors. Now limits are applied to manifest fetches not blob layers. We don't cache manifest layers because getting those requires auth (even as an anonymous user you get an auth token) 19:45:11 <clarkb> This is unfortunate because it means our caching strategy is no longer effective for docker hub 19:45:33 <clarkb> On the plus side projects like zuul and nodepool and system-config havne't appeared to be affected yet. But othres like tripleo have 19:45:54 <clarkb> docker has promised they'll write a blog post on suggestions for CI operators which I haven't seen being published yet /me waits patiently 19:46:26 <clarkb> If our users struggle with this in the meantime I think their best bet may be to stop using our mirrors because then they will make anonymous requests from IPs that will generally be unique enoughto avoid issues 19:47:01 <clarkb> Other ideas I've seen include building images rather than fetching them (tripleo is doing this) as well as using other registries like quay 19:47:27 <fungi> there are certainly multiple solutions available to us, but i've been trying to remind users that dockerhub has promised to publish guidance and we should wait for that 19:47:58 <fungi> at least before we invest effort in building an alternative solution 19:48:16 <clarkb> ++ I mostly want people to be aware there is an issue and workarounds from the source should be published at some point 19:48:33 <clarkb> and there are "easy" workarounds that can be used between now and then like not using our mirrors 19:48:35 <fungi> (such as running ourt own proxy registry, or switching to a different web proxy which might be more flexible than apache mod_proxy) 19:49:11 <fungi> there was also some repeated confusion i've tried by best to correct around zuul-registry and its presumed use in proxying docker images for jobs 19:50:14 <clarkb> oh ya a couple people were confused by that 19:50:24 <clarkb> not realizing its a temporary staging ground not a canonical source/cache 19:50:26 <ianw> didn't github also announce a competing registry too? 19:50:37 <clarkb> ianw: yes 19:50:43 <clarkb> and google has one 19:50:51 <fungi> yes, but who knows if it will have similar (or worse) rate limits. we've been bitted by github rate limits pretty often as it is 19:51:13 <fungi> man, my typing is atrocious today 19:51:26 <ianw> yeah, just thinking that's sure to become something to mix in as well 19:53:02 <clarkb> #topic Open Discussion 19:53:08 <clarkb> Anything else to bring up in our last 7 minutes? 19:53:46 <fungi> oh, yeah 19:53:51 <fungi> pynotedb 19:54:13 <fungi> a few years ago, zara started work on a python library to interface with gerrit notedb databases 19:54:18 <fungi> but didn't get all that far with it 19:54:42 <fungi> we have the package name on pypi and a repo in our )opendev's) namespace on opendev but that's mostly just a cookie-cutter commit 19:54:54 <hashar> :-\ 19:55:23 <fungi> more recently softwarefactory needed something to be able to interface with notedb from python and started writing a module for that 19:55:39 <fungi> they (ironically) picked the same name without checking whether it was taken 19:55:56 <fungi> now they're asking if we can hand over the pypi project so they can publish their library under that name 19:56:00 <clarkb> for the name in pypi was anything released to it? 19:56:24 <clarkb> if yes, then we may want to quickly double check nothing is using it (I think pypi exposes that somehow) but if not I have no objections to that idea 19:56:24 <fungi> a couple of dev releases several years ago, looks like 19:57:19 <fungi> also SotK has confirmed that the original authors are okay with lettnig it go 19:57:40 <fungi> and probably just using tristanC's thing instead once they're ready 19:58:25 <clarkb> works for me 19:58:32 <clarkb> particularly if the original authors are happy with the plan 19:58:49 <diablo_rojo> Seems reasonable 19:59:24 <fungi> ahh, looks like the "releases" for it on pypi have no files anyway 19:59:58 <fungi> evidenced from the lack of "download files" at https://pypi.org/project/pynotedb/ 20:00:13 <hashar> there is no tag in the repo apparently 20:00:15 <fungi> so the two dev releases on there are just empty, no packages 20:00:17 <clarkb> that makes things easy 20:00:24 <diablo_rojo> Nice 20:00:26 <clarkb> and we are at time 20:00:29 <clarkb> Thank you everyone! 20:00:31 <fungi> thanks clarkb! 20:00:31 <clarkb> #endmeeting