19:01:12 <clarkb> #startmeeting infra 19:01:13 <openstack> Meeting started Tue Jan 28 19:01:12 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:16 <openstack> The meeting name has been set to 'infra' 19:01:23 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-January/006583.html Our Agenda 19:01:24 <zbr> o/ 19:01:29 <clarkb> #topic Announcements 19:01:37 <clarkb> I did not have any announcements to announce 19:02:07 <corvus> clarkb: nice announcement 19:02:24 <clarkb> #topic Actions from last meeting 19:02:30 <clarkb> corvus: it was agood one 19:02:33 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-01-21-19.01.txt minutes from last meeting 19:02:42 <clarkb> There were no actions recorded in the last meeting 19:03:08 <clarkb> #topic Priority Efforts 19:03:13 <clarkb> Let's dive right in then 19:03:17 <clarkb> #topic OpenDev 19:03:30 <clarkb> link https://review.opendev.org/#/c/703134/ Split OpenDev out of OpenStack Governance 19:03:35 <clarkb> #link https://review.opendev.org/#/c/703134/ Split OpenDev out of OpenStack Governance 19:03:41 <clarkb> #link https://review.opendev.org/#/c/703488/ Update OpenDev docs with new Governance 19:04:02 <clarkb> I think these two changes are just about ready to go in. At least I haven't seen much new feedback recently on the first one 19:04:25 <clarkb> I'll bring it up with the TC to see what the next steps are from their side to keep it moving 19:04:34 <clarkb> but if you've got any input now would be a great time to record it 19:05:48 <clarkb> Are there any questions about this move to bring up here? 19:07:25 <clarkb> The other opendev item I wanted to bring up was that we had been experiencing a ddos from huawei cloud against our gitea servers 19:07:37 <corvus> (likely unintentional) 19:07:53 <clarkb> correct 19:08:15 <clarkb> I ended up emailing our OSF board member from huawei and they said they were customer IPs so couldn't put us directly in touch but did bring it up with the customer 19:08:21 <clarkb> since then I've not noticed similar behavior 19:08:43 <clarkb> If we continue to see similar OOMing behavior though we should likely strongly consider gitea hosts with more memory. 19:09:18 <clarkb> fungi mentioned rate limiting requests, but I think that may make it worse because the git processes would stay around longer as the requests would take more time. And it is the git processes loading a bunch of info into memory that causes the problems 19:09:34 <clarkb> (another option may be more gitea hosts as that will distribute load from lb better) 19:09:57 <clarkb> something to monitor but for now it is no longer an emergency 19:09:58 <corvus> in the long run, if we can get to a single distributed instance, we should be able to handle that better with load balancing 19:10:24 <clarkb> ++ 19:10:32 <corvus> but yeah, until that happens, it seems like either more ram or more hosts are the best short term options 19:10:34 <fungi> yes, if the gitea servers all shared a coherent backend filesystem view that would be much easier to absorb and scale 19:11:05 <corvus> (more hosts should work in this case since we're talking about multiple source ips) 19:11:44 <fungi> right now we can't even guarantee that two gitea servers have the exact same commits at any given point in time, due to gerrit scheduling replication to them independently 19:11:55 <ianw> are they "legitimate" requests, or more looking like scripts gone wild? 19:12:14 <clarkb> ianw: it looked like periodic CI jobs that cloned everythign from scratch each day 19:12:18 <fungi> ianw: best guess is a ci system which doesn't cache repositories locally and clones them all for every build 19:12:20 <clarkb> from hundreds of hosts 19:13:02 <fungi> timing also could be related to the recent cinder announcement about drivers without third-party ci getting marked as unsupported 19:13:11 <clarkb> that is possible 19:13:44 <ianw> even if we could infinitely scale, perhaps rate limiting that sort of thing is best for everyone anyway 19:13:49 <fungi> huawei representatives did say it wasn't a huawei system though, just some customer in their public cloud 19:14:19 <fungi> the challenge is how to rate-limit it so that it doesn't make the matter worse, as clarkb points out 19:14:21 <clarkb> if we wanted to go the rate limit route we'd have to limit requests before git gets forked 19:14:30 <clarkb> it is doable but a naive approach would likely not work well 19:15:02 <fungi> i suspect the only sane limits would have to be implemented within gitea itself 19:15:44 <fungi> for example, to allow the host to stop serving new requests once system resource utilization reaches some defined thresholds 19:16:05 <fungi> so that the load balancer starts sending subsequent requests to other hosts in the pool 19:16:35 <fungi> could also implement that with a health check agent reporting host info in haproxy, but that's complex to set up 19:19:23 <ianw> can haproxy "count" how many connections have been made, and cut you off? or are the ips spread out enough that it would get under that but still cause problems? 19:20:04 <clarkb> I think the ips were spread out enough in this case 19:20:05 <fungi> you could do that with iptables/conntrack actually 19:20:26 <clarkb> which is why a consumption monitoring system might be necessary 19:20:37 <clarkb> cloning nova requirse just over a gig of memory 19:20:45 <clarkb> have enough of those (and not very many) and you run out of memory 19:21:06 <fungi> yeah, it's less about the request volume and more about the impact of specific requests 19:22:20 <fungi> and somewhat, though not directly correlated to, the data transferred 19:23:10 <clarkb> having a proper gitea cluster should largely mitigate this as long as our k8s cluster has enough headroom 19:23:18 <clarkb> as new giteas can be spun up to meet increases in demand 19:23:27 <clarkb> its possible we may just want to focus on making ^ a reality 19:24:37 <corvus> i haven't checked in on the status of elasticsearch support recently 19:25:52 <corvus> progress is being made https://github.com/go-gitea/gitea/pull/9428 19:26:08 <corvus> but we still need code search for that 19:26:25 <clarkb> exciting, maybe we keep that as the focus as it solves other problems too. And keep the other option in mind if the ddos problems comes back or gets worse 19:27:02 <clarkb> Anything else on this topic or should we move on? 19:27:37 <ianw> is this they type of thing we could have a story to track 19:28:00 <ianw> i sort of worry that there's a lot of investigation not consolidated on it anywhere 19:28:19 <clarkb> ianw: ++ (too easy to forget when things are broken) 19:28:35 <clarkb> #action clarkb write up story on gitea OOMs and DDoS 19:28:40 <clarkb> I'll fix that 19:29:35 <ianw> thanks, that will be great as we track it over time 19:30:04 <clarkb> #topic Update Config Management 19:30:23 <clarkb> I don't think mordred has made much progress on the dockerification of gerrit 19:30:31 <clarkb> he has been busy with travel and such 19:30:41 <clarkb> Does anyone else have config management updates to bring up? 19:31:18 <ianw> i really want to get back to nodepool builder from containers very soon 19:32:14 <clarkb> ianw: is that running on the nodepool side's testing yet? 19:34:05 <ianw> ummm ... not sure 19:34:47 <clarkb> no worries, was just curious 19:34:54 <clarkb> #topic General Topics 19:35:23 <clarkb> I/we have been semi formally pushign on getting rid of Trusty once and for all the last few days 19:35:26 <ianw> (re: nodepool yes we have container tests merged) 19:36:05 <clarkb> I've been working on a static.openstack.org replacement which needs a new gerritlib release which I'll push after the meeting 19:36:19 <fungi> status not static, right? 19:36:20 <clarkb> once that is in I half expect it to be functional, I'll confirm that then update dns 19:36:24 <clarkb> er yes status. 19:36:49 <clarkb> As part of this I realized we have a lack of testing around jeepyb and gerritlib so I've been working on a new job this morning to do an integration test with a running gerrit 19:37:17 <clarkb> this should give us a lot more confidence in jeepyb and gerritlib changes which are likely to become important as we tool up some of the opendev self serve stuff. If nothing else we can reuse the test platform and build different tools 19:37:40 <clarkb> I'm close to having that stack ready for review and will bring it up after the meeting once I have the ready for review commits pushed 19:37:51 <clarkb> ianw: any progress on static? I think the change to add the host landed yesterday 19:38:19 <ianw> if i could get some eyes on 19:38:28 <ianw> #link https://review.opendev.org/704523 19:38:43 <ianw> hit a small snag deploying openafs 19:39:05 <ianw> but, after that, we should be able to test out governance and security sites with local host updates 19:39:33 <ianw> if we're happy, we can switch dns at some point after that 19:39:58 <ianw> with that POC the other bits should proceed quickly as we get them published to afs 19:40:34 <clarkb> sounds good, I'll review that after the meeting 19:40:50 <clarkb> wiki was the other big host in this list, fungi have you had a chance to psuh on it yet? 19:41:08 <fungi> unfortunately no. trying to get freed up to do that today 19:41:14 <clarkb> ok let us know if we can help 19:41:26 <zbr> can i add something about our static websites? 19:41:30 <fungi> have to run some errands after the meeting and will then figure out where i last left off 19:41:34 <clarkb> zbr: sure 19:41:59 <zbr> we should enable google site tools in order to know who is linking to us and be able to reindex quickly when we update stuff 19:42:15 <zbr> or find links that are broken 19:42:44 <ianw> that's usually dropping a randomly named html file in the root, right? 19:42:46 <zbr> enabling it requires only some kind of site verification, no JS mess is needed. 19:42:49 <zbr> exactly 19:43:09 <corvus> i don't think that's something that the infra/opendev team needs to do 19:43:14 <zbr> when zuul docs got broken I raised https://review.opendev.org/#/c/702888/ fro enabling it. 19:43:19 <corvus> that should be up to the individual projects 19:43:27 <clarkb> individual projects should be able to verify themselves if its based on content in the site 19:44:08 <fungi> i'd be very uncomfortable forcing an tacit endorsement of some proprietary third-party service on all sites opendev hosts 19:44:14 <corvus> my read is that the zuul project is not thrilled about the idea 19:44:56 <fungi> an alternative is to do what we did for docs.openstack.org and provide 404 reporting scraped from teh apache error logs 19:45:04 <zbr> so we provide a worsened experience to the users just for the spite of proprietary 3rd parties? 19:45:41 <clarkb> zbr: I think the goal would be to address the problem without relying on a proprietary service. The preexisting 404 scanner tool is one such method (as fungi points out) 19:45:43 <zbr> clearly the docs where broken for days before we fixed them 19:45:56 <clarkb> and if projects wish to opt into those proprietary tools they can do so without our input aiui 19:46:31 <fungi> for the record, zuul's docs were not broken for days. external serch engines were serving stale cached links 19:47:00 <fungi> zuul's documentation provides its own index which is required to be consistent by the tool which builds it 19:47:29 <fungi> i'll grant that the integrated keyword searching for sphinx is not great compared to what external services provide, but that's independent of the documentation index itself 19:47:58 <clarkb> zbr: but yes, one of the explicit goals here is to push viable project hosting via open source tooling 19:48:19 <fungi> and to improve those open tools where necessary 19:48:25 <ianw> is that hash in the file name based on the URL, or based on the user account requesting to add the site? 19:48:33 <zbr> ok, i just wanted to state that we should not ignore the UX 19:48:57 <corvus> ianw: that's for a user account 19:49:17 <corvus> https://review.opendev.org/702888 would give zbr administrative control of the zuul-ci.org domain in google webmaster tools 19:49:23 <ianw> ok, so it's not like you add one file and everyone/anyone could verify it 19:49:49 <zbr> true, but someone has to do it, i does not have to be me. 19:50:03 <corvus> in fact, no one has to do it 19:50:30 <fungi> as evidenced by the fact that no one has so far 19:50:48 <zbr> so is better to do nothing just because we don't want to give someone the permission to do SEO 19:51:16 <corvus> i feel like the conversation has looped back to 19:45 19:51:25 <Shrews> the fact that folks find it better to use an external search engine to find the docs they need points to the fact that we should improve the layout of our current docs to make things easier to find. the reorg was (hopefully) the beginning of that effort 19:51:31 <clarkb> I think this is a question for the zuul project not opendev, but I also don't think anyone is saying we have to give up. Simply thatZuul would like to use open source tools to address this problem 19:51:49 <clarkb> we only have about 8 minutes left and there are a couple more topics, I think we should move on 19:52:07 <corvus> Shrews, clarkb: ++ 19:52:08 <clarkb> I wanted to point out we have deployed a new (arm64) cloud to production in nodepool 19:52:28 <clarkb> Unfortunately there is some weird network behavior between nb03 and the cloud apis so we have been unable to upload images 19:52:49 <ianw> also the mirror has disappeared unexpectedly, and the api hasn't helped determine why 19:52:54 <clarkb> we've reached out to kevinz on this but it is the chinese new year so expect it might be a little while before that gets fixed 19:53:04 <ianw> tracking things in 19:53:06 <ianw> #link https://storyboard.openstack.org/#!/story/2007195 19:53:30 <clarkb> this new cloud will give us like 5 times the capacity for arm64 jobs 19:53:33 <clarkb> it is very exciting 19:54:00 <ianw> yep, and we'll for sure help sort out any stability issues, we always do :) 19:54:36 <clarkb> And finally I set up a followup call with the airship team to talk about any new questions about adding the ericsson cloud to opendev nodepool. That happens tomorrow at 1600UTC on jitsi meet. 19:54:48 <clarkb> #link http://lists.openstack.org/pipermail/openstack-infra/2020-January/006578.html Airship CI Meeting details here 19:55:16 <clarkb> I ended up using jitsi meet for something else recently and it worked really well so wanted to give it a go. It is shaping up to be an open source alternative to google hangouts and zoom etc 19:55:43 <clarkb> ya'll are welcome to join. The timing is probably bad for ianw, you should sleep in instead :) 19:56:22 <clarkb> And with that we have ~3 minutes for anything else 19:56:25 <clarkb> #topic Open Discussion 19:57:35 <fungi> zbr: i've been evaluating our various options for open tools to perform analysis of our web activity in socially-conscious ways, producing reports which avoid any use of pii so they can be provided publicly. so far each of the classic tools i've evaluated has had one problem or another, but this one came to my attention last week i'm curious to try: 19:57:40 <fungi> #link https://www.goatcounter.com/ 19:58:28 <zbr> thanks, still think we are missing the point. 19:58:54 <fungi> i apparently am 19:59:20 <zbr> you cannot force google to reindex your site with 3rd party tools, neither to convince them to tell you about incoming links, broken stuff and so on. 19:59:36 <zbr> this is not about analytics stuff 20:00:02 <clarkb> zbr: right, but if we properly add redirects then we don't need to force reindexing 20:00:04 <fungi> you can however find out what links are "broken" by seeing what urls folks request which return errors and add corresponding redirects 20:00:08 <clarkb> we can instead rely on their periodic reindexing 20:00:21 <clarkb> and we are at time 20:00:28 <clarkb> thank you everyone 20:00:30 <clarkb> #endmeeting