*** hamalq has quit IRC | 00:36 | |
*** diablo_rojo has quit IRC | 06:49 | |
*** mordred has quit IRC | 12:04 | |
*** mordred has joined #opendev-meeting | 12:11 | |
*** hamalq has joined #opendev-meeting | 15:50 | |
*** hamalq_ has joined #opendev-meeting | 15:52 | |
*** hamalq has quit IRC | 15:55 | |
-openstackstatus- NOTICE: Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options. | 18:21 | |
clarkb | anyone else here for the meeting? | 19:01 |
---|---|---|
fungi | more or less | 19:01 |
clarkb | It might be a little disorganized due to fires earlier in the day but I'll give it a go | 19:01 |
clarkb | #startmeeting infra | 19:01 |
openstack | Meeting started Tue Jun 30 19:01:32 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
openstack | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
*** openstack changes topic to " (Meeting topic: infra)" | 19:01 | |
openstack | The meeting name has been set to 'infra' | 19:01 |
clarkb | #topic Announcements | 19:01 |
*** openstack changes topic to "Announcements (Meeting topic: infra)" | 19:01 | |
ianw_pto | o/ | 19:02 |
clarkb | If you hadn't noticed our gitea installation was being ddos'd, its under control now but only because we're blocking all of china unicom | 19:02 |
clarkb | we can talk more about this shortly | 19:02 |
clarkb | The other thing I wanted to mention is I'm taking next week off and unlike ianw_pto I don't intend to be here for the meeting :) | 19:02 |
clarkb | If we're going to have a meeting next week we'll need someone to volunteer for running it | 19:03 |
clarkb | #topic Actions from last meeting | 19:03 |
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)" | 19:03 | |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-23-19.01.txt minutes from last meeting | 19:03 |
clarkb | There were none | 19:04 |
clarkb | #topic Specs approval | 19:04 |
*** ianw_pto is now known as ianw | 19:04 | |
*** openstack changes topic to "Specs approval (Meeting topic: infra)" | 19:04 | |
clarkb | ianw: oh no did the pto end? | 19:04 |
clarkb | #link https://review.opendev.org/#/c/731838/ Authentication broker service | 19:04 |
clarkb | Going to continue to call this out and we did get a new patchset | 19:05 |
clarkb | I should read it | 19:05 |
ianw | heh yes was just yesterday | 19:05 |
clarkb | #topic Priority Efforts | 19:05 |
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)" | 19:05 | |
clarkb | #topic Opendev | 19:05 |
*** openstack changes topic to "Opendev (Meeting topic: infra)" | 19:06 | |
clarkb | Let's dive right in | 19:06 |
clarkb | Before we talk about the ddos I wanted to remind people that the advisory board will start moving forward at the end of this week | 19:06 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2020-May/000026.html Advisory Board thread. | 19:06 |
clarkb | we've got a number of volunteers which is exciting | 19:06 |
clarkb | Also we had a gitea api issue with the v1.12.0 release | 19:07 |
clarkb | long story short listing repos requires pagination now but the way the repos are listed from the db doesn't consistently produce a complete list | 19:07 |
clarkb | we worked around that with https://review.opendev.org/#/c/738109/ and I proposed an upstream change at https://github.com/go-gitea/gitea/pull/12057 which seems to fix it as well | 19:08 |
clarkb | For today's gitea troubles http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is a good illustration of what we saw | 19:08 |
clarkb | basically at ~midnight UTC today we immediately spiked to our haproxy connection limit | 19:08 |
clarkb | after digging aroud in gitea and haproxy logs it appears that there is a botnet that is doing a crawl of our gitea instllation from many many many IP addresses most of which belong to chinese ISPs | 19:09 |
clarkb | while doing that I noticed it appeared we had headroom to accept more connections so I proposed bumping that limit from 4k to 16k in haproxy (note the cacti number is 2x the haproxy number because haproxy has a connection the front end and the backend for each logical connection) | 19:10 |
clarkb | unfortunately our backends couldn't handle the new connections (of which we seemed to peak at about 8k logical connections) | 19:10 |
fungi | this may be in part due to specific characteristics of the requests we were being hit with | 19:11 |
clarkb | we went from having slowness and the occasional error to more persistent errors as the giteas ran out of memory. I manually reverted the maxconn change and https://review.opendev.org/#/c/738679/1 is in the gate to revert it properly. Then I restarted all the giteas and thigns got better. | 19:11 |
clarkb | As part of recovery we also blocked all IPv4 ranges for china unicom on the haporxy load balancer | 19:11 |
clarkb | if we want to undo those drop rules we can restart the netfilter-presistent service on that host | 19:11 |
clarkb | yes, the requests are looking at specific files and commits and checking them across the different localizations that gitea offers | 19:12 |
clarkb | its basically doing a proper web crawl, but not throttling itself and the way it does it causes us problems | 19:12 |
clarkb | We appear to be stable right now even though the crawler seems to still be running from other IPs | 19:13 |
*** diablo_rojo has joined #opendev-meeting | 19:13 | |
* diablo_rojo sneaks in late | 19:13 | |
clarkb | we're under that 4k connection limit and giteas seem happy. | 19:13 |
clarkb | The problem we're now faced is how to address this more properly so that people that just want to clone nova from china aren't going to be blocked | 19:13 |
ianw | so it's currently manually applied config on haproxy node? | 19:13 |
clarkb | ianw: ya I did a for loop of iptables -I -j DROP -s $prefix | 19:14 |
clarkb | so a reboot or restart of our netfilter-persistent service will reset to our normal iptables ruleset | 19:14 |
ianw | cool; and does this thing have a specific UA string? | 19:14 |
fungi | we have no idea | 19:14 |
clarkb | ianw: good question | 19:14 |
clarkb | unfortunately gitea does not log UAs | 19:15 |
fungi | and haproxy can't see them | 19:15 |
clarkb | one idea I had was to tcpdump and then decrypt on gitea0X and see if we can sort that out | 19:15 |
clarkb | but was just trying to fight the fire earlier and haven't had time to really try ^ | 19:15 |
clarkb | because ya if this is a well behaved bot maybe we can update/set robots.txt and be on our weay | 19:15 |
corvus | i'll look into gitea options to log uas | 19:16 |
ianw | ok, i can probably help | 19:16 |
clarkb | https://docs.gitea.io/en-us/logging-configuration/#the-access_log_template implies we may be able to get that out of gitea actually | 19:16 |
fungi | it's worth checking, but my suspicion is that it's not going to be well-behaved or else it wouldn't be sourced from thousands of addresses across multiple service providers | 19:16 |
clarkb | corvus: thanks | 19:16 |
ianw | the traffic goes directly into gitea doesn't it, not via a reverse proxy? | 19:17 |
fungi | it acts like some crawler implemented on top of a botnet of compromised machines | 19:18 |
clarkb | corvus: reading that really quickly I think we want to change from default logger to access logger | 19:18 |
corvus | clarkb: i agree | 19:18 |
fungi | ianw: it's a layer 4 proxy | 19:18 |
clarkb | ianw: no its all through the load balancer | 19:18 |
fungi | ianw: oh, you mean at the backend... right, gitea's listening on the server's ip address directly, there's no apache handing off those connections via loopback | 19:19 |
clarkb | thinking out loud here: I think that while we're stable we should do the logging siwtch as that gives us more data | 19:19 |
ianw | sorry i'm thinking that we could put apache infront of gitea on each gitea node, and filter at that level | 19:19 |
corvus | filter how? | 19:19 |
clarkb | corvus: mod rewrite based on UA? | 19:19 |
corvus | (i mean, based on what criteria) | 19:19 |
ianw | via UA, if we find it misbehaving | 19:19 |
clarkb | assuming the UA is discernable | 19:19 |
ianw | yeah, and not robots obeying | 19:19 |
fungi | i've seen discussions about similar crawlers, and if they're not obeying robots.txt they also are quite likely to use a random assortment of popular browser agent strings too | 19:20 |
clarkb | I like that. Basically improve our logging to check if it is a robot.txt fix. If not that will tell us if UA is filterable and if so we could add an apache to front the giteas | 19:20 |
clarkb | and that is all a reason to not further filter IPs since we're under the limits and happy but still have enough of those requests to be able to debug them further | 19:21 |
clarkb | then make decissions based on whatever that tells us | 19:21 |
corvus | johsom also mentioned we can limit by ip in haproxy | 19:22 |
fungi | and yes, improved loggnig out of gitea would be lovely. out of haproxy too... if we knew the ephemeral port haproxy sourced each forwarded socket from, we could map those to log entries from gitea | 19:22 |
corvus | so if none of the above works, doing that might be a slightly better alternative to iptables | 19:22 |
ianw | ++ would be good to encode in haproxy config | 19:22 |
fungi | currently haproxy doesn't tell us what the source port for its forwarded socket was, just the client' source port, so we've got a blind spot even with improved gitea logging | 19:23 |
ianw | what is our robots.txt situation; i get a 404 for https://opendev.org/robots.txt | 19:23 |
clarkb | fungi: https://www.haproxy.com/blog/haproxy-log-customization/ we can do that too looks like | 19:23 |
clarkb | ianw: I want to say its part of our docker image? | 19:24 |
clarkb | ah I think we can set it in our custom dir and it would serve it but we must not do that | 19:25 |
fungi | %bi provides "backend source IP (HAProxy connects with)" but maybe that includes the source port number | 19:25 |
corvus | #link https://review.opendev.org/738684 Enable access log in gitea | 19:25 |
clarkb | fungi: %bp is the port | 19:25 |
fungi | oh, duh, that was the next line below %bi and i totally missed it | 19:26 |
fungi | thanks | 19:26 |
clarkb | corvus: note we may need log rotation for those files | 19:27 |
clarkb | corvus: looks like we could have it interleave with the regular log if we want (then journald/dockerd deal with rotation?) | 19:27 |
fungi | yeah, so with the added logging in gitea and haproxy we'll be able to map any request back to an actual client ip address | 19:27 |
fungi | that will be a huge help | 19:27 |
clarkb | ++ | 19:27 |
clarkb | fungi: would you like to do the haproxy side or should we find another volunteer? | 19:27 |
fungi | i'm already looking into it | 19:28 |
clarkb | awesome, thanks | 19:28 |
clarkb | Anything else we want to bring up on the subject of gitea, haproxy, or opendev? | 19:28 |
clarkb | I think this gives us a number of good next steps but am open to more ideas. Otherwise we can continue the meeting | 19:28 |
fungi | i just want to make it clear that even though we blocked access from china unicom's address space, we don't have any reason to believe they're a responsible party in this situation | 19:29 |
fungi | they're a popular isp who happens to have many customers in a place where pirated operating systems which can never receive security fixes are standard protocol, and so the majority of compromised hosts in large botnets tend to be on ip addresses of such isps | 19:30 |
clarkb | #topic Update Config Management | 19:31 |
*** openstack changes topic to "Update Config Management (Meeting topic: infra)" | 19:31 | |
clarkb | we've been iterating on having ze01 run off of the zuul-executor docker image | 19:32 |
clarkb | frickler turned it off again today for a reason I've yet to fully look into due to the gitea issues | 19:32 |
fungi | i saw some mention of newly discovered problems, yeah, but got sideswiped by other bonfires | 19:32 |
clarkb | looks like it was some sort of iptables issue. We've actually seen that issue before on non container executor jobs as well I think | 19:33 |
clarkb | but in this case they were all on ze01 so it was thought we should turn it off | 19:33 |
ianw | i had a quick look at that ... it was very weird and an ansible error that "stdout was not available in the dict instance" | 19:33 |
clarkb | we attempt to persist firewall rules on the remote host and do an iptables save for that | 19:33 |
clarkb | ianw: ya we've had that error before then it went away | 19:34 |
frickler | there were "MODULE FAILURE" errors in the job logs | 19:34 |
clarkb | I'm guessing some sort of ansible/iptables bug and maybe the container is able to reproduce it reliably | 19:34 |
ianw | basically a variable made with register: on a command: somehow seemed to not have stdout | 19:34 |
clarkb | (due to a timing issue or set of tooling etc) | 19:34 |
frickler | and then I found lots of "ModuleNotFoundError: No module named \'gear\'" on ze01 | 19:34 |
frickler | and assumed some relation, though I didn't dig further | 19:34 |
clarkb | got it. So possibly two separate or related issues that should be looked into | 19:35 |
clarkb | thanks for the notes | 19:35 |
ianw | yeah, it was like it was somehow running in different process or something | 19:35 |
clarkb | mordred isn't around today otherwise he'd probably have ideas. Maybe we can sit on this for a bit until mordred can debug? | 19:36 |
clarkb | though if someone else would like to feel free | 19:36 |
frickler | the stdout error was just because it was trying to look at the output of the failed module | 19:36 |
corvus | is there a pointer to the error somewhere | 19:36 |
frickler | see the logs in #openstack-infra | 19:37 |
clarkb | https://fa41114c73dc4ffe3f14-2bb0e09cfc1bf1e619272dff8ccf0e99.ssl.cf2.rackcdn.com/738557/2/check/tripleo-ci-centos-8-containers-multinode/7cdd1b2/job-output.txt was linked there | 19:37 |
clarkb | and shows the module failure for iptables saving | 19:37 |
corvus | clarkb: thanks. i do a lot better with "here's a link to a problem we don't understand" | 19:37 |
frickler | and the failure on ze01 appeared very often | 19:39 |
ianw | frickler: oh interesting; "persistent-firewall: List current ipv4 rules" shows up as OK in the console log, but seems like it was not OK | 19:39 |
frickler | ianw: there were two nodes, one passed the other failed | 19:40 |
corvus | why do we think that's related to the executor? | 19:40 |
ianw | to my eye, they both look OK in the output @ https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json | 19:41 |
frickler | corvus: because of "ModuleNotFoundError: No module named \'gear\'" in the executor log | 19:41 |
frickler | corvus: that may be a different thing, but it looked similar | 19:41 |
ianw | https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#34529 in particular | 19:41 |
clarkb | ya I think we may have two separate issues. The gear thing is probably related to the container image but the iptables thing I'm not sure | 19:42 |
frickler | corvus: together with this seeming a new issue and ze01 being changed yesterday, that was enough hints for me | 19:42 |
corvus | the gear issue didn't cause that job to fail though, right? | 19:42 |
clarkb | corvus: unless that causes post_failre? I'm not sure if the role is set up to fail on that or not | 19:42 |
corvus | clarkb: that was a retry_limit: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6 | 19:43 |
corvus | centos8 | 19:43 |
frickler | ianw: the failure is later: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#76130 | 19:43 |
corvus | it sounds like there's perhaps a non-critical error on the executor with a missing gear package, but i don't think that should cause jobs to fail | 19:44 |
corvus | separately, there are lots of jobs retrying because of the centos8-tripleo issues | 19:44 |
ianw | frickler: yeah @ https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#37597 | 19:45 |
ianw | but all the expected output is there | 19:45 |
ianw | anyway, we can probably debug outside the meeting | 19:46 |
clarkb | ++ lets continue afterwards | 19:46 |
clarkb | #topic General Topics | 19:46 |
*** openstack changes topic to "General Topics (Meeting topic: infra)" | 19:46 | |
clarkb | #topic DNS Cleanup | 19:46 |
*** openstack changes topic to "DNS Cleanup (Meeting topic: infra)" | 19:46 | |
corvus | https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary | 19:46 |
corvus | that's the task that caused the stdout error | 19:46 |
corvus | before we move on | 19:47 |
corvus | i'd like to understand what are the blockers for the executor | 19:47 |
clarkb | #undo | 19:47 |
openstack | Removing item from minutes: #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary | 19:47 |
clarkb | wait what thats not what I expected to be undone | 19:47 |
corvus | is it agreed that the only executor-related error is the (suspected non-fatal) missing gear package? | 19:47 |
clarkb | #undo | 19:47 |
openstack | Removing item from minutes: #topic DNS Cleanup | 19:47 |
clarkb | #undo | 19:47 |
openstack | Removing item from minutes: #topic General Topics | 19:47 |
clarkb | #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary caused iptables failure | 19:48 |
corvus | or am i missing something? | 19:48 |
clarkb | corvus: that is my understanding | 19:48 |
clarkb | gear is what needs addressing then we can turn ze01 back on? | 19:48 |
corvus | i suspect that would just cause us not to submit logstash jobs | 19:48 |
clarkb | that is my understanding as well | 19:48 |
corvus | cool, i'll work on adding that | 19:49 |
clarkb | #topic General Topics | 19:49 |
*** openstack changes topic to "General Topics (Meeting topic: infra)" | 19:49 | |
clarkb | #topic DNS Cleanup | 19:49 |
*** openstack changes topic to "DNS Cleanup (Meeting topic: infra)" | 19:49 | |
clarkb | I kept this on the agenda as a reminder that I meant to do a second pass of record removals and have not done that yet and things have been busy with fires | 19:49 |
clarkb | nothign else to add with this though | 19:49 |
clarkb | #topic Time to retire openstack-infra mailing list? | 19:50 |
*** openstack changes topic to "Time to retire openstack-infra mailing list? (Meeting topic: infra)" | 19:50 | |
clarkb | fungi: this was your topic want to quickly go over it ? | 19:50 |
clarkb | The last email to that list was on june 2 | 19:50 |
fungi | sure, just noting that the infra team has been supplanted by the tact sig, which claims (currently) to use the openstack-discuss ml like other sigs | 19:50 |
clarkb | and was from zbr who we can probably convince to email service-discuss or openstack-discuss depending on the context | 19:51 |
fungi | and as you've observed, communication levels on it are already low | 19:51 |
fungi | we've likely still got the address embedded in various places, like pypi package metadata in older releases at the very least, so if we do decide it's time to close it down i would forward that address to the openstack-discuss ml | 19:51 |
clarkb | I'm good with shutting it down and setting up the forward | 19:52 |
clarkb | it was never a very busy list anyway so unlikely to cause problems with the forward | 19:52 |
fungi | this was manily an informal addition to the meeting topic just to get a feel for whether there are strong objections, it's not time yet, whatever | 19:52 |
fungi | next step would be for me to post to that ml with a proposed end date (maybe august 1?) and make sure there are no objections from subscribers | 19:53 |
clarkb | fungi: maybe send an email to that list with a proposed date a week or two in the future then just do it? | 19:53 |
clarkb | that way anyone still subbed will get a notification first | 19:53 |
frickler | seems fine for me | 19:53 |
fungi | ahh, okay, sure i could maybe say july 15 | 19:53 |
fungi | if folks don't think that's too quick | 19:54 |
clarkb | works for me | 19:54 |
fungi | anyway, not hearing objections, i'll go forth with the (hopefully final) ml thread | 19:54 |
clarkb | thanks! | 19:54 |
clarkb | #topic Grafana deployments from containers | 19:55 |
*** openstack changes topic to "Grafana deployments from containers (Meeting topic: infra)" | 19:55 | |
diablo_rojo | thanks! | 19:55 |
clarkb | #link https://review.opendev.org/#/q/status:open+topic:grafana-container | 19:55 |
clarkb | ianw: want to quickly update us on this subject? I know you need reviews (sorry too many fires) | 19:55 |
fungi | yes, i stuck it on the top of my review stack when i went to bed last night, and it only got buried as soon as i woke up :/ | 19:55 |
ianw | sorry, yeah basically grafanan and graphite containers | 19:56 |
ianw | if people want to review, then i can try deploying them | 19:56 |
ianw | grafana should be fine, graphite i'll have to think about data migration | 19:56 |
clarkb | cool, thanks for working on that. Its on my todo list for when I get out from under my fires backlog | 19:56 |
ianw | (but it can sit as graphite.opendev.org for testing and while we do that, and then just switch dns at an appropriate time) | 19:57 |
clarkb | and thats basically all we had time for. | 19:58 |
clarkb | We didn't manage to get to every item on the agenda but the gitea brainstorm was really useful | 19:58 |
clarkb | Thanks everyone | 19:58 |
clarkb | feel free to bring up anything we missed in #opendev | 19:58 |
clarkb | #endmeeting | 19:58 |
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev" | 19:58 | |
openstack | Meeting ended Tue Jun 30 19:58:31 2020 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:58 |
openstack | Minutes: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-30-19.01.html | 19:58 |
openstack | Minutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-30-19.01.txt | 19:58 |
openstack | Log: http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-30-19.01.log.html | 19:58 |
fungi | thanks clarkb! | 19:59 |
*** tobiash has quit IRC | 20:06 | |
*** tobiash has joined #opendev-meeting | 20:07 | |
*** tobiash has quit IRC | 22:04 | |
*** tobiash has joined #opendev-meeting | 22:06 | |
*** hamalq_ has quit IRC | 23:38 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!