Tuesday, 2020-06-30

*** hamalq has quit IRC00:36
*** diablo_rojo has quit IRC06:49
*** mordred has quit IRC12:04
*** mordred has joined #opendev-meeting12:11
*** hamalq has joined #opendev-meeting15:50
*** hamalq_ has joined #opendev-meeting15:52
*** hamalq has quit IRC15:55
-openstackstatus- NOTICE: Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.18:21
clarkbanyone else here for the meeting?19:01
fungimore or less19:01
clarkbIt might be a little disorganized due to fires earlier in the day but I'll give it a go19:01
clarkb#startmeeting infra19:01
openstackMeeting started Tue Jun 30 19:01:32 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
openstackUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
*** openstack changes topic to " (Meeting topic: infra)"19:01
openstackThe meeting name has been set to 'infra'19:01
clarkb#topic Announcements19:01
*** openstack changes topic to "Announcements (Meeting topic: infra)"19:01
ianw_ptoo/19:02
clarkbIf you hadn't noticed our gitea installation was being ddos'd, its under control now but only because we're blocking all of china unicom19:02
clarkbwe can talk more about this shortly19:02
clarkbThe other thing I wanted to mention is I'm taking next week off and unlike ianw_pto I don't intend to be here for the meeting :)19:02
clarkbIf we're going to have a meeting next week we'll need someone to volunteer for running it19:03
clarkb#topic Actions from last meeting19:03
*** openstack changes topic to "Actions from last meeting (Meeting topic: infra)"19:03
clarkb#link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-23-19.01.txt minutes from last meeting19:03
clarkbThere were none19:04
clarkb#topic Specs approval19:04
*** ianw_pto is now known as ianw19:04
*** openstack changes topic to "Specs approval (Meeting topic: infra)"19:04
clarkbianw: oh no did the pto end?19:04
clarkb#link https://review.opendev.org/#/c/731838/ Authentication broker service19:04
clarkbGoing to continue to call this out and we did get a new patchset19:05
clarkbI should read it19:05
ianwheh yes was just yesterday19:05
clarkb#topic Priority Efforts19:05
*** openstack changes topic to "Priority Efforts (Meeting topic: infra)"19:05
clarkb#topic Opendev19:05
*** openstack changes topic to "Opendev (Meeting topic: infra)"19:06
clarkbLet's dive right in19:06
clarkbBefore we talk about the ddos I wanted to remind people that the advisory board will start moving forward at the end of this week19:06
clarkb#link http://lists.opendev.org/pipermail/service-discuss/2020-May/000026.html Advisory Board thread.19:06
clarkbwe've got a number of volunteers which is exciting19:06
clarkbAlso we had a gitea api issue with the v1.12.0 release19:07
clarkblong story short listing repos requires pagination now but the way the repos are listed from the db doesn't consistently produce a complete list19:07
clarkbwe worked around that with https://review.opendev.org/#/c/738109/ and I proposed an upstream change at https://github.com/go-gitea/gitea/pull/12057 which seems to fix it as well19:08
clarkbFor today's gitea troubles http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is a good illustration of what we saw19:08
clarkbbasically at ~midnight UTC today we immediately spiked to our haproxy connection limit19:08
clarkbafter digging aroud in gitea and haproxy logs it appears that there is a botnet that is doing a crawl of our gitea instllation from many many many IP addresses most of which belong to chinese ISPs19:09
clarkbwhile doing that I noticed it appeared we had headroom to accept more connections so I proposed bumping that limit from 4k to 16k in haproxy (note the cacti number is 2x the haproxy number because haproxy has a connection the front end and the backend for each logical connection)19:10
clarkbunfortunately our backends couldn't handle the new connections (of which we seemed to peak at about 8k logical connections)19:10
fungithis may be in part due to specific characteristics of the requests we were being hit with19:11
clarkbwe went from having slowness and the occasional error to more persistent errors as the giteas ran out of memory. I manually reverted the maxconn change and https://review.opendev.org/#/c/738679/1 is in the gate to revert it properly. Then I restarted all the giteas and thigns got better.19:11
clarkbAs part of recovery we also blocked all IPv4 ranges for china unicom on the haporxy load balancer19:11
clarkbif we want to undo those drop rules we can restart the netfilter-presistent service on that host19:11
clarkbyes, the requests are looking at specific files and commits and checking them across the different localizations that gitea offers19:12
clarkbits basically doing a proper web crawl, but not throttling itself and the way it does it causes us problems19:12
clarkbWe appear to be stable right now even though the crawler seems to still be running from other IPs19:13
*** diablo_rojo has joined #opendev-meeting19:13
* diablo_rojo sneaks in late19:13
clarkbwe're under that 4k connection limit and giteas seem happy.19:13
clarkbThe problem we're now faced is how to address this more properly so that people that just want to clone nova from china aren't going to be blocked19:13
ianwso it's currently manually applied config on haproxy node?19:13
clarkbianw: ya I did a for loop of iptables -I -j DROP -s $prefix19:14
clarkbso a reboot or restart of our netfilter-persistent service will reset to our normal iptables ruleset19:14
ianwcool; and does this thing have a specific UA string?19:14
fungiwe have no idea19:14
clarkbianw: good question19:14
clarkbunfortunately gitea does not log UAs19:15
fungiand haproxy can't see them19:15
clarkbone idea I had was to tcpdump and then decrypt on gitea0X and see if we can sort that out19:15
clarkbbut was just trying to fight the fire earlier and haven't had time to really try ^19:15
clarkbbecause ya if this is a well behaved bot maybe we can update/set robots.txt and be on our weay19:15
corvusi'll look into gitea options to log uas19:16
ianwok, i can probably help19:16
clarkbhttps://docs.gitea.io/en-us/logging-configuration/#the-access_log_template implies we may be able to get that out of gitea actually19:16
fungiit's worth checking, but my suspicion is that it's not going to be well-behaved or else it wouldn't be sourced from thousands of addresses across multiple service providers19:16
clarkbcorvus: thanks19:16
ianwthe traffic goes directly into gitea doesn't it, not via a reverse proxy?19:17
fungiit acts like some crawler implemented on top of a botnet of compromised machines19:18
clarkbcorvus: reading that really quickly I think we want to change from default logger to access logger19:18
corvusclarkb: i agree19:18
fungiianw: it's a layer 4 proxy19:18
clarkbianw: no its all through the load balancer19:18
fungiianw: oh, you mean at the backend... right, gitea's listening on the server's ip address directly, there's no apache handing off those connections via loopback19:19
clarkbthinking out loud here: I think that while we're stable we should do the logging siwtch as that gives us more data19:19
ianwsorry i'm thinking that we could put apache infront of gitea on each gitea node, and filter at that level19:19
corvusfilter how?19:19
clarkbcorvus: mod rewrite based on UA?19:19
corvus(i mean, based on what criteria)19:19
ianwvia UA, if we find it misbehaving19:19
clarkbassuming the UA is discernable19:19
ianwyeah, and not robots obeying19:19
fungii've seen discussions about similar crawlers, and if they're not obeying robots.txt they also are quite likely to use a random assortment of popular browser agent strings too19:20
clarkbI like that. Basically improve our logging to check if it is a robot.txt fix. If not that will tell us if UA is filterable and if so we could add an apache to front the giteas19:20
clarkband that is all a reason to not further filter IPs since we're under the limits and happy but still have enough of those requests to be able to debug them further19:21
clarkbthen make decissions based on whatever that tells us19:21
corvusjohsom also mentioned we can limit by ip in haproxy19:22
fungiand yes, improved loggnig out of gitea would be lovely. out of haproxy too... if we knew the ephemeral port haproxy sourced each forwarded socket from, we could map those to log entries from gitea19:22
corvusso if none of the above works, doing that might be a slightly better alternative to iptables19:22
ianw++ would be good to encode in haproxy config19:22
fungicurrently haproxy doesn't tell us what the source port for its forwarded socket was, just the client' source port, so we've got a blind spot even with improved gitea logging19:23
ianwwhat is our robots.txt situation; i get a 404 for https://opendev.org/robots.txt19:23
clarkbfungi: https://www.haproxy.com/blog/haproxy-log-customization/ we can do that too looks like19:23
clarkbianw: I want to say its part of our docker image?19:24
clarkbah I think we can set it in our custom dir and it would serve it but we must not do that19:25
fungi%bi provides "backend source IP (HAProxy connects with)" but maybe that includes the source port number19:25
corvus#link https://review.opendev.org/738684 Enable access log in gitea19:25
clarkbfungi: %bp is the port19:25
fungioh, duh, that was the next line below %bi and i totally missed it19:26
fungithanks19:26
clarkbcorvus: note we may need log rotation for those files19:27
clarkbcorvus: looks like we could have it interleave with the regular log if we want (then journald/dockerd deal with rotation?)19:27
fungiyeah, so with the added logging in gitea and haproxy we'll be able to map any request back to an actual client ip address19:27
fungithat will be a huge help19:27
clarkb++19:27
clarkbfungi: would you like to do the haproxy side or should we find another volunteer?19:27
fungii'm already looking into it19:28
clarkbawesome, thanks19:28
clarkbAnything else we want to bring up on the subject of gitea, haproxy, or opendev?19:28
clarkbI think this gives us a number of good next steps but am open to more ideas. Otherwise we can continue the meeting19:28
fungii just want to make it clear that even though we blocked access from china unicom's address space, we don't have any reason to believe they're a responsible party in this situation19:29
fungithey're a popular isp who happens to have many customers in a place where pirated operating systems which can never receive security fixes are standard protocol, and so the majority of compromised hosts in large botnets tend to be on ip addresses of such isps19:30
clarkb#topic Update Config Management19:31
*** openstack changes topic to "Update Config Management (Meeting topic: infra)"19:31
clarkbwe've been iterating on having ze01 run off of the zuul-executor docker image19:32
clarkbfrickler turned it off again today for a reason I've yet to fully look into due to the gitea issues19:32
fungii saw some mention of newly discovered problems, yeah, but got sideswiped by other bonfires19:32
clarkblooks like it was some sort of iptables issue. We've actually seen that issue before on non container executor jobs as well I think19:33
clarkbbut in this case they were all on ze01 so it was thought we should turn it off19:33
ianwi had a quick look at that ... it was very weird and an ansible error that "stdout was not available in the dict instance"19:33
clarkbwe attempt to persist firewall rules on the remote host and do an iptables save for that19:33
clarkbianw: ya we've had that error before then it went away19:34
fricklerthere were "MODULE FAILURE" errors in the job logs19:34
clarkbI'm guessing some sort of ansible/iptables bug and maybe the container is able to reproduce it reliably19:34
ianwbasically a variable made with register: on a command: somehow seemed to not have stdout19:34
clarkb(due to a timing issue or set of tooling etc)19:34
fricklerand then I found lots of "ModuleNotFoundError: No module named \'gear\'" on ze0119:34
fricklerand assumed some relation, though I didn't dig further19:34
clarkbgot it. So possibly two separate or related issues that should be looked into19:35
clarkbthanks for the notes19:35
ianwyeah, it was like it was somehow running in different process or something19:35
clarkbmordred isn't around today otherwise he'd probably have ideas. Maybe we can sit on this for a bit until mordred can debug?19:36
clarkbthough if someone else would like to feel free19:36
fricklerthe stdout error was just because it was trying to look at the output of the failed module19:36
corvusis there a pointer to the error somewhere19:36
fricklersee the logs in #openstack-infra19:37
clarkbhttps://fa41114c73dc4ffe3f14-2bb0e09cfc1bf1e619272dff8ccf0e99.ssl.cf2.rackcdn.com/738557/2/check/tripleo-ci-centos-8-containers-multinode/7cdd1b2/job-output.txt was linked there19:37
clarkband shows the module failure for iptables saving19:37
corvusclarkb: thanks.  i do a lot better with "here's a link to a problem we don't understand"19:37
fricklerand the failure on ze01 appeared very often19:39
ianwfrickler: oh interesting; "persistent-firewall: List current ipv4 rules" shows up as OK in the console log, but seems like it was not OK19:39
fricklerianw: there were two nodes, one passed the other failed19:40
corvuswhy do we think that's related to the executor?19:40
ianwto my eye, they both look OK in the output @  https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json19:41
fricklercorvus: because of "ModuleNotFoundError: No module named \'gear\'" in the executor log19:41
fricklercorvus: that may be a different thing, but it looked similar19:41
ianwhttps://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#34529 in particular19:41
clarkbya I think we may have two separate issues. The gear thing is probably related to the container image but the iptables thing I'm not sure19:42
fricklercorvus: together with this seeming a new issue and ze01 being changed yesterday, that was enough hints for me19:42
corvusthe gear issue didn't cause that job to fail though, right?19:42
clarkbcorvus: unless that causes post_failre? I'm not sure if the role is set up to fail on that or not19:42
corvusclarkb: that was a retry_limit: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b619:43
corvuscentos819:43
fricklerianw: the failure is later: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#7613019:43
corvusit sounds like there's perhaps a non-critical error on the executor with a missing gear package, but i don't think that should cause jobs to fail19:44
corvusseparately, there are lots of jobs retrying because of the centos8-tripleo issues19:44
ianwfrickler: yeah @ https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/log/job-output.json#3759719:45
ianwbut all the expected output is there19:45
ianwanyway, we can probably debug outside the meeting19:46
clarkb++ lets continue afterwards19:46
clarkb#topic General Topics19:46
*** openstack changes topic to "General Topics (Meeting topic: infra)"19:46
clarkb#topic DNS Cleanup19:46
*** openstack changes topic to "DNS Cleanup (Meeting topic: infra)"19:46
corvushttps://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary19:46
corvusthat's the task that caused the stdout error19:46
corvusbefore we move on19:47
corvusi'd like to understand what are the blockers for the executor19:47
clarkb#undo19:47
openstackRemoving item from minutes: #link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary19:47
clarkbwait what thats not what I expected to be undone19:47
corvusis it agreed that the only executor-related error is the (suspected non-fatal) missing gear package?19:47
clarkb#undo19:47
openstackRemoving item from minutes: #topic DNS Cleanup19:47
clarkb#undo19:47
openstackRemoving item from minutes: #topic General Topics19:47
clarkb#link https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary caused iptables failure19:48
corvusor am i missing something?19:48
clarkbcorvus: that is my understanding19:48
clarkbgear is what needs addressing then we can turn ze01 back on?19:48
corvusi suspect that would just cause us not to submit logstash jobs19:48
clarkbthat is my understanding as well19:48
corvuscool, i'll work on adding that19:49
clarkb#topic General Topics19:49
*** openstack changes topic to "General Topics (Meeting topic: infra)"19:49
clarkb#topic DNS Cleanup19:49
*** openstack changes topic to "DNS Cleanup (Meeting topic: infra)"19:49
clarkbI kept this on the agenda as a reminder that I meant to do a second pass of record removals and have not done that yet and things have been busy with fires19:49
clarkbnothign else to add with this though19:49
clarkb#topic Time to retire openstack-infra mailing list?19:50
*** openstack changes topic to "Time to retire openstack-infra mailing list? (Meeting topic: infra)"19:50
clarkbfungi: this was your topic want to quickly go over it ?19:50
clarkbThe last email to that list was on june 219:50
fungisure, just noting that the infra team has been supplanted by the tact sig, which claims (currently) to use the openstack-discuss ml like other sigs19:50
clarkband was from zbr who we can probably convince to email service-discuss or openstack-discuss depending on the context19:51
fungiand as you've observed, communication levels on it are already low19:51
fungiwe've likely still got the address embedded in various places, like pypi package metadata in older releases at the very least, so if we do decide it's time to close it down i would forward that address to the openstack-discuss ml19:51
clarkbI'm good with shutting it down and setting up the forward19:52
clarkbit was never a very busy list anyway so unlikely to cause problems with the forward19:52
fungithis was manily an informal addition to the meeting topic just to get a feel for whether there are strong objections, it's not time yet, whatever19:52
funginext step would be for me to post to that ml with a proposed end date (maybe august 1?) and make sure there are no objections from subscribers19:53
clarkbfungi: maybe send an email to that list with a proposed date a week or two in the future then just do it?19:53
clarkbthat way anyone still subbed will get a notification first19:53
fricklerseems fine for me19:53
fungiahh, okay, sure i could maybe say july 1519:53
fungiif folks don't think that's too quick19:54
clarkbworks for me19:54
fungianyway, not hearing objections, i'll go forth with the (hopefully final) ml thread19:54
clarkbthanks!19:54
clarkb#topic Grafana deployments from containers19:55
*** openstack changes topic to "Grafana deployments from containers (Meeting topic: infra)"19:55
diablo_rojothanks!19:55
clarkb#link https://review.opendev.org/#/q/status:open+topic:grafana-container19:55
clarkbianw: want to quickly update us on this subject? I know you need reviews (sorry too many fires)19:55
fungiyes, i stuck it on the top of my review stack when i went to bed last night, and it only got buried as soon as i woke up :/19:55
ianwsorry, yeah basically grafanan and graphite containers19:56
ianwif people want to review, then i can try deploying them19:56
ianwgrafana should be fine, graphite i'll have to think about data migration19:56
clarkbcool, thanks for working on that. Its on my todo list for when I get out from under my fires backlog19:56
ianw(but it can sit as graphite.opendev.org for testing and while we do that, and then just switch dns at an appropriate time)19:57
clarkband thats basically all we had time for.19:58
clarkbWe didn't manage to get to every item on the agenda but the gitea brainstorm was really useful19:58
clarkbThanks everyone19:58
clarkbfeel free to bring up anything we missed in #opendev19:58
clarkb#endmeeting19:58
*** openstack changes topic to "Incident management and meetings for the OpenDev sysadmins; normal discussions are in #opendev"19:58
openstackMeeting ended Tue Jun 30 19:58:31 2020 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:58
openstackMinutes:        http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-30-19.01.html19:58
openstackMinutes (text): http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-30-19.01.txt19:58
openstackLog:            http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-06-30-19.01.log.html19:58
fungithanks clarkb!19:59
*** tobiash has quit IRC20:06
*** tobiash has joined #opendev-meeting20:07
*** tobiash has quit IRC22:04
*** tobiash has joined #opendev-meeting22:06
*** hamalq_ has quit IRC23:38

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!