Tuesday, 2020-06-30

*** ryohayakawa has joined #opendev00:04
*** factor has quit IRC00:54
*** factor has joined #opendev00:54
*** dtantsur has joined #opendev01:12
*** dtantsur|afk has quit IRC01:12
*** vblando has quit IRC02:36
*** mnasiadka has quit IRC02:36
*** Open10K8S has quit IRC02:37
*** Open10K8S has joined #opendev02:37
*** mnasiadka has joined #opendev02:38
*** vblando has joined #opendev02:40
*** ysandeep|away is now known as ysandeep04:01
*** ykarel|away is now known as ykarel04:19
openstackgerrityatin proposed openstack/diskimage-builder master: Disable all enabled epel repos in CentOS8  https://review.opendev.org/73843504:47
*** ysandeep is now known as ysandeep|brb05:41
*** DSpider has joined #opendev06:07
*** ysandeep|brb is now known as ysandeep06:11
*** diablo_rojo has quit IRC06:49
*** bhagyashris|pto is now known as bhagyashris07:16
*** hashar has joined #opendev07:24
*** sshnaidm|afk is now known as sshnaidm|ruck07:28
*** tosky has joined #opendev07:28
*** bhagyashris is now known as bhagyashris|lunc07:29
*** iurygregory has quit IRC07:42
*** moppy has quit IRC08:01
*** iurygregory has joined #opendev08:01
*** moppy has joined #opendev08:01
*** hiep_mq has joined #opendev08:25
*** ykarel is now known as ykarel|lunch08:25
*** jangutter has quit IRC08:27
*** hiep_mq has quit IRC08:37
*** bhagyashris|lunc is now known as bhagyashris08:37
*** Eighth_Doctor has quit IRC08:38
*** rchurch has quit IRC08:40
*** rchurch has joined #opendev08:44
*** Eighth_Doctor has joined #opendev08:52
*** tkajinam has quit IRC08:55
*** priteau has joined #opendev09:08
*** sshnaidm|ruck has quit IRC09:14
*** sshnaidm has joined #opendev09:23
*** sshnaidm has quit IRC09:42
*** ykarel|lunch is now known as ykarel09:49
*** priteau has quit IRC09:54
*** sshnaidm has joined #opendev09:55
*** mugsie has quit IRC10:07
*** mugsie has joined #opendev10:10
*** ryohayakawa has quit IRC10:30
*** hashar has quit IRC10:47
*** ysandeep is now known as ysandeep|afk10:53
fricklerinfra-root: most cacti graphs for gitea0X look weird starting around 0:30 GMT today11:00
fricklere.g. http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66632&rra_id=all11:00
fricklerhmm, the docker log only shows the IP of the lb as client, makes it difficult to track things. not sure if that might be some ddos, even less how to mitigate if it was11:13
fricklerexcept it seems that cloudflare seems to offer free protection for open source projects, but we'd have to discuss whether we'd want that11:13
*** bhagyashris is now known as bhagyashris|brb11:32
*** ysandeep|afk is now known as ysandeep11:39
openstackgerritMerged openstack/diskimage-builder master: Make ipa centos8 job non-voting  https://review.opendev.org/73848111:49
*** mordred has quit IRC12:04
*** mordred has joined #opendev12:09
*** bhagyashris|brb is now known as bhagyashris12:10
*** dtantsur is now known as dtantsur|brb12:13
*** hashar has joined #opendev12:22
*** priteau has joined #opendev12:45
*** ysandeep is now known as ysandeep|afk12:47
ttxcorvus, fungi, clarkb: if you can review https://review.opendev.org/#/c/738187/ today, I'm around in the next 3 hours to watch it go through and fix it in case of weirdness12:55
*** ysandeep|afk is now known as ysandeep12:55
ttxtl;dr: a retry is better than convoluted conditionals12:55
AJaegerttx, +2A12:59
clarkbfrickler: that may indicate we've got another ddos happening against the service :/13:02
ttxAJaeger: thx!13:07
clarkbThere are two vexxhost IPs that show up as particularly busy13:13
clarkboddly they are ipv4 addrs not ipv6. Neither address shows up in our nodepool logs, but I'm double checking that I'm grepping through rotated logs properly now13:14
clarkbhowever I expect that our jobs would always use ipv6 not ipv413:14
openstackgerritMerged zuul/zuul-jobs master: upload-git-mirror: use retries to avoid races  https://review.opendev.org/73818713:14
clarkbya I'm grepping rotated logs properly so these aren't any IPs of our own13:15
clarkbmnaser: if you have a moment or can point us at who does I'd happily share IPs and see if we can work backward from there?13:15
fungithe established connections graph for the lb are striking, can probably just see what the socket table looks like13:16
mnaserclarkb: can you post them to me?  I wonder if it’s our Zuul.13:17
clarkbfungi: thats bascially what I did, grep by backend on load balancer and sort by unique IP13:18
smcginnisI've noticed some services seem a bit slower this morning. A ddos could certainly explain that.13:18
clarkb`sudo grep balance_git_https/gitea08.opendev.org /var/log/syslog | cut -d' ' -f 6 | sed -ne 's/\(:[0-9]\+\)$//p' | sort | uniq -c | sort | tail`13:20
clarkbif other infra-root ^ want to do similar on the load balancer13:20
fungiclarkb: the address distribution looks fairly innocuous to me13:20
fungii was analyzing netstat -ln13:20
clarkbfungi: tehre are two vexxhost IPs that hvae an order of magnitude more requests over that syslog13:20
clarkbcompared to other IPs13:20
clarkbpossible that it finally caught up and now our distribution is more normalized13:20
fungihighest count for open sockets i saw was 66.187.233.202 which seems to be a nat at red hat13:21
clarkbfungi: yes that one also shows up (but it has since we spun up the service since red hat insists on funneling all internets through a single IP)13:21
fungier, not netstat -ln, netstat -nt13:21
funginext most connections is from 124.92.132.123 at china unicom13:22
fungiseveral other china unicom addresses in the current top count of open connections to the lb13:23
fungiso far basically all of the ones i'm checking are, in fact13:24
funginetstat -nt|awk '{print $5}'|sed 's/:[0-9]*$//'|sort|uniq -c|sort -n13:24
fungithe highest 8 are obviously connections to the backends13:25
fungithe next 5 are china unicom13:26
fungi(at the moment)13:26
clarkbfungi: ya I was looking at it over time since the cacti graphs show it being a long term issue and in the past total number of connections over tiem when that has happened has correlated strongly with the issue13:27
clarkbits also possible the IPs I identified are just noise as the actual problem is spread out over many more IPs and we're seeing that right now13:28
fungiyeah, the distribution makes me wonder if the connection count is a symptom of something impacting cluster performance, not the cause13:30
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is fun13:32
fungicloser looks at cacti graphs, note that the bandwidth utilization didn't really increase, or even spike (except on the loopback)13:32
fungiso it could be something else is causing connections to take far longer to service, and they're all piling up13:33
fungicpu, load average and loopback connections all went up at the same time across all backends13:34
clarkbfungi: ya thats what we saw before, basically really expensive requests that hang around13:34
clarkband then after a few hours they finally complete in gitea13:34
fungimakes sense. not expensive operations resulting in large amounts of returned data over the network though13:35
clarkbno its expensive for gitea to produce a result so its sitting in the background spinning its wheels for several hours before saying something13:36
fungiyeah, i wonder how we could identify those13:38
clarkblast time it was via docker logs on the web container and looking for completed requests with large times or started request entries and no corresponding completed line13:39
fungiseems gitea is using threads, so can't really just go by process age like we could with the git http backend13:39
clarkbI'm having a hard time doing that now13:40
clarkbbecause the volume of logs is signficant13:40
clarkbok finally getting somewhere with that. haven't seen any large requests, but I am noticing a lot of requests for specific commits and under specific languages13:42
fungifunny to hear folks on the conference talking about collecting haproxy telemetry while digging into this13:42
clarkbthat seems odd for us and implies maybe its a crawler13:42
clarkbgitea doesn't seem to log user agent unfortunately or that might make it a bit easier to tell what was doing that if it is a bot13:43
fungiand i don't suppose we have sufficient session identifiers to pick it out of the haproxy logs13:44
clarkband if it isn't closing those open connections we may not log them in the way I was looking in syslog (your live view would be a better indicator of that)13:44
clarkbfungi: ya the gitea logging is maybe the next thing to look at in that space to make it easier to debug this class of problem. I believe it will log a forwarded for ip if we were doing http(s) balancing on haproxy but we do tcp passthrough instead13:46
clarkbadding user agent would be useful though13:46
clarkbI think if we restart the containers we'll reset the log buffer which may make operating with the gitea logs slightly easier13:50
*** mlavalle has joined #opendev14:00
clarkbfungi: I'm trying to use my own IP as a bread crumb in the lb and gitea logs to see how we can more effectively correlate them14:01
fungiif haproxy logs the ephemeral port from which it sources the forwarded socket, and gitea logs the "client" port number, we could likely create a mapping based on that, backend and approximate timestamps14:03
fungithough what haproxy's putting in syslog looks like it includes the original client source port, but not the source port for the forwarded connection it creates to the backend14:05
openstackgerritRiccardo Pittau proposed openstack/diskimage-builder master: Convert multi line if statement to case  https://review.opendev.org/73447914:10
*** dtantsur|brb is now known as dtantsur14:14
*** ysandeep is now known as ysandeep|away14:17
clarkbdoing rough correlation using timestamps and looking for blocks where significant numbers of requests do the specific commit and file with lang param thing I'm beginning to suspect the requests from those china unicom ranges14:19
clarkbeach one seems to be a new IP so doesn't show up in our busy IPs listings14:19
*** mlavalle has quit IRC14:20
clarkbthis correlation isn't perfect though so not wanting to commit to that just yet, but am starting to thik about how we might mitigate such a thing14:20
clarkbwhat I'm noticing is that we seem to have capped out haproxy to ~8k concurrent connections but we're barely spinning the cpu or using any memory there14:22
clarkbwe may be able to mitigate this by allowing haproxy to handle far more connections. However, ulimit is set to a very large number for open files and it isn't clear to me where that limit may be14:22
fungihaproxy is in a docker container now, so could it be the namespace having a separate, lower limit?14:24
*** mlavalle has joined #opendev14:24
openstackgerritClark Boylan proposed opendev/system-config master: Increase allowed number of haproxy connections  https://review.opendev.org/73863514:28
clarkbfungi: nah its ^14:28
clarkbI checked ulimit -a in the container14:28
fungiaha!14:28
clarkbI think lookingat gitea01 we have headroom for maybe 6-8x the number of current requests14:29
clarkbso I bump by 4x in that change and we can monitor and tune from there14:29
clarkbthis is still just a mitigation though doesn't really address the underlying problem (but maybe thats good enough for now?)14:30
fungiyeah, i agree, system load is topping out around 25% of the cpu count, and cpu utilization is also around 25%14:30
clarkband we get connection on front and connection on back so 4000 * 2 = 8k total tcp conns14:30
fungimemory pressure is low too14:30
clarkbfungi: ya14:30
clarkband the lb itself is running super lean14:32
clarkbit could probably do 64k connections on the lb itself14:32
*** weshay_ruck has joined #opendev14:33
tristanCGreeting, is there a place where we could download the diskimage used by opendev's zuul?14:33
clarkbtristanC: yes https://nb01.opendev.org, https://nb02.opendev.org. https://nb04.opendev.org14:33
clarkball three of those build the images so you'll want to check which one has the most recent build of the image you want14:33
clarkber it might needs /images ? /me checks14:34
clarkbyes its the /images path at those hosts14:34
tristanCclarkb: thanks, /images is what i was looking for14:34
clarkbfrickler: corvus mordred I think we start at https://review.opendev.org/#/c/738635/1 to see if we can make opendev.org happier14:35
clarkband separately continue to try and sort out if these requests are from a legit crawler and if so we may be able to set up a robots.txt and ask it to kindly go away14:36
clarkbfungi: I guess we could do a packet capture then figure out decryption of it and then look for user agent headers14:43
clarkbfungi: that seems like an after conference, after breakfast task if I'm going to tackle that though14:43
openstackgerritMerged zuul/zuul-jobs master: prepare-workspace: Add Role Variable in README.rst  https://review.opendev.org/73735214:45
fungiyeah, with a copy of the ssl server key we could do offline decryption of a pcap14:45
fungithough i'm out of practice, been a while since i did that14:46
*** ykarel is now known as ykarel|away14:48
*** sgw1 has quit IRC14:50
*** sorin-mihai has joined #opendev14:52
corvusclarkb, fungi: oh, i have scrollback to read; i'll do that and expect to be useful after breakfast in maybe 30m?14:53
*** sorin-mihai__ has quit IRC14:53
clarkbcorvus: sounds good14:53
fungiyeah, things are mostly working14:54
*** sgw1 has joined #opendev14:55
corvusok caught up on scrollback, +3 738635; breakfast now then back14:58
openstackgerritMerged zuul/zuul-jobs master: Return upload_results in upload-logs-swift role  https://review.opendev.org/73356415:01
fungi738635 is going to need an haproxy container restart, right? hopefully that's fast15:10
clarkbfungi: yes and yes its usually pretty painless15:10
*** hashar is now known as hasharAway15:24
openstackgerritMerged opendev/puppet-openstackid master: Fixed permissions issues on SpammerProcess  https://review.opendev.org/71735915:47
*** sshnaidm has quit IRC15:51
*** sshnaidm has joined #opendev15:53
*** hasharAway is now known as hashar15:58
clarkbwe're about half an hour from applying the haproxy config update. I'm going to find lunch16:07
clarkbI guess its just breakfast at this point16:08
clarkbyay early mornings16:08
*** sshnaidm is now known as sshnaidm|ruck16:18
openstackgerritMerged opendev/system-config master: Increase allowed number of haproxy connections  https://review.opendev.org/73863516:40
fungiand now we wait for the deploy to run16:40
*** dtantsur is now known as dtantsur|afk16:45
fungilooks like the deploy is wrapping up now16:45
fungiand done16:45
fungionce enough of us are around, i guess we can restart the container16:47
clarkbfungi: I think it restarts automatically?16:49
clarkbhttp://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all seems to show that16:51
fungiit didn't seem like it restarted, looking at ps16:51
fungistart timestamp is over two weeks ago16:52
fungiroot     12495  0.0  0.1  19848 11176 ?        Ss   Jun11   1:22 haproxy -sf 6 -W -db -f /usr/local/etc/haproxy/haproxy.cfg16:52
fungiweird16:53
fungimaybe it reread its config live?16:53
clarkbI didnt think it did butmaybe16:53
fungino mention of any config reload in syslog16:57
clarkbdocker ps also shows the container didn't restart16:58
clarkbmaybe it is checking its config on the fly?16:58
clarkbgitea01 connections, memory and cpu has gone up too but all at reasonable levels17:00
clarkbfungi: I'm thinking we should probably restart the haproxy just to be sure, but it definitely seems to be running with thatn ew config anyway17:00
clarkbI'm checking the ansible nwo to see if we signal the process at all17:00
clarkbwe do: cmd: docker-compose kill -s HUP haproxy17:01
fungiahh, okay17:01
clarkbthats the handler for updating our config so this is all good17:01
clarkbno restart necessary17:01
fungiyep, perfect17:01
clarkbalso if we were using all 16k conns on the front end I would expect 32k conns recorded in cacti17:02
fungiagreed17:03
clarkbwe're "only" getting ~16k conns in cacti which implies to me we've now got enough capacity in haproxy to address the load17:03
fungiwaiting to see if there's fluxuation in that or if it looks capped still17:03
clarkbas long as our backends stay happy this may be sufficient to work around the problem17:03
clarkband if so we can see if this persists as a low key ddos or maybe it will resolve on its own if the bots get the crawling done17:03
fungithe new top out does seem to have some variation in it compared to before, so i have hopes that's representative of actual demand now17:04
clarkbya17:04
clarkbgitea04 may not be super happy with the change. I'm wondering if that is where rh nat maps to17:08
clarkball the others look to be fine according to cacti17:08
sgw1Hi There, was there an issue with opendev.org earlier today? Some folks were seeing:17:08
sgw1 fatal: unable to access 'https://opendev.org/starlingx/stx-puppet.git/': SSL received a record that exceeded the maximum permissible length.17:08
sgw1and it was very slow17:08
clarkbsgw1: yes we've been sorting that out today. Basically at about midnight UTC today we've come under what appears to be a ddos (not sure if intentional or not)17:09
clarkbit looks like a web crawler bot out of china fetching all the commits and files out of our repos17:10
clarkbbut its doing soem from many many many IPs17:10
clarkbanyway what we just did was to bump up the connection limit on the load balancer at it appears that things would be happy with that though maybe one of our backends is not due to how we have to load balance by source IP17:10
clarkbhaproxy just decided that gitea04 is down due to response lag17:11
sgw1clarkb: thanks for the info, I forwarded it back to the folks asking.17:12
clarkbsgw1: they can always ask too :)17:12
sgw1clarkb: while I am here, we are getting ready for branching our 4.0 release, I might try a test push to one repo, if I need to undo it for some reason, I will check in.17:13
clarkblooks like haproxy decided gitea04 is back now17:13
clarkbso it may sort of throttle itself ?17:13
sgw1clarkb: not sure how to answer that other than not their style :-(17:13
clarkbI'm not sure if this is better than the situation before where we told people to go away at the haproxy laye17:13
clarkbRH nat is not on gitea04 fwiw17:14
clarkbfungi: it seems like we've stablized the incoming connection counts but we're slowly starting to get unhealthy backends17:16
clarkbgitea01 is starting to swap now along with 0317:17
clarkber 0417:17
clarkbya may need a revert17:17
clarkbwe may be stabilizing but we're doing so at relatively unhappy states (not completely dead though)17:19
clarkbmaybe lets let it run for 15-30 minutes then see where we've ended up and go from there?17:19
clarkbunfortunately my next best idea is start blocking large chunks of chinese IP space :/ and I'm worried about collateral damage17:20
fungiyeah17:21
fungior we add more backends17:21
fungiyeah, looks like cpu and load average on the backends is not scaling linearly with the increase in connection count17:22
fungiwere getting 100% cpu with load averages around 16+17:23
clarkbthere are certain requests that cost more than others, its possible that the reuqests we're getting are intentionally expensive17:23
fungiaha, it's memory pressure17:23
fungii think we're killing them with swapping17:23
clarkbya17:23
fungioh, and now i see in scrollback you already said that17:23
fungisorry, trying to juggle hvac contractor and eating my now cold lunch17:24
clarkbsgw1: fwiw we'd love to have more people from involved project help build the community resources. If there is any way we can help change the style of interaction that would be great17:24
clarkbsgw1: we can't help if we can't even communicate with each other :/17:24
clarkbfungi: internet tells me that AS is advertising 1236 prefixes17:27
clarkbI suppose we could add all of those to firewall block rules?17:28
fungithat would be... painful17:28
clarkbya but not having a working service is worse :?17:28
fungii wonder if we could programmatically aggregate a bunch of those17:29
clarkbgitea01 is going to OOM soon I think17:29
clarkb04 and 02 are a bit more stable17:29
fungiwe should probably revert the connection limit increase for now? we could add more gitea backends17:29
clarkbfungi: ya I'll do that manually really quickly17:30
fungii'll push up the actual revert change17:30
clarkbmanual application is done17:30
fungiapparently even git replication to the gitea servers has been lagging17:31
fungioh, i'm getting 500 isr errors17:31
fungiso i can't currently remote update my checkout17:33
clarkbyou should be able to update from gerrit17:35
openstackgerritJeremy Stanley proposed opendev/system-config master: Revert "Increase allowed number of haproxy connections"  https://review.opendev.org/73867917:36
*** slittle1 has joined #opendev17:36
fungireverted through gerrit's webui for now17:36
fungiyeah, that's what i wound up doing17:36
slittle1Just joined ...17:37
slittle1is opendev's git server having issues ?17:37
clarkbslittle1: yes we've had a ddos all day and we thought we could alleviate some of the pain for people and it ended up consuming too much memory and makign things worse17:37
fungislittle1: well, it's a cluster of 8 git servers behind a load balancer, but yes, we're under a very large volume of git requests from random addresses in china unicom17:37
clarkbfungi: radb gives me 986 prefixes17:37
clarkbfungi: we could add 986 drop rules easily enough17:38
fungiclarkb: yeah, doing that on gitea-lb01 for now is probably the bandaid we need while we look at better options17:38
clarkbfungi: `whois -h whois.radb.net -- '-i origin AS4837' | grep ^route: | sed -e 's/^route:\s\+//' | sort -u | sort -n | wc -l` fwiw17:38
clarkbdrop the wc if you want to see them17:39
fungiit's times like this i miss being able to just add one bgp filter rule to my border routers :/17:39
clarkbI think we may need to restart gitea backends to get them happy again though17:39
clarkbprobably start with firewall update then restart backends?17:40
fungiyes, they may not release their memory allocations without restarts17:40
*** hashar is now known as hasharAway17:40
fungiwe ought to be able to rolling restart the backends, can disable them one by one in haproxy if we want17:40
clarkbfungi: there is a more graceful way to do it to make sure that gerrit replication isn't impacted, but reboots may be worthwhile just in case anything got OOMKillered17:41
clarkbbut lets figure out the iptables rule changes first17:41
fungiwe can certainly just trigger a full replication once the restarts are done17:41
clarkbfungi: `for X in $(cat ip_range_list) ; do sudo iptables -I openstack-INPUT -j DROP -s $X; done` ?17:44
clarkb-s will take x.y.z.a/foo notation right?17:44
*** xakaitetoia has joined #opendev17:44
funginone are v6 prefixes?17:44
clarkbfungi: as far as I can tell it was all ipv417:45
fungiyes, cidr notation works fine with iptables17:45
fungiand that command looks right17:45
clarkblet me generate the list on the lb17:45
fungino whois installed there17:45
clarkbwhois not found17:45
clarkbof course not17:46
clarkbI'll scp it17:46
clarkbfungi: gitea-lb01:/home/clarkb/china_unicom_ranges17:46
fungilooks right17:47
fungialso radb.net does seem to aggregate prefixes where possible, at least skimming the list17:47
fungii don't see any subsets included17:48
fungii say go for it. worst case we reboot the server via nova api if we lock ourselves out17:49
clarkbfungi: `for X in $(cat china_unicom_ranges) ; do echo $X ; sudo iptables -I openstack-INPUT -j DROP -s $X ; done` is my command on lb01 that still look good? made some small edits17:50
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Use a temporary registry with buildx  https://review.opendev.org/73851717:51
fungiclarkb: yep, that looks right17:51
clarkbfungi: ok I'm running that now17:52
fungiwe could get fancy specifying destination ports, but there's little point as long as we don't lock ourselves out17:52
clarkbthats done17:52
clarkbI can create new connectiosn to the server17:53
fungi`sudo iptables -nL` looks like i would expect17:54
clarkblots of cD connections in haproxy logs fro those ranges now17:54
clarkbwhich is I think side effect of the iptables rukles17:54
fungitrust me you don't want to try that without -n17:54
clarkbto reset iptables rules without rebooting we can restart the iptables persistent unit17:55
clarkbI Think it may be called netfilter-something now17:55
funginetfilter-persistent, yes17:56
clarkbbut giteas are still unhappy17:56
clarkbI'm going to gracefully restart 01 and see if we need to reboot17:56
clarkbbasically we can always reboot after graceful restart17:57
smcginnisDo we have general connectivity issues right now?17:57
smcginnisre: https://review.opendev.org/#/c/738443/17:57
fungismcginnis: just the git servers17:57
smcginnisJobs all failing with connection failures.17:57
fungithough they should be starting to recover nowish, we hope17:57
fungii can finally `git remote update` again17:58
clarkbI just blocked all of china unicom from talking to them17:58
fungi~1k network prefixes17:58
smcginnisOK, I'll recheck in a bit. At least the one I am looking at now, it was trying to get TC data from the gitea servers.17:58
fungii'll write up something to service announce and status notice17:58
clarkbgraceful restart isn't going so great18:00
clarkbI'm still waiting on it to stop things18:00
fungiinfra-root: status notice Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.18:01
fungidoes that look reasonable?18:01
clarkbfungi: yes18:01
smcginnisLooks good to me as infra-nonroot too. :)18:02
clarkbAS4134 may be necessary too18:02
clarkbnow that there is less noise in the logs i'm seeing a lot of traffic from there too :/18:02
corvusclarkb, fungi: oh shoot, i missed that more stuff happened, sorry18:03
fungi"chinanet"18:03
johnsomAh, this must be why my git clone is hanging after I send the TLS client hello18:03
clarkbcorvus: basically my read of resource headroom was completely wrong18:03
corvusclarkb: on gitea or haproxy?18:03
clarkbcorvus: once we allowed more connections we spiralled out of control with memory on the gitea backends18:03
clarkbcorvus: gitea18:03
corvusgotcha18:03
fungijohnsom: i hope it's clearing up now but we're still evaluating how the temporarily bandaid is working18:04
clarkbhaproxy could've done a lot more :)18:04
corvusclarkb: is haproxy config reverted?18:04
clarkbcorvus: yes18:04
fungicorvus: manually reverted and also proposed as https://review.opendev.org/73867918:04
clarkbcorvus: https://review.opendev.org/#/c/738679/1 is proposed and I manualyl reverted18:04
johnsomLet me know if I can consult on haproxy configurations. I might know a thing or two about the subject. grin18:05
corvusclarkb: what about a smaller bump?  6k?18:05
fungijohnsom: at the moment it's just serving as a layer 4 proxy, so not much flexibility18:05
clarkbjohnsom: well earlier today we had maxconn set to 4000 which resulted in errors because haproxy wouldn't allow new connections. I evaluated resource use on our backends and thought we had head room to go to ~16k but was wrong18:05
clarkbcorvus: based on cacti data we were only doing ~8k connections18:06
clarkbcorvus: 6k might be ok but we may be erally close to that limit18:06
corvusoh.  so maybe 4500 :)18:06
clarkbbut also even at that limit we still have user noticeable outages18:06
clarkbbecause haproxy won't handshake with them before they timeout18:06
corvusclarkb: that happens at 4k?18:07
clarkbcorvus: haproxy won't complete handshakes until earlier connectiosn finish. This means occaionally your client will fail. This is how it was originally reported to us earlier today18:07
clarkbsometimes it would just be slow18:09
corvusclarkb: right, so getting more gitea connections available to haproxy was the expected solution.  but doing that overran gitea servers.18:09
clarkbyes18:09
corvusso do you think we're hitting a gitea limit?  it would be nice to use more that 25% ram and have a higher load average than 2.0.18:10
clarkbI half expect that the requests being made require lots of memory bceause they are being made across repos, files, and commits18:10
corvus50% and a la of 8 would be ideal, i'd think :/18:10
clarkband ya tuning to a sweet spot is a good thing, but I expect we'll still be degraded if we do that thought the firewall rules18:11
clarkbI finally got gitea01 to stop its gitea stuff18:11
clarkbI'll reboot it now?18:11
clarkbit did get OOMkillered ~15 minutes ago so a reboot seems like a good idea18:12
clarkb(then iterate through the list and do that18:12
johnsomWell, with L4, you should be able to handle ~29,000 connections per gigabyte allocated to haproxy. But, this is very version dependent and data volume may impact the CPU load. As you scale connections you also need to make sure you tune the open files available as well, which with systemd gets to be tricky.18:12
clarkbjohnsom: yes haproxy is not the problem18:12
clarkbjohnsom: the issue is allowing this many connections to the backends causes them to run out of memory18:13
fungijohnsom: yeah, we're really not seeing any issues with haproxy, it's the backends which are being overrun18:13
johnsomOk. Would rate limiting at haproxy help, is there a cidr or such that could be used for rate limiting?18:13
clarkbjohnsom: yes, thats what the maxconn limit is effectively doing for us18:13
clarkbjohnsom: can we rate limit by cidr in haproxy and then remove our iptables rules?18:14
fungijohnsom: we saw connections scattered across probably hundreds of prefixes from the same as18:14
clarkbfungi: and now a second AS18:14
clarkbI don't see any objections. I'm rebooting gitea01 now18:14
corvusclarkb: ++18:14
fungiclarkb: oh, yes, please do reboot it18:14
johnsomUgh, ok. Yeah, you can do pretty complex rate limiting with haproxy, beyond just maxconn. But it sounds like this may not be easy to identify the bad actors.18:15
fungiit seems to be some sort of indexing spider18:15
johnsomclarkb Here is some nice examples: https://www.haproxy.com/blog/four-examples-of-haproxy-rate-limiting/18:15
fungibut a very distributed one operating on the down-low18:16
clarkbgitea01 is done and up and looks ok18:16
clarkbI'll work through 02-08 in sequence18:16
clarkbother things we may want to look at is as4134 (double check that against current haproxy logs) and maybe using haproxy to rate limit instead of iptables drop rules18:17
fungiat the volume we've been seeing, the rate limit will be no different than dropping at the firewall, other than it will start allowing connections again once the event passes18:17
clarkbcorvus: then once all are rebooted I think we can tune the bumped maxcoon18:18
clarkb*maxconn18:18
clarkbright now it won't be useful because most of the giteas are unhappy18:18
clarkbbut once I've got them happy again we'll get useful infop18:18
clarkbfungi: thats a good point18:18
clarkb02 is also having trouble with the graceful down. I'll try just doing a reboot on 03 when 02 is done18:19
fungishould i status notice about AS4837 now or wait until we decide whether we need to add AS4134?18:20
clarkbI think you can go ahead and then we can add any other rule changes as subsequent notices/logs18:20
fungi#status notice Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.18:21
openstackstatusfungi: sending notice18:21
-openstackstatus- NOTICE: Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.18:21
fungii'll put something similar out to the service-announce ml18:22
openstackstatusfungi: finished sending notice18:24
clarkbI got impatient and tried to do 03 but its super bogged down but I got on 04 and have issued a reboot. It is very slow. I assume its waiting for docker to stop and docker is waiting for containers to stop18:24
clarkbtldr just doing a reboot isn't any faster I don't think18:24
clarkbwe can do nova reboot --hard18:25
clarkbor be patient. I'm attempting to practice patience18:25
fungimy concern with --hard is that it won't wait for the filesystems to umount18:30
johnsomFYI, I got a successful clone now, so functionality returning18:31
fungijohnsom: that's great, thanks for confirming!18:31
clarkb01, 02, and 04 are happy now18:32
clarkb03 and 05 next18:32
clarkband now those are done18:34
clarkbI think as more become happy the load comes off the sad ones and they restart quicker18:34
clarkbI'm doing graceful stop and reboot to clear out any OOM side effects18:35
fungiso do we want to try to add more backends?18:37
clarkbfungi: I mean we can but then we have to spend the rest of the day replicating18:38
fungiin conjunction with inching up the max connections18:38
clarkband if its just to serve a ddos I'm not sure thats the right decision18:38
fungioh, right, speaking of replicating, we need to retrigger a full replication in gerrit once your reboots are complete18:39
clarkbcacti shows the difference between normal and not normal and itsm assive overkill to add more hosts18:39
fungigiven that this doesn't look like a targeted attack, i have a feeling it's going to come and go as whatever distributed web crawler this is spreads18:40
clarkball 8 should be properly restarted now18:40
clarkbfungi: is it weird for it to be so distributed though?18:40
fungiso it's not growing the cluster to absorb this incident, but rather growing to accept fututre similar incidents18:40
clarkbfungi: right, but being idle 99% of the time seems like a poor use of donated resources18:41
fungii agree, if we had some way to elastically scale this, it would be awesome18:41
clarkbif this is an actual bot that isn't intentionally malicious we could ask it nicely to go away with a robots.txt18:41
clarkbbut I think the best way to do that is with tcpdumps and decrypting those streams18:42
clarkbwhich is not the simplest thing to do iirc18:42
fungithe nature of the traffic though, it seems like someone has implemented some crawler application on top of a botnet of compromised systems. i doubt we're the only site seeing this we just happen to be hit hard by having a resource-intensive service behind lots and lots of distinct urls18:42
*** xakaitetoia has quit IRC18:42
clarkbya that could be18:43
clarkbwe're currently operating under our limit so if we want to test bumping the limit we'll need to undo our iptables rules I think18:43
fungitypical botnets range upwards of tens of thousands of compromised systems, and having a majority of them originating from ip addresses in popular chinese isps is not uncommon18:44
clarkbwe are at about 60% of the limit18:44
clarkb2.5/4k connections18:44
clarkbish18:44
clarkbpotential options for next steps: Remove iptables rules for china unicom ranges. This will likely put us back into a semi degraded state with slow connections and connections that fail occasionally. If we do <- we can try raising our maxconn value slowly until we see things get out of control to find where our limit is. We can add backends prior to doing that and try to accomodate the flood. We can do the18:50
clarkbopposite and block more ranges with malicious users. We can try tcpdumps and attempt to determine if this is a bot that will respond to a robots.txt value and if so update robots.txt18:50
clarkbWe can do nothing and see if anyone complains from the blocked ranges and see if the bots go away (since we're managing to keep up now)18:50
clarkband with that I need a break. Our meeting is in 10 minutes and I've been haeds down since ~6am18:51
fungiif this really is a widespread spider, we might be able to infer some characteristics by looking at access logs of some of our other services18:57
*** hasharAway is now known as hashar19:01
*** sshnaidm|ruck is now known as sshnaidm|bbl19:02
*** ianw_pto is now known as ianw19:04
openstackgerritMerged opendev/system-config master: Revert "Increase allowed number of haproxy connections"  https://review.opendev.org/73867919:17
openstackgerritJames E. Blair proposed opendev/system-config master: Enable access log in gitea  https://review.opendev.org/73868419:24
ianwso i'm not seeing that we have a robots.txt for opendev.org or that gitea has a way to set one, although much less sure on that second part19:29
clarkbianw: https://github.com/go-gitea/gitea/issues/62119:30
clarkbthats a breadcrumb saying its possible19:30
ianwyeah just drop it in public/19:33
clarkblooks like gitea may rotate the logs for us already19:43
ianwhttps://git.lelux.fi/theel0ja/gitea-robots.txt/src/branch/master/robots.txt looks like a pretty sweet robots.txt19:46
openstackgerritJeremy Stanley proposed opendev/system-config master: Add backend source port to haproxy logs  https://review.opendev.org/73868519:46
ianwi note it has19:47
ianw# Language spam19:47
ianwDisallow: /*?lang=19:47
fungiseems to be an already recognized issue19:47
corvusso the error in build 7cdd1b201d0e462680ea7ac71d0777b6 is a follow on from this error: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary19:51
corvuswe may be able to get the module failure info from the executor, i'm not sure19:51
ianwcorvus: that step shows as green OK in the console log for you right?19:52
corvusianw: yep19:52
clarkbianw: I like the idea of adapting that robots.txt19:54
* clarkb is happy we had a meeting brainstorm so many good ideas19:54
corvusianw: that task has "failed_when: false"19:54
corvus  # Using shell to try and debug why this task when run sometimes returns -1319:54
corvus                            "rc": -1319:55
*** hashar is now known as hasharAway19:55
corvusso, um, apparently the author of that role has observed that happening, and expects it to happen, and so ignores it19:55
corvusbut the follow-on task assumes it works19:55
clarkbya we've seen that error before19:55
clarkbI don't recall what debugging setup we had done though19:55
openstackgerritIan Wienand proposed opendev/system-config master: gitea-image: add a robots.txt  https://review.opendev.org/73868619:55
corvusi'm not sure that we got any more information from the shell19:56
corvusanyway, we still need to track down that module error in the executor log19:56
corvusthere's nothing in the executor log that isn't already in the job output19:58
corvusso basically, no idea what caused -13 to happen19:59
corvusswitching to the gear issue: that's something that needs to be installed in the ansible venvs, and i guess we don't do that in the upstream zuul images.20:01
corvusthis is a bit more of a thorny problem though, since there is no usage of gear in zuul-jobs20:01
corvusso our pseudo-policy of "we'll add it to zuul-executor if something in zuul-jobs needs it" doesn't cut it here20:01
clarkbinfra-root to summarize gitea situation I think we can land https://review.opendev.org/#/c/738686/1 and https://review.opendev.org/#/c/738684/1 (this one actually likely already does rotation) then land the haproxy logging update fungi is working on. Then basically as each change lands check logs and see if we learn anything new from better logging20:01
corvusclarkb: we don't want the internet archive to crawl us?20:02
*** noonedeadpunk has quit IRC20:02
clarkbcorvus: I mean maybe? the problem seems to be crawling in general poses issues though I guess we've never really noticed this until now so being selective about what we allow is probably as good as it gets?20:03
clarkbin particular the lang stuff and well as specific commit level stuff would be good to clean up based on the logs we had during this situation20:03
fungiwell, the filter for localization url variants is probably a big enough help on its own20:03
*** noonedeadpunk has joined #opendev20:03
ianwthat was just a straight copy of that upstream link, i agree maybe we could drop that bit20:03
*** diablo_rojo has joined #opendev20:03
fungithose essentially multiply the url count dozens of time over20:03
corvushonestly, i'd rather switch from gitea to something that can handle the load rather than disallowing crawling20:03
ianwalso crawl-delay as 2, if obeyed, would seem to help20:04
corvusin my view, supporting indexing and archiving is sort of the point of putting something in front of gerrit :/20:04
fungii concur20:04
ianwi don't think this is disallowing indexing, just making it more useful?20:05
clarkbya its directing it to the bits that are more appropriate20:05
corvusit disallows indexing commits?20:05
clarkbindexing every file for every commit for every localization lang is expensive20:05
clarkbcorvus: yes because the crawlers are crawling every single file in every single commit. If we want to start with simply stopping the lang toggle and see if that is sufficient we can do that20:06
corvusi could be convinced lang is not useful (since the *content* isn't localized)20:06
*** mlavalle has quit IRC20:06
clarkbthen work up from there rather than backwards20:06
clarkbcorrect only the dashboadr template itself is localized not the git repo content20:06
corvusclarkb: right.  i'm going out on a limb and saying that crawlers indexing every file in every commit is desirable.20:06
*** tobiash has quit IRC20:06
corvus"better them than us" :)20:06
clarkbI agree, but it also causes our service to break20:06
corvuswell, no20:06
corvusthe botnet ddos causes our service to break20:07
*** mlavalle has joined #opendev20:07
ianwi note github robots.txt has20:07
ianwUser-agent: baidu20:07
ianwcrawl-delay: 120:07
corvusanyway, i'm trying to suggest we don't over-compensate and throw out the benefit of the system.  let's start small with disallowing lang, setting the delay, and whetever else seems minimally useful20:07
*** tobiash has joined #opendev20:07
clarkbcorvus: that works for me20:07
clarkbwe should exclude activity as well20:08
clarkbthats the not useful graph data20:08
ianwmuch like the sign at the shop like "please do not eat the soap" i'm presuming they put that there in response to a observed problem :)20:09
corvusianw, clarkb: i left comments20:09
fungialso fungi is not working on the haproxy logging change, he already proposed it during the meeting (738685)20:09
fungithough happy to apply any fixes if there are issues20:10
clarkbcorvus: I think the from github is url paths that gitea inherited from github20:10
clarkbcorvus: I expect that gitea supports those paths that are github specific20:10
clarkb(have not tested that yet)20:10
clarkbfungi: oh I completely missed the change is already up20:10
fungido we still need to trigger full replication from gerrit? or did that happen already and i missed it20:10
corvusclarkb: oh, like are they aliases for more native gitea paths?20:10
clarkbcorvus: ya that was my interpretation20:10
corvusclarkb: if that's the case, then i agree we can filter them because it's duplicative20:10
clarkbfungi: I don't think that has happened yet20:10
corvusianw: ^ feel free to ignore my comment on that if that's the case20:11
fungii can work on that next while i wait for gitea to stop hanging on sync20:11
ianwyeah, it appears seeded from github.com/robots.txt20:11
fungier, i mean, while i wait for gertty to stop hanging20:12
fungii think something about the gitea issues earluer have confused it20:12
fungi#status log triggered full gerrit re-replication after gitea restarts with `replication start --all` via ssh command-line api20:14
openstackstatusfungi: finished logging20:14
clarkbfungi: corvus I reviewed the haproxy change and left some notes. I think we should double check the check logs before approving but I +2'd20:17
clarkbthere is one thing about []s being special in haproxy docs that we want to double check doesnt' cause problems for us20:17
fungioh, i'll take a closer look, thanks!20:17
clarkbI expect its fine because the [%t] in the existing format is fine20:18
fungiyeah, i just double-checked that on the production server20:19
fungii could also drop them, i just didn't want it to become ambiguous if we end up doing any load balancing over ipv6 (not that we do presently)20:20
clarkbI think if the check logs look good we should +A20:20
clarkbits motly a concern that I don't understand what haproxy means by the []s being different20:20
clarkbbut if the behavior is fine ship it :)20:21
fungiand yeah, the rest of that format string was just copy-pasted from what the haproxy doc says the default is for tcp forward logging, though i did spot-check that it seemed to match the things we're logging currently20:22
clarkbhttps://review.opendev.org/#/c/738684/1 has passed if anyone has time for gitea access log enablement20:23
clarkbthough note I have to pop out in about 20 minutes to get my glasses adjusted (they got scratched and apparently warranties cover that(20:24
ianwdo we want to disallow */raw/ ?20:24
clarkbI guess it depends if we expect indexers to properly be able to render the intent of formatted source?20:25
clarkbraw may be useful for searching verbatim code snippets?20:25
*** hasharAway has quit IRC20:28
*** hashar has joined #opendev20:29
clarkbprobably leave raw for now since I'm not sure all the useful indexers can do that20:30
ianwwhat about blame?20:33
ianwraw and blame are the two i thikn github disallows that are relevant to us20:34
clarkbblame I can personally live without since it is in the git repos and web index is a poor substitute20:34
clarkbcorvus: fungi ^ thoughts?20:34
*** sshnaidm|bbl has quit IRC20:34
*** sshnaidm|bbl has joined #opendev20:35
corvusi think we can live without blame.  it is expensive and lower value in indexes/searching20:36
corvus /archiving20:36
ianwyeah i have that in20:37
ianwi think the rest gitea does not do20:37
fungiyeah, i don't think there's a lot of benefit to spidering that20:37
ianwe.g. atom feeds, etc20:37
fungiblame, i mean20:37
openstackgerritIan Wienand proposed opendev/system-config master: gitea-image: add a robots.txt  https://review.opendev.org/73868620:39
clarkbwe apparently don't put any traffic through the lb in testing https://zuul.opendev.org/t/openstack/build/1abcc48f83fe4a1c91db7f45fea09391/log/gitea-lb01.opendev.org/syslog.txt doesn't show it anyway20:40
clarkbI've got to pop out now so won't approve it but I Think youcan if you'll watch it20:40
clarkbfungi: ^20:40
ianwDisallow: /Explodingstuff/20:41
ianwthat has one repo with a ransomware .exe20:42
ianw... i wonder what the story is with that20:43
openstackgerritMerged openstack/diskimage-builder master: Disable all enabled epel repos in CentOS8  https://review.opendev.org/73843520:44
fungiianw: where?20:45
*** sshnaidm|bbl is now known as sshnaidm|afk20:46
ianwfungi: sorry that's in the github.com/robots.txt which i was comparing and contrasting to20:46
fungiahh20:46
fungiamusing20:46
ianwof all the possible things on github, it seems odd that this one is so special20:47
fungiit was probably serving a very high-profile ransomware payload to many compromised systems20:47
fungior something similar they didn't want showing up in web searches20:48
ianwyeah but why not delete it?  although i might believe that script kiddies were using some sort of download thing that unintentionally obeyed robots.txt (because it was designed for good and they didn't patch it out) maybe20:49
fungiwell, deleting the file *thoroughly* requires rewriting (at least some) git history20:50
*** priteau has quit IRC20:52
*** jbryce has quit IRC21:15
*** jbryce has joined #opendev21:15
clarkbreplacing lenses took surprisingly little time. I think I'll hang around to see what logging tells us but then it was an early start today and I'm tired. Will likely call it there and pick things up in the morning21:19
fungicorvus: are you cool with the updated robots.txt in 738686? looks like it addresses your prior comments now21:27
corvusi'll check21:27
corvus+321:27
fungithanks!21:27
corvusianw: thanks; i like the comments too so we know what to look at next if things are bad21:28
ianwi don't think anyone ever accused me of leaving too *few* comments :)21:28
openstackgerritMerged opendev/system-config master: Enable access log in gitea  https://review.opendev.org/73868421:32
clarkbre ^ I think we may need to restart gitea processes manually21:37
clarkbbceause its not a new container image so we don't automatically do it21:37
clarkbif fungi's replication kick is still running we'll want to do it gracefully21:37
fungieven if it's not still running we'd want to do it gracefully, right?21:37
clarkbyes, though its less critical21:38
openstackgerritMerged opendev/system-config master: Add backend source port to haproxy logs  https://review.opendev.org/73868521:39
clarkbhaproxy should autoupdate if the sighup is enough there21:39
clarkbI wonder if we should wait for robots.txt before restarting giteas just in case gitea caches that21:39
fungiwe did at least work out the restart for the gitea containers in such a way that we no longer lose replicated refs during a restart, right?21:40
clarkbfungi: yes, though I don't think we've ever managed to fully confirm that. The semi regularl "my commit is missing" complaints went away after we did it though and some rough testing of the component pieces shows it should work21:41
clarkbits just really hard to confirm in the running system due to races21:41
fungiright, okay21:41
clarkbbasically its docker-compose down && docker-compose up -d mariadb gitea-web ; #wait here until the gitea web loads then docker-compose up -d gitea-ssh21:41
clarkband all of that is in the role too if you need to find it later21:42
clarkbwhat that does is stops the ssh server first (its written that way in the docker compose file) then only starts the ssh server once gitea itself is ready. This way gerrit will fail to push anything until gitea the main process can handle the input21:43
fungimakes sense21:44
clarkbhrm robots.txt is a bit of a ways out21:44
clarkbmaybe we should go ahead with some restarts now. I can do 01 and confirm that its access log works at least21:44
* clarkb does this21:45
clarkbapparently docker-compose stop != docker-compose down21:46
clarkbdown removes the containes stop does not21:46
clarkbso when I did the reboots down was correct to prevent them restarting in the wrong order21:46
clarkbbut without a reboot a stop is fine21:46
*** rosmaita has joined #opendev21:46
clarkbgitea01 is done21:47
clarkbwe have user agents21:48
clarkband baiduspider shows up21:48
clarkbbut so do others21:48
rosmaitai have a (hopefully) quick zuul question when someone has a minute21:51
clarkbrosmaita: go for it, I think the fires are under control and now we're just poking at them as they cool :)21:51
rosmaitaty21:51
clarkbnote I think we may need to restart haproxy to get new log format21:51
rosmaitai'm trying to configure this job: https://review.opendev.org/#/c/738687/4/.zuul.yaml@10921:52
clarkbthe config updated but I don't see the new format yet21:52
rosmaitai want it to use a playbook from a different repo21:52
rosmaitabut not sure how to specify that21:52
clarkbrosmaita: playbooks can't be shared between repos, but roles can21:53
clarkbrosmaita: usually the process is to repackage the ansible bits into a reusable role that the original source and new consumer can both use21:53
fungialso jobs can be used between repos (obviously)21:54
fungiso another approach is to define a job (maybe an abstract job) which uses the playbook, and then inherit from that in the other repo with a new job parented to it21:55
rosmaitai think there's a tox-cover job defined in zuul-jobs, but i figured i should probably make openstack-tox my parent job21:57
clarkbrosmaita: what is your post playbook going to do?21:59
rosmaitaclarkb: the playbook i'm using only references one role -- if i copy the playbook to the cinder repo, will it just find the fetch-coverage-output role?21:59
clarkbrosmaita: yes zuul-jobs is available already. Though really you should probably create an openstack-tox-cover that does that for all jobs?21:59
clarkbI think the difference between openstack-tox and normal tox is use of constraints22:00
clarkbso an tox-cover <- openstack-tox-cover  inheritance that is then applied to cinder makes sense to me22:00
rosmaitaok ... i will try copying the playbook first to see if the job does what I want, and if it does, i will propose an openstack-tox-cover job22:01
clarkbsounds good22:01
fungii've condirmed, `cd /etc/haproxy-docker && sudo docker-compose exec haproxy grep log-format /usr/local/etc/haproxy/haproxy.cfg` does show the updated config, but ps indicates the service has not restarted yet22:02
rosmaitaclarkb: ty22:02
clarkbfungi: we just do a sighup22:02
clarkbfungi: I'm guessing that only allows some things to be updated in the config and others are evaluated on start22:02
fungiis the outage from `sudo docker-compose down && sudo docker-compose up -d` going to be brief enough we don't care about scheduling?22:03
*** tobiash has quit IRC22:04
clarkbI would do s/down/stop/ for the reason I discovered earlier, That outage should last only a few seconds (so may be noticed but only briefly)22:04
fungiso it's stop followed up by -d?22:05
clarkbyup stop && up -d22:05
fungiokay, i'll give that a shot on gitea-lb01 now22:05
*** tobiash has joined #opendev22:06
clarkbk I'm watching22:06
fungitook ~12 seconds22:06
fungii can browse content22:06
fungiprocess start time looks recent now22:07
clarkblogs still missing that info?22:07
fungiyeah, it doesn't seem to be using the new log-format22:07
*** hashar has quit IRC22:07
clarkbmaybe the []s are the problem here?22:07
fungiwell, only one of the two parameters added were in []22:07
*** olaph has quit IRC22:09
clarkbI'm looking at the rest of the log config maybe we set the new value then override later22:09
clarkbya I think that may be what is happening22:09
clarkbwe set option tcplog on the frontends22:10
clarkbwhich is a specific log format22:10
fungioh22:10
fungiyep. i'll work on another fix22:10
clarkbfungi: we can just test it manually firs ttoo22:10
clarkbbut I think commenting out those lines or removing them then sighup may do it22:11
fungithis is a nice writeup: https://www.haproxy.com/blog/introduction-to-haproxy-logging/22:13
clarkbianw: fwiw gitea01 does have a fair bit of interesting UA data now in /var/haproxy/log/access.log which we can possibly use to update the robots.txt22:15
clarkbwhile fungi is doing the haproxy things I'll finish up gitea02-08 restarts22:15
clarkbthen they'll all have that fun data22:15
fungiwhere on the host system are we writing stashing the haproxy log? running `mount` inside the container isn't much help22:17
fungier, haproxy config, not log22:17
fungiit's /usr/local/etc/haproxy/haproxy.cfg inside the container, but mount just claims that /dev/vda1 is mounted as /usr/local/etc/haproxy22:17
fungiaha, it's /var/haproxy/etc/haproxy.cfg22:19
fungiclarkb: yep, that did it22:21
fungias expected. i'll propose removing the two occurrences of "option tcplog"22:21
clarkbfungi: ya check the docker compose config file for the mounts22:21
clarkb01-08 are all restarted and should have access logs now22:22
ianwsorry back, having a look22:23
clarkbheh the access logs don't have ports22:23
clarkbso to map you want to get the url + timestamp from access log for UA you want. Then check against macaron logs in docker logs output then that gives you the port and you can cross check against lb for the actual source22:24
openstackgerritJeremy Stanley proposed opendev/system-config master: Remove the tcplog option from haproxy configs  https://review.opendev.org/73871022:24
clarkbwell progress anyway22:24
rosmaitaclarkb: i am a idiot -- there is already an openstack-tox-cover job22:24
fungirosmaita: happens to me all the time, i consider it a validation of my need when i spend half an hour discovering that what i tried to add is already there22:25
rosmaita:)22:25
clarkbfungi: actually is there port info in the gitea side somewhere?22:25
fungiclarkb: i don't know, i had hoped we'd get that from the access log being added22:26
clarkbwe may still have a missing piece22:26
corvusi didn't look at that, i thought all we wanted from access log was UA22:26
clarkbcorvus: ya I think its 95% of what we want22:27
fungiua is definitely useful and an improvement22:27
corvuswe'll need to unblock the ip ranges to see the UA right?22:27
clarkbif we can also map lb logs to gitea logs that is a separate win22:27
clarkbcorvus: no because we didn't block all the IPs doing the things22:27
fungihad just hoped we'd reach a point where we could accually correlate a request to a source ip address22:27
corvusah cool -- do we have a ua yet?22:27
clarkbcorvus: we have many :)22:27
corvusi mean from the crawler22:27
clarkb"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)"22:28
fungi738710 will get the backend source ports to appear properly (tested manually to confirm)22:28
clarkb\"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" too22:28
clarkbbaidu doesn't seem to have english faq22:29
ianwthat maps though with the crawl-delay: 1 that github sets for UA baidu22:29
clarkbthere are others too (I haven't managed to do a complete listing but I think if we grep for lang= and then do a sort -u -c sort of deal we should see which are most active)22:31
corvusare folks thinking that first UA is botnet?22:31
ianwclarkb: sorry where's the UA?  not in the gitea logs?22:31
corvusianw: should be in access.log in /var/gitea/logs22:31
ianwahh ok, was tailing the container22:32
clarkbcorvus: yes I think so based on the url its hitting22:32
clarkbcorvus: its a specific ocmmit file with lang set22:32
clarkbthats not concrete evidence unless there is a trend though (single data point not enough to be sure)22:33
fungicorvus: may be compromised in-browser payload. a lot of them work that way, so they'd wind up with the actual browser's ua anyway22:33
fungiwhat with turing-complete language interpreters in modern browsers, you don't need to compromise the whole system any more, just convince the browser to run your program22:34
clarkbwe can set the access log format22:34
clarkbI'm now trying to see if the remote port is available to that logging context22:34
clarkb(having the ability to link the two would be useful)22:35
fungiif gitea uses apache access log format replacements, i think %{c}p gives the client port22:35
funginot that i would have any reason to expect it to use anything like apache's format string language22:36
openstackgerritMerged opendev/system-config master: gitea-image: add a robots.txt  https://review.opendev.org/73868622:36
clarkbit does not22:36
ianwhttp://paste.openstack.org/show/795410/22:36
clarkbhttps://docs.gitea.io/en-us/logging-configuration/#the-access_log_template22:36
ianw44 \"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"22:37
clarkbsemrush in that listing is the one we had to turn off for lists.o.o because it made mailman unhappy22:37
ianw810 \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"22:37
ianwso relative, i don't think it's that22:37
clarkbya baidu seems pretty well behaved overall22:38
corvusbecause of the iptables block, we're not going to have many ddos entries in that log22:38
fungiclarkb: i wonder then if it's something like {{.Ctx.RemotePort}}22:40
clarkbfungi: there isn't but we do have the Ctx.Req. object which is a normal golang http request which may have it22:40
clarkbcorvus: we only cut the volume down and did not eliminate it, but yes undoing iptales would give us more data22:41
clarkbcorvus: there is at least another whole AS that we could block to stop the requests22:41
ianwaccording to https://developers.whatismybrowser.com/useragents/parse/38888-internet-explorer-windows-trident the top hit in that list i provided is windows xp sp2 ie722:42
ianwi.e. i'd say we can rule out that's a human22:42
ianwhttp://paste.openstack.org/show/795411/ all requests, not just lang22:44
*** mlavalle has quit IRC22:46
*** tkajinam has joined #opendev22:48
openstackgerritClark Boylan proposed opendev/system-config master: Update gitea access log format  https://review.opendev.org/73871422:49
clarkbthat change is not tested but I think if I got the names right it should world22:50
clarkbworld? work22:50
clarkbfungi: ^ basically macaron which is gitea's web layer tries to be smart about the value there and I think it ends up trimming that useful info but the normal http.Request.RemoteAddr value has it and is inteded for logging22:50
*** rosmaita has left #opendev22:51
fungioh cool22:51
clarkbianw: I expect we can filter a lot of those?22:54
clarkbmy modern firefox UA reports itself with its version22:54
clarkbso Firefox 4.0.1 is either a lie or very very old?22:55
fungithose moz/ie versions were commonly used in other more niche browsers which needed to work around sites claiming their browser wasn't supported22:56
clarkbfungi: +2'd https://review.opendev.org/#/c/738710/122:56
fungibut i agree these days it's more likely to be a script22:56
clarkbthats what the prefix bits are supposed to do these days right?22:56
clarkbthe last entry should be your actual browser?22:57
fungiin theory22:57
*** tosky has quit IRC22:57
fungithough in reality it can be anything22:57
clarkbhttps://review.opendev.org/#/c/738714/1 should be tested by our testing (I checked earlier jobs and the access log shows up in them)22:58
clarkbcorvus: ianw ^ if you can review that I'll double check the logs in the test runs before approving22:59
ianwclarkb: i mean looking at http://paste.openstack.org/show/795411/ the hits are from things that are saying they look like ie7, some tecent browser thing, and os 10.7 which went eol in ~ 2012?23:00
clarkbianw: ya its a lot of things reporting to be very old browsers23:00
clarkbthe opera entry there is from 201123:00
ianwbut not being nice and putting any sort of "i'm a robot string in it" :/23:00
clarkbfirefox 4.0.1 is also a 2011 release23:01
clarkbwas 2011 the golden year for web browsers ?23:02
clarkbI'm leaning towards : we can use robots.txt to block those. Actual browsers will ignore the robots.txt right? And any well behaved bot will go away. If the bots are not well behaved then we may need the mod rewrite idea23:03
ianwseems once robots is there we could just hand edit for testing23:04
corvusclarkb: lgtm; do you want a +3 or to check the zuul logs first?23:05
clarkbcorvus: I'll check the logs first and approve if its good23:05
ianwoh it deployed already, cool23:05
clarkbheh that safari release is 2011/2012 too23:05
clarkbits like wheover built this bot did so in 2012 and listed all the possible UA strings at the time :)23:05
ianwso, is anything actually loading the robots.txt ...23:06
ianwbasically two things that look like well behaved bots http://paste.openstack.org/show/795412/23:07
clarkband I guess the flood isn't going away with the less nice bots23:09
ianwis there like a spamhaus for UA's?23:10
clarkbfungi: ^23:11
funginone that i know of23:11
ianwhttps://perishablepress.com/4g-ultimate-user-agent-blacklist/23:11
ianwi mean, insert grain of salt of course23:12
fungiyeah, given any application can claim whatever ua it wants and masquerade as any other agent, if filtering based on well-known malware ua strings were especially effective the authors would adapt23:12
fungiwe might get away with it for a while, or indefinitely, but it would be a continual game of whack-a-mole adding new entries over time23:13
ianwbut i guess that what we're seeing is a bunch of ancient UA's, whatever we're dealing with it's cutting edge23:13
ianwis *not* cutting edge i mean23:13
corvusclarkb: gitea log change failed test23:14
fungithe ua is useful for identifying well-meaning bots so that you can adjust robots.txt to ask them to be better behaved23:14
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Use a temporary registry with buildx  https://review.opendev.org/73851723:15
clarkbcorvus: and no access log. Maybe Ctx.Req.RemoteAddr isn't valid there :/23:15
ianwfungi: sure, but if there's some empirical list of "this is a bunch of UA's from known spam/script/ddos utilities that are common abusive" that includes all these EOL browsers, that would be great :)23:15
ianwhowever, that is maybe the type of thing kept in google/cloudflare repos and not committed publicly23:16
clarkboh template error writing out that config file23:16
clarkbits a jinja2 problem with the {{ .stuff23:18
clarkbtrying to figure out if I can do a literal block23:19
openstackgerritJames E. Blair proposed zuul/zuul-jobs master: Ignore ansible lint E106  https://review.opendev.org/73871623:19
openstackgerritClark Boylan proposed opendev/system-config master: Update gitea access log format  https://review.opendev.org/73871423:22
clarkbI think ^ that may fix it assuming that is the only issue23:22
*** auristor has quit IRC23:29
openstackgerritMerged zuul/zuul-jobs master: Ignore ansible lint E106  https://review.opendev.org/73871623:33
*** Dmitrii-Sh has quit IRC23:42
*** Dmitrii-Sh has joined #opendev23:43
ianwmnaser: any idea what 38.108.68.124 - - [30/Jun/2020:23:52:28 +0000] "GET /zuul/zuul-registry/src/commit/a2dcc167a292ce8ad83a6890a749004b5b298c64/setup.py?lang=pt-PT HTTP/1.1" 200 21763 "https://opendev.org/zuul/zuul-registry/src/commit/a2dcc167a292ce8ad83a6890a749004b5b298c64/setup.py\" \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" is?23:53
ianwmnaser: ohhhh i'm an idiot, that's the loadbalancer23:54
mnaserianw: :)23:54
ianwsorry i got confused in the output of my own script :)23:55
*** auristor has joined #opendev23:56
clarkbianw: ya thats why we're trying to get the port mappings done23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!