*** ryohayakawa has joined #opendev | 00:04 | |
*** factor has quit IRC | 00:54 | |
*** factor has joined #opendev | 00:54 | |
*** dtantsur has joined #opendev | 01:12 | |
*** dtantsur|afk has quit IRC | 01:12 | |
*** vblando has quit IRC | 02:36 | |
*** mnasiadka has quit IRC | 02:36 | |
*** Open10K8S has quit IRC | 02:37 | |
*** Open10K8S has joined #opendev | 02:37 | |
*** mnasiadka has joined #opendev | 02:38 | |
*** vblando has joined #opendev | 02:40 | |
*** ysandeep|away is now known as ysandeep | 04:01 | |
*** ykarel|away is now known as ykarel | 04:19 | |
openstackgerrit | yatin proposed openstack/diskimage-builder master: Disable all enabled epel repos in CentOS8 https://review.opendev.org/738435 | 04:47 |
---|---|---|
*** ysandeep is now known as ysandeep|brb | 05:41 | |
*** DSpider has joined #opendev | 06:07 | |
*** ysandeep|brb is now known as ysandeep | 06:11 | |
*** diablo_rojo has quit IRC | 06:49 | |
*** bhagyashris|pto is now known as bhagyashris | 07:16 | |
*** hashar has joined #opendev | 07:24 | |
*** sshnaidm|afk is now known as sshnaidm|ruck | 07:28 | |
*** tosky has joined #opendev | 07:28 | |
*** bhagyashris is now known as bhagyashris|lunc | 07:29 | |
*** iurygregory has quit IRC | 07:42 | |
*** moppy has quit IRC | 08:01 | |
*** iurygregory has joined #opendev | 08:01 | |
*** moppy has joined #opendev | 08:01 | |
*** hiep_mq has joined #opendev | 08:25 | |
*** ykarel is now known as ykarel|lunch | 08:25 | |
*** jangutter has quit IRC | 08:27 | |
*** hiep_mq has quit IRC | 08:37 | |
*** bhagyashris|lunc is now known as bhagyashris | 08:37 | |
*** Eighth_Doctor has quit IRC | 08:38 | |
*** rchurch has quit IRC | 08:40 | |
*** rchurch has joined #opendev | 08:44 | |
*** Eighth_Doctor has joined #opendev | 08:52 | |
*** tkajinam has quit IRC | 08:55 | |
*** priteau has joined #opendev | 09:08 | |
*** sshnaidm|ruck has quit IRC | 09:14 | |
*** sshnaidm has joined #opendev | 09:23 | |
*** sshnaidm has quit IRC | 09:42 | |
*** ykarel|lunch is now known as ykarel | 09:49 | |
*** priteau has quit IRC | 09:54 | |
*** sshnaidm has joined #opendev | 09:55 | |
*** mugsie has quit IRC | 10:07 | |
*** mugsie has joined #opendev | 10:10 | |
*** ryohayakawa has quit IRC | 10:30 | |
*** hashar has quit IRC | 10:47 | |
*** ysandeep is now known as ysandeep|afk | 10:53 | |
frickler | infra-root: most cacti graphs for gitea0X look weird starting around 0:30 GMT today | 11:00 |
frickler | e.g. http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66632&rra_id=all | 11:00 |
frickler | hmm, the docker log only shows the IP of the lb as client, makes it difficult to track things. not sure if that might be some ddos, even less how to mitigate if it was | 11:13 |
frickler | except it seems that cloudflare seems to offer free protection for open source projects, but we'd have to discuss whether we'd want that | 11:13 |
*** bhagyashris is now known as bhagyashris|brb | 11:32 | |
*** ysandeep|afk is now known as ysandeep | 11:39 | |
openstackgerrit | Merged openstack/diskimage-builder master: Make ipa centos8 job non-voting https://review.opendev.org/738481 | 11:49 |
*** mordred has quit IRC | 12:04 | |
*** mordred has joined #opendev | 12:09 | |
*** bhagyashris|brb is now known as bhagyashris | 12:10 | |
*** dtantsur is now known as dtantsur|brb | 12:13 | |
*** hashar has joined #opendev | 12:22 | |
*** priteau has joined #opendev | 12:45 | |
*** ysandeep is now known as ysandeep|afk | 12:47 | |
ttx | corvus, fungi, clarkb: if you can review https://review.opendev.org/#/c/738187/ today, I'm around in the next 3 hours to watch it go through and fix it in case of weirdness | 12:55 |
*** ysandeep|afk is now known as ysandeep | 12:55 | |
ttx | tl;dr: a retry is better than convoluted conditionals | 12:55 |
AJaeger | ttx, +2A | 12:59 |
clarkb | frickler: that may indicate we've got another ddos happening against the service :/ | 13:02 |
ttx | AJaeger: thx! | 13:07 |
clarkb | There are two vexxhost IPs that show up as particularly busy | 13:13 |
clarkb | oddly they are ipv4 addrs not ipv6. Neither address shows up in our nodepool logs, but I'm double checking that I'm grepping through rotated logs properly now | 13:14 |
clarkb | however I expect that our jobs would always use ipv6 not ipv4 | 13:14 |
openstackgerrit | Merged zuul/zuul-jobs master: upload-git-mirror: use retries to avoid races https://review.opendev.org/738187 | 13:14 |
clarkb | ya I'm grepping rotated logs properly so these aren't any IPs of our own | 13:15 |
clarkb | mnaser: if you have a moment or can point us at who does I'd happily share IPs and see if we can work backward from there? | 13:15 |
fungi | the established connections graph for the lb are striking, can probably just see what the socket table looks like | 13:16 |
mnaser | clarkb: can you post them to me? I wonder if it’s our Zuul. | 13:17 |
clarkb | fungi: thats bascially what I did, grep by backend on load balancer and sort by unique IP | 13:18 |
smcginnis | I've noticed some services seem a bit slower this morning. A ddos could certainly explain that. | 13:18 |
clarkb | `sudo grep balance_git_https/gitea08.opendev.org /var/log/syslog | cut -d' ' -f 6 | sed -ne 's/\(:[0-9]\+\)$//p' | sort | uniq -c | sort | tail` | 13:20 |
clarkb | if other infra-root ^ want to do similar on the load balancer | 13:20 |
fungi | clarkb: the address distribution looks fairly innocuous to me | 13:20 |
fungi | i was analyzing netstat -ln | 13:20 |
clarkb | fungi: tehre are two vexxhost IPs that hvae an order of magnitude more requests over that syslog | 13:20 |
clarkb | compared to other IPs | 13:20 |
clarkb | possible that it finally caught up and now our distribution is more normalized | 13:20 |
fungi | highest count for open sockets i saw was 66.187.233.202 which seems to be a nat at red hat | 13:21 |
clarkb | fungi: yes that one also shows up (but it has since we spun up the service since red hat insists on funneling all internets through a single IP) | 13:21 |
fungi | er, not netstat -ln, netstat -nt | 13:21 |
fungi | next most connections is from 124.92.132.123 at china unicom | 13:22 |
fungi | several other china unicom addresses in the current top count of open connections to the lb | 13:23 |
fungi | so far basically all of the ones i'm checking are, in fact | 13:24 |
fungi | netstat -nt|awk '{print $5}'|sed 's/:[0-9]*$//'|sort|uniq -c|sort -n | 13:24 |
fungi | the highest 8 are obviously connections to the backends | 13:25 |
fungi | the next 5 are china unicom | 13:26 |
fungi | (at the moment) | 13:26 |
clarkb | fungi: ya I was looking at it over time since the cacti graphs show it being a long term issue and in the past total number of connections over tiem when that has happened has correlated strongly with the issue | 13:27 |
clarkb | its also possible the IPs I identified are just noise as the actual problem is spread out over many more IPs and we're seeing that right now | 13:28 |
fungi | yeah, the distribution makes me wonder if the connection count is a symptom of something impacting cluster performance, not the cause | 13:30 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is fun | 13:32 |
fungi | closer looks at cacti graphs, note that the bandwidth utilization didn't really increase, or even spike (except on the loopback) | 13:32 |
fungi | so it could be something else is causing connections to take far longer to service, and they're all piling up | 13:33 |
fungi | cpu, load average and loopback connections all went up at the same time across all backends | 13:34 |
clarkb | fungi: ya thats what we saw before, basically really expensive requests that hang around | 13:34 |
clarkb | and then after a few hours they finally complete in gitea | 13:34 |
fungi | makes sense. not expensive operations resulting in large amounts of returned data over the network though | 13:35 |
clarkb | no its expensive for gitea to produce a result so its sitting in the background spinning its wheels for several hours before saying something | 13:36 |
fungi | yeah, i wonder how we could identify those | 13:38 |
clarkb | last time it was via docker logs on the web container and looking for completed requests with large times or started request entries and no corresponding completed line | 13:39 |
fungi | seems gitea is using threads, so can't really just go by process age like we could with the git http backend | 13:39 |
clarkb | I'm having a hard time doing that now | 13:40 |
clarkb | because the volume of logs is signficant | 13:40 |
clarkb | ok finally getting somewhere with that. haven't seen any large requests, but I am noticing a lot of requests for specific commits and under specific languages | 13:42 |
fungi | funny to hear folks on the conference talking about collecting haproxy telemetry while digging into this | 13:42 |
clarkb | that seems odd for us and implies maybe its a crawler | 13:42 |
clarkb | gitea doesn't seem to log user agent unfortunately or that might make it a bit easier to tell what was doing that if it is a bot | 13:43 |
fungi | and i don't suppose we have sufficient session identifiers to pick it out of the haproxy logs | 13:44 |
clarkb | and if it isn't closing those open connections we may not log them in the way I was looking in syslog (your live view would be a better indicator of that) | 13:44 |
clarkb | fungi: ya the gitea logging is maybe the next thing to look at in that space to make it easier to debug this class of problem. I believe it will log a forwarded for ip if we were doing http(s) balancing on haproxy but we do tcp passthrough instead | 13:46 |
clarkb | adding user agent would be useful though | 13:46 |
clarkb | I think if we restart the containers we'll reset the log buffer which may make operating with the gitea logs slightly easier | 13:50 |
*** mlavalle has joined #opendev | 14:00 | |
clarkb | fungi: I'm trying to use my own IP as a bread crumb in the lb and gitea logs to see how we can more effectively correlate them | 14:01 |
fungi | if haproxy logs the ephemeral port from which it sources the forwarded socket, and gitea logs the "client" port number, we could likely create a mapping based on that, backend and approximate timestamps | 14:03 |
fungi | though what haproxy's putting in syslog looks like it includes the original client source port, but not the source port for the forwarded connection it creates to the backend | 14:05 |
openstackgerrit | Riccardo Pittau proposed openstack/diskimage-builder master: Convert multi line if statement to case https://review.opendev.org/734479 | 14:10 |
*** dtantsur|brb is now known as dtantsur | 14:14 | |
*** ysandeep is now known as ysandeep|away | 14:17 | |
clarkb | doing rough correlation using timestamps and looking for blocks where significant numbers of requests do the specific commit and file with lang param thing I'm beginning to suspect the requests from those china unicom ranges | 14:19 |
clarkb | each one seems to be a new IP so doesn't show up in our busy IPs listings | 14:19 |
*** mlavalle has quit IRC | 14:20 | |
clarkb | this correlation isn't perfect though so not wanting to commit to that just yet, but am starting to thik about how we might mitigate such a thing | 14:20 |
clarkb | what I'm noticing is that we seem to have capped out haproxy to ~8k concurrent connections but we're barely spinning the cpu or using any memory there | 14:22 |
clarkb | we may be able to mitigate this by allowing haproxy to handle far more connections. However, ulimit is set to a very large number for open files and it isn't clear to me where that limit may be | 14:22 |
fungi | haproxy is in a docker container now, so could it be the namespace having a separate, lower limit? | 14:24 |
*** mlavalle has joined #opendev | 14:24 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Increase allowed number of haproxy connections https://review.opendev.org/738635 | 14:28 |
clarkb | fungi: nah its ^ | 14:28 |
clarkb | I checked ulimit -a in the container | 14:28 |
fungi | aha! | 14:28 |
clarkb | I think lookingat gitea01 we have headroom for maybe 6-8x the number of current requests | 14:29 |
clarkb | so I bump by 4x in that change and we can monitor and tune from there | 14:29 |
clarkb | this is still just a mitigation though doesn't really address the underlying problem (but maybe thats good enough for now?) | 14:30 |
fungi | yeah, i agree, system load is topping out around 25% of the cpu count, and cpu utilization is also around 25% | 14:30 |
clarkb | and we get connection on front and connection on back so 4000 * 2 = 8k total tcp conns | 14:30 |
fungi | memory pressure is low too | 14:30 |
clarkb | fungi: ya | 14:30 |
clarkb | and the lb itself is running super lean | 14:32 |
clarkb | it could probably do 64k connections on the lb itself | 14:32 |
*** weshay_ruck has joined #opendev | 14:33 | |
tristanC | Greeting, is there a place where we could download the diskimage used by opendev's zuul? | 14:33 |
clarkb | tristanC: yes https://nb01.opendev.org, https://nb02.opendev.org. https://nb04.opendev.org | 14:33 |
clarkb | all three of those build the images so you'll want to check which one has the most recent build of the image you want | 14:33 |
clarkb | er it might needs /images ? /me checks | 14:34 |
clarkb | yes its the /images path at those hosts | 14:34 |
tristanC | clarkb: thanks, /images is what i was looking for | 14:34 |
clarkb | frickler: corvus mordred I think we start at https://review.opendev.org/#/c/738635/1 to see if we can make opendev.org happier | 14:35 |
clarkb | and separately continue to try and sort out if these requests are from a legit crawler and if so we may be able to set up a robots.txt and ask it to kindly go away | 14:36 |
clarkb | fungi: I guess we could do a packet capture then figure out decryption of it and then look for user agent headers | 14:43 |
clarkb | fungi: that seems like an after conference, after breakfast task if I'm going to tackle that though | 14:43 |
openstackgerrit | Merged zuul/zuul-jobs master: prepare-workspace: Add Role Variable in README.rst https://review.opendev.org/737352 | 14:45 |
fungi | yeah, with a copy of the ssl server key we could do offline decryption of a pcap | 14:45 |
fungi | though i'm out of practice, been a while since i did that | 14:46 |
*** ykarel is now known as ykarel|away | 14:48 | |
*** sgw1 has quit IRC | 14:50 | |
*** sorin-mihai has joined #opendev | 14:52 | |
corvus | clarkb, fungi: oh, i have scrollback to read; i'll do that and expect to be useful after breakfast in maybe 30m? | 14:53 |
*** sorin-mihai__ has quit IRC | 14:53 | |
clarkb | corvus: sounds good | 14:53 |
fungi | yeah, things are mostly working | 14:54 |
*** sgw1 has joined #opendev | 14:55 | |
corvus | ok caught up on scrollback, +3 738635; breakfast now then back | 14:58 |
openstackgerrit | Merged zuul/zuul-jobs master: Return upload_results in upload-logs-swift role https://review.opendev.org/733564 | 15:01 |
fungi | 738635 is going to need an haproxy container restart, right? hopefully that's fast | 15:10 |
clarkb | fungi: yes and yes its usually pretty painless | 15:10 |
*** hashar is now known as hasharAway | 15:24 | |
openstackgerrit | Merged opendev/puppet-openstackid master: Fixed permissions issues on SpammerProcess https://review.opendev.org/717359 | 15:47 |
*** sshnaidm has quit IRC | 15:51 | |
*** sshnaidm has joined #opendev | 15:53 | |
*** hasharAway is now known as hashar | 15:58 | |
clarkb | we're about half an hour from applying the haproxy config update. I'm going to find lunch | 16:07 |
clarkb | I guess its just breakfast at this point | 16:08 |
clarkb | yay early mornings | 16:08 |
*** sshnaidm is now known as sshnaidm|ruck | 16:18 | |
openstackgerrit | Merged opendev/system-config master: Increase allowed number of haproxy connections https://review.opendev.org/738635 | 16:40 |
fungi | and now we wait for the deploy to run | 16:40 |
*** dtantsur is now known as dtantsur|afk | 16:45 | |
fungi | looks like the deploy is wrapping up now | 16:45 |
fungi | and done | 16:45 |
fungi | once enough of us are around, i guess we can restart the container | 16:47 |
clarkb | fungi: I think it restarts automatically? | 16:49 |
clarkb | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all seems to show that | 16:51 |
fungi | it didn't seem like it restarted, looking at ps | 16:51 |
fungi | start timestamp is over two weeks ago | 16:52 |
fungi | root 12495 0.0 0.1 19848 11176 ? Ss Jun11 1:22 haproxy -sf 6 -W -db -f /usr/local/etc/haproxy/haproxy.cfg | 16:52 |
fungi | weird | 16:53 |
fungi | maybe it reread its config live? | 16:53 |
clarkb | I didnt think it did butmaybe | 16:53 |
fungi | no mention of any config reload in syslog | 16:57 |
clarkb | docker ps also shows the container didn't restart | 16:58 |
clarkb | maybe it is checking its config on the fly? | 16:58 |
clarkb | gitea01 connections, memory and cpu has gone up too but all at reasonable levels | 17:00 |
clarkb | fungi: I'm thinking we should probably restart the haproxy just to be sure, but it definitely seems to be running with thatn ew config anyway | 17:00 |
clarkb | I'm checking the ansible nwo to see if we signal the process at all | 17:00 |
clarkb | we do: cmd: docker-compose kill -s HUP haproxy | 17:01 |
fungi | ahh, okay | 17:01 |
clarkb | thats the handler for updating our config so this is all good | 17:01 |
clarkb | no restart necessary | 17:01 |
fungi | yep, perfect | 17:01 |
clarkb | also if we were using all 16k conns on the front end I would expect 32k conns recorded in cacti | 17:02 |
fungi | agreed | 17:03 |
clarkb | we're "only" getting ~16k conns in cacti which implies to me we've now got enough capacity in haproxy to address the load | 17:03 |
fungi | waiting to see if there's fluxuation in that or if it looks capped still | 17:03 |
clarkb | as long as our backends stay happy this may be sufficient to work around the problem | 17:03 |
clarkb | and if so we can see if this persists as a low key ddos or maybe it will resolve on its own if the bots get the crawling done | 17:03 |
fungi | the new top out does seem to have some variation in it compared to before, so i have hopes that's representative of actual demand now | 17:04 |
clarkb | ya | 17:04 |
clarkb | gitea04 may not be super happy with the change. I'm wondering if that is where rh nat maps to | 17:08 |
clarkb | all the others look to be fine according to cacti | 17:08 |
sgw1 | Hi There, was there an issue with opendev.org earlier today? Some folks were seeing: | 17:08 |
sgw1 | fatal: unable to access 'https://opendev.org/starlingx/stx-puppet.git/': SSL received a record that exceeded the maximum permissible length. | 17:08 |
sgw1 | and it was very slow | 17:08 |
clarkb | sgw1: yes we've been sorting that out today. Basically at about midnight UTC today we've come under what appears to be a ddos (not sure if intentional or not) | 17:09 |
clarkb | it looks like a web crawler bot out of china fetching all the commits and files out of our repos | 17:10 |
clarkb | but its doing soem from many many many IPs | 17:10 |
clarkb | anyway what we just did was to bump up the connection limit on the load balancer at it appears that things would be happy with that though maybe one of our backends is not due to how we have to load balance by source IP | 17:10 |
clarkb | haproxy just decided that gitea04 is down due to response lag | 17:11 |
sgw1 | clarkb: thanks for the info, I forwarded it back to the folks asking. | 17:12 |
clarkb | sgw1: they can always ask too :) | 17:12 |
sgw1 | clarkb: while I am here, we are getting ready for branching our 4.0 release, I might try a test push to one repo, if I need to undo it for some reason, I will check in. | 17:13 |
clarkb | looks like haproxy decided gitea04 is back now | 17:13 |
clarkb | so it may sort of throttle itself ? | 17:13 |
sgw1 | clarkb: not sure how to answer that other than not their style :-( | 17:13 |
clarkb | I'm not sure if this is better than the situation before where we told people to go away at the haproxy laye | 17:13 |
clarkb | RH nat is not on gitea04 fwiw | 17:14 |
clarkb | fungi: it seems like we've stablized the incoming connection counts but we're slowly starting to get unhealthy backends | 17:16 |
clarkb | gitea01 is starting to swap now along with 03 | 17:17 |
clarkb | er 04 | 17:17 |
clarkb | ya may need a revert | 17:17 |
clarkb | we may be stabilizing but we're doing so at relatively unhappy states (not completely dead though) | 17:19 |
clarkb | maybe lets let it run for 15-30 minutes then see where we've ended up and go from there? | 17:19 |
clarkb | unfortunately my next best idea is start blocking large chunks of chinese IP space :/ and I'm worried about collateral damage | 17:20 |
fungi | yeah | 17:21 |
fungi | or we add more backends | 17:21 |
fungi | yeah, looks like cpu and load average on the backends is not scaling linearly with the increase in connection count | 17:22 |
fungi | were getting 100% cpu with load averages around 16+ | 17:23 |
clarkb | there are certain requests that cost more than others, its possible that the reuqests we're getting are intentionally expensive | 17:23 |
fungi | aha, it's memory pressure | 17:23 |
fungi | i think we're killing them with swapping | 17:23 |
clarkb | ya | 17:23 |
fungi | oh, and now i see in scrollback you already said that | 17:23 |
fungi | sorry, trying to juggle hvac contractor and eating my now cold lunch | 17:24 |
clarkb | sgw1: fwiw we'd love to have more people from involved project help build the community resources. If there is any way we can help change the style of interaction that would be great | 17:24 |
clarkb | sgw1: we can't help if we can't even communicate with each other :/ | 17:24 |
clarkb | fungi: internet tells me that AS is advertising 1236 prefixes | 17:27 |
clarkb | I suppose we could add all of those to firewall block rules? | 17:28 |
fungi | that would be... painful | 17:28 |
clarkb | ya but not having a working service is worse :? | 17:28 |
fungi | i wonder if we could programmatically aggregate a bunch of those | 17:29 |
clarkb | gitea01 is going to OOM soon I think | 17:29 |
clarkb | 04 and 02 are a bit more stable | 17:29 |
fungi | we should probably revert the connection limit increase for now? we could add more gitea backends | 17:29 |
clarkb | fungi: ya I'll do that manually really quickly | 17:30 |
fungi | i'll push up the actual revert change | 17:30 |
clarkb | manual application is done | 17:30 |
fungi | apparently even git replication to the gitea servers has been lagging | 17:31 |
fungi | oh, i'm getting 500 isr errors | 17:31 |
fungi | so i can't currently remote update my checkout | 17:33 |
clarkb | you should be able to update from gerrit | 17:35 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Revert "Increase allowed number of haproxy connections" https://review.opendev.org/738679 | 17:36 |
*** slittle1 has joined #opendev | 17:36 | |
fungi | reverted through gerrit's webui for now | 17:36 |
fungi | yeah, that's what i wound up doing | 17:36 |
slittle1 | Just joined ... | 17:37 |
slittle1 | is opendev's git server having issues ? | 17:37 |
clarkb | slittle1: yes we've had a ddos all day and we thought we could alleviate some of the pain for people and it ended up consuming too much memory and makign things worse | 17:37 |
fungi | slittle1: well, it's a cluster of 8 git servers behind a load balancer, but yes, we're under a very large volume of git requests from random addresses in china unicom | 17:37 |
clarkb | fungi: radb gives me 986 prefixes | 17:37 |
clarkb | fungi: we could add 986 drop rules easily enough | 17:38 |
fungi | clarkb: yeah, doing that on gitea-lb01 for now is probably the bandaid we need while we look at better options | 17:38 |
clarkb | fungi: `whois -h whois.radb.net -- '-i origin AS4837' | grep ^route: | sed -e 's/^route:\s\+//' | sort -u | sort -n | wc -l` fwiw | 17:38 |
clarkb | drop the wc if you want to see them | 17:39 |
fungi | it's times like this i miss being able to just add one bgp filter rule to my border routers :/ | 17:39 |
clarkb | I think we may need to restart gitea backends to get them happy again though | 17:39 |
clarkb | probably start with firewall update then restart backends? | 17:40 |
fungi | yes, they may not release their memory allocations without restarts | 17:40 |
*** hashar is now known as hasharAway | 17:40 | |
fungi | we ought to be able to rolling restart the backends, can disable them one by one in haproxy if we want | 17:40 |
clarkb | fungi: there is a more graceful way to do it to make sure that gerrit replication isn't impacted, but reboots may be worthwhile just in case anything got OOMKillered | 17:41 |
clarkb | but lets figure out the iptables rule changes first | 17:41 |
fungi | we can certainly just trigger a full replication once the restarts are done | 17:41 |
clarkb | fungi: `for X in $(cat ip_range_list) ; do sudo iptables -I openstack-INPUT -j DROP -s $X; done` ? | 17:44 |
clarkb | -s will take x.y.z.a/foo notation right? | 17:44 |
*** xakaitetoia has joined #opendev | 17:44 | |
fungi | none are v6 prefixes? | 17:44 |
clarkb | fungi: as far as I can tell it was all ipv4 | 17:45 |
fungi | yes, cidr notation works fine with iptables | 17:45 |
fungi | and that command looks right | 17:45 |
clarkb | let me generate the list on the lb | 17:45 |
fungi | no whois installed there | 17:45 |
clarkb | whois not found | 17:45 |
clarkb | of course not | 17:46 |
clarkb | I'll scp it | 17:46 |
clarkb | fungi: gitea-lb01:/home/clarkb/china_unicom_ranges | 17:46 |
fungi | looks right | 17:47 |
fungi | also radb.net does seem to aggregate prefixes where possible, at least skimming the list | 17:47 |
fungi | i don't see any subsets included | 17:48 |
fungi | i say go for it. worst case we reboot the server via nova api if we lock ourselves out | 17:49 |
clarkb | fungi: `for X in $(cat china_unicom_ranges) ; do echo $X ; sudo iptables -I openstack-INPUT -j DROP -s $X ; done` is my command on lb01 that still look good? made some small edits | 17:50 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Use a temporary registry with buildx https://review.opendev.org/738517 | 17:51 |
fungi | clarkb: yep, that looks right | 17:51 |
clarkb | fungi: ok I'm running that now | 17:52 |
fungi | we could get fancy specifying destination ports, but there's little point as long as we don't lock ourselves out | 17:52 |
clarkb | thats done | 17:52 |
clarkb | I can create new connectiosn to the server | 17:53 |
fungi | `sudo iptables -nL` looks like i would expect | 17:54 |
clarkb | lots of cD connections in haproxy logs fro those ranges now | 17:54 |
clarkb | which is I think side effect of the iptables rukles | 17:54 |
fungi | trust me you don't want to try that without -n | 17:54 |
clarkb | to reset iptables rules without rebooting we can restart the iptables persistent unit | 17:55 |
clarkb | I Think it may be called netfilter-something now | 17:55 |
fungi | netfilter-persistent, yes | 17:56 |
clarkb | but giteas are still unhappy | 17:56 |
clarkb | I'm going to gracefully restart 01 and see if we need to reboot | 17:56 |
clarkb | basically we can always reboot after graceful restart | 17:57 |
smcginnis | Do we have general connectivity issues right now? | 17:57 |
smcginnis | re: https://review.opendev.org/#/c/738443/ | 17:57 |
fungi | smcginnis: just the git servers | 17:57 |
smcginnis | Jobs all failing with connection failures. | 17:57 |
fungi | though they should be starting to recover nowish, we hope | 17:57 |
fungi | i can finally `git remote update` again | 17:58 |
clarkb | I just blocked all of china unicom from talking to them | 17:58 |
fungi | ~1k network prefixes | 17:58 |
smcginnis | OK, I'll recheck in a bit. At least the one I am looking at now, it was trying to get TC data from the gitea servers. | 17:58 |
fungi | i'll write up something to service announce and status notice | 17:58 |
clarkb | graceful restart isn't going so great | 18:00 |
clarkb | I'm still waiting on it to stop things | 18:00 |
fungi | infra-root: status notice Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options. | 18:01 |
fungi | does that look reasonable? | 18:01 |
clarkb | fungi: yes | 18:01 |
smcginnis | Looks good to me as infra-nonroot too. :) | 18:02 |
clarkb | AS4134 may be necessary too | 18:02 |
clarkb | now that there is less noise in the logs i'm seeing a lot of traffic from there too :/ | 18:02 |
corvus | clarkb, fungi: oh shoot, i missed that more stuff happened, sorry | 18:03 |
fungi | "chinanet" | 18:03 |
johnsom | Ah, this must be why my git clone is hanging after I send the TLS client hello | 18:03 |
clarkb | corvus: basically my read of resource headroom was completely wrong | 18:03 |
corvus | clarkb: on gitea or haproxy? | 18:03 |
clarkb | corvus: once we allowed more connections we spiralled out of control with memory on the gitea backends | 18:03 |
clarkb | corvus: gitea | 18:03 |
corvus | gotcha | 18:03 |
fungi | johnsom: i hope it's clearing up now but we're still evaluating how the temporarily bandaid is working | 18:04 |
clarkb | haproxy could've done a lot more :) | 18:04 |
corvus | clarkb: is haproxy config reverted? | 18:04 |
clarkb | corvus: yes | 18:04 |
fungi | corvus: manually reverted and also proposed as https://review.opendev.org/738679 | 18:04 |
clarkb | corvus: https://review.opendev.org/#/c/738679/1 is proposed and I manualyl reverted | 18:04 |
johnsom | Let me know if I can consult on haproxy configurations. I might know a thing or two about the subject. grin | 18:05 |
corvus | clarkb: what about a smaller bump? 6k? | 18:05 |
fungi | johnsom: at the moment it's just serving as a layer 4 proxy, so not much flexibility | 18:05 |
clarkb | johnsom: well earlier today we had maxconn set to 4000 which resulted in errors because haproxy wouldn't allow new connections. I evaluated resource use on our backends and thought we had head room to go to ~16k but was wrong | 18:05 |
clarkb | corvus: based on cacti data we were only doing ~8k connections | 18:06 |
clarkb | corvus: 6k might be ok but we may be erally close to that limit | 18:06 |
corvus | oh. so maybe 4500 :) | 18:06 |
clarkb | but also even at that limit we still have user noticeable outages | 18:06 |
clarkb | because haproxy won't handshake with them before they timeout | 18:06 |
corvus | clarkb: that happens at 4k? | 18:07 |
clarkb | corvus: haproxy won't complete handshakes until earlier connectiosn finish. This means occaionally your client will fail. This is how it was originally reported to us earlier today | 18:07 |
clarkb | sometimes it would just be slow | 18:09 |
corvus | clarkb: right, so getting more gitea connections available to haproxy was the expected solution. but doing that overran gitea servers. | 18:09 |
clarkb | yes | 18:09 |
corvus | so do you think we're hitting a gitea limit? it would be nice to use more that 25% ram and have a higher load average than 2.0. | 18:10 |
clarkb | I half expect that the requests being made require lots of memory bceause they are being made across repos, files, and commits | 18:10 |
corvus | 50% and a la of 8 would be ideal, i'd think :/ | 18:10 |
clarkb | and ya tuning to a sweet spot is a good thing, but I expect we'll still be degraded if we do that thought the firewall rules | 18:11 |
clarkb | I finally got gitea01 to stop its gitea stuff | 18:11 |
clarkb | I'll reboot it now? | 18:11 |
clarkb | it did get OOMkillered ~15 minutes ago so a reboot seems like a good idea | 18:12 |
clarkb | (then iterate through the list and do that | 18:12 |
johnsom | Well, with L4, you should be able to handle ~29,000 connections per gigabyte allocated to haproxy. But, this is very version dependent and data volume may impact the CPU load. As you scale connections you also need to make sure you tune the open files available as well, which with systemd gets to be tricky. | 18:12 |
clarkb | johnsom: yes haproxy is not the problem | 18:12 |
clarkb | johnsom: the issue is allowing this many connections to the backends causes them to run out of memory | 18:13 |
fungi | johnsom: yeah, we're really not seeing any issues with haproxy, it's the backends which are being overrun | 18:13 |
johnsom | Ok. Would rate limiting at haproxy help, is there a cidr or such that could be used for rate limiting? | 18:13 |
clarkb | johnsom: yes, thats what the maxconn limit is effectively doing for us | 18:13 |
clarkb | johnsom: can we rate limit by cidr in haproxy and then remove our iptables rules? | 18:14 |
fungi | johnsom: we saw connections scattered across probably hundreds of prefixes from the same as | 18:14 |
clarkb | fungi: and now a second AS | 18:14 |
clarkb | I don't see any objections. I'm rebooting gitea01 now | 18:14 |
corvus | clarkb: ++ | 18:14 |
fungi | clarkb: oh, yes, please do reboot it | 18:14 |
johnsom | Ugh, ok. Yeah, you can do pretty complex rate limiting with haproxy, beyond just maxconn. But it sounds like this may not be easy to identify the bad actors. | 18:15 |
fungi | it seems to be some sort of indexing spider | 18:15 |
johnsom | clarkb Here is some nice examples: https://www.haproxy.com/blog/four-examples-of-haproxy-rate-limiting/ | 18:15 |
fungi | but a very distributed one operating on the down-low | 18:16 |
clarkb | gitea01 is done and up and looks ok | 18:16 |
clarkb | I'll work through 02-08 in sequence | 18:16 |
clarkb | other things we may want to look at is as4134 (double check that against current haproxy logs) and maybe using haproxy to rate limit instead of iptables drop rules | 18:17 |
fungi | at the volume we've been seeing, the rate limit will be no different than dropping at the firewall, other than it will start allowing connections again once the event passes | 18:17 |
clarkb | corvus: then once all are rebooted I think we can tune the bumped maxcoon | 18:18 |
clarkb | *maxconn | 18:18 |
clarkb | right now it won't be useful because most of the giteas are unhappy | 18:18 |
clarkb | but once I've got them happy again we'll get useful infop | 18:18 |
clarkb | fungi: thats a good point | 18:18 |
clarkb | 02 is also having trouble with the graceful down. I'll try just doing a reboot on 03 when 02 is done | 18:19 |
fungi | should i status notice about AS4837 now or wait until we decide whether we need to add AS4134? | 18:20 |
clarkb | I think you can go ahead and then we can add any other rule changes as subsequent notices/logs | 18:20 |
fungi | #status notice Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options. | 18:21 |
openstackstatus | fungi: sending notice | 18:21 |
-openstackstatus- NOTICE: Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options. | 18:21 | |
fungi | i'll put something similar out to the service-announce ml | 18:22 |
openstackstatus | fungi: finished sending notice | 18:24 |
clarkb | I got impatient and tried to do 03 but its super bogged down but I got on 04 and have issued a reboot. It is very slow. I assume its waiting for docker to stop and docker is waiting for containers to stop | 18:24 |
clarkb | tldr just doing a reboot isn't any faster I don't think | 18:24 |
clarkb | we can do nova reboot --hard | 18:25 |
clarkb | or be patient. I'm attempting to practice patience | 18:25 |
fungi | my concern with --hard is that it won't wait for the filesystems to umount | 18:30 |
johnsom | FYI, I got a successful clone now, so functionality returning | 18:31 |
fungi | johnsom: that's great, thanks for confirming! | 18:31 |
clarkb | 01, 02, and 04 are happy now | 18:32 |
clarkb | 03 and 05 next | 18:32 |
clarkb | and now those are done | 18:34 |
clarkb | I think as more become happy the load comes off the sad ones and they restart quicker | 18:34 |
clarkb | I'm doing graceful stop and reboot to clear out any OOM side effects | 18:35 |
fungi | so do we want to try to add more backends? | 18:37 |
clarkb | fungi: I mean we can but then we have to spend the rest of the day replicating | 18:38 |
fungi | in conjunction with inching up the max connections | 18:38 |
clarkb | and if its just to serve a ddos I'm not sure thats the right decision | 18:38 |
fungi | oh, right, speaking of replicating, we need to retrigger a full replication in gerrit once your reboots are complete | 18:39 |
clarkb | cacti shows the difference between normal and not normal and itsm assive overkill to add more hosts | 18:39 |
fungi | given that this doesn't look like a targeted attack, i have a feeling it's going to come and go as whatever distributed web crawler this is spreads | 18:40 |
clarkb | all 8 should be properly restarted now | 18:40 |
clarkb | fungi: is it weird for it to be so distributed though? | 18:40 |
fungi | so it's not growing the cluster to absorb this incident, but rather growing to accept fututre similar incidents | 18:40 |
clarkb | fungi: right, but being idle 99% of the time seems like a poor use of donated resources | 18:41 |
fungi | i agree, if we had some way to elastically scale this, it would be awesome | 18:41 |
clarkb | if this is an actual bot that isn't intentionally malicious we could ask it nicely to go away with a robots.txt | 18:41 |
clarkb | but I think the best way to do that is with tcpdumps and decrypting those streams | 18:42 |
clarkb | which is not the simplest thing to do iirc | 18:42 |
fungi | the nature of the traffic though, it seems like someone has implemented some crawler application on top of a botnet of compromised systems. i doubt we're the only site seeing this we just happen to be hit hard by having a resource-intensive service behind lots and lots of distinct urls | 18:42 |
*** xakaitetoia has quit IRC | 18:42 | |
clarkb | ya that could be | 18:43 |
clarkb | we're currently operating under our limit so if we want to test bumping the limit we'll need to undo our iptables rules I think | 18:43 |
fungi | typical botnets range upwards of tens of thousands of compromised systems, and having a majority of them originating from ip addresses in popular chinese isps is not uncommon | 18:44 |
clarkb | we are at about 60% of the limit | 18:44 |
clarkb | 2.5/4k connections | 18:44 |
clarkb | ish | 18:44 |
clarkb | potential options for next steps: Remove iptables rules for china unicom ranges. This will likely put us back into a semi degraded state with slow connections and connections that fail occasionally. If we do <- we can try raising our maxconn value slowly until we see things get out of control to find where our limit is. We can add backends prior to doing that and try to accomodate the flood. We can do the | 18:50 |
clarkb | opposite and block more ranges with malicious users. We can try tcpdumps and attempt to determine if this is a bot that will respond to a robots.txt value and if so update robots.txt | 18:50 |
clarkb | We can do nothing and see if anyone complains from the blocked ranges and see if the bots go away (since we're managing to keep up now) | 18:50 |
clarkb | and with that I need a break. Our meeting is in 10 minutes and I've been haeds down since ~6am | 18:51 |
fungi | if this really is a widespread spider, we might be able to infer some characteristics by looking at access logs of some of our other services | 18:57 |
*** hasharAway is now known as hashar | 19:01 | |
*** sshnaidm|ruck is now known as sshnaidm|bbl | 19:02 | |
*** ianw_pto is now known as ianw | 19:04 | |
openstackgerrit | Merged opendev/system-config master: Revert "Increase allowed number of haproxy connections" https://review.opendev.org/738679 | 19:17 |
openstackgerrit | James E. Blair proposed opendev/system-config master: Enable access log in gitea https://review.opendev.org/738684 | 19:24 |
ianw | so i'm not seeing that we have a robots.txt for opendev.org or that gitea has a way to set one, although much less sure on that second part | 19:29 |
clarkb | ianw: https://github.com/go-gitea/gitea/issues/621 | 19:30 |
clarkb | thats a breadcrumb saying its possible | 19:30 |
ianw | yeah just drop it in public/ | 19:33 |
clarkb | looks like gitea may rotate the logs for us already | 19:43 |
ianw | https://git.lelux.fi/theel0ja/gitea-robots.txt/src/branch/master/robots.txt looks like a pretty sweet robots.txt | 19:46 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Add backend source port to haproxy logs https://review.opendev.org/738685 | 19:46 |
ianw | i note it has | 19:47 |
ianw | # Language spam | 19:47 |
ianw | Disallow: /*?lang= | 19:47 |
fungi | seems to be an already recognized issue | 19:47 |
corvus | so the error in build 7cdd1b201d0e462680ea7ac71d0777b6 is a follow on from this error: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary | 19:51 |
corvus | we may be able to get the module failure info from the executor, i'm not sure | 19:51 |
ianw | corvus: that step shows as green OK in the console log for you right? | 19:52 |
corvus | ianw: yep | 19:52 |
clarkb | ianw: I like the idea of adapting that robots.txt | 19:54 |
* clarkb is happy we had a meeting brainstorm so many good ideas | 19:54 | |
corvus | ianw: that task has "failed_when: false" | 19:54 |
corvus | # Using shell to try and debug why this task when run sometimes returns -13 | 19:54 |
corvus | "rc": -13 | 19:55 |
*** hashar is now known as hasharAway | 19:55 | |
corvus | so, um, apparently the author of that role has observed that happening, and expects it to happen, and so ignores it | 19:55 |
corvus | but the follow-on task assumes it works | 19:55 |
clarkb | ya we've seen that error before | 19:55 |
clarkb | I don't recall what debugging setup we had done though | 19:55 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea-image: add a robots.txt https://review.opendev.org/738686 | 19:55 |
corvus | i'm not sure that we got any more information from the shell | 19:56 |
corvus | anyway, we still need to track down that module error in the executor log | 19:56 |
corvus | there's nothing in the executor log that isn't already in the job output | 19:58 |
corvus | so basically, no idea what caused -13 to happen | 19:59 |
corvus | switching to the gear issue: that's something that needs to be installed in the ansible venvs, and i guess we don't do that in the upstream zuul images. | 20:01 |
corvus | this is a bit more of a thorny problem though, since there is no usage of gear in zuul-jobs | 20:01 |
corvus | so our pseudo-policy of "we'll add it to zuul-executor if something in zuul-jobs needs it" doesn't cut it here | 20:01 |
clarkb | infra-root to summarize gitea situation I think we can land https://review.opendev.org/#/c/738686/1 and https://review.opendev.org/#/c/738684/1 (this one actually likely already does rotation) then land the haproxy logging update fungi is working on. Then basically as each change lands check logs and see if we learn anything new from better logging | 20:01 |
corvus | clarkb: we don't want the internet archive to crawl us? | 20:02 |
*** noonedeadpunk has quit IRC | 20:02 | |
clarkb | corvus: I mean maybe? the problem seems to be crawling in general poses issues though I guess we've never really noticed this until now so being selective about what we allow is probably as good as it gets? | 20:03 |
clarkb | in particular the lang stuff and well as specific commit level stuff would be good to clean up based on the logs we had during this situation | 20:03 |
fungi | well, the filter for localization url variants is probably a big enough help on its own | 20:03 |
*** noonedeadpunk has joined #opendev | 20:03 | |
ianw | that was just a straight copy of that upstream link, i agree maybe we could drop that bit | 20:03 |
*** diablo_rojo has joined #opendev | 20:03 | |
fungi | those essentially multiply the url count dozens of time over | 20:03 |
corvus | honestly, i'd rather switch from gitea to something that can handle the load rather than disallowing crawling | 20:03 |
ianw | also crawl-delay as 2, if obeyed, would seem to help | 20:04 |
corvus | in my view, supporting indexing and archiving is sort of the point of putting something in front of gerrit :/ | 20:04 |
fungi | i concur | 20:04 |
ianw | i don't think this is disallowing indexing, just making it more useful? | 20:05 |
clarkb | ya its directing it to the bits that are more appropriate | 20:05 |
corvus | it disallows indexing commits? | 20:05 |
clarkb | indexing every file for every commit for every localization lang is expensive | 20:05 |
clarkb | corvus: yes because the crawlers are crawling every single file in every single commit. If we want to start with simply stopping the lang toggle and see if that is sufficient we can do that | 20:06 |
corvus | i could be convinced lang is not useful (since the *content* isn't localized) | 20:06 |
*** mlavalle has quit IRC | 20:06 | |
clarkb | then work up from there rather than backwards | 20:06 |
clarkb | correct only the dashboadr template itself is localized not the git repo content | 20:06 |
corvus | clarkb: right. i'm going out on a limb and saying that crawlers indexing every file in every commit is desirable. | 20:06 |
*** tobiash has quit IRC | 20:06 | |
corvus | "better them than us" :) | 20:06 |
clarkb | I agree, but it also causes our service to break | 20:06 |
corvus | well, no | 20:06 |
corvus | the botnet ddos causes our service to break | 20:07 |
*** mlavalle has joined #opendev | 20:07 | |
ianw | i note github robots.txt has | 20:07 |
ianw | User-agent: baidu | 20:07 |
ianw | crawl-delay: 1 | 20:07 |
corvus | anyway, i'm trying to suggest we don't over-compensate and throw out the benefit of the system. let's start small with disallowing lang, setting the delay, and whetever else seems minimally useful | 20:07 |
*** tobiash has joined #opendev | 20:07 | |
clarkb | corvus: that works for me | 20:07 |
clarkb | we should exclude activity as well | 20:08 |
clarkb | thats the not useful graph data | 20:08 |
ianw | much like the sign at the shop like "please do not eat the soap" i'm presuming they put that there in response to a observed problem :) | 20:09 |
corvus | ianw, clarkb: i left comments | 20:09 |
fungi | also fungi is not working on the haproxy logging change, he already proposed it during the meeting (738685) | 20:09 |
fungi | though happy to apply any fixes if there are issues | 20:10 |
clarkb | corvus: I think the from github is url paths that gitea inherited from github | 20:10 |
clarkb | corvus: I expect that gitea supports those paths that are github specific | 20:10 |
clarkb | (have not tested that yet) | 20:10 |
clarkb | fungi: oh I completely missed the change is already up | 20:10 |
fungi | do we still need to trigger full replication from gerrit? or did that happen already and i missed it | 20:10 |
corvus | clarkb: oh, like are they aliases for more native gitea paths? | 20:10 |
clarkb | corvus: ya that was my interpretation | 20:10 |
corvus | clarkb: if that's the case, then i agree we can filter them because it's duplicative | 20:10 |
clarkb | fungi: I don't think that has happened yet | 20:10 |
corvus | ianw: ^ feel free to ignore my comment on that if that's the case | 20:11 |
fungi | i can work on that next while i wait for gitea to stop hanging on sync | 20:11 |
ianw | yeah, it appears seeded from github.com/robots.txt | 20:11 |
fungi | er, i mean, while i wait for gertty to stop hanging | 20:12 |
fungi | i think something about the gitea issues earluer have confused it | 20:12 |
fungi | #status log triggered full gerrit re-replication after gitea restarts with `replication start --all` via ssh command-line api | 20:14 |
openstackstatus | fungi: finished logging | 20:14 |
clarkb | fungi: corvus I reviewed the haproxy change and left some notes. I think we should double check the check logs before approving but I +2'd | 20:17 |
clarkb | there is one thing about []s being special in haproxy docs that we want to double check doesnt' cause problems for us | 20:17 |
fungi | oh, i'll take a closer look, thanks! | 20:17 |
clarkb | I expect its fine because the [%t] in the existing format is fine | 20:18 |
fungi | yeah, i just double-checked that on the production server | 20:19 |
fungi | i could also drop them, i just didn't want it to become ambiguous if we end up doing any load balancing over ipv6 (not that we do presently) | 20:20 |
clarkb | I think if the check logs look good we should +A | 20:20 |
clarkb | its motly a concern that I don't understand what haproxy means by the []s being different | 20:20 |
clarkb | but if the behavior is fine ship it :) | 20:21 |
fungi | and yeah, the rest of that format string was just copy-pasted from what the haproxy doc says the default is for tcp forward logging, though i did spot-check that it seemed to match the things we're logging currently | 20:22 |
clarkb | https://review.opendev.org/#/c/738684/1 has passed if anyone has time for gitea access log enablement | 20:23 |
clarkb | though note I have to pop out in about 20 minutes to get my glasses adjusted (they got scratched and apparently warranties cover that( | 20:24 |
ianw | do we want to disallow */raw/ ? | 20:24 |
clarkb | I guess it depends if we expect indexers to properly be able to render the intent of formatted source? | 20:25 |
clarkb | raw may be useful for searching verbatim code snippets? | 20:25 |
*** hasharAway has quit IRC | 20:28 | |
*** hashar has joined #opendev | 20:29 | |
clarkb | probably leave raw for now since I'm not sure all the useful indexers can do that | 20:30 |
ianw | what about blame? | 20:33 |
ianw | raw and blame are the two i thikn github disallows that are relevant to us | 20:34 |
clarkb | blame I can personally live without since it is in the git repos and web index is a poor substitute | 20:34 |
clarkb | corvus: fungi ^ thoughts? | 20:34 |
*** sshnaidm|bbl has quit IRC | 20:34 | |
*** sshnaidm|bbl has joined #opendev | 20:35 | |
corvus | i think we can live without blame. it is expensive and lower value in indexes/searching | 20:36 |
corvus | /archiving | 20:36 |
ianw | yeah i have that in | 20:37 |
ianw | i think the rest gitea does not do | 20:37 |
fungi | yeah, i don't think there's a lot of benefit to spidering that | 20:37 |
ianw | e.g. atom feeds, etc | 20:37 |
fungi | blame, i mean | 20:37 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: gitea-image: add a robots.txt https://review.opendev.org/738686 | 20:39 |
clarkb | we apparently don't put any traffic through the lb in testing https://zuul.opendev.org/t/openstack/build/1abcc48f83fe4a1c91db7f45fea09391/log/gitea-lb01.opendev.org/syslog.txt doesn't show it anyway | 20:40 |
clarkb | I've got to pop out now so won't approve it but I Think youcan if you'll watch it | 20:40 |
clarkb | fungi: ^ | 20:40 |
ianw | Disallow: /Explodingstuff/ | 20:41 |
ianw | that has one repo with a ransomware .exe | 20:42 |
ianw | ... i wonder what the story is with that | 20:43 |
openstackgerrit | Merged openstack/diskimage-builder master: Disable all enabled epel repos in CentOS8 https://review.opendev.org/738435 | 20:44 |
fungi | ianw: where? | 20:45 |
*** sshnaidm|bbl is now known as sshnaidm|afk | 20:46 | |
ianw | fungi: sorry that's in the github.com/robots.txt which i was comparing and contrasting to | 20:46 |
fungi | ahh | 20:46 |
fungi | amusing | 20:46 |
ianw | of all the possible things on github, it seems odd that this one is so special | 20:47 |
fungi | it was probably serving a very high-profile ransomware payload to many compromised systems | 20:47 |
fungi | or something similar they didn't want showing up in web searches | 20:48 |
ianw | yeah but why not delete it? although i might believe that script kiddies were using some sort of download thing that unintentionally obeyed robots.txt (because it was designed for good and they didn't patch it out) maybe | 20:49 |
fungi | well, deleting the file *thoroughly* requires rewriting (at least some) git history | 20:50 |
*** priteau has quit IRC | 20:52 | |
*** jbryce has quit IRC | 21:15 | |
*** jbryce has joined #opendev | 21:15 | |
clarkb | replacing lenses took surprisingly little time. I think I'll hang around to see what logging tells us but then it was an early start today and I'm tired. Will likely call it there and pick things up in the morning | 21:19 |
fungi | corvus: are you cool with the updated robots.txt in 738686? looks like it addresses your prior comments now | 21:27 |
corvus | i'll check | 21:27 |
corvus | +3 | 21:27 |
fungi | thanks! | 21:27 |
corvus | ianw: thanks; i like the comments too so we know what to look at next if things are bad | 21:28 |
ianw | i don't think anyone ever accused me of leaving too *few* comments :) | 21:28 |
openstackgerrit | Merged opendev/system-config master: Enable access log in gitea https://review.opendev.org/738684 | 21:32 |
clarkb | re ^ I think we may need to restart gitea processes manually | 21:37 |
clarkb | bceause its not a new container image so we don't automatically do it | 21:37 |
clarkb | if fungi's replication kick is still running we'll want to do it gracefully | 21:37 |
fungi | even if it's not still running we'd want to do it gracefully, right? | 21:37 |
clarkb | yes, though its less critical | 21:38 |
openstackgerrit | Merged opendev/system-config master: Add backend source port to haproxy logs https://review.opendev.org/738685 | 21:39 |
clarkb | haproxy should autoupdate if the sighup is enough there | 21:39 |
clarkb | I wonder if we should wait for robots.txt before restarting giteas just in case gitea caches that | 21:39 |
fungi | we did at least work out the restart for the gitea containers in such a way that we no longer lose replicated refs during a restart, right? | 21:40 |
clarkb | fungi: yes, though I don't think we've ever managed to fully confirm that. The semi regularl "my commit is missing" complaints went away after we did it though and some rough testing of the component pieces shows it should work | 21:41 |
clarkb | its just really hard to confirm in the running system due to races | 21:41 |
fungi | right, okay | 21:41 |
clarkb | basically its docker-compose down && docker-compose up -d mariadb gitea-web ; #wait here until the gitea web loads then docker-compose up -d gitea-ssh | 21:41 |
clarkb | and all of that is in the role too if you need to find it later | 21:42 |
clarkb | what that does is stops the ssh server first (its written that way in the docker compose file) then only starts the ssh server once gitea itself is ready. This way gerrit will fail to push anything until gitea the main process can handle the input | 21:43 |
fungi | makes sense | 21:44 |
clarkb | hrm robots.txt is a bit of a ways out | 21:44 |
clarkb | maybe we should go ahead with some restarts now. I can do 01 and confirm that its access log works at least | 21:44 |
* clarkb does this | 21:45 | |
clarkb | apparently docker-compose stop != docker-compose down | 21:46 |
clarkb | down removes the containes stop does not | 21:46 |
clarkb | so when I did the reboots down was correct to prevent them restarting in the wrong order | 21:46 |
clarkb | but without a reboot a stop is fine | 21:46 |
*** rosmaita has joined #opendev | 21:46 | |
clarkb | gitea01 is done | 21:47 |
clarkb | we have user agents | 21:48 |
clarkb | and baiduspider shows up | 21:48 |
clarkb | but so do others | 21:48 |
rosmaita | i have a (hopefully) quick zuul question when someone has a minute | 21:51 |
clarkb | rosmaita: go for it, I think the fires are under control and now we're just poking at them as they cool :) | 21:51 |
rosmaita | ty | 21:51 |
clarkb | note I think we may need to restart haproxy to get new log format | 21:51 |
rosmaita | i'm trying to configure this job: https://review.opendev.org/#/c/738687/4/.zuul.yaml@109 | 21:52 |
clarkb | the config updated but I don't see the new format yet | 21:52 |
rosmaita | i want it to use a playbook from a different repo | 21:52 |
rosmaita | but not sure how to specify that | 21:52 |
clarkb | rosmaita: playbooks can't be shared between repos, but roles can | 21:53 |
clarkb | rosmaita: usually the process is to repackage the ansible bits into a reusable role that the original source and new consumer can both use | 21:53 |
fungi | also jobs can be used between repos (obviously) | 21:54 |
fungi | so another approach is to define a job (maybe an abstract job) which uses the playbook, and then inherit from that in the other repo with a new job parented to it | 21:55 |
rosmaita | i think there's a tox-cover job defined in zuul-jobs, but i figured i should probably make openstack-tox my parent job | 21:57 |
clarkb | rosmaita: what is your post playbook going to do? | 21:59 |
rosmaita | clarkb: the playbook i'm using only references one role -- if i copy the playbook to the cinder repo, will it just find the fetch-coverage-output role? | 21:59 |
clarkb | rosmaita: yes zuul-jobs is available already. Though really you should probably create an openstack-tox-cover that does that for all jobs? | 21:59 |
clarkb | I think the difference between openstack-tox and normal tox is use of constraints | 22:00 |
clarkb | so an tox-cover <- openstack-tox-cover inheritance that is then applied to cinder makes sense to me | 22:00 |
rosmaita | ok ... i will try copying the playbook first to see if the job does what I want, and if it does, i will propose an openstack-tox-cover job | 22:01 |
clarkb | sounds good | 22:01 |
fungi | i've condirmed, `cd /etc/haproxy-docker && sudo docker-compose exec haproxy grep log-format /usr/local/etc/haproxy/haproxy.cfg` does show the updated config, but ps indicates the service has not restarted yet | 22:02 |
rosmaita | clarkb: ty | 22:02 |
clarkb | fungi: we just do a sighup | 22:02 |
clarkb | fungi: I'm guessing that only allows some things to be updated in the config and others are evaluated on start | 22:02 |
fungi | is the outage from `sudo docker-compose down && sudo docker-compose up -d` going to be brief enough we don't care about scheduling? | 22:03 |
*** tobiash has quit IRC | 22:04 | |
clarkb | I would do s/down/stop/ for the reason I discovered earlier, That outage should last only a few seconds (so may be noticed but only briefly) | 22:04 |
fungi | so it's stop followed up by -d? | 22:05 |
clarkb | yup stop && up -d | 22:05 |
fungi | okay, i'll give that a shot on gitea-lb01 now | 22:05 |
*** tobiash has joined #opendev | 22:06 | |
clarkb | k I'm watching | 22:06 |
fungi | took ~12 seconds | 22:06 |
fungi | i can browse content | 22:06 |
fungi | process start time looks recent now | 22:07 |
clarkb | logs still missing that info? | 22:07 |
fungi | yeah, it doesn't seem to be using the new log-format | 22:07 |
*** hashar has quit IRC | 22:07 | |
clarkb | maybe the []s are the problem here? | 22:07 |
fungi | well, only one of the two parameters added were in [] | 22:07 |
*** olaph has quit IRC | 22:09 | |
clarkb | I'm looking at the rest of the log config maybe we set the new value then override later | 22:09 |
clarkb | ya I think that may be what is happening | 22:09 |
clarkb | we set option tcplog on the frontends | 22:10 |
clarkb | which is a specific log format | 22:10 |
fungi | oh | 22:10 |
fungi | yep. i'll work on another fix | 22:10 |
clarkb | fungi: we can just test it manually firs ttoo | 22:10 |
clarkb | but I think commenting out those lines or removing them then sighup may do it | 22:11 |
fungi | this is a nice writeup: https://www.haproxy.com/blog/introduction-to-haproxy-logging/ | 22:13 |
clarkb | ianw: fwiw gitea01 does have a fair bit of interesting UA data now in /var/haproxy/log/access.log which we can possibly use to update the robots.txt | 22:15 |
clarkb | while fungi is doing the haproxy things I'll finish up gitea02-08 restarts | 22:15 |
clarkb | then they'll all have that fun data | 22:15 |
fungi | where on the host system are we writing stashing the haproxy log? running `mount` inside the container isn't much help | 22:17 |
fungi | er, haproxy config, not log | 22:17 |
fungi | it's /usr/local/etc/haproxy/haproxy.cfg inside the container, but mount just claims that /dev/vda1 is mounted as /usr/local/etc/haproxy | 22:17 |
fungi | aha, it's /var/haproxy/etc/haproxy.cfg | 22:19 |
fungi | clarkb: yep, that did it | 22:21 |
fungi | as expected. i'll propose removing the two occurrences of "option tcplog" | 22:21 |
clarkb | fungi: ya check the docker compose config file for the mounts | 22:21 |
clarkb | 01-08 are all restarted and should have access logs now | 22:22 |
ianw | sorry back, having a look | 22:23 |
clarkb | heh the access logs don't have ports | 22:23 |
clarkb | so to map you want to get the url + timestamp from access log for UA you want. Then check against macaron logs in docker logs output then that gives you the port and you can cross check against lb for the actual source | 22:24 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Remove the tcplog option from haproxy configs https://review.opendev.org/738710 | 22:24 |
clarkb | well progress anyway | 22:24 |
rosmaita | clarkb: i am a idiot -- there is already an openstack-tox-cover job | 22:24 |
fungi | rosmaita: happens to me all the time, i consider it a validation of my need when i spend half an hour discovering that what i tried to add is already there | 22:25 |
rosmaita | :) | 22:25 |
clarkb | fungi: actually is there port info in the gitea side somewhere? | 22:25 |
fungi | clarkb: i don't know, i had hoped we'd get that from the access log being added | 22:26 |
clarkb | we may still have a missing piece | 22:26 |
corvus | i didn't look at that, i thought all we wanted from access log was UA | 22:26 |
clarkb | corvus: ya I think its 95% of what we want | 22:27 |
fungi | ua is definitely useful and an improvement | 22:27 |
corvus | we'll need to unblock the ip ranges to see the UA right? | 22:27 |
clarkb | if we can also map lb logs to gitea logs that is a separate win | 22:27 |
clarkb | corvus: no because we didn't block all the IPs doing the things | 22:27 |
fungi | had just hoped we'd reach a point where we could accually correlate a request to a source ip address | 22:27 |
corvus | ah cool -- do we have a ua yet? | 22:27 |
clarkb | corvus: we have many :) | 22:27 |
corvus | i mean from the crawler | 22:27 |
clarkb | "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)" | 22:28 |
fungi | 738710 will get the backend source ports to appear properly (tested manually to confirm) | 22:28 |
clarkb | \"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" too | 22:28 |
clarkb | baidu doesn't seem to have english faq | 22:29 |
ianw | that maps though with the crawl-delay: 1 that github sets for UA baidu | 22:29 |
clarkb | there are others too (I haven't managed to do a complete listing but I think if we grep for lang= and then do a sort -u -c sort of deal we should see which are most active) | 22:31 |
corvus | are folks thinking that first UA is botnet? | 22:31 |
ianw | clarkb: sorry where's the UA? not in the gitea logs? | 22:31 |
corvus | ianw: should be in access.log in /var/gitea/logs | 22:31 |
ianw | ahh ok, was tailing the container | 22:32 |
clarkb | corvus: yes I think so based on the url its hitting | 22:32 |
clarkb | corvus: its a specific ocmmit file with lang set | 22:32 |
clarkb | thats not concrete evidence unless there is a trend though (single data point not enough to be sure) | 22:33 |
fungi | corvus: may be compromised in-browser payload. a lot of them work that way, so they'd wind up with the actual browser's ua anyway | 22:33 |
fungi | what with turing-complete language interpreters in modern browsers, you don't need to compromise the whole system any more, just convince the browser to run your program | 22:34 |
clarkb | we can set the access log format | 22:34 |
clarkb | I'm now trying to see if the remote port is available to that logging context | 22:34 |
clarkb | (having the ability to link the two would be useful) | 22:35 |
fungi | if gitea uses apache access log format replacements, i think %{c}p gives the client port | 22:35 |
fungi | not that i would have any reason to expect it to use anything like apache's format string language | 22:36 |
openstackgerrit | Merged opendev/system-config master: gitea-image: add a robots.txt https://review.opendev.org/738686 | 22:36 |
clarkb | it does not | 22:36 |
ianw | http://paste.openstack.org/show/795410/ | 22:36 |
clarkb | https://docs.gitea.io/en-us/logging-configuration/#the-access_log_template | 22:36 |
ianw | 44 \"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" | 22:37 |
clarkb | semrush in that listing is the one we had to turn off for lists.o.o because it made mailman unhappy | 22:37 |
ianw | 810 \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" | 22:37 |
ianw | so relative, i don't think it's that | 22:37 |
clarkb | ya baidu seems pretty well behaved overall | 22:38 |
corvus | because of the iptables block, we're not going to have many ddos entries in that log | 22:38 |
fungi | clarkb: i wonder then if it's something like {{.Ctx.RemotePort}} | 22:40 |
clarkb | fungi: there isn't but we do have the Ctx.Req. object which is a normal golang http request which may have it | 22:40 |
clarkb | corvus: we only cut the volume down and did not eliminate it, but yes undoing iptales would give us more data | 22:41 |
clarkb | corvus: there is at least another whole AS that we could block to stop the requests | 22:41 |
ianw | according to https://developers.whatismybrowser.com/useragents/parse/38888-internet-explorer-windows-trident the top hit in that list i provided is windows xp sp2 ie7 | 22:42 |
ianw | i.e. i'd say we can rule out that's a human | 22:42 |
ianw | http://paste.openstack.org/show/795411/ all requests, not just lang | 22:44 |
*** mlavalle has quit IRC | 22:46 | |
*** tkajinam has joined #opendev | 22:48 | |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Update gitea access log format https://review.opendev.org/738714 | 22:49 |
clarkb | that change is not tested but I think if I got the names right it should world | 22:50 |
clarkb | world? work | 22:50 |
clarkb | fungi: ^ basically macaron which is gitea's web layer tries to be smart about the value there and I think it ends up trimming that useful info but the normal http.Request.RemoteAddr value has it and is inteded for logging | 22:50 |
*** rosmaita has left #opendev | 22:51 | |
fungi | oh cool | 22:51 |
clarkb | ianw: I expect we can filter a lot of those? | 22:54 |
clarkb | my modern firefox UA reports itself with its version | 22:54 |
clarkb | so Firefox 4.0.1 is either a lie or very very old? | 22:55 |
fungi | those moz/ie versions were commonly used in other more niche browsers which needed to work around sites claiming their browser wasn't supported | 22:56 |
clarkb | fungi: +2'd https://review.opendev.org/#/c/738710/1 | 22:56 |
fungi | but i agree these days it's more likely to be a script | 22:56 |
clarkb | thats what the prefix bits are supposed to do these days right? | 22:56 |
clarkb | the last entry should be your actual browser? | 22:57 |
fungi | in theory | 22:57 |
*** tosky has quit IRC | 22:57 | |
fungi | though in reality it can be anything | 22:57 |
clarkb | https://review.opendev.org/#/c/738714/1 should be tested by our testing (I checked earlier jobs and the access log shows up in them) | 22:58 |
clarkb | corvus: ianw ^ if you can review that I'll double check the logs in the test runs before approving | 22:59 |
ianw | clarkb: i mean looking at http://paste.openstack.org/show/795411/ the hits are from things that are saying they look like ie7, some tecent browser thing, and os 10.7 which went eol in ~ 2012? | 23:00 |
clarkb | ianw: ya its a lot of things reporting to be very old browsers | 23:00 |
clarkb | the opera entry there is from 2011 | 23:00 |
ianw | but not being nice and putting any sort of "i'm a robot string in it" :/ | 23:00 |
clarkb | firefox 4.0.1 is also a 2011 release | 23:01 |
clarkb | was 2011 the golden year for web browsers ? | 23:02 |
clarkb | I'm leaning towards : we can use robots.txt to block those. Actual browsers will ignore the robots.txt right? And any well behaved bot will go away. If the bots are not well behaved then we may need the mod rewrite idea | 23:03 |
ianw | seems once robots is there we could just hand edit for testing | 23:04 |
corvus | clarkb: lgtm; do you want a +3 or to check the zuul logs first? | 23:05 |
clarkb | corvus: I'll check the logs first and approve if its good | 23:05 |
ianw | oh it deployed already, cool | 23:05 |
clarkb | heh that safari release is 2011/2012 too | 23:05 |
clarkb | its like wheover built this bot did so in 2012 and listed all the possible UA strings at the time :) | 23:05 |
ianw | so, is anything actually loading the robots.txt ... | 23:06 |
ianw | basically two things that look like well behaved bots http://paste.openstack.org/show/795412/ | 23:07 |
clarkb | and I guess the flood isn't going away with the less nice bots | 23:09 |
ianw | is there like a spamhaus for UA's? | 23:10 |
clarkb | fungi: ^ | 23:11 |
fungi | none that i know of | 23:11 |
ianw | https://perishablepress.com/4g-ultimate-user-agent-blacklist/ | 23:11 |
ianw | i mean, insert grain of salt of course | 23:12 |
fungi | yeah, given any application can claim whatever ua it wants and masquerade as any other agent, if filtering based on well-known malware ua strings were especially effective the authors would adapt | 23:12 |
fungi | we might get away with it for a while, or indefinitely, but it would be a continual game of whack-a-mole adding new entries over time | 23:13 |
ianw | but i guess that what we're seeing is a bunch of ancient UA's, whatever we're dealing with it's cutting edge | 23:13 |
ianw | is *not* cutting edge i mean | 23:13 |
corvus | clarkb: gitea log change failed test | 23:14 |
fungi | the ua is useful for identifying well-meaning bots so that you can adjust robots.txt to ask them to be better behaved | 23:14 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Use a temporary registry with buildx https://review.opendev.org/738517 | 23:15 |
clarkb | corvus: and no access log. Maybe Ctx.Req.RemoteAddr isn't valid there :/ | 23:15 |
ianw | fungi: sure, but if there's some empirical list of "this is a bunch of UA's from known spam/script/ddos utilities that are common abusive" that includes all these EOL browsers, that would be great :) | 23:15 |
ianw | however, that is maybe the type of thing kept in google/cloudflare repos and not committed publicly | 23:16 |
clarkb | oh template error writing out that config file | 23:16 |
clarkb | its a jinja2 problem with the {{ .stuff | 23:18 |
clarkb | trying to figure out if I can do a literal block | 23:19 |
openstackgerrit | James E. Blair proposed zuul/zuul-jobs master: Ignore ansible lint E106 https://review.opendev.org/738716 | 23:19 |
openstackgerrit | Clark Boylan proposed opendev/system-config master: Update gitea access log format https://review.opendev.org/738714 | 23:22 |
clarkb | I think ^ that may fix it assuming that is the only issue | 23:22 |
*** auristor has quit IRC | 23:29 | |
openstackgerrit | Merged zuul/zuul-jobs master: Ignore ansible lint E106 https://review.opendev.org/738716 | 23:33 |
*** Dmitrii-Sh has quit IRC | 23:42 | |
*** Dmitrii-Sh has joined #opendev | 23:43 | |
ianw | mnaser: any idea what 38.108.68.124 - - [30/Jun/2020:23:52:28 +0000] "GET /zuul/zuul-registry/src/commit/a2dcc167a292ce8ad83a6890a749004b5b298c64/setup.py?lang=pt-PT HTTP/1.1" 200 21763 "https://opendev.org/zuul/zuul-registry/src/commit/a2dcc167a292ce8ad83a6890a749004b5b298c64/setup.py\" \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" is? | 23:53 |
ianw | mnaser: ohhhh i'm an idiot, that's the loadbalancer | 23:54 |
mnaser | ianw: :) | 23:54 |
ianw | sorry i got confused in the output of my own script :) | 23:55 |
*** auristor has joined #opendev | 23:56 | |
clarkb | ianw: ya thats why we're trying to get the port mappings done | 23:57 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!