Tuesday, 2020-06-30

*** ryohayakawa has joined #opendev		00:04
*** factor has quit IRC		00:54
*** factor has joined #opendev		00:54
*** dtantsur has joined #opendev		01:12
*** dtantsur\|afk has quit IRC		01:12
*** vblando has quit IRC		02:36
*** mnasiadka has quit IRC		02:36
*** Open10K8S has quit IRC		02:37
*** Open10K8S has joined #opendev		02:37
*** mnasiadka has joined #opendev		02:38
*** vblando has joined #opendev		02:40
*** ysandeep\|away is now known as ysandeep		04:01
*** ykarel\|away is now known as ykarel		04:19
openstackgerrit	yatin proposed openstack/diskimage-builder master: Disable all enabled epel repos in CentOS8 https://review.opendev.org/738435	04:47
*** ysandeep is now known as ysandeep\|brb		05:41
*** DSpider has joined #opendev		06:07
*** ysandeep\|brb is now known as ysandeep		06:11
*** diablo_rojo has quit IRC		06:49
*** bhagyashris\|pto is now known as bhagyashris		07:16
*** hashar has joined #opendev		07:24
*** sshnaidm\|afk is now known as sshnaidm\|ruck		07:28
*** tosky has joined #opendev		07:28
*** bhagyashris is now known as bhagyashris\|lunc		07:29
*** iurygregory has quit IRC		07:42
*** moppy has quit IRC		08:01
*** iurygregory has joined #opendev		08:01
*** moppy has joined #opendev		08:01
*** hiep_mq has joined #opendev		08:25
*** ykarel is now known as ykarel\|lunch		08:25
*** jangutter has quit IRC		08:27
*** hiep_mq has quit IRC		08:37
*** bhagyashris\|lunc is now known as bhagyashris		08:37
*** Eighth_Doctor has quit IRC		08:38
*** rchurch has quit IRC		08:40
*** rchurch has joined #opendev		08:44
*** Eighth_Doctor has joined #opendev		08:52
*** tkajinam has quit IRC		08:55
*** priteau has joined #opendev		09:08
*** sshnaidm\|ruck has quit IRC		09:14
*** sshnaidm has joined #opendev		09:23
*** sshnaidm has quit IRC		09:42
*** ykarel\|lunch is now known as ykarel		09:49
*** priteau has quit IRC		09:54
*** sshnaidm has joined #opendev		09:55
*** mugsie has quit IRC		10:07
*** mugsie has joined #opendev		10:10
*** ryohayakawa has quit IRC		10:30
*** hashar has quit IRC		10:47
*** ysandeep is now known as ysandeep\|afk		10:53
frickler	infra-root: most cacti graphs for gitea0X look weird starting around 0:30 GMT today	11:00
frickler	e.g. http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66632&rra_id=all	11:00
frickler	hmm, the docker log only shows the IP of the lb as client, makes it difficult to track things. not sure if that might be some ddos, even less how to mitigate if it was	11:13
frickler	except it seems that cloudflare seems to offer free protection for open source projects, but we'd have to discuss whether we'd want that	11:13
*** bhagyashris is now known as bhagyashris\|brb		11:32
*** ysandeep\|afk is now known as ysandeep		11:39
openstackgerrit	Merged openstack/diskimage-builder master: Make ipa centos8 job non-voting https://review.opendev.org/738481	11:49
*** mordred has quit IRC		12:04
*** mordred has joined #opendev		12:09
*** bhagyashris\|brb is now known as bhagyashris		12:10
*** dtantsur is now known as dtantsur\|brb		12:13
*** hashar has joined #opendev		12:22
*** priteau has joined #opendev		12:45
*** ysandeep is now known as ysandeep\|afk		12:47
ttx	corvus, fungi, clarkb: if you can review https://review.opendev.org/#/c/738187/ today, I'm around in the next 3 hours to watch it go through and fix it in case of weirdness	12:55
*** ysandeep\|afk is now known as ysandeep		12:55
ttx	tl;dr: a retry is better than convoluted conditionals	12:55
AJaeger	ttx, +2A	12:59
clarkb	frickler: that may indicate we've got another ddos happening against the service :/	13:02
ttx	AJaeger: thx!	13:07
clarkb	There are two vexxhost IPs that show up as particularly busy	13:13
clarkb	oddly they are ipv4 addrs not ipv6. Neither address shows up in our nodepool logs, but I'm double checking that I'm grepping through rotated logs properly now	13:14
clarkb	however I expect that our jobs would always use ipv6 not ipv4	13:14
openstackgerrit	Merged zuul/zuul-jobs master: upload-git-mirror: use retries to avoid races https://review.opendev.org/738187	13:14
clarkb	ya I'm grepping rotated logs properly so these aren't any IPs of our own	13:15
clarkb	mnaser: if you have a moment or can point us at who does I'd happily share IPs and see if we can work backward from there?	13:15
fungi	the established connections graph for the lb are striking, can probably just see what the socket table looks like	13:16
mnaser	clarkb: can you post them to me? I wonder if it’s our Zuul.	13:17
clarkb	fungi: thats bascially what I did, grep by backend on load balancer and sort by unique IP	13:18
smcginnis	I've noticed some services seem a bit slower this morning. A ddos could certainly explain that.	13:18
clarkb	`sudo grep balance_git_https/gitea08.opendev.org /var/log/syslog \| cut -d' ' -f 6 \| sed -ne 's/$:[0-9]\+$$//p' \| sort \| uniq -c \| sort \| tail`	13:20
clarkb	if other infra-root ^ want to do similar on the load balancer	13:20
fungi	clarkb: the address distribution looks fairly innocuous to me	13:20
fungi	i was analyzing netstat -ln	13:20
clarkb	fungi: tehre are two vexxhost IPs that hvae an order of magnitude more requests over that syslog	13:20
clarkb	compared to other IPs	13:20
clarkb	possible that it finally caught up and now our distribution is more normalized	13:20
fungi	highest count for open sockets i saw was 66.187.233.202 which seems to be a nat at red hat	13:21
clarkb	fungi: yes that one also shows up (but it has since we spun up the service since red hat insists on funneling all internets through a single IP)	13:21
fungi	er, not netstat -ln, netstat -nt	13:21
fungi	next most connections is from 124.92.132.123 at china unicom	13:22
fungi	several other china unicom addresses in the current top count of open connections to the lb	13:23
fungi	so far basically all of the ones i'm checking are, in fact	13:24
fungi	netstat -nt\|awk '{print $5}'\|sed 's/:[0-9]*$//'\|sort\|uniq -c\|sort -n	13:24
fungi	the highest 8 are obviously connections to the backends	13:25
fungi	the next 5 are china unicom	13:26
fungi	(at the moment)	13:26
clarkb	fungi: ya I was looking at it over time since the cacti graphs show it being a long term issue and in the past total number of connections over tiem when that has happened has correlated strongly with the issue	13:27
clarkb	its also possible the IPs I identified are just noise as the actual problem is spread out over many more IPs and we're seeing that right now	13:28
fungi	yeah, the distribution makes me wonder if the connection count is a symptom of something impacting cluster performance, not the cause	13:30
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all is fun	13:32
fungi	closer looks at cacti graphs, note that the bandwidth utilization didn't really increase, or even spike (except on the loopback)	13:32
fungi	so it could be something else is causing connections to take far longer to service, and they're all piling up	13:33
fungi	cpu, load average and loopback connections all went up at the same time across all backends	13:34
clarkb	fungi: ya thats what we saw before, basically really expensive requests that hang around	13:34
clarkb	and then after a few hours they finally complete in gitea	13:34
fungi	makes sense. not expensive operations resulting in large amounts of returned data over the network though	13:35
clarkb	no its expensive for gitea to produce a result so its sitting in the background spinning its wheels for several hours before saying something	13:36
fungi	yeah, i wonder how we could identify those	13:38
clarkb	last time it was via docker logs on the web container and looking for completed requests with large times or started request entries and no corresponding completed line	13:39
fungi	seems gitea is using threads, so can't really just go by process age like we could with the git http backend	13:39
clarkb	I'm having a hard time doing that now	13:40
clarkb	because the volume of logs is signficant	13:40
clarkb	ok finally getting somewhere with that. haven't seen any large requests, but I am noticing a lot of requests for specific commits and under specific languages	13:42
fungi	funny to hear folks on the conference talking about collecting haproxy telemetry while digging into this	13:42
clarkb	that seems odd for us and implies maybe its a crawler	13:42
clarkb	gitea doesn't seem to log user agent unfortunately or that might make it a bit easier to tell what was doing that if it is a bot	13:43
fungi	and i don't suppose we have sufficient session identifiers to pick it out of the haproxy logs	13:44
clarkb	and if it isn't closing those open connections we may not log them in the way I was looking in syslog (your live view would be a better indicator of that)	13:44
clarkb	fungi: ya the gitea logging is maybe the next thing to look at in that space to make it easier to debug this class of problem. I believe it will log a forwarded for ip if we were doing http(s) balancing on haproxy but we do tcp passthrough instead	13:46
clarkb	adding user agent would be useful though	13:46
clarkb	I think if we restart the containers we'll reset the log buffer which may make operating with the gitea logs slightly easier	13:50
*** mlavalle has joined #opendev		14:00
clarkb	fungi: I'm trying to use my own IP as a bread crumb in the lb and gitea logs to see how we can more effectively correlate them	14:01
fungi	if haproxy logs the ephemeral port from which it sources the forwarded socket, and gitea logs the "client" port number, we could likely create a mapping based on that, backend and approximate timestamps	14:03
fungi	though what haproxy's putting in syslog looks like it includes the original client source port, but not the source port for the forwarded connection it creates to the backend	14:05
openstackgerrit	Riccardo Pittau proposed openstack/diskimage-builder master: Convert multi line if statement to case https://review.opendev.org/734479	14:10
*** dtantsur\|brb is now known as dtantsur		14:14
*** ysandeep is now known as ysandeep\|away		14:17
clarkb	doing rough correlation using timestamps and looking for blocks where significant numbers of requests do the specific commit and file with lang param thing I'm beginning to suspect the requests from those china unicom ranges	14:19
clarkb	each one seems to be a new IP so doesn't show up in our busy IPs listings	14:19
*** mlavalle has quit IRC		14:20
clarkb	this correlation isn't perfect though so not wanting to commit to that just yet, but am starting to thik about how we might mitigate such a thing	14:20
clarkb	what I'm noticing is that we seem to have capped out haproxy to ~8k concurrent connections but we're barely spinning the cpu or using any memory there	14:22
clarkb	we may be able to mitigate this by allowing haproxy to handle far more connections. However, ulimit is set to a very large number for open files and it isn't clear to me where that limit may be	14:22
fungi	haproxy is in a docker container now, so could it be the namespace having a separate, lower limit?	14:24
*** mlavalle has joined #opendev		14:24
openstackgerrit	Clark Boylan proposed opendev/system-config master: Increase allowed number of haproxy connections https://review.opendev.org/738635	14:28
clarkb	fungi: nah its ^	14:28
clarkb	I checked ulimit -a in the container	14:28
fungi	aha!	14:28
clarkb	I think lookingat gitea01 we have headroom for maybe 6-8x the number of current requests	14:29
clarkb	so I bump by 4x in that change and we can monitor and tune from there	14:29
clarkb	this is still just a mitigation though doesn't really address the underlying problem (but maybe thats good enough for now?)	14:30
fungi	yeah, i agree, system load is topping out around 25% of the cpu count, and cpu utilization is also around 25%	14:30
clarkb	and we get connection on front and connection on back so 4000 * 2 = 8k total tcp conns	14:30
fungi	memory pressure is low too	14:30
clarkb	fungi: ya	14:30
clarkb	and the lb itself is running super lean	14:32
clarkb	it could probably do 64k connections on the lb itself	14:32
*** weshay_ruck has joined #opendev		14:33
tristanC	Greeting, is there a place where we could download the diskimage used by opendev's zuul?	14:33
clarkb	tristanC: yes https://nb01.opendev.org, https://nb02.opendev.org. https://nb04.opendev.org	14:33
clarkb	all three of those build the images so you'll want to check which one has the most recent build of the image you want	14:33
clarkb	er it might needs /images ? /me checks	14:34
clarkb	yes its the /images path at those hosts	14:34
tristanC	clarkb: thanks, /images is what i was looking for	14:34
clarkb	frickler: corvus mordred I think we start at https://review.opendev.org/#/c/738635/1 to see if we can make opendev.org happier	14:35
clarkb	and separately continue to try and sort out if these requests are from a legit crawler and if so we may be able to set up a robots.txt and ask it to kindly go away	14:36
clarkb	fungi: I guess we could do a packet capture then figure out decryption of it and then look for user agent headers	14:43
clarkb	fungi: that seems like an after conference, after breakfast task if I'm going to tackle that though	14:43
openstackgerrit	Merged zuul/zuul-jobs master: prepare-workspace: Add Role Variable in README.rst https://review.opendev.org/737352	14:45
fungi	yeah, with a copy of the ssl server key we could do offline decryption of a pcap	14:45
fungi	though i'm out of practice, been a while since i did that	14:46
*** ykarel is now known as ykarel\|away		14:48
*** sgw1 has quit IRC		14:50
*** sorin-mihai has joined #opendev		14:52
corvus	clarkb, fungi: oh, i have scrollback to read; i'll do that and expect to be useful after breakfast in maybe 30m?	14:53
*** sorin-mihai__ has quit IRC		14:53
clarkb	corvus: sounds good	14:53
fungi	yeah, things are mostly working	14:54
*** sgw1 has joined #opendev		14:55
corvus	ok caught up on scrollback, +3 738635; breakfast now then back	14:58
openstackgerrit	Merged zuul/zuul-jobs master: Return upload_results in upload-logs-swift role https://review.opendev.org/733564	15:01
fungi	738635 is going to need an haproxy container restart, right? hopefully that's fast	15:10
clarkb	fungi: yes and yes its usually pretty painless	15:10
*** hashar is now known as hasharAway		15:24
openstackgerrit	Merged opendev/puppet-openstackid master: Fixed permissions issues on SpammerProcess https://review.opendev.org/717359	15:47
*** sshnaidm has quit IRC		15:51
*** sshnaidm has joined #opendev		15:53
*** hasharAway is now known as hashar		15:58
clarkb	we're about half an hour from applying the haproxy config update. I'm going to find lunch	16:07
clarkb	I guess its just breakfast at this point	16:08
clarkb	yay early mornings	16:08
*** sshnaidm is now known as sshnaidm\|ruck		16:18
openstackgerrit	Merged opendev/system-config master: Increase allowed number of haproxy connections https://review.opendev.org/738635	16:40
fungi	and now we wait for the deploy to run	16:40
*** dtantsur is now known as dtantsur\|afk		16:45
fungi	looks like the deploy is wrapping up now	16:45
fungi	and done	16:45
fungi	once enough of us are around, i guess we can restart the container	16:47
clarkb	fungi: I think it restarts automatically?	16:49
clarkb	http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=66611&rra_id=all seems to show that	16:51
fungi	it didn't seem like it restarted, looking at ps	16:51
fungi	start timestamp is over two weeks ago	16:52
fungi	root 12495 0.0 0.1 19848 11176 ? Ss Jun11 1:22 haproxy -sf 6 -W -db -f /usr/local/etc/haproxy/haproxy.cfg	16:52
fungi	weird	16:53
fungi	maybe it reread its config live?	16:53
clarkb	I didnt think it did butmaybe	16:53
fungi	no mention of any config reload in syslog	16:57
clarkb	docker ps also shows the container didn't restart	16:58
clarkb	maybe it is checking its config on the fly?	16:58
clarkb	gitea01 connections, memory and cpu has gone up too but all at reasonable levels	17:00
clarkb	fungi: I'm thinking we should probably restart the haproxy just to be sure, but it definitely seems to be running with thatn ew config anyway	17:00
clarkb	I'm checking the ansible nwo to see if we signal the process at all	17:00
clarkb	we do: cmd: docker-compose kill -s HUP haproxy	17:01
fungi	ahh, okay	17:01
clarkb	thats the handler for updating our config so this is all good	17:01
clarkb	no restart necessary	17:01
fungi	yep, perfect	17:01
clarkb	also if we were using all 16k conns on the front end I would expect 32k conns recorded in cacti	17:02
fungi	agreed	17:03
clarkb	we're "only" getting ~16k conns in cacti which implies to me we've now got enough capacity in haproxy to address the load	17:03
fungi	waiting to see if there's fluxuation in that or if it looks capped still	17:03
clarkb	as long as our backends stay happy this may be sufficient to work around the problem	17:03
clarkb	and if so we can see if this persists as a low key ddos or maybe it will resolve on its own if the bots get the crawling done	17:03
fungi	the new top out does seem to have some variation in it compared to before, so i have hopes that's representative of actual demand now	17:04
clarkb	ya	17:04
clarkb	gitea04 may not be super happy with the change. I'm wondering if that is where rh nat maps to	17:08
clarkb	all the others look to be fine according to cacti	17:08
sgw1	Hi There, was there an issue with opendev.org earlier today? Some folks were seeing:	17:08
sgw1	fatal: unable to access 'https://opendev.org/starlingx/stx-puppet.git/': SSL received a record that exceeded the maximum permissible length.	17:08
sgw1	and it was very slow	17:08
clarkb	sgw1: yes we've been sorting that out today. Basically at about midnight UTC today we've come under what appears to be a ddos (not sure if intentional or not)	17:09
clarkb	it looks like a web crawler bot out of china fetching all the commits and files out of our repos	17:10
clarkb	but its doing soem from many many many IPs	17:10
clarkb	anyway what we just did was to bump up the connection limit on the load balancer at it appears that things would be happy with that though maybe one of our backends is not due to how we have to load balance by source IP	17:10
clarkb	haproxy just decided that gitea04 is down due to response lag	17:11
sgw1	clarkb: thanks for the info, I forwarded it back to the folks asking.	17:12
clarkb	sgw1: they can always ask too :)	17:12
sgw1	clarkb: while I am here, we are getting ready for branching our 4.0 release, I might try a test push to one repo, if I need to undo it for some reason, I will check in.	17:13
clarkb	looks like haproxy decided gitea04 is back now	17:13
clarkb	so it may sort of throttle itself ?	17:13
sgw1	clarkb: not sure how to answer that other than not their style :-(	17:13
clarkb	I'm not sure if this is better than the situation before where we told people to go away at the haproxy laye	17:13
clarkb	RH nat is not on gitea04 fwiw	17:14
clarkb	fungi: it seems like we've stablized the incoming connection counts but we're slowly starting to get unhealthy backends	17:16
clarkb	gitea01 is starting to swap now along with 03	17:17
clarkb	er 04	17:17
clarkb	ya may need a revert	17:17
clarkb	we may be stabilizing but we're doing so at relatively unhappy states (not completely dead though)	17:19
clarkb	maybe lets let it run for 15-30 minutes then see where we've ended up and go from there?	17:19
clarkb	unfortunately my next best idea is start blocking large chunks of chinese IP space :/ and I'm worried about collateral damage	17:20
fungi	yeah	17:21
fungi	or we add more backends	17:21
fungi	yeah, looks like cpu and load average on the backends is not scaling linearly with the increase in connection count	17:22
fungi	were getting 100% cpu with load averages around 16+	17:23
clarkb	there are certain requests that cost more than others, its possible that the reuqests we're getting are intentionally expensive	17:23
fungi	aha, it's memory pressure	17:23
fungi	i think we're killing them with swapping	17:23
clarkb	ya	17:23
fungi	oh, and now i see in scrollback you already said that	17:23
fungi	sorry, trying to juggle hvac contractor and eating my now cold lunch	17:24
clarkb	sgw1: fwiw we'd love to have more people from involved project help build the community resources. If there is any way we can help change the style of interaction that would be great	17:24
clarkb	sgw1: we can't help if we can't even communicate with each other :/	17:24
clarkb	fungi: internet tells me that AS is advertising 1236 prefixes	17:27
clarkb	I suppose we could add all of those to firewall block rules?	17:28
fungi	that would be... painful	17:28
clarkb	ya but not having a working service is worse :?	17:28
fungi	i wonder if we could programmatically aggregate a bunch of those	17:29
clarkb	gitea01 is going to OOM soon I think	17:29
clarkb	04 and 02 are a bit more stable	17:29
fungi	we should probably revert the connection limit increase for now? we could add more gitea backends	17:29
clarkb	fungi: ya I'll do that manually really quickly	17:30
fungi	i'll push up the actual revert change	17:30
clarkb	manual application is done	17:30
fungi	apparently even git replication to the gitea servers has been lagging	17:31
fungi	oh, i'm getting 500 isr errors	17:31
fungi	so i can't currently remote update my checkout	17:33
clarkb	you should be able to update from gerrit	17:35
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Revert "Increase allowed number of haproxy connections" https://review.opendev.org/738679	17:36
*** slittle1 has joined #opendev		17:36
fungi	reverted through gerrit's webui for now	17:36
fungi	yeah, that's what i wound up doing	17:36
slittle1	Just joined ...	17:37
slittle1	is opendev's git server having issues ?	17:37
clarkb	slittle1: yes we've had a ddos all day and we thought we could alleviate some of the pain for people and it ended up consuming too much memory and makign things worse	17:37
fungi	slittle1: well, it's a cluster of 8 git servers behind a load balancer, but yes, we're under a very large volume of git requests from random addresses in china unicom	17:37
clarkb	fungi: radb gives me 986 prefixes	17:37
clarkb	fungi: we could add 986 drop rules easily enough	17:38
fungi	clarkb: yeah, doing that on gitea-lb01 for now is probably the bandaid we need while we look at better options	17:38
clarkb	fungi: `whois -h whois.radb.net -- '-i origin AS4837' \| grep ^route: \| sed -e 's/^route:\s\+//' \| sort -u \| sort -n \| wc -l` fwiw	17:38
clarkb	drop the wc if you want to see them	17:39
fungi	it's times like this i miss being able to just add one bgp filter rule to my border routers :/	17:39
clarkb	I think we may need to restart gitea backends to get them happy again though	17:39
clarkb	probably start with firewall update then restart backends?	17:40
fungi	yes, they may not release their memory allocations without restarts	17:40
*** hashar is now known as hasharAway		17:40
fungi	we ought to be able to rolling restart the backends, can disable them one by one in haproxy if we want	17:40
clarkb	fungi: there is a more graceful way to do it to make sure that gerrit replication isn't impacted, but reboots may be worthwhile just in case anything got OOMKillered	17:41
clarkb	but lets figure out the iptables rule changes first	17:41
fungi	we can certainly just trigger a full replication once the restarts are done	17:41
clarkb	fungi: `for X in $(cat ip_range_list) ; do sudo iptables -I openstack-INPUT -j DROP -s $X; done` ?	17:44
clarkb	-s will take x.y.z.a/foo notation right?	17:44
*** xakaitetoia has joined #opendev		17:44
fungi	none are v6 prefixes?	17:44
clarkb	fungi: as far as I can tell it was all ipv4	17:45
fungi	yes, cidr notation works fine with iptables	17:45
fungi	and that command looks right	17:45
clarkb	let me generate the list on the lb	17:45
fungi	no whois installed there	17:45
clarkb	whois not found	17:45
clarkb	of course not	17:46
clarkb	I'll scp it	17:46
clarkb	fungi: gitea-lb01:/home/clarkb/china_unicom_ranges	17:46
fungi	looks right	17:47
fungi	also radb.net does seem to aggregate prefixes where possible, at least skimming the list	17:47
fungi	i don't see any subsets included	17:48
fungi	i say go for it. worst case we reboot the server via nova api if we lock ourselves out	17:49
clarkb	fungi: `for X in $(cat china_unicom_ranges) ; do echo $X ; sudo iptables -I openstack-INPUT -j DROP -s $X ; done` is my command on lb01 that still look good? made some small edits	17:50
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Use a temporary registry with buildx https://review.opendev.org/738517	17:51
fungi	clarkb: yep, that looks right	17:51
clarkb	fungi: ok I'm running that now	17:52
fungi	we could get fancy specifying destination ports, but there's little point as long as we don't lock ourselves out	17:52
clarkb	thats done	17:52
clarkb	I can create new connectiosn to the server	17:53
fungi	`sudo iptables -nL` looks like i would expect	17:54
clarkb	lots of cD connections in haproxy logs fro those ranges now	17:54
clarkb	which is I think side effect of the iptables rukles	17:54
fungi	trust me you don't want to try that without -n	17:54
clarkb	to reset iptables rules without rebooting we can restart the iptables persistent unit	17:55
clarkb	I Think it may be called netfilter-something now	17:55
fungi	netfilter-persistent, yes	17:56
clarkb	but giteas are still unhappy	17:56
clarkb	I'm going to gracefully restart 01 and see if we need to reboot	17:56
clarkb	basically we can always reboot after graceful restart	17:57
smcginnis	Do we have general connectivity issues right now?	17:57
smcginnis	re: https://review.opendev.org/#/c/738443/	17:57
fungi	smcginnis: just the git servers	17:57
smcginnis	Jobs all failing with connection failures.	17:57
fungi	though they should be starting to recover nowish, we hope	17:57
fungi	i can finally `git remote update` again	17:58
clarkb	I just blocked all of china unicom from talking to them	17:58
fungi	~1k network prefixes	17:58
smcginnis	OK, I'll recheck in a bit. At least the one I am looking at now, it was trying to get TC data from the gitea servers.	17:58
fungi	i'll write up something to service announce and status notice	17:58
clarkb	graceful restart isn't going so great	18:00
clarkb	I'm still waiting on it to stop things	18:00
fungi	infra-root: status notice Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.	18:01
fungi	does that look reasonable?	18:01
clarkb	fungi: yes	18:01
smcginnis	Looks good to me as infra-nonroot too. :)	18:02
clarkb	AS4134 may be necessary too	18:02
clarkb	now that there is less noise in the logs i'm seeing a lot of traffic from there too :/	18:02
corvus	clarkb, fungi: oh shoot, i missed that more stuff happened, sorry	18:03
fungi	"chinanet"	18:03
johnsom	Ah, this must be why my git clone is hanging after I send the TLS client hello	18:03
clarkb	corvus: basically my read of resource headroom was completely wrong	18:03
corvus	clarkb: on gitea or haproxy?	18:03
clarkb	corvus: once we allowed more connections we spiralled out of control with memory on the gitea backends	18:03
clarkb	corvus: gitea	18:03
corvus	gotcha	18:03
fungi	johnsom: i hope it's clearing up now but we're still evaluating how the temporarily bandaid is working	18:04
clarkb	haproxy could've done a lot more :)	18:04
corvus	clarkb: is haproxy config reverted?	18:04
clarkb	corvus: yes	18:04
fungi	corvus: manually reverted and also proposed as https://review.opendev.org/738679	18:04
clarkb	corvus: https://review.opendev.org/#/c/738679/1 is proposed and I manualyl reverted	18:04
johnsom	Let me know if I can consult on haproxy configurations. I might know a thing or two about the subject. grin	18:05
corvus	clarkb: what about a smaller bump? 6k?	18:05
fungi	johnsom: at the moment it's just serving as a layer 4 proxy, so not much flexibility	18:05
clarkb	johnsom: well earlier today we had maxconn set to 4000 which resulted in errors because haproxy wouldn't allow new connections. I evaluated resource use on our backends and thought we had head room to go to ~16k but was wrong	18:05
clarkb	corvus: based on cacti data we were only doing ~8k connections	18:06
clarkb	corvus: 6k might be ok but we may be erally close to that limit	18:06
corvus	oh. so maybe 4500 :)	18:06
clarkb	but also even at that limit we still have user noticeable outages	18:06
clarkb	because haproxy won't handshake with them before they timeout	18:06
corvus	clarkb: that happens at 4k?	18:07
clarkb	corvus: haproxy won't complete handshakes until earlier connectiosn finish. This means occaionally your client will fail. This is how it was originally reported to us earlier today	18:07
clarkb	sometimes it would just be slow	18:09
corvus	clarkb: right, so getting more gitea connections available to haproxy was the expected solution. but doing that overran gitea servers.	18:09
clarkb	yes	18:09
corvus	so do you think we're hitting a gitea limit? it would be nice to use more that 25% ram and have a higher load average than 2.0.	18:10
clarkb	I half expect that the requests being made require lots of memory bceause they are being made across repos, files, and commits	18:10
corvus	50% and a la of 8 would be ideal, i'd think :/	18:10
clarkb	and ya tuning to a sweet spot is a good thing, but I expect we'll still be degraded if we do that thought the firewall rules	18:11
clarkb	I finally got gitea01 to stop its gitea stuff	18:11
clarkb	I'll reboot it now?	18:11
clarkb	it did get OOMkillered ~15 minutes ago so a reboot seems like a good idea	18:12
clarkb	(then iterate through the list and do that	18:12
johnsom	Well, with L4, you should be able to handle ~29,000 connections per gigabyte allocated to haproxy. But, this is very version dependent and data volume may impact the CPU load. As you scale connections you also need to make sure you tune the open files available as well, which with systemd gets to be tricky.	18:12
clarkb	johnsom: yes haproxy is not the problem	18:12
clarkb	johnsom: the issue is allowing this many connections to the backends causes them to run out of memory	18:13
fungi	johnsom: yeah, we're really not seeing any issues with haproxy, it's the backends which are being overrun	18:13
johnsom	Ok. Would rate limiting at haproxy help, is there a cidr or such that could be used for rate limiting?	18:13
clarkb	johnsom: yes, thats what the maxconn limit is effectively doing for us	18:13
clarkb	johnsom: can we rate limit by cidr in haproxy and then remove our iptables rules?	18:14
fungi	johnsom: we saw connections scattered across probably hundreds of prefixes from the same as	18:14
clarkb	fungi: and now a second AS	18:14
clarkb	I don't see any objections. I'm rebooting gitea01 now	18:14
corvus	clarkb: ++	18:14
fungi	clarkb: oh, yes, please do reboot it	18:14
johnsom	Ugh, ok. Yeah, you can do pretty complex rate limiting with haproxy, beyond just maxconn. But it sounds like this may not be easy to identify the bad actors.	18:15
fungi	it seems to be some sort of indexing spider	18:15
johnsom	clarkb Here is some nice examples: https://www.haproxy.com/blog/four-examples-of-haproxy-rate-limiting/	18:15
fungi	but a very distributed one operating on the down-low	18:16
clarkb	gitea01 is done and up and looks ok	18:16
clarkb	I'll work through 02-08 in sequence	18:16
clarkb	other things we may want to look at is as4134 (double check that against current haproxy logs) and maybe using haproxy to rate limit instead of iptables drop rules	18:17
fungi	at the volume we've been seeing, the rate limit will be no different than dropping at the firewall, other than it will start allowing connections again once the event passes	18:17
clarkb	corvus: then once all are rebooted I think we can tune the bumped maxcoon	18:18
clarkb	*maxconn	18:18
clarkb	right now it won't be useful because most of the giteas are unhappy	18:18
clarkb	but once I've got them happy again we'll get useful infop	18:18
clarkb	fungi: thats a good point	18:18
clarkb	02 is also having trouble with the graceful down. I'll try just doing a reboot on 03 when 02 is done	18:19
fungi	should i status notice about AS4837 now or wait until we decide whether we need to add AS4134?	18:20
clarkb	I think you can go ahead and then we can add any other rule changes as subsequent notices/logs	18:20
fungi	#status notice Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.	18:21
openstackstatus	fungi: sending notice	18:21
-openstackstatus- NOTICE: Due to a flood of connections from random prefixes, we have temporarily blocked all AS4837 (China Unicom) source addresses from access to the Git service at opendev.org while we investigate further options.		18:21
fungi	i'll put something similar out to the service-announce ml	18:22
openstackstatus	fungi: finished sending notice	18:24
clarkb	I got impatient and tried to do 03 but its super bogged down but I got on 04 and have issued a reboot. It is very slow. I assume its waiting for docker to stop and docker is waiting for containers to stop	18:24
clarkb	tldr just doing a reboot isn't any faster I don't think	18:24
clarkb	we can do nova reboot --hard	18:25
clarkb	or be patient. I'm attempting to practice patience	18:25
fungi	my concern with --hard is that it won't wait for the filesystems to umount	18:30
johnsom	FYI, I got a successful clone now, so functionality returning	18:31
fungi	johnsom: that's great, thanks for confirming!	18:31
clarkb	01, 02, and 04 are happy now	18:32
clarkb	03 and 05 next	18:32
clarkb	and now those are done	18:34
clarkb	I think as more become happy the load comes off the sad ones and they restart quicker	18:34
clarkb	I'm doing graceful stop and reboot to clear out any OOM side effects	18:35
fungi	so do we want to try to add more backends?	18:37
clarkb	fungi: I mean we can but then we have to spend the rest of the day replicating	18:38
fungi	in conjunction with inching up the max connections	18:38
clarkb	and if its just to serve a ddos I'm not sure thats the right decision	18:38
fungi	oh, right, speaking of replicating, we need to retrigger a full replication in gerrit once your reboots are complete	18:39
clarkb	cacti shows the difference between normal and not normal and itsm assive overkill to add more hosts	18:39
fungi	given that this doesn't look like a targeted attack, i have a feeling it's going to come and go as whatever distributed web crawler this is spreads	18:40
clarkb	all 8 should be properly restarted now	18:40
clarkb	fungi: is it weird for it to be so distributed though?	18:40
fungi	so it's not growing the cluster to absorb this incident, but rather growing to accept fututre similar incidents	18:40
clarkb	fungi: right, but being idle 99% of the time seems like a poor use of donated resources	18:41
fungi	i agree, if we had some way to elastically scale this, it would be awesome	18:41
clarkb	if this is an actual bot that isn't intentionally malicious we could ask it nicely to go away with a robots.txt	18:41
clarkb	but I think the best way to do that is with tcpdumps and decrypting those streams	18:42
clarkb	which is not the simplest thing to do iirc	18:42
fungi	the nature of the traffic though, it seems like someone has implemented some crawler application on top of a botnet of compromised systems. i doubt we're the only site seeing this we just happen to be hit hard by having a resource-intensive service behind lots and lots of distinct urls	18:42
*** xakaitetoia has quit IRC		18:42
clarkb	ya that could be	18:43
clarkb	we're currently operating under our limit so if we want to test bumping the limit we'll need to undo our iptables rules I think	18:43
fungi	typical botnets range upwards of tens of thousands of compromised systems, and having a majority of them originating from ip addresses in popular chinese isps is not uncommon	18:44
clarkb	we are at about 60% of the limit	18:44
clarkb	2.5/4k connections	18:44
clarkb	ish	18:44
clarkb	potential options for next steps: Remove iptables rules for china unicom ranges. This will likely put us back into a semi degraded state with slow connections and connections that fail occasionally. If we do <- we can try raising our maxconn value slowly until we see things get out of control to find where our limit is. We can add backends prior to doing that and try to accomodate the flood. We can do the	18:50
clarkb	opposite and block more ranges with malicious users. We can try tcpdumps and attempt to determine if this is a bot that will respond to a robots.txt value and if so update robots.txt	18:50
clarkb	We can do nothing and see if anyone complains from the blocked ranges and see if the bots go away (since we're managing to keep up now)	18:50
clarkb	and with that I need a break. Our meeting is in 10 minutes and I've been haeds down since ~6am	18:51
fungi	if this really is a widespread spider, we might be able to infer some characteristics by looking at access logs of some of our other services	18:57
*** hasharAway is now known as hashar		19:01
*** sshnaidm\|ruck is now known as sshnaidm\|bbl		19:02
*** ianw_pto is now known as ianw		19:04
openstackgerrit	Merged opendev/system-config master: Revert "Increase allowed number of haproxy connections" https://review.opendev.org/738679	19:17
openstackgerrit	James E. Blair proposed opendev/system-config master: Enable access log in gitea https://review.opendev.org/738684	19:24
ianw	so i'm not seeing that we have a robots.txt for opendev.org or that gitea has a way to set one, although much less sure on that second part	19:29
clarkb	ianw: https://github.com/go-gitea/gitea/issues/621	19:30
clarkb	thats a breadcrumb saying its possible	19:30
ianw	yeah just drop it in public/	19:33
clarkb	looks like gitea may rotate the logs for us already	19:43
ianw	https://git.lelux.fi/theel0ja/gitea-robots.txt/src/branch/master/robots.txt looks like a pretty sweet robots.txt	19:46
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Add backend source port to haproxy logs https://review.opendev.org/738685	19:46
ianw	i note it has	19:47
ianw	# Language spam	19:47
ianw	Disallow: /*?lang=	19:47
fungi	seems to be an already recognized issue	19:47
corvus	so the error in build 7cdd1b201d0e462680ea7ac71d0777b6 is a follow on from this error: https://zuul.opendev.org/t/openstack/build/7cdd1b201d0e462680ea7ac71d0777b6/console#1/1/33/primary	19:51
corvus	we may be able to get the module failure info from the executor, i'm not sure	19:51
ianw	corvus: that step shows as green OK in the console log for you right?	19:52
corvus	ianw: yep	19:52
clarkb	ianw: I like the idea of adapting that robots.txt	19:54
* clarkb is happy we had a meeting brainstorm so many good ideas		19:54
corvus	ianw: that task has "failed_when: false"	19:54
corvus	# Using shell to try and debug why this task when run sometimes returns -13	19:54
corvus	"rc": -13	19:55
*** hashar is now known as hasharAway		19:55
corvus	so, um, apparently the author of that role has observed that happening, and expects it to happen, and so ignores it	19:55
corvus	but the follow-on task assumes it works	19:55
clarkb	ya we've seen that error before	19:55
clarkb	I don't recall what debugging setup we had done though	19:55
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea-image: add a robots.txt https://review.opendev.org/738686	19:55
corvus	i'm not sure that we got any more information from the shell	19:56
corvus	anyway, we still need to track down that module error in the executor log	19:56
corvus	there's nothing in the executor log that isn't already in the job output	19:58
corvus	so basically, no idea what caused -13 to happen	19:59
corvus	switching to the gear issue: that's something that needs to be installed in the ansible venvs, and i guess we don't do that in the upstream zuul images.	20:01
corvus	this is a bit more of a thorny problem though, since there is no usage of gear in zuul-jobs	20:01
corvus	so our pseudo-policy of "we'll add it to zuul-executor if something in zuul-jobs needs it" doesn't cut it here	20:01
clarkb	infra-root to summarize gitea situation I think we can land https://review.opendev.org/#/c/738686/1 and https://review.opendev.org/#/c/738684/1 (this one actually likely already does rotation) then land the haproxy logging update fungi is working on. Then basically as each change lands check logs and see if we learn anything new from better logging	20:01
corvus	clarkb: we don't want the internet archive to crawl us?	20:02
*** noonedeadpunk has quit IRC		20:02
clarkb	corvus: I mean maybe? the problem seems to be crawling in general poses issues though I guess we've never really noticed this until now so being selective about what we allow is probably as good as it gets?	20:03
clarkb	in particular the lang stuff and well as specific commit level stuff would be good to clean up based on the logs we had during this situation	20:03
fungi	well, the filter for localization url variants is probably a big enough help on its own	20:03
*** noonedeadpunk has joined #opendev		20:03
ianw	that was just a straight copy of that upstream link, i agree maybe we could drop that bit	20:03
*** diablo_rojo has joined #opendev		20:03
fungi	those essentially multiply the url count dozens of time over	20:03
corvus	honestly, i'd rather switch from gitea to something that can handle the load rather than disallowing crawling	20:03
ianw	also crawl-delay as 2, if obeyed, would seem to help	20:04
corvus	in my view, supporting indexing and archiving is sort of the point of putting something in front of gerrit :/	20:04
fungi	i concur	20:04
ianw	i don't think this is disallowing indexing, just making it more useful?	20:05
clarkb	ya its directing it to the bits that are more appropriate	20:05
corvus	it disallows indexing commits?	20:05
clarkb	indexing every file for every commit for every localization lang is expensive	20:05
clarkb	corvus: yes because the crawlers are crawling every single file in every single commit. If we want to start with simply stopping the lang toggle and see if that is sufficient we can do that	20:06
corvus	i could be convinced lang is not useful (since the content isn't localized)	20:06
*** mlavalle has quit IRC		20:06
clarkb	then work up from there rather than backwards	20:06
clarkb	correct only the dashboadr template itself is localized not the git repo content	20:06
corvus	clarkb: right. i'm going out on a limb and saying that crawlers indexing every file in every commit is desirable.	20:06
*** tobiash has quit IRC		20:06
corvus	"better them than us" :)	20:06
clarkb	I agree, but it also causes our service to break	20:06
corvus	well, no	20:06
corvus	the botnet ddos causes our service to break	20:07
*** mlavalle has joined #opendev		20:07
ianw	i note github robots.txt has	20:07
ianw	User-agent: baidu	20:07
ianw	crawl-delay: 1	20:07
corvus	anyway, i'm trying to suggest we don't over-compensate and throw out the benefit of the system. let's start small with disallowing lang, setting the delay, and whetever else seems minimally useful	20:07
*** tobiash has joined #opendev		20:07
clarkb	corvus: that works for me	20:07
clarkb	we should exclude activity as well	20:08
clarkb	thats the not useful graph data	20:08
ianw	much like the sign at the shop like "please do not eat the soap" i'm presuming they put that there in response to a observed problem :)	20:09
corvus	ianw, clarkb: i left comments	20:09
fungi	also fungi is not working on the haproxy logging change, he already proposed it during the meeting (738685)	20:09
fungi	though happy to apply any fixes if there are issues	20:10
clarkb	corvus: I think the from github is url paths that gitea inherited from github	20:10
clarkb	corvus: I expect that gitea supports those paths that are github specific	20:10
clarkb	(have not tested that yet)	20:10
clarkb	fungi: oh I completely missed the change is already up	20:10
fungi	do we still need to trigger full replication from gerrit? or did that happen already and i missed it	20:10
corvus	clarkb: oh, like are they aliases for more native gitea paths?	20:10
clarkb	corvus: ya that was my interpretation	20:10
corvus	clarkb: if that's the case, then i agree we can filter them because it's duplicative	20:10
clarkb	fungi: I don't think that has happened yet	20:10
corvus	ianw: ^ feel free to ignore my comment on that if that's the case	20:11
fungi	i can work on that next while i wait for gitea to stop hanging on sync	20:11
ianw	yeah, it appears seeded from github.com/robots.txt	20:11
fungi	er, i mean, while i wait for gertty to stop hanging	20:12
fungi	i think something about the gitea issues earluer have confused it	20:12
fungi	#status log triggered full gerrit re-replication after gitea restarts with `replication start --all` via ssh command-line api	20:14
openstackstatus	fungi: finished logging	20:14
clarkb	fungi: corvus I reviewed the haproxy change and left some notes. I think we should double check the check logs before approving but I +2'd	20:17
clarkb	there is one thing about []s being special in haproxy docs that we want to double check doesnt' cause problems for us	20:17
fungi	oh, i'll take a closer look, thanks!	20:17
clarkb	I expect its fine because the [%t] in the existing format is fine	20:18
fungi	yeah, i just double-checked that on the production server	20:19
fungi	i could also drop them, i just didn't want it to become ambiguous if we end up doing any load balancing over ipv6 (not that we do presently)	20:20
clarkb	I think if the check logs look good we should +A	20:20
clarkb	its motly a concern that I don't understand what haproxy means by the []s being different	20:20
clarkb	but if the behavior is fine ship it :)	20:21
fungi	and yeah, the rest of that format string was just copy-pasted from what the haproxy doc says the default is for tcp forward logging, though i did spot-check that it seemed to match the things we're logging currently	20:22
clarkb	https://review.opendev.org/#/c/738684/1 has passed if anyone has time for gitea access log enablement	20:23
clarkb	though note I have to pop out in about 20 minutes to get my glasses adjusted (they got scratched and apparently warranties cover that(	20:24
ianw	do we want to disallow */raw/ ?	20:24
clarkb	I guess it depends if we expect indexers to properly be able to render the intent of formatted source?	20:25
clarkb	raw may be useful for searching verbatim code snippets?	20:25
*** hasharAway has quit IRC		20:28
*** hashar has joined #opendev		20:29
clarkb	probably leave raw for now since I'm not sure all the useful indexers can do that	20:30
ianw	what about blame?	20:33
ianw	raw and blame are the two i thikn github disallows that are relevant to us	20:34
clarkb	blame I can personally live without since it is in the git repos and web index is a poor substitute	20:34
clarkb	corvus: fungi ^ thoughts?	20:34
*** sshnaidm\|bbl has quit IRC		20:34
*** sshnaidm\|bbl has joined #opendev		20:35
corvus	i think we can live without blame. it is expensive and lower value in indexes/searching	20:36
corvus	/archiving	20:36
ianw	yeah i have that in	20:37
ianw	i think the rest gitea does not do	20:37
fungi	yeah, i don't think there's a lot of benefit to spidering that	20:37
ianw	e.g. atom feeds, etc	20:37
fungi	blame, i mean	20:37
openstackgerrit	Ian Wienand proposed opendev/system-config master: gitea-image: add a robots.txt https://review.opendev.org/738686	20:39
clarkb	we apparently don't put any traffic through the lb in testing https://zuul.opendev.org/t/openstack/build/1abcc48f83fe4a1c91db7f45fea09391/log/gitea-lb01.opendev.org/syslog.txt doesn't show it anyway	20:40
clarkb	I've got to pop out now so won't approve it but I Think youcan if you'll watch it	20:40
clarkb	fungi: ^	20:40
ianw	Disallow: /Explodingstuff/	20:41
ianw	that has one repo with a ransomware .exe	20:42
ianw	... i wonder what the story is with that	20:43
openstackgerrit	Merged openstack/diskimage-builder master: Disable all enabled epel repos in CentOS8 https://review.opendev.org/738435	20:44
fungi	ianw: where?	20:45
*** sshnaidm\|bbl is now known as sshnaidm\|afk		20:46
ianw	fungi: sorry that's in the github.com/robots.txt which i was comparing and contrasting to	20:46
fungi	ahh	20:46
fungi	amusing	20:46
ianw	of all the possible things on github, it seems odd that this one is so special	20:47
fungi	it was probably serving a very high-profile ransomware payload to many compromised systems	20:47
fungi	or something similar they didn't want showing up in web searches	20:48
ianw	yeah but why not delete it? although i might believe that script kiddies were using some sort of download thing that unintentionally obeyed robots.txt (because it was designed for good and they didn't patch it out) maybe	20:49
fungi	well, deleting the file thoroughly requires rewriting (at least some) git history	20:50
*** priteau has quit IRC		20:52
*** jbryce has quit IRC		21:15
*** jbryce has joined #opendev		21:15
clarkb	replacing lenses took surprisingly little time. I think I'll hang around to see what logging tells us but then it was an early start today and I'm tired. Will likely call it there and pick things up in the morning	21:19
fungi	corvus: are you cool with the updated robots.txt in 738686? looks like it addresses your prior comments now	21:27
corvus	i'll check	21:27
corvus	+3	21:27
fungi	thanks!	21:27
corvus	ianw: thanks; i like the comments too so we know what to look at next if things are bad	21:28
ianw	i don't think anyone ever accused me of leaving too few comments :)	21:28
openstackgerrit	Merged opendev/system-config master: Enable access log in gitea https://review.opendev.org/738684	21:32
clarkb	re ^ I think we may need to restart gitea processes manually	21:37
clarkb	bceause its not a new container image so we don't automatically do it	21:37
clarkb	if fungi's replication kick is still running we'll want to do it gracefully	21:37
fungi	even if it's not still running we'd want to do it gracefully, right?	21:37
clarkb	yes, though its less critical	21:38
openstackgerrit	Merged opendev/system-config master: Add backend source port to haproxy logs https://review.opendev.org/738685	21:39
clarkb	haproxy should autoupdate if the sighup is enough there	21:39
clarkb	I wonder if we should wait for robots.txt before restarting giteas just in case gitea caches that	21:39
fungi	we did at least work out the restart for the gitea containers in such a way that we no longer lose replicated refs during a restart, right?	21:40
clarkb	fungi: yes, though I don't think we've ever managed to fully confirm that. The semi regularl "my commit is missing" complaints went away after we did it though and some rough testing of the component pieces shows it should work	21:41
clarkb	its just really hard to confirm in the running system due to races	21:41
fungi	right, okay	21:41
clarkb	basically its docker-compose down && docker-compose up -d mariadb gitea-web ; #wait here until the gitea web loads then docker-compose up -d gitea-ssh	21:41
clarkb	and all of that is in the role too if you need to find it later	21:42
clarkb	what that does is stops the ssh server first (its written that way in the docker compose file) then only starts the ssh server once gitea itself is ready. This way gerrit will fail to push anything until gitea the main process can handle the input	21:43
fungi	makes sense	21:44
clarkb	hrm robots.txt is a bit of a ways out	21:44
clarkb	maybe we should go ahead with some restarts now. I can do 01 and confirm that its access log works at least	21:44
* clarkb does this		21:45
clarkb	apparently docker-compose stop != docker-compose down	21:46
clarkb	down removes the containes stop does not	21:46
clarkb	so when I did the reboots down was correct to prevent them restarting in the wrong order	21:46
clarkb	but without a reboot a stop is fine	21:46
*** rosmaita has joined #opendev		21:46
clarkb	gitea01 is done	21:47
clarkb	we have user agents	21:48
clarkb	and baiduspider shows up	21:48
clarkb	but so do others	21:48
rosmaita	i have a (hopefully) quick zuul question when someone has a minute	21:51
clarkb	rosmaita: go for it, I think the fires are under control and now we're just poking at them as they cool :)	21:51
rosmaita	ty	21:51
clarkb	note I think we may need to restart haproxy to get new log format	21:51
rosmaita	i'm trying to configure this job: https://review.opendev.org/#/c/738687/4/.zuul.yaml@109	21:52
clarkb	the config updated but I don't see the new format yet	21:52
rosmaita	i want it to use a playbook from a different repo	21:52
rosmaita	but not sure how to specify that	21:52
clarkb	rosmaita: playbooks can't be shared between repos, but roles can	21:53
clarkb	rosmaita: usually the process is to repackage the ansible bits into a reusable role that the original source and new consumer can both use	21:53
fungi	also jobs can be used between repos (obviously)	21:54
fungi	so another approach is to define a job (maybe an abstract job) which uses the playbook, and then inherit from that in the other repo with a new job parented to it	21:55
rosmaita	i think there's a tox-cover job defined in zuul-jobs, but i figured i should probably make openstack-tox my parent job	21:57
clarkb	rosmaita: what is your post playbook going to do?	21:59
rosmaita	clarkb: the playbook i'm using only references one role -- if i copy the playbook to the cinder repo, will it just find the fetch-coverage-output role?	21:59
clarkb	rosmaita: yes zuul-jobs is available already. Though really you should probably create an openstack-tox-cover that does that for all jobs?	21:59
clarkb	I think the difference between openstack-tox and normal tox is use of constraints	22:00
clarkb	so an tox-cover <- openstack-tox-cover inheritance that is then applied to cinder makes sense to me	22:00
rosmaita	ok ... i will try copying the playbook first to see if the job does what I want, and if it does, i will propose an openstack-tox-cover job	22:01
clarkb	sounds good	22:01
fungi	i've condirmed, `cd /etc/haproxy-docker && sudo docker-compose exec haproxy grep log-format /usr/local/etc/haproxy/haproxy.cfg` does show the updated config, but ps indicates the service has not restarted yet	22:02
rosmaita	clarkb: ty	22:02
clarkb	fungi: we just do a sighup	22:02
clarkb	fungi: I'm guessing that only allows some things to be updated in the config and others are evaluated on start	22:02
fungi	is the outage from `sudo docker-compose down && sudo docker-compose up -d` going to be brief enough we don't care about scheduling?	22:03
*** tobiash has quit IRC		22:04
clarkb	I would do s/down/stop/ for the reason I discovered earlier, That outage should last only a few seconds (so may be noticed but only briefly)	22:04
fungi	so it's stop followed up by -d?	22:05
clarkb	yup stop && up -d	22:05
fungi	okay, i'll give that a shot on gitea-lb01 now	22:05
*** tobiash has joined #opendev		22:06
clarkb	k I'm watching	22:06
fungi	took ~12 seconds	22:06
fungi	i can browse content	22:06
fungi	process start time looks recent now	22:07
clarkb	logs still missing that info?	22:07
fungi	yeah, it doesn't seem to be using the new log-format	22:07
*** hashar has quit IRC		22:07
clarkb	maybe the []s are the problem here?	22:07
fungi	well, only one of the two parameters added were in []	22:07
*** olaph has quit IRC		22:09
clarkb	I'm looking at the rest of the log config maybe we set the new value then override later	22:09
clarkb	ya I think that may be what is happening	22:09
clarkb	we set option tcplog on the frontends	22:10
clarkb	which is a specific log format	22:10
fungi	oh	22:10
fungi	yep. i'll work on another fix	22:10
clarkb	fungi: we can just test it manually firs ttoo	22:10
clarkb	but I think commenting out those lines or removing them then sighup may do it	22:11
fungi	this is a nice writeup: https://www.haproxy.com/blog/introduction-to-haproxy-logging/	22:13
clarkb	ianw: fwiw gitea01 does have a fair bit of interesting UA data now in /var/haproxy/log/access.log which we can possibly use to update the robots.txt	22:15
clarkb	while fungi is doing the haproxy things I'll finish up gitea02-08 restarts	22:15
clarkb	then they'll all have that fun data	22:15
fungi	where on the host system are we writing stashing the haproxy log? running `mount` inside the container isn't much help	22:17
fungi	er, haproxy config, not log	22:17
fungi	it's /usr/local/etc/haproxy/haproxy.cfg inside the container, but mount just claims that /dev/vda1 is mounted as /usr/local/etc/haproxy	22:17
fungi	aha, it's /var/haproxy/etc/haproxy.cfg	22:19
fungi	clarkb: yep, that did it	22:21
fungi	as expected. i'll propose removing the two occurrences of "option tcplog"	22:21
clarkb	fungi: ya check the docker compose config file for the mounts	22:21
clarkb	01-08 are all restarted and should have access logs now	22:22
ianw	sorry back, having a look	22:23
clarkb	heh the access logs don't have ports	22:23
clarkb	so to map you want to get the url + timestamp from access log for UA you want. Then check against macaron logs in docker logs output then that gives you the port and you can cross check against lb for the actual source	22:24
openstackgerrit	Jeremy Stanley proposed opendev/system-config master: Remove the tcplog option from haproxy configs https://review.opendev.org/738710	22:24
clarkb	well progress anyway	22:24
rosmaita	clarkb: i am a idiot -- there is already an openstack-tox-cover job	22:24
fungi	rosmaita: happens to me all the time, i consider it a validation of my need when i spend half an hour discovering that what i tried to add is already there	22:25
rosmaita	:)	22:25
clarkb	fungi: actually is there port info in the gitea side somewhere?	22:25
fungi	clarkb: i don't know, i had hoped we'd get that from the access log being added	22:26
clarkb	we may still have a missing piece	22:26
corvus	i didn't look at that, i thought all we wanted from access log was UA	22:26
clarkb	corvus: ya I think its 95% of what we want	22:27
fungi	ua is definitely useful and an improvement	22:27
corvus	we'll need to unblock the ip ranges to see the UA right?	22:27
clarkb	if we can also map lb logs to gitea logs that is a separate win	22:27
clarkb	corvus: no because we didn't block all the IPs doing the things	22:27
fungi	had just hoped we'd reach a point where we could accually correlate a request to a source ip address	22:27
corvus	ah cool -- do we have a ua yet?	22:27
clarkb	corvus: we have many :)	22:27
corvus	i mean from the crawler	22:27
clarkb	"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)"	22:28
fungi	738710 will get the backend source ports to appear properly (tested manually to confirm)	22:28
clarkb	\"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" too	22:28
clarkb	baidu doesn't seem to have english faq	22:29
ianw	that maps though with the crawl-delay: 1 that github sets for UA baidu	22:29
clarkb	there are others too (I haven't managed to do a complete listing but I think if we grep for lang= and then do a sort -u -c sort of deal we should see which are most active)	22:31
corvus	are folks thinking that first UA is botnet?	22:31
ianw	clarkb: sorry where's the UA? not in the gitea logs?	22:31
corvus	ianw: should be in access.log in /var/gitea/logs	22:31
ianw	ahh ok, was tailing the container	22:32
clarkb	corvus: yes I think so based on the url its hitting	22:32
clarkb	corvus: its a specific ocmmit file with lang set	22:32
clarkb	thats not concrete evidence unless there is a trend though (single data point not enough to be sure)	22:33
fungi	corvus: may be compromised in-browser payload. a lot of them work that way, so they'd wind up with the actual browser's ua anyway	22:33
fungi	what with turing-complete language interpreters in modern browsers, you don't need to compromise the whole system any more, just convince the browser to run your program	22:34
clarkb	we can set the access log format	22:34
clarkb	I'm now trying to see if the remote port is available to that logging context	22:34
clarkb	(having the ability to link the two would be useful)	22:35
fungi	if gitea uses apache access log format replacements, i think %{c}p gives the client port	22:35
fungi	not that i would have any reason to expect it to use anything like apache's format string language	22:36
openstackgerrit	Merged opendev/system-config master: gitea-image: add a robots.txt https://review.opendev.org/738686	22:36
clarkb	it does not	22:36
ianw	http://paste.openstack.org/show/795410/	22:36
clarkb	https://docs.gitea.io/en-us/logging-configuration/#the-access_log_template	22:36
ianw	44 \"Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"	22:37
clarkb	semrush in that listing is the one we had to turn off for lists.o.o because it made mailman unhappy	22:37
ianw	810 \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"	22:37
ianw	so relative, i don't think it's that	22:37
clarkb	ya baidu seems pretty well behaved overall	22:38
corvus	because of the iptables block, we're not going to have many ddos entries in that log	22:38
fungi	clarkb: i wonder then if it's something like {{.Ctx.RemotePort}}	22:40
clarkb	fungi: there isn't but we do have the Ctx.Req. object which is a normal golang http request which may have it	22:40
clarkb	corvus: we only cut the volume down and did not eliminate it, but yes undoing iptales would give us more data	22:41
clarkb	corvus: there is at least another whole AS that we could block to stop the requests	22:41
ianw	according to https://developers.whatismybrowser.com/useragents/parse/38888-internet-explorer-windows-trident the top hit in that list i provided is windows xp sp2 ie7	22:42
ianw	i.e. i'd say we can rule out that's a human	22:42
ianw	http://paste.openstack.org/show/795411/ all requests, not just lang	22:44
*** mlavalle has quit IRC		22:46
*** tkajinam has joined #opendev		22:48
openstackgerrit	Clark Boylan proposed opendev/system-config master: Update gitea access log format https://review.opendev.org/738714	22:49
clarkb	that change is not tested but I think if I got the names right it should world	22:50
clarkb	world? work	22:50
clarkb	fungi: ^ basically macaron which is gitea's web layer tries to be smart about the value there and I think it ends up trimming that useful info but the normal http.Request.RemoteAddr value has it and is inteded for logging	22:50
*** rosmaita has left #opendev		22:51
fungi	oh cool	22:51
clarkb	ianw: I expect we can filter a lot of those?	22:54
clarkb	my modern firefox UA reports itself with its version	22:54
clarkb	so Firefox 4.0.1 is either a lie or very very old?	22:55
fungi	those moz/ie versions were commonly used in other more niche browsers which needed to work around sites claiming their browser wasn't supported	22:56
clarkb	fungi: +2'd https://review.opendev.org/#/c/738710/1	22:56
fungi	but i agree these days it's more likely to be a script	22:56
clarkb	thats what the prefix bits are supposed to do these days right?	22:56
clarkb	the last entry should be your actual browser?	22:57
fungi	in theory	22:57
*** tosky has quit IRC		22:57
fungi	though in reality it can be anything	22:57
clarkb	https://review.opendev.org/#/c/738714/1 should be tested by our testing (I checked earlier jobs and the access log shows up in them)	22:58
clarkb	corvus: ianw ^ if you can review that I'll double check the logs in the test runs before approving	22:59
ianw	clarkb: i mean looking at http://paste.openstack.org/show/795411/ the hits are from things that are saying they look like ie7, some tecent browser thing, and os 10.7 which went eol in ~ 2012?	23:00
clarkb	ianw: ya its a lot of things reporting to be very old browsers	23:00
clarkb	the opera entry there is from 2011	23:00
ianw	but not being nice and putting any sort of "i'm a robot string in it" :/	23:00
clarkb	firefox 4.0.1 is also a 2011 release	23:01
clarkb	was 2011 the golden year for web browsers ?	23:02
clarkb	I'm leaning towards : we can use robots.txt to block those. Actual browsers will ignore the robots.txt right? And any well behaved bot will go away. If the bots are not well behaved then we may need the mod rewrite idea	23:03
ianw	seems once robots is there we could just hand edit for testing	23:04
corvus	clarkb: lgtm; do you want a +3 or to check the zuul logs first?	23:05
clarkb	corvus: I'll check the logs first and approve if its good	23:05
ianw	oh it deployed already, cool	23:05
clarkb	heh that safari release is 2011/2012 too	23:05
clarkb	its like wheover built this bot did so in 2012 and listed all the possible UA strings at the time :)	23:05
ianw	so, is anything actually loading the robots.txt ...	23:06
ianw	basically two things that look like well behaved bots http://paste.openstack.org/show/795412/	23:07
clarkb	and I guess the flood isn't going away with the less nice bots	23:09
ianw	is there like a spamhaus for UA's?	23:10
clarkb	fungi: ^	23:11
fungi	none that i know of	23:11
ianw	https://perishablepress.com/4g-ultimate-user-agent-blacklist/	23:11
ianw	i mean, insert grain of salt of course	23:12
fungi	yeah, given any application can claim whatever ua it wants and masquerade as any other agent, if filtering based on well-known malware ua strings were especially effective the authors would adapt	23:12
fungi	we might get away with it for a while, or indefinitely, but it would be a continual game of whack-a-mole adding new entries over time	23:13
ianw	but i guess that what we're seeing is a bunch of ancient UA's, whatever we're dealing with it's cutting edge	23:13
ianw	is not cutting edge i mean	23:13
corvus	clarkb: gitea log change failed test	23:14
fungi	the ua is useful for identifying well-meaning bots so that you can adjust robots.txt to ask them to be better behaved	23:14
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Use a temporary registry with buildx https://review.opendev.org/738517	23:15
clarkb	corvus: and no access log. Maybe Ctx.Req.RemoteAddr isn't valid there :/	23:15
ianw	fungi: sure, but if there's some empirical list of "this is a bunch of UA's from known spam/script/ddos utilities that are common abusive" that includes all these EOL browsers, that would be great :)	23:15
ianw	however, that is maybe the type of thing kept in google/cloudflare repos and not committed publicly	23:16
clarkb	oh template error writing out that config file	23:16
clarkb	its a jinja2 problem with the {{ .stuff	23:18
clarkb	trying to figure out if I can do a literal block	23:19
openstackgerrit	James E. Blair proposed zuul/zuul-jobs master: Ignore ansible lint E106 https://review.opendev.org/738716	23:19
openstackgerrit	Clark Boylan proposed opendev/system-config master: Update gitea access log format https://review.opendev.org/738714	23:22
clarkb	I think ^ that may fix it assuming that is the only issue	23:22
*** auristor has quit IRC		23:29
openstackgerrit	Merged zuul/zuul-jobs master: Ignore ansible lint E106 https://review.opendev.org/738716	23:33
*** Dmitrii-Sh has quit IRC		23:42
*** Dmitrii-Sh has joined #opendev		23:43
ianw	mnaser: any idea what 38.108.68.124 - - [30/Jun/2020:23:52:28 +0000] "GET /zuul/zuul-registry/src/commit/a2dcc167a292ce8ad83a6890a749004b5b298c64/setup.py?lang=pt-PT HTTP/1.1" 200 21763 "https://opendev.org/zuul/zuul-registry/src/commit/a2dcc167a292ce8ad83a6890a749004b5b298c64/setup.py\" \"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)" is?	23:53
ianw	mnaser: ohhhh i'm an idiot, that's the loadbalancer	23:54
mnaser	ianw: :)	23:54
ianw	sorry i got confused in the output of my own script :)	23:55
*** auristor has joined #opendev		23:56
clarkb	ianw: ya thats why we're trying to get the port mappings done	23:57

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!