Tuesday, 2025-03-11

ianw	i can check 02:00 :)	00:00
clarkb	the hourly service-nodepool failed and I suspect it is due to a full disk on nb04	00:17
clarkb	ya /opt is full again arg. I'm goign to start a screen on that host, stop the service, then start a long running rm for those files	00:17
clarkb	ok that is in progress on nb04 (just want to avoid any noise in the signal later if possible)	00:20
Clark[m]	I think the hourly jobs enqueued ahead of the daily jobs	02:01
Clark[m]	Hrm I don't see system config in periodic at all	02:06
Clark[m]	Oh wow there they are. A whole flood of projects after the initial 15	02:07
Clark[m]	Bootstrap bridge is starting for the daily jobs now	02:14
Clark[m]	Looking ok from the status page so far	02:29
ianw	still all progressing in twos ... bridge utilisation seems sane	02:56
ianw	i guess parallel operation makes `/var/log/ansible.log` fairly useless as it all gets mixed up	02:59
ianw	perhaps instead of > output of runs we should set ANSIBLE_LOG_PATH?	03:01
Clark[m]	I thought we output to playbook specific log files	03:01
ianw	we do but with a >>	03:02
Clark[m]	https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-playbook.yaml#L21	03:03
Clark[m]	Ya I guess I'm not understanding where the conflict is if it's playbook name specific?	03:03
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook: redirect via ansible logger https://review.opendev.org/c/opendev/system-config/+/943999	03:06
ianw	oh just because it _also_ logs to /var/log/ansible.log via the global config. I think that if we set ANSIBLE_LOG_FILE that will override that, and not have all the prod playbooks writing to the same file	03:07
Clark[m]	Oh does it log to a global file too? One thing we would need to check with that change is if it is safe to do so for jobs whose logs get published publicly	03:07
ianw	haha i was about to say, now i think about it, we want to make sure we put anything that comes out into a local file too :)	03:08
Clark[m]	Since Ansible log output may not b the same as stdout/stderr	03:08
Clark[m]	So probably need to understanf any potential behavior differences between the two first?	03:09
ianw	i guess the best thing would be to >> to a .stdout.log	03:09
ianw	but then we have another log file to deal with on the encryption path	03:09
ianw	(not that i think anyone else's key but mine is in there for that :)	03:10
Clark[m]	Ya and regular capture path for those that do it	03:10
ianw	i wonder if ANSIBLE_LOG_FILE _and_ >> to the same file works ok	03:10
Clark[m]	Since we won't want to overexpose to start. One option may be to do both like you say then disable public publishing for everything. Then recheck the two files for anything that published before and add it back in	03:10
ianw	the ordering may be completely out, but perhaps it doesn't matter that much	03:10
Clark[m]	If things look safe	03:10
Clark[m]	What is in the Ansible log file? Is it different?	03:11
Clark[m]	Sorry I'm not currently able to check that easily but can if it becomes urgent	03:11
Clark[m]	Going back to general behavior here I think this is continuing to look good	03:12
ianw	it looks to me that what is in ansible.log as written out by ansible is the same as what is captured into each <service>.yaml.log	03:14
ianw	i don't think this is a problem, as such ... just that we have the rather messy ansible.log file just a jumble of ever increasing mostly random stuff	03:15
Clark[m]	Got it. Less about ensuring correct behavior for the remote node config management and more about making debugging more straightforward	03:15
Clark[m]	https://zuul.opendev.org/t/openstack/buildset/1c24d0f003e9427ea84e393f48120397 success!	03:19
Clark[m]	ianw: we use >> because we start the file with a header	03:27
Clark[m]	I suspect that is still fine with the proposed change though as Ansible should append too sounds like	03:27
Clark[m]	So probably the main thing to check is just that the content isn't more dangerous	03:28
ianw	yeah i don't think it's right because it will echo the output. have to think about it :/	03:43
Clark[m]	It being the change to use the log env var?	03:47
ianw	yep	04:47
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook: redirect via ansible logger https://review.opendev.org/c/opendev/system-config/+/943999	04:59
ianw	^ thought two - keep the log file capture as we have now, but burn ansible's own logging to /dev/null for Zuul runs. but leave the default there so that if you run by hand, you still get logs in /var/log/ansible.log. this is predicated on the idea that "log_path" in Ansible doesn't capture anything that stdout/stderr won't ... which i think is correct	05:02
*** mrunge_ is now known as mrunge		06:35
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	07:16
*** jroll02 is now known as jroll0		08:45
opendevreview	Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	09:23
opendevreview	Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045	10:35
opendevreview	Jeremy Stanley proposed opendev/bindep master: Drop auxiliary requirements files https://review.opendev.org/c/opendev/bindep/+/940711	14:38
fungi	slight efficiency improvement in the latest revision of ^	14:39
tkajinam	hmm I just noticed a bit strange behavior in https://review.opendev.org/c/openstack/puppet-horizon/+/943232	14:47
clarkb	looking at periodic buildset runtimes for system-config typical runtime was 2 to 2.5 hours and last nights was 1 hour and 8 minutes	14:47
tkajinam	a strange behavior of gerrit, I mean	14:47
clarkb	a definite measurable improvment	14:47
clarkb	tkajinam: can you be more specific? what do you observe that is strange?	14:47
tkajinam	if you look at my latest comment it is duplicated.	14:48
tkajinam	hmmm maybe it might be a problem caused by something in my local. ignore it for now	14:48
clarkb	I think I've seen similar with gertty users	14:49
clarkb	not sure if you use gertty	14:49
tkajinam	no I posted it from web interface	14:49
fungi	i've accidentally done it with gertty's experimental inline comment threads patch by selecting the reply button on a comment after composing a reply to it already	14:50
clarkb	https://review.opendev.org/c/opendev/bindep/+/940711/ fungi's patchset 5 is the example I'm thinking of	14:51
fungi	yeah, that's where i saw i'd done it	14:51
tkajinam	ok	14:51
fungi	but since it's all just rest api calls back to the gerrit server, i expect the webclient might retry to post a comment if it received an error or disconnect, and if the server had correctly processed the comment anyway then maybe you end up with two	14:52
tkajinam	yeah that's possible	14:52
tkajinam	I'll come back here in case I observe the same behavior frequently.	14:52
tkajinam	sorry for the noise !	14:53
clarkb	tkajinam: thank you for the haeds up and ya I think this is probably fine unless it becomes persistent. If that happens definitely let us know	14:53
fungi	please do, maybe we can correlate them and figure out the commonalities	14:53
fungi	definitely noy noise	14:53
fungi	s/noy/not/	14:53
tkajinam	:-)	14:54
tkajinam	I've never seen Alex using gertty so I was wondering why his comment in patchset 2 is the legacy (non-inline comment, I mean).	14:55
tkajinam	that's why I suspected something strange might be happening in comment feature.	14:55
clarkb	that is also another hallmark of gertty usage. But maybe there is another client or using the ssh review feature?	14:56
fungi	gertty currently doesn't support the newer thread-style comments (there's an experimental change to add support but it's still incomplete), so most gertty users end up leaving legacy comments	14:56
tkajinam	yeah	14:56
clarkb	I think if you do `ssh -p 29418 user@review.opendev.org gerrit review -m "message here"` you also get that behavior	14:57
tkajinam	I'll talk with him to know how it was posted.	14:57
clarkb	you have to use the --json input to get the more modern inline commenting stuff (which the modern top level comment is a special case of)	14:57
tkajinam	ah. ok	14:59
tkajinam	there are still a lot of mystery in gerrit :-P	14:59
clarkb	ya there is a special meta file name that if you comment on is a top level comment. That is the new behavior. The old behavior is you set a comment on the patchset and don't specific a file	15:01
fungi	i'd say there's less mystery in gerrit than in proprietary code review platforms without available source code ;)	15:02
clarkb	fungi: can you check my comment on https://review.opendev.org/c/opendev/system-config/+/943999 as ianw won't be awake this time of day? I'm just trying to reason about the safety of capturing stderr like that for jobs that publish public logs	15:04
clarkb	I'm really happy the periodic buildset ran in just over half the time it took previously	15:06
clarkb	even if we don't increase the semaphore limit that is a great roi and I think we can safely bump the limit up too	15:07
fungi	answered	15:08
clarkb	thanks! Wanted to make sure my dst jet lagged brain is keeping up	15:09
clarkb	jamesdenton: wanted to let you know that the new dfw3 region seems to be working well. We've also managed to switch sjc3 over to the new tenant/project that matches what is in dfw3. Thanks for the help getting that done.	15:13
clarkb	jamesdenton: also did you know that nova ssh keys are not project/tenant specific (we learned that the hard way when we deleted them using the old project/tenant in sjc3)	15:13
jamesdenton	Glad to hear about DFW3! But if you could elaborate a little more on the nova keys...	15:14
fungi	we had some keypairs defined which we'd been using with the old project in sjc3	15:15
fungi	we didn't realize that they were the same objects being used for the new project in sjc3 under the same account	15:15
fungi	i deleted the keypairs from the account thinking they were only being deleted from the old project, but it of course led to them being unavailable in the new project as well	15:15
clarkb	and likely to be some ancient nova thing that we're just going to have to live with for backward compatibility reasons	15:16
clarkb	so nothing for you to change/address. Just an interesting behavior we discovered the hard way	15:16
fungi	i didn't think to check that they were still there for the new project, so we were erroring for a little while with nodepool telling nova to use those keypair objects which no longer existed	15:17
frickler	keypairs are a user resource in nova	15:17
fungi	right, i learned that the hard way ;)	15:17
clarkb	our config management sorted it out automatically when it ran again to update cloud things so not a big deal either	15:18
fungi	arguably, we're abusing the keypair feature somewhat to do things it's not strictly intended for	15:18
fungi	we could just inject all the ssh keys into our images instead	15:19
clarkb	there are reasons to not do that though. Including people reusing our images	15:19
clarkb	(though we still bake in zuul's key)	15:19
clarkb	so we're only half addressing that problem	15:19
clarkb	jamesdenton: the other thing to note is in each region we have quota sufficient for 50 instances except for the memory limit. We can only fit 32 of our 8GB RAM instances into the memory quota we have currently. I'm not sure what capacity looks like in the new deployments but we're always happy to make use of more quota if possible. I think cloudnull mentioned a third region may come	15:21
clarkb	online which is the other direction we can expand too	15:21
jamesdenton	frickler thanks for the clarification on that!	15:25
jamesdenton	clarkb we can help with the quota, i think.	15:25
opendevreview	Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672	15:27
clarkb	fungi: I posted some additional followup to https://review.opendev.org/c/opendev/system-config/+/943999 with some further investigating and info gathering if you are curious	15:46
clarkb	tl;dr is yes stderr is going to zuul and can be viewed. Seems to be innocuous	15:47
fungi	cool, thanks for confirming. that matches what i expected	15:48
clarkb	I didn't approve the change because I'm still not sure I trust my early morning brain and figure we can wait for ianw to review the comments before we land it	15:49
fungi	yeah, that one's not at all urgent	15:49
clarkb	should we proceed wtih https://review.opendev.org/c/opendev/system-config/+/940928 to exercise parallel infra-prod more with merging changes?	16:29
clarkb	that switches the haproxy and zookeeper statsd container "sidecars" to python3.12	16:29
clarkb	now that the board meeting is over I should eat something too	16:30
fungi	i've approved it	16:35
clarkb	it should merge shortly. Will probably end up behind the hourly jobs	18:02
opendevreview	Merged opendev/system-config master: Start using python3.12 https://review.opendev.org/c/opendev/system-config/+/940928	18:03
fungi	there we go	18:04
fungi	and yes	18:04
clarkb	oh those image updates don't end up triggering jobs for gitea-lb, zuul-lb, or zookeeper	18:05
clarkb	I guess I can write a change to fix that	18:05
fungi	good point	18:05
opendevreview	Clark Boylan proposed opendev/system-config master: Trigger related jobs when statsd images update https://review.opendev.org/c/opendev/system-config/+/944063	18:10
clarkb	something like that should do it I think	18:10
tonyb	Any ideas how to debug: ERROR: failed to solve: docker.io/opendevorg/python-builder:3.11-bookworm: failed to resolve source metadata for docker.io/opendevorg/python-builder:3.11-bookworm: failed to copy: httpReadSeeker: failed open: content at https://zuul-jobs.buildset-registry:5000/v2/opendevorg/python-builder/manifests/sha256:9dd6363ddd47c9093f0a14127cf73612b7b7e7ef39db50ab9b7e617d5b1a8e15?ns=docker.io not found: not found	18:36
tonyb	from: https://zuul.opendev.org/t/zuul/build/f72e1c199d8e44fe9f0e944be74453a0/log/job-output.txt?severity=0#1308	18:36
clarkb	tonyb: I suspect that is the buildset registry getting hit by the docker rate limit	18:37
clarkb	if you look in the buildset registry job logs for the registry itself you may be able to confirm	18:37
tonyb	clarkb: Thanks. I forgot to look there. It just has 'Not found' and returns a 404	18:42
clarkb	tonyb: oh right I'm remembering now we theorize that what ahppens is the buildset registry says 404 I don't have that image. Then docker falls back to talking to docker.io directly then gets the rate limit error. but when it reports the errors it only reports the first of the two errors it received	18:43
clarkb	we haven't confirmed that via code review or profiling but it seems to match up witn infrequent occurences due to hitting rate limits	18:44
tonyb	Hmm okay, I'll see what I can find.	18:45
clarkb	essentially we think this is afailure of docker to report errors sanely and we're getting the error that ins't really an error masking the actual problem (which we think is likely the rate limit throttling)	18:45
clarkb	as a side note: I think nodepool can switch over to using the mirrored python base images	18:45
clarkb	and sidestep the whole issue because quay should be less problematic	18:46
tonyb	Yeah that all makes sense.	18:46
tonyb	It'd be nice if the ContainerFile "FROM" could support a list like FROM [doker.io/..., quay.io/...] AS Builder but that's a pipe dream and probably wouldn't work because of SHASUMs or somethign else I'm not considering	18:48
clarkb	and docker just generally trying to lock down their walled garden	18:48
tonyb	Yeah	18:48
corvus	re zuul launcher and rax flex -- i suspect we need new image uploads for the new project; so i've triggered image builds	19:16
clarkb	oh that makes sense since the old tenant/project was cleaned up and its images wouldn't be available in the new tenant/project	19:17
corvus	yeah, and i think zl isn't smart enough to know that changed (since the connection looks the same)	19:17
fungi	i wasn't sure if anything needed to be restarted there	19:23
clarkb	trigginer those rebuilds is currently through the api directly or the web ui right?	19:24
fungi	it did at least delete its prior images in the old sjc3 project	19:24
clarkb	(just so everyone else is aware of how to do that should they need to)	19:24
clarkb	the statsd image update respin hit docker rate limits. I have rechecked it	19:37
clarkb	I'm going to pop out on a bike ride soon to get out before the rain arrives. But I'll be back and can help shepherd that in (it is I think a decent land system-config change and see that deploy is happy with parallel jobs in that pipeline candidate)	19:38
tonyb	clarkb: FWIW I added a -1 and asked a question on https://review.opendev.org/c/opendev/system-config/+/943999 ... feel free to ignore if I'm wrong	19:48
tonyb	clarkb: Also enjoy your ride	19:49
clarkb	tonyb: oh good catch I think you are right looking at the old side	19:49
tonyb	huzzah	19:51
clarkb	ianw: ^ fyi that should be fixed. We can get to it if you prefer too, but wanted to give you the opportunity to weigh in on the comments as a whole	19:54
tonyb	What's needed to +A 943216: Add option to force docker.io addresses to IPv4 \| https://review.opendev.org/c/opendev/system-config/+/943216 ?	21:02
fungi	probably just someone needs to be around to spot issues so we can emergency revert it	21:03
fungi	i'm happy to go ahead and approve it now	21:03
tonyb	fungi: Thanks, Assuming you're also happy to keep an eye on things. If not I'll be around tomorrow	21:06
fungi	yeah, i can	21:06
fungi	approved now	21:06
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add a tool for displaying CPU flags and QEMU version https://review.opendev.org/c/openstack/diskimage-builder/+/937836	21:13
opendevreview	Tony Breeds proposed openstack/diskimage-builder master: Add a tool for displaying CPU flags and QEMU version https://review.opendev.org/c/openstack/diskimage-builder/+/937836	21:16
Clark[m]	tonyb: fungi: and we want to check prod remains unaffected as expected when that lands	22:15
opendevreview	Ian Wienand proposed opendev/system-config master: run-production-playbook: redirect via ansible logger https://review.opendev.org/c/opendev/system-config/+/943999	22:17
ianw	tonyb: thanks for checking that. now i'm worried about the testing :)	22:18
clarkb	ianw: I don't think that playbook is tested at all :(	22:18
clarkb	that change will be a good one to exercise infra-prod parallel stuff and may actually cause the load balancer statsd stuff ot update too	22:19
ianw	i'm actually wavering on it now looking at the testing we do do	22:20
ianw	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c1c/943999/2/check/system-config-run-base/c1c2ea4/bridge99.opendev.org/ansible/	22:20
ianw	one thing that "ansible.log" has that the stdout capture doesn't have is timestamps	22:20
clarkb	thats a good call out. That said we do put the header in the file so have rough timing. But also on the remote side syslog records that actions too	22:21
clarkb	so its not impossible to find precise timing but is definitely more annoying	22:21
ianw	if we set ANSIBLE_LOG to /var/log/ansible/<service>.yaml.log (and get the timestamps) what do we do with the stdout output?	22:22
ianw	it then becomes redundant	22:22
clarkb	if they are equivalent we probably want to redirect stdout to /dev/null?	22:22
clarkb	just to avoid future confusion over all this when we wonder why they are go into two places	22:23
ianw	then i worry "does ansible put out anything on stdout that might not be in the .log file it writes"?	22:23
ianw	:)	22:23
clarkb	maybe someone from the ansibel world can answer that question for us. Is bcoca still around?	22:24
ianw	we have stdout_callback=debug which is why i think we get more info coming out of stdout	22:24
clarkb	fwiw I never realized we had /var/log/ansible.log recording anything and always relied on the redirected stdout file content and never had an issue with it	22:26
clarkb	or at least not one that I believe the log output would've addressed	22:26
clarkb	so if we want to stick with it due to expected increase in verbosity I think that is fine	22:26
opendevreview	Merged opendev/system-config master: Add option to force docker.io addresses to IPv4 https://review.opendev.org/c/opendev/system-config/+/943216	22:34
ianw	yeah, if we add a '<service>.yaml.stdout.log' it requires quite a lot of extra stuff in the post-production playbooks	22:34
clarkb	confirmed that 943216 just enqueued jobs that should update the statsd containers for us	22:35
clarkb	thats good two birds one stone here	22:35
clarkb	gitea's statsd has restarted	22:37
clarkb	oh zuul-db ran and finished not zuul-lb. That is next so zuul's statsd should update shortly	22:37
clarkb	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-5m&to=now&timezone=utc shows a brief blip in stats but we are getting data again so I'm ahppy with that for now	22:38
clarkb	and checking /etc/hosts on gitea-lb02 and zuul-lb02 it looks untouched as expected so thats great. Thank you tonyb for that update. I think that should make our CI jobs a lot more reliable when interacting with docker hub	22:39
clarkb	https://grafana.opendev.org/d/39b50608de/zuul-load-balancer?orgId=1&from=now-5m&to=now&timezone=utc zuul-lb stats lgtm too. Short blip and then back to normal	22:39
clarkb	and as an exercise of infra-prod jobs runnign in parallel in the deploy pipeline this is looking great	22:40
clarkb	the gitea job failed bceauase gitea09 returned a 500 error trying to check fi the gerrit user is present	22:43
clarkb	loading /opendev/system-config also produces a 500 error for me	22:44
clarkb	[E] GetUserByName: Error 1040 (08004): Too many connections	22:45
clarkb	I think that is the database complaining about too many connections but I'm still trying to run it down	22:45
clarkb	yes mariadb log confirms it is closing connections prior to auth completing due to too many connections	22:46
clarkb	it does look like gitea has a lot of connections open	22:48
clarkb	we have local config to bump the connection limit up to 200	22:49
clarkb	currently only gitea09 seems to be in this state. I'm going to manually shut it and its db down then start it up again so that gerrit replication doesn't fall too far behind	22:50
clarkb	but then I'll follow that up wit ha change to increase the connection limit	22:50
fungi	yeah, the deploy run lgtm	22:51
fungi	i'll keep an eye out for any more system-config job failures that look like dockerhub rate limit issues	22:53
fungi	hopefully now there will be fewer	22:53
clarkb	gitea09 immediately reentered the same state	22:55
clarkb	looking at access logs it looks like things may be hitting :3000 directly now	22:55
clarkb	so are bypassing the apache filters	22:55
clarkb	though I'm not sure if the filters would've been effective for the set of requests	22:55
clarkb	I half suspect that we're getting a mariadb connection per request	22:56
fungi	huh	22:56
clarkb	not sure what the best option is here. Can shut gitea09 services down. Then maybe bump mariadb connection limits and block port 3000 direct access?	22:56
fungi	this is random clients hitting 3000/tcp?	22:57
clarkb	whois says it is alibaba cloud ips but yes	22:57
fungi	but yeah, we could definitely just limit it to listening on the loopbck	22:57
fungi	loopback	22:57
clarkb	and they don't seem to be taking the 500 error as a clue to go away	22:58
clarkb	it is lamost certainly an AI crawler bot as it appears to be going file by file and commit by commit through everything	22:58
clarkb	I've just double checked and haproxy is using the apache ports so we can safely block :3000 from the world	22:59
clarkb	lets start there before we worry about mariadb connection limits	23:00
fungi	agreed	23:00
fungi	the only reason we have to leave 3000 open is for bypassing/ruling out apache issues, but we could also limit access to it with iptables and allow haproxy to reach it if we need that for some reason	23:01
clarkb	we also use port 3000 for management but I think we do that from localhost so this should actually improve things for us as we get direct access for management	23:01
clarkb	and then everyone else has to go through the proxy	23:01
fungi	oh, right, authenticated admin access, but yeah ssh port forward wfm	23:02
clarkb	the LE certs specify a port of :3000 for the host specific name	23:02
clarkb	fungi: in this case its ansible doing management stuff and it just runs against localhost via ansible which is kinda like a port forward	23:02
fungi	oh, that too	23:02
clarkb	I don't think the :3000 in the certs is a big deal though as my browser doesn't complain when I use :3081	23:02
clarkb	but I'm not sure	23:02
fungi	i don't think browsers care, no	23:02
ianw	(just confirming that ANSIBLE_LOG_PATH=file ANSIBLE_STDOUT_CALLBACK=... means that the log file gets the output of the selected stdout callback. or to say that another way, the log file captures the ansible stdout. it's not like we can have dense output from the command-line, but have it logging in the background debug level info)	23:03
fungi	i use le issued https certs for smtp, imap, pop3 and irc on some servers without any trouble too, for that matter	23:03
opendevreview	Clark Boylan proposed opendev/system-config master: Drop public port 3000 access for Gitea https://review.opendev.org/c/opendev/system-config/+/944081	23:04
clarkb	I'm thinking maybe we manually apply ^ to gitea09 and then fi there are no problems with that by tomorrow morning land 944081?	23:05
fungi	yeah, that's what i was just looking at as the easiest option, no need to restart anything	23:06
clarkb	iptables -I openstack-INPUT -p tcp --dport 3000 -j DROP ?	23:07
clarkb	and ip6tables	23:08
clarkb	need to use -I got get it ahead of our accept rule	23:08
clarkb	-A would put it behind and it wouldn't take effect	23:08
clarkb	I'm going to do that	23:10
fungi	yeah	23:10
clarkb	the access log went quiet and load is dropping. I still don't get a useful response from the service yet	23:12
clarkb	I may try restarting things again if that persists	23:12
clarkb	I think my rule is actually problematic I think youcan't hit :3000 via localhost anymore either with that rule?	23:14
clarkb	ya source is 0.0.0.0/0	23:14
clarkb	it needs to go in rule slot 2 or 3 depending on whether or not iptables 0 indexes	23:15
clarkb	give me a minute and I'll delete that rule and apply it after the localhost accept rules	23:15
clarkb	that seems to be happier now	23:18
clarkb	but system-config also appears to be behind as anticipated	23:19
clarkb	I'm going to trigger gerrit replication to gitea09 now	23:19
clarkb	fungi: can you double check the iptables rules on gitea09 look correct to you?	23:20
clarkb	I ended up doing -I openstack-INPUT 5 that rule from above then -D openstack-INPUT 1 with both iptables and ip6tables	23:20
fungi	looking	23:22
clarkb	alibaba's abuse reporting portal doesn't seem to have an option for "someone is being bad but maybe not strictly illegal"	23:24
clarkb	replication is in progress against gitea09	23:24
fungi	yeah, that looks fine, though note that the actual desired state from 944081 is going to be deleting the existing port 3000 allow rules rather than a separate block rule	23:24
clarkb	fungi: yup I guess I could've just -D'd that specific rule	23:25
clarkb	but I think this is fine we should update the config and then it will upate iptables in place and the new setup should be roughly equivalent?	23:25
fungi	anyway, my guess is that a crawler stumbled across a http://gitea09...:3000 url mentioned in our irc channel logs and just kept spidering from there	23:25
clarkb	or some public cert registry	23:26
clarkb	a followup to this probably wants to edit our LE certs to drop the :3000	23:26
fungi	yes, i think this is a sufficient test	23:26
fungi	and agreed, the port specification is unnecessary for the certs, i expect	23:26
clarkb	these crawlers are such a nuisance though. Look at robots.txt respect the crawl delay. If you get massive quantities of 500 errors maybe you should look at what you are doing etc	23:27
clarkb	replication is almost half done	23:29
clarkb	fungi: 944081 is also a good sanity check that blocking external port :3000 doesn't break out automation. I don't think it will but good to double check	23:30
fungi	yep	23:31
fungi	as for irc logs being a likely entrypoint for the crawlers, we frequently test gitea09 and so mention it a lot more often. doesn't seem like they're hitting any of the other backends	23:32
clarkb	an for :3081 I suspect apache would meltdown before gitea did. Not dieal but at least our automation would keep working	23:33
clarkb	the go webserver built into gitae is just too good at accepting all the connections	23:33
clarkb	more than halfway done now. About 1k tasks remaining in the gerrit queue	23:34
clarkb	if anyone wants to look at the historical logs for this /var/gitea/logs/access.log on gitea09	23:35
clarkb	'HTTP/1.1" 500' is a search string that should work	23:35
clarkb	system-config master is up to date on gitea09 now	23:38
clarkb	but still about 500 tasks remaining	23:38
clarkb	it is replicating one nova change meta ref. Everything else has replicated	23:42
clarkb	and now that is done too	23:42
clarkb	so ya I think once we are satisfied blocking port 3000 isn't a problem we apply that globally and move on. Gitea09 should be good for now	23:43
fungi	any idea if the user agent(s) for the offenders were already in our filter list?	23:45
clarkb	no but that is a good thing to check doing so now	23:47
clarkb	fungi: it is in the filter list	23:49
clarkb	so we've probably discovered this one before and blocked it at the apache level then they discoverd a backdoor	23:49
clarkb	I'm being askedquestions about dinner now. I think we're stable and can followup in the morning with the port block and anything else we decide we need to do	23:51
opendevreview	Ian Wienand proposed opendev/system-config master: install-root-key : run on localhost https://review.opendev.org/c/opendev/system-config/+/944084	23:58
ianw	^ one for tomorrow :)	23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!