ianw | i can check 02:00 :) | 00:00 |
---|---|---|
clarkb | the hourly service-nodepool failed and I suspect it is due to a full disk on nb04 | 00:17 |
clarkb | ya /opt is full again arg. I'm goign to start a screen on that host, stop the service, then start a long running rm for those files | 00:17 |
clarkb | ok that is in progress on nb04 (just want to avoid any noise in the signal later if possible) | 00:20 |
Clark[m] | I think the hourly jobs enqueued ahead of the daily jobs | 02:01 |
Clark[m] | Hrm I don't see system config in periodic at all | 02:06 |
Clark[m] | Oh wow there they are. A whole flood of projects after the initial 15 | 02:07 |
Clark[m] | Bootstrap bridge is starting for the daily jobs now | 02:14 |
Clark[m] | Looking ok from the status page so far | 02:29 |
ianw | still all progressing in twos ... bridge utilisation seems sane | 02:56 |
ianw | i guess parallel operation makes `/var/log/ansible.log` fairly useless as it all gets mixed up | 02:59 |
ianw | perhaps instead of > output of runs we should set ANSIBLE_LOG_PATH? | 03:01 |
Clark[m] | I thought we output to playbook specific log files | 03:01 |
ianw | we do but with a >> | 03:02 |
Clark[m] | https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-playbook.yaml#L21 | 03:03 |
Clark[m] | Ya I guess I'm not understanding where the conflict is if it's playbook name specific? | 03:03 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook: redirect via ansible logger https://review.opendev.org/c/opendev/system-config/+/943999 | 03:06 |
ianw | oh just because it _also_ logs to /var/log/ansible.log via the global config. I think that if we set ANSIBLE_LOG_FILE that will override that, and not have all the prod playbooks writing to the same file | 03:07 |
Clark[m] | Oh does it log to a global file too? One thing we would need to check with that change is if it is safe to do so for jobs whose logs get published publicly | 03:07 |
ianw | haha i was about to say, now i think about it, we want to make sure we put anything that comes out into a local file too :) | 03:08 |
Clark[m] | Since Ansible log output may not b the same as stdout/stderr | 03:08 |
Clark[m] | So probably need to understanf any potential behavior differences between the two first? | 03:09 |
ianw | i guess the best thing would be to >> to a .stdout.log | 03:09 |
ianw | but then we have another log file to deal with on the encryption path | 03:09 |
ianw | (not that i think anyone else's key but mine is in there for that :) | 03:10 |
Clark[m] | Ya and regular capture path for those that do it | 03:10 |
ianw | i wonder if ANSIBLE_LOG_FILE _and_ >> to the same file works ok | 03:10 |
Clark[m] | Since we won't want to overexpose to start. One option may be to do both like you say then disable public publishing for everything. Then recheck the two files for anything that published before and add it back in | 03:10 |
ianw | the ordering may be completely out, but perhaps it doesn't matter that much | 03:10 |
Clark[m] | If things look safe | 03:10 |
Clark[m] | What is in the Ansible log file? Is it different? | 03:11 |
Clark[m] | Sorry I'm not currently able to check that easily but can if it becomes urgent | 03:11 |
Clark[m] | Going back to general behavior here I think this is continuing to look good | 03:12 |
ianw | it looks to me that what is in ansible.log as written out by ansible is the same as what is captured into each <service>.yaml.log | 03:14 |
ianw | i don't think this is a problem, as such ... just that we have the rather messy ansible.log file just a jumble of ever increasing mostly random stuff | 03:15 |
Clark[m] | Got it. Less about ensuring correct behavior for the remote node config management and more about making debugging more straightforward | 03:15 |
Clark[m] | https://zuul.opendev.org/t/openstack/buildset/1c24d0f003e9427ea84e393f48120397 success! | 03:19 |
Clark[m] | ianw: we use >> because we start the file with a header | 03:27 |
Clark[m] | I suspect that is still fine with the proposed change though as Ansible should append too sounds like | 03:27 |
Clark[m] | So probably the main thing to check is just that the content isn't more dangerous | 03:28 |
ianw | yeah i don't think it's right because it will echo the output. have to think about it :/ | 03:43 |
Clark[m] | It being the change to use the log env var? | 03:47 |
ianw | yep | 04:47 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook: redirect via ansible logger https://review.opendev.org/c/opendev/system-config/+/943999 | 04:59 |
ianw | ^ thought two - keep the log file capture as we have now, but burn ansible's own logging to /dev/null for Zuul runs. but leave the default there so that if you run by hand, you still get logs in /var/log/ansible.log. this is predicated on the idea that "log_path" in Ansible doesn't capture anything that stdout/stderr won't ... which i think is correct | 05:02 |
*** mrunge_ is now known as mrunge | 06:35 | |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 07:16 |
*** jroll02 is now known as jroll0 | 08:45 | |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 09:23 |
opendevreview | Karolina Kula proposed openstack/diskimage-builder master: WIP: Add support for CentOS Stream 10 https://review.opendev.org/c/openstack/diskimage-builder/+/934045 | 10:35 |
opendevreview | Jeremy Stanley proposed opendev/bindep master: Drop auxiliary requirements files https://review.opendev.org/c/opendev/bindep/+/940711 | 14:38 |
fungi | slight efficiency improvement in the latest revision of ^ | 14:39 |
tkajinam | hmm I just noticed a bit strange behavior in https://review.opendev.org/c/openstack/puppet-horizon/+/943232 | 14:47 |
clarkb | looking at periodic buildset runtimes for system-config typical runtime was 2 to 2.5 hours and last nights was 1 hour and 8 minutes | 14:47 |
tkajinam | a strange behavior of gerrit, I mean | 14:47 |
clarkb | a definite measurable improvment | 14:47 |
clarkb | tkajinam: can you be more specific? what do you observe that is strange? | 14:47 |
tkajinam | if you look at my latest comment it is duplicated. | 14:48 |
tkajinam | hmmm maybe it might be a problem caused by something in my local. ignore it for now | 14:48 |
clarkb | I think I've seen similar with gertty users | 14:49 |
clarkb | not sure if you use gertty | 14:49 |
tkajinam | no I posted it from web interface | 14:49 |
fungi | i've accidentally done it with gertty's experimental inline comment threads patch by selecting the reply button on a comment after composing a reply to it already | 14:50 |
clarkb | https://review.opendev.org/c/opendev/bindep/+/940711/ fungi's patchset 5 is the example I'm thinking of | 14:51 |
fungi | yeah, that's where i saw i'd done it | 14:51 |
tkajinam | ok | 14:51 |
fungi | but since it's all just rest api calls back to the gerrit server, i expect the webclient might retry to post a comment if it received an error or disconnect, and if the server had correctly processed the comment anyway then maybe you end up with two | 14:52 |
tkajinam | yeah that's possible | 14:52 |
tkajinam | I'll come back here in case I observe the same behavior frequently. | 14:52 |
tkajinam | sorry for the noise ! | 14:53 |
clarkb | tkajinam: thank you for the haeds up and ya I think this is probably fine unless it becomes persistent. If that happens definitely let us know | 14:53 |
fungi | please do, maybe we can correlate them and figure out the commonalities | 14:53 |
fungi | definitely noy noise | 14:53 |
fungi | s/noy/not/ | 14:53 |
tkajinam | :-) | 14:54 |
tkajinam | I've never seen Alex using gertty so I was wondering why his comment in patchset 2 is the legacy (non-inline comment, I mean). | 14:55 |
tkajinam | that's why I suspected something strange might be happening in comment feature. | 14:55 |
clarkb | that is also another hallmark of gertty usage. But maybe there is another client or using the ssh review feature? | 14:56 |
fungi | gertty currently doesn't support the newer thread-style comments (there's an experimental change to add support but it's still incomplete), so most gertty users end up leaving legacy comments | 14:56 |
tkajinam | yeah | 14:56 |
clarkb | I think if you do `ssh -p 29418 user@review.opendev.org gerrit review -m "message here"` you also get that behavior | 14:57 |
tkajinam | I'll talk with him to know how it was posted. | 14:57 |
clarkb | you have to use the --json input to get the more modern inline commenting stuff (which the modern top level comment is a special case of) | 14:57 |
tkajinam | ah. ok | 14:59 |
tkajinam | there are still a lot of mystery in gerrit :-P | 14:59 |
clarkb | ya there is a special meta file name that if you comment on is a top level comment. That is the new behavior. The old behavior is you set a comment on the patchset and don't specific a file | 15:01 |
fungi | i'd say there's less mystery in gerrit than in proprietary code review platforms without available source code ;) | 15:02 |
clarkb | fungi: can you check my comment on https://review.opendev.org/c/opendev/system-config/+/943999 as ianw won't be awake this time of day? I'm just trying to reason about the safety of capturing stderr like that for jobs that publish public logs | 15:04 |
clarkb | I'm really happy the periodic buildset ran in just over half the time it took previously | 15:06 |
clarkb | even if we don't increase the semaphore limit that is a great roi and I think we can safely bump the limit up too | 15:07 |
fungi | answered | 15:08 |
clarkb | thanks! Wanted to make sure my dst jet lagged brain is keeping up | 15:09 |
clarkb | jamesdenton: wanted to let you know that the new dfw3 region seems to be working well. We've also managed to switch sjc3 over to the new tenant/project that matches what is in dfw3. Thanks for the help getting that done. | 15:13 |
clarkb | jamesdenton: also did you know that nova ssh keys are not project/tenant specific (we learned that the hard way when we deleted them using the old project/tenant in sjc3) | 15:13 |
jamesdenton | Glad to hear about DFW3! But if you could elaborate a little more on the nova keys... | 15:14 |
fungi | we had some keypairs defined which we'd been using with the old project in sjc3 | 15:15 |
fungi | we didn't realize that they were the same objects being used for the new project in sjc3 under the same account | 15:15 |
fungi | i deleted the keypairs from the account thinking they were only being deleted from the old project, but it of course led to them being unavailable in the new project as well | 15:15 |
clarkb | and likely to be some ancient nova thing that we're just going to have to live with for backward compatibility reasons | 15:16 |
clarkb | so nothing for you to change/address. Just an interesting behavior we discovered the hard way | 15:16 |
fungi | i didn't think to check that they were still there for the new project, so we were erroring for a little while with nodepool telling nova to use those keypair objects which no longer existed | 15:17 |
frickler | keypairs are a user resource in nova | 15:17 |
fungi | right, i learned that the hard way ;) | 15:17 |
clarkb | our config management sorted it out automatically when it ran again to update cloud things so not a big deal either | 15:18 |
fungi | arguably, we're abusing the keypair feature somewhat to do things it's not strictly intended for | 15:18 |
fungi | we could just inject all the ssh keys into our images instead | 15:19 |
clarkb | there are reasons to not do that though. Including people reusing our images | 15:19 |
clarkb | (though we still bake in zuul's key) | 15:19 |
clarkb | so we're only half addressing that problem | 15:19 |
clarkb | jamesdenton: the other thing to note is in each region we have quota sufficient for 50 instances except for the memory limit. We can only fit 32 of our 8GB RAM instances into the memory quota we have currently. I'm not sure what capacity looks like in the new deployments but we're always happy to make use of more quota if possible. I think cloudnull mentioned a third region may come | 15:21 |
clarkb | online which is the other direction we can expand too | 15:21 |
jamesdenton | frickler thanks for the clarification on that! | 15:25 |
jamesdenton | clarkb we can help with the quota, i think. | 15:25 |
opendevreview | Karolina Kula proposed opendev/glean master: WIP: Add support for CentOS 10 keyfiles https://review.opendev.org/c/opendev/glean/+/941672 | 15:27 |
clarkb | fungi: I posted some additional followup to https://review.opendev.org/c/opendev/system-config/+/943999 with some further investigating and info gathering if you are curious | 15:46 |
clarkb | tl;dr is yes stderr is going to zuul and can be viewed. Seems to be innocuous | 15:47 |
fungi | cool, thanks for confirming. that matches what i expected | 15:48 |
clarkb | I didn't approve the change because I'm still not sure I trust my early morning brain and figure we can wait for ianw to review the comments before we land it | 15:49 |
fungi | yeah, that one's not at all urgent | 15:49 |
clarkb | should we proceed wtih https://review.opendev.org/c/opendev/system-config/+/940928 to exercise parallel infra-prod more with merging changes? | 16:29 |
clarkb | that switches the haproxy and zookeeper statsd container "sidecars" to python3.12 | 16:29 |
clarkb | now that the board meeting is over I should eat something too | 16:30 |
fungi | i've approved it | 16:35 |
clarkb | it should merge shortly. Will probably end up behind the hourly jobs | 18:02 |
opendevreview | Merged opendev/system-config master: Start using python3.12 https://review.opendev.org/c/opendev/system-config/+/940928 | 18:03 |
fungi | there we go | 18:04 |
fungi | and yes | 18:04 |
clarkb | oh those image updates don't end up triggering jobs for gitea-lb, zuul-lb, or zookeeper | 18:05 |
clarkb | I guess I can write a change to fix that | 18:05 |
fungi | good point | 18:05 |
opendevreview | Clark Boylan proposed opendev/system-config master: Trigger related jobs when statsd images update https://review.opendev.org/c/opendev/system-config/+/944063 | 18:10 |
clarkb | something like that should do it I think | 18:10 |
tonyb | Any ideas how to debug: ERROR: failed to solve: docker.io/opendevorg/python-builder:3.11-bookworm: failed to resolve source metadata for docker.io/opendevorg/python-builder:3.11-bookworm: failed to copy: httpReadSeeker: failed open: content at https://zuul-jobs.buildset-registry:5000/v2/opendevorg/python-builder/manifests/sha256:9dd6363ddd47c9093f0a14127cf73612b7b7e7ef39db50ab9b7e617d5b1a8e15?ns=docker.io not found: not found | 18:36 |
tonyb | from: https://zuul.opendev.org/t/zuul/build/f72e1c199d8e44fe9f0e944be74453a0/log/job-output.txt?severity=0#1308 | 18:36 |
clarkb | tonyb: I suspect that is the buildset registry getting hit by the docker rate limit | 18:37 |
clarkb | if you look in the buildset registry job logs for the registry itself you may be able to confirm | 18:37 |
tonyb | clarkb: Thanks. I forgot to look there. It just has 'Not found' and returns a 404 | 18:42 |
clarkb | tonyb: oh right I'm remembering now we theorize that what ahppens is the buildset registry says 404 I don't have that image. Then docker falls back to talking to docker.io directly then gets the rate limit error. but when it reports the errors it only reports the first of the two errors it received | 18:43 |
clarkb | we haven't confirmed that via code review or profiling but it seems to match up witn infrequent occurences due to hitting rate limits | 18:44 |
tonyb | Hmm okay, I'll see what I can find. | 18:45 |
clarkb | essentially we think this is afailure of docker to report errors sanely and we're getting the error that ins't really an error masking the actual problem (which we think is likely the rate limit throttling) | 18:45 |
clarkb | as a side note: I think nodepool can switch over to using the mirrored python base images | 18:45 |
clarkb | and sidestep the whole issue because quay should be less problematic | 18:46 |
tonyb | Yeah that all makes sense. | 18:46 |
tonyb | It'd be nice if the ContainerFile "FROM" could support a list like FROM [doker.io/..., quay.io/...] AS Builder but that's a pipe dream and probably wouldn't work because of SHASUMs or somethign else I'm not considering | 18:48 |
clarkb | and docker just generally trying to lock down their walled garden | 18:48 |
tonyb | Yeah | 18:48 |
corvus | re zuul launcher and rax flex -- i suspect we need new image uploads for the new project; so i've triggered image builds | 19:16 |
clarkb | oh that makes sense since the old tenant/project was cleaned up and its images wouldn't be available in the new tenant/project | 19:17 |
corvus | yeah, and i think zl isn't smart enough to know that changed (since the connection looks the same) | 19:17 |
fungi | i wasn't sure if anything needed to be restarted there | 19:23 |
clarkb | trigginer those rebuilds is currently through the api directly or the web ui right? | 19:24 |
fungi | it did at least delete its prior images in the old sjc3 project | 19:24 |
clarkb | (just so everyone else is aware of how to do that should they need to) | 19:24 |
clarkb | the statsd image update respin hit docker rate limits. I have rechecked it | 19:37 |
clarkb | I'm going to pop out on a bike ride soon to get out before the rain arrives. But I'll be back and can help shepherd that in (it is I think a decent land system-config change and see that deploy is happy with parallel jobs in that pipeline candidate) | 19:38 |
tonyb | clarkb: FWIW I added a -1 and asked a question on https://review.opendev.org/c/opendev/system-config/+/943999 ... feel free to ignore if I'm wrong | 19:48 |
tonyb | clarkb: Also enjoy your ride | 19:49 |
clarkb | tonyb: oh good catch I think you are right looking at the old side | 19:49 |
tonyb | huzzah | 19:51 |
clarkb | ianw: ^ fyi that should be fixed. We can get to it if you prefer too, but wanted to give you the opportunity to weigh in on the comments as a whole | 19:54 |
tonyb | What's needed to +A 943216: Add option to force docker.io addresses to IPv4 | https://review.opendev.org/c/opendev/system-config/+/943216 ? | 21:02 |
fungi | probably just someone needs to be around to spot issues so we can emergency revert it | 21:03 |
fungi | i'm happy to go ahead and approve it now | 21:03 |
tonyb | fungi: Thanks, Assuming you're also happy to keep an eye on things. If not I'll be around tomorrow | 21:06 |
fungi | yeah, i can | 21:06 |
fungi | approved now | 21:06 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add a tool for displaying CPU flags and QEMU version https://review.opendev.org/c/openstack/diskimage-builder/+/937836 | 21:13 |
opendevreview | Tony Breeds proposed openstack/diskimage-builder master: Add a tool for displaying CPU flags and QEMU version https://review.opendev.org/c/openstack/diskimage-builder/+/937836 | 21:16 |
Clark[m] | tonyb: fungi: and we want to check prod remains unaffected as expected when that lands | 22:15 |
opendevreview | Ian Wienand proposed opendev/system-config master: run-production-playbook: redirect via ansible logger https://review.opendev.org/c/opendev/system-config/+/943999 | 22:17 |
ianw | tonyb: thanks for checking that. now i'm worried about the testing :) | 22:18 |
clarkb | ianw: I don't think that playbook is tested at all :( | 22:18 |
clarkb | that change will be a good one to exercise infra-prod parallel stuff and may actually cause the load balancer statsd stuff ot update too | 22:19 |
ianw | i'm actually wavering on it now looking at the testing we do do | 22:20 |
ianw | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_c1c/943999/2/check/system-config-run-base/c1c2ea4/bridge99.opendev.org/ansible/ | 22:20 |
ianw | one thing that "ansible.log" has that the stdout capture doesn't have is timestamps | 22:20 |
clarkb | thats a good call out. That said we do put the header in the file so have rough timing. But also on the remote side syslog records that actions too | 22:21 |
clarkb | so its not impossible to find precise timing but is definitely more annoying | 22:21 |
ianw | if we set ANSIBLE_LOG to /var/log/ansible/<service>.yaml.log (and get the timestamps) what do we do with the stdout output? | 22:22 |
ianw | it then becomes redundant | 22:22 |
clarkb | if they are equivalent we probably want to redirect stdout to /dev/null? | 22:22 |
clarkb | just to avoid future confusion over all this when we wonder why they are go into two places | 22:23 |
ianw | then i worry "does ansible put out anything on stdout that might not be in the .log file it writes"? | 22:23 |
ianw | :) | 22:23 |
clarkb | maybe someone from the ansibel world can answer that question for us. Is bcoca still around? | 22:24 |
ianw | we have stdout_callback=debug which is why i think we get more info coming out of stdout | 22:24 |
clarkb | fwiw I never realized we had /var/log/ansible.log recording anything and always relied on the redirected stdout file content and never had an issue with it | 22:26 |
clarkb | or at least not one that I believe the log output would've addressed | 22:26 |
clarkb | so if we want to stick with it due to expected increase in verbosity I think that is fine | 22:26 |
opendevreview | Merged opendev/system-config master: Add option to force docker.io addresses to IPv4 https://review.opendev.org/c/opendev/system-config/+/943216 | 22:34 |
ianw | yeah, if we add a '<service>.yaml.stdout.log' it requires quite a lot of extra stuff in the post-production playbooks | 22:34 |
clarkb | confirmed that 943216 just enqueued jobs that should update the statsd containers for us | 22:35 |
clarkb | thats good two birds one stone here | 22:35 |
clarkb | gitea's statsd has restarted | 22:37 |
clarkb | oh zuul-db ran and finished not zuul-lb. That is next so zuul's statsd should update shortly | 22:37 |
clarkb | https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=now-5m&to=now&timezone=utc shows a brief blip in stats but we are getting data again so I'm ahppy with that for now | 22:38 |
clarkb | and checking /etc/hosts on gitea-lb02 and zuul-lb02 it looks untouched as expected so thats great. Thank you tonyb for that update. I think that should make our CI jobs a lot more reliable when interacting with docker hub | 22:39 |
clarkb | https://grafana.opendev.org/d/39b50608de/zuul-load-balancer?orgId=1&from=now-5m&to=now&timezone=utc zuul-lb stats lgtm too. Short blip and then back to normal | 22:39 |
clarkb | and as an exercise of infra-prod jobs runnign in parallel in the deploy pipeline this is looking great | 22:40 |
clarkb | the gitea job failed bceauase gitea09 returned a 500 error trying to check fi the gerrit user is present | 22:43 |
clarkb | loading /opendev/system-config also produces a 500 error for me | 22:44 |
clarkb | [E] GetUserByName: Error 1040 (08004): Too many connections | 22:45 |
clarkb | I think that is the database complaining about too many connections but I'm still trying to run it down | 22:45 |
clarkb | yes mariadb log confirms it is closing connections prior to auth completing due to too many connections | 22:46 |
clarkb | it does look like gitea has a lot of connections open | 22:48 |
clarkb | we have local config to bump the connection limit up to 200 | 22:49 |
clarkb | currently only gitea09 seems to be in this state. I'm going to manually shut it and its db down then start it up again so that gerrit replication doesn't fall too far behind | 22:50 |
clarkb | but then I'll follow that up wit ha change to increase the connection limit | 22:50 |
fungi | yeah, the deploy run lgtm | 22:51 |
fungi | i'll keep an eye out for any more system-config job failures that look like dockerhub rate limit issues | 22:53 |
fungi | hopefully now there will be fewer | 22:53 |
clarkb | gitea09 immediately reentered the same state | 22:55 |
clarkb | looking at access logs it looks like things may be hitting :3000 directly now | 22:55 |
clarkb | so are bypassing the apache filters | 22:55 |
clarkb | though I'm not sure if the filters would've been effective for the set of requests | 22:55 |
clarkb | I half suspect that we're getting a mariadb connection per request | 22:56 |
fungi | huh | 22:56 |
clarkb | not sure what the best option is here. Can shut gitea09 services down. Then maybe bump mariadb connection limits and block port 3000 direct access? | 22:56 |
fungi | this is random clients hitting 3000/tcp? | 22:57 |
clarkb | whois says it is alibaba cloud ips but yes | 22:57 |
fungi | but yeah, we could definitely just limit it to listening on the loopbck | 22:57 |
fungi | loopback | 22:57 |
clarkb | and they don't seem to be taking the 500 error as a clue to go away | 22:58 |
clarkb | it is lamost certainly an AI crawler bot as it appears to be going file by file and commit by commit through everything | 22:58 |
clarkb | I've just double checked and haproxy is using the apache ports so we can safely block :3000 from the world | 22:59 |
clarkb | lets start there before we worry about mariadb connection limits | 23:00 |
fungi | agreed | 23:00 |
fungi | the only reason we have to leave 3000 open is for bypassing/ruling out apache issues, but we could also limit access to it with iptables and allow haproxy to reach it if we need that for some reason | 23:01 |
clarkb | we also use port 3000 for management but I think we do that from localhost so this should actually improve things for us as we get direct access for management | 23:01 |
clarkb | and then everyone else has to go through the proxy | 23:01 |
fungi | oh, right, authenticated admin access, but yeah ssh port forward wfm | 23:02 |
clarkb | the LE certs specify a port of :3000 for the host specific name | 23:02 |
clarkb | fungi: in this case its ansible doing management stuff and it just runs against localhost via ansible which is kinda like a port forward | 23:02 |
fungi | oh, that too | 23:02 |
clarkb | I don't think the :3000 in the certs is a big deal though as my browser doesn't complain when I use :3081 | 23:02 |
clarkb | but I'm not sure | 23:02 |
fungi | i don't think browsers care, no | 23:02 |
ianw | (just confirming that ANSIBLE_LOG_PATH=file ANSIBLE_STDOUT_CALLBACK=... means that the log file gets the output of the selected stdout callback. or to say that another way, the log file captures the ansible stdout. it's not like we can have dense output from the command-line, but have it logging in the background debug level info) | 23:03 |
fungi | i use le issued https certs for smtp, imap, pop3 and irc on some servers without any trouble too, for that matter | 23:03 |
opendevreview | Clark Boylan proposed opendev/system-config master: Drop public port 3000 access for Gitea https://review.opendev.org/c/opendev/system-config/+/944081 | 23:04 |
clarkb | I'm thinking maybe we manually apply ^ to gitea09 and then fi there are no problems with that by tomorrow morning land 944081? | 23:05 |
fungi | yeah, that's what i was just looking at as the easiest option, no need to restart anything | 23:06 |
clarkb | iptables -I openstack-INPUT -p tcp --dport 3000 -j DROP ? | 23:07 |
clarkb | and ip6tables | 23:08 |
clarkb | need to use -I got get it ahead of our accept rule | 23:08 |
clarkb | -A would put it behind and it wouldn't take effect | 23:08 |
clarkb | I'm going to do that | 23:10 |
fungi | yeah | 23:10 |
clarkb | the access log went quiet and load is dropping. I still don't get a useful response from the service yet | 23:12 |
clarkb | I may try restarting things again if that persists | 23:12 |
clarkb | I think my rule is actually problematic I think youcan't hit :3000 via localhost anymore either with that rule? | 23:14 |
clarkb | ya source is 0.0.0.0/0 | 23:14 |
clarkb | it needs to go in rule slot 2 or 3 depending on whether or not iptables 0 indexes | 23:15 |
clarkb | give me a minute and I'll delete that rule and apply it after the localhost accept rules | 23:15 |
clarkb | that seems to be happier now | 23:18 |
clarkb | but system-config also appears to be behind as anticipated | 23:19 |
clarkb | I'm going to trigger gerrit replication to gitea09 now | 23:19 |
clarkb | fungi: can you double check the iptables rules on gitea09 look correct to you? | 23:20 |
clarkb | I ended up doing -I openstack-INPUT 5 that rule from above then -D openstack-INPUT 1 with both iptables and ip6tables | 23:20 |
fungi | looking | 23:22 |
clarkb | alibaba's abuse reporting portal doesn't seem to have an option for "someone is being bad but maybe not strictly illegal" | 23:24 |
clarkb | replication is in progress against gitea09 | 23:24 |
fungi | yeah, that looks fine, though note that the actual desired state from 944081 is going to be deleting the existing port 3000 allow rules rather than a separate block rule | 23:24 |
clarkb | fungi: yup I guess I could've just -D'd that specific rule | 23:25 |
clarkb | but I think this is fine we should update the config and then it will upate iptables in place and the new setup should be roughly equivalent? | 23:25 |
fungi | anyway, my guess is that a crawler stumbled across a http://gitea09...:3000 url mentioned in our irc channel logs and just kept spidering from there | 23:25 |
clarkb | or some public cert registry | 23:26 |
clarkb | a followup to this probably wants to edit our LE certs to drop the :3000 | 23:26 |
fungi | yes, i think this is a sufficient test | 23:26 |
fungi | and agreed, the port specification is unnecessary for the certs, i expect | 23:26 |
clarkb | these crawlers are such a nuisance though. Look at robots.txt respect the crawl delay. If you get massive quantities of 500 errors maybe you should look at what you are doing etc | 23:27 |
clarkb | replication is almost half done | 23:29 |
clarkb | fungi: 944081 is also a good sanity check that blocking external port :3000 doesn't break out automation. I don't think it will but good to double check | 23:30 |
fungi | yep | 23:31 |
fungi | as for irc logs being a likely entrypoint for the crawlers, we frequently test gitea09 and so mention it a lot more often. doesn't seem like they're hitting any of the other backends | 23:32 |
clarkb | an for :3081 I suspect apache would meltdown before gitea did. Not dieal but at least our automation would keep working | 23:33 |
clarkb | the go webserver built into gitae is just too good at accepting all the connections | 23:33 |
clarkb | more than halfway done now. About 1k tasks remaining in the gerrit queue | 23:34 |
clarkb | if anyone wants to look at the historical logs for this /var/gitea/logs/access.log on gitea09 | 23:35 |
clarkb | 'HTTP/1.1" 500' is a search string that should work | 23:35 |
clarkb | system-config master is up to date on gitea09 now | 23:38 |
clarkb | but still about 500 tasks remaining | 23:38 |
clarkb | it is replicating one nova change meta ref. Everything else has replicated | 23:42 |
clarkb | and now that is done too | 23:42 |
clarkb | so ya I think once we are satisfied blocking port 3000 isn't a problem we apply that globally and move on. Gitea09 should be good for now | 23:43 |
fungi | any idea if the user agent(s) for the offenders were already in our filter list? | 23:45 |
clarkb | no but that is a good thing to check doing so now | 23:47 |
clarkb | fungi: it is in the filter list | 23:49 |
clarkb | so we've probably discovered this one before and blocked it at the apache level then they discoverd a backdoor | 23:49 |
clarkb | I'm being askedquestions about dinner now. I think we're stable and can followup in the morning with the port block and anything else we decide we need to do | 23:51 |
opendevreview | Ian Wienand proposed opendev/system-config master: install-root-key : run on localhost https://review.opendev.org/c/opendev/system-config/+/944084 | 23:58 |
ianw | ^ one for tomorrow :) | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!