Wednesday, 2024-10-30

fungi	sgtm, yep!	01:49
*** elibrokeit_ is now known as elibrokeit		04:14
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	06:13
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	06:19
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	07:34
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	08:20
ianw	have we noticed all the tzdata errors in base.yaml in the testing -> https://zuul.opendev.org/t/openstack/build/b136610606d54c3e819bfe0562ce6170/log/job-output.txt#3969 ?	08:42
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	08:58
frickler	I don't think I've seen that before. sounds like we should just install tzdata, then https://code.djangoproject.com/ticket/33814	09:19
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	09:22
opendevreview	Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Fedora 40 to the CI tests https://review.opendev.org/c/openstack/diskimage-builder/+/933664	09:44
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	10:21
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	11:36
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	12:29
opendevreview	Slawek Kaplonski proposed zuul/zuul-jobs master: Drop support for user/password authentication to the readthedocs.org https://review.opendev.org/c/zuul/zuul-jobs/+/933395	13:23
opendevreview	Slawek Kaplonski proposed openstack/project-config master: Remove rtd_secret from the trigger-readthedocs-webhook job https://review.opendev.org/c/openstack/project-config/+/933396	13:31
clarkb	ianw: frickler yes I think tonyb acutally has a change up for that	14:49
clarkb	https://review.opendev.org/c/opendev/system-config/+/923684 looks like I need to reapply my review votes	14:50
clarkb	infra-root I'm around now to start my day if we want to upgrade etherpad, update gerrit caches, or both	14:57
clarkb	do we want to start with etherpad and do gerrit later in the day when in theory it will be less busy or get gerrit done first because it is the "big" one	14:57
clarkb	I'm running gerrit show-caches --show-jvm against the server right now to try and capture some baseline type data and it is not quick	15:15
clarkb	ok so that took about 5-6 minutes? but I have the data and it doesn't seem to have really impacted the running service at all	15:21
clarkb	I'll put the output in my homedir on the server. It looks like we have a fair bit of headroom memory wise though Mem: 93.94g total = 65.73g used + 27.82g free + 400.00m buffers 96.00g max which should be plenty of room for those diff caches if the limits are respected	15:22
clarkb	and ya the hit rates on those caches are also lower than ohters which would imply to me that they need more room	15:24
fungi	clarkb: plan sounds good to me, i've got to run some errands in a bit but won't be away long	15:27
clarkb	fungi: do you think we should do etherpad or gerrit first?	15:28
fungi	etherpad first, for the reason you outlined	15:28
clarkb	ack I'll go ahead and apoprove that change now then	15:28
clarkb	and ya gerrit si busy this morning	15:28
fungi	perfect	15:29
fungi	i'm happy to drive the gerrit restart later too	15:29
clarkb	approved so we've got like half an hour or so I think before that lands	15:29
clarkb	and thank you for offering that would be great	15:29
clarkb	I think the intermediate registry may be doing that thing again. But I'm not positive as its only been a few minutes since the job for etherpad got stuck in that state	16:09
clarkb	ss -np shows no connections active from the test node	16:11
clarkb	I'm running a tcpdump capturing host $thatipaddress now	16:12
clarkb	just to see if we can any SYNs or similar	16:12
clarkb	ping is pinging and shows up in the tcpdump as a sanity check	16:12
fungi	probably won't unless it's retrying the connection	16:13
clarkb	ping works in both directions	16:13
clarkb	and this is all over ipv4 because the test node in question is a raxflex node	16:13
fungi	though it's also going through address translation, if that makes any difference	16:14
clarkb	I suspect not since we saw this with other clouds when it happened previously	16:15
clarkb	its just coincidence that the current example has a floating IP I think	16:15
clarkb	trying to get service logs now but docker logs is slow	16:15
clarkb	anyway it seems like on the test node side it thinks it is pushing to the insecure-ci-registry and on the insecure-ci-registry node we have no connection	16:16
clarkb	ping works. I guess we can try a telnet/nc to port 5000 and see if that succeeds	16:16
fungi	no open socket?	16:16
clarkb	nope ss -np doesn't show it	16:17
clarkb	but the logs are helpful we have the ssl errors	16:17
clarkb	I'm going to restart the container	16:17
clarkb	https://paste.opendev.org/show/b2jNAsE62IZa7I0HJyxJ/	16:17
clarkb	I wonder if this is a bitflip type of situation and the server we're one has bad memory?	16:17
fungi	ssl.SSLError: [SSL: BAD_KEY_SHARE] bad key share (_ssl.c:1006)	16:18
fungi	i guess that's not the ssl error corvus said he'd observed elsewhere?	16:19
clarkb	ya	16:19
clarkb	its been a thing that has happend sporadically on this server for a year or two I think	16:19
clarkb	restarting the container got the job moving again	16:19
clarkb	it could be a code bug somewhere too, but we haven't been able to track that down if so	16:19
clarkb	I wonder if cheroot/cherrypy have had updates since we last built this container	16:20
clarkb	maybe we should just rebuild he container to pick up dependency updatse and see if that helps cc corvus	16:21
clarkb	there is a newer cheroot than the zuul-registry container	16:21
clarkb	I'll push up a "please rebuild this container" change to zuul-registry	16:22
clarkb	as a side note my tcpdump didn't capture the bit where the job got unstuck...	16:22
clarkb	it got all my icmp requests for ping so maybe I got the tcpdump wrong	16:23
fungi	success!	16:26
clarkb	remote: https://review.opendev.org/c/zuul/zuul-registry/+/933759 Rebuild the zuul-registry container image	16:26
fungi	i wonder why system-config-run-etherpad is taking so long to get a node	16:37
clarkb	the job ran in raxflex and since its using the paused registry build all nodes need to come from there and we've got less quota there	16:38
clarkb	I suspect its just due to waiting for resources to free up	16:39
fungi	oh, yep	16:44
fungi	it eventually started anyway	16:44
clarkb	looks like cheroot/cherrypy use matrix we could jump on there maybe to see if they have any ideas if updating the installation doens't help	16:51
fungi	though also zuul-registry-build-image is failing for the image update change	16:52
clarkb	its the client too old problem again but these versions look different (from memory anyway)	16:54
opendevreview	Merged opendev/system-config master: Update etherpad to v2.2.6 https://review.opendev.org/c/opendev/system-config/+/933618	16:54
clarkb	I need a better index into my brain so that I can pull this stuff up and remember how/why/what when its these recurring issues	16:54
clarkb	hrm did matrix remove the ability to search scrollback?	16:57
fungi	i think there's a channel-level config setting that can make it so people who join a channel can't see messages from before they joined it	16:58
clarkb	I can scrollbackwards but there used to be a magnifying glass in element to search	16:59
clarkb	anyway https://review.opendev.org/c/zuul/zuul-jobs/+/913808 is the breadcrumb I was looking for (thankfully it was in weechat logs and I can search those)	16:59
clarkb	and https://review.opendev.org/c/zuul/zuul-jobs/+/913902 is the solution that I think I need to port to zuul-registry	17:00
fungi	clarkb: click the "room info" icon then, and there should be a "search messages" field that appears at the top of the new sidebar that opens	17:00
clarkb	aha thanks	17:00
fungi	at least in element client in my browser	17:00
fungi	the irony of troubleshooting matrix's ui in an irc channel is not lost on me	17:01
fungi	deploy job just finished	17:02
fungi	i'm able to load a pad fine	17:02
noonedeadpunk	hey! Issues we've experiencing with reaching https://releases.openstack.org/constraints/ from nodepool vms are getting more annoying. So I've gathered couple of failed jobs for the same patch wich occured during last couple of days. and a different providers are being used in them	17:03
noonedeadpunk	rax https://zuul.opendev.org/t/openstack/build/87cb2dbe3f554df6aabd16ecf04f829d/	17:03
noonedeadpunk	open metal https://zuul.opendev.org/t/openstack/build/c51165d4b0ee4bf6ad4d6a6eb8e64237	17:03
noonedeadpunk	ovh: https://zuul.opendev.org/t/openstack/build/47f47a5683104c71a62ce2efdad45d0e	17:04
noonedeadpunk	are we sure there's nothing wrong with some of our backends/frontends?	17:04
fungi	i need to run a couple of errands real fast, but can dig into that shortly when i return	17:04
clarkb	fungi: oh let me test etherpad I got sidetracked by zuul-registry so fast	17:04
fungi	noonedeadpunk: there's only one frontend for that site by the way, it's the same server as static.opendev.org, it's just apache sitting in front of a network filesystem (afs)	17:05
noonedeadpunk	I know you said we should not use external connections, but well. failure rate is quite annoying right now	17:05
clarkb	fungi: noonedeadpunk: its just a redirect to gitea though	17:05
fungi	aha, i forgot about that	17:05
noonedeadpunk	yeah, it could be gitea as well	17:05
fungi	so it could be at the gitea level the problem is occurring	17:06
clarkb	when this first came up it was prior to the most recent gitea upgrade. There were some bugfixes that we had hoped would improve things which I guess did not help	17:06
clarkb	back then I did notice there were sporadic OOMs of the gitea service but I dind't think they were often enough to explain the problem	17:06
fungi	anyway, i'll brb	17:06
clarkb	etherpad looks good but I had to hard refresh the one pad I loaded	17:07
clarkb	css must've changed in incompatible ways	17:07
clarkb	I will repeat what I've said though that we expect our CI jobs to consume git repos through the zuul cache to avoid internet and server problems like this	17:08
noonedeadpunk	there're just 2 jobs out of all that would try to reach https://releases.openstack.org and only they do fail, just in case. So it's not that we should be issueing too much connections from CI	17:08
clarkb	that doesn't explain why gitea is sad, but its also against our prescribed job design	17:08
noonedeadpunk	I'm just guessing that end users might easily get issues as well?	17:09
noonedeadpunk	As given these are different backends all having random troubles	17:09
noonedeadpunk	*different nodepool providers	17:09
clarkb	if the issue is in the gitea service itself and not the networking then sure. If it is networking then it could be mroe specific to various clouds and their peers	17:10
noonedeadpunk	I kinda somehow suspicious about ovh, open metal and 2 rax would see same networking troubles... could be, but somehow doubt...	17:10
clarkb	re static vs gitea I guess one thing to check is if we're redirecting (I think we are) or proxying	17:11
clarkb	if we're proxying it could be a problem in the chain of proxies	17:11
noonedeadpunk	I think it was just a redirect	17:11
noonedeadpunk	with quite some rules and parsing	17:11
noonedeadpunk	iirc this was producing apache template https://opendev.org/openstack/releases/src/branch/master/openstack_releases/_redirections.py	17:12
noonedeadpunk	and it's even htaccess: https://opendev.org/openstack/releases/src/branch/master/doc/source/_templates/htaccess	17:14
clarkb	ya you can just make the request too and see what the browser does. I just wanted to consider that possibility. Seems we can rule it out	17:14
clarkb	noonedeadpunk: do your jobs log anythng more than read operation timed out? (other info could be http return codes if there is one, the gitea server backend that was reached (its in the ssl cert info), etc)	17:19
noonedeadpunk	no, nothing else I found so far	17:20
clarkb	looking at giteas there are 3 OOMs in the last week	17:21
noonedeadpunk	it's a get_url ansible task	17:21
clarkb	one of which (gitea09) is less than an hour after the recorded read timeout	17:21
clarkb	sorry this one https://zuul.opendev.org/t/openstack/build/87cb2dbe3f554df6aabd16ecf04f829d/log/job-output.txt#11500	17:22
noonedeadpunk	only for that patch I did 5 rechecks since morining of Oct 29	17:23
clarkb	the other two read timeouts occur well after all of the OOMs	17:23
clarkb	ya so I'm not sure the OOMs are the source of the problem	17:23
clarkb	I would've expected closer correlation rather than hours of delta between events	17:23
noonedeadpunk	true	17:24
clarkb	another possibility is we're queueing up at the haproxy frontend due to total numbers of requests and your 10 second timeout is too short depending on the size of the queue	17:24
clarkb	https://zuul.opendev.org/t/openstack/build/47f47a5683104c71a62ce2efdad45d0e/log/job-output.txt#10722 this is the most recent example from your list which may be the easiest to debug since the logs will be newer	17:25
noonedeadpunk	default timeout for the module is 10 sec indeed	17:25
noonedeadpunk	it's actually a good point for extending it	17:26
noonedeadpunk	and easy to do	17:26
clarkb	https://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=1730286000000&to=1730289600000 this is the hour block for that most recent occurence in our haproxy stats	17:27
clarkb	there is a frontend connection spike but nothing that seems like an outlier around 11:40	17:28
clarkb	noonedeadpunk: in our ansible http requests we tend to retry a few times too (just generally regardless of what we are talking to)	17:28
noonedeadpunk	++	17:28
noonedeadpunk	seems repsonse time is not being gathered?	17:29
amorin_	hey team, I´d like to merge the full relation chain here: https://review.opendev.org/c/openstack/mistral-tempest-plugin/+/933692	17:30
clarkb	noonedeadpunk: no but I don't think that would be helpful here as many responses take a long time	17:30
clarkb	noonedeadpunk: if you git cloen nova graphing the packfile isn't fast and that is expected so unfortunately a 10 second window is just noise in that	17:31
noonedeadpunk	fair	17:31
amorin_	I can't v+1 on my side, is there anyone available to do that for me? I'd like to keep the commits separated because the last one will be reverted in a near future, but I need to unlock mistral CI first	17:31
clarkb	amorin_: to be mergable you need to make CI happy. YOu can do this by disabling the jobs (either not running them or marking them non voting) or you can squash the changes together	17:32
amorin_	ok, so you mean I should add a commit in the middle of them disable / enable mistral-devstack, right?	17:33
noonedeadpunk	clarkb: eventually.. I just recalled that we're doing fetch twice for this job. First we do wget to the host as a cache (for N, not N+1): https://zuul.opendev.org/t/openstack/build/47f47a5683104c71a62ce2efdad45d0e/log/job-output.txt#930-944	17:33
noonedeadpunk	and I never seen it to fail	17:33
clarkb	noonedeadpunk: that could point at a possible client problem then. Like maybe get url doesn't handle redirects properly in all cases?	17:34
*** amorin_ is now known as amorin		17:35
noonedeadpunk	it works each second time, and I'd expect it to behave same way...	17:35
jrosser	noonedeadpunk: that wget you show is also ipv6 - it never will be in the place that fails	17:35
noonedeadpunk	oh? are you sure?	17:35
noonedeadpunk	as I'd assume it would use ipv6 as well	17:36
jrosser	oh, right you're correct for metal style job i think?	17:36
noonedeadpunk	oh, yes, true	17:37
noonedeadpunk	I somehow thought of upgrade jobs as metal only	17:37
noonedeadpunk	so basically we can be having smth off with external connectivity inside lxc on our side. huh	17:37
clarkb	another possibility is that 10 second timeout is just too short and wget doesn't timeout so quickly	17:38
clarkb	but that may be due to lxc container networking	17:38
clarkb	or queing on the haproxy side etc	17:38
noonedeadpunk	yeah, I'm prepping patch to increase timeout and add retry	17:38
clarkb	amorin: yes adding a change or updating existing ones to disable the enable the jobs is probably the easiest option	17:39
clarkb	amorin: and then as long as you merge the whole stack at once the risk for that is low	17:39
amorin	ok thanks, will try that!	17:39
clarkb	https://review.opendev.org/c/zuul/zuul-registry/+/933764 fixes zuul-registry CI jobs and then the change I pushed earlier does an explicit image rebuild	17:57
clarkb	I think that explicit rebuild may be unnecessary though with that first change.	17:57
fungi	i'm back for the rest of the day, so when gerrit activity wanes a bit more i'll work on restarting it (we're already well into europe's evening)	18:03
clarkb	fungi: we need to approve the change for taht one still I think (I only approved etherpad)	18:04
clarkb	and then got distracted by all the things	18:04
clarkb	fungi: https://review.opendev.org/c/opendev/system-config/+/932763 ya its still unapproved should we approve it now then plan to restart probably after my lunch?	18:08
fungi	yep, approved now	18:15
opendevreview	Merged opendev/system-config master: Increase size limits for some Gerrit caches https://review.opendev.org/c/opendev/system-config/+/932763	19:15
fungi	deploy is waiting behind hourlies, but they're nearly done	19:19
clarkb	the deploy job succeeded checking the server now	19:33
clarkb	config update is in place on the server as expected	19:34
fungi	yep, confirmed	19:34
clarkb	I'm not quite done with lunch but maybe plan for restarting in say an hour?	19:34
fungi	sgtm	19:34
clarkb	the process is pretty straightforward should just be downing the container, moving the waiting queue content aside, then starting containers again. We don't have a new image to pull	19:35
fungi	for when we're ready:	19:35
fungi	status notice The Gerrit service on review.opendev.org will be offline momentarily to apply a configuration change	19:35
clarkb	and then check if diffs remain available post restart and tomorrow run the show-caches --show-jvm command to see what they look like after ~24 hours of use	19:35
clarkb	ya that notice looks good	19:35
fungi	i have the command series queued up in a root screen session on review02	19:40
fungi	for when the time comes	19:40
fungi	i'll send an early heads up too	20:00
fungi	status notice The Gerrit service on review.opendev.org will be offline momentarily at 20:30 utc (half an hour from now) to apply a configuration change	20:00
fungi	#status notice The Gerrit service on review.opendev.org will be offline momentarily at 20:30 utc (half an hour from now) to apply a configuration change	20:01
opendevstatus	fungi: sending notice	20:01
clarkb	I just pulled up the command in screen	20:01
fungi	helps if i add the # ;)	20:01
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily at 20:30 utc (half an hour from now) to apply a configuration change		20:01
clarkb	the only comment I have is you remove your whole tmp dir rather than just its contents	20:01
clarkb	but I guess your tmpdir only has replication queue stuff in it?	20:01
fungi	yeah	20:01
fungi	also the mv would otherwise try to replace it	20:02
clarkb	and personally I always run docker-compose comamnds in the docker compose dir to avoid issues since I don't want to remembe which ones use env files and which don't	20:02
clarkb	pretty sure gerrit doesnt' so using -f is fine	20:02
opendevstatus	fungi: finished sending notice	20:04
fungi	gerrit doesn't, yeah, otherwise i also pass the option to specify the env file	20:04
* fungi feels "weird" about system-level commands that infer context from pwd		20:06
clarkb	ya when I discovered that was happening instead of inferring the context from the compose file I had a sad	20:07
clarkb	I guess we might want to do a show-caches today just to see that we haven't aggressively pruned the larger caches	20:16
clarkb	but the logs should also help confirm that (as it would record its pruning of them)	20:16
fungi	#status notice The Gerrit service on review.opendev.org will be offline momentarily to apply a configuration change	20:29
opendevstatus	fungi: sending notice	20:29
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily to apply a configuration change		20:30
fungi	in progress	20:30
clarkb	I'm following along	20:30
fungi	should be starting back up now	20:30
clarkb	logs say it is ready	20:31
fungi	webui is responding	20:31
clarkb	and diffs load for me	20:31
clarkb	but not on all changes maybe	20:31
fungi	a little sluggish pulling up random old changes, but yes	20:31
fungi	though https://review.opendev.org/c/opendev/bindep/+/932191/3/bindep/depends.py doesn't show me a diff yet	20:32
clarkb	so maybe we improved diff loading post restart a little bit but not universally	20:32
clarkb	that could be the difference between the memory and disk cacehes too	20:32
fungi	there it goes, just took a moment to actually load it	20:32
fungi	didn't error though	20:32
opendevstatus	fungi: finished sending notice	20:32
clarkb	ya the ones that weren't loading for me load now too	20:33
fungi	~immediate	20:33
clarkb	probably let it run for a bit longer then we can capture a show-caches output baseline post restart and do so again tomorrow and then compare the numbers	20:33
fungi	for a 2-3 minute after restart definition of immediate	20:33
clarkb	I'll run the show caches at 20:45 and record that in another file in my homedir on the server	20:37
fungi	sounds good	20:37
fungi	i'm closing out the screen session	20:37
clarkb	++	20:37
clarkb	fwiw caches were pruned but they were all much closer in total size to their configured limits which makes me happy	20:39
fungi	excellent	20:39
clarkb	the only exception I see is this one Cache jdbc:h2:file:///var/gerrit/cache/diff_intraline size (260.44m) is greater than maxSize (128.00m), pruning	20:39
clarkb	but in terms of magnitude thati s still far less of a delta than we had previously with the other diff related caches	20:39
fungi	yep	20:41
fungi	we can always do a second (or third, or fourth) iteration for the sake of incremental performance improvements	20:42
clarkb	heh i queud up the show caches command but was a couple mintues early then got distracted anyway it is running now	20:48
clarkb	and posted	20:50
clarkb	gerrit_file_diff in particular has many more entries in memory already and its hit rate is higher	20:52
clarkb	which I think is ag ood thing	20:53
clarkb	infra-root https://review.opendev.org/c/opendev/system-config/+/933354 is the backup cleanup documentation stuff. ianw has some good thoughts in there if we want to just prune the server normally for now and continue to iterate on that or we can land that as is, do cleanups (and maybe regular pruning) and then implement ianw ideas as a followup	20:54
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	21:07
fungi	we're still a few days away from needing to prune the problem server	21:10
fungi	but i'm happy to do that as a stop-gap if it looms nearer	21:10
clarkb	the next hourly deployment job set should update the zuul registry on insecure-ci-registry	21:26
clarkb	my change to update zuul-registry jobs landed and that built a new image	21:26
fungi	here's hoping that solves the random breakage	21:30
clarkb	registry updated ~15 seconds ago	22:12
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	22:17
opendevreview	Ian Wienand proposed opendev/system-config master: [dnm] testing backup purge idea https://review.opendev.org/c/opendev/system-config/+/933700	23:06

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!