Wednesday, 2024-10-30

fungisgtm, yep!01:49
*** elibrokeit_ is now known as elibrokeit04:14
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370006:13
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370006:19
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370007:34
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370008:20
ianwhave we noticed all the tzdata errors in base.yaml in the testing -> https://zuul.opendev.org/t/openstack/build/b136610606d54c3e819bfe0562ce6170/log/job-output.txt#3969 ?08:42
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370008:58
fricklerI don't think I've seen that before. sounds like we should just install tzdata, then https://code.djangoproject.com/ticket/3381409:19
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370009:22
opendevreviewDmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Fedora 40 to the CI tests  https://review.opendev.org/c/openstack/diskimage-builder/+/93366409:44
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370010:21
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370011:36
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370012:29
opendevreviewSlawek Kaplonski proposed zuul/zuul-jobs master: Drop support for user/password authentication to the readthedocs.org  https://review.opendev.org/c/zuul/zuul-jobs/+/93339513:23
opendevreviewSlawek Kaplonski proposed openstack/project-config master: Remove rtd_secret from the trigger-readthedocs-webhook job  https://review.opendev.org/c/openstack/project-config/+/93339613:31
clarkbianw: frickler yes I think tonyb acutally has a change up for that14:49
clarkbhttps://review.opendev.org/c/opendev/system-config/+/923684 looks like I need to reapply my review votes14:50
clarkbinfra-root I'm around now to start my day if we want to upgrade etherpad, update gerrit caches, or both14:57
clarkbdo we want to start with etherpad and do gerrit later in the day when in theory it will be less busy or get gerrit done first because it is the "big" one14:57
clarkbI'm running gerrit show-caches --show-jvm against the server right now to try and capture some baseline type data and it is not quick15:15
clarkbok so that took about 5-6 minutes? but I have the data and it doesn't seem to have really impacted the running service at all15:21
clarkbI'll put the output in my homedir on the server. It looks like we have a fair bit of headroom memory wise though Mem: 93.94g total = 65.73g used + 27.82g free + 400.00m buffers 96.00g max which should be plenty of room for those diff caches if the limits are respected15:22
clarkband ya the hit rates on those caches are also lower than ohters which would imply to me that they need more room15:24
fungiclarkb: plan sounds good to me, i've got to run some errands in a bit but won't be away long15:27
clarkbfungi: do you think we should do etherpad or gerrit first?15:28
fungietherpad first, for the reason you outlined15:28
clarkback I'll go ahead and apoprove that change now then15:28
clarkband ya gerrit si busy this morning15:28
fungiperfect15:29
fungii'm happy to drive the gerrit restart later too15:29
clarkbapproved so we've got like half an hour or so I think before that lands15:29
clarkband thank you for offering that would be great15:29
clarkbI think the intermediate registry may be doing that thing again. But I'm not positive as its only been a few minutes since the job for etherpad got stuck in that state16:09
clarkbss -np shows no connections active from the test node16:11
clarkbI'm running a tcpdump capturing host $thatipaddress now16:12
clarkbjust to see if we can any SYNs or similar16:12
clarkbping is pinging and shows up in the tcpdump as a sanity check16:12
fungiprobably won't unless it's retrying the connection16:13
clarkbping works in both directions16:13
clarkband this is all over ipv4 because the test node in question is a raxflex node16:13
fungithough it's also going through address translation, if that makes any difference16:14
clarkbI suspect not since we saw this with other clouds when it happened previously16:15
clarkbits just coincidence that the current example has a floating IP I think16:15
clarkbtrying to get service logs now but docker logs is slow16:15
clarkbanyway it seems like on the test node side it thinks it is pushing to the insecure-ci-registry and on the insecure-ci-registry node we have no connection16:16
clarkbping works. I guess we can try a telnet/nc to port 5000 and see if that succeeds16:16
fungino open socket?16:16
clarkbnope ss -np doesn't show it16:17
clarkbbut the logs are helpful we have the ssl errors16:17
clarkbI'm going to restart the container16:17
clarkbhttps://paste.opendev.org/show/b2jNAsE62IZa7I0HJyxJ/16:17
clarkbI wonder if this is a bitflip type of situation and the server we're one has bad memory?16:17
fungissl.SSLError: [SSL: BAD_KEY_SHARE] bad key share (_ssl.c:1006)16:18
fungii guess that's not the ssl error corvus said he'd observed elsewhere?16:19
clarkbya16:19
clarkbits been a thing that has happend sporadically on this server for a year or two I think16:19
clarkbrestarting the container got the job moving again16:19
clarkbit could be a code bug somewhere too, but we haven't been able to track that down if so16:19
clarkbI wonder if cheroot/cherrypy have had updates since we last built this container16:20
clarkbmaybe we should just rebuild he container to pick up dependency updatse and see if that helps cc corvus 16:21
clarkbthere is a newer cheroot than the zuul-registry container16:21
clarkbI'll push up a "please rebuild this container" change to zuul-registry16:22
clarkbas a side note my tcpdump didn't capture the bit where the job got unstuck...16:22
clarkbit got all my icmp requests for ping so maybe I got the tcpdump wrong16:23
fungisuccess!16:26
clarkbremote:   https://review.opendev.org/c/zuul/zuul-registry/+/933759 Rebuild the zuul-registry container image16:26
fungii wonder why system-config-run-etherpad is taking so long to get a node16:37
clarkbthe job ran in raxflex and since its using the paused registry build all nodes need to come from there and we've got less quota there16:38
clarkbI suspect its just due to waiting for resources to free up16:39
fungioh, yep16:44
fungiit eventually started anyway16:44
clarkblooks like cheroot/cherrypy use matrix we could jump on there maybe to see if they have any ideas if updating the installation doens't help16:51
fungithough also zuul-registry-build-image is failing for the image update change16:52
clarkbits the client too old problem again but these versions look different (from memory anyway)16:54
opendevreviewMerged opendev/system-config master: Update etherpad to v2.2.6  https://review.opendev.org/c/opendev/system-config/+/93361816:54
clarkbI need a better index into my brain so that I can pull this stuff up and remember how/why/what when its these recurring issues16:54
clarkbhrm did matrix remove the ability to search scrollback?16:57
fungii think there's a channel-level config setting that can make it so people who join a channel can't see messages from before they joined it16:58
clarkbI can scrollbackwards but there used to be a magnifying glass in element to search16:59
clarkbanyway https://review.opendev.org/c/zuul/zuul-jobs/+/913808 is the breadcrumb I was looking for (thankfully it was in weechat logs and I can search those)16:59
clarkband https://review.opendev.org/c/zuul/zuul-jobs/+/913902 is the solution that I think I need to port to zuul-registry17:00
fungiclarkb: click the "room info" icon then, and there should be a "search messages" field that appears at the top of the new sidebar that opens17:00
clarkbaha thanks17:00
fungiat least in element client in my browser17:00
fungithe irony of troubleshooting matrix's ui in an irc channel is not lost on me17:01
fungideploy job just finished17:02
fungii'm able to load a pad fine17:02
noonedeadpunkhey! Issues we've experiencing with reaching https://releases.openstack.org/constraints/ from nodepool vms are getting more annoying. So I've gathered couple of failed jobs for the same patch wich occured during last couple of days. and a different providers are being used in them17:03
noonedeadpunkrax https://zuul.opendev.org/t/openstack/build/87cb2dbe3f554df6aabd16ecf04f829d/17:03
noonedeadpunkopen metal https://zuul.opendev.org/t/openstack/build/c51165d4b0ee4bf6ad4d6a6eb8e6423717:03
noonedeadpunkovh: https://zuul.opendev.org/t/openstack/build/47f47a5683104c71a62ce2efdad45d0e17:04
noonedeadpunkare we sure there's nothing wrong with some of our backends/frontends?17:04
fungii need to run a couple of errands real fast, but can dig into that shortly when i return17:04
clarkbfungi: oh let me test etherpad I got sidetracked by zuul-registry so fast17:04
funginoonedeadpunk: there's only one frontend for that site by the way, it's the same server as static.opendev.org, it's just apache sitting in front of a network filesystem (afs)17:05
noonedeadpunkI know you said we should not use external connections, but well. failure rate is quite annoying right now17:05
clarkbfungi: noonedeadpunk: its just a redirect to gitea though17:05
fungiaha, i forgot about that17:05
noonedeadpunkyeah, it could be gitea as well17:05
fungiso it could be at the gitea level the problem is occurring17:06
clarkbwhen this first came up it was prior to the most recent gitea upgrade. There were some bugfixes that we had hoped would improve things which I guess did not help17:06
clarkbback then I did notice there were sporadic OOMs of the gitea service but I dind't think they were often enough to explain the problem17:06
fungianyway, i'll brb17:06
clarkbetherpad looks good but I had to hard refresh the one pad I loaded17:07
clarkbcss must've changed in incompatible ways17:07
clarkbI will repeat what I've said though that we expect our CI jobs to consume git repos through the zuul cache to avoid internet and server problems like this17:08
noonedeadpunkthere're just 2 jobs out of all that would try to reach  https://releases.openstack.org and only they do fail, just in case. So it's not that we should be issueing too much connections from CI17:08
clarkbthat doesn't explain why gitea is sad, but its also against our prescribed job design17:08
noonedeadpunkI'm just guessing that end users might easily get issues as well?17:09
noonedeadpunkAs given these are different backends all having random troubles17:09
noonedeadpunk*different nodepool providers17:09
clarkbif the issue is in the gitea service itself and not the networking then sure. If it is networking then it could be mroe specific to various clouds and their peers17:10
noonedeadpunkI kinda somehow suspicious about ovh, open metal and 2 rax would see same networking troubles... could be, but somehow doubt...17:10
clarkbre static vs gitea I guess one thing to check is if we're redirecting (I think we are) or proxying17:11
clarkbif we're proxying it could be a problem in the chain of proxies17:11
noonedeadpunkI think it was just a redirect17:11
noonedeadpunkwith quite some rules and parsing17:11
noonedeadpunkiirc this was producing apache template https://opendev.org/openstack/releases/src/branch/master/openstack_releases/_redirections.py17:12
noonedeadpunkand it's even htaccess: https://opendev.org/openstack/releases/src/branch/master/doc/source/_templates/htaccess17:14
clarkbya you can just make the request too and see what the browser does. I just wanted to consider that possibility. Seems we can rule it out17:14
clarkbnoonedeadpunk: do your jobs log anythng more than read operation timed out? (other info could be http return codes if there is one, the gitea server backend that was reached (its in the ssl cert info), etc)17:19
noonedeadpunkno, nothing else I found so far17:20
clarkblooking at giteas there are 3 OOMs in the last week17:21
noonedeadpunkit's a get_url ansible task17:21
clarkbone of which (gitea09) is less than an hour after the recorded read timeout17:21
clarkbsorry this one https://zuul.opendev.org/t/openstack/build/87cb2dbe3f554df6aabd16ecf04f829d/log/job-output.txt#1150017:22
noonedeadpunkonly for that patch I did 5 rechecks since morining of Oct 2917:23
clarkbthe other two read timeouts occur well after all of the OOMs17:23
clarkbya so I'm not sure the OOMs are the source of the problem17:23
clarkbI would've expected closer correlation rather than hours of delta between events17:23
noonedeadpunktrue17:24
clarkbanother possibility is we're queueing up at the haproxy frontend due to total numbers of requests and your 10 second timeout is too short depending on the size of the queue17:24
clarkbhttps://zuul.opendev.org/t/openstack/build/47f47a5683104c71a62ce2efdad45d0e/log/job-output.txt#10722 this is the most recent example from your list which may be the easiest to debug since the logs will be newer17:25
noonedeadpunkdefault timeout for the module is 10 sec indeed17:25
noonedeadpunkit's actually a good point for extending it17:26
noonedeadpunkand easy to do17:26
clarkbhttps://grafana.opendev.org/d/1f6dfd6769/opendev-load-balancer?orgId=1&from=1730286000000&to=1730289600000 this is the hour block for that most recent occurence in our haproxy stats17:27
clarkbthere is a frontend connection spike but nothing that seems like an outlier around 11:4017:28
clarkbnoonedeadpunk: in our ansible http requests we tend to retry a few times too (just generally regardless of what we are talking to)17:28
noonedeadpunk++17:28
noonedeadpunkseems repsonse time is not being gathered?17:29
amorin_hey team, I´d like to merge the full relation chain here: https://review.opendev.org/c/openstack/mistral-tempest-plugin/+/93369217:30
clarkbnoonedeadpunk: no but I don't think that would be helpful here as many responses take a long time17:30
clarkbnoonedeadpunk: if you git cloen nova graphing the packfile isn't fast and that is expected so unfortunately a 10 second window is just noise in that17:31
noonedeadpunkfair17:31
amorin_I can't v+1 on my side, is there anyone available to do that for me? I'd like to keep the commits separated because the last one will be reverted in a  near future, but I need to unlock mistral CI first17:31
clarkbamorin_: to be mergable you need to make CI happy. YOu can do this by disabling the jobs (either not running them or marking them non voting) or you can squash the changes together17:32
amorin_ok, so you mean I should add a commit in the middle of them disable / enable mistral-devstack, right?17:33
noonedeadpunkclarkb: eventually.. I just recalled that we're doing fetch twice for this job. First we do wget to the host as a cache (for N, not N+1): https://zuul.opendev.org/t/openstack/build/47f47a5683104c71a62ce2efdad45d0e/log/job-output.txt#930-94417:33
noonedeadpunkand I never seen it to fail17:33
clarkbnoonedeadpunk: that could point at a possible client problem then. Like maybe get url doesn't handle redirects properly in all cases?17:34
*** amorin_ is now known as amorin17:35
noonedeadpunkit works each second time, and I'd expect it to behave same way...17:35
jrossernoonedeadpunk: that wget you show is also ipv6 - it never will be in the place that fails17:35
noonedeadpunkoh? are you sure?17:35
noonedeadpunkas I'd assume it would use ipv6 as well17:36
jrosseroh, right you're correct for metal style job i think?17:36
noonedeadpunkoh, yes, true17:37
noonedeadpunkI somehow thought of upgrade jobs as metal only17:37
noonedeadpunkso basically we can be having smth off with external connectivity inside lxc on our side. huh17:37
clarkbanother possibility is that 10 second timeout is just too short and wget doesn't timeout so quickly17:38
clarkbbut that may be due to lxc container networking17:38
clarkbor queing on the haproxy side etc17:38
noonedeadpunkyeah, I'm prepping patch to increase timeout and add retry17:38
clarkbamorin: yes adding a change or updating existing ones to disable the enable the jobs is probably the easiest option17:39
clarkbamorin: and then as long as you merge the whole stack at once the risk for that is low17:39
amorinok thanks, will try that!17:39
clarkbhttps://review.opendev.org/c/zuul/zuul-registry/+/933764 fixes zuul-registry CI jobs and then the change I pushed earlier does an explicit image rebuild17:57
clarkbI think that explicit rebuild may be unnecessary though with that first change.17:57
fungii'm back for the rest of the day, so when gerrit activity wanes a bit more i'll work on restarting it (we're already well into europe's evening)18:03
clarkbfungi: we need to approve the change for taht one still I think (I only approved etherpad)18:04
clarkband then got distracted by all the things18:04
clarkbfungi: https://review.opendev.org/c/opendev/system-config/+/932763 ya its still unapproved should we approve it now then plan to restart probably after my lunch?18:08
fungiyep, approved now18:15
opendevreviewMerged opendev/system-config master: Increase size limits for some Gerrit caches  https://review.opendev.org/c/opendev/system-config/+/93276319:15
fungideploy is waiting behind hourlies, but they're nearly done19:19
clarkbthe deploy job succeeded checking the server now19:33
clarkbconfig update is in place on the server as expected19:34
fungiyep, confirmed19:34
clarkbI'm not quite done with lunch but maybe plan for restarting in say an hour?19:34
fungisgtm19:34
clarkbthe process is pretty straightforward should just be downing the container, moving the waiting queue content aside, then starting containers again. We don't have a new image to pull19:35
fungifor when we're ready:19:35
fungistatus notice The Gerrit service on review.opendev.org will be offline momentarily to apply a configuration change19:35
clarkband then check if diffs remain available post restart and tomorrow run the show-caches --show-jvm command to see what they look like after ~24 hours of use19:35
clarkbya that notice looks good19:35
fungii have the command series queued up in a root screen session on review0219:40
fungifor when the time comes19:40
fungii'll send an early heads up too20:00
fungistatus notice The Gerrit service on review.opendev.org will be offline momentarily at 20:30 utc (half an hour from now) to apply a configuration change20:00
fungi#status notice The Gerrit service on review.opendev.org will be offline momentarily at 20:30 utc (half an hour from now) to apply a configuration change20:01
opendevstatusfungi: sending notice20:01
clarkbI just pulled up the command in screen20:01
fungihelps if i add the # ;)20:01
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily at 20:30 utc (half an hour from now) to apply a configuration change20:01
clarkbthe only comment I have is you remove your whole tmp dir rather than just its contents20:01
clarkbbut I guess your tmpdir only has replication queue stuff in it?20:01
fungiyeah20:01
fungialso the mv would otherwise try to replace it20:02
clarkband personally I always run docker-compose comamnds in the docker compose dir to avoid issues since I don't want to remembe which ones use env files and which don't20:02
clarkbpretty sure gerrit doesnt' so using -f is fine20:02
opendevstatusfungi: finished sending notice20:04
fungigerrit doesn't, yeah, otherwise i also pass the option to specify the env file20:04
* fungi feels "weird" about system-level commands that infer context from pwd20:06
clarkbya when I discovered that was happening instead of inferring the context from the compose file I had a sad20:07
clarkbI guess we might want to do a show-caches today just to see that we haven't aggressively pruned the larger caches20:16
clarkbbut the logs should also help confirm that (as it would record its pruning of them)20:16
fungi#status notice The Gerrit service on review.opendev.org will be offline momentarily to apply a configuration change20:29
opendevstatusfungi: sending notice20:29
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily to apply a configuration change20:30
fungiin progress20:30
clarkbI'm following along20:30
fungishould be starting back up now20:30
clarkblogs say it is ready20:31
fungiwebui is responding20:31
clarkband diffs load for me20:31
clarkbbut not on all changes maybe20:31
fungia little sluggish pulling up random old changes, but yes20:31
fungithough https://review.opendev.org/c/opendev/bindep/+/932191/3/bindep/depends.py doesn't show me a diff yet20:32
clarkbso maybe we improved diff loading post restart a little bit but not universally20:32
clarkbthat could be the difference between the memory and disk cacehes too20:32
fungithere it goes, just took a moment to actually load it20:32
fungididn't error though20:32
opendevstatusfungi: finished sending notice20:32
clarkbya the ones that weren't loading for me load now too20:33
fungi~immediate20:33
clarkbprobably let it run for a bit longer then we can capture a show-caches output baseline post restart and do so again tomorrow and then compare the numbers20:33
fungifor a 2-3 minute after restart definition of immediate20:33
clarkbI'll run the show caches at 20:45 and record that in another file in my homedir on the server20:37
fungisounds good20:37
fungii'm closing out the screen session20:37
clarkb++20:37
clarkbfwiw caches were pruned but they were all much closer in total size to their configured limits which makes me happy20:39
fungiexcellent20:39
clarkbthe only exception I see is this one Cache jdbc:h2:file:///var/gerrit/cache/diff_intraline size (260.44m) is greater than maxSize (128.00m), pruning20:39
clarkbbut in terms of magnitude thati s still far less of a delta than we had previously with the other diff related caches20:39
fungiyep20:41
fungiwe can always do a second (or third, or fourth) iteration for the sake of incremental performance improvements20:42
clarkbheh i queud up the show caches command but was a couple mintues early then got distracted anyway it is running now20:48
clarkband posted20:50
clarkbgerrit_file_diff in particular has many more entries in memory already and its hit rate is higher20:52
clarkbwhich I think is ag ood thing20:53
clarkbinfra-root https://review.opendev.org/c/opendev/system-config/+/933354 is the backup cleanup documentation stuff. ianw has some good thoughts in there if we want to just prune the server normally for now and continue to iterate on that or we can land that as is, do cleanups (and maybe regular pruning) and then implement ianw ideas as a followup20:54
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370021:07
fungiwe're still a few days away from needing to prune the problem server21:10
fungibut i'm happy to do that as a stop-gap if it looms nearer21:10
clarkbthe next hourly deployment job set should update the zuul registry on insecure-ci-registry21:26
clarkbmy change to update zuul-registry jobs landed and that built a new image21:26
fungihere's hoping that solves the random breakage21:30
clarkbregistry updated ~15 seconds ago22:12
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370022:17
opendevreviewIan Wienand proposed opendev/system-config master: [dnm] testing backup purge idea  https://review.opendev.org/c/opendev/system-config/+/93370023:06

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!