Thursday, 2025-01-23

opendevreviewRodolfo Alonso proposed zuul/zuul-jobs master: Block twine 6.1.0, breaking ``test-release-openstack`` CI job  https://review.opendev.org/c/zuul/zuul-jobs/+/93993606:52
opendevreviewElod Illes proposed openstack/project-config master: Use ubuntu-noble for test-release-openstack  https://review.opendev.org/c/openstack/project-config/+/93994710:46
opendevreviewchandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus  https://review.opendev.org/c/openstack/project-config/+/93995211:26
opendevreviewchandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus  https://review.opendev.org/c/openstack/project-config/+/93995211:33
opendevreviewchandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus  https://review.opendev.org/c/openstack/project-config/+/93995211:55
opendevreviewchandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus  https://review.opendev.org/c/openstack/project-config/+/93995212:39
opendevreviewchandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus  https://review.opendev.org/c/openstack/project-config/+/93995212:53
opendevreviewMerged zuul/zuul-jobs master: Block twine 6.1.0, breaking ``test-release-openstack`` CI job  https://review.opendev.org/c/zuul/zuul-jobs/+/93993614:12
opendevreviewMerged openstack/project-config master: Use ubuntu-noble for test-release-openstack  https://review.opendev.org/c/openstack/project-config/+/93994714:39
opendevreviewMerged openstack/project-config master: Add repository for devstack-plugin-prometheus  https://review.opendev.org/c/openstack/project-config/+/93995214:58
opendevreviewMerged openstack/project-config master: Fix release ACL for whitebox-tempest-plugin  https://review.opendev.org/c/openstack/project-config/+/93888714:59
clarkbthere is a held gerrit here: https://200.225.47.41/q/status:open+-is:wip for testing h2 cache stuff. I don't think I'm going to dive right into that though as I'd like to clean up some of the remaining paste/lodgeit effort first15:50
clarkbto that end I've approved https://review.opendev.org/c/opendev/lodgeit/+/939385 to publish lodgeit images to quay and if that is happy I'll approve the change to pull from quay as well15:51
clarkbfungi: did you want to push a bindep release or do we think the twine problems might impact that?15:52
clarkbfungi: to look at podman package update restart hook behavior I'm at https://packages.ubuntu.com/noble/podman on the right pane there is a ...debian.tar.xz that is where I should look for package control and hook stuff right?15:54
fungiclarkb: i'm hoping we'll get the twine situation sorted out first, but that could be within the next few hours15:56
fungiclarkb: for podman packaging, yes but also see the further analysis and links i posted in the etherpad where we were discussing it. i don't see any indication it would try to restart running containers15:57
clarkbah ok you've looked already thats great.15:57
clarkbfungi: in debian/rules I see stuff doing dh_installsystemd --name=podman-restart but that is all I can find so far15:58
clarkbbut I think that is installing systemd unit files?15:59
clarkbadn podman-restart is a utility to restart containers15:59
fungiyeah, looks like it installs /etc/systemd/system/default.target.wants/podman-restart.service (you can check the one on paste)15:59
clarkbhttps://docs.podman.io/en/v5.1.0/markdown/podman-restart.1.html15:59
clarkbya so its a tool we could run but it doesn't appear tied into the packaging itself so this is great15:59
fungion stop it does `podman stop $(/usr/bin/podman container ls --filter restart-policy=always -q)`16:00
fungiso only affects containers with a restart-policy of :"always"16:00
clarkbfungi: that is most of our containers fwiw16:00
clarkbwhy do you mean by `on stop`?16:01
fungiService.ExecStop=16:01
clarkbgotcha systemd service stop. For which service?16:01
fungipodman-restart.service16:01
clarkbgot it.16:01
fungiso i guess we need to find whether podman-restart gets stopped/started/restarted on package updates16:02
clarkba naive grep -r podman-restart * inside the debian xz tar contents doesn't show anything obviously doing that16:02
fungiits purpose seems to be more for making sure containers get started on boot16:02
clarkband ya I think that is what is ensuring things come up on boot for us16:03
fungiso i'm not overly concerned, but i guess we should pay attention around podman package upgrades just to be sure16:03
clarkbsounds good16:03
clarkband thanks again for taking an early look16:03
clarkbonce I feel a bit more awake I'm also going to spot check backups for paste02 just to be more confident in them. Then I think we can probably land the change to retire paste01 backups16:06
clarkbinfra-root for general container image reliablity I think we have two broad actions we can do: the first is updating mariadb to fetch from quay in all of our services (paste and gitea are done). This does restart the database so care needs to be taken. Then separately updating our Dockerfiles to pull dependencies for images we don't build (because we don't use them speculatively) as16:09
clarkbwell16:09
clarkbat first I wanted to bulk move to our python-builder and python-base images on quay but realized we would lose speculative testing of updates to those images if we did so before moving to podman as the runtime. I do update lodgeit but we're on podman for paste now so I think that is fine16:10
fungiand, to be clear, switching to podman requires updating to noble first yeah?16:11
clarkbyes16:11
clarkbor at least it does currently. It may be possible to get podman running with docker compose on older platforms but every time we've tried in the past it hasn't been workable for one reason or another16:12
clarkbnoble seems to be the first case where the debuntu world and the podman world have caught up to each other in a way that makes them work nicely16:12
clarkbwe might also decide we're ok with losing speculative testing of python-base and python-builder if we have some speculative coverage of them (for example via lodgeit or something else)16:13
clarkbzuul uses them too16:13
clarkbso maybe its ok to accept a small amount of risk in updating those without speculative testing for say gerrit and whatever else as long as we lean on zuul and lodgeit for coverage16:14
clarkboh but zuul is in a different tenant so we don't get speculative testing there iether?16:14
slittlestill can't zuul to run on https://review.opendev.org/c/starlingx/utilities/+/938743 and https://review.opendev.org/c/starlingx/vault-armada-app/+/93874416:21
clarkbslittle: did you try my suggestion of pushing an update to the zuul config to force zuul to evaluate the config on that branc hand report back errors?16:22
clarkbI don't see evidence of that in the changes but maybe there was a different change pushed for that16:22
fungiremote:   https://review.opendev.org/c/starlingx/utilities/+/940048 DNM: See what happens when Zuul config is modified [WIP] [NEW]16:26
clarkblooks like we're still not getting complaints from zuul and I'm not seeing it enqueue jobs either. I guess that hack to try and get zuul to post a response isn't valid16:28
slittleA whitespace change to .zull.yaml in https://review.opendev.org/c/starlingx/utilities/+/938743 had no effect16:30
clarkbzuul02 reports the same no jobs for queue item in check that we saw with the kolla change overriding config (which meant there were no jobs there still not sure why there are no jobs here)16:30
clarkbthe list of sources still doesn't seem to contain r/stx.10.0 though16:31
clarkbcould the problem be that we are ignoring the branch for some reason?16:31
clarkbhere we go16:32
clarkbConfiguration syntax error not related to change context. Error won't be reported.16:32
fungiso probably still related to one or more of the remaining starlingx/zuul-jobs errors?16:33
clarkbthat is my best guess right now16:34
clarkbthere doesn't seem to be an  associated traceback in the log or anything indicating what the error is16:34
slittlei find it strange that a couple dozen other starlingx gits got passed this for .gitreview update on r/stx.10.0 branch16:35
clarkbslittle: they may not depend on the broken configuration in zuul so their zuul configs for the new branch loaded16:35
clarkbthe problem appears to be that because this is a new branch there is no existing config in zuul for it. When zuul goes to load the configs for this branch it cannot do so because there are errors.16:36
clarkbI'm still trying to sort out what the errors are16:38
fungiyou mean beyond just looking at the list of errors at https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=016:38
clarkbya it would be nice to see a concrete link between starlingx/utilities and starlingx/zuul-jobs errors for example (like use of the bad nodeset or something)16:39
clarkbone option may be to trim the .zuul.yaml down to jus the linters job then build up from there until it breaks16:39
clarkbmaking additioanl guesses the problem could be with the secret16:41
clarkbzuul requires that secrets not be changed across branches and maybe this definition is different16:41
slittledoes zuul.conf support commenting out ?16:42
fungiit does16:42
fungi940048,1 is trimmed down to just the linters job and zuul still didn't enqueue or report errors on the change16:42
clarkbfungi: 940048 has all the jobs in it16:43
fungisorry, meant ,216:43
clarkbah I need ti f5 then16:43
fungii revised it16:43
clarkbfungi: it needs a rebase on the latest parent patchset16:43
fungii'm going to try setting it to just the noop job for check next16:43
fungi940048,3 just uses the noop job in check, and no response from zuul16:48
clarkband we appear to still get the configuration error unrelated to the change16:49
clarkbso the error must not be in the branch (and maybe not the project?) itself16:49
opendevreviewMerged opendev/lodgeit master: Reapply "Move lodgeit image publication to quay.io"  https://review.opendev.org/c/opendev/lodgeit/+/93938516:50
clarkbhowever the error list only shows starlingx/zuul-jobs so maybe that is the source of the problem16:52
clarkbstarlingx/zuul-jobs defines starlingx-common-tox-linters starlingx-common-tox-pep8 and starlingx-common-tox-pylint which are suspiciously similar to the jobs in utilities (but the have different names). Mostly jusit calling this out because why16:54
fungi940048,4 is just the noop job and no parent16:54
clarkbinfra-root any objection to approving https://review.opendev.org/c/opendev/system-config/+/939767 now that lodgeit is being updated in quay: https://quay.io/repository/opendevorg/lodgeit?tab=tags&tag=latest ?16:56
fungii've gone ahead and approved that, but if anyone disagrees with you or the 3 existing +2 votes they have time to -2 or wip it16:57
frickleradded another +2 just in case ;)16:58
clarkbfungi: your latest ps still has the same issue according to zuul02's debug log: 2025-01-23 16:54:56,092 INFO zuul.Pipeline.openstack.check: [e: 8221001fed414122b8e0fe1cdea30352] Configuration syntax error not related to change context. Error won't be reported.16:59
clarkbwhich is really odd beacuse what in that config can be wrong16:59
fungi940048,6 is just the noop job with no parent change and no topic17:00
fungion the wild theory that same-topic functionality is related to this17:00
clarkb2025-01-23 17:00:42,923 INFO zuul.Pipeline.openstack.check: [e: 33910d2518e74d8abb72886dbd146143] Configuration syntax error not related to change context. Error won't be reported.17:01
clarkbsame thing17:01
clarkbI think that really points to project config elsewhere?17:01
fungiyeah, except starlingx/utilities doesn't have any jobs added by project-config either (i just checked)17:02
clarkbme too and I concur17:02
clarkbit could be config in another branch in utilities that uses a branch matcher to apply to this branch maybe17:03
clarkbor branch wide problems like the secret being redefined or something17:04
clarkbI think that is the problem its the secret breaking the project config project wide17:04
clarkbmaybe17:04
clarkbhttps://opendev.org/starlingx/utilities/src/branch/r/stx.5.0/.zuul.yaml#L44 != https://opendev.org/starlingx/utilities/src/branch/master/.zuul.yaml#L15917:05
clarkboh but the secret has different names so that should be ok17:05
clarkbcorvus: is there a trick for finding unrelated errors in the zuul debug log? I'm looking at the source and wondering if we even log them at all (that might explain why I'm not seeing anything in the logs)17:11
clarkbcorvus: tldr is fungi pushed a very minimal zuul config: https://review.opendev.org/c/starlingx/utilities/+/940048 and zuul still reports there are unrelated config errors so they won't be reported17:12
clarkbthe project doesn't have any config in openstack/project-config which would imply the only config is in the project itself. Which has me at "its probably a problem in another branch impacting this new branch"17:14
clarkblooks like adding f/caracal branch worked ~2 months ago17:15
clarkband this is the first new branch since then17:16
clarkbhttps://review.opendev.org/c/starlingx/utilities/+/93489617:16
fungi940048,7 sets pipeline.debug just to see if that can coax any details out17:17
fungi(hopefully i got that right, i do it so rarely)17:17
clarkbusing that timeframe I'm looking at https://review.opendev.org/q/project:starlingx/utilities+status:merged to see what has gone into the project since November 1217:17
corvusclarkb: i'm going through 103 lines of scrollback, give me a minute.17:17
clarkbcorvus: ack thanks17:17
clarkbnone of the merged changes to the project since November 12 touch zuul config17:18
corvusclarkb: if i'm following correctly, the problem is that we expect starlingx/zuul-jobs to have a project config on branch r/stax.10.0 -- that can be seen here: https://opendev.org/starlingx/zuul-jobs/src/branch/r/stx.10.0/zuul.d/project.yaml  but there is no configuration visible in zuul at https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/zuul-jobs for that branch and there are no errors in the config page17:23
corvusconfig error page17:23
corvusclarkb: that sounds like zuul is unaware of the branch.  presumably we're not excluding the branch in main.yaml, which suggests that zuul may have just missed the branch creation.  we can fix that with a reconfiguration.17:24
corvusi'll trigger a reconfiguration of the openstack tenant17:25
clarkbcorvus: the project is starlingx/utilities but the rest of it makes sense to me17:26
clarkbzuul-jobs may be in the same boat too17:26
clarkbya that repo has a r/stx.10.0 branch too so likely in the same boat17:26
corvusit is surprising for it to have missed two.17:26
corvusis there something unusual about how those branches were created.17:26
corvus?17:26
clarkbslittle: ^ how are you creating the branches?17:27
fungimaybe the reason https://review.opendev.org/c/starlingx/utilities/+/940048 isn't working though is that the as-created state of the r/stx.10.0 branch in starlingx/utilities refers back to starlingx/zuul-jobs which has errors, even though the proposed change would remove all association with it?17:27
clarkbcorvus: starlingx-release has gerrt create perms but not force push as far as I can tell so they should be using either the gerrit web ui to create branches or the rest api17:28
corvus(i have not reconfigured zuul yet -- i have halted work on that because this additional info about multiple projects being affected is weird)17:28
corvusfungi: any existing errors should show up in the config-errors page17:28
clarkbstarlginx/vault-armada-app too17:28
clarkbso at least three?17:28
fungiyeah, and https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/utilities doesn't show a r/stx.10.0 tab either17:28
clarkbI wonder if they are scripting the branch creation and they are all being createdin a very short window and zuul is missing them all due to something else happening in that time period?17:29
fungiwhich i'd at least expect if zuul were going to try to load configuration from it (but maybe only branches it successfully loads configuration from show up there?)17:29
corvusfungi: it shows live config, so if it's not there it's not loaded17:29
corvuswhen were the branches created?17:30
fungiclarkb: we have in the past seen bulk branch creation events from openstack projects also end up with missed events until zuul gets told to reload from the repository states17:30
clarkbhttps://opendev.org/starlingx/tools/src/branch/master/release/branch-repo.sh looks suspicious17:30
clarkbI don't see hte use of the api there but maybe git push --tags means you don't need that?17:32
clarkbslittle: ^ is that how you create the branches?17:33
clarkbcorvus: if that script was used it also pushes the gitreview update so would've been created January 817:34
clarkbcorvus: per https://review.opendev.org/c/starlingx/utilities/+/938743 for starlingx/utilities17:34
fungii have a feeling it may be cruft, that script was last touched almost 5 years ago17:34
clarkbcould be17:35
fungior, otherwise, it's been doing it this way for years17:35
corvusdebug.log.15.gz:2025-01-08 23:57:52,072 DEBUG zuul.Scheduler: [e: be11c47d033f4bcdb45b54ede64d8d23] Submitting tenant reconfiguration event for openstack due to event <GerritTriggerEvent ref-updated opendev.org/starlingx/utilities r/stx.10.0> in project starlingx/utilities, ltime 189041797710717:36
corvusoh sorry; the adjacent lines indicate that was the creation event17:37
clarkb2025-01-08 23:59:13,377 ERROR zuul.TenantParser:   KeyError: 'r/stx.10.0'17:37
clarkbthere are also errors like ^17:37
clarkbperhaps in recursive lookups for that branch in parent jobs etc?17:38
clarkb(just thinking out loud that maybe branch creation failed because the order of the projects matters nad it was wrong here?)17:38
priteauIs docs.openstack.org being hammered by bots like Git was the other day? It feels much slower than usual17:40
clarkbpriteau: no system load looks fine. It could be something with afs I susppose but dmesg doesn't show any recent afs complaints17:42
opendevreviewMerged opendev/system-config master: Reapply "Pull lodgeit from quay.io"  https://review.opendev.org/c/opendev/system-config/+/93976717:44
clarkb/var/cache/openafs and /var/cache/openafs-client seem empty17:44
clarkbmaybe those paths don't do what I think they do17:44
clarkb/var/cache/openafs is what the mirrors use for caching openafs17:45
clarkb/opt/cache/openafs is the cache path on status17:46
clarkb*static17:46
clarkband it is not empty. So we do have a cache that appears to be working no the surface17:46
clarkband docs.openstack.org does seem responsive to me now.17:47
priteauyes, it's better now17:48
fungii can cat html files in /afs/openstack.org/docs/ from static.o.o just fine too17:48
fungino lag/delay17:49
slittlehttps://review.opendev.org/c/starlingx/utilities/+/938743 is pretty minimal now.  zuul still failing silently17:51
slittleI guess I can comment out EVERYTHING17:52
clarkbslittle: see the discussion above17:52
slittlewill do ... jumping between threads17:52
fungislittle: i did basically the same in various iterations already in https://review.opendev.org/c/starlingx/utilities/+/940048 and since then we've noticed that zuul doesn't seem to be aware of the branch itself in that repo (and several others created around the same time)17:53
clarkbcorvus: against the wall clock zuul-jobs, utilities and vault-armada-app seem to be showing up in zuul on the 8th very close to each other17:53
clarkbcorvus: and for other projects I see Loading configuration from starlingx/ptp-notification-armada-app/.zuul.yaml@r/stx.10.0 logs but not for utilities at least17:54
slittlegit push ${review_remote} ${branch}:${branch}17:55
slittlegit push ${review_remote} ${tag}:${tag}17:55
fungislittle: are you doing it in quick succession for many repos in a batch, or ad hoc at different times?17:56
clarkbhrm for vault-armada-app I don't see a log showing the branch creation just the tag creation17:56
slittleit's scripted here ... https://opendev.org/starlingx/root/src/branch/master/build-tools/branching17:57
clarkb(it is possible I'm just not looking for the correct log messages)17:58
slittle push_branches_tags.sh will iterate over all our repos creating the branches, as well as tags marking the point of divergance from master17:59
clarkboh yup zuul01 seems to have handled them and i was looking at zuul02. ok less confused about those missing right now18:00
slittlenote the with_retries function, as the failure rate as rather high18:01
clarkbslittle: do you know what fails?18:01
clarkbbecause one thing I notice is that I actually see ref-updated events for the same refs several times over the course of just over an hour18:02
clarkbmakes me wonder if you're actually succeeding and repushing unecessarily18:02
clarkb2025-01-08 23:58:03,362 DEBUG zuul.Scheduler: [e: be11c47d033f4bcdb45b54ede64d8d23] Trigger event minimum reconfigure ltime of 1890417977107 newer than current reconfigure ltime of 1890417970413, aborting early18:05
clarkbcorvus: ^ could that explain it maybe?18:05
corvusnope18:05
clarkbthat event id maps to the ref-updated for r/stx.10.0 branch creation on zuul0118:05
corvusi'm digging through some stuff, give me a few18:05
clarkback18:06
clarkbfyi paste did update to pull from quay and this test paste worked for me: https://paste.opendev.org/show/bcxk1PowxB9ueNuJ8U2j/18:07
clarkbI was able to open an old paste too out picked randomly out of my browser history18:08
clarkbinfra-root I have removed the WIP from https://review.opendev.org/c/opendev/system-config/+/939128 after checking paste02 backups against the vexxhost backup server using borg mount18:14
clarkbhead -4 against the mariadb dump shows a mariadb dump header and /etc of the filesystem dump appeared to have things I expeced in it (like etc/hosts matched the paste02 content)18:14
clarkbI have unmounted things but others can feel free to remount and investigate as part of their reviews of that change if they wish18:15
opendevreviewyatin proposed openstack/project-config master: [Neutron] Update dashboard with latest job renames  https://review.opendev.org/c/openstack/project-config/+/94006518:24
clarkbdepending on how zuul debugging goes I'll plan to approve the gerrit 3.10.4 update before lunch sothat we can update to it after our meetup18:25
clarkbif we continue to have problems with ^ today I'll probably shfit gears tomorrow and start trying to reduce our reliance on docker as much as possible18:25
clarkboh also I think tonyb's use ipv4 for docker hack might be a good idea in our mirror josb as I have seen some of them fail and I'm pretty sure that is due to the quota limits which are made worse by ipv6 block handling18:26
slittleit's the branch push that fails.  I don't have logs any longer.18:38
corvusslittle: do you have any record of any push failures from that branch creation?18:39
slittleAll I know is that the first half dozen or so go through fine, then I'll see several failures in a row as if something is either overloaded, or I've triggered a DOS protection mechanism.18:40
fungiusually if i'm doing bulk operations pushing to gerrit, i add a delay between each push (something like `sleep 3` in my loop)18:40
corvusstarlingx/utilities was one of the last events i see, so there is a correlation there.  one hypothesis i have is that gerrit did not return the new branch in the api call that zuul uses to list branches after the push event.18:41
corvusi wonder if there's a possibility the branch listing api call uses the result of an async operation which is backlogged18:42
slittlemight be in luck ... I still have terminal history18:42
slittleRunning: git review -s18:43
slittlessh://slittle1@review.opendev.org:29418/starlingx/app-gen-tool.git did not work. Description: ssh_exchange_identification: read: Connection reset by peer18:43
slittlefatal: Could not read from remote repository.18:43
slittlesucceeded on the next retry after 45 sec18:44
fungiwe *do* actually have some "ddos protection" on and in gerrit, in particular a limit on the number of concurrent tcp connections allowed to the ssh api socket from the same ip address, and a limit on the number of concurrent open sessions for the same gerrit account, but they're in the 100+ range18:44
clarkbwhich could be hit more easily if behind NAT18:45
clarkbor if running things in parallel18:45
slittleI have a seconf failure from 'git review -s' ... slightly different18:46
slittleRunning: git review -s18:46
slittleProblem running 'git remote update gerrit'18:46
slittleFetching gerrit18:46
slittlessh_exchange_identification: read: Connection reset by peer18:46
slittlefatal: Could not read from remote repository.18:46
slittlePlease make sure you have the correct access rights18:46
slittleand the repository exists.18:46
slittleerror: Could not fetch gerrit18:46
fungithose could be due to our overload protections, or gerrit getting overwhelmed, but could also indicate issues with the network anywhere between the client and server (e.g. a middlebox closing the session due to a full state tracking table or packet shaper limiting bandwidth utilization or security appliance freaking out over a false positive in the stream)18:49
slittlesorry, these errors appear to be from the create_branches_and_tags.sh script, not push_branches_tags.sh18:50
corvus[2025-01-08T23:57:08.644Z] [SSH git-receive-pack /starlingx/utilities.git (slittle1)] INFO  com.google.gerrit.server.git.MultiProgressMonitor : Processing changes: refs: 1 [CONTEXT ratelimit_period="1 MINUTES [skipped: 17]" ]18:51
fungiintroducing a delay between passes through the loop would be good to help even out the load it puts on the gerrit server anyway18:51
corvusi'm not sure what to make of that log entry.  that is at the moment that zuul received the branch create event (zuul saw it 1 second later)18:51
slittletag push error ...18:53
slittlegit push gerrit vr/stx.10.018:53
slittlessh_exchange_identification: read: Connection reset by peer18:53
slittlefatal: Could not read from remote repository.18:53
slittlePlease make sure you have the correct access rights18:53
slittleand the repository exists.18:53
corvuswe are one day too late to look at the gerrit http access logs.... unless we back them up?18:53
clarkbcorvus: we should. https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#restore-from-backup18:54
clarkbI think you can borg mount from a prior day on the host itself and navigate to the log file via fuse and view it that way. Then when you are done don't forget to run the borg umount command (it is emitted by the borg-mount script when you run it)18:55
fungilooking at example MultiProgressMonitor log examples including a ratelimit_period context, it looks like that may be how it tries to avoid spamming the service log18:55
corvusmost of the work gerrit seems to be doing is for ai bon crawlers.18:55
corvusbot.  not bon.  very much not bon.18:55
fungitres mal18:55
clarkbcorvus: I suspect that is true for the majority of the internet now18:55
slittleroughly 10 git push errors on tags18:55
corvuswe should maybe keep the same number of days of logs for gerrit and zuul :)18:56
slittlehere is one trying to set up a gerrit review ofa .gitreview file .... git review --yes --topic=r.stx.10.018:57
slittleProblem running 'git remote update gerrit'18:57
slittleFetching gerrit18:57
slittlessh_exchange_identification: read: Connection reset by peer18:57
slittlefatal: Could not read from remote repository.18:57
clarkbcorvus: look in review_site/logs18:57
fungiit appears we've asked this question before too: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-07-23.log.html#t2024-07-23T12:35:3518:57
clarkbcorvus: we have http logs there for the gerrit http server18:57
corvusclarkb: oh hey cool, i was looking at apache18:58
clarkbya its possible we may still need apache side logs but maybe this is sufficient18:58
corvusit has the byte count...18:58
corvusi'm going to try to deduce the contents of the api request from the byte count.  it's a long shot, but it's the only echo of what happened.18:59
slittleroughly 5 identical 'git review' failures on that same run18:59
clarkbslittle: fungi  I suspect that connection reset by peer as part of exhcange identification would be the user limit19:00
clarkbthe tcp connection limit would occur earlier in the ssh connection setup and exchanging identification is probably the earlist point that gerrit can determine who you are to apply the user limit?19:01
fungitrue, if it were the connection limit it would be something like an icmp admin-prohibited response19:01
clarkbso either yuo've got 96 concurrent connections or they are not shutting down cleanly (we've seen this with some clients but I want to say it was really old paramiko not openssh) or maybe your firewalls are not cleaning things up?19:02
clarkbonce upon a tmie the stackalytics crew would just create a new account when they hit the limit rather tha nfigure out how to clean up after themselves19:03
slittle60 gits... for each at least four transactions ... git review -s, git push <branch>, git push <tag>, git review19:03
fungior the socket shutdown takes too long and is happening in parallel with new connections accumulating19:03
clarkbslittle: and git review -s is multiple transactions that are papered over for you. Including a test probe, a fetch of the commit message hook if running an older version, and a login check19:04
fungiit's possible the tcp/ip stack tries to handle socket shutdown asynchronously19:04
clarkbfungi: ya that could be19:04
clarkbso ya I suspect fungi's simple idea of slowing things down a bit may be helpful here19:04
clarkbprobably enough to wait between repos and not between every operation19:05
fungiright, if you're performing several operations in the loop, then maybe throw a `sleep 10` in there and see if the errors mostly/entirely disappear19:06
slittleI believe we tried that previously, with a 10-15s delay between repos.  It made little difference19:10
fungiin that case the problem may not be originating from the gerrit server itself. where does the script run?19:11
slittlea workstation within the WindRiver corp network19:11
fungiokay, so could in theory be going through security proxies, overtaxed firewalls, oversubscribed network links, who knows what else19:12
fungiany of which have the potential to result in those exact errors19:12
slittlethe thing is ... 'repo sync' iterates over those same 60 gits without issue.   We only see issues when it's gerrit that we are interacting with in a loop19:13
clarkblooking at gerrit sshd logs you are using an old git review and it is fetching the commit message hook each time19:14
clarkbso that one wuold be one optimization to try19:14
fungiis 'repo sync' doing it over ssh on a nonstandard tcp port, or something more normal like https?19:14
clarkbI also see 523 logins from you over the course ofthat day in two blocks of time19:15
clarkbwhich is well above our 96 limit so ya if anything is slow to close connections (regarless of where that is happening) then you could hit the limit19:15
fungimight be interesting to try running it from another network some time and see if the error rate is any better19:15
clarkbah the later block of time is spread over the 9th too19:16
slittle'repo' is a google tool formanaging multi-git projects.   it uses git cammands under the hood.  Our manifest is configured to pull from https://opendev.org/starlingx/*19:17
clarkblooks like 425 logins over the course of the hour that straddles the 8th and 9th19:17
clarkboh you use repo19:17
clarkbit is likely that repo is reusing connections19:17
clarkbso you end up with a single login19:17
clarkbor at least far fewer than 42519:17
clarkbhttps://gerrit.googlesource.com/git-repo/+/refs/tags/v2.51/git_ssh confirmed it uses ssh control peristence19:19
fungiwell, `repo` is using an https remote to our gitea haproxy, not ssh over 29418/tcp to our gerrit server, sounds like19:19
clarkbthat would be another option open to you for your script19:19
clarkbfungi: oh ya the url above is gitea19:19
clarkbanyway repo will reuse ssh connections too19:19
fungihaving managed lots of corporate security and network hardware in a past life, i can say without hesitation that there's plenty of stuff in the typical corporate network that will treat/handle those differently19:20
fungithese days, https connections are heavily optimized, for example19:21
clarkbmy suggestions would be to 1) update git review so that you don't need to fetch the commit message hook and 2) try ssh control persistence19:21
clarkbslittle: ^19:21
clarkbalso I liked fungi's idea of trying from a different network source19:21
fungialso 3. try to do more over https instead of ssh19:22
corvusclarkb: fungi slittle i don't quite have enough info to prove what happened; i think there are too many other changes to info/refs for me to be able to assume the contents based on length alone (or, at least, to fully reconstruct it may take a very long time).19:29
corvushowever, we do know that zuul did query gerrit to get the list of branches (we see that in the logs), and i thought if there was an error in zuul, it was most likely to be that it didn't query at all.  so at this point, we're looking at these choices:19:29
corvus1) gerrit did not include the new branch in info/refs when zuul queried it; or 2) something completely unexpected caused zuul to use the wrong data or fail to update its cache.19:30
corvuswe can't prove either at this point, but i'm leaning toward 119:31
fungicorvus: this seems, at least on the surface, similar to past incidents where the openstack release team has performed bulk stable branch creation around release time and some branches have ended up not getting jobs run until a full reconfig of zuul (which we used to do more often)19:31
corvusi'm going to add a log line to zuul that would disambiguate this in the future, so if it is #2, we have enough confidence to start looknig for weird things.19:31
corvusfungi: ++19:31
corvusand for now, i think we should just reconfigure :)19:32
fungii agree it seems possible that gerrit's state is "eventually consistent" and that querying it for a branch may not work immediately after creation-related events are streamed, especially if it's been asked to create a lot of them in a short span of time and/or is under unrelated load19:32
fungithat is to say, it would nor surprise me to learn that emission of ref-updated events isn't held back waiting for branch creation to make it all the way down to the repositories19:34
clarkbcorvus: sounds good and thank you for digging into it19:34
clarkbfwiw I find only one killed ssh connection logged from slittle on the 8th19:34
clarkbwas for git-receive-pack./starlingx/portieris-armada-app.git19:34
clarkbmakes me wonder if we're not logging the early connection stop for hitting the limits19:35
clarkbbut I've been digging around in gerrit source ot try and figure out how that is implemented and haven't found it yety19:35
corvusyeah, the concerning part of my theory is that it would imply that one git operation (push) would not be reflected in a subsequent git operation (info/refs).  that seems super unlikely.  so one of two unlikely bugs.  we need to see if it has stripes to know whether it's a zebra.  :)19:35
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/940071 Add debug log when fetching Gerrit branches [NEW]19:37
corvus#status log issued zuul tenant-reconfigure for openstack to pick up missing starlingx branches19:38
opendevstatuscorvus: finished logging19:38
corvusit might be a while before ^ takes effect19:38
clarkbok gerrit does have an explicit log at warning for max connections reached19:39
corvus(it will have finished with the status page says the last reconfigure is after 19:39 utc)19:39
clarkbI don't see that in error_log or sshd_log for the 8th or 9th but gerrit logging is sufficient ly complicated I'm not convinced that didn't happen19:40
clarkbwith that "resolved" any objection to me trying to land the gerrit 3.10.4 update again?19:40
corvusnone here; i'm mostly afk until meetup tho19:41
clarkbya me too. I need lunch and don't think we'll try to restart until after the meetup anyway19:41
clarkbI'll hit +A19:42
clarkbor actually reenqueue it again since that avoids needing clean check19:42
fungii'm around and can keep an eye on it19:44
clarkbI think it is in the trigger queue which is waiting on the reconfigure19:44
clarkbbut I did run the command to enqueue it19:44
fungicool19:44
fungithanks!19:44
tonybfrickler: Answering your question from openstack-dev here so it's more generally visible.  For issues with "Software Factory CI" 3rd-party CI you can ping myself [UTC+1000] or dpawlik [UTC+0100], with the latter being my preference as they're more likely to deal with it quickly19:56
slittlefor ssh control persistence to review.openstack.org ... do I want the 'user' to be 'git' or my user name?20:01
clarkbslittle: your user name. Gerrit doesn't use a shared account like github20:16
clarkbstill no r/stx.10.0 branch in https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/vault-armada-app and a recheck doesn't seem to have worked20:22
clarkbthe reconfiguration says 22 minutes ago so I think it completed20:22
clarkbso maybe this is still reproduceable cc corvus (but finish lunch)20:22
clarkboh wait maybe I'm just impatient?20:23
clarkbjobs are enqueued now20:23
clarkband the branch is there nevermind this was me expecting thinsg to run quicker and I just needed to wait another thirty seconds20:23
clarkbslittle: ^ fyi I think you can restore the starlingx/utilities change to its original state and it should hopefully work now too20:24
slittlewill do20:39
clarkbmeetup time21:01
tonybI'll be ~5mins late I lost track of time and need coffee21:02
clarkback21:03
opendevreviewClark Boylan proposed opendev/system-config master: Update grafana to 10.4.14  https://review.opendev.org/c/opendev/system-config/+/94007321:30
opendevreviewMerged opendev/system-config master: Update Gerrit images to 3.10.4 and 3.11.1  https://review.opendev.org/c/opendev/system-config/+/93916721:35
opendevreviewBrian Haley proposed zuul/zuul-jobs master: Update ensure-twine role  https://review.opendev.org/c/zuul/zuul-jobs/+/94007421:37
*** tosky_ is now known as tosky21:41
fungi#status notice The Gerrit service on review.opendev.org will be offline momentarily while we reboot for a patch version upgrade of the software, but should return again within a few minutes22:44
opendevstatusfungi: sending notice22:44
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily while we reboot for a patch version upgrade of the software, but should return again within a few minutes22:44
opendevstatusfungi: finished sending notice22:46
clarkbthe change to bump grafan to 10.4.14 did pass CI so now we need to check screenshots I guess23:09
clarkbfungi: one thing I still have on my todo list for today is a bindep release. DO you know if 940074 is needed for that?23:13
clarkbI'm not sure if the ensure-twine role proposal is related to the twine problems earlier23:13
clarkbthey linked to the issue but I'm not sure I understand why we need to ensure-twine in zuul-jobs if we are arelady installing it23:16
*** promethe- is now known as prometheanfire23:25
clarkbok confirmed that we can't release things using openstack release tooling right now23:25
clarkbthe latest issue is ensure-twine uses pip install --user which runs afoul of the no global installs under python3.12 on noble and test-release-openstack at least moved to noble23:26
clarkbthe change above updates things to use a virtualenv but now the problem is with testing.23:26
clarkbtest-release-openstack is defined in openstack/project-config so can't speculatively load the proposed zuul-jobs ensure-twine pdate23:26
clarkbI think the change itself may be fine though just having a hard time verifying it against test-release-openstack23:27
fungiclarkb: correct (sorry, was nabbing dinner)23:39
clarkbno problem. I reivewed bhaley's change and caught up on all the reasons for it and tried to provide thati nformation in the review23:47
clarkber haleyb23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!