opendevreview | Rodolfo Alonso proposed zuul/zuul-jobs master: Block twine 6.1.0, breaking ``test-release-openstack`` CI job https://review.opendev.org/c/zuul/zuul-jobs/+/939936 | 06:52 |
---|---|---|
opendevreview | Elod Illes proposed openstack/project-config master: Use ubuntu-noble for test-release-openstack https://review.opendev.org/c/openstack/project-config/+/939947 | 10:46 |
opendevreview | chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952 | 11:26 |
opendevreview | chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952 | 11:33 |
opendevreview | chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952 | 11:55 |
opendevreview | chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952 | 12:39 |
opendevreview | chandan kumar proposed openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952 | 12:53 |
opendevreview | Merged zuul/zuul-jobs master: Block twine 6.1.0, breaking ``test-release-openstack`` CI job https://review.opendev.org/c/zuul/zuul-jobs/+/939936 | 14:12 |
opendevreview | Merged openstack/project-config master: Use ubuntu-noble for test-release-openstack https://review.opendev.org/c/openstack/project-config/+/939947 | 14:39 |
opendevreview | Merged openstack/project-config master: Add repository for devstack-plugin-prometheus https://review.opendev.org/c/openstack/project-config/+/939952 | 14:58 |
opendevreview | Merged openstack/project-config master: Fix release ACL for whitebox-tempest-plugin https://review.opendev.org/c/openstack/project-config/+/938887 | 14:59 |
clarkb | there is a held gerrit here: https://200.225.47.41/q/status:open+-is:wip for testing h2 cache stuff. I don't think I'm going to dive right into that though as I'd like to clean up some of the remaining paste/lodgeit effort first | 15:50 |
clarkb | to that end I've approved https://review.opendev.org/c/opendev/lodgeit/+/939385 to publish lodgeit images to quay and if that is happy I'll approve the change to pull from quay as well | 15:51 |
clarkb | fungi: did you want to push a bindep release or do we think the twine problems might impact that? | 15:52 |
clarkb | fungi: to look at podman package update restart hook behavior I'm at https://packages.ubuntu.com/noble/podman on the right pane there is a ...debian.tar.xz that is where I should look for package control and hook stuff right? | 15:54 |
fungi | clarkb: i'm hoping we'll get the twine situation sorted out first, but that could be within the next few hours | 15:56 |
fungi | clarkb: for podman packaging, yes but also see the further analysis and links i posted in the etherpad where we were discussing it. i don't see any indication it would try to restart running containers | 15:57 |
clarkb | ah ok you've looked already thats great. | 15:57 |
clarkb | fungi: in debian/rules I see stuff doing dh_installsystemd --name=podman-restart but that is all I can find so far | 15:58 |
clarkb | but I think that is installing systemd unit files? | 15:59 |
clarkb | adn podman-restart is a utility to restart containers | 15:59 |
fungi | yeah, looks like it installs /etc/systemd/system/default.target.wants/podman-restart.service (you can check the one on paste) | 15:59 |
clarkb | https://docs.podman.io/en/v5.1.0/markdown/podman-restart.1.html | 15:59 |
clarkb | ya so its a tool we could run but it doesn't appear tied into the packaging itself so this is great | 15:59 |
fungi | on stop it does `podman stop $(/usr/bin/podman container ls --filter restart-policy=always -q)` | 16:00 |
fungi | so only affects containers with a restart-policy of :"always" | 16:00 |
clarkb | fungi: that is most of our containers fwiw | 16:00 |
clarkb | why do you mean by `on stop`? | 16:01 |
fungi | Service.ExecStop= | 16:01 |
clarkb | gotcha systemd service stop. For which service? | 16:01 |
fungi | podman-restart.service | 16:01 |
clarkb | got it. | 16:01 |
fungi | so i guess we need to find whether podman-restart gets stopped/started/restarted on package updates | 16:02 |
clarkb | a naive grep -r podman-restart * inside the debian xz tar contents doesn't show anything obviously doing that | 16:02 |
fungi | its purpose seems to be more for making sure containers get started on boot | 16:02 |
clarkb | and ya I think that is what is ensuring things come up on boot for us | 16:03 |
fungi | so i'm not overly concerned, but i guess we should pay attention around podman package upgrades just to be sure | 16:03 |
clarkb | sounds good | 16:03 |
clarkb | and thanks again for taking an early look | 16:03 |
clarkb | once I feel a bit more awake I'm also going to spot check backups for paste02 just to be more confident in them. Then I think we can probably land the change to retire paste01 backups | 16:06 |
clarkb | infra-root for general container image reliablity I think we have two broad actions we can do: the first is updating mariadb to fetch from quay in all of our services (paste and gitea are done). This does restart the database so care needs to be taken. Then separately updating our Dockerfiles to pull dependencies for images we don't build (because we don't use them speculatively) as | 16:09 |
clarkb | well | 16:09 |
clarkb | at first I wanted to bulk move to our python-builder and python-base images on quay but realized we would lose speculative testing of updates to those images if we did so before moving to podman as the runtime. I do update lodgeit but we're on podman for paste now so I think that is fine | 16:10 |
fungi | and, to be clear, switching to podman requires updating to noble first yeah? | 16:11 |
clarkb | yes | 16:11 |
clarkb | or at least it does currently. It may be possible to get podman running with docker compose on older platforms but every time we've tried in the past it hasn't been workable for one reason or another | 16:12 |
clarkb | noble seems to be the first case where the debuntu world and the podman world have caught up to each other in a way that makes them work nicely | 16:12 |
clarkb | we might also decide we're ok with losing speculative testing of python-base and python-builder if we have some speculative coverage of them (for example via lodgeit or something else) | 16:13 |
clarkb | zuul uses them too | 16:13 |
clarkb | so maybe its ok to accept a small amount of risk in updating those without speculative testing for say gerrit and whatever else as long as we lean on zuul and lodgeit for coverage | 16:14 |
clarkb | oh but zuul is in a different tenant so we don't get speculative testing there iether? | 16:14 |
slittle | still can't zuul to run on https://review.opendev.org/c/starlingx/utilities/+/938743 and https://review.opendev.org/c/starlingx/vault-armada-app/+/938744 | 16:21 |
clarkb | slittle: did you try my suggestion of pushing an update to the zuul config to force zuul to evaluate the config on that branc hand report back errors? | 16:22 |
clarkb | I don't see evidence of that in the changes but maybe there was a different change pushed for that | 16:22 |
fungi | remote: https://review.opendev.org/c/starlingx/utilities/+/940048 DNM: See what happens when Zuul config is modified [WIP] [NEW] | 16:26 |
clarkb | looks like we're still not getting complaints from zuul and I'm not seeing it enqueue jobs either. I guess that hack to try and get zuul to post a response isn't valid | 16:28 |
slittle | A whitespace change to .zull.yaml in https://review.opendev.org/c/starlingx/utilities/+/938743 had no effect | 16:30 |
clarkb | zuul02 reports the same no jobs for queue item in check that we saw with the kolla change overriding config (which meant there were no jobs there still not sure why there are no jobs here) | 16:30 |
clarkb | the list of sources still doesn't seem to contain r/stx.10.0 though | 16:31 |
clarkb | could the problem be that we are ignoring the branch for some reason? | 16:31 |
clarkb | here we go | 16:32 |
clarkb | Configuration syntax error not related to change context. Error won't be reported. | 16:32 |
fungi | so probably still related to one or more of the remaining starlingx/zuul-jobs errors? | 16:33 |
clarkb | that is my best guess right now | 16:34 |
clarkb | there doesn't seem to be an associated traceback in the log or anything indicating what the error is | 16:34 |
slittle | i find it strange that a couple dozen other starlingx gits got passed this for .gitreview update on r/stx.10.0 branch | 16:35 |
clarkb | slittle: they may not depend on the broken configuration in zuul so their zuul configs for the new branch loaded | 16:35 |
clarkb | the problem appears to be that because this is a new branch there is no existing config in zuul for it. When zuul goes to load the configs for this branch it cannot do so because there are errors. | 16:36 |
clarkb | I'm still trying to sort out what the errors are | 16:38 |
fungi | you mean beyond just looking at the list of errors at https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0 | 16:38 |
clarkb | ya it would be nice to see a concrete link between starlingx/utilities and starlingx/zuul-jobs errors for example (like use of the bad nodeset or something) | 16:39 |
clarkb | one option may be to trim the .zuul.yaml down to jus the linters job then build up from there until it breaks | 16:39 |
clarkb | making additioanl guesses the problem could be with the secret | 16:41 |
clarkb | zuul requires that secrets not be changed across branches and maybe this definition is different | 16:41 |
slittle | does zuul.conf support commenting out ? | 16:42 |
fungi | it does | 16:42 |
fungi | 940048,1 is trimmed down to just the linters job and zuul still didn't enqueue or report errors on the change | 16:42 |
clarkb | fungi: 940048 has all the jobs in it | 16:43 |
fungi | sorry, meant ,2 | 16:43 |
clarkb | ah I need ti f5 then | 16:43 |
fungi | i revised it | 16:43 |
clarkb | fungi: it needs a rebase on the latest parent patchset | 16:43 |
fungi | i'm going to try setting it to just the noop job for check next | 16:43 |
fungi | 940048,3 just uses the noop job in check, and no response from zuul | 16:48 |
clarkb | and we appear to still get the configuration error unrelated to the change | 16:49 |
clarkb | so the error must not be in the branch (and maybe not the project?) itself | 16:49 |
opendevreview | Merged opendev/lodgeit master: Reapply "Move lodgeit image publication to quay.io" https://review.opendev.org/c/opendev/lodgeit/+/939385 | 16:50 |
clarkb | however the error list only shows starlingx/zuul-jobs so maybe that is the source of the problem | 16:52 |
clarkb | starlingx/zuul-jobs defines starlingx-common-tox-linters starlingx-common-tox-pep8 and starlingx-common-tox-pylint which are suspiciously similar to the jobs in utilities (but the have different names). Mostly jusit calling this out because why | 16:54 |
fungi | 940048,4 is just the noop job and no parent | 16:54 |
clarkb | infra-root any objection to approving https://review.opendev.org/c/opendev/system-config/+/939767 now that lodgeit is being updated in quay: https://quay.io/repository/opendevorg/lodgeit?tab=tags&tag=latest ? | 16:56 |
fungi | i've gone ahead and approved that, but if anyone disagrees with you or the 3 existing +2 votes they have time to -2 or wip it | 16:57 |
frickler | added another +2 just in case ;) | 16:58 |
clarkb | fungi: your latest ps still has the same issue according to zuul02's debug log: 2025-01-23 16:54:56,092 INFO zuul.Pipeline.openstack.check: [e: 8221001fed414122b8e0fe1cdea30352] Configuration syntax error not related to change context. Error won't be reported. | 16:59 |
clarkb | which is really odd beacuse what in that config can be wrong | 16:59 |
fungi | 940048,6 is just the noop job with no parent change and no topic | 17:00 |
fungi | on the wild theory that same-topic functionality is related to this | 17:00 |
clarkb | 2025-01-23 17:00:42,923 INFO zuul.Pipeline.openstack.check: [e: 33910d2518e74d8abb72886dbd146143] Configuration syntax error not related to change context. Error won't be reported. | 17:01 |
clarkb | same thing | 17:01 |
clarkb | I think that really points to project config elsewhere? | 17:01 |
fungi | yeah, except starlingx/utilities doesn't have any jobs added by project-config either (i just checked) | 17:02 |
clarkb | me too and I concur | 17:02 |
clarkb | it could be config in another branch in utilities that uses a branch matcher to apply to this branch maybe | 17:03 |
clarkb | or branch wide problems like the secret being redefined or something | 17:04 |
clarkb | I think that is the problem its the secret breaking the project config project wide | 17:04 |
clarkb | maybe | 17:04 |
clarkb | https://opendev.org/starlingx/utilities/src/branch/r/stx.5.0/.zuul.yaml#L44 != https://opendev.org/starlingx/utilities/src/branch/master/.zuul.yaml#L159 | 17:05 |
clarkb | oh but the secret has different names so that should be ok | 17:05 |
clarkb | corvus: is there a trick for finding unrelated errors in the zuul debug log? I'm looking at the source and wondering if we even log them at all (that might explain why I'm not seeing anything in the logs) | 17:11 |
clarkb | corvus: tldr is fungi pushed a very minimal zuul config: https://review.opendev.org/c/starlingx/utilities/+/940048 and zuul still reports there are unrelated config errors so they won't be reported | 17:12 |
clarkb | the project doesn't have any config in openstack/project-config which would imply the only config is in the project itself. Which has me at "its probably a problem in another branch impacting this new branch" | 17:14 |
clarkb | looks like adding f/caracal branch worked ~2 months ago | 17:15 |
clarkb | and this is the first new branch since then | 17:16 |
clarkb | https://review.opendev.org/c/starlingx/utilities/+/934896 | 17:16 |
fungi | 940048,7 sets pipeline.debug just to see if that can coax any details out | 17:17 |
fungi | (hopefully i got that right, i do it so rarely) | 17:17 |
clarkb | using that timeframe I'm looking at https://review.opendev.org/q/project:starlingx/utilities+status:merged to see what has gone into the project since November 12 | 17:17 |
corvus | clarkb: i'm going through 103 lines of scrollback, give me a minute. | 17:17 |
clarkb | corvus: ack thanks | 17:17 |
clarkb | none of the merged changes to the project since November 12 touch zuul config | 17:18 |
corvus | clarkb: if i'm following correctly, the problem is that we expect starlingx/zuul-jobs to have a project config on branch r/stax.10.0 -- that can be seen here: https://opendev.org/starlingx/zuul-jobs/src/branch/r/stx.10.0/zuul.d/project.yaml but there is no configuration visible in zuul at https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/zuul-jobs for that branch and there are no errors in the config page | 17:23 |
corvus | config error page | 17:23 |
corvus | clarkb: that sounds like zuul is unaware of the branch. presumably we're not excluding the branch in main.yaml, which suggests that zuul may have just missed the branch creation. we can fix that with a reconfiguration. | 17:24 |
corvus | i'll trigger a reconfiguration of the openstack tenant | 17:25 |
clarkb | corvus: the project is starlingx/utilities but the rest of it makes sense to me | 17:26 |
clarkb | zuul-jobs may be in the same boat too | 17:26 |
clarkb | ya that repo has a r/stx.10.0 branch too so likely in the same boat | 17:26 |
corvus | it is surprising for it to have missed two. | 17:26 |
corvus | is there something unusual about how those branches were created. | 17:26 |
corvus | ? | 17:26 |
clarkb | slittle: ^ how are you creating the branches? | 17:27 |
fungi | maybe the reason https://review.opendev.org/c/starlingx/utilities/+/940048 isn't working though is that the as-created state of the r/stx.10.0 branch in starlingx/utilities refers back to starlingx/zuul-jobs which has errors, even though the proposed change would remove all association with it? | 17:27 |
clarkb | corvus: starlingx-release has gerrt create perms but not force push as far as I can tell so they should be using either the gerrit web ui to create branches or the rest api | 17:28 |
corvus | (i have not reconfigured zuul yet -- i have halted work on that because this additional info about multiple projects being affected is weird) | 17:28 |
corvus | fungi: any existing errors should show up in the config-errors page | 17:28 |
clarkb | starlginx/vault-armada-app too | 17:28 |
clarkb | so at least three? | 17:28 |
fungi | yeah, and https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/utilities doesn't show a r/stx.10.0 tab either | 17:28 |
clarkb | I wonder if they are scripting the branch creation and they are all being createdin a very short window and zuul is missing them all due to something else happening in that time period? | 17:29 |
fungi | which i'd at least expect if zuul were going to try to load configuration from it (but maybe only branches it successfully loads configuration from show up there?) | 17:29 |
corvus | fungi: it shows live config, so if it's not there it's not loaded | 17:29 |
corvus | when were the branches created? | 17:30 |
fungi | clarkb: we have in the past seen bulk branch creation events from openstack projects also end up with missed events until zuul gets told to reload from the repository states | 17:30 |
clarkb | https://opendev.org/starlingx/tools/src/branch/master/release/branch-repo.sh looks suspicious | 17:30 |
clarkb | I don't see hte use of the api there but maybe git push --tags means you don't need that? | 17:32 |
clarkb | slittle: ^ is that how you create the branches? | 17:33 |
clarkb | corvus: if that script was used it also pushes the gitreview update so would've been created January 8 | 17:34 |
clarkb | corvus: per https://review.opendev.org/c/starlingx/utilities/+/938743 for starlingx/utilities | 17:34 |
fungi | i have a feeling it may be cruft, that script was last touched almost 5 years ago | 17:34 |
clarkb | could be | 17:35 |
fungi | or, otherwise, it's been doing it this way for years | 17:35 |
corvus | debug.log.15.gz:2025-01-08 23:57:52,072 DEBUG zuul.Scheduler: [e: be11c47d033f4bcdb45b54ede64d8d23] Submitting tenant reconfiguration event for openstack due to event <GerritTriggerEvent ref-updated opendev.org/starlingx/utilities r/stx.10.0> in project starlingx/utilities, ltime 1890417977107 | 17:36 |
corvus | oh sorry; the adjacent lines indicate that was the creation event | 17:37 |
clarkb | 2025-01-08 23:59:13,377 ERROR zuul.TenantParser: KeyError: 'r/stx.10.0' | 17:37 |
clarkb | there are also errors like ^ | 17:37 |
clarkb | perhaps in recursive lookups for that branch in parent jobs etc? | 17:38 |
clarkb | (just thinking out loud that maybe branch creation failed because the order of the projects matters nad it was wrong here?) | 17:38 |
priteau | Is docs.openstack.org being hammered by bots like Git was the other day? It feels much slower than usual | 17:40 |
clarkb | priteau: no system load looks fine. It could be something with afs I susppose but dmesg doesn't show any recent afs complaints | 17:42 |
opendevreview | Merged opendev/system-config master: Reapply "Pull lodgeit from quay.io" https://review.opendev.org/c/opendev/system-config/+/939767 | 17:44 |
clarkb | /var/cache/openafs and /var/cache/openafs-client seem empty | 17:44 |
clarkb | maybe those paths don't do what I think they do | 17:44 |
clarkb | /var/cache/openafs is what the mirrors use for caching openafs | 17:45 |
clarkb | /opt/cache/openafs is the cache path on status | 17:46 |
clarkb | *static | 17:46 |
clarkb | and it is not empty. So we do have a cache that appears to be working no the surface | 17:46 |
clarkb | and docs.openstack.org does seem responsive to me now. | 17:47 |
priteau | yes, it's better now | 17:48 |
fungi | i can cat html files in /afs/openstack.org/docs/ from static.o.o just fine too | 17:48 |
fungi | no lag/delay | 17:49 |
slittle | https://review.opendev.org/c/starlingx/utilities/+/938743 is pretty minimal now. zuul still failing silently | 17:51 |
slittle | I guess I can comment out EVERYTHING | 17:52 |
clarkb | slittle: see the discussion above | 17:52 |
slittle | will do ... jumping between threads | 17:52 |
fungi | slittle: i did basically the same in various iterations already in https://review.opendev.org/c/starlingx/utilities/+/940048 and since then we've noticed that zuul doesn't seem to be aware of the branch itself in that repo (and several others created around the same time) | 17:53 |
clarkb | corvus: against the wall clock zuul-jobs, utilities and vault-armada-app seem to be showing up in zuul on the 8th very close to each other | 17:53 |
clarkb | corvus: and for other projects I see Loading configuration from starlingx/ptp-notification-armada-app/.zuul.yaml@r/stx.10.0 logs but not for utilities at least | 17:54 |
slittle | git push ${review_remote} ${branch}:${branch} | 17:55 |
slittle | git push ${review_remote} ${tag}:${tag} | 17:55 |
fungi | slittle: are you doing it in quick succession for many repos in a batch, or ad hoc at different times? | 17:56 |
clarkb | hrm for vault-armada-app I don't see a log showing the branch creation just the tag creation | 17:56 |
slittle | it's scripted here ... https://opendev.org/starlingx/root/src/branch/master/build-tools/branching | 17:57 |
clarkb | (it is possible I'm just not looking for the correct log messages) | 17:58 |
slittle | push_branches_tags.sh will iterate over all our repos creating the branches, as well as tags marking the point of divergance from master | 17:59 |
clarkb | oh yup zuul01 seems to have handled them and i was looking at zuul02. ok less confused about those missing right now | 18:00 |
slittle | note the with_retries function, as the failure rate as rather high | 18:01 |
clarkb | slittle: do you know what fails? | 18:01 |
clarkb | because one thing I notice is that I actually see ref-updated events for the same refs several times over the course of just over an hour | 18:02 |
clarkb | makes me wonder if you're actually succeeding and repushing unecessarily | 18:02 |
clarkb | 2025-01-08 23:58:03,362 DEBUG zuul.Scheduler: [e: be11c47d033f4bcdb45b54ede64d8d23] Trigger event minimum reconfigure ltime of 1890417977107 newer than current reconfigure ltime of 1890417970413, aborting early | 18:05 |
clarkb | corvus: ^ could that explain it maybe? | 18:05 |
corvus | nope | 18:05 |
clarkb | that event id maps to the ref-updated for r/stx.10.0 branch creation on zuul01 | 18:05 |
corvus | i'm digging through some stuff, give me a few | 18:05 |
clarkb | ack | 18:06 |
clarkb | fyi paste did update to pull from quay and this test paste worked for me: https://paste.opendev.org/show/bcxk1PowxB9ueNuJ8U2j/ | 18:07 |
clarkb | I was able to open an old paste too out picked randomly out of my browser history | 18:08 |
clarkb | infra-root I have removed the WIP from https://review.opendev.org/c/opendev/system-config/+/939128 after checking paste02 backups against the vexxhost backup server using borg mount | 18:14 |
clarkb | head -4 against the mariadb dump shows a mariadb dump header and /etc of the filesystem dump appeared to have things I expeced in it (like etc/hosts matched the paste02 content) | 18:14 |
clarkb | I have unmounted things but others can feel free to remount and investigate as part of their reviews of that change if they wish | 18:15 |
opendevreview | yatin proposed openstack/project-config master: [Neutron] Update dashboard with latest job renames https://review.opendev.org/c/openstack/project-config/+/940065 | 18:24 |
clarkb | depending on how zuul debugging goes I'll plan to approve the gerrit 3.10.4 update before lunch sothat we can update to it after our meetup | 18:25 |
clarkb | if we continue to have problems with ^ today I'll probably shfit gears tomorrow and start trying to reduce our reliance on docker as much as possible | 18:25 |
clarkb | oh also I think tonyb's use ipv4 for docker hack might be a good idea in our mirror josb as I have seen some of them fail and I'm pretty sure that is due to the quota limits which are made worse by ipv6 block handling | 18:26 |
slittle | it's the branch push that fails. I don't have logs any longer. | 18:38 |
corvus | slittle: do you have any record of any push failures from that branch creation? | 18:39 |
slittle | All I know is that the first half dozen or so go through fine, then I'll see several failures in a row as if something is either overloaded, or I've triggered a DOS protection mechanism. | 18:40 |
fungi | usually if i'm doing bulk operations pushing to gerrit, i add a delay between each push (something like `sleep 3` in my loop) | 18:40 |
corvus | starlingx/utilities was one of the last events i see, so there is a correlation there. one hypothesis i have is that gerrit did not return the new branch in the api call that zuul uses to list branches after the push event. | 18:41 |
corvus | i wonder if there's a possibility the branch listing api call uses the result of an async operation which is backlogged | 18:42 |
slittle | might be in luck ... I still have terminal history | 18:42 |
slittle | Running: git review -s | 18:43 |
slittle | ssh://slittle1@review.opendev.org:29418/starlingx/app-gen-tool.git did not work. Description: ssh_exchange_identification: read: Connection reset by peer | 18:43 |
slittle | fatal: Could not read from remote repository. | 18:43 |
slittle | succeeded on the next retry after 45 sec | 18:44 |
fungi | we *do* actually have some "ddos protection" on and in gerrit, in particular a limit on the number of concurrent tcp connections allowed to the ssh api socket from the same ip address, and a limit on the number of concurrent open sessions for the same gerrit account, but they're in the 100+ range | 18:44 |
clarkb | which could be hit more easily if behind NAT | 18:45 |
clarkb | or if running things in parallel | 18:45 |
slittle | I have a seconf failure from 'git review -s' ... slightly different | 18:46 |
slittle | Running: git review -s | 18:46 |
slittle | Problem running 'git remote update gerrit' | 18:46 |
slittle | Fetching gerrit | 18:46 |
slittle | ssh_exchange_identification: read: Connection reset by peer | 18:46 |
slittle | fatal: Could not read from remote repository. | 18:46 |
slittle | Please make sure you have the correct access rights | 18:46 |
slittle | and the repository exists. | 18:46 |
slittle | error: Could not fetch gerrit | 18:46 |
fungi | those could be due to our overload protections, or gerrit getting overwhelmed, but could also indicate issues with the network anywhere between the client and server (e.g. a middlebox closing the session due to a full state tracking table or packet shaper limiting bandwidth utilization or security appliance freaking out over a false positive in the stream) | 18:49 |
slittle | sorry, these errors appear to be from the create_branches_and_tags.sh script, not push_branches_tags.sh | 18:50 |
corvus | [2025-01-08T23:57:08.644Z] [SSH git-receive-pack /starlingx/utilities.git (slittle1)] INFO com.google.gerrit.server.git.MultiProgressMonitor : Processing changes: refs: 1 [CONTEXT ratelimit_period="1 MINUTES [skipped: 17]" ] | 18:51 |
fungi | introducing a delay between passes through the loop would be good to help even out the load it puts on the gerrit server anyway | 18:51 |
corvus | i'm not sure what to make of that log entry. that is at the moment that zuul received the branch create event (zuul saw it 1 second later) | 18:51 |
slittle | tag push error ... | 18:53 |
slittle | git push gerrit vr/stx.10.0 | 18:53 |
slittle | ssh_exchange_identification: read: Connection reset by peer | 18:53 |
slittle | fatal: Could not read from remote repository. | 18:53 |
slittle | Please make sure you have the correct access rights | 18:53 |
slittle | and the repository exists. | 18:53 |
corvus | we are one day too late to look at the gerrit http access logs.... unless we back them up? | 18:53 |
clarkb | corvus: we should. https://docs.opendev.org/opendev/system-config/latest/sysadmin.html#restore-from-backup | 18:54 |
clarkb | I think you can borg mount from a prior day on the host itself and navigate to the log file via fuse and view it that way. Then when you are done don't forget to run the borg umount command (it is emitted by the borg-mount script when you run it) | 18:55 |
fungi | looking at example MultiProgressMonitor log examples including a ratelimit_period context, it looks like that may be how it tries to avoid spamming the service log | 18:55 |
corvus | most of the work gerrit seems to be doing is for ai bon crawlers. | 18:55 |
corvus | bot. not bon. very much not bon. | 18:55 |
fungi | tres mal | 18:55 |
clarkb | corvus: I suspect that is true for the majority of the internet now | 18:55 |
slittle | roughly 10 git push errors on tags | 18:55 |
corvus | we should maybe keep the same number of days of logs for gerrit and zuul :) | 18:56 |
slittle | here is one trying to set up a gerrit review ofa .gitreview file .... git review --yes --topic=r.stx.10.0 | 18:57 |
slittle | Problem running 'git remote update gerrit' | 18:57 |
slittle | Fetching gerrit | 18:57 |
slittle | ssh_exchange_identification: read: Connection reset by peer | 18:57 |
slittle | fatal: Could not read from remote repository. | 18:57 |
clarkb | corvus: look in review_site/logs | 18:57 |
fungi | it appears we've asked this question before too: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-07-23.log.html#t2024-07-23T12:35:35 | 18:57 |
clarkb | corvus: we have http logs there for the gerrit http server | 18:57 |
corvus | clarkb: oh hey cool, i was looking at apache | 18:58 |
clarkb | ya its possible we may still need apache side logs but maybe this is sufficient | 18:58 |
corvus | it has the byte count... | 18:58 |
corvus | i'm going to try to deduce the contents of the api request from the byte count. it's a long shot, but it's the only echo of what happened. | 18:59 |
slittle | roughly 5 identical 'git review' failures on that same run | 18:59 |
clarkb | slittle: fungi I suspect that connection reset by peer as part of exhcange identification would be the user limit | 19:00 |
clarkb | the tcp connection limit would occur earlier in the ssh connection setup and exchanging identification is probably the earlist point that gerrit can determine who you are to apply the user limit? | 19:01 |
fungi | true, if it were the connection limit it would be something like an icmp admin-prohibited response | 19:01 |
clarkb | so either yuo've got 96 concurrent connections or they are not shutting down cleanly (we've seen this with some clients but I want to say it was really old paramiko not openssh) or maybe your firewalls are not cleaning things up? | 19:02 |
clarkb | once upon a tmie the stackalytics crew would just create a new account when they hit the limit rather tha nfigure out how to clean up after themselves | 19:03 |
slittle | 60 gits... for each at least four transactions ... git review -s, git push <branch>, git push <tag>, git review | 19:03 |
fungi | or the socket shutdown takes too long and is happening in parallel with new connections accumulating | 19:03 |
clarkb | slittle: and git review -s is multiple transactions that are papered over for you. Including a test probe, a fetch of the commit message hook if running an older version, and a login check | 19:04 |
fungi | it's possible the tcp/ip stack tries to handle socket shutdown asynchronously | 19:04 |
clarkb | fungi: ya that could be | 19:04 |
clarkb | so ya I suspect fungi's simple idea of slowing things down a bit may be helpful here | 19:04 |
clarkb | probably enough to wait between repos and not between every operation | 19:05 |
fungi | right, if you're performing several operations in the loop, then maybe throw a `sleep 10` in there and see if the errors mostly/entirely disappear | 19:06 |
slittle | I believe we tried that previously, with a 10-15s delay between repos. It made little difference | 19:10 |
fungi | in that case the problem may not be originating from the gerrit server itself. where does the script run? | 19:11 |
slittle | a workstation within the WindRiver corp network | 19:11 |
fungi | okay, so could in theory be going through security proxies, overtaxed firewalls, oversubscribed network links, who knows what else | 19:12 |
fungi | any of which have the potential to result in those exact errors | 19:12 |
slittle | the thing is ... 'repo sync' iterates over those same 60 gits without issue. We only see issues when it's gerrit that we are interacting with in a loop | 19:13 |
clarkb | looking at gerrit sshd logs you are using an old git review and it is fetching the commit message hook each time | 19:14 |
clarkb | so that one wuold be one optimization to try | 19:14 |
fungi | is 'repo sync' doing it over ssh on a nonstandard tcp port, or something more normal like https? | 19:14 |
clarkb | I also see 523 logins from you over the course ofthat day in two blocks of time | 19:15 |
clarkb | which is well above our 96 limit so ya if anything is slow to close connections (regarless of where that is happening) then you could hit the limit | 19:15 |
fungi | might be interesting to try running it from another network some time and see if the error rate is any better | 19:15 |
clarkb | ah the later block of time is spread over the 9th too | 19:16 |
slittle | 'repo' is a google tool formanaging multi-git projects. it uses git cammands under the hood. Our manifest is configured to pull from https://opendev.org/starlingx/* | 19:17 |
clarkb | looks like 425 logins over the course of the hour that straddles the 8th and 9th | 19:17 |
clarkb | oh you use repo | 19:17 |
clarkb | it is likely that repo is reusing connections | 19:17 |
clarkb | so you end up with a single login | 19:17 |
clarkb | or at least far fewer than 425 | 19:17 |
clarkb | https://gerrit.googlesource.com/git-repo/+/refs/tags/v2.51/git_ssh confirmed it uses ssh control peristence | 19:19 |
fungi | well, `repo` is using an https remote to our gitea haproxy, not ssh over 29418/tcp to our gerrit server, sounds like | 19:19 |
clarkb | that would be another option open to you for your script | 19:19 |
clarkb | fungi: oh ya the url above is gitea | 19:19 |
clarkb | anyway repo will reuse ssh connections too | 19:19 |
fungi | having managed lots of corporate security and network hardware in a past life, i can say without hesitation that there's plenty of stuff in the typical corporate network that will treat/handle those differently | 19:20 |
fungi | these days, https connections are heavily optimized, for example | 19:21 |
clarkb | my suggestions would be to 1) update git review so that you don't need to fetch the commit message hook and 2) try ssh control persistence | 19:21 |
clarkb | slittle: ^ | 19:21 |
clarkb | also I liked fungi's idea of trying from a different network source | 19:21 |
fungi | also 3. try to do more over https instead of ssh | 19:22 |
corvus | clarkb: fungi slittle i don't quite have enough info to prove what happened; i think there are too many other changes to info/refs for me to be able to assume the contents based on length alone (or, at least, to fully reconstruct it may take a very long time). | 19:29 |
corvus | however, we do know that zuul did query gerrit to get the list of branches (we see that in the logs), and i thought if there was an error in zuul, it was most likely to be that it didn't query at all. so at this point, we're looking at these choices: | 19:29 |
corvus | 1) gerrit did not include the new branch in info/refs when zuul queried it; or 2) something completely unexpected caused zuul to use the wrong data or fail to update its cache. | 19:30 |
corvus | we can't prove either at this point, but i'm leaning toward 1 | 19:31 |
fungi | corvus: this seems, at least on the surface, similar to past incidents where the openstack release team has performed bulk stable branch creation around release time and some branches have ended up not getting jobs run until a full reconfig of zuul (which we used to do more often) | 19:31 |
corvus | i'm going to add a log line to zuul that would disambiguate this in the future, so if it is #2, we have enough confidence to start looknig for weird things. | 19:31 |
corvus | fungi: ++ | 19:31 |
corvus | and for now, i think we should just reconfigure :) | 19:32 |
fungi | i agree it seems possible that gerrit's state is "eventually consistent" and that querying it for a branch may not work immediately after creation-related events are streamed, especially if it's been asked to create a lot of them in a short span of time and/or is under unrelated load | 19:32 |
fungi | that is to say, it would nor surprise me to learn that emission of ref-updated events isn't held back waiting for branch creation to make it all the way down to the repositories | 19:34 |
clarkb | corvus: sounds good and thank you for digging into it | 19:34 |
clarkb | fwiw I find only one killed ssh connection logged from slittle on the 8th | 19:34 |
clarkb | was for git-receive-pack./starlingx/portieris-armada-app.git | 19:34 |
clarkb | makes me wonder if we're not logging the early connection stop for hitting the limits | 19:35 |
clarkb | but I've been digging around in gerrit source ot try and figure out how that is implemented and haven't found it yety | 19:35 |
corvus | yeah, the concerning part of my theory is that it would imply that one git operation (push) would not be reflected in a subsequent git operation (info/refs). that seems super unlikely. so one of two unlikely bugs. we need to see if it has stripes to know whether it's a zebra. :) | 19:35 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/940071 Add debug log when fetching Gerrit branches [NEW] | 19:37 |
corvus | #status log issued zuul tenant-reconfigure for openstack to pick up missing starlingx branches | 19:38 |
opendevstatus | corvus: finished logging | 19:38 |
corvus | it might be a while before ^ takes effect | 19:38 |
clarkb | ok gerrit does have an explicit log at warning for max connections reached | 19:39 |
corvus | (it will have finished with the status page says the last reconfigure is after 19:39 utc) | 19:39 |
clarkb | I don't see that in error_log or sshd_log for the 8th or 9th but gerrit logging is sufficient ly complicated I'm not convinced that didn't happen | 19:40 |
clarkb | with that "resolved" any objection to me trying to land the gerrit 3.10.4 update again? | 19:40 |
corvus | none here; i'm mostly afk until meetup tho | 19:41 |
clarkb | ya me too. I need lunch and don't think we'll try to restart until after the meetup anyway | 19:41 |
clarkb | I'll hit +A | 19:42 |
clarkb | or actually reenqueue it again since that avoids needing clean check | 19:42 |
fungi | i'm around and can keep an eye on it | 19:44 |
clarkb | I think it is in the trigger queue which is waiting on the reconfigure | 19:44 |
clarkb | but I did run the command to enqueue it | 19:44 |
fungi | cool | 19:44 |
fungi | thanks! | 19:44 |
tonyb | frickler: Answering your question from openstack-dev here so it's more generally visible. For issues with "Software Factory CI" 3rd-party CI you can ping myself [UTC+1000] or dpawlik [UTC+0100], with the latter being my preference as they're more likely to deal with it quickly | 19:56 |
slittle | for ssh control persistence to review.openstack.org ... do I want the 'user' to be 'git' or my user name? | 20:01 |
clarkb | slittle: your user name. Gerrit doesn't use a shared account like github | 20:16 |
clarkb | still no r/stx.10.0 branch in https://zuul.opendev.org/t/openstack/project/opendev.org/starlingx/vault-armada-app and a recheck doesn't seem to have worked | 20:22 |
clarkb | the reconfiguration says 22 minutes ago so I think it completed | 20:22 |
clarkb | so maybe this is still reproduceable cc corvus (but finish lunch) | 20:22 |
clarkb | oh wait maybe I'm just impatient? | 20:23 |
clarkb | jobs are enqueued now | 20:23 |
clarkb | and the branch is there nevermind this was me expecting thinsg to run quicker and I just needed to wait another thirty seconds | 20:23 |
clarkb | slittle: ^ fyi I think you can restore the starlingx/utilities change to its original state and it should hopefully work now too | 20:24 |
slittle | will do | 20:39 |
clarkb | meetup time | 21:01 |
tonyb | I'll be ~5mins late I lost track of time and need coffee | 21:02 |
clarkb | ack | 21:03 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update grafana to 10.4.14 https://review.opendev.org/c/opendev/system-config/+/940073 | 21:30 |
opendevreview | Merged opendev/system-config master: Update Gerrit images to 3.10.4 and 3.11.1 https://review.opendev.org/c/opendev/system-config/+/939167 | 21:35 |
opendevreview | Brian Haley proposed zuul/zuul-jobs master: Update ensure-twine role https://review.opendev.org/c/zuul/zuul-jobs/+/940074 | 21:37 |
*** tosky_ is now known as tosky | 21:41 | |
fungi | #status notice The Gerrit service on review.opendev.org will be offline momentarily while we reboot for a patch version upgrade of the software, but should return again within a few minutes | 22:44 |
opendevstatus | fungi: sending notice | 22:44 |
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily while we reboot for a patch version upgrade of the software, but should return again within a few minutes | 22:44 | |
opendevstatus | fungi: finished sending notice | 22:46 |
clarkb | the change to bump grafan to 10.4.14 did pass CI so now we need to check screenshots I guess | 23:09 |
clarkb | fungi: one thing I still have on my todo list for today is a bindep release. DO you know if 940074 is needed for that? | 23:13 |
clarkb | I'm not sure if the ensure-twine role proposal is related to the twine problems earlier | 23:13 |
clarkb | they linked to the issue but I'm not sure I understand why we need to ensure-twine in zuul-jobs if we are arelady installing it | 23:16 |
*** promethe- is now known as prometheanfire | 23:25 | |
clarkb | ok confirmed that we can't release things using openstack release tooling right now | 23:25 |
clarkb | the latest issue is ensure-twine uses pip install --user which runs afoul of the no global installs under python3.12 on noble and test-release-openstack at least moved to noble | 23:26 |
clarkb | the change above updates things to use a virtualenv but now the problem is with testing. | 23:26 |
clarkb | test-release-openstack is defined in openstack/project-config so can't speculatively load the proposed zuul-jobs ensure-twine pdate | 23:26 |
clarkb | I think the change itself may be fine though just having a hard time verifying it against test-release-openstack | 23:27 |
fungi | clarkb: correct (sorry, was nabbing dinner) | 23:39 |
clarkb | no problem. I reivewed bhaley's change and caught up on all the reasons for it and tried to provide thati nformation in the review | 23:47 |
clarkb | er haleyb | 23:48 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!