Monday, 2024-07-22

*** liuxie is now known as liushy07:19
*** iurygregory__ is now known as iurygregory11:18
jrosseris it possible that one gitea backend is returning different content for openstack-ansible than the others?12:48
fungijrosser: yes, see https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-07-20.log.html12:53
fungii'll go ahead and take gitea12 out of the lb pools and start working through the plan outlined there12:53
jrosserah ok - thanks! i think i get unlucky and hit that one consistently from my work network12:54
fungiworking theory is an unplanned host reboot in vexxhost sjc1 resulted in a truncated write12:54
fungijrosser: the load balancer does source persistence, so yes if you hit it you're likely to keep hitting it. sorry about that!12:54
fungijrosser: it's downed in haproxy now. can you check whether everything's fine for your remote now?12:56
jrosserfungi: yes that looks good now - git fetch grabbed some new objects and master looks like i would expect now12:59
fungithanks for confirming13:00
*** iurygregory__ is now known as iurygregory13:01
fungiafter trying a few options, i ended up transplanting a copy of openstack/openstack-ansible.git from gitea13 to replace the broken one on gitea12, triggered gerrit's replication of that repository to the backend which completed successfully, and am now rerunning the git gc cron manually to see if it raises any new errors13:16
fungiseemingly unrelated, the gc errors on "fatal: bad object refs/changes/43/773243/8" (a non-current patchset for an openstack/kolla-ansible change)13:22
fungithat ref was created in gerrit 2024-07-15 13:55-45 utc13:28
fungier, i meant 13:55:45 utc13:30
fungiour guess on an outage that could have been responsible for the problem commit in openstack/openstack-ansible was between 13:56-14:20 utc on 2024-07-15 so, assuming a few seconds of delay for the replication from gerrit that fits (it's certainly suspiciously closely-timed at any rate)13:31
fungii'll perform similar transplantation for that repo in a bit13:31
fungiokay, done and running the gc yet again to see if it comes back clean this time13:47
fungiall good, no stdout/stderr and exited 014:31
fungii've put gitea12 back into the lb pools now14:31
fungi#status log Repaired data corruption for two repositories on the gitea12 backend, root cause seems to be from an unexpected hypervisor host outage14:32
opendevstatusfungi: finished logging14:32
clarkbfungi: thank you for taking care of that14:40
funginp14:43
clarkbfrom my notes `docker-compose run -T --rm gitea-web gitea doctor convert` is the command I ran on the held test node to run the doctor command. I think the rough process is to remove node from the lb, shutdown gitea-web and gitea-ssh, perform a db backup dump, run the doctor command, check exit code/logs/db table descriptions, start gitea and check the service, add back to the lb15:13
clarkbwe have a newer held node from the upgrade testing that I haven't cleaned up yet. We can do a more complete test run of that process there first15:14
clarkbhttps://review.opendev.org/c/opendev/system-config/+/924536 is a followup to the gitea upgrade to reflect the secret format in the test node to match reality better15:30
fungiplan sounds good to me15:31
clarkbok when I'm a bit more awake and I've caught up on other things I'll take a look at doing that on the test node15:32
clarkbhttps://etherpad.opendev.org/p/hcz4yMxUIsAWGgyoHKeZ put the gitea db doctoring steps in here. Still need to do the test pass on the test node and take additional notes16:54
opendevreviewMerged opendev/system-config master: Update test gitea's JWT_SECRET  https://review.opendev.org/c/opendev/system-config/+/92453616:57
opendevreviewMerged opendev/system-config master: Add vmware migration list to lists.openinfra.dev  https://review.opendev.org/c/opendev/system-config/+/92443216:57
clarkbok I'm reasonably happy with what is in the etherpad now. I'm not sure I've got enough notes in there about checking the results so definitely add content to that section if you've got additional ideas17:15
clarkband feel free to connect to the db on the test node and poke around. That is what I was doing17:17
clarkbadded a note about a docker-compose behavior I just discovered too. I don't think it is a problem for us but it was unexpected for me17:24
clarkbI'm looking at hte meeting agenda to prepare for tomorrow and realize we didn't clean up those x/* projects from the zuul tenatn config yet did we?18:08
clarkbI think we're past the day we said we would do that and we can proceed now cc frickler 18:08
clarkbinfra-root do we want to keep the zuul db performance agenda item on the meeting agenda? I know a bunch of work went on while I was out and I think its resolved at this point. Maybe therei s value in a recap/summary?18:14
clarkbfrickler: and not sure if you want to keep the queue window size item? Maybe we can rewrite it as implementing early fail detection for openstack jobs instead?18:15
fricklerclarkb: I can check if the x/* patch is still current tomorrow18:20
fricklerclarkb: I would still like to reduce the queue window for openstack gate, at least temporarily to see at what value it will end settling on. not sure if there is a way to track that in grafana?18:22
fricklerI don't see anyone having time or interest to work on early fail detection18:22
fricklerso that doesn't sound like a viable alternative to me for now18:23
fungii.e. no interest from those maintaining the tempest/grenade jobs in signalling early failure from individual test results?18:24
fungiclarkb: the plan in the etherpad looks good to me. i'm not thinking of anything else useful to test post-doctor18:26
clarkbI don't see any grafana graphs for the window size, but I do think there is statsd reporting for that value so we could have a dashboard for it (the tricky thing is it is per queue now and we have lots of queues so not sure how best to visualize that)18:27
clarkbfungi: thank you for looking18:27
clarkbok I've made edits to the agenda. Anything else we'd like to add/edit/remove?20:15
funginothing comes to mind20:26
clarkbcool I'll send that out at the end of my day to give tonyb a chance to chime in. re the gitea stuff it is looking like I'll start on actually running through that either tomorrow afternoon (after meetings) or wednesday morning. I've got at least one errand this afternoon and would prefer to have enough time to get all of the giteas done together20:32

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!