*** hamalq has quit IRC | 00:02 | |
*** mlavalle has quit IRC | 00:10 | |
TheJulia | corvus: is it down? Looks like connections are timing out and I got a blank page load right before that | 00:22 |
---|---|---|
corvus | TheJulia: it's extremely degraded, but hasn't lost queue state or events. it should eventually recover and restart all the jobs. | 00:23 |
TheJulia | okay | 00:25 |
TheJulia | thanks corvus! | 00:25 |
corvus | TheJulia: thanks for not throwing vegetables at me :) | 00:26 |
TheJulia | corvus: I've been there myself long ago | 00:26 |
*** ysandeep|away is now known as ysandeep | 00:41 | |
fungi | zuul-scheduler process still has a cpu completely pegged and the rest api is unresponsive, but the debug log does indicate it's still dispatching builds to executors (albeit slowly) | 00:47 |
*** diablo_rojo has quit IRC | 00:52 | |
fungi | WARNING kazoo.client: Connection dropped: socket connection error: EOF occurred in violation of protocol (_ssl.c:1125) | 00:57 |
fungi | is that what the zk connection timeouts look like? | 00:57 |
fungi | seeing them go by in the debug log every few minutes | 00:57 |
fungi | roughly 2-4 minutes apart for a while | 00:58 |
fungi | corvus: are we likely generating retry events because of zookeeper disconnects faster than we can process them, or do you still expect it to recover on its own without restarting? | 01:01 |
fungi | i'm happy to work on a scheduler restart to get things moving again and try to reenqueue everything from the periodic queue backups. looks like we have one from 23:41 utc | 01:06 |
fungi | which is roughly the time everything seems to have ground to an almost-halt | 01:06 |
corvus | fungi: i have a very large query running | 01:08 |
corvus | i'd like to let it finish | 01:08 |
fungi | no worries, wasn't sure if you were done yet | 01:08 |
corvus | i'll make sure it's running before i go to bed | 01:09 |
fungi | thanks! | 01:10 |
weshay|ruck | something going.. seeing a ton of retry_limits on centos-8 jobs | 01:23 |
weshay|ruck | ? | 01:24 |
* weshay|ruck reads | 01:24 | |
fungi | weshay|ruck: yeah, we're trying to get to the bottom of a recent memory leak in zuul | 01:44 |
*** mfixtex has quit IRC | 01:46 | |
*** brinzhang_ is now known as brinzhang | 02:04 | |
johnsom | Is there an ETA on zuul coming back? | 03:07 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: Planet OPML file https://review.opendev.org/c/opendev/system-config/+/784191 | 03:09 |
corvus | i'm going to restart it now | 03:16 |
corvus | #status log restarted zuul after freeze while debugging memleak | 03:23 |
corvus | should be up now | 03:24 |
TheJulia | \o/ | 03:46 |
*** gothicserpent has quit IRC | 03:49 | |
*** ykarel|away has joined #opendev | 03:50 | |
*** tkajinam has quit IRC | 03:51 | |
*** tkajinam has joined #opendev | 03:51 | |
*** tkajinam has quit IRC | 03:52 | |
*** tkajinam has joined #opendev | 03:53 | |
*** ykarel|away is now known as ykarel | 03:54 | |
*** whoami-rajat has joined #opendev | 04:17 | |
*** marios has joined #opendev | 05:25 | |
*** rosmaita has joined #opendev | 05:47 | |
*** sboyron has joined #opendev | 05:47 | |
*** tkajinam has quit IRC | 06:02 | |
*** tkajinam has joined #opendev | 06:03 | |
*** tkajinam has quit IRC | 06:03 | |
*** tkajinam has joined #opendev | 06:03 | |
*** bandini has joined #opendev | 06:10 | |
*** lpetrut has joined #opendev | 06:30 | |
*** hashar has joined #opendev | 06:39 | |
*** gibi_away is now known as gibi | 06:58 | |
openstackgerrit | Merged opendev/system-config master: Explicitly create empty reprepro dists https://review.opendev.org/c/opendev/system-config/+/784158 | 07:24 |
*** CeeMac has quit IRC | 07:24 | |
openstackgerrit | Merged opendev/system-config master: Correct debian-security repo codename for bullseye https://review.opendev.org/c/opendev/system-config/+/784169 | 07:24 |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Fix generate two grub.cfg files https://review.opendev.org/c/openstack/diskimage-builder/+/784203 | 07:39 |
*** tosky has joined #opendev | 07:45 | |
openstackgerrit | Daniel Blixt proposed zuul/zuul-jobs master: WIP: Make build-sshkey handling windows compatible https://review.opendev.org/c/zuul/zuul-jobs/+/780662 | 07:47 |
Tengu | hello there! is the "job retry_limit/pause" issue solved? or may I help on it if it's still relevant? | 07:55 |
*** ykarel has quit IRC | 08:03 | |
*** ykarel has joined #opendev | 08:05 | |
*** ysandeep is now known as ysandeep|lunch | 08:27 | |
*** jaicaa has quit IRC | 08:33 | |
*** jaicaa has joined #opendev | 08:36 | |
*** dtantsur|afk is now known as dtantsur | 08:44 | |
*** ykarel is now known as ykarel|lunch | 08:58 | |
openstackgerrit | Irene Calderón proposed opendev/storyboard master: Esto es una prueba https://review.opendev.org/c/opendev/storyboard/+/784329 | 09:37 |
*** elod is now known as elod_afk | 10:01 | |
*** ysandeep|lunch is now known as ysandeep | 10:05 | |
*** ykarel|lunch is now known as ykarel | 10:10 | |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Introduce openEuler distro https://review.opendev.org/c/openstack/diskimage-builder/+/784363 | 10:13 |
zbr|rover | do we use links to logs instead of zuul build page in zuul comments on purpose or accident? I kinda prefer being send to zuul page instead of logs page. | 10:27 |
zbr|rover | i would personally find it more convenient if the links in comments would be the same as the ones inside the newly "zuul summary" tab. | 10:28 |
zbr|rover | funny, trying to load https://zuul.opendev.org/t/openstack/build/e14185c56a0f495ca21c3afd0c67a7aa managed to crash chrome. | 10:30 |
openstackgerrit | xinliang proposed openstack/diskimage-builder master: Introduce openEuler distro https://review.opendev.org/c/openstack/diskimage-builder/+/784363 | 10:36 |
*** hashar is now known as hasharLunch | 11:02 | |
*** ysandeep is now known as ysandeep|afk | 11:35 | |
*** hasharLunch is now known as hashar | 11:58 | |
*** bhagyashris has quit IRC | 12:28 | |
*** bhagyashris has joined #opendev | 12:29 | |
*** hrw has quit IRC | 12:29 | |
*** ysandeep|afk is now known as ysandeep | 12:37 | |
*** stand has joined #opendev | 12:55 | |
weshay|ruck | fungi, and all.. thanks!! | 12:57 |
*** gothicserpent has joined #opendev | 13:21 | |
*** roman_g has joined #opendev | 13:24 | |
*** gothicserpent has quit IRC | 13:25 | |
*** mailingsam has joined #opendev | 13:46 | |
fungi | zbr|rover: which comments are you talking about? | 13:50 |
zbr|rover | fungi: nevermind. i think it was PBKAC on that, when i checked the url manually they were identical. | 13:52 |
zbr|rover | the "lost between browser tabs" would describe it better | 13:53 |
*** gothicserpent has joined #opendev | 14:04 | |
*** gothicserpent has quit IRC | 14:04 | |
fungi | happens to me too, sure | 14:07 |
fungi | Tengu: solved (or at least gone for now) | 14:07 |
fungi | we've been trying to get to the bottom of a new memory leak in the zuul scheduler, but interactive debugging the live process was slowing it considerably and causing side effects like spurious mass job retries | 14:08 |
fungi | the memory leak is not gone yet, we're still collecting data | 14:08 |
Tengu | fungi: ah, thanks for the info! | 14:12 |
openstackgerrit | Jeremy Stanley proposed opendev/system-config master: Temporarily serve tarballs site from AFS R+W vols https://review.opendev.org/c/opendev/system-config/+/784424 | 14:14 |
fungi | infra-root: expedited approval of that ^ is appreciated so we can get back to serving current content on the tarballs site until the ord replication is finished | 14:14 |
*** elod_afk is now known as elod | 14:29 | |
*** ysandeep is now known as ysandeep|away | 14:30 | |
corvus | fungi: i'd like to try to do another data collection pass; hopefully not as terrible as last night, but still almost certainly disruptive | 14:32 |
*** lbragstad has quit IRC | 14:36 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: ensure-kubernetes: remove dns resolvers hack https://review.opendev.org/c/zuul/zuul-jobs/+/784427 | 14:37 |
fungi | corvus: probably the earlier the better | 14:38 |
fungi | lots of openstack teams are under a lot of stress since next week is final release candidates for wallaby | 14:40 |
fungi | so there's been quite a bit of scrambling to get final fixes merged, as usual | 14:40 |
*** roman_g has quit IRC | 14:45 | |
*** lpetrut has quit IRC | 14:45 | |
openstackgerrit | Sorin Sbârnea proposed openstack/diskimage-builder master: WIP: Add freebash disk image https://review.opendev.org/c/openstack/diskimage-builder/+/784432 | 14:56 |
openstackgerrit | Sorin Sbârnea proposed openstack/diskimage-builder master: WIP: Add freebsd disk image https://review.opendev.org/c/openstack/diskimage-builder/+/784432 | 14:56 |
*** chkumar|ruck is now known as raukadah | 15:01 | |
*** tkajinam has quit IRC | 15:02 | |
*** dtantsur is now known as dtantsur|afk | 15:06 | |
corvus | this current query is proving to be quite disruptive; i have a copy of the queues saved from before i started it though; so if we decide to abort it, i can re-enqueue | 15:11 |
corvus | i believe i have thought of a way to make objgraph nicer though; if we do abort/restart, i'll work on that | 15:12 |
corvus | fungi: no result yet; i think we should restart :( | 15:27 |
*** zbr|rover is now known as zbr | 15:29 | |
openstackgerrit | Jeremy Stanley proposed zuul/zuul-jobs master: Document algorithm var for remove-build-sshkey https://review.opendev.org/c/zuul/zuul-jobs/+/783988 | 15:30 |
fungi | corvus: okay, do you need help with the restart or want me to do it? | 15:30 |
corvus | fungi: i won't de any more debugging today; i'll resume tonight or tomorrow, and do so with a process which is hopefully nicer and can be aborted. | 15:30 |
corvus | fungi: nah, i got it | 15:30 |
fungi | thanks! | 15:30 |
*** mlavalle has joined #opendev | 15:33 | |
corvus | fungi, clarkb: i wonder if running the objgraph query in a fork would be effective? | 15:36 |
corvus | tobiash: ^ | 15:36 |
fungi | corvus: that's an interesting idea | 15:36 |
fungi | it would get its own copy of memory i guess | 15:36 |
corvus | yeah, i'm assuming all the objects would be there and leaked; we'd want to make sure all the tcp connections are closed. | 15:37 |
tobiash | corvus: a fork should work | 15:38 |
corvus | my first idea is to just modify the objgraph methods to add in a sleep between each call to gc.get_referrers, and to check for a stop flag; but if we can do the work in a forked process, we would have an entire cpu available. | 15:38 |
corvus | cool; i'll prototype the fork idea on a local zuul and if it works, try that out in the next debug session tonight/tomorrow. | 15:39 |
fungi | that does seem like it's worth trying anyway | 15:39 |
fungi | and the fork is still using all the same pointers, so shouldn't increase actual memory utilization significantly, right? | 15:40 |
fungi | assuming we don't memcpy everything | 15:40 |
corvus | fungi: yeah, i think mem usage would increase moderately slowly as pages get cow'd | 15:40 |
fungi | cool, that's what i was hoping | 15:41 |
*** diablo_rojo has joined #opendev | 15:41 | |
diablo_rojo | fungi, clarkb I assume you're already aware the zuul status site is not loading? | 15:43 |
*** ykarel is now known as ykarel|away | 15:43 | |
fungi | diablo_rojo: yeah, was just talking about that in #openstack-infra with some other folks, probably we need to restart zuul-web now that zuul-scheduler has been restarted | 15:44 |
fungi | corvus: shall i? or do you think it will recover on its own? | 15:44 |
corvus | fungi: it's up | 15:44 |
diablo_rojo | fungi, ah okay cool. Thanks! | 15:44 |
fungi | oh, perfect. thanks! | 15:44 |
diablo_rojo | Way ahead of me :) | 15:45 |
diablo_rojo | Thanks fungi and corvus! | 15:45 |
corvus | #status log restarted zuul after going unresponsive during debugging | 15:47 |
*** whoami-rajat has quit IRC | 15:47 | |
corvus | fungi: restart and re-enqueue is complete | 15:47 |
corvus | fungi: i'm done debugging for the day | 15:48 |
fungi | thanks again! | 15:48 |
fungi | i'll keep an eye on the memory graph | 15:48 |
*** bandini has quit IRC | 15:49 | |
*** ykarel|away has quit IRC | 15:53 | |
*** hashar has quit IRC | 15:58 | |
*** fressi has joined #opendev | 15:59 | |
*** fressi has left #opendev | 15:59 | |
*** sshnaidm is now known as sshnaidm|afk | 16:20 | |
*** marios is now known as marios|out | 16:31 | |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/765177 | 16:37 |
*** hamalq has joined #opendev | 16:40 | |
*** ysandeep|away is now known as ysandeep | 16:41 | |
openstackgerrit | Merged opendev/system-config master: Temporarily serve tarballs site from AFS R+W vols https://review.opendev.org/c/opendev/system-config/+/784424 | 16:59 |
*** marios|out has quit IRC | 17:20 | |
clarkb | corvus: re a fork, the risk there is we'll have two schedulers fighting to do the same work? I guess the thing you'll be POCing is convincing the child to be inactive while running? | 17:38 |
fungi | yeah, i took that as a given. the fork needs to explicitly do nothing, i think | 17:38 |
fungi | close all inherited file descriptors, maybe even just go into a busywait | 17:39 |
corvus | clarkb: the only thread in the fork should be the one where fork is called, yeah? so that would be the repl server thread becoming the main thread of the new process, i would think. | 17:39 |
clarkb | infra-root I'm going to try and dig into the vexxhost ipv6 stuff after lunch today as I suspect that is impact job runtimes in that cloud as well as our new review server. I think a good next step there will be to jump on some in use test nodes and check their ipv6 networking configs to see if any patterns emerge and go from there since ianw seems to have the tcpdumping covered on review | 17:39 |
corvus | (ie, the scheduler thread would not exist) | 17:39 |
fungi | ahh, yeah the repl as the only thread would solve it | 17:40 |
fungi | as long as you don't execute functions in the repl to make that no longer true, but the answer there is to just not do that | 17:41 |
clarkb | ah yup | 17:42 |
clarkb | fungi: have we landed the second pair of openedge cleanups yet? that was next on my list to check on from yesterday | 17:43 |
* clarkb finds change links | 17:44 | |
clarkb | https://review.opendev.org/c/opendev/system-config/+/783991 looks like that hasn't merged yet. Any reason to not do that now (sounds like zuul things are settling for the moment?) | 17:45 |
fungi | clarkb: no, haven't yet | 17:46 |
fungi | but should be safe now | 17:46 |
clarkb | ok I'll +A it now | 17:46 |
fungi | thanks! | 17:53 |
*** ysandeep is now known as ysandeep|away | 17:57 | |
*** ykarel|away has joined #opendev | 18:13 | |
*** ykarel|away has quit IRC | 18:25 | |
openstackgerrit | Merged opendev/system-config master: Clean up OpenEdge configuration https://review.opendev.org/c/opendev/system-config/+/783991 | 18:43 |
fungi | clarkb: i suppose 784086 can go in now too since that's merged | 18:48 |
clarkb | fungi: ++ do you want to +A or should I? | 18:52 |
fungi | feel free | 18:52 |
*** gothicserpent has joined #opendev | 18:59 | |
*** gothicserpent has quit IRC | 19:01 | |
openstackgerrit | Merged opendev/zone-opendev.org master: Clean up OpenEdge configuration https://review.opendev.org/c/opendev/zone-opendev.org/+/784086 | 19:01 |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/765177 | 19:07 |
*** mailingsam has quit IRC | 19:12 | |
clarkb | I've started looking at a vexxhost test node to see what is going on with its networking and try to work from that. One thing I checked since I was here is that svm cpu flag is present and it is (this means amd nested virt is a possibility) | 19:17 |
clarkb | dmesg also confirms nested virt is enabled. What i am not seeing is glean or network config at first boot at all | 19:22 |
clarkb | checking some of the older hosts that exist in nodepool's list none of them seem to have more than one globally routable address and 2 default routes (2 default routes are expected on the public ipv6 interface iirc) | 19:30 |
clarkb | so that is all looking good from the test node side. Makes me wonder if they aren't typically sticking around long enough to have trouble | 19:30 |
*** gothicserpent has joined #opendev | 19:36 | |
fungi | which test node? | 19:37 |
fungi | oh, "a vexxhost test node" i see | 19:38 |
fungi | trying to find an explanation for the replacement gerrit server's ipv6 madness? | 19:38 |
clarkb | yup, and also the pip installation slowness in johnsom's example from yesterday which I suspect is also related | 19:39 |
clarkb | mirror.ca-ymq-1.vexxhost.opendev.org:/etc/netplan/50-cloud-init.yaml is the modified file that fixed this problem there | 19:39 |
clarkb | we set dhcp6 and accept-ra to false then manually set routes and addr based on the values that we had previously accepted via RA which mnaser confirmed should be stable | 19:40 |
clarkb | however we never really got any more info from the cloud side why this was happening | 19:40 |
clarkb | assuming things are expected to continue to be stable cloud side we could do similar for review02 but that seems really clunky and we should consider documenting/automating that configuration for vexxhost nodes if we do | 19:40 |
*** rosmaita has left #opendev | 19:41 | |
fungi | yeah, and bug 1844712 is still basically getting no traction | 19:41 |
openstack | bug 1844712 in OpenStack Security Advisory "RA Leak on tenant network" [Undecided,Incomplete] https://launchpad.net/bugs/1844712 | 19:41 |
johnsom | Just throwing a wild idea out, are you accidentally over restricting icmpv6? | 19:41 |
johnsom | If routers don't get all of the neighbor discovery goodness they can prune routes? | 19:42 |
clarkb | johnsom: the problem is we're getting RAs for networks we aren't on | 19:42 |
johnsom | opps, ?->. I see this on my comcast IPv6 | 19:42 |
johnsom | Ah, well, that is a whole different issue. lol | 19:42 |
fungi | johnsom: not filtering icmpv6, no. we're wondering if it's that we're getting announcements from gateways which aren't really valid gateways in addition to the correct ones | 19:42 |
clarkb | so then when the host tries to talk from that source addr over that route the packets end up in the bitbucket | 19:42 |
clarkb | we solved that on the mirror node by disabling dynamic configuration which is less than ideal when nova says use ipv6_slaac | 19:43 |
clarkb | johnsom: but I suspect that may explain some of the pip installation slowness in your timeout on vexxhost example. As pip may wait for ipv6 to timeout then fallback to ipv4 (particularly notable is the time spent is consistency ~60 seconds every time) | 19:44 |
fungi | johnsom: the bug report above, we've seen it happen both in vexxhost and limestone, where it's leaking between tenants even, but i can imagine it's even more likely to occur within a tenant (some job sets up routing on a vnic, begins spewing ra packets onto the network, other nodes see those and add prefixes/routes) | 19:44 |
fungi | in theory neutron filters that, but it seems that sometimes that doesn't actually happen | 19:45 |
fungi | and as of yet, nobody's come up with a sound theory on why | 19:45 |
clarkb | and we set up a bunch of nodes once and tried to inject RAs ourselves and they never showed up on other hosts (as expected) | 19:46 |
fungi | yeah, in the past there have been races around things like port creation/deletion, et cetera, where filtering had gaps | 19:46 |
clarkb | fungi: I think my next step is to boot a vexxhost test node manually and see if I can reproduce there if the node hangs around long enough (say check ti tomorrow0 | 19:47 |
clarkb | but otherwise on the test node side I didn't see anything amiss after checking about 10 instances | 19:47 |
fungi | i have a feeling it could happen in bursts, and relies on some specific set of circumstances | 19:47 |
clarkb | and maybe we set up review02 to mimic mirror01 when ianw returns | 19:47 |
fungi | you have to catch it when the right job has run there recently and misbehaved in that way while the other node was up and running | 19:48 |
clarkb | ya | 19:48 |
openstackgerrit | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Add Debian Bullseye Zuul job https://review.opendev.org/c/openstack/diskimage-builder/+/783790 | 19:50 |
*** slaweq_ has joined #opendev | 19:51 | |
*** slaweq has quit IRC | 19:52 | |
*** CeeMac has joined #opendev | 19:54 | |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/765177 | 20:03 |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/765177 | 20:20 |
openstackgerrit | Paul Belanger proposed zuul/zuul-jobs master: ensure-podman: Use official podman repos for ubuntu https://review.opendev.org/c/zuul/zuul-jobs/+/765177 | 20:39 |
*** gothicserpent has quit IRC | 21:07 | |
clarkb | fungi: have you had a chance to look at those gerrit account cleanup proposals? (I know its been a busy week or longer of fires) if possible would be nice to go thorugh those tomorrow | 21:36 |
*** diablo_rojo has quit IRC | 21:38 | |
*** osmanlicilegi has quit IRC | 21:44 | |
fungi | not yet, but i can take a look now | 21:44 |
clarkb | cool and thank you | 21:44 |
fungi | they're in your homedir on review? | 21:44 |
clarkb | yes let me find the exact path for you | 21:45 |
clarkb | fungi: ~/gerrit_user_cleanups/notes.20210315 | 21:45 |
fungi | 784424 merged almost 5 hours ago and still hasn't deployed | 21:45 |
fungi | and the tarballs release is still running | 21:46 |
clarkb | are we waiting on available executor slots? | 21:46 |
clarkb | those don't use normal nodes so we shouldn't be queued up being nodepool usage | 21:46 |
fungi | checking to see if it's still in the queue | 21:47 |
clarkb | ya they are all in the queues still | 21:47 |
clarkb | as waiting | 21:47 |
fungi | ahh, yep | 21:48 |
fungi | probably blocked on the periodics? | 21:48 |
clarkb | as well as a large number of tag jobs :/ | 21:48 |
clarkb | well but periodic is doing the same thing it is just waiting as well | 21:48 |
clarkb | is this possibly a side effect of corvus' debugging? | 21:48 |
fungi | yeah, nothing's actually running for those items | 21:49 |
clarkb | the infra-prod jobs do have a semaphore | 21:50 |
clarkb | do you know if those tag jobs too? (just wondering if this is more semaphore weirdness) | 21:50 |
fungi | the times on all those items line up with the reenqueue | 21:50 |
fungi | so maybe something is weird about how they were reenqueued | 21:50 |
corvus | clarkb, fungi: i had a fleeting thought that we may have leaked actual semaphores in the crash | 21:51 |
corvus | we have semaphore cleanup as a todo item | 21:51 |
fungi | oh! right, so maybe leaked semaphores still sitting in znodes? | 21:51 |
corvus | yep; is this only affecting infra, or is it wider? how urgent? | 21:52 |
clarkb | corvus: it is affecting a bunch of openstack tag jobs | 21:52 |
clarkb | infra + those tag jobs are the only ones I've seen so far | 21:53 |
fungi | those tag jobs are all release note publication though | 21:53 |
fungi | so maybe not urgent | 21:53 |
clarkb | ah yup | 21:53 |
fungi | and the rest, yeah, just opendev infra deployment and config management runs | 21:53 |
fungi | so not a huge deal, i can manually apply 784424 in production, that's the only thing really causing lingering pain as far as i know | 21:54 |
corvus | how does "resolve within 4 hours" sound for priority on this? | 21:54 |
fungi | oh that's plenty soon as far as i'm concerned | 21:54 |
clarkb | wfm | 21:55 |
corvus | ok. if it's more urgent, i can increase that; but all things being equal, that's convenient for me. | 21:55 |
fungi | i've manually applied 784424 in production now, so nothing else urgent i know about | 21:55 |
fungi | also happy to try manual znode surgery, betting it's safe to delete semaphore znodes older than the restart | 21:55 |
corvus | fungi: i think it's actually a znode edit | 22:00 |
fungi | ahh, okay | 22:00 |
corvus | i think a semaphore is now a json list of jobs which hold it | 22:00 |
corvus | in the case of a semaphore max of 1, however, a delete should be ok | 22:01 |
fungi | so the semaphore itself is persistent, but may be empty | 22:01 |
corvus | my guess is we have a semaphore that looks like "/path/to/semaphores/infra-prod-something" and its contents are "['build-uuid-from-before-restart']" | 22:02 |
corvus | if that's the case, and the max is 1, we can delete that znode | 22:02 |
fungi | clarkb: on the account cleanup topic, i'm in the process of adapting openstack's election tooling to work around the lack of anonymous access to the emails method in the rest api, and i'm finding there are at least some accounts which have contributed changes recently but have no preferred e-mail. wonder if if we can (or should) do anything about those | 22:02 |
corvus | but if it were "['build-uuid-from-before-restart', 'build-uuid-from-after-restart']" with a max of 2, then editing would be required. | 22:02 |
fungi | corvus: right, and i have a feeling we have the latter because in this case there wouldn't be builds waiting otherwise | 22:03 |
clarkb | fungi: those users can simply go in via the web ui and set up a preferred email | 22:03 |
corvus | fungi: i think we have the former because i think our max is 1? | 22:03 |
fungi | clarkb: if we can figure out how to contact them ;) | 22:03 |
clarkb | fungi: they likely have external ids with emails in them | 22:03 |
clarkb | or the git commits they have pushed | 22:03 |
fungi | corvus: oh, okay, i must have misunderstood. so waiting builds don't get added to the data structure in the semaphore znode | 22:04 |
corvus | fungi: correct, only builds which hold the lock | 22:04 |
fungi | anyway, i'll stop distracting you | 22:04 |
fungi | clarkb: excellent point, they will probably have a committer address on the change even if the owner account has no preferred address | 22:05 |
fungi | i can probably use that as a fallback even | 22:05 |
fungi | clarkb: looking at your list, i wonder if we can also identify accounts with invalid openids and no ssh keys and no password set (regardless of whether they have a username)? | 22:08 |
clarkb | fungi: checking if password is set is hard because you have to dig into the git repo directly | 22:09 |
clarkb | it is doable though | 22:09 |
clarkb | fungi: maybe take the set that meet the other criteria as a sublist then check the git repo directly for that? | 22:09 |
clarkb | (no apis expose that essentially) | 22:09 |
fungi | ahh, nevermind. they're not usable, though may have been used previously since we did wipe all the passwords after the incident | 22:09 |
fungi | yeah, i'm good with the stuff in your proposed list. i spot-checked some from each category | 22:10 |
fungi | also i'll be around tomorrow to help with the cleanup on these if you want | 22:10 |
fungi | aha! i just realized most of these changes owned by an account with no preferred e-mail are from "OpenStack Proposal Bot" | 22:12 |
clarkb | silly bot | 22:13 |
clarkb | thank you for checking and I'll need to get back up to speed on running my scripts again :) | 22:13 |
clarkb | ok I've reviewed the gerrit db change | 22:39 |
*** sboyron has quit IRC | 22:50 | |
*** eharney has quit IRC | 22:54 | |
clarkb | fungi: should we warn the release team about the tag jobs? I assume those tags were pushed by them? but I guess they could be independent? | 22:54 |
*** tkajinam has joined #opendev | 22:57 | |
*** tkajinam has quit IRC | 22:57 | |
*** tkajinam has joined #opendev | 22:58 | |
*** auristor has quit IRC | 23:04 | |
corvus | (CONNECTED [localhost:2181]) /zuul/semaphores/openstack> get publish-releasenotes | 23:17 |
corvus | ["baaab4cfbc074796b5be235775754aaf-publish-openstack-releasenotes-python3"] | 23:17 |
corvus | (CONNECTED [localhost:2181]) /zuul/semaphores/openstack> get infra-prod-playbook | 23:17 |
corvus | ["d573f07ba3094f52bf6a69cf7a0f02a7-infra-prod-service-registry"] | 23:17 |
corvus | those are the 2 semaphores currently held | 23:17 |
corvus | this is pretty cool; i like this level of visibility :) | 23:18 |
corvus | that's uuid-jobname | 23:18 |
corvus | oh, those are queue item uuids | 23:19 |
corvus | (thus the job name addition to make it unique) | 23:20 |
corvus | i think that's so that if the build uuid changes, we keep the semaphore | 23:20 |
corvus | last restart was at 20:32 | 23:20 |
corvus | baaab4cfbc074796b5be235775754aaf last appeared in the log at 14:47 | 23:21 |
corvus | wait that restart time doesn't look right | 23:21 |
corvus | 15:47 was last restart | 23:22 |
corvus | looks like my last log entry didn't make it to the wiki :/ | 23:22 |
corvus | anyway, that entry is confirmed as stale | 23:22 |
corvus | https://codesearch.opendev.org/?q=publish-releasenotes&i=nope&files=&excludeFiles=&repos= says max is 1 | 23:23 |
corvus | so i will remove the entry | 23:23 |
corvus | the same is true for infra-prod-playbook, but it's even older. removed | 23:26 |
corvus | top releasenodes job is queued now; top infra-prod job is running | 23:27 |
clarkb | infra-prod job is in the periodic queue if anyone has trouble finding it | 23:27 |
clarkb | corvus: when you say remove the entry you removed the znode entirely or made the znode content [] ? | 23:28 |
corvus | clarkb: removed entirely | 23:28 |
corvus | shortcut valid for max=1 semaphores only | 23:28 |
clarkb | for max>1 you would edit the json to remove invalid job entries? | 23:28 |
corvus | yep | 23:29 |
corvus | but i'm going to write code so no one ever has to do that :) | 23:29 |
clarkb | ++ | 23:31 |
*** tosky has quit IRC | 23:40 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!