corvus | i'm going to restart zuul now | 01:18 |
---|---|---|
corvus | with just one scheduler | 01:18 |
Alex_Gaynor | Is zuul experiencing some sort of known outage ATM? | 01:44 |
corvus | Alex_Gaynor: i'm restarting it | 01:45 |
Alex_Gaynor | corvus: 👍 | 01:45 |
Alex_Gaynor | How long does that typically take? | 01:46 |
fungi | up to now, something around 15-20 minutes, longer depending on what else we might be doing as part of the restart | 01:47 |
corvus | Alex_Gaynor: something like 20 minutes or so for a full restart | 01:47 |
fungi | soon though, restarts should go unnoticed | 01:47 |
corvus | (i think i'm going to have to clear state and start again) | 01:47 |
Alex_Gaynor | Got it. Thanks. (Is there a better place to follow along than here?) | 01:48 |
corvus | this is the place | 01:48 |
corvus | Alex_Gaynor: did you notice an issue before about 20m ago? | 01:48 |
corvus | (i'm wondering if this is prompted by the restart, or if there was another issue before the restart that i didn't notice) | 01:49 |
Alex_Gaynor | Looks like I started seeing issues about 30 minutes ago | 01:51 |
Alex_Gaynor | In the form of various errors loading https://zuul.opendev.org/t/pyca/status/ | 01:51 |
corvus | okay. that's as expected then :) | 01:52 |
Alex_Gaynor | 👍 | 01:52 |
corvus | i've cleared state and am starting the scheduler again | 01:59 |
ianw_pto | Alex_Gaynor: while talking about zuul and pyca things, let me know if you have any thoughts on https://github.com/pyca/pynacl/issues/601 for arm64 wheels for pynacl; i imagine it could be very similar to what we have | 02:09 |
ianw_pto | what prompted me to think about it again was recent work we were doing to upgrade our containers to bullseye; the buildx process cross-compiles the dependencies an pynacl was one of the more painful bits | 02:11 |
Alex_Gaynor | ianw_pto: 👍 I ping'd reaperhulk, pynacl makes me sad these days so I don't think about it much. | 02:11 |
ianw_pto | ok, can do, thanks :) don't want to make anyone sad | 02:11 |
Alex_Gaynor | Hehe, not remotely your fault. (Once upon a time pynacl was a library with misuse resistant cryptography, and now it's mostly cryptography that's too hipster to be in openssl) | 02:13 |
corvus | apparently we're waiting on github rate limits again | 02:22 |
corvus | i think that may mean it could be a few hours before zuul is able to start | 02:23 |
corvus | ah, i think the timout may have expired, it's moving along now | 02:28 |
Alex_Gaynor | Is zuul having a problem ATM? I'm seeing the events queue not going down | 13:25 |
fungi | Alex_Gaynor: i'm taking a look | 13:41 |
Alex_Gaynor | 🙇♂️ | 13:41 |
fungi | it's at 0 for most tenants, but yes it looks like the pyca tenant is reporting an event queue length of 20 at the moment | 13:43 |
fungi | the vexxhost queue length is also 20 | 13:43 |
fungi | event queue length | 13:44 |
fungi | and the zuul tenant's event queue is at 9 | 13:44 |
fungi | all the others are at 0 | 13:44 |
fungi | i'll see if there's any clues in the scheduler logs | 13:45 |
Alex_Gaynor | FWIW the pyca one has been at 20 for many minutes, it's not just transitory | 13:45 |
fungi | the vexxhost tenant only has a gerrit source connection, so seems unlikely to be limited to github events | 13:48 |
fungi | the most recent event logged by the scheduler for the pyca tenant seems to be this one: | 13:50 |
fungi | 2021-10-31 13:17:38,143 DEBUG zuul.GithubConnection: [e: ea4f4296-3a4c-11ec-9cac-f345a58a0adc] Scheduling event from github: <GithubTriggerEvent 0x7fd7d80c0c70 pull_request pyca/cryptography refs/pull/6504/head status github.com/pyca/cryptography 6504,6f45c6d2e0978d1521718ed5e97eda6a4d97d763 delivery: ea4f4296-3a4c-11ec-9cac-f345a58a0adc> | 13:50 |
fungi | and no mention of that event id past that log entry | 13:51 |
fungi | so that does seem to support the impression that it's still hanging out in the event queue | 13:52 |
fungi | the most recent builds to start in that tenant were at 09:04:37 utc, so it was clearly processing the event queue at least up to that time | 13:53 |
fungi | implying it wasn't immediately stuck when the scheduler was restarted | 13:54 |
fungi | looks like the vexxhost tenant processed a potential trigger event as recently as 10:09:06 utc | 13:58 |
corvus | fungi: i see the same change cache error. i think we should revert to 4.10.4 | 14:10 |
fungi | okay, i saw it too but it's not the only exception so i was trying to do perform some rudimentary statistical analysis to see which ones were more common before and after updating | 14:11 |
corvus | that one will stop queue processing at least | 14:11 |
fungi | AttributeError: 'NoneType' object has no attribute 'cache_key' | 14:12 |
fungi | that one? | 14:12 |
corvus | yep | 14:12 |
fungi | yeah, 5300 of those today since the debug log rotated | 14:12 |
fungi | 3135 in yesterday's log | 14:12 |
fungi | what was the trick you did for the mass downgrade last time? ansible playbook to locally tag 4.10.4 as "latest" on all the servers? | 14:14 |
corvus | i'll work on a manual revert | 14:14 |
corvus | yep | 14:14 |
corvus | stopping zuul | 14:17 |
fungi | confirmed, before yesterday we did not seem to log that exception | 14:17 |
fungi | the other one i see starting yesterday and continuing today is... | 14:18 |
fungi | AttributeError: 'str' object has no attribute 'change' | 14:18 |
corvus | i'm deleting the zk state | 14:18 |
fungi | from the rpc listener | 14:18 |
fungi | though that may have been related to manual reenqueuing of changes | 14:19 |
corvus | starting zuul | 14:20 |
corvus | https://zuul.opendev.org/api/components looks like good versions | 14:21 |
fungi | nevermind, those exceptions doesn't look like they were clustered at times where reenqueuing was underway | 14:21 |
fungi | don't look like | 14:22 |
Alex_Gaynor | Question: Does the restart mean the even queue was lost, or will those jobs still happen? | 14:23 |
fungi | since the state in zookeeper was cleared, the queued trigger events will be lost i think | 14:24 |
corvus | right (though items already processed and running jobs are saved, but there weren't any in pyca) | 14:24 |
fungi | note these upgrades/restarts are working toward persistent state for purposes of being able to run multiple schedulers, so restarts in the (hopefully near) future will be hitless | 14:25 |
corvus | we're almost there. unfortunately this was a bug in our state persistence :/ | 14:26 |
fungi | i need to run out on a couple of quick errands, but should return within the hour hopefully | 14:37 |
corvus | fungi: i'll be leaving for the day soon | 14:37 |
fungi | thanks, i'll keep an eye on things once i'm back, but we ran smoothly enough on that release so i don't expect it will give us trouble | 14:38 |
corvus | ++ | 14:38 |
corvus | it's back up; i'm re-enqueing items | 14:41 |
corvus | Alex_Gaynor: and i see something running in pyca | 14:41 |
Alex_Gaynor | corvus: Yeah, I kicked teh job | 14:42 |
corvus | re-enqueue is done | 14:42 |
corvus | #status log restarted zuul on 4.10.4 due to bugs in master | 14:43 |
opendevstatus | corvus: finished logging | 14:43 |
fungi | i'm around again and will keep an eye to irc/mailing lists in case anyone notices something still awry | 15:49 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!