opendevreview | Merged opendev/system-config master: Add LE config for zuul01 https://review.opendev.org/c/opendev/system-config/+/816903 | 00:51 |
---|---|---|
corvus | zuul is still looking good; i think we should just continue to run it on master with both schedulers | 00:56 |
corvus | note that there is a minor consideration to be aware of re zuul: it is possible for the status json to be slightly out-of-date or flap between two states. it's random which scheduler formats the json, and they can be slightly out of sync with each other. i don't expect users to notice most of the time (i haven't even noticed so far), so i think it's okay to run in this configuration, and the benefits of early testing of 2 schedulers outweigh | 01:11 |
corvus | the downsides. but i wanted to make sure folks are aware of that in case it is noticed. (that's the next thing that's being worked on, so this isn't a long-term issue). | 01:11 |
opendevreview | Merged opendev/system-config master: refstack: don't chown db directory https://review.opendev.org/c/opendev/system-config/+/816761 | 02:00 |
ianw | Digest:sha256:bc67ba409c0f02a4ee36bf855a40f511228aa4286009cd99cb3595cdee9cf3ac is the build from 816618 | 02:03 |
ianw | "opendevorg/gerrit@sha256:bc67ba409c0f02a4ee36bf855a40f511228aa4286009cd99cb3595cdee9cf3ac" | 02:04 |
ianw | so we've got the the right image on the server | 02:04 |
ianw | i'm going to restart now to get the new plugin | 02:06 |
ianw | #status log restarted gerrit to pick up plugin changes from #816618 | 02:09 |
opendevstatus | ianw: finished logging | 02:10 |
opendevreview | Merged opendev/system-config master: Adjust RefStack build for osf->openinfra rename https://review.opendev.org/c/opendev/system-config/+/808480 | 02:20 |
fungi | i'm back, but looks like everything's okay | 02:51 |
*** diablo_rojo_phone_ is now known as diablo_rojo_phone | 04:06 | |
*** diablo_rojo_phone is now known as Guest5311 | 04:06 | |
*** gouthamr_ is now known as gouthamr | 04:07 | |
*** Guest5311 is now known as diablo_rojo_phone | 04:08 | |
opendevreview | Ian Wienand proposed openstack/project-config master: Add Fedora 35 https://review.opendev.org/c/openstack/project-config/+/816931 | 04:15 |
opendevreview | Ian Wienand proposed openstack/project-config master: Add Fedora 35 https://review.opendev.org/c/openstack/project-config/+/816931 | 04:20 |
opendevreview | Ian Wienand proposed openstack/project-config master: Set Fedora 34 min-ready to zero https://review.opendev.org/c/openstack/project-config/+/816932 | 04:20 |
opendevreview | Ian Wienand proposed openstack/project-config master: Remove Fedora 34 https://review.opendev.org/c/openstack/project-config/+/816933 | 04:20 |
ianw | 2021-11-08 04:28:04.208 | ++ /opt/dib_tmp/dib_build.VHZvuUPn/hooks/root.d/08-containerfile:main:62 : podman run -d dib-work-image /bin/sh | 04:29 |
ianw | 2021-11-08 04:28:04.487 | Error: error allocating lock for new container: no space left on device | 04:29 |
ianw | this is weird because there is space | 04:29 |
ianw | so it must be talking about something else ... | 04:30 |
ianw | nb02 | 04:30 |
ianw | same thing on nb01 | 04:31 |
opendevreview | Merged openstack/diskimage-builder master: tests: remove debootstrap install https://review.opendev.org/c/openstack/diskimage-builder/+/815571 | 04:59 |
ianw | $ podman ps -a | wc -l | 06:33 |
ianw | 2049 | 06:33 |
ianw | i think this has to do with it | 06:33 |
*** ysandeep|pto is now known as ysandeep | 06:36 | |
ianw | $ podman ps -a | wc -l | 06:37 |
ianw | 2049 | 06:37 |
ianw | same number on nb01 as nb02 ... that can not be a coincidence | 06:37 |
ianw | I've run "podman system reset"; the builders have daily queues going now, although i suspect that will fix it for now | 06:38 |
ianw | i'd say we're leaking the dib-work-image. i'll look into it more tomorrow | 06:38 |
*** akahat is now known as akahat|rover | 07:00 | |
*** pojadhav is now known as pojadhav|lunch | 07:47 | |
*** sshnaidm|off is now known as sshnaidm | 08:04 | |
*** pojadhav|lunch is now known as pojadhav | 08:19 | |
*** ykarel is now known as ykarel|lunch | 08:53 | |
*** ysandeep is now known as ysandeep|afk | 09:18 | |
*** ysandeep|afk is now known as ysandeep | 09:51 | |
*** ykarel|lunch is now known as ykarel | 10:21 | |
opendevreview | Martin Kopec proposed opendev/system-config master: Update Interop doc https://review.opendev.org/c/opendev/system-config/+/816982 | 10:33 |
tobias-urdin | ianw: thanks for fixing the fedora image patches when it was removed from mirrors :) | 10:33 |
*** pojadhav is now known as pojadhav|afk | 10:51 | |
*** tobias-urdin5 is now known as tobias-urdin | 11:06 | |
*** lbragstad7 is now known as lbragstad | 11:41 | |
*** ysandeep is now known as ysandeep|brb | 11:45 | |
*** ysandeep|brb is now known as ysandeep | 11:58 | |
*** jpena|off is now known as jpena | 13:24 | |
fungi | in the openstack tenant i see we've got some queue items waiting hours with a handful of their builds queued. no idea yet if it's related to the zuul updates over the weekend and dual schedulers, or maybe a recurrence of the problem we observed with dependent job sets stuck trying to get nodes from very constrained providers like airship-kna1. i'll start looking once my coffee kicks in a little | 13:36 |
fungi | also i get the impression the periodic-stable pipeline e-mails did not get distributed to the stable-maint ml. i'll double-check it, but we may be sending e-mail differently (or not at all?) from the second scheduler | 13:42 |
*** ykarel is now known as ykarel|away | 14:36 | |
corvus | fungi: thanks, i'll start looking into those | 14:38 |
fungi | stable-maint ml may just be me imagining things. i see at least some failures posted there as recently as Mon Nov 8 03:32:26 UTC 2021 | 14:43 |
corvus | fungi: oh, i believe i did not re-enqueue a batch of periodic jobs over the weekend; so they may not have gotten saturday/sunday but should have gotten sunday/monday | 14:44 |
fungi | yep, seems like that's working as expected, thanks | 14:45 |
corvus | oops, one of the log lines has accidentally changed to be less helpful, which will make tracking down this node issue a little harder | 14:46 |
fungi | i think check queue item 816807,3 in the zuul tenant may be exhibiting the same situation i'm seeing for a number of changes in the openstack tenant | 14:47 |
corvus | "for job <zuul.model.FrozenJob object at 0x7ff3fe56dbb0>" that used to be a job name :/ | 14:47 |
*** pojadhav|afk is now known as pojadhav | 14:50 | |
corvus | i used the repl to get the node request id for the top check queued job: 'neutron-tempest-plugin-scenario-openvswitch': '300-0016042207' | 14:51 |
fungi | as for general exception roundup, i'm not spotting a couple of new ones the scheduler debug logs aside from the previously observed "TypeError: 'NoneType' object is not subscriptable" | 14:51 |
fungi | this hit a couple of times on zuul02: https://paste.opendev.org/show/810851/ | 14:53 |
corvus | that could definitely cause problems; maybe even the one we're seeing now with unhandled node requests | 14:54 |
fungi | and this on zuul01: https://paste.opendev.org/show/810852/ | 14:54 |
fungi | also zuul01 logged 13 occurrences of kazoo.exceptions.ConnectionLoss and zuul02 logged 54 of them | 14:55 |
fungi | no idea if any of that is related | 14:55 |
corvus | that one is probably harmless; i think it's a delete of an already deleted node, but could be a fallow-on from the first error | 14:55 |
corvus | that's a lot of connection losses | 14:56 |
fungi | trying to see if i they're interspersed randomly or clustered | 14:56 |
corvus | would also be good to see what the rate was last week | 14:56 |
fungi | /var/log/zuul/debug.log.6.gz on zuul02 contains none as far as i can see | 14:57 |
fungi | checking a few other days | 14:58 |
fungi | as for clustering, it looks like the entries today on zuul02 came in bunches, around 11:38, 12:03-12:06, 12:10, 12:16, 12:27, 13:18, 13:36, 13:55, 14:16, 14:26, 14:29, 14:34, 14:58 | 15:00 |
fungi | some clusters were larger, some as few as one exception | 15:00 |
fungi | there could be some ongoing/intermittent network disruption | 15:01 |
fungi | do we maybe only log kazoo.exceptions.ConnectionLoss in some very recent commit? i've scanned back more than two weeks in the zuul02 debug log and found no hits before today | 15:04 |
corvus | wow, zuul01 saw a multi-hour period it couldn't connect to zk: from 06:56:17 to 08:09:34 | 15:05 |
corvus | fungi: no that should be unchanged; something is likely very different with our zk connection, whether that's zuul or network issues is unclear | 15:05 |
fungi | yeah, i'm continuing to grep farther and farther back, finally found one from 2021-10-22 07:55:08,906 | 15:06 |
fungi | so it's been almost 3 weeks | 15:07 |
fungi | corvus: this is likely related? http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70037&rra_id=all | 15:07 |
fungi | the other two zk servers are only using 42% of their rootfs | 15:08 |
corvus | that is entirely possible | 15:09 |
fungi | /var/zookeeper/datalog/version-2 contains 28gb of data | 15:10 |
fungi | and that's around 9gb on the other two servers | 15:11 |
fungi | so 3x larger on 04 | 15:11 |
corvus | starting at 02:00 the amount of data in zk spiked according to https://grafana.opendev.org/d/5Imot6EMk/zuul-status?orgId=1&from=now-24h&to=now | 15:11 |
fungi | periodic-stable jobs were going on around then | 15:11 |
fungi | for the openstack tenant | 15:12 |
fungi | periodic-stable triggers at 02:01 utc daily | 15:12 |
fungi | and periodic at 02:00 utc | 15:13 |
fungi | so that could be related to queuing hundreds/thousands of builds in a matter of minutes | 15:14 |
corvus | makes sense | 15:17 |
corvus | it looks like the other zk servers had their storage fill up as well, but they recovered? | 15:18 |
fungi | looks like zk may have crashed on 04? i'm trying to see if there are clues in the docker-compose logs | 15:21 |
fungi | 2021-11-08 06:53:42,330 [myid:4] - ERROR [Snapshot Thread:ZooKeeperServer@329] - Severe unrecoverable error, exiting | 15:22 |
fungi | java.io.IOException: No space left on device | 15:22 |
fungi | seems like filling up the rootfs crasged it when trying to snapshot | 15:22 |
fungi | crashed | 15:22 |
fungi | i wonder if it would recover if we freed up enough space to start it again | 15:23 |
fungi | also i wonder if it should really be consuming that much disk, and if so we should probably attach dedicated volumes for them | 15:24 |
*** ysandeep is now known as ysandeep|dinner | 15:25 | |
fungi | happy to get started on that if it's the direction we want to go | 15:25 |
corvus | we have zk purge snapshots every 6 hours; that corresponds with the drop in this graph: http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70101&rra_id=all | 15:25 |
corvus | i wonder if we're keeping too many snapshots and should reduce that | 15:26 |
corvus | aiui, we retain 3 snapshots, but only delete the ones >3 every 6 hours | 15:26 |
fungi | the disk utilization graph suggests that it doesn't normally use anywhere close to this much on previous days | 15:27 |
corvus | on zk04 (which is helpfully frozen in time at max usage), we have around 75 snapshot files | 15:28 |
corvus | and aronud 25 on zk05 | 15:28 |
fungi | growth on all three zk servers seems to have increased by close to an order of magnitude | 15:29 |
corvus | fungi: maybe before looking at expanding the disk, we should increase the snapshot cleanup frequency? | 15:29 |
corvus | maybe every 2 hours? | 15:30 |
fungi | yeah, that seems worthwhile, though also it appears that letting zk fill up the rootfs destabilizes it, dedicating a filesystem would shield the system from that impact | 15:30 |
corvus | yes, that's true. want to do both then? | 15:30 |
fungi | i'm fine with focusing on cleanup improvements first, and then working on adding cinder volumes after we've stabilized the cluster | 15:31 |
corvus | ok. i'll propose a patch to change the interval | 15:32 |
fungi | i'm curious what's come in after 4.10.4 though which causes it to create so much larger snapshots | 15:32 |
corvus | queue state :) | 15:32 |
fungi | ahh, okay, so that's the churn from moving queue state in | 15:32 |
corvus | i have some thoughts on reducing the data size, but they depend on having the queue state in zk to start with | 15:33 |
corvus | basically, we should be able to share some data which we currently duplicate in the executor queue | 15:33 |
corvus | (global repo state, job vars, etc; with queue state in zk now, executors can reach all of that data directly) | 15:34 |
fungi | what's the safest way to get zk going again on 04? i could free up ~1gb by clearing /var/log/journal for instance | 15:34 |
fungi | or can we just start it as-is and expect it will clean up old snapshots before it tries to write anything? | 15:35 |
corvus | fungi: i'm unsure; it might be okay to manually delete the snapshots | 15:36 |
Clark[m] | Not really here yet but there might be old docker images to prune too if we aren't pruning them regularly | 15:36 |
corvus | i think we should get the new interval in place first though | 15:36 |
corvus | (that way we can do just one rolling restart to get it in use) | 15:37 |
fungi | there's only one zookeeper image and two opendevorg/zookeeper-statsd images according to `docker image list` | 15:37 |
fungi | i could delete the older opendevorg/zookeeper-statsd which would free up 122mb | 15:38 |
opendevreview | James E. Blair proposed opendev/system-config master: Purge ZK snapshots more frequently https://review.opendev.org/c/opendev/system-config/+/817038 | 15:38 |
corvus | i'm going to work on breakfast, then resume zk work | 15:55 |
fungi | i see what you mean about the status json being inconsistent, i was watching that change in the status page and saw a couple of builds transition from starting back to queued | 15:55 |
*** ysandeep|dinner is now known as ysandeep | 15:56 | |
yuriys | I noticed some errors for launch attempts on the inmotion cloud, and all of them were due to (raise exception.NoMoreFixedIps(net=network_id) , which shouldn't be happening, because that's statically configuration, unless we're not 'releasing' ips fast enough during destroy+provision requests. | 15:57 |
yuriys | static config* i meant | 15:57 |
yuriys | although it did go from 0 to 100 at 13:58 UTC... hmm | 16:00 |
fungi | we have at times in the past seen floating ips leak due to pathological conditions in clouds preventing us from deleting them when we free up the ports/servers to which they were bound | 16:02 |
yuriys | Yeah looks like at 13:58 a call to destroy nearly all instances went out, shortly by rebuilding all them, not quite sure how to combat that for the ip release delay vs the reprovision | 16:02 |
fungi | nodepool expects to occasionally see errors for such things, it's fine as long as they don't persist and it's able to eventually start booting new server instances there | 16:03 |
fungi | it's possible we could teach the launchers to track floating ip capacity similar to how it does other limits and refrain from making boot calls until there's capacity to assign new floating ips, i'm not quite sure what would be entailed in implementing that however | 16:05 |
yuriys | yeah it looks good now, just need a delay between destroy and create calls | 16:06 |
Clark[m] | Note this wasn't floating ips just normal IPs aiui | 16:06 |
fungi | oh, yep sorry i see now from the pasted error it was fixed ips | 16:07 |
fungi | so i guess the "ip release delay" is something neutron does? | 16:08 |
fungi | and likely doesn't expose it via the api such that we could query | 16:08 |
clarkb | as soon as my glasses are defogged I can ACK the zk purge interval change | 16:18 |
clarkb | Let me know if I can help with zk otherwise I'll stay out of the way as I've not been on top of it with y'all | 16:20 |
fungi | there's not much to catch up on. basically latest zuul moved queue state into zk which creates a ton of new churn resulting in much larger zk snapshots, so much so that it's easy for them to overrun the available space on the rootfs within the 6-hour cleanup interval we'd configured | 16:21 |
fungi | and that can also lead to crashing zk entirely if it runs out of space while writing a snapshot | 16:22 |
clarkb | I think another option available to us is to increase the snapshot interval | 16:22 |
clarkb | so that we produce snapshots less often | 16:22 |
clarkb | that is based on numbers of transactions iirc | 16:23 |
fungi | either way, yes the goal is to keep the snapshot count to a manageable level, but also probably to relocate /var/zookeeper or at least /var/zookeeper/datalog to a dedicated filesystem/volume | 16:23 |
clarkb | apparently I've got a kernel update waiting so first reboots, then acking that change | 16:25 |
fungi | also we may need some manual intervention on zk04 as the zk service on it crashed ~10 hours ago leaving its rootfs full | 16:25 |
fungi | but holding off there until the cleanup interval change deploys | 16:26 |
opendevreview | Clark Boylan proposed opendev/bindep master: Try out PBR pep 517 support https://review.opendev.org/c/opendev/bindep/+/816741 | 16:43 |
clarkb | through localy testing I think maybe that all we need to do is keep the setup.py script then it will work? we'll find out | 16:43 |
clarkb | if that does work I have a docs updated that I need to push to pbr | 16:44 |
fungi | if it winds up calling into setup.py even with latest build/pip/wheel, then that probably doesn't avoid the deprecation warning | 16:45 |
clarkb | fungi: ya I'm not sure on that point yet. You'd think that the official method of doing this via setuptools wouldn't trip the deprecation warning. But I don't know yet | 16:46 |
fungi | though i guess it's just calls to `setup.py install` and setup_requires/easy_install which trigger the warnings | 16:47 |
fungi | so calling into setup.py on its own may not be a problem | 16:47 |
clarkb | fungi: https://github.com/pypa/setuptools/blob/main/setuptools/build_meta.py#L158 ultimately that is what it does | 16:47 |
clarkb | oh ya and its doing setup.py sdist and setup.py bdist_wheel equivalents | 16:48 |
clarkb | and then those artifacts are handled directly by pip or poetry or whatever | 16:48 |
*** marios is now known as marios|out | 17:03 | |
*** ysandeep is now known as ysandeep|out | 17:04 | |
corvus | i think i have identified the issue that's caused some stuck 'queued' jobs. details and a fix in #zuul. i'd like to merge the fix, then do what i guess would be our first ever rolling scheduler restart to deploy it. | 17:08 |
fungi | woah, i'm game! | 17:14 |
corvus | looks like we're about 5m or so from the zk interval change landing. perhaps once it does, we can go ahead and manually update it and roll zk | 17:15 |
fungi | should we expect the rolling restart to get those stuck queue items moving again? (for whatever definition of "expect" we have at this point in regards to rolling scheduler restarts) | 17:19 |
fungi | looking at the fix, i assume we'll need to manually dequeue/enqueue them | 17:20 |
clarkb | ya I think we would have to reenqueue so they get new node requests | 17:20 |
fungi | as they presumably correspond to incorrectly cleaned-up node requests | 17:20 |
fungi | or we'd have to wait for the scheduler's node request timeout, which if memory serves is fairly long (8 hours? 10?) | 17:21 |
corvus | we may not have a node request timeout. and yeah, dequeue/enqueue. we could go ahead and do that now. | 17:23 |
corvus | i'll get started on that | 17:23 |
opendevreview | Merged opendev/system-config master: Purge ZK snapshots more frequently https://review.opendev.org/c/opendev/system-config/+/817038 | 17:24 |
fungi | there are a couple of builds to go in the opendev-prod-hourly buildset, before it will be able to start infra-prod-service-zookeeper in deploy | 17:26 |
fungi | huh, infra-prod-service-zuul just failed in opendev-prod-hourly | 17:28 |
fungi | oh, it's the lack of letsencrypting breaking apache reloads | 17:30 |
fungi | on zuul01 | 17:30 |
clarkb | yes, known problem with a proposed fix | 17:31 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/816903 hrm fix is merged maybe insufficient fix? | 17:31 |
clarkb | ACME records I bet. One sec | 17:31 |
fungi | clarkb: following https://discuss.python.org/t/11810 it seems like we may still be forced to call into deprecated setuptools functions when doing editable installs | 17:31 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add acme record for zuul01 https://review.opendev.org/c/opendev/zone-opendev.org/+/817061 | 17:32 |
clarkb | fungi: corvus ^ re zuul01 le | 17:33 |
fungi | and yeah, i agree with 817061 | 17:33 |
fungi | [Mon Nov 8 03:41:12 UTC 2021] zuul01.opendev.org:Verify error:DNS problem: NXDOMAIN looking up TXT for _acme-challenge.zuul01.opendev.org - check that a DNS record exists for this domain | 17:33 |
fungi | rather, acme.sh agrees ;) | 17:33 |
clarkb | fungi: re editable installs sounds like setuptools is hoping to address that at some point | 17:34 |
fungi | right, they simply haven't yet | 17:34 |
fungi | i should have said "...for now" | 17:34 |
corvus | clarkb: thx, sorry | 17:35 |
corvus | there are many items that need re-enqueing; i'm having emacs help me with the commands | 17:35 |
fungi | elisp to the rescue? | 17:36 |
corvus | okay, it's 122 commands (half dequeue half enqueue) | 17:36 |
corvus | that's running. it will take a long time. | 17:37 |
corvus | it should be safe to manually edit the zoo.cfg files now, right? | 17:38 |
fungi | infra-prod-service-zookeeper failed in deploy | 17:38 |
*** jpena is now known as jpena|off | 17:38 | |
clarkb | likely due to the lack of disk space to ansible on zk04? | 17:38 |
corvus | probably because zk04 is out of space | 17:38 |
corvus | yeah | 17:38 |
fungi | mkdir: cannot create directory ‘/root/.ansible/tmp/ansible-tmp-1636392894.8472807-6361-108486208652702’: No space left on device | 17:39 |
fungi | indeed | 17:39 |
corvus | i'd like to manually make those edits on all 3 servers; delete the snapshots from under zk04, then restart zk04. how does that sound? | 17:40 |
clarkb | that seems reasonable. | 17:40 |
clarkb | (I don't know how zk04 will handle the loss of snapshots but in theory they are intended to be moved around for restoring stuff so should be safe to move them away/delete them | 17:40 |
corvus | yeah; i'm expecting it to fully resync from the other servers; we'll see | 17:41 |
fungi | sounds great to me, thanks | 17:41 |
corvus | i deleted the logs and snapshots. that appears to have caused a zk outage :/ | 17:47 |
corvus | zuul is...recovering now? | 17:49 |
fungi | was that by deleting from 05 and/or 06 with 04 still down, or was 04 running again? | 17:50 |
corvus | only restarting 04 after deleting files from disk | 17:50 |
fungi | maybe the cluster members got really busy trying to restore missing snapshots so stopped handling requests for a while? | 17:51 |
fungi | i guess zk doesn't like to answer requests if it has reason to believe the cluster is incoherent | 17:51 |
clarkb | I think snapshots are per node and sort of forgotten about once written and not puging | 17:51 |
fungi | ahh, so maybe not then | 17:51 |
clarkb | but zk04 reentering the cluster will cause the cluster to bring it up to speed for the current dataset | 17:52 |
corvus | 05 crashed when i brought 04 up | 17:52 |
fungi | ouch | 17:52 |
corvus | i think that explains the outage (but i don't know why 05 crashed) | 17:52 |
corvus | 06 stayed up | 17:53 |
fungi | something was clearly going on with zk05 around 17:39 | 17:54 |
corvus | my dequeue command is stuck. i wonder if the rpc handler thread is stuck | 17:54 |
corvus | however, both schedulers are still processing pipelines | 17:56 |
corvus | i've restarted the dequeue/enqueue script | 17:56 |
corvus | and it is proceeding (i'm unsure if only one scheduler is handling those commands now or both; it doesn't really matter since we're going to restart them) | 17:57 |
fungi | zk05 logged quite a few "java.io.IOException: ZK down" and immediately prior to those appearing there was this: | 17:57 |
fungi | 2021-11-08 17:39:42,051 [myid:5] - WARN [QuorumPeer[myid=5](plain=[0:0:0:0:0:0:0:0]:2181)(secure=[0:0:0:0:0:0:0:0]:2281):Follower@96] - Exception when following the leader | 17:57 |
corvus | the zk graphs look reasonable now; i think it's time to update the config for 05 and restart it now | 17:57 |
corvus | oh, zk05 disk is full | 17:58 |
fungi | d'oh! | 17:58 |
fungi | you'd think i would have checked that FIRST | 17:58 |
* fungi sighs | 17:58 | |
corvus | now we know why it crashed | 17:58 |
corvus | i did check it first, but i clicked on ze05 and not zk05 :( | 17:59 |
* corvus sighs | 17:59 | |
corvus | okay, so i'm going to repeat the process i did on zk04 | 17:59 |
clarkb | should we check zk06 state first? | 17:59 |
clarkb | then see if restarting zk05 causes it to fill tis disk? | 18:00 |
corvus | there's lots of space os zk06 right now | 18:00 |
fungi | /dev/root 39G 7.9G 31G 21% / | 18:00 |
clarkb | like becoming the leader causes an explosion of disk use maybe? | 18:00 |
corvus | so i think we have good reason to think it might work this time | 18:00 |
clarkb | cool | 18:00 |
corvus | starting 05 now | 18:02 |
fungi | disk usage is climbing on 06, but not super rapidly | 18:03 |
opendevreview | Merged opendev/zone-opendev.org master: Add acme record for zuul01 https://review.opendev.org/c/opendev/zone-opendev.org/+/817061 | 18:05 |
corvus | i don't know if we get a confirmation that zk05 is fully synced or not. it looks like it has the right amount of data, but there's no warm-fuzzy log message saying everything is okay. | 18:06 |
clarkb | corvus: I think with zookeeper it can't be part of the cluster without having synced fully | 18:06 |
clarkb | basically once it shows up as a follower to one of the others it is good. But ya no explicit indication of that | 18:06 |
corvus | i think i'm ready to assume that everything is okay and we can proceed to zk06 now | 18:06 |
corvus | ++ | 18:07 |
corvus | i will shut down zk06 now. it's the leader, so this will cause some churn | 18:07 |
corvus | starting 06 | 18:08 |
corvus | it's back up; all 3 servers are up; 05 is the leader now | 18:09 |
corvus | i think that concludes zk work. dequeue/enqueue script is still crawling along. | 18:10 |
clarkb | then next up is the update to fix the identified zuul bugs? | 18:11 |
corvus | incidentally, i just observed the expected status page flap for the first time on the zuul tenant status page | 18:11 |
corvus | yep | 18:11 |
fungi | i observed it a few times earlier today watching system-config changes ni the status page for the openstack tenant | 18:12 |
corvus | once those changes merge, we'll do the rolling restart. | 18:12 |
clarkb | corvus: for the status page flag that happens because zuul-web can load the state from the non active scheduler and get stale info? | 18:12 |
corvus | clarkb: yep (though i wouldn't characterize it as a non-active scheduler, but rather, a scheduler who may not have processed the pipeline most recently) | 18:13 |
clarkb | the one that won the election least recently :) | 18:14 |
corvus | (since it's sort of random which scheduler processes which pipeline(s) at any given time); odds are they should eventually converge on the same data relatively quickly | 18:14 |
corvus | they're both active at the same time. they just both race through the list of tenants and pipelines and try to lock and process each one individually | 18:15 |
corvus | if they start at the same time, and one of them locks openstack/check the other will skip that one and lock openstack/gate | 18:15 |
clarkb | got it | 18:15 |
corvus | then when the first finishes openstack/check, it'll try to lock openstack/gate. if the second has finished, it will succeed and process it again (which is wasteful but harmless and quick); if the second still has the lock, it'll leapfrog and go to openstack/poste. | 18:16 |
fungi | keeping an eye on the filesystem utilization on the zk servers... based on the fairly linear trending i anticipate we'll keep them under 60% used when they prune snapshots | 19:02 |
fungi | s/when/just before/ | 19:02 |
clarkb | note I'm trying to dequeue/enqueue things to address the fight between the schedulers over builds with retry results | 19:02 |
fungi | this should also give us some idea on sizing for cinder volumes | 19:02 |
clarkb | it is slow going. | 19:03 |
fungi | though i think rackspace requires cinder volumes be a minimum of 100gb, which will probably be 4x what we end up using with the current churn and pruning | 19:04 |
clarkb | I am unable to tell if the tripleo gate queue is thrashing in this way or not. It is hard for me to see with the flapping status page. It seems like tripleo might be ok? | 19:06 |
corvus | i monkeypatched in a fix for the retry thrashing bug; actual fix is https://review.opendev.org/817067 (and that's == what i patched) | 19:26 |
clarkb | I've approved it just now | 19:28 |
clarkb | I have ~7 changes to reenqueue that were dequeued and that is in progress | 19:28 |
corvus | clarkb: yeah, i don't see a way to dequeue that periodic item, but with the immediate retry thrashing issue resolved i think we can probably just leave it? | 19:28 |
corvus | (if we really wanted to dequeue it i could probably come up with something in the repl, but danger seems to outweigh benefit) | 19:29 |
clarkb | ok | 19:29 |
corvus | incidentally, my de/re-enqueue of nodepool-bug-affected changes is continuing | 19:30 |
fungi | it will eventually time out the node requests, right? | 19:30 |
clarkb | a number of the jobs are actually running I think | 19:30 |
clarkb | it should largely be harmless other than node consumption | 19:30 |
corvus | fungi: at this point i don't think the periodic item is suffering from any bugs | 19:30 |
corvus | (and i don't believe node requests have a timeout) | 19:31 |
corvus | so i think the current state is: retry bug hotpatched; nodepool bug still present but rarely triggered; remediation of earlier nodepool bug fixes in progress; waiting for zuul changes to merge in order to rolling-restart with all known bugs fixed | 19:32 |
fungi | i thought the scheduler would resubmit a node request if not satisfied in something like 8 hours? | 19:32 |
fungi | (i vaguely recall reviewing a change for that, at least) | 19:33 |
SpamapS | Ohai, trying to add an email identity on my opendev account.. just wondering if the email loop verification emails are working. I waited a whole 5 minutes. ;) | 19:33 |
clarkb | down to two reenqueues in progress and then I should be done recovering | 19:33 |
fungi | SpamapS: i can check the mta on the server | 19:33 |
corvus | i'm going to afk for a bit | 19:34 |
SpamapS | fungi: TY, I don't mean to be urgent I'd just like to submit the patch before I have to task switch back to not-zuul things. :) | 19:35 |
fungi | SpamapS: i see a message for what i assume to be your new work address accepted by aspmx.l.google.com at 19:29:21 utc with gsmtp queue id e18si38289043vsu.306 | 19:35 |
fungi | maybe check your junk mail? | 19:35 |
fungi | definitely wasn't rejected, at least | 19:36 |
SpamapS | Oh boo it was counted as spam wow that's rare. | 19:36 |
SpamapS | TY, reported as not-spam :) | 19:36 |
fungi | SpamapS: if you can help us determine why it was counted as spam we could try to remedy it or pass the reasons along to the gerrit upstream maintainers | 19:37 |
clarkb | reenqueue complete for the things I dequeued. I think we're in a good spot now. But will keep an eye on the oepnstack check queue in particular to make sure things are rotating out of there | 19:38 |
SpamapS | fungi: DM'd you the full text with headers. I don't see many clues in there. It's possible some creepy AI made a few assumptions. | 19:39 |
clarkb | I want to say in the past we've had to force smtp over ipv4 for gmail addresses. Maybe we need to find a way to send email via ipv4 for any google hosted domain? | 19:40 |
fungi | clarkb: in this case it went over ipv4 anyway from the looks of the headers and our logs | 19:46 |
clarkb | gotcha likely unrelated to google assuming ipv6 mail is spam then | 19:47 |
fungi | i like that my personal mailserver's spamassassin is configured to inject a header line indicating which sa rules a given message matched so i can have some idea of why it was or wasn't considered spam | 19:47 |
SpamapS | It's also possible that it's just assuming nobody has ever sent @spotify.com an email from opendev and so it made me +1 the first one. | 19:47 |
fungi | wish other mailservers did that | 19:47 |
clarkb | I think the tripleo gate queue may have reset without a failing job. Indication of another bug? I don't have hard evidence of that though as its more a I was skimming things and noticed it seemed jobs went away and now they have restarted but I don't see a failed change | 19:52 |
fungi | zk04 pruned its snapshots around 50% utilization on the fs | 19:54 |
clarkb | tooz runs a ridiculous number of jobs | 19:54 |
fungi | i expect 05 and 06 to also do so shortly, and they're at similar utilization levels | 19:54 |
fungi | yep, they all cleaned up at around 20-21gb used on the rootfs | 20:28 |
fungi | hopefully that's enough to survive through tomorrow's periodic jobs | 20:29 |
clarkb | we are able to reduce the cleanup period to an hour too | 20:29 |
clarkb | a few more options available to us to trim it back further if we feel that is necesary | 20:30 |
fungi | yep | 20:30 |
clarkb | the zuul fixes are landing now and over the next half hour should hopefully all be in | 20:31 |
clarkb | I'm goign to have to grab lunch soon but will be back after and can help with watching it post restart | 20:31 |
fungi | yeah, i'll be context switching to dinner prep in roughly an hour, but after i've eaten i can help with restarts or whatever | 20:31 |
clarkb | the tripleo queue has reset but this time clearly due to a failing job so no concern there | 21:09 |
*** dviroel is now known as dviroel|out | 21:11 | |
ianw | anyone mind if i supervise f35 builds today? https://review.opendev.org/c/openstack/project-config/+/816931 | 21:13 |
ianw | i don't want to spend time debugging f34 any more; if f35 still has issues booting i guess we/i can debug that | 21:13 |
ianw | we also have some sort of leak with containerfile builds + podman images; probably on an error path, but i'm not sure yet. i'll look into that | 21:14 |
clarkb | ianw: is there a corresponding grafyaml update we need to do? But +2 from me | 21:15 |
ianw | clarkb: hrm, looks like we haven't updated that in a while; i'll do that today | 21:17 |
ianw | on the gmail spam thing -- prompted me to look as all my @redhat.com is gmail and i have 30 days history -- nov 3, 5 & 6 i got one mail, and today every mail in the stack https://review.opendev.org/c/openstack/heat/+/816604 got put in there | 21:19 |
clarkb | fungi: corvus: looks like we had a kazoo connection loss at ~20:03 UTC from zuul02 | 21:20 |
ianw | so, that would be another data point that perhaps in early nov something did change to make gmail think gerrit might be more spammy | 21:20 |
ianw | for all their machine learning skills, why they can't realise i've probably got 10 of thousands of emails from this address and never marked one as spam, and have filters setup to apply labels ... maybe that's a pretty big flag i do want to see these mails | 21:21 |
clarkb | we seem to have a number of these connection closed due to Len errors in the recent zk05 docker container log | 21:26 |
clarkb | side note we seem to write a new snapshot every 5 minutes or so | 21:26 |
ianw | tobias-urdin: yeah, sorry i missed that use of the image. although i'd strongly suggest that with the stable branches, a fedora is not really a good choice. i presume it's because something recent is required -- but perhaps 8-stream; or even 9-stream now -- would work better? | 21:27 |
corvus | my de/re-enqueues from earlier are done | 21:28 |
corvus | i think we may be writing a too-large object into zk which is triggering a connection loss | 21:31 |
clarkb | ah ok | 21:31 |
clarkb | the size is ~1.8MB and our limit is 1MB? | 21:31 |
clarkb | the size from the len error I mean is ~1.8MB | 21:32 |
corvus | https://paste.opendev.org/show/810857/ | 21:32 |
corvus | what's the len error? | 21:32 |
corvus | oh that's the error from the zk server? | 21:32 |
clarkb | yes if you docker logs the zk container you'll see the len error tracebacks | 21:32 |
clarkb | with what I presume is a byte length | 21:32 |
corvus | that sounds plausible | 21:33 |
clarkb | I noticed it when looking into the "Unknown Configuration Error" messages for https://review.opendev.org/c/openstack/octavia/+/798151 the latest recheck seems to have enqueued though | 21:34 |
corvus | it looks like we're writing a big serialized config error to zk | 21:34 |
corvus | i don't quite understand why we're doing that in the format status json call | 21:35 |
corvus | i think i want to assume the second traceback is somehow corrupt, and i only need to look at the first one. it makes sense. | 21:36 |
clarkb | looking at it I think the second one may be side effect of the connection loss but not due to the specific write | 21:37 |
clarkb | basically first one explains the connection loss then second one is explained by the connection loss | 21:37 |
corvus | yeah, just not sure why the second one is doing any writes; like, _loadDynamicLayout shouldn't be in the call stack for a status json get | 21:38 |
clarkb | oh I see, could be interleaved writes maybe? | 21:40 |
corvus | i don't think we get that with the logging module | 21:41 |
clarkb | looks like it is trying to get the event queue length to report in the status json. Maybe that loads more info than anticipated? | 21:43 |
corvus | how do we go from "raise self._exception" to "item.setConfigErrors" in the next frame? | 21:44 |
corvus | maybe it has something to do with kazoo's internal async processing and retry handling | 21:45 |
corvus | and the second thread has an async retry of the original exception in its call stack. i dunno. it's weird. i'm still just going to try to fix the first error. :) | 21:46 |
corvus | unfortunately, i don't know which attribute of the ConfigurationError we're trying to serialize which is too big | 21:46 |
clarkb | my hunch would be the message content since that can be quite large iirc | 21:47 |
corvus | right, but "error" or "short_error" ? | 21:48 |
corvus | or is it that there are a large number of ConfigurationError objects | 21:48 |
clarkb | gotcha | 21:49 |
clarkb | corvus: should we shard all the error objects? since they can easily get large? or are you still trying to track down the specific attribute(s) | 22:33 |
corvus | clarkb: i'm working on a change to shard all the error objects | 22:34 |
clarkb | ok I look forward to reviewing it | 22:34 |
fungi | okay, i'm back from cooking and eating foods | 22:49 |
clarkb | infra-root I've updated https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting with notes about multi zuul scheduler and the container user stuff I started looking at. I dropped fips stuff and fedora 34. Anything else we should add? | 22:50 |
fungi | andcaught up on the zk disconnect sitch | 22:50 |
clarkb | ianw: btw the new js theme looks great | 22:50 |
ianw | thanks, i think that's it. will be working on the 3.4 upgrade prep | 22:52 |
ianw | i don't know if i just hadn't noticed before, but it was great to see on the polymer docs page i was looking at that it's all deprecated and now you should use "lit" | 22:56 |
clarkb | hahaha | 22:56 |
ianw | so just as we get polygerrit and i have some vague understanding of it ... it's already old news | 22:56 |
fungi | web frameworks will stabilize some day | 23:16 |
fungi | and on that day, i'll begin to understand them | 23:16 |
clarkb | corvus: fungi there are ~8 changes in openstack check that look like they might be stuck. They are the first 8 | 23:24 |
clarkb | is this possibly related to the lost noderequests somehow? | 23:24 |
corvus | clarkb: yes, likely so. | 23:24 |
clarkb | ok I can dequeue enqueue them if we think that is the best step for them | 23:25 |
corvus | i think all the bugfixes so far have landed, except the one i just uploaded for the configerrors | 23:25 |
corvus | clarkb: yeah, i'm leaning toward dequeue/enqueue and wait for the next bugfix to land before doing the rolling restart | 23:26 |
clarkb | ok I'll start on those now | 23:26 |
corvus | ok. i'm going to afk so i can be around when it lands | 23:27 |
clarkb | alright dequeue enqueue complete for the changes that looked suspect to me | 23:39 |
clarkb | corvus: when you return 815696,4 seems to be stuck but is also only just 2 hours old which means it should be more recent than the monkey patches. I'll hold off on dequeue/enqueing that one in case that is a good one to examine for further debugging | 23:43 |
corvus | clarkb: i only monkeypatched the retry fix, not nodepool | 23:47 |
corvus | so i think you should just go ahead and re-enqueue that one | 23:48 |
clarkb | oh got it | 23:48 |
clarkb | proceeding | 23:48 |
corvus | whew, i saw zuul tenant gate head with a queued job, but it just started :) | 23:54 |
clarkb | I notice there are a few more changes that need to be dequeue/enqueued. Working on that now | 23:59 |
clarkb | they weren't all at the top of the queue | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!