| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 962238: Fix defaults for upload-image-swift and -s3 https://review.opendev.org/c/zuul/zuul-jobs/+/962238 | 00:34 | |
| @tristanc_:matrix.org | Hello folks, we are trying to diagnose a weird issue when we perform a ZooKeeper upgrade. We presently only have one replica, and when the ZK service is restarted, we observe that Nodepool deletes in-use nodes, which results in running Zuul jobs to fail weirdly. This apparently happens for the ZKNodes that are unlocked, but looking at the metastatic adapter, the IN_USE node state should be locked with a non ephemeral lock. So the question is: is it expected to loose in-use nodes when ZooKeeper is fully restarted? And is the only solution to upgrade ZooKeeper without loosing IN_USE nodes is to setup more than one replicas? | 09:43 |
|---|---|---|
| @jim:acmegating.com | tristanC: a zk quorum with an even number of nodes could split-brain and cause issues. running an odd number is recommended. | 13:55 |
| @fungicide:matrix.org | (this is why opendev has 3 zk servers for its zuul) | 13:59 |
| @jangutter:matrix.org | I have a question about https://review.opendev.org/c/zuul/zuul-jobs/+/962194/1/roles/ensure-python/tasks/main.yaml <--- I think that general section is making the assumption that Fedora and CentOS python packages are using dotless convention by default. I'm reasonably sure the regex on line 44 can be removed. | 14:31 |
| I don't know much about the users of this, for example if python_version: "311" is a considered a valid input to the role. | ||
| @jangutter:matrix.org | Argh, I meant to say "making the incorrect assumption" | 14:32 |
| @tristanc_:matrix.org | corvus: fungi Thanks! And one is not enough I guess... It's a bit surprising though, I thought that restarting zookeeper would not kill running jobs | 14:33 |
| @fungicide:matrix.org | jangutter: i have a feeling some of that came about due to using the role in tox jobs where "py311" is tox's built-in selector for a python 3.11 test environment | 14:35 |
| @fungicide:matrix.org | and so we wanted a way to map a test environment back to a python interpreter install mechanism | 14:35 |
| @fungicide:matrix.org | but yeah, it looks like this is in reverse? we pass in "3.11" as python_version and then it expects fedora to supply a python311-devel package | 14:37 |
| @jangutter:matrix.org | I'm thinking the behaviour for fedora/rhel packages should map "311" or "3.11" to "python3.11-devel" (or "39"/"3.9" to "python3.9-devel"). Does that sound reasonable? | 14:38 |
| @fungicide:matrix.org | jangutter: hunting around, it looks like at some timein the past there was a dotless package naming scheme in centos/epel | 14:38 |
| @jangutter:matrix.org | I'm shocked, I tell you. | 14:39 |
| @fungicide:matrix.org | yeah, rpmfind has a lot of python39-devel package hits, for example | 14:39 |
| @fungicide:matrix.org | opensuse still does python311-devel apparently | 14:40 |
| @fungicide:matrix.org | so it has now diverged depending on the red hat sub-flavor | 14:40 |
| @jangutter:matrix.org | _but_ I wonder if there's an alias. | 14:40 |
| @fungicide:matrix.org | oh maybe | 14:41 |
| @jangutter:matrix.org | yeah, python39 + python3.9 works on c9s | 14:41 |
| @jangutter:matrix.org | Oh my giddy aunt.... python312 works on c10s too | 14:42 |
| @jangutter:matrix.org | OK, so please ignore my rant - it turns out that even though the package might be named python3.9 or python39, the aliases seem to be working and lasts across many versions. | 14:43 |
| @jangutter:matrix.org | Aha. | 14:45 |
| @jangutter:matrix.org | There is a difference though: python3.12-devel works, but python312-devel does not. | 14:45 |
| @fungicide:matrix.org | so depending on the vintage we'll need to use one or the other i guess? | 14:46 |
| @jangutter:matrix.org | so for the base packages the aliases remain, but the devel package seems to have dropped it. That's new in both c9s and c10s | 14:46 |
| @jangutter:matrix.org | For the devel package (with rpmfind) it looks like OpenSuse diverged from RH-derived distros. | 14:48 |
| @fungicide:matrix.org | right, that's what i was saying earlier | 14:48 |
| @jangutter:matrix.org | (just the devel package mind you!) | 14:48 |
| @fungicide:matrix.org | what a mess | 14:48 |
| @fungicide:matrix.org | i tried to express this complexity in the pep 725/804 draft discussions, but i'm not sure my concern was heard | 14:49 |
| @fungicide:matrix.org | (not about python.*-devel packages specifically, but the problem of package names changing over time in various distros) | 14:50 |
| @jangutter:matrix.org | That's why you shouldn't download the packages yourself: give it to an LLM agent to do it for you! | 14:52 |
| @fungicide:matrix.org | but what if you already *are* an llm agent? | 14:52 |
| @jangutter:matrix.org | I'm waay ahead of you never having had a soul. | 14:53 |
| @jangutter:matrix.org | Never should have gone to 64 bit... it all went downhill from there. | 14:54 |
| @jangutter:matrix.org | Looking at the ci coverage for that job, there's precious little rpm-based distros voting on it. | 14:56 |
| @fungicide:matrix.org | at this point opendev only has centos stream and rocky available, someone is currently working on alma. we dropped fedora and opensuse due to lack of interest (in maintaining support for them) | 14:58 |
| @jim:acmegating.com | (if people are interested in maintaining those, volunteering to do so in opendev is welcome :) | 14:59 |
| @fungicide:matrix.org | yes, exactly | 14:59 |
| @jangutter:matrix.org | ah, to be young and have the fun of keeping a distro integrated! (decades ago, I had a Gentoo desktop) | 15:00 |
| @jangutter:matrix.org | So, two choices: keep the current logic in-place in `ensure-python` (just amend it for c10), or I can propose a fix with something that swings the logic around? It's niche, but it means one less thing we keep downstream. | 15:02 |
| @fungicide:matrix.org | yeah, i think if the only user who's reported running into it only needs it to work for centos, then take the simple approach for now and don't prematurely engineer support for other users who haven't brought it up and likely don't exist | 15:08 |
| @fungicide:matrix.org | reworking it so the "old" package name format is the exception might help avoid future updates to the logic, but would also potentially be backward-incompatible for platforms we don't know about | 15:12 |
| @fungicide:matrix.org | basically, prioritize fixing what we know isn't working for actual users who have run into a problem, but avoid breaking what may be currently working for users we're not aware of | 15:13 |
| @jangutter:matrix.org | Agree... at best I think I need to add a note in the code though. | 15:17 |
| @jangutter:matrix.org | (for future acheologists, in some distant age....) | 15:18 |
| @fungicide:matrix.org | yes, absolutely, a comment about it would be grand | 15:19 |
| @fungicide:matrix.org | and then later if we end up with an unweildy list of exceptions or get annoyed by constantly updating it, that's the time to think about potentially-backward-incompatible refactoring of the logic | 15:20 |
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 15:27 | |
| - [zuul/zuul] 961292: Launcher: handle reused node failure https://review.opendev.org/c/zuul/zuul/+/961292 | ||
| - [zuul/zuul] 961557: Assign unassigned building nodes to requests https://review.opendev.org/c/zuul/zuul/+/961557 | ||
| - [zuul/zuul] 962145: Use a subnode for request assignment https://review.opendev.org/c/zuul/zuul/+/962145 | ||
| -@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 962291: Update packer job/role documentation https://review.opendev.org/c/zuul/zuul-jobs/+/962291 | 15:43 | |
| -@gerrit:opendev.org- Zuul merged on behalf of Andy Ladjadj: [zuul/zuul-jobs] 836744: fix(packer): prevent task failure when packer_variables is not defined https://review.opendev.org/c/zuul/zuul-jobs/+/836744 | 15:57 | |
| -@gerrit:opendev.org- Jan Gutter proposed: [zuul/zuul-jobs] 962194: Fix up some EL10 compatibility https://review.opendev.org/c/zuul/zuul-jobs/+/962194 | 16:52 | |
| @gordonmessmer:fedora.im | https://opendev.org/zuul/zuul/src/branch/master/tools/docker-compose.yaml refers to a container named zuul-test-zookeeper, but I can't find that container or a definition for it | 18:09 |
| @fungicide:matrix.org | Gordon Messmer: that... is the definition for it? | 18:23 |
| @fungicide:matrix.org | docker-compose reads the definition in that file to create the container | 18:23 |
| @fungicide:matrix.org | e.g. if you run https://opendev.org/zuul/zuul/src/branch/master/tools/test-setup-docker.sh i think? | 18:24 |
| @gordonmessmer:fedora.im | oh hell... I misread the error message I get when trying to start podman-compose. it's failing to *start* that container because of an "invalid mount option" -_- | 18:25 |
| @gordonmessmer:fedora.im | podman-compose seems not to understand the tmpfs directions | 18:27 |
| @fungicide:matrix.org | i wonder if it's thrown by the uid= option | 18:27 |
| @gordonmessmer:fedora.im | yes | 18:28 |
| @fungicide:matrix.org | at least that's not set in the mysql container's tmpfs list | 18:28 |
| @fungicide:matrix.org | so if it's starting the mysql container and not zookeeper then that would stand to reason | 18:28 |
| @fungicide:matrix.org | looks like that option was added to the docker-compose.yaml file by https://review.opendev.org/c/zuul/zuul/+/835019 ~3.5 years ago, for reference | 18:30 |
| @fungicide:matrix.org | i would be surprised if nobody's tried running the setup with podman-compose in that long, but i suppose it's possible | 18:31 |
| @gordonmessmer:fedora.im | perhaps they hit an error and simply revert to using docker. :) | 18:31 |
| @fungicide:matrix.org | "The :U suffix tells Podman to use the correct host UID and GID based on the UID and GID within the <<container|pod>>, to change recursively the owner and group of the source volume. Chowning walks the file system under the volume and changes the UID/GID on each file, it the volume has thousands of inodes, this process will take a long time, delaying the start of the <<container|pod>>." | 18:34 |
| @fungicide:matrix.org | from the "Chowning Volume Mounts" section of https://docs.podman.io/en/v4.4/markdown/options/volume.html | 18:35 |
| @fungicide:matrix.org | i wonder if that's used directly by podman-compose | 18:35 |
| @gordonmessmer:fedora.im | I've simply removed the uid mount option, and the USER spec for the container. that allows podman-compose to start the set, at least. | 18:35 |
| @gordonmessmer:fedora.im | perhaps also notable, tox 4.30 no longer supports the "whitelist_externals" directive, so tox fails. | 18:39 |
| @fungicide:matrix.org | ah, yeah zuul switched to nox around the time that tox v4 happened, so probably never got fixed to work with it | 18:41 |
| @fungicide:matrix.org | i bet TESTING.rst got overlooked for updating | 18:42 |
| @gordonmessmer:fedora.im | yes, that would make sense. it still refers to tox | 18:42 |
| @fungicide:matrix.org | well, we also never removed the old tox.ini | 18:42 |
| @gordonmessmer:fedora.im | OK, so do I need to know anything other than "run nox"? | 18:44 |
| @fungicide:matrix.org | aha, there are still some uses of tox mixed around in tool scripts looks like, which is i guess why tox.ini wasn't removed, though i see patterns like `ensure_tox_version: "<4"` | 18:45 |
| @fungicide:matrix.org | Gordon Messmer: basically yes, though the equivalent of `tox -e myenv` is `nox -s myenv` | 18:45 |
| @fungicide:matrix.org | and the environments are defined in noxfile.py instead of tox.ini | 18:46 |
| @gordonmessmer:fedora.im | thanks. let's see what happens... | 18:48 |
| @gordonmessmer:fedora.im | my goal is to rebase and update https://review.opendev.org/c/zuul/zuul/+/859939 | 18:48 |
| @fungicide:matrix.org | cool, this is helpful discussion regardless, i'm working on a patch now to get some of the stuff you've identified cleaned up | 18:49 |
| @fungicide:matrix.org | though not sure what we should do about podman-compose at the moment | 18:49 |
| @fungicide:matrix.org | that'll need a bit more digging | 18:49 |
| @gordonmessmer:fedora.im | I'm seeing a lot of kazoo client errors, "Connection time-out" | 18:51 |
| @fungicide:matrix.org | could be that zookeeper didn't start | 18:52 |
| @fungicide:matrix.org | which might be due to the dropped uid mapping | 18:52 |
| @fungicide:matrix.org | kazoo is the zk client lib | 18:52 |
| @gordonmessmer:fedora.im | it looks like the container is *running*, at least | 18:53 |
| @gordonmessmer:fedora.im | I'll see if I can get logs out of it | 18:53 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [zuul/zuul] 962304: Clean up tox remnants https://review.opendev.org/c/zuul/zuul/+/962304 | 19:04 | |
| @fungicide:matrix.org | now to see if that ^ passes testing | 19:04 |
| @gordonmessmer:fedora.im | found that zookeeper did not have read access to its certificates. simply make them globally readable. I may try to figure out the rootless container permissions later, but this should be good enough for now | 19:08 |
| @gordonmessmer:fedora.im | unit tests are running. thanks. | 19:09 |
| @gordonmessmer:fedora.im | if/when the gitea tests pass, I'll open a new review | 19:10 |
| @fungicide:matrix.org | interesting, i guess the lack of user mapping could have affected the perms on the certs volume | 19:11 |
| -@gerrit:opendev.org- Jeremy Stanley https://matrix.to/#/@fungicide:matrix.org proposed: [zuul/zuul] 962304: Clean up tox remnants https://review.opendev.org/c/zuul/zuul/+/962304 | 19:33 | |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!