*** tosky has quit IRC | 00:00 | |
*** mattw4 has quit IRC | 00:02 | |
SpamapS | mnaser: v3 is pretty nice.. I'd be down for testing. :) | 00:03 |
---|---|---|
mnaser | SpamapS: yeah, funnily enough tho, we've historically never used tiller ever :p | 00:03 |
mnaser | so helm v3 was a "hey glad y'all finally caught on" :P | 00:03 |
*** saneax has quit IRC | 00:20 | |
corvus | mnaser: sweet -- maybe we can use those in the gerrit-review deployment | 00:21 |
mnaser | corvus: ill push it up at some point, probably later today or tomorrow and then we can look about adding it to opendev | 00:22 |
corvus | ++ | 00:22 |
mnaser | the only help i'd appreciate is (i tried this and failed a while back) is tagged releases on dockerhub would be nice | 00:25 |
*** rlandy has quit IRC | 00:29 | |
corvus | agreed; i'll try to get to that if someone else doesn't soon, but i've a bit larger backlog than normal right now | 00:35 |
*** sgw has joined #zuul | 01:01 | |
SpamapS | TBH, Helm is a far far simpler approach than the operator. | 01:09 |
SpamapS | It can't do as much, but it's definitely intended as "package management for k8s apps" rather than "do all the magic" | 01:09 |
mnaser | SpamapS: yep, its a good intermediate step for now | 01:12 |
SpamapS | We've actually shifted over to managing kubernetes objects with Terraform of late. The lifecycle and dependency management is particularly nice. | 01:20 |
SpamapS | But we have a bunch of helm stuff that will stay in helm for a while. | 01:21 |
*** swest has joined #zuul | 01:30 | |
*** swest has quit IRC | 01:35 | |
*** swest has joined #zuul | 01:50 | |
*** bhavikdbavishi has joined #zuul | 02:47 | |
*** gouthamr has quit IRC | 03:14 | |
*** gouthamr has joined #zuul | 03:24 | |
*** jamesmcarthur has joined #zuul | 03:29 | |
*** jamesmcarthur has quit IRC | 03:44 | |
*** pcaruana has joined #zuul | 05:24 | |
*** raukadah is now known as chkumar|rover | 05:58 | |
*** saneax has joined #zuul | 06:24 | |
*** jamesmcarthur has joined #zuul | 06:46 | |
*** jamesmcarthur has quit IRC | 06:51 | |
*** jcapitao|afk has joined #zuul | 07:27 | |
*** jcapitao|afk is now known as jcapitao | 07:28 | |
*** gtema_ has joined #zuul | 07:58 | |
*** tosky has joined #zuul | 08:18 | |
*** jpena|off is now known as jpena | 08:22 | |
*** gtema_ has quit IRC | 08:26 | |
*** tosky has quit IRC | 08:32 | |
*** avass has joined #zuul | 08:44 | |
*** themroc has joined #zuul | 08:49 | |
*** tosky has joined #zuul | 09:16 | |
*** mhu has joined #zuul | 09:33 | |
*** bhavikdbavishi has quit IRC | 09:58 | |
*** tosky has quit IRC | 10:02 | |
*** tosky has joined #zuul | 10:03 | |
*** jcapitao is now known as jcapitao|afk | 11:29 | |
*** sshnaidm has quit IRC | 12:04 | |
*** bhavikdbavishi has joined #zuul | 12:04 | |
tristanC | fwiw i'm still exploring dhall-lang, and for application deployment here is what i wrote for zuul: https://github.com/TristanCacqueray/dhall-operator/blob/master/applications/Zuul.dhall | 12:07 |
*** avass has quit IRC | 12:11 | |
*** mgoddard has quit IRC | 12:31 | |
*** jpena is now known as jpena|lunch | 12:41 | |
*** armstrongs has joined #zuul | 12:41 | |
*** mgoddard has joined #zuul | 12:49 | |
*** armstrongs has quit IRC | 12:50 | |
*** mgoddard has quit IRC | 12:54 | |
*** rlandy has joined #zuul | 12:57 | |
*** sshnaidm has joined #zuul | 12:58 | |
*** Goneri has quit IRC | 12:58 | |
*** sshnaidm has quit IRC | 13:02 | |
*** sshnaidm has joined #zuul | 13:03 | |
*** AshBullock has joined #zuul | 13:04 | |
*** electrofelix has joined #zuul | 13:05 | |
*** jamesmcarthur has joined #zuul | 13:05 | |
*** mgoddard has joined #zuul | 13:09 | |
AshBullock | Hey all, I have a question on the nodepool kubernetes driver, I'm seeing some jobs running on our eks cluster hitting RETRY_LIMIT intermittently, I was wondering if there is an undocumented max-pods setting for kubernetes similar to the openshift driver ? https://zuul-ci.org/docs/nodepool/configuration.html#attr-providers.[openshiftpods].max-pods | 13:09 |
*** jamesmcarthur has quit IRC | 13:12 | |
*** jamesmcarthur has joined #zuul | 13:13 | |
*** jcapitao|afk is now known as jcapitao | 13:16 | |
*** ssbarnea has quit IRC | 13:22 | |
*** jamesmcarthur has quit IRC | 13:29 | |
Shrews | AshBullock: no, there is no such setting in the kubernetes driver | 13:33 |
Shrews | AshBullock: that being said, quota issues (such as max-pods) should not cause RETRY_LIMIT errors. You would see messages in nodepool about "not enough quota to satisfy" the request, and it would simply not handle the request until quota freed up. You are likely hitting some sort of communication issue, I'm guessing (not a k8s expert). | 13:39 |
Shrews | first place you may want to look is the zuul executor logs for the builds encountering the retry. might be more info there | 13:43 |
*** jpena|lunch is now known as jpena | 13:45 | |
*** Goneri has joined #zuul | 13:46 | |
*** bhavikdbavishi has quit IRC | 13:48 | |
*** jamesmcarthur has joined #zuul | 13:55 | |
*** jamesmcarthur has quit IRC | 13:59 | |
AshBullock | Thanks for the help Shrews, I'll take a look at the executor logs to see if I can find anything | 14:00 |
*** sshnaidm has quit IRC | 14:02 | |
*** jamesmcarthur has joined #zuul | 14:03 | |
Shrews | corvus: the etherpad lgtm. who is going to run/operate their zuul? | 14:04 |
*** chkumar|rover is now known as chandankumar | 14:07 | |
clarkb | NODE_FAILURE is ehat you get if you cant provision with nodepool | 14:14 |
clarkb | retry limit happens when the job runs but zuul detect s failures in pre run or ansible return exit code indicating network problems | 14:15 |
clarkb | zuul by default tries 3 times to run the job when ithits this | 14:15 |
clarkb | failing 3 times results in retry limit | 14:15 |
clarkb | AshBullock: ^ | 14:15 |
mordred | Shrews: they/we are | 14:20 |
*** themroc has quit IRC | 14:21 | |
mordred | Shrews: idea being having a community repo/repos sort of similar to opendev where we have a gitops repo for running it | 14:23 |
*** bhavikdbavishi has joined #zuul | 14:48 | |
*** sshnaidm has joined #zuul | 14:53 | |
*** AshBullock has quit IRC | 15:04 | |
*** ssbarnea has joined #zuul | 15:12 | |
mnaser | hrm, the latest nodepool-builder image on dockerhub doesnt have debootstrap | 15:28 |
mnaser | https://opendev.org/zuul/nodepool/commit/46d0ce248326127c2d883a415af98fea66af889d this commit implies "Note the sibling build will have installed many of these from the bindep.txt file from diskimage-builder itself." but adding "However, when using releases this is not done." | 15:28 |
mnaser | does that mean i have to build my own images and add these bits on top of it (i'm ok with that, it just seems like uh, work that new users might struggle with) | 15:29 |
mordred | hrm | 15:29 |
mordred | I do not believe that's the intent | 15:29 |
mordred | mnaser: let's loop corvus in when he gets up so we can talk through it | 15:30 |
mnaser | ya looking at the /var/log/apt/history.log in the container, its not htere | 15:31 |
mnaser | *there | 15:31 |
mordred | mnaser: (the ultimate solution here is the finishing of the docker-base-image patches for dib so that debootstrap is not needed anymore) | 15:31 |
mnaser | mordred: yeah i was thinking about that too! i was wondering about the feasiblity of running docker-base-image inside a container | 15:31 |
*** panda has quit IRC | 15:31 | |
mordred | mnaser: shold be fine actually - it just does a podman export | 15:32 |
mordred | so it doesn;'t actually _run_ a container build or anything - just fetches and then exports the filesystem | 15:32 |
mnaser | mordred: i guess we'd need podman as a runtime dependency in that case but thats' fine by me | 15:32 |
mordred | yah | 15:32 |
mordred | (it's actually written to work with podman or docker - but I think podman is the nicer runtime dep) | 15:33 |
mnaser | yeah its probably not gonna try and mess around with trying to get a systemd service/etc done | 15:33 |
mordred | mnaser: https://review.opendev.org/#/c/693619/ is what I've got so far | 15:33 |
*** panda has joined #zuul | 15:33 | |
mnaser | mordred: i actually searched for that patch hoping it'd have merged | 15:34 |
mordred | mnaser: I think that part is solid enough - the hard bits are going to be https://review.opendev.org/#/c/693642/ and similar for the other base os images | 15:34 |
mnaser | and then i would have started using it =P | 15:34 |
clarkb | mordred: mnaser that commit merged yesterday, maybe it failed to upload the updated image? | 15:34 |
mordred | mnaser: honestly - I think it's all likely not too hard to get moving | 15:34 |
clarkb | or your pull is from pre merge? | 15:34 |
mnaser | clarkb: the last merge was 18 hours ago and the most recent image in dockerhub was 18 hours ago (and also the one i have) | 15:34 |
mordred | clarkb: I don't think that commit is sufficient | 15:34 |
clarkb | oh I thought debootstrap was specifically listed | 15:35 |
mordred | clarkb: somewhere we missed that we need some of the siblings behavior in production builds too | 15:35 |
mordred | it's not | 15:35 |
clarkb | ah | 15:35 |
mordred | I'm honestly not sure what the *right* solution is - think it's worth a quick discussion - I'm pretty sure implemeting the right solution won't be as hard as figuring out what it is :) | 15:36 |
mnaser | yep, agreed | 15:36 |
mnaser | i think for now ill kinda just uh, have an image with a few extra packages (based on the most recent tagged release) | 15:36 |
mnaser | just to unblock the helm charts work im at | 15:37 |
mordred | mnaser: I betcha those two dib patches would get you a bootable ubuntu image if you did container-base-image vm ubuntu-kernel DIB_CONTAINER_IMAGE=docker.io/library/ubuntu | 15:37 |
mnaser | so far launcher works well (tested with cloudimages) and working on builder now | 15:37 |
mordred | mnaser: woot | 15:37 |
clarkb | I think if we are installing vhd-utils and debian-keyring and yum etc we may as well install debootstrap | 15:38 |
clarkb | yum is the equivalent of debootstrap there for red hat distros | 15:39 |
mordred | I agree | 15:39 |
openstackgerrit | Monty Taylor proposed zuul/nodepool master: Add debootstrap to builder package list https://review.opendev.org/699707 | 15:40 |
clarkb | I know ianw intends to get this into production after PTO | 15:40 |
mordred | clarkb, mnaser: ^^ | 15:40 |
clarkb | I expect things will work a bit more happily out of the box once we dogfood it | 15:41 |
mnaser | clarkb: thats a very reasonable argument IMHO | 15:41 |
*** ssbarnea has quit IRC | 15:46 | |
mnaser | fwiw mnaser/nodepool-builder:latest is running with mordred patch, so ill test that and see if any other things pop up missing | 15:50 |
tristanC | Software Factory 3.4 has been released, amongs other things it removes SCL for python3 and the zuul rpm doesn't have patches anymore: https://www.softwarefactory-project.io/releases/3.4/ | 15:51 |
corvus | ohai | 15:53 |
clarkb | tristanC: removes SCL because centos/rhel 7 provide python 3 directly? | 15:54 |
mnaser | challenge #2: sudo mount --bind /opt/cache/apt/debian /tmp/dib_build.8Jsgxogy/mnt/var/cache/apt/archives => mount: /tmp/dib_build.8Jsgxogy/mnt/var/cache/apt/archives: permission denied. -- gonna guess i have to find the right capability to add to this container | 15:55 |
tristanC | clarkb: yes, we rebuilt every python3 components using the python-3.6 provided by el7 | 15:55 |
* mnaser goes back to research | 15:55 | |
clarkb | mnaser: ya you'll need privileges | 15:55 |
corvus | mordred: ++ 699707 | 15:55 |
mnaser | clarkb: ya im trying to avoid privileged: true and finding the right caps to add.. | 15:55 |
corvus | mnaser: you are my hero | 15:56 |
clarkb | I think mount is its own cap? | 15:56 |
mordred | mnaser: we've been wanting a human to do that for like 2 years now | 15:56 |
tristanC | clarkb: we still enable the rh-git-218 SCL because zuul needs a more recent git | 15:56 |
mnaser | corvus, mordred \o/ | 15:56 |
clarkb | tristanC: hrm what aspect of zuul requires newer git? | 15:56 |
mnaser | mount seems to require CAP_SYS_ADMIN:X | 15:57 |
tristanC | clarkb: iirc GIT_SSH_COMMAND doesn't work on el7 | 15:57 |
mordred | of course it does | 15:57 |
mnaser | super unrelated and old school but http://linux-vserver.org/Capabilities_and_Flags -- seems like there are contexts | 15:57 |
mnaser | and we can build caps based on contexts, SECURE_MOUNT which is allowing to mount | 15:57 |
mnaser | i wonder if k8s can support these | 15:57 |
mnaser | seems like a vserver construct tho :< | 15:58 |
clarkb | there are ways around that if people want to hack on dib. One method is FUSE (might make builds slower?) another is mkfs.* for certain filesystems can take an existing fs tree and write it into the new fs on a file without mounting it | 15:59 |
clarkb | ext4 can do that but I don't think xfs or btrfs can | 16:00 |
mnaser | yeah it's gonna have to be CAP_SYS_ADMIN because thats the only way you can get `mount` :( | 16:03 |
corvus | i wonder if it's because of things like proc and sysfs | 16:04 |
clarkb | corvus: that and the way it writes the file image out is to mount a file as block device | 16:04 |
mnaser | actually proc and sysfs seem to be ok to mount by default | 16:04 |
clarkb | it then unmounts that file and you get the .raw image. This is then converted to other formats | 16:04 |
clarkb | in theory that could be fuse mounted | 16:05 |
clarkb | it could also have its contents written directly by mkfs if the fs types support that | 16:05 |
mnaser | apparnetly certain file systems have a `FS_USERNS_MOUNT` flag (procfs, tmpfs, sysfs) which make sthem ok, but mountin ext4/nfs/btrfs/overlayfs etc are no bueno | 16:06 |
mordred | mnaser: how does img work? (or the other "build images in unprivileged containers using user namespaces") | 16:08 |
clarkb | mordred: they don't have to create a proper filesystem aiui | 16:09 |
mnaser | mordred: i mean i think i could have avoided that error by disabling the apt cache | 16:09 |
mnaser | cause it was trying to bind mount /var/cache/apt/archives | 16:09 |
clarkb | mnaser: I don't think so. the end of dib runs is to mkfs on a file, mount it, write the fs out, unmount it, and convert from raw to $format | 16:10 |
clarkb | that mount will need perms | 16:10 |
mnaser | yeah so i would have failed later if i disabled it | 16:10 |
clarkb | yes | 16:10 |
*** chandankumar is now known as raukadah | 16:10 | |
mnaser | btw lol, the next nugget | 16:10 |
mnaser | "ps: command not found" | 16:10 |
mnaser | :p | 16:10 |
clarkb | dib is running ps or your are? | 16:11 |
mnaser | /usr/local/lib/python3.7/site-packages/diskimage_builder/lib/common-functions: line 177: ps: command not found | 16:11 |
mnaser | dib is | 16:11 |
mnaser | https://github.com/openstack/diskimage-builder/blob/master/diskimage_builder/lib/common-functions#L172-L189 | 16:11 |
clarkb | do we need to install coreutils/sysutils because the container images strip that out | 16:11 |
clarkb | also ^ would make the container image based bootstrap weird I expect | 16:12 |
mnaser | im gonna keep iterating and push up a patch with all the things ill find.. | 16:13 |
mnaser | mkdir: cannot create directory '/etc/modprobe.d': Permission denied | 16:19 |
mnaser | hmmm | 16:19 |
mnaser | this feels like a bug | 16:20 |
mnaser | https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/modprobe/extra-data.d/50-modprobe-blacklist -- it should be `$TMP_MOUNT_PATH/etc/modprobe.d` no ? | 16:21 |
mnaser | or maybe it should be prefixed with a sudo | 16:21 |
clarkb | mnaser: I think it should be $TMP_MOUNT_PATH prefixed | 16:24 |
mnaser | right, because we're not actually trying to check if the host has kmod or not.. | 16:24 |
clarkb | exactly | 16:24 |
mnaser | ok ill push up a patch now | 16:24 |
*** ssbarnea has joined #zuul | 16:26 | |
openstackgerrit | Merged zuul/nodepool master: Add debootstrap to builder package list https://review.opendev.org/699707 | 16:27 |
mnaser | https://review.opendev.org/699722 modprobe.d: use $TMP_MOUNT_PATH | 16:27 |
openstackgerrit | Mohammed Naser proposed zuul/nodepool master: Add procps to packages in Dockerfile https://review.opendev.org/699725 | 16:32 |
mnaser | hmm | 16:33 |
mnaser | mkdir: cannot create directory '/tmp/dib_build.qfvFRgnB/mnt/etc/modprobe.d': Permission denied | 16:33 |
mnaser | i guess we gotta sudo that? | 16:33 |
clarkb | possibly. The relative permissiones get mind bendy there because that is in the nested fs and ya /etc there is probably owned by uid 0 which is root | 16:34 |
clarkb | and sudo will reconcile that | 16:34 |
mnaser | ok great, got past that | 16:46 |
mnaser | im gonna keep putting things inside https://review.opendev.org/#/q/topic:nodepool-in-k8s | 16:47 |
*** jcapitao is now known as jcapitao|afk | 16:48 | |
*** rlandy is now known as rlandy|brb | 17:17 | |
*** hashar has joined #zuul | 17:21 | |
*** bhavikdbavishi has quit IRC | 17:23 | |
*** mattw4 has joined #zuul | 17:35 | |
*** jpena is now known as jpena|off | 17:36 | |
*** panda has quit IRC | 17:36 | |
*** panda has joined #zuul | 17:39 | |
mnaser | exec_sudo: losetup: cannot find an unused loop device | 17:40 |
mnaser | i've wrestled this enough for a while with no success, added CAP_MKNOD and no bueno | 17:41 |
mnaser | apparently we do some mknods but that feels wrong https://serverfault.com/a/720496 | 17:42 |
mnaser | and seems to imply that they are shared with the host | 17:42 |
tristanC | mnaser: perhaps try to bind mount /dev/loop-control or authorize the c:10:237 device ? | 17:42 |
mnaser | tristanC: ok, i'lll dig from there | 17:45 |
*** rlandy|brb is now known as rlandy | 17:46 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul master: spec: add a zuul-runner cli https://review.opendev.org/681277 | 17:59 |
mnaser | hmm | 18:04 |
mnaser | we can look into this at some point: https://github.com/braincorp/partfs | 18:04 |
*** rfolco has quit IRC | 18:07 | |
*** electrofelix has quit IRC | 18:09 | |
tristanC | mnaser: you'd still need a device access, e.g. /dev/fuse | 18:09 |
clarkb | tristanC: ya but anyone can read and write to that device | 18:10 |
clarkb | at least on my machine | 18:10 |
clarkb | you won't need additioanl permissions/capabilities past that aiui | 18:10 |
mnaser | tristanC, clarkb: but apparently also fuse can be mounted inside containers | 18:10 |
mnaser | according to some article i remember reading at some point | 18:10 |
pabelanger | mnaser: https://review.opendev.org/415927/ might be helpful, that was my last attemps of DIB inside docker, back in 2017 | 18:13 |
pabelanger | there was even work to use docker for dib matrix of tests | 18:14 |
*** yolanda__ is now known as yolanda | 18:18 | |
*** tosky has quit IRC | 18:23 | |
*** Goneri has quit IRC | 18:28 | |
mnaser | pabelanger: neat | 18:35 |
mnaser | im so tempted to just say f'it and add "privileged: true" :( | 18:35 |
mnaser | we're at: "failed to set up loop device: Operation not permitted" | 18:36 |
tobiash | mnaser: there was a time when containerized dib leaked loop devices | 18:36 |
mnaser | tobiash: thats what im worried about too | 18:36 |
mnaser | i exposed a single loop device only (loop0) | 18:36 |
tobiash | mnaser: you'll need to use privileged | 18:36 |
pabelanger | mnaser: yah, needs to be privileged right now | 18:36 |
mnaser | tobiash: ive been adding manual CAPS as needed.. | 18:37 |
mnaser | im at CAP_MKNOD and CAP_SYS_ADMIN .. | 18:37 |
tobiash | ok, that'll take some time to find all needed privs ;) | 18:37 |
mnaser | i think ill cutover to privileged for now to make sure it works then i will scale back | 18:38 |
tobiash | mnaser: for the loopdev leak, we have this in the root.d phase: http://paste.openstack.org/show/787742/ | 18:41 |
tobiash | but no idea if that's still required | 18:41 |
tobiash | we needed this back in zuulv2 days and stick with that | 18:42 |
mnaser | tobiash: ya i remember running into similar issues a long time ago | 18:43 |
*** openstackgerrit has quit IRC | 18:43 | |
mnaser | ya privileged just uh, fixed it all, but we'll see. | 18:44 |
clarkb | tobiash: mnaser re the loop leak I would expect that would affect containerized and not containerized dib the same and i don't believe that is something we see leaking on our builders | 18:44 |
clarkb | heh but now that I check I think maybe we do | 18:45 |
mnaser | :P | 18:45 |
clarkb | what is weird about that is we don't seem to hit the node limit | 18:45 |
clarkb | so we don't leak them quickly? | 18:45 |
clarkb | in any case that isn't container specific | 18:45 |
mnaser | i think im going to make builders a statefulset | 18:46 |
tobiash | clarkb: back then we had this issue only in dockerized envs | 18:46 |
mnaser | because the builder hostname will be changing often during redeploys and the builder ids are constantly changing | 18:47 |
clarkb | mnaser: that shouldn't matter? the biggest reason to make it stateful will be to keep the cache around so that your builds are faster | 18:47 |
tobiash | mnaser: yes, builders need to be a statefulset | 18:47 |
mnaser | https://www.irccloud.com/pastebin/VSmF0iC0/ | 18:47 |
clarkb | tobiash: oh? | 18:47 |
tobiash | As well as the executors | 18:47 |
tobiash | clarkb: but I don't remember the reasons | 18:48 |
mnaser | everytime you redeploy, it'll be a different hostname | 18:48 |
mnaser | the executors might make sense bc or the cache | 18:48 |
clarkb | mnaser: hrm and I guess we use the hostnames to identify deleting images? | 18:48 |
mnaser | yep | 18:48 |
mnaser | so they're not deleting cause those "nodes" arent responding | 18:48 |
clarkb | seems like we could make that better in nodepool, but probably also low priority | 18:49 |
clarkb | (the image is already deleted on the nodepool side if the host is gone so it should noop and be happy there) | 18:49 |
tobiash | mnaser: the executors also need a stable identity because of live streaming | 18:49 |
*** openstackgerrit has joined #zuul | 18:50 | |
openstackgerrit | Merged zuul/nodepool master: Add procps to packages in Dockerfile https://review.opendev.org/699725 | 18:50 |
openstackgerrit | Merged zuul/nodepool master: Functional tests - use common verification script https://review.opendev.org/698834 | 18:50 |
*** rfolco has joined #zuul | 18:52 | |
*** sshnaidm is now known as sshnaidm|afk | 18:58 | |
*** jamesmcarthur has quit IRC | 19:03 | |
mnaser | ok so https://review.opendev.org/#/c/699722/ helped me build images locally if anyone wants to help push that through | 19:25 |
clarkb | mnaser: +2. ianw is out on pto so may not get too it soon. If another infra reviewer can ack it though I think we can merge it without ianw | 19:26 |
*** jamesmcarthur has joined #zuul | 19:26 | |
* mnaser is trying to avoid having a local build as much as possible | 19:27 | |
*** mgoddard has quit IRC | 19:33 | |
*** mgoddard has joined #zuul | 19:34 | |
*** Goneri has joined #zuul | 19:37 | |
*** jamesmcarthur has quit IRC | 19:38 | |
*** mhu has quit IRC | 19:41 | |
SpamapS | I hit an interesting problem today | 19:48 |
SpamapS | we have a job in our gate that creates a terraform plan... that's a diff against the infrastructure that it saves as an artifact... | 19:48 |
SpamapS | but we don't apply until promote, post-merge. The promote job goes and digs out the artifact, and applies that diff, or complains if the infrastructure changed and the diff is stale. | 19:49 |
SpamapS | We approved two changes in rapid succession, and zuul went [gate changeA = plan1][gate changeB = plan1][merge changeA][promote changeA+plan1 == SUCCESS][merge changeB][promote changeB+plan1 == stale FAIL] | 19:50 |
*** decimuscorvinus has quit IRC | 19:51 | |
*** decimuscorvinus has joined #zuul | 19:52 | |
SpamapS | Now, if we semaphore the gate job and the promote job, we can shrink the window for duplicate plans, but we can't eliminate it. There's a window where the semaphore is unlocked, and changeA is merged, and gating changeB wins, and makes a duplicate plan, and then the same scenario happens... | 19:52 |
SpamapS | Any ideas? At this point, we're thinking block the plan creation gate job until any unapplied plans are applied.. but that's also going to make things extremely serialized (maybe that's what we want?) | 19:53 |
corvus | SpamapS: does the plan for B not include the changes that A makes? | 19:54 |
mordred | SpamapS: it seems like the gate job for changeB needs to be making a plan/diff that would be the result of changeB being applied if changeA was already applied | 19:54 |
corvus | (just wondering why the promote of B didn't see the A plan component as existing/noop) | 19:54 |
corvus | i think mordred and i are saying similar things | 19:54 |
mordred | which - if it's diffing against production, is kind of hard to simulate in the gate job, since changeA hasn't been applied yet | 19:54 |
mordred | corvus: I agree :_) | 19:55 |
mordred | SpamapS: are corvus and I tracking the problem correctly at least? | 19:55 |
SpamapS | Terraform doesn't give us the option to assume a plan has already been applied. It's not as smart as git. | 19:58 |
SpamapS | mordred: yes you're right that the gate job for change B needs to make a plan that includes the results of changeA's plan. In order to do that, one must apply change A's plan. There's no stacking. | 19:59 |
corvus | yeah, from a high level, it seems like "the production system" is a part of the gate environment. i think that means it either needs to be able to be modeled serially (so that changes are stacked correctly), or mutexed into the singleton that it is | 20:01 |
SpamapS | corvus: right, basically we're going to end up passing a lock through as an artifact that promote will unlock by applying or failing, and until that happens, no plans can be created. | 20:02 |
corvus | (i wonder if it's feasible to make a tool which manipulates terraform plans that way -- subtraction, addition, etc) | 20:02 |
SpamapS | I'm not sure it would be valid unfortunately. Cloud APIs often have emergent effects. | 20:03 |
SpamapS | The plan may be "create a foo" and that creation will get an ID that is now part of the state of the system. | 20:03 |
corvus | SpamapS: ack | 20:03 |
SpamapS | I'm also not sure this is how we need the system to work. | 20:04 |
SpamapS | We did this to lock in the handoff from gate tests -> promote applies. | 20:04 |
corvus | this seems like an interesting consideration in deployment systems -- how stateful vs stateless they are | 20:04 |
SpamapS | Agreed, this is a particularly sticky wicket. | 20:05 |
SpamapS | Up until yesterday, we'd just let the promote job apply whatever it needed to based on the code in the repo. | 20:05 |
corvus | SpamapS: incidentally, if it's not too boring for you to explain it to me, why make a plan in the gate? why isn't that just something that happens post-merge? | 20:05 |
corvus | oh, heh, i think your last sentence is getting at my question :) | 20:06 |
corvus | so yeah, what changed? | 20:06 |
SpamapS | That's a good question. | 20:06 |
SpamapS | We wanted to make sure that what we gated is the *only* change that happens. I'm not sure it's as valuable as we thought, especially if it makes our deployment pipeline serialize with the gate. | 20:06 |
corvus | oh interesting | 20:07 |
SpamapS | The minor problem we were solving is that sometimes there are in-process manual changes that may get overridden by stuff landing in the gate. | 20:07 |
SpamapS | TBH I'm struggling to come up with reasons of value. | 20:08 |
corvus | heh, it's always those "manual" edge cases that mess up this whole gitops thing | 20:08 |
SpamapS | We may just want to drop it and be a bit more forceful with "apply what's in the code base" | 20:08 |
* corvus looks at opendev's "emergency" file | 20:08 | |
SpamapS | There was also some talk about vetting the plans in the gate, so like, make sure it never deletes an RDS or something, but one can't really inspect them in the current state of terraform, so that's just a fantasy. | 20:09 |
*** jamesmcarthur has joined #zuul | 20:09 | |
SpamapS | I actually think the thing we want is not plan-in-the-gate but plan-in-check. So.. inform the user of what this change would do, and then maybe give them some kind of option to say "yes this is approved, but only with that plan" | 20:11 |
clarkb | corvus: re gerrit and zuul. I dont think yoi can do blue green deployments of zuul if gating. We can do that for subcomponents that scale out like the executor though | 20:12 |
corvus | clarkb: is that re the original msg or the reply i just sent? | 20:13 |
clarkb | corvus: the one you just sent and the one from thomas | 20:13 |
clarkb | but as you say its well tested | 20:13 |
corvus | i meant to say that blue/green wasn't necessary because of gating, but maybe i could have been more explicit | 20:13 |
corvus | i wrote too many words | 20:14 |
clarkb | corvus: ya I got that but it goes the other way too. If the zuul isgating aproject you cant blue green that indtall due to shared statr | 20:14 |
clarkb | but ya I think once you get to gated state a lot of those concerns go away | 20:15 |
corvus | ah yes. the flip side of SpamapS's coin | 20:15 |
SpamapS | corvus: mordred thanks for your wisdom. I think we're going to move the plan generation into promote. | 20:15 |
*** rf0lc0 has joined #zuul | 20:15 | |
corvus | SpamapS: sounds good; thanks for the brain food. i love hearing about use cases like this | 20:16 |
SpamapS | That way we still get a plan artifact of what happened, but we don't block gate jobs from running. | 20:16 |
*** hashar has quit IRC | 20:16 | |
SpamapS | (terraform plans tied to git commits are extremely useful for audits / RCA's) | 20:16 |
*** rfolco has quit IRC | 20:17 | |
*** hashar has joined #zuul | 20:24 | |
SpamapS | corvus: if we find that we do want to have this serialization between gate and promote, I wonder if there's room for a new type of mutex-ish object where it follows the artifacts across pipelines. As annoying as serializing the plan generation would be.. as long as everything else could run in the gate.. and there's just this big queue of "generate plan and upload artifact" waiting.. that's a fast process.. | 20:37 |
SpamapS | this might still be useful in other contexts. | 20:37 |
SpamapS | but, yeah, let's wait for a second attempt at it before we go beyond noodling | 20:37 |
*** hashar has quit IRC | 20:38 | |
*** rf0lc0 has quit IRC | 20:41 | |
mnaser | does paramiko have any sort of uh, run-time dependencies like the actual ssh binaries? | 20:42 |
corvus | i don't think so | 20:42 |
mnaser | ssh-keyscan to the IP of this machine works perfectly, but nodepool is timing out | 20:43 |
corvus | mnaser: is it using the right ip? it should be in the error log if it's timing out | 20:55 |
mnaser | corvus: yeah, its the ipv4 one (eliminating any ipv6 shenanigans) | 20:56 |
pabelanger | that timeout do you have setup for boot? | 20:56 |
pabelanger | maybe keyscan happening too fast? | 20:56 |
corvus | mnaser: this is the method, if you want to try manual debug: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/nodeutils.py#L58 | 20:57 |
corvus | mnaser: you can actually just import that and run it from the repl; no special objects/classes needed | 20:57 |
mnaser | yeah im running it manually on my local machine vs that container to see what is different | 20:57 |
mnaser | it def times out running it seperately too | 20:57 |
* mnaser hmms | 20:57 | |
mnaser | the nodepool-builder container has a bit more stuff in there (ssh tool is there) and it scans successfully | 20:59 |
mnaser | let me see if it actaully runs there | 20:59 |
mnaser | ok, extracting the method out, it actually.. works? | 21:06 |
mnaser | it is returning a `ssh-ed25519` key | 21:06 |
*** rf0lc0 has joined #zuul | 21:26 | |
*** rf0lc0 has quit IRC | 21:26 | |
*** jcapitao|afk has quit IRC | 21:38 | |
corvus | clarkb: do you think we should restart opendev before releasing 3.14? | 21:44 |
clarkb | corvus: let me look at the git log | 21:44 |
corvus | it's the 2.5 removal, smart-reconfig, and a couple of bugfixes. | 21:45 |
clarkb | ya those bugfixes are maybe worth restarting for since they affect pipeline behavior? | 21:46 |
clarkb | I'm not too concerned about the ansible version removal | 21:46 |
corvus | k, i'll get that started then | 21:46 |
fungi | i'm only on for a moment from tonight's hotel, but have to cheer for the "pi release" | 21:48 |
corvus | fungi: wait till the bugfix releases.... 3.14.15926535 | 21:49 |
fungi | yass | 21:50 |
clarkb | they put pi on the wall in our only underground MAX station here. And got it wrong | 21:50 |
corvus | knuths christmas lecture this year was on pi | 21:50 |
clarkb | apparently they had taken the value as printed in some textbook which also got it wrong | 21:50 |
corvus | clarkb: wow | 21:50 |
clarkb | and its carved into the stone wall | 21:50 |
clarkb | so they never changed it :) | 21:51 |
fungi | should i be worried about how well the max isn't engineered, if they can't get pi right? | 21:51 |
clarkb | fungi: I think siemens makes the trains and not local construction company so probably ok | 21:51 |
corvus | https://www.roadsideamerica.com/tip/20814 | 21:51 |
fungi | i bet siemens knows pi | 21:51 |
*** dtroyer has joined #zuul | 21:52 | |
*** pcaruana has quit IRC | 21:54 | |
*** saneax has quit IRC | 21:55 | |
corvus | also letterspacing lining numbers is a bit of a typographic blunder. | 21:57 |
clarkb | corvus: if only they had you to help them do the layout :) | 21:58 |
corvus | yes, they could have had the wrong numbers in better style! | 21:58 |
clarkb | one of the really neat things about that station is the have the vertical cores they took laid out horizontally then have applied geological timeline tidbits along it | 21:59 |
corvus | ok, opendev restarted; we'll watch that a bit and then cut a release | 22:01 |
*** jamesmcarthur has quit IRC | 22:04 | |
*** mattw4 has quit IRC | 22:07 | |
*** mattw4 has joined #zuul | 22:10 | |
*** mattw4 has quit IRC | 22:51 | |
*** saneax has joined #zuul | 22:54 | |
*** mattw4 has joined #zuul | 22:59 | |
*** rlandy is now known as rlandy|bbl | 23:11 | |
*** saneax has quit IRC | 23:24 | |
*** mattw4 has quit IRC | 23:40 | |
*** mattw4 has joined #zuul | 23:40 | |
*** mattw4 has quit IRC | 23:45 | |
*** mattw4 has joined #zuul | 23:46 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!