Wednesday, 2023-05-24

corvusclarkb: do you grok
corvusi'm not sure why we put xenial in there on our current focal hosts00:15
corvusoh i see it, nm00:16
opendevreviewJames E. Blair proposed opendev/system-config master: WIP: Test zuul on jammy
*** amoralej_ is now known as amoralej09:16
fricklerinfra-root: I'm seeing a lot of (zombie?) "git cat-file --batch-check" processes on our executors, is that a known issue? (first discovered this on our local zuul, which has more seldom restarts)10:29
*** amoralej is now known as amoralej|lunch10:55
fungii've definitely seen them for... years11:39
fricklerI'm not sure if that's meant to comfort me or rather not ... ;)11:40
fungii don't think they're zombies, they're still parented to the executor process and seem to only be as old as the last exectutor container restart11:41
frickleryes, not zombies in the unix meaning of the word. can still be quite a lot for deployments that don't do automated regular restarts11:42
fungimaybe due to a bug in GitPython not cleaning up subprocesses it spawns?11:59
fungiseems similar12:00
fungihere's a workaround in another project using gitpython:
SvenKieskemhm, you can easily dos yourself with such behaviour depending how fast this process list grows - apparently not fast enough, yet.12:07
fungiprobably part of why we don't end up with lots is that we perform rolling restarts of all the zuul services to update their containers at least once a week12:09
fungiwe're averaging 170 git cat-file processes per executor at the moment. i'll check again in a bit to see if it increases12:16
fungiwe're about 4 days since the last restarts12:16
SvenKieskeI think it would be worthwhile nevertheless to fix these, how much ram do these consume?12:22
*** amoralej|lunch is now known as amoralej12:33
fungihard to say in aggregate because they might share pages, but we track memory utilization on the executors and throttle builds automatically when it gets too high:
fungithe memory used by the git cat-file processes is a tiny fraction of what python/ansible consumes12:36
fungiand yes, it would be worthwhile to fix. probably the best way to do that is upstream in gitpython12:37
fungithey've even marked the issue about it as "help wanted"12:37
fungiif you look at the executory memory utilization graphs, they don't seem to get worse as the week goes on and approaches the next scheduled restart, so it doesn't look like those leftover processes are contributing significantly to memory utilization12:42
fungialso the average of 170 git cat-file processes i calculated across the executors was not to imply that all of those are left over, i'd probably need to filter them on start time or some similar heuristic to better measure that12:47
SvenKieskelooking at the upstream bug I don't think they are aware that this happens on many more systems than just the ones reported12:52
SvenKieskeI don't know really how GitPyhton interacts with zuul here, but the suggestion seems clear: use libgit2 via it's python bindings instead12:53
SvenKieskethat's at least my take when reading
fungizuul relies on gitpython in this particular case to attempt git merges and then extract (possibly speculative) configuration files from all the branches in them12:54
fungiand to construct the speculative repository states which are exposed to the build workspaces and replicated to the job nodes assigned to a given build12:55
fungianyway, this is not the zuul project's discussion channel, we just happen to be running a deployment of zuul. the channel or mailing list are where potential redesigns like should be brought up12:57
SvenKieskewell, Byron is the primary author of GitPython and afaik is currently full time working on gitoxide, the rust reimplementation of git, so I wouldn't bet on many upstream work. he also suggested to use libgit2 instead12:57
fungier, matrix channel i meant12:58
SvenKieskere: zuul dev; sure12:58
SvenKieskedidn't know they are on matrix, good to know12:58
SvenKiesketo end this discussion: is anyone aware of a bug tracking this on the zuul side?12:59
fungiyeah, opendev has a dedicated matrix homeserver and zuul's discussion channel is hosted there12:59
fungiSvenKieske: i am not aware of one, no, but since it's a bug in a library used by zuul it's sort of a grey area as to whether it could be considered a zuul bug13:00
fungibut if the situation really is "gitpython is abandoned by its author who suggests everyone should switch to libgit2" then maybe that's worth bringing up13:01
SvenKieskeyeah, and given that a fix doesn't seem to materialize, maybe they/someone can implement a workaround in zuul, besides restarting :)13:03
fungihuh, i can't find libgit2 on pypi13:04
fungiaha, it's pygit213:04
fungii assume most distros have libgit2 by now, but it's relatively recent in zuul years (seems it only landed in debian as far back as bullseye)13:08
fungithat would probably explain why it was designed around gitpython instead of pygit2 bindings13:09
clarkbI think licensing was an issue in the way back times too15:16
fungilooks like it's currently "GPLv2 with a Linking Exception"15:22
fungiwhile gitpython is bsd15:22
clarkbya it was just GPL before iirc and the last time it came up zuul was part of openstack and that was something to be avoided15:23
clarkbI don't thnik it would be an issue today15:23
fungiit's possible also addresses apache/gpl2 compatibility issues, but i'd want a lawyer to confirm that15:26
clarkbfungi: well the linking exception is what fixes it now15:28
clarkbI want to say that exception didn't exist before15:28
clarkbthe alternative in the past was to relicense zuul to something gplv2 compat15:29
clarkbcorvus: fungi frickler I'm starting to put a plan together there. Once I've got the list of steps outlined I'll start filling in specifics and we can move forward15:29
fungiright, i mean i'm not 100% confident in my reading of the exception clause15:29
fungiand would prefer openinfra's legal counsel weigh in on it before we said sure it's compatible with zuul's license15:30
fungibut, again, the zuul matrix channel or zuul mailing list is the place for the discussion15:31
fungiopenssl security advisory coming on tuesday. highest severity ranked "moderate" by their security folks though15:33
clarkbif any infra-root have time to skim over to make sure that looks sane then give the sync back to script paste a close look for accuracy that would be appreciated. I'm at a point where I'd like to get that done as well as the announcement before taking further steps (just to avoid getting ahead of myself)15:44
clarkband while I wait for that I will do local software updates15:44
corvusfwiw, we evaluated using libgit2 in zuul, and there are significant performance penalties.  we can use it for some things, but for others, we would just need to shell out to a git process.16:03
dmendiza[m]Hi friends!  I need some help debugging a Centos 9 devstack gate.  Looks like things are failing because liberasurecode-devel can't be found:
dmendiza[m]Seems to be a swift requirement:
fungidmendiza[m]: yep, any idea where it comes from in centos stream 9?16:10
fungiepel? rdo?16:11
fungithere's a proposed change to switch the rdo version used:
fungicalls out ovs versions as the reason, but i wonder if ec is in a similar situation16:12
dmendiza[m]looks like the repo is delorean-master-testing  ... at least in a local devstack vm I spun up a couple of days ago16:13
fungier, i guess that change i linked is for /SIGs16:14
fungithere's a code comment in devstack that says "Some necessary packages are in openstack repo, for example liberasurecode-devel"16:15
fungiright before it does `install_package hostname openstack-release-wallaby`16:16
fungioh, that's for openEuler-22.03 though not centos16:16
fungiso i'm guessing it normally gets handled by the _install_rdo function16:17
fungiand for centos stream 9 it grabs and updates from that16:18
fungidmendiza[m]: are you sure older runs didn't have a similar error? it looks like that was nonfatal and the job died a few seconds later because it couldn't find pip:
fungi2023-05-23 12:49:14.139 | /usr/bin/python3.9: No module named pip16:25
dmendiza[m]fungi: so, what I think is happening, is that the dnf install dies because it can't find the one package, which results with python3-pip not being installed16:25
fungiaha, okay that would make sense16:26
fungiso in that case the main question i have is whether it's trying to download that package from our centos stream 9 mirror and not finding it, or trying to install it from elsewhere16:27
fungidmendiza[m]: we have it here in our mirrors:
dmendiza[m]hmm...  🤔 ...  only thing I can think of is maybe a signature check failure because it's only happening on our FIPS enabled gate16:31
* dmendiza[m] tries to test that theory16:31
fungidmendiza[m]: also make sure it's referencing an rdo package collection which actually has packages. looks like yoga, zed and antelope have liberasurecode-devel16:33
opendevreviewEduardo Alberti proposed openstack/project-config master: Add Kubernetes Power Manager app to StarlingX
fungidmendiza[m]: wallaby, xena and bobcat have no packages directory at all16:33
dmendiza[m]yeah, this is a patch to bobcat16:33
fungipresumably you'll want to use rdo antelope packages for now until rdo bobcat exists16:34
fungii mean, it exists it's just empty (no packages directory yet)
fungior you might need to grab them from (i don't think we mirror that anywhere?)16:36
clarkbinfra-root I did a test run of my sync script to cboylan/python-builder: spot checking against I think this looks good? both arm and x86 images made it across and hashes seem to match spot checking 3.10-bullseye and 3.11-bullseye16:36
clarkbplease check that for sanity, but I'll plan to proceed with syncing from to in opendevorg soon16:37
dmendiza[m]fungi: yeah, looks like the package installs just fine without fips, but fails like so when fips is enabled:
clarkbI think this is a distro bug? curl not working under fips isn't something we can fix for you16:47
johnsomclarkb I just saw this patch too:
fungiyep, having arrived at that, it's probably something to discuss in #openstack-qa or #rdo16:48
johnsomIt seems like that RDO host has old ciphers/protocols maybe?16:48
fungiif any of you know someone at red hat, that's where i'd start16:49
johnsomfungi lol16:49
dmendiza[m]heh ... yeah looks like the TLS handshake fails under fips:
clarkbinfra-root I manually checked `sudo docker run -it --rm bash` on my local machine and on nb04 to double check that the arm and amd manifest stuff seems to be working and it appears to be fine16:56
* dmendiza[m] takes his broken toy to 16:58
fungidmendiza[m]: johnsom: back on topic for this channel though, if the delorean/rdo packages can be mirrored the same way we rsync normal centos packages, that would mask the problem (not solve it because users doing their own fips compliance testing with devstack would still hit it) but would cut down on devstack centos jobs connecting out across the wilds of the internet to install stuff16:59
clarkbfungi: I don't think that is true16:59
clarkbrdo/delorean are in a ton of flux at all times16:59
clarkbthe delorean content in particular is basically development packages that get updated all the time randomly17:00
fungiyeah, if the way their package repositories are laid out isn't stable for mirroring, then it would probably be unworkable17:00
fungior if they don't update packages and indices in the right order (avoiding removing old packages for a while after new versions are added)17:00
fungigranted, that would be a problem for direct consumers of those packages too, even without us mirroring them17:01
clarkbfungi: did you want to look at the container image snc stuff before I proceed?17:08
fungii'm looking through it now, yep17:08
fungii think it makes sense, i'm just going back over it to be sure17:09
clarkbthanks! I'll wait then17:09
fungii haven't ever uploaded images to dockerhub or made tags there, so i'm taking for granted that part is correct17:10
dmendiza[m]I'm thinking this might be an OpenSSL issue and not the cert that RDO is using
clarkback. I think if you cross check the content between and that is probably good enough17:11
clarkbfungi: I used the script with a small modification of the new side namespace to cboylan in order to do that17:12
fungiclarkb: corrected a minor typo on the announcement17:12
fungidmendiza[m]: so the server is using tls 1.2 and expecting ems negotiation but openssl hasn't enabled it in fips mode?17:15
dmendiza[m]I think turning on FIPS in the client requires EMS, but for some reason it is not enabled ... at least that's what I think this error is trying to tell us:17:17
dmendiza[m]error:1C8000E9:Provider routines:kdf_tls1_prf_derive:ems not enabled17:17
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Move python builder/base images to"
fungiclarkb: the todo list makes references to "step 0" but the list is unnumbered. is that referring to the first bullet?17:18
clarkbfungi: yes17:18
fungik, thx17:19
fungiplan lgtm, i'll review that first change17:19
fungii take it that's a straight up revert of the original, hence going back to relative image names17:21
clarkbfungi: I changed the rebuild string in the Dockerfiles but otherwise it is a straight revert17:22
clarkb(I didn't want confusing timetsamps that go backwards in those files)17:22
clarkbfungi: ok I'm going to push one more change then switch over to doing the sync step. We shouldn't land any of these until after that sync step completes17:23
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Move assets image to"
clarkbalright now to get logged in for the sync step and sort through that.17:24
clarkbI'm going to do just a few at a time. I started with assets because its content hasn't actually chagned so is safe. and seem to be in aggrement on that 92f... value17:29
clarkbI'm also only logged into the docker side so I can't accidentally push the other direction17:31
clarkbok all 14 images have been synced back. I've checked all the images (only spot checked tags on each image though) and they all seem to line up between docker and quay17:47
clarkbI'll wait for a +1 from zuul on the two changes I've pushed so far then I'll send the announcement email17:48
clarkbthen we can hopefully approve those two and continue on to the others17:49
fungithe assets revert already came back +117:49
fungithe first change hasn't reported yet17:49
clarkbya the assets build is pretty trivial it just copies a few files into a container image17:49
clarkbI'll wait for a +1 on the other one too17:49
clarkbcorvus: do you want to review these change sbefore the get approved?17:50
corvuscan do17:50
corvuslgtm; i can't think of a reason not to do it that way17:51
corvuson another subject, i was doing some digging into performance metrics for the nodepool zk persistent watches change, and i'm pretty sure i mentioned here already that you can clearly see the improvement in zk latency when we made the change (on april 11):
corvusbut what i just noticed is that i think we can actually see an improvement in zuul performance too:
corvusthe pipeline refresh time for the periodic pipeline drops from 5.0 seconds to 3.6 seconds after that change; likely because zk has more bandwidth to handle the zuul requests.17:53
clarkblooking at the jobs for the base python images change I will need to resync accessbot, haproxy-statsd and zookeeper-statsd after this lands. Not a big deal and that is why I've got a note to check if anything needs to be resynced again17:55
clarkbcorvus: is the periodic pipeline the most expensive in terms of time due to the thundering herd?17:59
corvusclarkb: yeah, i think that's what that spike is18:02
clarkbcorvus: fungi: I take it your are both good with me sending the email and approving those two changes?18:07
clarkbzuul just reported a +1 back18:07
corvusclarkb: yes18:07
opendevreviewEduardo Alberti proposed openstack/project-config master: Add Kubernetes Power Manager app to StarlingX
clarkbemail sent. Approving the changes next.18:09
corvusclarkbfungi the "test zuul/nodepool" on jammy changes are +1 from zuul; what's next for upgrading those servers?  land those changes, then delete, relaunch, and update dns one at a time?18:32
fungithat's probably safest. we could launch multiple replacements in parallel if we have the quota for it instead of deleting and replacing one by one18:34
clarkbin the past I've updated the testing at the same time that we update the inventory with the new nodes. But I don't think that is super important. Otherwise up basically that18:34
corvusin this case there's a 1:N where N is like 25 relationship with testing and real nodes, so exactly when to update is a grey area :)18:35
corvusso maybe i'll start by spinning up as many new mergers as possible, then swapping them out; then executors, then schedulers, then nodepool18:36
fungisgtm, thanks!18:38
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Migrate statsd sidecar container images to"
corvusi'm launching 6 merger nodes18:53
opendevreviewMerged opendev/system-config master: Revert "Move python builder/base images to"
opendevreviewMerged opendev/system-config master: Revert "Move assets image to"
clarkbthe promotion jobs for the base images have run in deploy successfully. I now need to resync haproxy-statsd, zookeeper-statsd, and accessbot. I'm going to do that now then I'm going to pop out for lunch18:59
clarkbthe pushes to docker seem to have worked for base images too19:02
clarkbResync for haproxy-statsd, zookeeper-statsd and accessbot is done19:08
clarkbI think that is ready when we are assuming it passes CI and reviewers are happy with it. Definitely look closely at that one as I had to do a partial revert of the second change. cc fungi and corvus19:09
clarkbI'm going to pause here and find lunch. I'll continue to push up changes when I return19:09
corvusi'm going to shut down all of the trash compactors on the detention level20:05
corvuser, i mean, all of the zuul mergers20:05
clarkb884274 is about to report success. I think it is good to merge if yall are happy with it20:14
corvusclarkb: lgtm20:15
opendevreviewJames E. Blair proposed opendev/ master: Replace all zuul mergers
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Move system-config irc bots into"
opendevreviewJames E. Blair proposed opendev/system-config master: Upgrade zuul mergers to jammy
corvusclarkb: fungi ^ those 2 changes should accomplish the jammy upgrade for zm*.  i have shut all current mergers down (relying on executors for now), so should be safe to apply at will.20:18
corvusclarkb: and i included the jammy test upgrade for only the merger in that change :)20:19
clarkbcorvus: replacing them in place like that may require you to clear out ansible caches. I can't remember if that was a problem the last time around20:20
clarkbsince ansible will cache the wrong host facts20:20
corvushrm, something weird happened with the tabs in the dns change... want me to fix that?20:20
corvusclarkb: ack.  when do you think is the best time for that?20:20
clarkbcorvus: we run zuul hourly so you'd probably have to put hosts in the emergency file to avoid that. Maybe better to see if it breaks when the change lands and if so then clear it out and let the hourly job correct things20:21
clarkbre tabs I haven't been too picky about it20:21
corvuswfm.  just rm -fr /var/cache/ansible/facts/zm*20:21
clarkbI try to make things consistent when I make changes but otherwise its meh20:21
clarkband ya I think that should do it20:21
corvusk.  happy to fix the tabs if desired; just don't want to churn under you if no one cares.20:22
clarkb+2 on both changes. I didn't go and veryify every ip address and host key :)20:22
clarkbbut the changes look like what I would expect them to look20:22
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Extend the checksum files generation procedure
opendevreviewClark Boylan proposed opendev/gerritbot master: Revert "Move gerritbot to"
opendevreviewClark Boylan proposed opendev/statusbot master: Revert "Move statusbot to"
opendevreviewMaksim Malchuk proposed openstack/diskimage-builder master: Extend the checksum files generation procedure
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Move pull external IRC bot images from"
opendevreviewClark Boylan proposed opendev/grafyaml master: Revert "Migrate grafyaml container images to"
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Pull grafyaml from"
dmendiza[m]fungi re: FIPS jobs failing, turning on FIPS mode in the client requires that TLS 1.2 use EMS, but some repos have EMS disabled in openssl, so the negotiation fails.  Tracking a workaround here:
opendevreviewClark Boylan proposed openstack/project-config master: Revert "Pull grafyaml from"
opendevreviewClark Boylan proposed opendev/lodgeit master: Revert "Move lodgeit image publication to"
opendevreviewClark Boylan proposed opendev/system-config master: Revert "Pull lodgeit from"
clarkbok all of the required changes for this rollback should be pushed now and recorded in I think I managed to set the proper tag on all of them too21:03
clarkbI would like to approve them one by one because sometimes there are unexpected side effects that cause an image you don't expect to update to update21:03
clarkband doing them one by one allows us to check and resync to docker hub if necessary21:03
opendevreviewMerged opendev/ master: Replace all zuul mergers
clarkb is the next change to land and should be safe to do so. fungi I'll give you a bit of time to review that one since I ended up doing a partial revert and would like eyeballs on it if possible21:04
clarkband then its just a matter of landing the changes, verifying results and moving on to the next one21:05
fungistepped away to make/eat dinner but checking merger changes now21:05
fungialready merged! good deal21:05
fungioh, dns changed already merged, inventory change already gating but lgtm21:06
fungii'll review the rest of the quay migration reverts in that case21:07
clarkboops I swapped two numbers in the next up change number it should be
clarkband fungi has +2'd it so I will approve it now21:18
clarkbcorvus: if you have time for that is the next one up. I don't expect I'll get much beyond these two today just with the time involved to test, merge, deploy21:23
opendevreviewMerged opendev/system-config master: Upgrade zuul mergers to jammy
corvusthings have landed, dns is updated, and we're between zuul service playbook runs, so i'm going to go ahead and clear the fact cache for zm*22:40
corvushopefully next run at approx 23:11 should deploy22:41
opendevreviewMerged opendev/system-config master: Revert "Migrate statsd sidecar container images to"
clarkbcorvus: its is running now23:42
corvuslooks like it finished23:50
clarkbcorvus: I don't think it auto started the services though but ya seems to hvae deployed successfully23:50
corvusi'll start up zm01 and watch it23:50
corvusit's up...and idle :)23:51
clarkbthe statsd containers promoted successfully. They should roll out soon enough23:51
clarkbI'll continue the revert work tomorrow23:51
corvusclarkb: before you head out; do you reckon i should delete the old mergers now?23:52
corvusis the best way to do that to open an openstack client shell on bridge and delete from there?23:52
corvus(zm01 merged something, i will start the others)23:53
clarkbcorvus: in the past I've left things disabled on the old host for a short time just in case a revert is necessary. That seems less likely for mergers though23:54
corvusyeah, we can live without them, so i'll just remove them now23:54
clarkbcorvus: and ya `openstack server list | grep` then find the uuid for the old one using `openstack server show $uuid` then `openstack server delete $uuid`23:54
clarkbyou can't delete by name since there are duplicates23:55
corvusgood point; maybe we should put the uuid in inventory files to make this easier23:56
clarkbI tend to do a server show on whatever I'm going to delete first anyway like doing a select before a delete. Just makes me feel better :)23:58

Generated by 2.17.3 by Marius Gedminas - find it at!