Tuesday, 2025-02-11

clarkbthe zuul-lb02 deployment "failed" beacuse the handler to run sighup against haproxy when the haproxy config updated failed00:06
clarkbthis seems to be some problem with podman00:06
clarkbzuul-lb02 podman[17437]: time="2025-02-11T00:04:23Z" level=error msg="unable to signal init: permission denied"00:06
clarkbI've been able to reproduce it manually on the server00:06
clarkbthe service itself is up and running its just our ability to request graceful restarts that appears to be impacted right now00:07
clarkbI think that is worth investigating further before proceeding though00:07
clarkbthe error occurs when I use `sudo podman kill` too so this isn't related to having docker compose sit in front of podman at least00:09
clarkbgithub indicates it may be an apparmor problem00:10
clarkbkern.log seems to confirm00:11
clarkbIt is odd to me that we don't appear to have hit this in CI either00:13
clarkbdo we disable apparmor there?00:14
clarkbhttps://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2040483 is the issue. I'm still skimming it but I think they only allow a subset of signals and hup is not one of them :/00:21
clarkbhttps://github.com/NeilW/common/commit/b924b9d63305c94cbc8e46154bc6c1ebc08afd09 this implies any signal should be allowed, but I don't see that rule in place on the server. Trying to figure out if we're even patched00:29
clarkbI suspect that maybe upstream fixes that went to podman ^ in that commit differed from what ubuntu actually deployed in their packaging :/00:29
corvusi've seen that bug before, 1 sec00:36
corvusclarkb:  https://review.opendev.org/c/zuul/zuul/+/923084/14..1500:38
corvusthat's the point at which a workaround for that bug was no longer needed in our canary testing00:38
corvusi think stopping, or maybe restarting, the container was the problem then00:39
corvusi think that is compatible with your hypothesis that they did something different (ie, neglected to handle SIGHUP)00:40
clarkbcorvus: ya I'm like 99% sure that they are allowing that list of signals in the diff I linked because regular management works. But then hup isn't in that list and the audit log shows hup specifically as the signal and podman as the peer00:40
clarkbI've tried making a local override but I think podman is loading its own apparmor profiles when running ? which may override things?00:41
corvusyeah, take a look at the diff i linked00:41
clarkbhowever the audit log doesn't support that hypothesis. I see the initial profile load before any podman stuff runs then the manual reload I did00:41
clarkbcorvus: ya your diff appears to tell podman to select a specific profile?00:41
corvusi think that means we'd need a new profile, and then we tell podman to use that profile in the config file00:41
corvusyep00:41
clarkbya that could be00:42
clarkbI'm going to first check if the issue is I'm trying to modify a running process that already selected its profile and I need a new process to see the newly loaded profile00:42
clarkbI'm still not understanding why this didn't fail in CI but one step at a time00:42
corvusack.  i agree the ci failure is worth looking into/fixing.00:43
clarkbdown then up'ing containers doesn't produce a different result00:43
clarkbI guess I could try a reboot just to make sure there isn't some state I'm missing. Maybe apparmor_parser -r /etc/apparmor.d/podman isn't enough00:44
clarkboh actually the audit log says the profile is containers-default-0.57.4-apparmor100:44
clarkbthat probably explains why editing the podman profile doesn't help. But I don't see that profile on disk so this might be something that is dynamically loaded00:45
corvusi believe i learned during my initial investigation that apparmor uses the profile specified in the config (or its default) when setting up a container; so editing containers.conf is almost certainly necessary00:45
clarkbya and containers-default-0.57.4-apparmor1 must be the built in default because I'm not finding it in /etc/apparmor.d00:46
corvus++00:46
clarkbfwiw sudo kill -HUP $HAPROXY_CONTROL_PID seems to work00:47
clarkbI get a new haproxy subprocess (though I didn't confirm it changes any configuration)00:48
corvuswe could maybe move to use that and ignore the apparmor bug00:48
clarkbya that's what I'm wondering00:48
corvusmy gut says that will be less work and will make us less unhappy :)00:48
clarkbok I'll pick that up tomorrow then. I need to get this meeting agenda out and then figure out dinner00:50
clarkblet me remove the apparmor edits I made and reload the profile again so that we're back to expected confiruation on the server00:50
clarkbok I think I've restored everything to the way configuration management is setup to deploy it.00:53
clarkbfor my own note taking tomorrow: `sudo docker inspect --format '{{.State.Pid}}' haproxy-docker-haproxy-1` is the magic to get back the pid of the process in the container. Seems to work for both podman and docker01:01
clarkbbut ya I think we lookup the pid that way and issue a hup directly and drop the use of podman kill :/ disappointing that we continue to have so many problems. Part of the issue seems to be that upstream will say these are distro issues even though they seem to load an apparmor profile out of a template in their own code base01:01
clarkboh btw codesearch restarted automatically and logging seems to still work and the service is functioanl for me01:03
clarkbI'll also update the launchapd issue and the github issues tomorrow I guess01:04
clarkbit looks like there may be usr1 signals getting ignored too01:04
clarkbagenda is out and I'm off in search of food. Thanks!01:09
*** clarkb is now known as Guest870902:38
opendevreviewTakashi Kajinami proposed openstack/diskimage-builder master: tox: Drop functest target from tox  https://review.opendev.org/c/openstack/diskimage-builder/+/93226604:23
*** jroll09 is now known as jroll008:45
*** ralonsoh_ is now known as ralonsoh10:21
*** ralonsoh_ is now known as ralonsoh14:00
Guest8709I've apparently become a guest15:22
fungiwelcome, honored guest!15:22
*** Guest8709 is now known as clarkb15:23
clarkbdoes anyone know the correct way to escape the {{ .State.Pid }} in this ansible: command: docker inspect --format '{{.State.Pid}}' {{ docker_ps.stdout }}15:42
clarkbmaybe the raw tags are sufficient here?15:42
clarkbI guess {{ '{{'  }} produces the {{ literal15:43
clarkbcommand: docker inspect --format "{{ '{{' }} .State.Pid {{ '}}' }}" {{ docker_ps.stdout }}15:45
jrosseri think you can use !unsafe for this15:46
jrosserpossibly something like command: !unsafe "stuff-that-should-not-be-templated"15:47
jrosserthat would apply to the whole thing though15:47
clarkbright I need the variable at the end to be interpolated15:52
clarkbI could probably define {{ .State.Pid }} as an unsafe variable value then interpolate that variable in as an alternative15:53
corvusclarkb: since it's at the end .... consider concatenation?  (or maybe suppling a YAML list of arguments to command)15:54
clarkbcorvus: I'm not sure I understand how concatenation would work. Yaml won't let me define an unsafe variable then concatenate that to a safe variable will it?15:55
corvusclarkb:  https://paste.openstack.org/show/bYxAKCu9cNjQEbokn5LG/15:59
clarkbTIL16:00
clarkbthough I think I like the suggestion of using yaml arguments instead16:00
clarkbI'll get a patch up shortly16:00
corvusyeah i like that one better16:00
corvussomething like https://paste.openstack.org/show/biOTZ61EiPvjzWZnMTNv/16:02
clarkboh I came up with something slightly different. Once sec and gerrit will have it16:04
opendevreviewClark Boylan proposed opendev/system-config master: Perform haproxy HUP signals with kill  https://review.opendev.org/c/opendev/system-config/+/94125616:05
clarkbcorvus: ^16:05
clarkbI'll try to file nwe upstream bugs after meetings today16:06
*** elodilles_pto is now known as elodilles16:21
clarkbthe haproxy hup updates failed due to docker rate limits (arg) but the zuul job at least seems to show that the hup changes are working17:30
clarkbhttps://746cc4bed29e8f00c6ad-bb8dadf314ca143f13ef83e8dbc65d1a.ssl.cf2.rackcdn.com/941256/1/check/system-config-run-zuul/b4e1184/bridge99.opendev.org/ara-report/playbooks/7.html17:30
clarkbit is the last three tasks in that playbook doing the hup17:30
clarkbso I think that is ready for review now and we'll recheck later today when docker is less likely to be a problem17:30
clarkbbut if we get that in I think we can land the change to switch zuul.o.o over to zuul-lb02 in dns17:31
clarkbhttps://github.com/containers/common/issues/2321 has been reported. I'll followup on the launchpad side of things too18:05
clarkbjust one step at a time18:05
clarkband https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2089664 was already filed against ubuntu and that points to a PR fix that is waiting on someone with apparmor knowledge to wiegh in on. Hopefully my inolvement there gets things moving again18:22
clarkbin the mean time the workaround I proposed aboe should work I think18:22
clarkbafter lunch I may try to ride my bike a bit. It is cold though so probably not for very long19:48
clarkbmoving faster than you can run is great when the weather is warmer than 10C19:48
clarkbpretty miserable when colder than that19:49
tonybIt certainly can be.  Good luck19:50
fungipython 3.14.0.0a5 is tagged20:02
clarkbhttps://review.opendev.org/c/opendev/system-config/+/941256 passes now. I think if people are happy with that workaround and approve the change we can go ahead and approve the dns update for zuul.o.o after it lands20:28
clarkbhttps://zuul-lb02.opendev.org is up and running if you want to test general functionality20:29
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Install ca-certificates in the buildx image  https://review.opendev.org/c/zuul/zuul-jobs/+/93982321:16
clarkbreviewing gerrit esc meeting notes it looks like they have a plan for improving git diff caching. The proposal is to use a diff cache directly in the git repo and stop using h2 for that21:19
clarkbI wonder if that is something that git already has built in21:20
clarkbis anyone willing and able to review https://review.opendev.org/c/opendev/system-config/+/941256 ? If this wasn't a "podman has a bug and I'm working around it" situation I'd probably proceed to keep the noble node rollout moving but this is worth having at leats another set of eyes on I think21:24
clarkbI think I'ev decided it is too cold outside and will exercise indoors at some point so happy to monitor and approve etc21:25
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Install ca-certificates in the buildx image  https://review.opendev.org/c/zuul/zuul-jobs/+/93982321:46
corvusclarkb: +2 on 25623:11
corvusi almost thought about suggesting a shell script, but we might appreciate the extra debuggability of individual tasks there. lgtm.23:12
clarkbok I'm going to approve that now then and cross my finers docker is happy23:15
fungiah, you meant 94125623:24
fungii guess i double-approved it23:24
clarkbarg it already bounced off of docker rate limits. Its frustrating that ht process to get off docker hub relies so heavily on docker hub23:40
clarkboh wait no it failed trying to fetch some package23:41
clarkbTemporary failure resolving 'us.archive.ubuntu.com' and the other job failed to update apt cache for likely the same reason23:41
clarkbI think we don't use our mirrors there because system-config-run jobs put in the prod config for apt which doesn't use the mirrors23:42
clarkbthose names resolve for me so I'll try again23:42
fungidid it run in flex?23:46
fungifor the past week or so i've been noticing outages of 10-15 minutes for network connectivity to sjc3, both vm connectivity and to the api23:46
clarkbfungi: yes23:47
fungison the order of a few times a day23:47
clarkbhttps://zuul.opendev.org/t/openstack/build/3f0de8bfa0ae4dc084f2cf46ccc26cc1/log/job-output.txt and https://zuul.opendev.org/t/openstack/build/11dbd797b7404b7d99d39821057deb90/log/job-output.txt both show raxflex nodes 23:47
clarkbfungi: where is it manifesting for the api? in nodepool logs?23:50
clarkbI'm wondering if there is something concrete enough like api request timestamps we can bring up with them23:51
funginot sure, the reason i asked is that i have a personal vm in my own rackspace account i'm using as a shell server there, and when my connectivity to it starts timing out i also check whether i can get a response from the nova api, which is also typically down at the same time. in this case i had just regained access after about 10 minutes of being unreachable, and you mentioned a23:54
fungisomewhat unlikely network connection failure right after23:54
clarkbinteresting. It does seem like something that might be worth mentioning. But maybe only if it persists? Though you say its been ongoing for a week which seems pretty persistent23:55
fungicloudnull doesn't seem to be hanging out in here lately, but i'll check with cardoe tomorrow to see if he knows of anything going on there for the past week or so23:55
fungimaybe they're redoing switch cabling or something23:55
cardoeHe would. Joins the board and suddenly irc is too much for his corporate self.23:57
fungihah23:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!