clarkb | the zuul-lb02 deployment "failed" beacuse the handler to run sighup against haproxy when the haproxy config updated failed | 00:06 |
---|---|---|
clarkb | this seems to be some problem with podman | 00:06 |
clarkb | zuul-lb02 podman[17437]: time="2025-02-11T00:04:23Z" level=error msg="unable to signal init: permission denied" | 00:06 |
clarkb | I've been able to reproduce it manually on the server | 00:06 |
clarkb | the service itself is up and running its just our ability to request graceful restarts that appears to be impacted right now | 00:07 |
clarkb | I think that is worth investigating further before proceeding though | 00:07 |
clarkb | the error occurs when I use `sudo podman kill` too so this isn't related to having docker compose sit in front of podman at least | 00:09 |
clarkb | github indicates it may be an apparmor problem | 00:10 |
clarkb | kern.log seems to confirm | 00:11 |
clarkb | It is odd to me that we don't appear to have hit this in CI either | 00:13 |
clarkb | do we disable apparmor there? | 00:14 |
clarkb | https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2040483 is the issue. I'm still skimming it but I think they only allow a subset of signals and hup is not one of them :/ | 00:21 |
clarkb | https://github.com/NeilW/common/commit/b924b9d63305c94cbc8e46154bc6c1ebc08afd09 this implies any signal should be allowed, but I don't see that rule in place on the server. Trying to figure out if we're even patched | 00:29 |
clarkb | I suspect that maybe upstream fixes that went to podman ^ in that commit differed from what ubuntu actually deployed in their packaging :/ | 00:29 |
corvus | i've seen that bug before, 1 sec | 00:36 |
corvus | clarkb: https://review.opendev.org/c/zuul/zuul/+/923084/14..15 | 00:38 |
corvus | that's the point at which a workaround for that bug was no longer needed in our canary testing | 00:38 |
corvus | i think stopping, or maybe restarting, the container was the problem then | 00:39 |
corvus | i think that is compatible with your hypothesis that they did something different (ie, neglected to handle SIGHUP) | 00:40 |
clarkb | corvus: ya I'm like 99% sure that they are allowing that list of signals in the diff I linked because regular management works. But then hup isn't in that list and the audit log shows hup specifically as the signal and podman as the peer | 00:40 |
clarkb | I've tried making a local override but I think podman is loading its own apparmor profiles when running ? which may override things? | 00:41 |
corvus | yeah, take a look at the diff i linked | 00:41 |
clarkb | however the audit log doesn't support that hypothesis. I see the initial profile load before any podman stuff runs then the manual reload I did | 00:41 |
clarkb | corvus: ya your diff appears to tell podman to select a specific profile? | 00:41 |
corvus | i think that means we'd need a new profile, and then we tell podman to use that profile in the config file | 00:41 |
corvus | yep | 00:41 |
clarkb | ya that could be | 00:42 |
clarkb | I'm going to first check if the issue is I'm trying to modify a running process that already selected its profile and I need a new process to see the newly loaded profile | 00:42 |
clarkb | I'm still not understanding why this didn't fail in CI but one step at a time | 00:42 |
corvus | ack. i agree the ci failure is worth looking into/fixing. | 00:43 |
clarkb | down then up'ing containers doesn't produce a different result | 00:43 |
clarkb | I guess I could try a reboot just to make sure there isn't some state I'm missing. Maybe apparmor_parser -r /etc/apparmor.d/podman isn't enough | 00:44 |
clarkb | oh actually the audit log says the profile is containers-default-0.57.4-apparmor1 | 00:44 |
clarkb | that probably explains why editing the podman profile doesn't help. But I don't see that profile on disk so this might be something that is dynamically loaded | 00:45 |
corvus | i believe i learned during my initial investigation that apparmor uses the profile specified in the config (or its default) when setting up a container; so editing containers.conf is almost certainly necessary | 00:45 |
clarkb | ya and containers-default-0.57.4-apparmor1 must be the built in default because I'm not finding it in /etc/apparmor.d | 00:46 |
corvus | ++ | 00:46 |
clarkb | fwiw sudo kill -HUP $HAPROXY_CONTROL_PID seems to work | 00:47 |
clarkb | I get a new haproxy subprocess (though I didn't confirm it changes any configuration) | 00:48 |
corvus | we could maybe move to use that and ignore the apparmor bug | 00:48 |
clarkb | ya that's what I'm wondering | 00:48 |
corvus | my gut says that will be less work and will make us less unhappy :) | 00:48 |
clarkb | ok I'll pick that up tomorrow then. I need to get this meeting agenda out and then figure out dinner | 00:50 |
clarkb | let me remove the apparmor edits I made and reload the profile again so that we're back to expected confiruation on the server | 00:50 |
clarkb | ok I think I've restored everything to the way configuration management is setup to deploy it. | 00:53 |
clarkb | for my own note taking tomorrow: `sudo docker inspect --format '{{.State.Pid}}' haproxy-docker-haproxy-1` is the magic to get back the pid of the process in the container. Seems to work for both podman and docker | 01:01 |
clarkb | but ya I think we lookup the pid that way and issue a hup directly and drop the use of podman kill :/ disappointing that we continue to have so many problems. Part of the issue seems to be that upstream will say these are distro issues even though they seem to load an apparmor profile out of a template in their own code base | 01:01 |
clarkb | oh btw codesearch restarted automatically and logging seems to still work and the service is functioanl for me | 01:03 |
clarkb | I'll also update the launchapd issue and the github issues tomorrow I guess | 01:04 |
clarkb | it looks like there may be usr1 signals getting ignored too | 01:04 |
clarkb | agenda is out and I'm off in search of food. Thanks! | 01:09 |
*** clarkb is now known as Guest8709 | 02:38 | |
opendevreview | Takashi Kajinami proposed openstack/diskimage-builder master: tox: Drop functest target from tox https://review.opendev.org/c/openstack/diskimage-builder/+/932266 | 04:23 |
*** jroll09 is now known as jroll0 | 08:45 | |
*** ralonsoh_ is now known as ralonsoh | 10:21 | |
*** ralonsoh_ is now known as ralonsoh | 14:00 | |
Guest8709 | I've apparently become a guest | 15:22 |
fungi | welcome, honored guest! | 15:22 |
*** Guest8709 is now known as clarkb | 15:23 | |
clarkb | does anyone know the correct way to escape the {{ .State.Pid }} in this ansible: command: docker inspect --format '{{.State.Pid}}' {{ docker_ps.stdout }} | 15:42 |
clarkb | maybe the raw tags are sufficient here? | 15:42 |
clarkb | I guess {{ '{{' }} produces the {{ literal | 15:43 |
clarkb | command: docker inspect --format "{{ '{{' }} .State.Pid {{ '}}' }}" {{ docker_ps.stdout }} | 15:45 |
jrosser | i think you can use !unsafe for this | 15:46 |
jrosser | possibly something like command: !unsafe "stuff-that-should-not-be-templated" | 15:47 |
jrosser | that would apply to the whole thing though | 15:47 |
clarkb | right I need the variable at the end to be interpolated | 15:52 |
clarkb | I could probably define {{ .State.Pid }} as an unsafe variable value then interpolate that variable in as an alternative | 15:53 |
corvus | clarkb: since it's at the end .... consider concatenation? (or maybe suppling a YAML list of arguments to command) | 15:54 |
clarkb | corvus: I'm not sure I understand how concatenation would work. Yaml won't let me define an unsafe variable then concatenate that to a safe variable will it? | 15:55 |
corvus | clarkb: https://paste.openstack.org/show/bYxAKCu9cNjQEbokn5LG/ | 15:59 |
clarkb | TIL | 16:00 |
clarkb | though I think I like the suggestion of using yaml arguments instead | 16:00 |
clarkb | I'll get a patch up shortly | 16:00 |
corvus | yeah i like that one better | 16:00 |
corvus | something like https://paste.openstack.org/show/biOTZ61EiPvjzWZnMTNv/ | 16:02 |
clarkb | oh I came up with something slightly different. Once sec and gerrit will have it | 16:04 |
opendevreview | Clark Boylan proposed opendev/system-config master: Perform haproxy HUP signals with kill https://review.opendev.org/c/opendev/system-config/+/941256 | 16:05 |
clarkb | corvus: ^ | 16:05 |
clarkb | I'll try to file nwe upstream bugs after meetings today | 16:06 |
*** elodilles_pto is now known as elodilles | 16:21 | |
clarkb | the haproxy hup updates failed due to docker rate limits (arg) but the zuul job at least seems to show that the hup changes are working | 17:30 |
clarkb | https://746cc4bed29e8f00c6ad-bb8dadf314ca143f13ef83e8dbc65d1a.ssl.cf2.rackcdn.com/941256/1/check/system-config-run-zuul/b4e1184/bridge99.opendev.org/ara-report/playbooks/7.html | 17:30 |
clarkb | it is the last three tasks in that playbook doing the hup | 17:30 |
clarkb | so I think that is ready for review now and we'll recheck later today when docker is less likely to be a problem | 17:30 |
clarkb | but if we get that in I think we can land the change to switch zuul.o.o over to zuul-lb02 in dns | 17:31 |
clarkb | https://github.com/containers/common/issues/2321 has been reported. I'll followup on the launchpad side of things too | 18:05 |
clarkb | just one step at a time | 18:05 |
clarkb | and https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2089664 was already filed against ubuntu and that points to a PR fix that is waiting on someone with apparmor knowledge to wiegh in on. Hopefully my inolvement there gets things moving again | 18:22 |
clarkb | in the mean time the workaround I proposed aboe should work I think | 18:22 |
clarkb | after lunch I may try to ride my bike a bit. It is cold though so probably not for very long | 19:48 |
clarkb | moving faster than you can run is great when the weather is warmer than 10C | 19:48 |
clarkb | pretty miserable when colder than that | 19:49 |
tonyb | It certainly can be. Good luck | 19:50 |
fungi | python 3.14.0.0a5 is tagged | 20:02 |
clarkb | https://review.opendev.org/c/opendev/system-config/+/941256 passes now. I think if people are happy with that workaround and approve the change we can go ahead and approve the dns update for zuul.o.o after it lands | 20:28 |
clarkb | https://zuul-lb02.opendev.org is up and running if you want to test general functionality | 20:29 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Install ca-certificates in the buildx image https://review.opendev.org/c/zuul/zuul-jobs/+/939823 | 21:16 |
clarkb | reviewing gerrit esc meeting notes it looks like they have a plan for improving git diff caching. The proposal is to use a diff cache directly in the git repo and stop using h2 for that | 21:19 |
clarkb | I wonder if that is something that git already has built in | 21:20 |
clarkb | is anyone willing and able to review https://review.opendev.org/c/opendev/system-config/+/941256 ? If this wasn't a "podman has a bug and I'm working around it" situation I'd probably proceed to keep the noble node rollout moving but this is worth having at leats another set of eyes on I think | 21:24 |
clarkb | I think I'ev decided it is too cold outside and will exercise indoors at some point so happy to monitor and approve etc | 21:25 |
opendevreview | Clark Boylan proposed zuul/zuul-jobs master: Install ca-certificates in the buildx image https://review.opendev.org/c/zuul/zuul-jobs/+/939823 | 21:46 |
corvus | clarkb: +2 on 256 | 23:11 |
corvus | i almost thought about suggesting a shell script, but we might appreciate the extra debuggability of individual tasks there. lgtm. | 23:12 |
clarkb | ok I'm going to approve that now then and cross my finers docker is happy | 23:15 |
fungi | ah, you meant 941256 | 23:24 |
fungi | i guess i double-approved it | 23:24 |
clarkb | arg it already bounced off of docker rate limits. Its frustrating that ht process to get off docker hub relies so heavily on docker hub | 23:40 |
clarkb | oh wait no it failed trying to fetch some package | 23:41 |
clarkb | Temporary failure resolving 'us.archive.ubuntu.com' and the other job failed to update apt cache for likely the same reason | 23:41 |
clarkb | I think we don't use our mirrors there because system-config-run jobs put in the prod config for apt which doesn't use the mirrors | 23:42 |
clarkb | those names resolve for me so I'll try again | 23:42 |
fungi | did it run in flex? | 23:46 |
fungi | for the past week or so i've been noticing outages of 10-15 minutes for network connectivity to sjc3, both vm connectivity and to the api | 23:46 |
clarkb | fungi: yes | 23:47 |
fungi | son the order of a few times a day | 23:47 |
clarkb | https://zuul.opendev.org/t/openstack/build/3f0de8bfa0ae4dc084f2cf46ccc26cc1/log/job-output.txt and https://zuul.opendev.org/t/openstack/build/11dbd797b7404b7d99d39821057deb90/log/job-output.txt both show raxflex nodes | 23:47 |
clarkb | fungi: where is it manifesting for the api? in nodepool logs? | 23:50 |
clarkb | I'm wondering if there is something concrete enough like api request timestamps we can bring up with them | 23:51 |
fungi | not sure, the reason i asked is that i have a personal vm in my own rackspace account i'm using as a shell server there, and when my connectivity to it starts timing out i also check whether i can get a response from the nova api, which is also typically down at the same time. in this case i had just regained access after about 10 minutes of being unreachable, and you mentioned a | 23:54 |
fungi | somewhat unlikely network connection failure right after | 23:54 |
clarkb | interesting. It does seem like something that might be worth mentioning. But maybe only if it persists? Though you say its been ongoing for a week which seems pretty persistent | 23:55 |
fungi | cloudnull doesn't seem to be hanging out in here lately, but i'll check with cardoe tomorrow to see if he knows of anything going on there for the past week or so | 23:55 |
fungi | maybe they're redoing switch cabling or something | 23:55 |
cardoe | He would. Joins the board and suddenly irc is too much for his corporate self. | 23:57 |
fungi | hah | 23:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!