Tuesday, 2025-02-11

clarkb	the zuul-lb02 deployment "failed" beacuse the handler to run sighup against haproxy when the haproxy config updated failed	00:06
clarkb	this seems to be some problem with podman	00:06
clarkb	zuul-lb02 podman[17437]: time="2025-02-11T00:04:23Z" level=error msg="unable to signal init: permission denied"	00:06
clarkb	I've been able to reproduce it manually on the server	00:06
clarkb	the service itself is up and running its just our ability to request graceful restarts that appears to be impacted right now	00:07
clarkb	I think that is worth investigating further before proceeding though	00:07
clarkb	the error occurs when I use `sudo podman kill` too so this isn't related to having docker compose sit in front of podman at least	00:09
clarkb	github indicates it may be an apparmor problem	00:10
clarkb	kern.log seems to confirm	00:11
clarkb	It is odd to me that we don't appear to have hit this in CI either	00:13
clarkb	do we disable apparmor there?	00:14
clarkb	https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2040483 is the issue. I'm still skimming it but I think they only allow a subset of signals and hup is not one of them :/	00:21
clarkb	https://github.com/NeilW/common/commit/b924b9d63305c94cbc8e46154bc6c1ebc08afd09 this implies any signal should be allowed, but I don't see that rule in place on the server. Trying to figure out if we're even patched	00:29
clarkb	I suspect that maybe upstream fixes that went to podman ^ in that commit differed from what ubuntu actually deployed in their packaging :/	00:29
corvus	i've seen that bug before, 1 sec	00:36
corvus	clarkb: https://review.opendev.org/c/zuul/zuul/+/923084/14..15	00:38
corvus	that's the point at which a workaround for that bug was no longer needed in our canary testing	00:38
corvus	i think stopping, or maybe restarting, the container was the problem then	00:39
corvus	i think that is compatible with your hypothesis that they did something different (ie, neglected to handle SIGHUP)	00:40
clarkb	corvus: ya I'm like 99% sure that they are allowing that list of signals in the diff I linked because regular management works. But then hup isn't in that list and the audit log shows hup specifically as the signal and podman as the peer	00:40
clarkb	I've tried making a local override but I think podman is loading its own apparmor profiles when running ? which may override things?	00:41
corvus	yeah, take a look at the diff i linked	00:41
clarkb	however the audit log doesn't support that hypothesis. I see the initial profile load before any podman stuff runs then the manual reload I did	00:41
clarkb	corvus: ya your diff appears to tell podman to select a specific profile?	00:41
corvus	i think that means we'd need a new profile, and then we tell podman to use that profile in the config file	00:41
corvus	yep	00:41
clarkb	ya that could be	00:42
clarkb	I'm going to first check if the issue is I'm trying to modify a running process that already selected its profile and I need a new process to see the newly loaded profile	00:42
clarkb	I'm still not understanding why this didn't fail in CI but one step at a time	00:42
corvus	ack. i agree the ci failure is worth looking into/fixing.	00:43
clarkb	down then up'ing containers doesn't produce a different result	00:43
clarkb	I guess I could try a reboot just to make sure there isn't some state I'm missing. Maybe apparmor_parser -r /etc/apparmor.d/podman isn't enough	00:44
clarkb	oh actually the audit log says the profile is containers-default-0.57.4-apparmor1	00:44
clarkb	that probably explains why editing the podman profile doesn't help. But I don't see that profile on disk so this might be something that is dynamically loaded	00:45
corvus	i believe i learned during my initial investigation that apparmor uses the profile specified in the config (or its default) when setting up a container; so editing containers.conf is almost certainly necessary	00:45
clarkb	ya and containers-default-0.57.4-apparmor1 must be the built in default because I'm not finding it in /etc/apparmor.d	00:46
corvus	++	00:46
clarkb	fwiw sudo kill -HUP $HAPROXY_CONTROL_PID seems to work	00:47
clarkb	I get a new haproxy subprocess (though I didn't confirm it changes any configuration)	00:48
corvus	we could maybe move to use that and ignore the apparmor bug	00:48
clarkb	ya that's what I'm wondering	00:48
corvus	my gut says that will be less work and will make us less unhappy :)	00:48
clarkb	ok I'll pick that up tomorrow then. I need to get this meeting agenda out and then figure out dinner	00:50
clarkb	let me remove the apparmor edits I made and reload the profile again so that we're back to expected confiruation on the server	00:50
clarkb	ok I think I've restored everything to the way configuration management is setup to deploy it.	00:53
clarkb	for my own note taking tomorrow: `sudo docker inspect --format '{{.State.Pid}}' haproxy-docker-haproxy-1` is the magic to get back the pid of the process in the container. Seems to work for both podman and docker	01:01
clarkb	but ya I think we lookup the pid that way and issue a hup directly and drop the use of podman kill :/ disappointing that we continue to have so many problems. Part of the issue seems to be that upstream will say these are distro issues even though they seem to load an apparmor profile out of a template in their own code base	01:01
clarkb	oh btw codesearch restarted automatically and logging seems to still work and the service is functioanl for me	01:03
clarkb	I'll also update the launchapd issue and the github issues tomorrow I guess	01:04
clarkb	it looks like there may be usr1 signals getting ignored too	01:04
clarkb	agenda is out and I'm off in search of food. Thanks!	01:09
*** clarkb is now known as Guest8709		02:38
opendevreview	Takashi Kajinami proposed openstack/diskimage-builder master: tox: Drop functest target from tox https://review.opendev.org/c/openstack/diskimage-builder/+/932266	04:23
*** jroll09 is now known as jroll0		08:45
*** ralonsoh_ is now known as ralonsoh		10:21
*** ralonsoh_ is now known as ralonsoh		14:00
Guest8709	I've apparently become a guest	15:22
fungi	welcome, honored guest!	15:22
*** Guest8709 is now known as clarkb		15:23
clarkb	does anyone know the correct way to escape the {{ .State.Pid }} in this ansible: command: docker inspect --format '{{.State.Pid}}' {{ docker_ps.stdout }}	15:42
clarkb	maybe the raw tags are sufficient here?	15:42
clarkb	I guess {{ '{{' }} produces the {{ literal	15:43
clarkb	command: docker inspect --format "{{ '{{' }} .State.Pid {{ '}}' }}" {{ docker_ps.stdout }}	15:45
jrosser	i think you can use !unsafe for this	15:46
jrosser	possibly something like command: !unsafe "stuff-that-should-not-be-templated"	15:47
jrosser	that would apply to the whole thing though	15:47
clarkb	right I need the variable at the end to be interpolated	15:52
clarkb	I could probably define {{ .State.Pid }} as an unsafe variable value then interpolate that variable in as an alternative	15:53
corvus	clarkb: since it's at the end .... consider concatenation? (or maybe suppling a YAML list of arguments to command)	15:54
clarkb	corvus: I'm not sure I understand how concatenation would work. Yaml won't let me define an unsafe variable then concatenate that to a safe variable will it?	15:55
corvus	clarkb: https://paste.openstack.org/show/bYxAKCu9cNjQEbokn5LG/	15:59
clarkb	TIL	16:00
clarkb	though I think I like the suggestion of using yaml arguments instead	16:00
clarkb	I'll get a patch up shortly	16:00
corvus	yeah i like that one better	16:00
corvus	something like https://paste.openstack.org/show/biOTZ61EiPvjzWZnMTNv/	16:02
clarkb	oh I came up with something slightly different. Once sec and gerrit will have it	16:04
opendevreview	Clark Boylan proposed opendev/system-config master: Perform haproxy HUP signals with kill https://review.opendev.org/c/opendev/system-config/+/941256	16:05
clarkb	corvus: ^	16:05
clarkb	I'll try to file nwe upstream bugs after meetings today	16:06
*** elodilles_pto is now known as elodilles		16:21
clarkb	the haproxy hup updates failed due to docker rate limits (arg) but the zuul job at least seems to show that the hup changes are working	17:30
clarkb	https://746cc4bed29e8f00c6ad-bb8dadf314ca143f13ef83e8dbc65d1a.ssl.cf2.rackcdn.com/941256/1/check/system-config-run-zuul/b4e1184/bridge99.opendev.org/ara-report/playbooks/7.html	17:30
clarkb	it is the last three tasks in that playbook doing the hup	17:30
clarkb	so I think that is ready for review now and we'll recheck later today when docker is less likely to be a problem	17:30
clarkb	but if we get that in I think we can land the change to switch zuul.o.o over to zuul-lb02 in dns	17:31
clarkb	https://github.com/containers/common/issues/2321 has been reported. I'll followup on the launchpad side of things too	18:05
clarkb	just one step at a time	18:05
clarkb	and https://bugs.launchpad.net/ubuntu/+source/libpod/+bug/2089664 was already filed against ubuntu and that points to a PR fix that is waiting on someone with apparmor knowledge to wiegh in on. Hopefully my inolvement there gets things moving again	18:22
clarkb	in the mean time the workaround I proposed aboe should work I think	18:22
clarkb	after lunch I may try to ride my bike a bit. It is cold though so probably not for very long	19:48
clarkb	moving faster than you can run is great when the weather is warmer than 10C	19:48
clarkb	pretty miserable when colder than that	19:49
tonyb	It certainly can be. Good luck	19:50
fungi	python 3.14.0.0a5 is tagged	20:02
clarkb	https://review.opendev.org/c/opendev/system-config/+/941256 passes now. I think if people are happy with that workaround and approve the change we can go ahead and approve the dns update for zuul.o.o after it lands	20:28
clarkb	https://zuul-lb02.opendev.org is up and running if you want to test general functionality	20:29
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Install ca-certificates in the buildx image https://review.opendev.org/c/zuul/zuul-jobs/+/939823	21:16
clarkb	reviewing gerrit esc meeting notes it looks like they have a plan for improving git diff caching. The proposal is to use a diff cache directly in the git repo and stop using h2 for that	21:19
clarkb	I wonder if that is something that git already has built in	21:20
clarkb	is anyone willing and able to review https://review.opendev.org/c/opendev/system-config/+/941256 ? If this wasn't a "podman has a bug and I'm working around it" situation I'd probably proceed to keep the noble node rollout moving but this is worth having at leats another set of eyes on I think	21:24
clarkb	I think I'ev decided it is too cold outside and will exercise indoors at some point so happy to monitor and approve etc	21:25
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Install ca-certificates in the buildx image https://review.opendev.org/c/zuul/zuul-jobs/+/939823	21:46
corvus	clarkb: +2 on 256	23:11
corvus	i almost thought about suggesting a shell script, but we might appreciate the extra debuggability of individual tasks there. lgtm.	23:12
clarkb	ok I'm going to approve that now then and cross my finers docker is happy	23:15
fungi	ah, you meant 941256	23:24
fungi	i guess i double-approved it	23:24
clarkb	arg it already bounced off of docker rate limits. Its frustrating that ht process to get off docker hub relies so heavily on docker hub	23:40
clarkb	oh wait no it failed trying to fetch some package	23:41
clarkb	Temporary failure resolving 'us.archive.ubuntu.com' and the other job failed to update apt cache for likely the same reason	23:41
clarkb	I think we don't use our mirrors there because system-config-run jobs put in the prod config for apt which doesn't use the mirrors	23:42
clarkb	those names resolve for me so I'll try again	23:42
fungi	did it run in flex?	23:46
fungi	for the past week or so i've been noticing outages of 10-15 minutes for network connectivity to sjc3, both vm connectivity and to the api	23:46
clarkb	fungi: yes	23:47
fungi	son the order of a few times a day	23:47
clarkb	https://zuul.opendev.org/t/openstack/build/3f0de8bfa0ae4dc084f2cf46ccc26cc1/log/job-output.txt and https://zuul.opendev.org/t/openstack/build/11dbd797b7404b7d99d39821057deb90/log/job-output.txt both show raxflex nodes	23:47
clarkb	fungi: where is it manifesting for the api? in nodepool logs?	23:50
clarkb	I'm wondering if there is something concrete enough like api request timestamps we can bring up with them	23:51
fungi	not sure, the reason i asked is that i have a personal vm in my own rackspace account i'm using as a shell server there, and when my connectivity to it starts timing out i also check whether i can get a response from the nova api, which is also typically down at the same time. in this case i had just regained access after about 10 minutes of being unreachable, and you mentioned a	23:54
fungi	somewhat unlikely network connection failure right after	23:54
clarkb	interesting. It does seem like something that might be worth mentioning. But maybe only if it persists? Though you say its been ongoing for a week which seems pretty persistent	23:55
fungi	cloudnull doesn't seem to be hanging out in here lately, but i'll check with cardoe tomorrow to see if he knows of anything going on there for the past week or so	23:55
fungi	maybe they're redoing switch cabling or something	23:55
cardoe	He would. Joins the board and suddenly irc is too much for his corporate self.	23:57
fungi	hah	23:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!