Wednesday, 2025-02-12

cardoe	I’ve raised it to them.	00:18
cardoe	My dev boxes are behaving weird too	00:18
fungi	thanks. we can probably extract some exact timestamps since we have nodepool interacting with the api fairly continuously	00:20
fungi	if they want anything to correlate	00:20
cardoe	Yeah that’s what they are bugging me for.	00:29
cardoe	I’m a horrible user cause I just said I dunno. You figure it out.	00:29
fungi	yeah, i worked support at a service provider for many years... you're "that guy" ;)	00:31
fungi	lemme see what i can dig up	00:31
opendevreview	Clark Boylan proposed opendev/system-config master: Mirror docker tool images https://review.opendev.org/c/opendev/system-config/+/941304	00:35
clarkb	corvus: ^ fyi. I didn't do debian:testing because that seems like one that we might be able to find a different image for or at least standardize on the "base image for testing random container things"	00:35
clarkb	I didn't want to assume that debian:testing is the image that makes the most sense for that off the bat	00:35
clarkb	please double check I got all of the names and tag values right	00:36
*** dan_ is now known as dan_with		00:37
clarkb	I got distracted but I'm realizing I should've enqueued the sighup fix directly to the gate. Oh well	00:37
clarkb	at this point its mostly done in check I think	00:37
fungi	cardoe: these are utc timestamps where nodepool was unable to get a server list from the nova api in flex sjc3: https://paste.opendev.org/show/bvu1Nl9ImtdvqjHvWkfb/	01:04
clarkb	if the sighup fixup lands I'll have to make some noop changes to docker-compose.yaml and/or the haproxy config on the servers by hand then see if tomorrow's daily run triggers the handler successfully (though it should be covered in CI so I'm not too worried about it)	01:05
corvus	clarkb: i agree. maybe apache httpd or something?	01:30
Clark[m]	++	02:02
opendevreview	Merged opendev/system-config master: Perform haproxy HUP signals with kill https://review.opendev.org/c/opendev/system-config/+/941256	02:15
fungi	yay	02:15
Clark[m]	The deploy jobs were successful though they should noop. Anyway tomorrow we can update DNS and then I'll edit the docker compose files so that periodic jobs flip them back and emit the signals	03:35
cloudnull	o/	04:02
clarkb	welcome back	04:18
clarkb	I'm nto sure I have any real useful debugging info. fungi had that and provided https://paste.opendev.org/show/bvu1Nl9ImtdvqjHvWkfb/ tl;dr seems to be odd network connectivity issues to cloud resources, from cloud resources and to the cloud api in rax flex	04:19
cloudnull	hey clarkb - do you or fungi have anything on what was happening when the issue was observed?	04:29
cloudnull	I also have a new region for you all to beat up if interested?	04:30
clarkb	cloudnull: I mentioned that two test jobs failed because they couldn't lookup ubuntu package repo servers in DNS. Fungi then asked if it happend with raxflex because he apparently saw connectivity issues to a personal server in the region as well as problems getting ot the nova api	04:30
clarkb	definitely interested	04:30
clarkb	so basically I saw outbound connectivity have a disruption and fungi noticed inbound connectivity issues	04:31
clarkb	the dns requests would've been over port 53 (likely udp) and not using dns over tls or dns or https	04:31
cloudnull	was that all today?	04:31
clarkb	cloudnull: yes	04:32
clarkb	we configure local unbound to resolve against 1.1.1.1 and 8.8.8.8	04:32
cloudnull	sounds good, we'll look into it.	04:32
clarkb	(we use the ipv6 equiavlents in clouds with ipv6 but rax flex doesn't yet iirc)	04:32
cloudnull	I have no idea what might have happened, but we'll find out :D	04:32
cloudnull	yeah, no IPv6 yet, but soon-tm	04:32
clarkb	cloudnull: the timestamps in that paste would've been from nodepool logs accessing cloud apis for that cloud region	04:33
cloudnull	I wanted to get it done sooner, but it just didn't happen ...	04:33
cloudnull	+1	04:33
clarkb	and I think those are timestamps for failures to reach the api from nl01.opendev.org (hosted in rax classic dfw)	04:33
cloudnull	https://keystone.api.dfw3.rackspacecloud.com is the new region, you should be able to get access to it exactly the same as SJC	04:35
clarkb	that should be straightforward on our end. Just add a new region to clouds.yaml and then spin things up	04:35
cloudnull	👍	04:36
cloudnull	Is there a nodepool PR I need to push for that to happen?	04:37
clarkb	nope I think we should be able to do it from our end	04:38
clarkb	we need to update clouds.yaml contents, then add the region to our "cloud launcher" which sets up keys and security groups and so on, then we boot a mirror and then finally add the region to nodepool. We can ask questions if we run into anything unexpected in that process but I suspecti t will be similar to sjc	04:39
cloudnull	sounds good.	04:41
cloudnull	RE: networking tomfoolery, I've got some threads going - likely wont have an answer this evening but I'll keep you posted as we work through it.	04:41
Clark[m]	No rush it's getting late	05:45
opendevreview	QianBiao Ng proposed openstack/diskimage-builder master: Fix: ramdisk image build issues https://review.opendev.org/c/openstack/diskimage-builder/+/755056	10:58
karolinku	Hello folks. Question about glean and startup environment. I'm working on keyfiles support for C10, I noticed that most of distros are manually populating their networking files (like ifcfg) basing on config drive, but it case of keyfiles and NetworkManager specification, it is not recomended to edit them manually. So can nmcli be used for setting up config?	12:17
fungi	karolinku[m]: i guess my only concern would be getting the sequencing correct, since glean can write config files at any time but nmcli may not be able to interact with interfaces until later during the boot process	14:05
Clark[m]	I think there have been issues with network manager starting before config is in place too. I want to say that was the genesis of problems around failed DHCP (or was it RAs?) leading to the interface being improperly configured. Git logs in glean or dib probably have details	14:15
fungi	yeah, i guess the trick would be convincing nm to "do nothing" until instructed by glean	14:18
Clark[m]	I think it had to do with network manager marking interfaces down if default DHCP failed on startup (and some clouds don't do DHCP)	14:19
Clark[m]	Ya either you configure it first or configure it to idle	14:19
karolinku[m]	so I understand that both options are possible here - glean writing config prior and after nm starting	14:25
Clark[m]	Well previously it caused bugs if glean ran after	14:26
Clark[m]	That may not be an issue on centos 10 I'm not sure.	14:26
karolinku[m]	but nmcli can modify config dynamically	14:30
karolinku[m]	so even if it starts with default config, glean can overwrite it later. unless it puts interface down...	14:32
fungi	problem was nm would automatically start trying to dhcp and slaac configure interfaces before glean was able to tell it not to	14:33
fungi	ianw likely remembers the gritty details, i think he did a bunch of the troubleshooting for that when fedora first switched to nm	14:35
karolinku[m]	the other dirty hack which came to my mind, is to create ifcfg file in c10, then use "nmcli migrate" to transformate it into keyfile format	14:36
clarkb	can we just write a keyfile? is the format stable and easy to work with?	14:37
clarkb	https://review.opendev.org/c/opendev/glean/+/688031 I think thisi s related to the issue I mentioned	14:39
clarkb	in any case yuo'll have to figure out what works and makes snse	14:39
clarkb	ultimately the actual order of things and the tools use don't matter too much as long as they work consistently	14:40
karolinku[m]	doable, but not recommended as they wrote here https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10-beta/html/configuring_and_managing_networking/networkmanager-connection-profiles-in-keyfile-format#networkmanager-connection-profiles-in-keyfile-format	14:41
clarkb	infra-root if https://zuul-lb02.opendev.org/ looks good to you I'd like to proceed with https://review.opendev.org/c/opendev/zone-opendev.org/+/941167 I'll probably approve the dns change myself after breakfast and morning things regardless as the lb seems to work for me	14:41
clarkb	karolinku[m]: note that warning is for manual editing by hand	14:42
clarkb	karolinku[m]: its the same reason you shouldn't edit sudoers manually but instead use visudo	14:42
clarkb	in the case of glean it would be a programmatic tool that always emits the same content that can be tested too	14:42
clarkb	(thus avoiding inadverdent typos)	14:42
clarkb	basically I don't read that warning to say you shouldn't exit the files directly. I read it as saying you shouldn't just edit them manually by hand in a one off as it is easy to introduce bugs. That is different to editing them programmatically	14:43
karolinku[m]	I understand it as those keyfiles are just "store" of config, and default way of modyfing it is nmcli	14:45
karolinku[m]	using nmcli user just don't have to think what section should be used or option choosen	14:46
clarkb	"Typos or incorrect placements of parameters can lead to unexpected behavior. Therefore, do not manually edit or create NetworkManager profiles." this can be avoided entirely by using a programmatic tool like nmcli or glean	14:46
clarkb	yes that is true with glean too. Wtih glean the only input should be the config drive	14:46
karolinku[m]	im just not networking expert, so I dont even know If I be able to write proper config in fact	14:46
karolinku[m]	because I'm the human parser between config drive and keyfile 😅	14:48
clarkb	I'm not saying don't use nmcli. I'm more saying I don't think that networkmanager is actually saying you must use nmcli in this case	14:49
fungi	well, i expect the same could be said for writing a proper nm-interfacing script. you need the same degree of network knowledge either way, it's just a question of how you express what you want	14:49
clarkb	but yes you have to undersatnd how this works because you have to teach glean how it works	14:49
clarkb	the degree of understanding may vary based on additional abstractions and tools employed to simplify things	14:49
fungi	though maybe my years as a network engineer have made me cynical of the people who write network tools	14:49
karolinku[m]	wrt config drive - for learning purpouse I used https://opendev.org/opendev/glean/src/branch/master/glean/tests/fixtures/vexxhost/mnt/config/openstack/latest/network_data.json	14:52
karolinku[m]	how it is generated for vms?	14:52
clarkb	karolinku[m]: openstack nova generates the data and attaches it to the VMs either as an iso or fat32 filesystem iirc	14:53
clarkb	the filesystem has a specific label that glean looks for and mounts to /mount/config or something like that. Then it loads the json data off of it. The example data you link to is sanitized from a real cloud instance in that cloud provider	14:54
fungi	note that we interface with cloud providers running varying vintages of openstack, and what you can expect from configdrive's network metadata has changed over time	14:56
clarkb	I think for the stuff opendev cares about its been fairly stable compared to what is in our example test data	14:56
clarkb	this may not be true for ironic's use cases as they are more complicated with vlans and bonds	14:57
karolinku[m]	ok, thanks for the tips!	15:15
mnaser	lol, what are the crazy odds two differnet changes get the same change id.. https://review.opendev.org/q/Ic8aaa0728a43936cd4c6e1ed590e01ba8f0fbf5b	15:16
clarkb	its common for backports because people cherry pick them and carry the change id over	15:17
mnaser	yeah but in that case its two seperate changes + their backports lol	15:17
fungi	yeah, change ids are only required to be unique for a specific project+branch combo	15:17
clarkb	oh neat I see its one change to nova + backports and a different one to manila + backports	15:18
fungi	i have a feeling that's not random chance, and rather that one of them copied the change id from the other	15:18
clarkb	ya random=$({ git var GIT_COMMITTER_IDENT ; echo "$refhash" ; cat "$1"; } \| git hash-object --stdin) is how change id gets generated. $refhash is git rev-parse HEAD	15:19
clarkb	just GIT_COMMITTER_IDENT should be sufficient to produce different hashes here since these are different committers	15:20
mnaser	i guess technically with that we can find out whos the real holder of the change id :P	15:23
fungi	sha-1 is not immune to collisions, but the chances are more remote than the chance that some human copied a commit message and started editing it but didn't wipe the old id footer	15:23
mnaser	would have been a fun anamoly if that was the case :)	15:24
fungi	cloudnull: back to yesterday's discussion, i've been observing network outages in sjc3 that last around 10-15 minutes almost every day (sometimes more than once a day) for the past week or so, and when it happens both virtual machines and the nova api are unreachable from everywhere i try, so guessing it's something upstream of the entire cloud checking out at random	15:26
fungi	routes falling out of the igp? overloaded state tracking in some middlebox? whatever	15:27
clarkb	ok zuul-lb02 continues to work for me. I'm going to approve the change to update dns now	15:56
clarkb	worst case only the zuul dashboard becomes unreachable and we have to hit them directly with /etc/hosts overrides while we revert	15:57
clarkb	so impact should be small if anything does go wrong	15:57
clarkb	then once that is done I'll figure out loading extra permissions so that I can force merge the buildx fix in zuul-jobs so that we can fix nodepool testing so that we can fix dib testing	15:58
clarkb	oh and then try again to deploy codesaerch02 I guess	15:58
clarkb	I need to rebase the codesarch dns update after the zuul.o.o dns update	15:59
opendevreview	Merged opendev/zone-opendev.org master: Switch zuul.o.o to zuul-lb02 https://review.opendev.org/c/opendev/zone-opendev.org/+/941167	16:00
clarkb	ok already found a minor ish problem :/ haproxy logging is set up with `log /dev/log local0` in the haproxy config. I think this sets up a socket for syslog logging. It appears that there may have been permissions issues for that under podman and/or noble so we aren't logging like we expect to	16:08
clarkb	the service itself is working through	16:08
clarkb	ok rsyslog config is configured to set up a listen socket at /var/haproxy/dev/log which is a socket on lb01 but is a regular dir on lb02. I suspect there may be some startup race	16:14
clarkb	rsyslogd: cannot create '/var/haproxy/dev/log': Address already in use <- that seems to be the case here. Looking at the ansible it isn't clear to me how this happened as we should restart rsyslog with the new config before ever creating that path	16:17
clarkb	anyway what I'd like to do is stop the haproxy service (so the bind mount is stopped), move that log dir aside, restart rsyslogd then start haproxy again	16:18
clarkb	this will be noticeable as a blip in zuul web service. Any concern with that cc infra-root corvus	16:18
clarkb	if we prefer we can revert the dns update and do this surgery then update dns again. Maybe that is better	16:18
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Revert "Switch zuul.o.o to zuul-lb02" https://review.opendev.org/c/opendev/zone-opendev.org/+/941465	16:22
fungi	seems like it should be pretty quick	16:22
fungi	no objection from me	16:22
fungi	i doubt anyone will spot the blip, worst case they'll click reload in their browsers	16:22
clarkb	fungi: ya my concern is that the bind mount for /var/haproxy/dev/log:/dev/log is what is converting the socket device to a directory	16:23
clarkb	so mostly concerned I'll do all that work and be right back where I started so reverting to lb01 for now seems better	16:23
fungi	it's a reasonable theory, i agree	16:24
clarkb	if that is the case I feel like that is a major bug in podman. At the very least it should error rather than silently deleting content in your filesystem and replacing it with somethign different	16:24
clarkb	I foudn the logs from the original rsyslog config update and restart and ther were no complaints there. But I did a restart just now and it complains about not being able to change an existent file system entry	16:24
clarkb	that implies it worked at first and then podman broke it (or haproxy I suppose)	16:25
clarkb	anyway dns update is approved	16:25
fungi	it's a fifo we pre-create right?	16:25
clarkb	$AddUnixListenSocket /var/haproxy/dev/log	16:25
clarkb	its a unix domain socket	16:25
fungi	ah okay	16:25
fungi	not a named pipe	16:25
clarkb	anyway one step at a time. First undo dns update, then stop haproxy, clean up the filesystem and try again gathering data as we go	16:27
opendevreview	Merged opendev/zone-opendev.org master: Revert "Switch zuul.o.o to zuul-lb02" https://review.opendev.org/c/opendev/zone-opendev.org/+/941465	16:28
clarkb	ok my local resolution flipped over to lb01 and the ttl has reset back to 300 after the first lb01 lookup. I think its safe to start operating on lb02 so will do so now	16:36
corvus	just acking that i'm caught up and agree.	16:37
clarkb	rsyslogd[56508]: rsyslogd: cannot create '/var/haproxy/dev/log': Permission denied	16:38
clarkb	and now that I see ^ in the log I see that this appears to have happend when we deployed previously. The logs I saw earlier that looked clean must've been from earlier node setup	16:39
fungi	so rsyslog was started when the permissions didn't allow creating the socket, then podman saw thing being bindmounted didn't exist so helpfully created a directory there, then later when permissions got fixed haproxy couldn't create the socket because there was already a directory with the same name?	16:40
clarkb	thats half right. permissions haven't gotten fixed and it is still failing on this. And drumroll please the culprit is apparmor	16:41
* fungi is shocked. SHOCKED		16:41
clarkb	apparmor="DENIED" operation="mknod" class="file" profile="rsyslogd" name="/var/haproxy/dev/log"	16:41
clarkb	I'm off to figure out if I can add local config to apparmor to allow this to happen	16:41
* fungi sighs		16:41
corvus	the "creating missing directory" part is standard docker behavior	16:41
corvus	(as in, dockerd would have done that too, not just podman)	16:42
clarkb	ya in "positive" news this means podman isn't deleting anything it was never there in the first place so I'm happier about that	16:42
clarkb	corvus: ack	16:42
clarkb	and also this is a generic noble problem and doesn't have to do with podman or docker	16:42
corvus	yeah i guess this is a "noble rsyslog upgrade" problem?	16:42
fungi	because nobody uses rsyslog any more now that systemd-journald is a thing, i guess	16:43
clarkb	corvus: yup	16:43
corvus	yeah, totally our fault	16:44
fungi	https://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/1826294 maybe?	16:45
clarkb	/var/lib/*/dev/log rwl, <- this rule already exists. Do we have a preference to move the haproxy syslog device into /var/lib/haproxy/dev/log or do we want a local rule to allow rwl against /var/haproxy/dev/log ?	16:46
corvus	i'm fine bending to the will of the fhs	16:47
fungi	were we explicitly creating /var/haproxy/dev/log previously?	16:48
fungi	s/creating/configuring/	16:48
clarkb	fungi: yes that is what rsyslog configuration is set to create and is what we bind mount	16:48
fungi	okay, but it's our own configuration choice sounds like, not a package-supplied default	16:48
clarkb	yes this is all containers and our choice	16:48
fungi	so yes, i agree moving the socket to where the shipped apparmor rules expect it makes sense	16:49
corvus	yeah, we were trying to keep all the docker references together rather than bind-mounting things from all over the filesystem -- to try to make it more clear what's going on	16:49
clarkb	corvus: that is my next question should we move everything? I think that is fairly easy for haproxy because it is so stateless	16:49
fungi	unless we want to end up carting around an ever increasing amount of apparmor configuration too, i guess	16:49
corvus	but i think that's not an important enough goal to push back against existing config	16:49
clarkb	I think we can have ansible write out everything to /var/lib/haproxy and then we just go and clean up /var/haproxy when done	16:49
corvus	clarkb: that works for me	16:50
corvus	so we'd have /var/lib/haproxy/run and .../etc	16:50
clarkb	yes	16:50
fungi	wfm	16:50
clarkb	I'll work on that and adding tests to ensure we're creating things. Though it still isn't clear to me if our test nodes have apparmor in an enforcing state	16:51
corvus	signs point to no on that one, eh?	16:51
clarkb	corvus: ya based on docker-compose kill / podman kill working on noble in CI	16:51
fungi	well, also based on this not getting caught in testing	16:52
clarkb	we don't explicitly test the haproxy logging I don't think but maybe we do and if so then ya	16:52
clarkb	anyway I'll pause effort on lb02 and start working on a change to see how ugly this is. I don't expect it to be too bad even for the existing gitea and zuul proxies	16:53
clarkb	I guess I should update rsyslog config and double check it can create at athat location afterall	16:54
clarkb	I'll do that too	16:54
clarkb	yup that works	16:55
opendevreview	Merged zuul/zuul-jobs master: Install ca-certificates in the buildx image https://review.opendev.org/c/zuul/zuul-jobs/+/939823	17:01
fungi	gonna run some lunch errands, but should be back in about an hour	17:10
opendevreview	Clark Boylan proposed opendev/system-config master: Move haproxy config into /var/lib/haproxy https://review.opendev.org/c/opendev/system-config/+/941468	17:15
clarkb	the more I think about ^ the more I'm concerned that we might have unexpected fallout from the existing servers like with the old rsyslog socket path no longer being valid in the time between restarting rsyslog and haproxy	17:15
clarkb	but I figure this is good for a first pass and others can review it and add concerns if they have them. I think absolute worst case we'll be able to manually undo bits and pieces here and there to get it up and running again if necessary	17:16
opendevreview	Clark Boylan proposed opendev/zone-opendev.org master: Add codesearch02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/941140	17:18
clarkb	if CI is happy with ^ I'm going to go ahead and approve it in order to reduce the amoutn of rebase conflicts we have. The change to add codesearch02 is passing now so I think having it in dns is a good idea	17:19
opendevreview	Merged opendev/zone-opendev.org master: Add codesearch02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/941140	17:30
opendevreview	Clark Boylan proposed opendev/system-config master: Collect apparmor status in ci jobs https://review.opendev.org/c/opendev/system-config/+/941471	17:35
clarkb	debugging the apparmor behavior seems like it may be important as we move more services to noble ^ that is a starting point	17:36
clarkb	arg but editing that file runs all of our jobs...	17:36
corvus	is that bad? :)	17:37
clarkb	well it will eat docker hub quota but it won't prevent us from getting useful apparmor status info back so I guess its good in that regard	17:37
corvus	clarkb: re restarting: seems like worst case is a brief outage that is just corrected by restarting haproxy?	17:39
clarkb	corvus: yes I think so especially since haproxy seems to continue (logs a warning to stdout) if its main logging location isn't available as we saw on zuul-lb02	17:40
clarkb	basically I think worst case we identfiy whatever it was that didn't get transitioned properly then manually restart. Shouldn't be a huge impact if it happens and CI should confirm that things generally work	17:40
clarkb	its just the transition that we don't test	17:40
clarkb	I'm going to take a break while we wait on results from all the things	17:42
clarkb	looks like the zuul job passed with the haproxy refactor which is a good sign that it works with the shift (again the main concern is the transition as we don't test that). Let me know what ya'll think	18:11
clarkb	we're also starting to get back apparmor_status info but none of the noble host jobs haev run yet so don't have data to compare there yet	18:11
corvus	refactor has a +2 from me	18:12
clarkb	fungi: maybe you can take al ook after lunch too? I should be able to be around all day today to help monitor and debug/fix should the transition be problematic	18:27
fungi	yeah, looking now	18:32
clarkb	system-config-run-base has empty apparmor_status output and errors that apparmor_status command isn't found. But bridge does have apparmor status. This implies to me that we start out with no apparmor then as we deploy things we install apparmor. I'm wondering if we need to reboot for apparmor enforcement to take effect. Should know more once paste/zuul/grafana jobs complete	18:34
fungi	`apt-file search bin/apparmor_status` tells me that the apparmor package ships a /usr/sbin/apparmor_status file	18:35
fungi	so sounds like something is pulling in apparmor automatically as a dependency	18:36
clarkb	looks like it comes from install-docker via upstream.yaml on !noble and on noble it must be a podman dependency?	18:37
clarkb	but maybe it isn't a podman dependency so we don't actually get it on noble nodes in ci. If that is the case we can add apparmor as a dep to better align with the old docker system and what we appear to get in prod images	18:38
clarkb	https://zuul.opendev.org/t/openstack/build/9bb4f267b2934e9fae08277c73f209a3/log/paste99.opendev.org/apparmor_status.txt ok I suspect that is what is going on. For all the old docker stuff we installedapparmor but now we don't for podman	18:40
clarkb	I'm going to look at deps for podman packaging but I'll propose an update to install-docker to install apparmor for podman too	18:40
clarkb	ya there is no direct dependency. I'm guessing that the ubuntu stance is that they just install apparmor bydefault then things don't have to strictly depend on it even when they have integration?	18:43
opendevreview	Clark Boylan proposed opendev/system-config master: Install apparmor when installing podman https://review.opendev.org/c/opendev/system-config/+/941471	18:46
clarkb	that should fail on the zuul-lb thing	18:46
clarkb	depending on how we want to do stuff we could put the haproxy refactor on top of ^ to confirm things are happy afterwards then reverse the order to make things mergable	18:47
fungi	interestingly, apparmor has "optional" priority, not essential or important	18:47
fungi	so i suspect it's just included as part of their default base package selection at image build time	18:48
clarkb	ya	18:48
clarkb	it honestly might not be a terrible idea for us to ensure we include it in our images too since that is what ubuntu in the wild operates with	18:48
clarkb	that might break a lot of stuff but maybe that is for the best? could possibly make the change only in the zuul-launcher images and then that can be part of migrating to the new images	18:48
fungi	agreed, though it could be disruptive for jobs that aren't already dragging it in	18:48
clarkb	corvus: ^ not sure if you want to include that in the list of things to test	18:48
clarkb	noble in particular pulls in apparmor 4.0 which includes rules for a lot more stuff (this is what got podman for example)	18:49
clarkb	rsyslog seems to be new rules too	18:49
clarkb	feels a bit like I'm spinning my wheels on this stuff but this has all be good to learn and that aws the primary goal. Ensuring we're confident in our podman deployment	18:50
clarkb	less of a success replacing services more of a success making our podman better	18:50
clarkb	one thing to note landing 941471 might be difficult with the way docker rate limits work	18:52
clarkb	we may need to do our best effort to ensure we get coverage across as many thinsg as possible then reduce the testing we run before merging it	18:52
corvus	clarkb: let's exercise how easy it is to make new images and just make a dedicated apparmor image; i think i'd like to keep our current images so that we're still able to test. if the apparmor image works, we can switch over to it.	18:58
clarkb	corvus: that seems reasonable	18:59
clarkb	currently we add packages to the infra package needs list but that would be shared by the two images so need another mechanism. maybe just another small element with additional package deps	18:59
corvus	(if you wanted to go crazy and define an image-validation pipeline with a job for an apparmor image... technically that should work but it's untested in prod :)	19:01
corvus	clarkb: maybe an element that gets packages from an env var, and we set that env var in the job? that would be nice and general purpose for easy changes like this	19:01
clarkb	++	19:02
clarkb	its possible dib already has something like that too	19:02
clarkb	https://zuul.opendev.org/t/openstack/build/5d7a88d99c9a4c09bc9ffea273bd8d23/log/paste99.opendev.org/apparmor_status.txt shows that installing apparmor on noble as part of podman installation looks liek production	19:23
clarkb	so ya now we have a decision to make on whether or not we take clarkb's manual testing as sufficient and move forward with haproxy updatse or try to land 941471 first	19:23
clarkb	just note that 941471 is likely to be difficult to land normally due to how many docker consuming jobs it runs	19:24
dpanech	Hi, zuul jobs are not starting on this review, could someone help? https://review.opendev.org/c/starlingx/helm-charts/+/940993	19:32
clarkb	dpanech: hit refresh	19:36
fungi	fastest fix ever ;)	19:36
dpanech	huh awesome, did you do something or was just it taking longer than usual?	19:49
clarkb	I didn't do anything	19:49
clarkb	dpanech: did you check the zuul status page?	19:49
fungi	my guess is that if you went to the zuul status dashboard you'd have seen it in a "queued" state for about an hour while some provider was timing out trying to satisfy a node request for it	19:50
fungi	since it doesn't look from the usage graphs like we were maxxed out on available quota	19:51
dpanech	you mean this status page? https://zuul.opendev.org/t/openstack/builds	19:52
fungi	the build result reports that the job itself only took a couple of minutes to run, but didn't start running for about an hour after gerrit would have sent the event	19:52
clarkb	dpanech: https://zuul.opendev.org/t/openstack/status this one	19:52
fungi	dpanech: https://zuul.opendev.org/t/openstack/status	19:52
clarkb	builds shows you a historical view on what has run and completed. status shows you the live status	19:52
dpanech	ok good to know, thanks	19:53
clarkb	I'm going to eat lunch now. fungi I guess let me know what you think of https://review.opendev.org/c/opendev/system-config/+/941468	19:53
fungi	clarkb: huh, that change ran fewer jobs than i anticipated	19:53
clarkb	fungi: it ran jobs for the two services that use haproxy	19:55
clarkb	the change that installs apparmor and collects apparmor_status info runs against everything but is separate	19:55
fungi	yeah, for some reason i thought we had more	19:55
clarkb	ah	19:55
frickler	fyi https://blog.codeberg.org/we-stay-strong-against-hate-and-hatred.html	19:59
fungi	somehow 941471 only failed 2 jobs	20:26
fungi	one on a docker rate limit, the other looks like the executor got disconnected from a test node mid-test	20:27
clarkb	amazing. But also thats a good indication it doesn't break anything too bad	20:28
fungi	yes, my takeaway as well basically	20:28
clarkb	might be worth a recheck to ensure those two jobs succeed then we can think about direct enquing to the gate or even force merging if we want to get it in	20:28
clarkb	it occurred to me that the load balancer apparmor problems may not cause that chagne to fail because we don't actually assert anything about those logs. My change to move things in haproxy does add those asserts though	20:29
fungi	yes, that's helpful	20:29
fungi	i've rechecked 941471 for more results	20:30
clarkb	re https://review.opendev.org/c/opendev/system-config/+/941468 I can make time all afternoon to monitor and fixup things if we want to approve it now	20:55
clarkb	I'm a bit worried about tomorrow and Friday as we're supposed to get badish weather which always comes with the risk of power outage, but if we think waiting is best I'm fine with that too	20:56
fungi	yeah, i can approve it now, just didn't want to have it land while you were lunching	20:59
clarkb	a tuna sandwich with spicy pickles has been consumed	21:00
fungi	there's nothing in that description i dislike	21:00
fungi	sounds amazing	21:00
fungi	my sandwich too had pickles, but it was a cuban	21:00
clarkb	the other day I had a grilled cheese and ham sandwich and only after i had eaten it did I remember I had pickles to make it an almost cubano.	21:03
Clark[m]	fungi: the gitea job failed for haproxy refactor. Do we want to dequeue and reenqueue to the gate to try again asap or let it finish then recheck to go back through check first?	21:34
fungi	i can reenqueue it	21:35
Clark[m]	It failed pulling haproxy-statsd from docker hub	21:35
opendevreview	Aurelio Jargas proposed zuul/zuul-jobs master: Add role: `ensure-python-command`, refactor similar roles https://review.opendev.org/c/zuul/zuul-jobs/+/941490	21:54
fungi	system-config-run-tracing hit a dockerhub pull rate limit on 941471 this time, everything else passed, but we still haven't seen that job succeed for sure. i'll recheck again	22:33
clarkb	I've ssh'd into gitea-lb02, zuul-lb01 and zuul-lb02	22:45
clarkb	will keep an eye on them when the change merges which should happen shortly	22:45
clarkb	at this rate the jobs will run after hourly jobs. Oh well eventually we will get there	22:59
opendevreview	Merged opendev/system-config master: Move haproxy config into /var/lib/haproxy https://review.opendev.org/c/opendev/system-config/+/941468	23:00
clarkb	it snuck in just ahead of hourly jobs	23:02
clarkb	gitea is running first	23:02
clarkb	containers restarted and /var/lib/haproxy exists as does the syslog socket. /var/log/haproxy is bring written to	23:04
clarkb	and the job succeeded. I am able to hit https://opendev.org/ if there was a blip it must've been a very short one	23:04
corvus	nice!	23:04
clarkb	zuul-lb01 and zuul-lb02 seem to have experienced similar. Things shifted over and containers restarted	23:05
clarkb	I don't see a /var/log/haproxy on zuul-lb02 yet	23:06
clarkb	so trying to figure out if things are sad on the next thing :/	23:07
fungi	deploy reported success for both	23:07
clarkb	as far as I can tell both production load balancers are happy with their new "homes"	23:07
clarkb	still getting [ALERT] (11) : sendmsg()/writev() failed in logger #1: Permission denied (errno=13) on zuul-lb02	23:08
clarkb	srw-rw-rw- 1 root root 0 Feb 12 23:04 log which should allow anything to write to it?	23:08
fungi	yeah, unless... apparmor again?	23:09
clarkb	apparmor="DENIED" operation="sendmsg" class="file" info="Failed name lookup - disconnected path" error=-13 profile="rsyslogd" name="var/lib/haproxy/dev/log" pid=60555 comm="haproxy" requested_mask="r" denied_mask="r"	23:09
clarkb	is the problem that the container side sees the path as var/lib/haproxy/dev/log and not /var/lib/haproxy/dev/log?	23:12
clarkb	yes it seems to be related to chroots/namespaces	23:13
clarkb	"attach_disconnected This forces AppArmor to attach disconnected objects to the task's namespace and mediate them as though they are part of the namespace. WARNING this mode is unsafe and can result in aliasing and access to objects that should not be allowed. Its intent is a debug and policy development tool"	23:14
clarkb	I can't help but feel these apparmor rules are highly restrictive but also often useless. For example podman kill fails but I can just use kill(1)	23:18
clarkb	now I can't allow access to /var/lib/haproxy/dev/log because I'm bind mounting it into a container with a different filesystem namespace. I could instead talk to a udp port listening on localhost?	23:19
clarkb	what have we accomplished here?	23:19
clarkb	https://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-apparmor-profiles.html#sec-apparmor-profiles-glob-variables suse documents doing magic with variables for chroot prefixes here which isn't quite what we need	23:20
clarkb	we should be able to test this via https://review.opendev.org/c/opendev/system-config/+/941471 at least (which will now fail due to missing /var/log/haproxy.log)	23:23
clarkb	does anyone know anyone that understands apparmor? I feel like this is the sort of thing that should be doable. Another option would be to change how haproxy logs https://docs.haproxy.org/2.2/configuration.html#4.2-log	23:24
mordred	are there any people who understand apparmor?	23:31
fungi	apparmorsmiths	23:34
fungi	they're the only ones who can # it into shape	23:34
clarkb	thinking about this more the old problem was that rsyslog was not allowed to write to /var/haproxy/dev/log to create the socket. So we moved it to /var/lib/haproxy/dev/log and now rsyslog can create the socket	23:35
clarkb	the problem now is that haproxy running in a different namespace is trying to write to that socket and getting denied	23:35
clarkb	looing in /etc/apparmor.d/base/abstractions there seems to be a global rule to allow things to write to /dev/log	23:35
clarkb	I wonder if the problem now is more that we need to have a rule that says haproxy in a container can write to what is actually /var/lib/haproxy/dev/log. Though it is odd that the profile="rsyslogd" in the DENIED message	23:36
clarkb	since that profile shouldn't apply to our haproxy process?	23:36
fungi	unless apparmor somehow tracks what profile was associated with the process that created that resource?	23:38
clarkb	in the beginning of the file we have `profile rsyslogd /usr/sbin/rsyslogd {` I believe rsyslogd there is a logical name and can be any string (maybe with limtiations on characters) then /usr/sbin/rsyslogd is an attachment conditional that is supposed to restrict wher ethe profile appears	23:40
clarkb	but now that I think of it more its a socket so both rsyslog and haproxy are cooperating over it and maybe that is how we're tripping over this	23:40
clarkb	I'm going to try attach_disconnected just to see if that fixes it	23:42
clarkb	I don't like it because it applies to the whole profile and we really just need it for this one resource	23:42
corvus	as much as i hate to say it; if they are going to start including apparmor profiles for docker stuff, we may want to resign ourselves to just needing to write the occasional supplemental profile. it might be worth giving in and going ahead and doing that right now than continuing to fight it. bonus: moar sekurity?	23:43
corvus	er by "right now" i did not mean "immediately" i meant "correctly, soon"	23:44
corvus	(basically, i'm wondering if we're at the beginning of an apparmor journey on noble)	23:44
corvus	(rather than haproxy being the beginning and the end; i dunno honestly)	23:45
clarkb	ya I think that may be the case	23:47
clarkb	apparmor is no longer a little toy in the corner but something important to the system	23:47
clarkb	and ya I'm mostly trying to figure out what it wants from me to make this work. Adding that attach_disconnected flag did not help	23:47
clarkb	https://lists.ubuntu.com/archives/apparmor/2019-August/012014.html I think this is related also "Hopefully we'll have something more pleasing in the future, but this is where it's at today." from circa 2019	23:52
clarkb	but I recognize that person as someone in the ubuntu security irc channel maybe we can ask there	23:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!