Wednesday, 2025-02-12

cardoeI’ve raised it to them.00:18
cardoeMy dev boxes are behaving weird too00:18
fungithanks. we can probably extract some exact timestamps since we have nodepool interacting with the api fairly continuously00:20
fungiif they want anything to correlate00:20
cardoeYeah that’s what they are bugging me for.00:29
cardoeI’m a horrible user cause I just said I dunno. You figure it out.00:29
fungiyeah, i worked support at a service provider for many years... you're "that guy" ;)00:31
fungilemme see what i can dig up00:31
opendevreviewClark Boylan proposed opendev/system-config master: Mirror docker tool images  https://review.opendev.org/c/opendev/system-config/+/94130400:35
clarkbcorvus: ^ fyi. I didn't do debian:testing because that seems like one that we might be able to find a different image for or at least standardize on the "base image for testing random container things"00:35
clarkbI didn't want to assume that debian:testing is the image that makes the most sense for that off the bat00:35
clarkbplease double check I got all of the names and tag values right00:36
*** dan_ is now known as dan_with00:37
clarkbI got distracted but I'm realizing I should've enqueued the sighup fix directly to the gate. Oh well00:37
clarkbat this point its mostly done in check I think00:37
fungicardoe: these are utc timestamps where nodepool was unable to get a server list from the nova api in flex sjc3: https://paste.opendev.org/show/bvu1Nl9ImtdvqjHvWkfb/01:04
clarkbif the sighup fixup lands I'll have to make some noop changes to docker-compose.yaml and/or the haproxy config on the servers by hand then see if tomorrow's daily run triggers the handler successfully (though it should be covered in CI so I'm not too worried about it)01:05
corvusclarkb: i agree.  maybe apache httpd or something?01:30
Clark[m]++02:02
opendevreviewMerged opendev/system-config master: Perform haproxy HUP signals with kill  https://review.opendev.org/c/opendev/system-config/+/94125602:15
fungiyay02:15
Clark[m]The deploy jobs were successful though they should noop. Anyway tomorrow we can update DNS and then I'll edit the docker compose files so that periodic jobs flip them back and emit the signals03:35
cloudnullo/04:02
clarkbwelcome back04:18
clarkbI'm nto sure I have any real useful debugging info. fungi had that and provided https://paste.opendev.org/show/bvu1Nl9ImtdvqjHvWkfb/ tl;dr seems to be odd network connectivity issues to cloud resources, from cloud resources and to the cloud api in rax flex04:19
cloudnullhey clarkb - do you or fungi have anything on what was happening when the issue was observed?04:29
cloudnullI also have a new region for you all to beat up if interested?04:30
clarkbcloudnull: I mentioned that two test jobs failed because they couldn't lookup ubuntu package repo servers in DNS. Fungi then asked if it happend with raxflex because he apparently saw connectivity issues to a personal server in the region as well as problems getting ot the nova api04:30
clarkbdefinitely interested04:30
clarkbso basically I saw outbound connectivity have a disruption and fungi noticed inbound connectivity issues04:31
clarkbthe dns requests would've been over port 53 (likely udp) and not using dns over tls or dns or https04:31
cloudnullwas that all today? 04:31
clarkbcloudnull: yes04:32
clarkbwe configure local unbound to resolve against 1.1.1.1 and 8.8.8.804:32
cloudnullsounds good, we'll look into it.  04:32
clarkb(we use the ipv6 equiavlents in clouds with ipv6 but rax flex doesn't yet iirc)04:32
cloudnullI have no idea what might have happened, but we'll find out :D04:32
cloudnullyeah, no IPv6 yet, but soon-tm 04:32
clarkbcloudnull: the timestamps in that paste would've been from nodepool logs accessing cloud apis for that cloud region04:33
cloudnullI wanted to get it done sooner, but it just didn't happen ... 04:33
cloudnull+1 04:33
clarkband I think those are timestamps for failures to reach the api from nl01.opendev.org (hosted in rax classic dfw)04:33
cloudnullhttps://keystone.api.dfw3.rackspacecloud.com is the new region, you should be able to get access to it exactly the same as SJC 04:35
clarkbthat should be straightforward on our end. Just add a new region to clouds.yaml and then spin things up04:35
cloudnull👍04:36
cloudnullIs there a nodepool PR I need to push for that to happen? 04:37
clarkbnope I think we should be able to do it from our end04:38
clarkbwe need to update clouds.yaml contents, then add the region to our "cloud launcher" which sets up keys and security groups and so on, then we boot a mirror and then finally add the region to nodepool. We can ask questions if we run into anything unexpected in that process but I suspecti t will be similar to sjc04:39
cloudnullsounds good. 04:41
cloudnullRE: networking tomfoolery, I've got some threads going - likely wont have an answer this evening but I'll keep you posted as we work through it. 04:41
Clark[m]No rush it's getting late05:45
opendevreviewQianBiao Ng proposed openstack/diskimage-builder master: Fix: ramdisk image build issues  https://review.opendev.org/c/openstack/diskimage-builder/+/75505610:58
karolinkuHello folks. Question about glean and startup environment. I'm working on keyfiles support for C10, I noticed that most of distros are manually populating their networking files (like ifcfg) basing on config drive, but it case of keyfiles and NetworkManager specification,  it is not recomended to edit them manually. So can nmcli be used for setting up config?12:17
fungikarolinku[m]: i guess my only concern would be getting the sequencing correct, since glean can write config files at any time but nmcli may not be able to interact with interfaces until later during the boot process14:05
Clark[m]I think there have been issues with network manager starting before config is in place too. I want to say that was the genesis of problems around failed DHCP (or was it RAs?) leading to the interface being improperly configured. Git logs in glean or dib probably have details14:15
fungiyeah, i guess the trick would be convincing nm to "do nothing" until instructed by glean14:18
Clark[m]I think it had to do with network manager marking interfaces down if default DHCP failed on startup (and some clouds don't do DHCP)14:19
Clark[m]Ya either you configure it first or configure it to idle14:19
karolinku[m]so I understand that both options are possible here - glean writing config prior and after nm starting14:25
Clark[m]Well previously it caused bugs if glean ran after14:26
Clark[m]That may not be an issue on centos 10 I'm not sure.14:26
karolinku[m]but nmcli can modify config dynamically14:30
karolinku[m]so even if it starts with default config, glean can  overwrite it later. unless it puts interface down...14:32
fungiproblem was nm would automatically start trying to dhcp and slaac configure interfaces before glean was able to tell it not to14:33
fungiianw likely remembers the gritty details, i think he did a bunch of the troubleshooting for that when fedora first switched to nm14:35
karolinku[m]the other dirty hack which came to my mind, is to create ifcfg file in c10, then use "nmcli migrate" to transformate it into keyfile format14:36
clarkbcan we just write a keyfile? is the format stable and easy to work with?14:37
clarkbhttps://review.opendev.org/c/opendev/glean/+/688031 I think thisi s related to the issue I mentioned14:39
clarkbin any case yuo'll have to figure out what works and makes snse14:39
clarkbultimately the actual order of things and the tools use don't matter too much as long as they work consistently14:40
karolinku[m]doable, but not recommended as they wrote here https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10-beta/html/configuring_and_managing_networking/networkmanager-connection-profiles-in-keyfile-format#networkmanager-connection-profiles-in-keyfile-format14:41
clarkbinfra-root if https://zuul-lb02.opendev.org/ looks good to you I'd like to proceed with https://review.opendev.org/c/opendev/zone-opendev.org/+/941167 I'll probably approve the dns change myself after breakfast and morning things regardless as the lb seems to work for me14:41
clarkbkarolinku[m]: note that warning is for manual editing by hand14:42
clarkbkarolinku[m]: its the same reason you shouldn't edit sudoers manually but instead use visudo14:42
clarkbin the case of glean it would be a programmatic tool that always emits the same content that can be tested too14:42
clarkb(thus avoiding inadverdent typos)14:42
clarkbbasically I don't read that warning to say you shouldn't exit the files directly. I read it as saying you shouldn't just edit them manually by hand in a one off as it is easy to introduce bugs. That is different to editing them programmatically14:43
karolinku[m]I understand it as  those keyfiles are just "store" of config, and default way of modyfing it is nmcli14:45
karolinku[m]using nmcli user just don't have to think what section should be used or option choosen14:46
clarkb"Typos or incorrect placements of parameters can lead to unexpected behavior. Therefore, do not manually edit or create NetworkManager profiles." this can be avoided entirely by using a programmatic tool like nmcli or glean14:46
clarkbyes that is true with glean too. Wtih glean the only input should be the config drive14:46
karolinku[m]im just not networking expert, so I dont even know If I be able to write proper config in fact14:46
karolinku[m]because I'm the human parser between config drive and keyfile 😅14:48
clarkbI'm not saying don't use nmcli. I'm more saying I don't think that networkmanager is actually saying you must use nmcli in this case14:49
fungiwell, i expect the same could be said for writing a proper nm-interfacing script. you need the same degree of network knowledge either way, it's just a question of how you express what you want14:49
clarkbbut yes you have to undersatnd how this works because you have to teach glean how it works14:49
clarkbthe degree of understanding may vary based on additional abstractions and tools employed to simplify things14:49
fungithough maybe my years as a network engineer have made me cynical of the people who write network tools14:49
karolinku[m]wrt config drive - for learning purpouse I used https://opendev.org/opendev/glean/src/branch/master/glean/tests/fixtures/vexxhost/mnt/config/openstack/latest/network_data.json14:52
karolinku[m]how it is generated for vms?14:52
clarkbkarolinku[m]: openstack nova generates the data and attaches it to the VMs either as an iso or fat32 filesystem iirc14:53
clarkbthe filesystem has a specific label that glean looks for and mounts to /mount/config or something like that. Then it loads the json data off of it. The example data you link to is sanitized from a real cloud instance in that cloud provider14:54
funginote that we interface with cloud providers running varying vintages of openstack, and what you can expect from configdrive's network metadata has changed over time14:56
clarkbI think for the stuff opendev cares about its been fairly stable compared to what is in our example test data14:56
clarkbthis may not be true for ironic's use cases as they are more complicated with vlans and bonds14:57
karolinku[m]ok, thanks for the tips! 15:15
mnaserlol, what are the crazy odds two differnet changes get the same change id.. https://review.opendev.org/q/Ic8aaa0728a43936cd4c6e1ed590e01ba8f0fbf5b15:16
clarkbits common for backports because people cherry pick them and carry the change id over15:17
mnaseryeah but in that case its two seperate changes + their backports lol15:17
fungiyeah, change ids are only required to be unique for a specific project+branch combo15:17
clarkboh neat I see its one change to nova + backports and a different one to manila + backports15:18
fungii have a feeling that's not random chance, and rather that one of them copied the change id from the other15:18
clarkbya random=$({ git var GIT_COMMITTER_IDENT ; echo "$refhash" ; cat "$1"; } | git hash-object --stdin) is how change id gets generated. $refhash is git rev-parse HEAD15:19
clarkbjust GIT_COMMITTER_IDENT should be sufficient to produce different hashes here since these are different committers15:20
mnaseri guess technically with that we can find out whos the real holder of the change id :P15:23
fungisha-1 is not immune to collisions, but the chances are more remote than the chance that some human copied a commit message and started editing it but didn't wipe the old id footer15:23
mnaserwould have been a fun anamoly if that was the case :)15:24
fungicloudnull: back to yesterday's discussion, i've been observing network outages in sjc3 that last around 10-15 minutes almost every day (sometimes more than once a day) for the past week or so, and when it happens both virtual machines and the nova api are unreachable from everywhere i try, so guessing it's something upstream of the entire cloud checking out at random15:26
fungiroutes falling out of the igp? overloaded state tracking in some middlebox? whatever15:27
clarkbok zuul-lb02 continues to work for me. I'm going to approve the change to update dns now15:56
clarkbworst case only the zuul dashboard becomes unreachable and we have to hit them directly with /etc/hosts overrides while we revert15:57
clarkbso impact should be small if anything does go wrong15:57
clarkbthen once that is done I'll figure out loading extra permissions so that I can force merge the buildx fix in zuul-jobs so that we can fix nodepool testing so that we can fix dib testing15:58
clarkboh and then try again to deploy codesaerch02 I guess15:58
clarkbI need to rebase the codesarch dns update after the zuul.o.o dns update15:59
opendevreviewMerged opendev/zone-opendev.org master: Switch zuul.o.o to zuul-lb02  https://review.opendev.org/c/opendev/zone-opendev.org/+/94116716:00
clarkbok already found a minor ish problem :/ haproxy logging is set up with `log /dev/log local0` in the haproxy config. I think this sets up a socket for syslog logging. It appears that there may have been permissions issues for that under podman and/or noble so we aren't logging like we expect to16:08
clarkbthe service itself is working through16:08
clarkbok rsyslog config is configured to set up a listen socket at /var/haproxy/dev/log which is a socket on lb01 but is a regular dir on lb02. I suspect there may be some startup race16:14
clarkbrsyslogd: cannot create '/var/haproxy/dev/log': Address already in use <- that seems to be the case here. Looking at the ansible it isn't clear to me how this happened as we should restart rsyslog with the new config before ever creating that path16:17
clarkbanyway what I'd like to do is stop the haproxy service (so the bind mount is stopped), move that log dir aside, restart rsyslogd then start haproxy again16:18
clarkbthis will be noticeable as a blip in zuul web service. Any concern with that cc infra-root corvus 16:18
clarkbif we prefer we can revert the dns update and do this surgery then update dns again. Maybe that is better16:18
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Revert "Switch zuul.o.o to zuul-lb02"  https://review.opendev.org/c/opendev/zone-opendev.org/+/94146516:22
fungiseems like it should be pretty quick16:22
fungino objection from me16:22
fungii doubt anyone will spot the blip, worst case they'll click reload in their browsers16:22
clarkbfungi: ya my concern is that the bind mount for /var/haproxy/dev/log:/dev/log is what is converting the socket device to a directory16:23
clarkbso mostly concerned I'll do all that work and be right back where I started so reverting to lb01 for now seems better16:23
fungiit's a reasonable theory, i agree16:24
clarkbif that is the case I feel like that is a major bug in podman. At the very least it should error rather than silently deleting content in your filesystem and replacing it with somethign different16:24
clarkbI foudn the logs from the original rsyslog config update and restart and ther were no complaints there. But I did a restart just now and it complains about not being able to change an existent file system entry16:24
clarkbthat implies it worked at first and then podman broke it (or haproxy I suppose)16:25
clarkbanyway dns update is approved16:25
fungiit's a fifo we pre-create right?16:25
clarkb$AddUnixListenSocket /var/haproxy/dev/log16:25
clarkbits a unix domain socket16:25
fungiah okay16:25
funginot a named pipe16:25
clarkbanyway one step at a time. First undo dns update, then stop haproxy, clean up the filesystem and try again gathering data as we go16:27
opendevreviewMerged opendev/zone-opendev.org master: Revert "Switch zuul.o.o to zuul-lb02"  https://review.opendev.org/c/opendev/zone-opendev.org/+/94146516:28
clarkbok my local resolution flipped over to lb01 and the ttl has reset back to 300 after the first lb01 lookup. I think its safe to start operating on lb02 so will do so now16:36
corvusjust acking that i'm caught up and agree.16:37
clarkbrsyslogd[56508]: rsyslogd: cannot create '/var/haproxy/dev/log': Permission denied16:38
clarkband now that I see ^ in the log I see that this appears to have happend when we deployed previously. The logs I saw earlier that looked clean must've been from earlier node setup16:39
fungiso rsyslog was started when the permissions didn't allow creating the socket, then podman saw thing being bindmounted didn't exist so helpfully created a directory there, then later when permissions got fixed haproxy couldn't create the socket because there was already a directory with the same name?16:40
clarkbthats half right. permissions haven't gotten fixed and it is still failing on this. And drumroll please the culprit is apparmor16:41
* fungi is shocked. SHOCKED16:41
clarkbapparmor="DENIED" operation="mknod" class="file" profile="rsyslogd" name="/var/haproxy/dev/log"16:41
clarkbI'm off to figure out if I can add local config to apparmor to allow this to happen16:41
* fungi sighs16:41
corvusthe "creating missing directory" part is standard docker behavior16:41
corvus(as in, dockerd would have done that too, not just podman)16:42
clarkbya in "positive" news this means podman isn't deleting anything it was never there in the first place so I'm happier about that16:42
clarkbcorvus: ack16:42
clarkband also this is a generic noble problem and doesn't have to do with podman or docker16:42
corvusyeah i guess this is a "noble rsyslog upgrade" problem?16:42
fungibecause nobody uses rsyslog any more now that systemd-journald is a thing, i guess16:43
clarkbcorvus: yup16:43
corvusyeah, totally our fault16:44
fungihttps://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/1826294 maybe?16:45
clarkb/var/lib/*/dev/log            rwl, <- this rule already exists. Do we have a preference to move the haproxy syslog device into /var/lib/haproxy/dev/log or do we want a local rule to allow rwl against /var/haproxy/dev/log ?16:46
corvusi'm fine bending to the will of the fhs16:47
fungiwere we explicitly creating /var/haproxy/dev/log previously?16:48
fungis/creating/configuring/16:48
clarkbfungi: yes that is what rsyslog configuration is set to create and is what we bind mount16:48
fungiokay, but it's our own configuration choice sounds like, not a package-supplied default16:48
clarkbyes this is all containers and our choice16:48
fungiso yes, i agree moving the socket to where the shipped apparmor rules expect it makes sense16:49
corvusyeah, we were trying to keep all the docker references together rather than bind-mounting things from all over the filesystem -- to try to make it more clear what's going on16:49
clarkbcorvus: that is my next question should we move everything? I think that is fairly easy for haproxy because it is so stateless16:49
fungiunless we want to end up carting around an ever increasing amount of apparmor configuration too, i guess16:49
corvusbut i think that's not an important enough goal to push back against existing config16:49
clarkbI think we can have ansible write out everything to /var/lib/haproxy and then we just go and clean up /var/haproxy when done16:49
corvusclarkb: that works for me16:50
corvusso we'd have /var/lib/haproxy/run and .../etc16:50
clarkbyes16:50
fungiwfm16:50
clarkbI'll work on that and adding tests to ensure we're creating things. Though it still isn't clear to me if our test nodes have apparmor in an enforcing state16:51
corvussigns point to no on that one, eh?16:51
clarkbcorvus: ya based on docker-compose kill / podman kill working on noble in CI16:51
fungiwell, also based on this not getting caught in testing16:52
clarkbwe don't explicitly test the haproxy logging I don't think but maybe we do and if so then ya16:52
clarkbanyway I'll pause effort on lb02 and start working on a change to see how ugly this is. I don't expect it to be too bad even for the existing gitea and zuul proxies16:53
clarkbI guess I should update rsyslog config and double check it can create at athat location afterall16:54
clarkbI'll do that too16:54
clarkbyup that works16:55
opendevreviewMerged zuul/zuul-jobs master: Install ca-certificates in the buildx image  https://review.opendev.org/c/zuul/zuul-jobs/+/93982317:01
fungigonna run some lunch errands, but should be back in about an hour17:10
opendevreviewClark Boylan proposed opendev/system-config master: Move haproxy config into /var/lib/haproxy  https://review.opendev.org/c/opendev/system-config/+/94146817:15
clarkbthe more I think about ^ the more I'm concerned that we might have unexpected fallout from the existing servers like with the old rsyslog socket path no longer being valid in the time between restarting rsyslog and haproxy17:15
clarkbbut I figure this is good for a first pass and others can review it and add concerns if they have them. I think absolute worst case we'll be able to manually undo bits and pieces here and there to get it up and running again if necessary17:16
opendevreviewClark Boylan proposed opendev/zone-opendev.org master: Add codesearch02 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94114017:18
clarkbif CI is happy with ^ I'm going to go ahead and approve it in order to reduce the amoutn of rebase conflicts we have. The change to add codesearch02 is passing now so I think having it in dns is a good idea17:19
opendevreviewMerged opendev/zone-opendev.org master: Add codesearch02 to DNS  https://review.opendev.org/c/opendev/zone-opendev.org/+/94114017:30
opendevreviewClark Boylan proposed opendev/system-config master: Collect apparmor status in ci jobs  https://review.opendev.org/c/opendev/system-config/+/94147117:35
clarkbdebugging the apparmor behavior seems like it may be important as we move more services to noble ^ that is a starting point17:36
clarkbarg but editing that file runs all of our jobs...17:36
corvusis that bad? :)17:37
clarkbwell it will eat docker hub quota but it won't prevent us from getting useful apparmor status info back so I guess its good in that regard17:37
corvusclarkb: re restarting: seems like worst case is a brief outage that is just corrected by restarting haproxy?17:39
clarkbcorvus: yes I think so especially since haproxy seems to continue (logs a warning to stdout) if its main logging location isn't available as we saw on zuul-lb0217:40
clarkbbasically I think worst case we identfiy whatever it was that didn't get transitioned properly then manually restart. Shouldn't be a huge impact if it happens and CI should confirm that things generally work17:40
clarkbits just the transition that we don't test17:40
clarkbI'm going to take a break while we wait on results from all the things17:42
clarkblooks like the zuul job passed with the haproxy refactor which is a good sign that it works with the shift (again the main concern is the transition as we don't test that). Let me know what ya'll think18:11
clarkbwe're also starting to get back apparmor_status info but none of the noble host jobs haev run yet so don't have data to compare there yet18:11
corvusrefactor has a +2 from me18:12
clarkbfungi: maybe you can take al ook after lunch too? I should be able to be around all day today to help monitor and debug/fix should the transition be problematic18:27
fungiyeah, looking now18:32
clarkbsystem-config-run-base has empty apparmor_status output and errors that apparmor_status command isn't found. But bridge does have apparmor status. This implies to me that we start out with no apparmor then as we deploy things we install apparmor. I'm wondering if we need to reboot for apparmor enforcement to take effect. Should know more once paste/zuul/grafana jobs complete18:34
fungi`apt-file search bin/apparmor_status` tells me that the apparmor package ships a /usr/sbin/apparmor_status file18:35
fungiso sounds like something is pulling in apparmor automatically as a dependency18:36
clarkblooks like it comes from install-docker via upstream.yaml on !noble and on noble it must be a podman dependency?18:37
clarkbbut maybe it isn't a podman dependency so we don't actually get it on noble nodes in ci. If that is the case we can add apparmor as a dep to better align with the old docker system and what we appear to get in prod images18:38
clarkbhttps://zuul.opendev.org/t/openstack/build/9bb4f267b2934e9fae08277c73f209a3/log/paste99.opendev.org/apparmor_status.txt ok I suspect that is what is going on. For all the old docker stuff we installedapparmor but now we don't for podman18:40
clarkbI'm going to look at deps for podman packaging but I'll propose an update to install-docker to install apparmor for podman too18:40
clarkbya there is no direct dependency. I'm guessing that the ubuntu stance is that they just install apparmor bydefault then things don't have to strictly depend on it even when they have integration?18:43
opendevreviewClark Boylan proposed opendev/system-config master: Install apparmor when installing podman  https://review.opendev.org/c/opendev/system-config/+/94147118:46
clarkbthat should fail on the zuul-lb thing18:46
clarkbdepending on how we want to do stuff we could put the haproxy refactor on top of ^ to confirm things are happy afterwards then reverse the order to make things mergable18:47
fungiinterestingly, apparmor has "optional" priority, not essential or important18:47
fungiso i suspect it's just included as part of their default base package selection at image build time18:48
clarkbya18:48
clarkbit honestly might not be a terrible idea for us to ensure we include it in our images too since that is what ubuntu in the wild operates with18:48
clarkbthat might break a lot of stuff but maybe that is for the best? could possibly make the change only in the zuul-launcher images and then that can be part of migrating to the new images18:48
fungiagreed, though it could be disruptive for jobs that aren't already dragging it in18:48
clarkbcorvus: ^ not sure if you want to include that in the list of things to test18:48
clarkbnoble in particular pulls in apparmor 4.0 which includes rules for a lot more stuff (this is what got podman for example)18:49
clarkbrsyslog seems to be new rules too18:49
clarkbfeels a bit like I'm spinning my wheels on this stuff but this has all be good to learn and that aws the primary goal. Ensuring we're confident in our podman deployment18:50
clarkbless of a success replacing services more of a success making our podman better18:50
clarkbone thing to note landing 941471 might be difficult with the way docker rate limits work18:52
clarkbwe may need to do our best effort to ensure we get coverage across as many thinsg as possible then reduce the testing we run before merging it18:52
corvusclarkb: let's exercise how easy it is to make new images and just make a dedicated apparmor image; i think i'd like to keep our current images so that we're still able to test.  if the apparmor image works, we can switch over to it.18:58
clarkbcorvus: that seems reasonable18:59
clarkbcurrently we add packages to the infra package needs list but that would be shared by the two images so need another mechanism. maybe just another small element with additional package deps18:59
corvus(if you wanted to go crazy and define an image-validation pipeline with a job for an apparmor image... technically that should work but it's untested in prod :)19:01
corvusclarkb: maybe an element that gets packages from an env var, and we set that env var in the job?  that would be nice and general purpose for easy changes like this19:01
clarkb++19:02
clarkbits possible dib already has something like that too19:02
clarkbhttps://zuul.opendev.org/t/openstack/build/5d7a88d99c9a4c09bc9ffea273bd8d23/log/paste99.opendev.org/apparmor_status.txt shows that installing apparmor on noble as part of podman installation looks liek production19:23
clarkbso ya now we have a decision to make on whether or not we take clarkb's manual testing as sufficient and move forward with haproxy updatse or try to land 941471 first19:23
clarkbjust note that 941471 is likely to be difficult to land normally due to how many docker consuming jobs it runs19:24
dpanechHi, zuul jobs are not starting on this review, could someone help? https://review.opendev.org/c/starlingx/helm-charts/+/94099319:32
clarkbdpanech: hit refresh19:36
fungifastest fix ever ;)19:36
dpanechhuh awesome, did you do something or was just it taking longer than usual?19:49
clarkbI didn't do anything19:49
clarkbdpanech: did you check the zuul status page?19:49
fungimy guess is that if you went to the zuul status dashboard you'd have seen it in a "queued" state for about an hour while some provider was timing out trying to satisfy a node request for it19:50
fungisince it doesn't look from the usage graphs like we were maxxed out on available quota19:51
dpanechyou mean this status page? https://zuul.opendev.org/t/openstack/builds19:52
fungithe build result reports that the job itself only took a couple of minutes to run, but didn't start running for about an hour after gerrit would have sent the event19:52
clarkbdpanech: https://zuul.opendev.org/t/openstack/status this one19:52
fungidpanech: https://zuul.opendev.org/t/openstack/status19:52
clarkbbuilds shows you a historical view on what has run and completed. status shows you the live status19:52
dpanechok good to know, thanks19:53
clarkbI'm going to eat lunch now. fungi  I guess let me know what you think of https://review.opendev.org/c/opendev/system-config/+/94146819:53
fungiclarkb: huh, that change ran fewer jobs than i anticipated19:53
clarkbfungi: it ran jobs for the two services that use haproxy19:55
clarkbthe change that installs apparmor and collects apparmor_status info runs against everything but is separate19:55
fungiyeah, for some reason i thought we had more19:55
clarkbah19:55
fricklerfyi https://blog.codeberg.org/we-stay-strong-against-hate-and-hatred.html19:59
fungisomehow 941471 only failed 2 jobs20:26
fungione on a docker rate limit, the other looks like the executor got disconnected from a test node mid-test20:27
clarkbamazing. But also thats a good indication it doesn't break anything too bad20:28
fungiyes, my takeaway as well basically20:28
clarkbmight be worth a recheck to ensure those two jobs succeed then we can think about direct enquing to the gate or even force merging if we want to get it in20:28
clarkbit occurred to me that the load balancer apparmor problems may not cause that chagne to fail because we don't actually assert anything about those logs. My change to move things in haproxy does add those asserts though20:29
fungiyes, that's helpful20:29
fungii've rechecked 941471 for more results20:30
clarkbre https://review.opendev.org/c/opendev/system-config/+/941468 I can make time all afternoon to monitor and fixup things if we want to approve it now20:55
clarkbI'm a bit worried about tomorrow and Friday as we're supposed to get badish weather which always comes with the risk of power outage, but if we think waiting is best I'm fine with that too20:56
fungiyeah, i can approve it now, just didn't want to have it land while you were lunching20:59
clarkba tuna sandwich with spicy pickles has been consumed21:00
fungithere's nothing in that description i dislike21:00
fungisounds amazing21:00
fungimy sandwich too had pickles, but it was a cuban21:00
clarkbthe other day I had a grilled cheese and ham sandwich and only after i had eaten it did I remember I had pickles to make it an almost cubano.21:03
Clark[m]fungi: the gitea job failed for haproxy refactor. Do we want to dequeue and reenqueue to the gate to try again asap or let it finish then recheck to go back through check first?21:34
fungii can reenqueue it21:35
Clark[m]It failed pulling haproxy-statsd from docker hub21:35
opendevreviewAurelio Jargas proposed zuul/zuul-jobs master: Add role: `ensure-python-command`, refactor similar roles  https://review.opendev.org/c/zuul/zuul-jobs/+/94149021:54
fungisystem-config-run-tracing hit a dockerhub pull rate limit on 941471 this time, everything else passed, but we still haven't seen that job succeed for sure. i'll recheck again22:33
clarkbI've ssh'd into gitea-lb02, zuul-lb01 and zuul-lb0222:45
clarkbwill keep an eye on them when the change merges which should happen shortly22:45
clarkbat this rate the jobs will run after hourly jobs. Oh well eventually we will get there22:59
opendevreviewMerged opendev/system-config master: Move haproxy config into /var/lib/haproxy  https://review.opendev.org/c/opendev/system-config/+/94146823:00
clarkbit snuck in just ahead of hourly jobs23:02
clarkbgitea is running first23:02
clarkbcontainers restarted and /var/lib/haproxy exists as does the syslog socket. /var/log/haproxy is bring written to23:04
clarkband the job succeeded. I am able to hit https://opendev.org/ if there was a blip it must've been a very short one23:04
corvusnice!23:04
clarkbzuul-lb01 and zuul-lb02 seem to have experienced similar. Things shifted over and containers restarted23:05
clarkbI don't see a /var/log/haproxy on zuul-lb02 yet23:06
clarkbso trying to figure out if things are sad on the next thing :/23:07
fungideploy reported success for both23:07
clarkbas far as I can tell both production load balancers are happy with their new "homes"23:07
clarkbstill getting [ALERT]    (11) : sendmsg()/writev() failed in logger #1: Permission denied (errno=13) on zuul-lb0223:08
clarkbsrw-rw-rw- 1 root root 0 Feb 12 23:04 log which should allow anything to write to it?23:08
fungiyeah, unless... apparmor again?23:09
clarkbapparmor="DENIED" operation="sendmsg" class="file" info="Failed name lookup - disconnected path" error=-13 profile="rsyslogd" name="var/lib/haproxy/dev/log" pid=60555 comm="haproxy" requested_mask="r" denied_mask="r"23:09
clarkbis the problem that the container side sees the path as var/lib/haproxy/dev/log and not /var/lib/haproxy/dev/log?23:12
clarkbyes it seems to be related to chroots/namespaces23:13
clarkb"attach_disconnected This forces AppArmor to attach disconnected objects to the task's namespace and mediate them as though they are part of the namespace. WARNING this mode is unsafe and can result in aliasing and access to objects that should not be allowed. Its intent is a debug and policy development tool"23:14
clarkbI can't help but feel these apparmor rules are highly restrictive but also often useless. For example podman kill fails but I can just use kill(1)23:18
clarkbnow I can't allow access to /var/lib/haproxy/dev/log because I'm bind mounting it into a container with a different filesystem namespace. I could instead talk to a udp port listening on localhost?23:19
clarkbwhat have we accomplished here?23:19
clarkbhttps://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-apparmor-profiles.html#sec-apparmor-profiles-glob-variables suse documents doing magic with variables for chroot prefixes here which isn't quite what we need23:20
clarkbwe should be able to test this via https://review.opendev.org/c/opendev/system-config/+/941471 at least (which will now fail due to missing /var/log/haproxy.log)23:23
clarkbdoes anyone know anyone that understands apparmor? I feel like this is the sort of thing that should be doable. Another option would be to change how haproxy logs https://docs.haproxy.org/2.2/configuration.html#4.2-log23:24
mordredare there any people who understand apparmor?23:31
fungiapparmorsmiths23:34
fungithey're the only ones who can # it into shape23:34
clarkbthinking about this more the old problem was that rsyslog was not allowed to write to /var/haproxy/dev/log to create the socket. So we moved it to /var/lib/haproxy/dev/log and now rsyslog can create the socket23:35
clarkbthe problem now is that haproxy running in a different namespace is trying to write to that socket and getting denied23:35
clarkblooing in /etc/apparmor.d/base/abstractions there seems to be a global rule to allow things to write to /dev/log23:35
clarkbI wonder if the problem now is more that we need to have a rule that says haproxy in a container can write to what is actually /var/lib/haproxy/dev/log. Though it is odd that the profile="rsyslogd" in the DENIED message23:36
clarkbsince that profile shouldn't apply to our haproxy process?23:36
fungiunless apparmor somehow tracks what profile was associated with the process that created that resource?23:38
clarkbin the beginning of the file we have `profile rsyslogd /usr/sbin/rsyslogd {` I believe rsyslogd there is a logical name and can be any string (maybe with limtiations on characters) then /usr/sbin/rsyslogd is an attachment conditional that is supposed to restrict wher ethe profile appears23:40
clarkbbut now that I think of it more its a socket so both rsyslog and haproxy are cooperating over it and maybe that is how we're tripping over this23:40
clarkbI'm going to try attach_disconnected just to see if that fixes it23:42
clarkbI don't like it because it applies to the whole profile and we really just need it for this one resource23:42
corvusas much as i hate to say it; if they are going to start including apparmor profiles for docker stuff, we may want to resign ourselves to just needing to write the occasional supplemental profile.  it might be worth giving in and going ahead and doing that right now than continuing to fight it.  bonus: moar sekurity?23:43
corvuser by "right now" i did not mean "immediately" i meant "correctly, soon"23:44
corvus(basically, i'm wondering if we're at the beginning of an apparmor journey on noble)23:44
corvus(rather than haproxy being the beginning and the end; i dunno honestly)23:45
clarkbya I think that may be the case23:47
clarkbapparmor is no longer a little toy in the corner but something important to the system23:47
clarkband ya I'm mostly trying to figure out what it wants from me to make this work. Adding that attach_disconnected flag did not help23:47
clarkbhttps://lists.ubuntu.com/archives/apparmor/2019-August/012014.html I think this is related also "Hopefully we'll have something more pleasing in the future, but this is where it's at today." from circa 201923:52
clarkbbut I recognize that person as someone in the ubuntu security irc channel maybe we can ask there23:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!