cardoe | I’ve raised it to them. | 00:18 |
---|---|---|
cardoe | My dev boxes are behaving weird too | 00:18 |
fungi | thanks. we can probably extract some exact timestamps since we have nodepool interacting with the api fairly continuously | 00:20 |
fungi | if they want anything to correlate | 00:20 |
cardoe | Yeah that’s what they are bugging me for. | 00:29 |
cardoe | I’m a horrible user cause I just said I dunno. You figure it out. | 00:29 |
fungi | yeah, i worked support at a service provider for many years... you're "that guy" ;) | 00:31 |
fungi | lemme see what i can dig up | 00:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: Mirror docker tool images https://review.opendev.org/c/opendev/system-config/+/941304 | 00:35 |
clarkb | corvus: ^ fyi. I didn't do debian:testing because that seems like one that we might be able to find a different image for or at least standardize on the "base image for testing random container things" | 00:35 |
clarkb | I didn't want to assume that debian:testing is the image that makes the most sense for that off the bat | 00:35 |
clarkb | please double check I got all of the names and tag values right | 00:36 |
*** dan_ is now known as dan_with | 00:37 | |
clarkb | I got distracted but I'm realizing I should've enqueued the sighup fix directly to the gate. Oh well | 00:37 |
clarkb | at this point its mostly done in check I think | 00:37 |
fungi | cardoe: these are utc timestamps where nodepool was unable to get a server list from the nova api in flex sjc3: https://paste.opendev.org/show/bvu1Nl9ImtdvqjHvWkfb/ | 01:04 |
clarkb | if the sighup fixup lands I'll have to make some noop changes to docker-compose.yaml and/or the haproxy config on the servers by hand then see if tomorrow's daily run triggers the handler successfully (though it should be covered in CI so I'm not too worried about it) | 01:05 |
corvus | clarkb: i agree. maybe apache httpd or something? | 01:30 |
Clark[m] | ++ | 02:02 |
opendevreview | Merged opendev/system-config master: Perform haproxy HUP signals with kill https://review.opendev.org/c/opendev/system-config/+/941256 | 02:15 |
fungi | yay | 02:15 |
Clark[m] | The deploy jobs were successful though they should noop. Anyway tomorrow we can update DNS and then I'll edit the docker compose files so that periodic jobs flip them back and emit the signals | 03:35 |
cloudnull | o/ | 04:02 |
clarkb | welcome back | 04:18 |
clarkb | I'm nto sure I have any real useful debugging info. fungi had that and provided https://paste.opendev.org/show/bvu1Nl9ImtdvqjHvWkfb/ tl;dr seems to be odd network connectivity issues to cloud resources, from cloud resources and to the cloud api in rax flex | 04:19 |
cloudnull | hey clarkb - do you or fungi have anything on what was happening when the issue was observed? | 04:29 |
cloudnull | I also have a new region for you all to beat up if interested? | 04:30 |
clarkb | cloudnull: I mentioned that two test jobs failed because they couldn't lookup ubuntu package repo servers in DNS. Fungi then asked if it happend with raxflex because he apparently saw connectivity issues to a personal server in the region as well as problems getting ot the nova api | 04:30 |
clarkb | definitely interested | 04:30 |
clarkb | so basically I saw outbound connectivity have a disruption and fungi noticed inbound connectivity issues | 04:31 |
clarkb | the dns requests would've been over port 53 (likely udp) and not using dns over tls or dns or https | 04:31 |
cloudnull | was that all today? | 04:31 |
clarkb | cloudnull: yes | 04:32 |
clarkb | we configure local unbound to resolve against 1.1.1.1 and 8.8.8.8 | 04:32 |
cloudnull | sounds good, we'll look into it. | 04:32 |
clarkb | (we use the ipv6 equiavlents in clouds with ipv6 but rax flex doesn't yet iirc) | 04:32 |
cloudnull | I have no idea what might have happened, but we'll find out :D | 04:32 |
cloudnull | yeah, no IPv6 yet, but soon-tm | 04:32 |
clarkb | cloudnull: the timestamps in that paste would've been from nodepool logs accessing cloud apis for that cloud region | 04:33 |
cloudnull | I wanted to get it done sooner, but it just didn't happen ... | 04:33 |
cloudnull | +1 | 04:33 |
clarkb | and I think those are timestamps for failures to reach the api from nl01.opendev.org (hosted in rax classic dfw) | 04:33 |
cloudnull | https://keystone.api.dfw3.rackspacecloud.com is the new region, you should be able to get access to it exactly the same as SJC | 04:35 |
clarkb | that should be straightforward on our end. Just add a new region to clouds.yaml and then spin things up | 04:35 |
cloudnull | 👍 | 04:36 |
cloudnull | Is there a nodepool PR I need to push for that to happen? | 04:37 |
clarkb | nope I think we should be able to do it from our end | 04:38 |
clarkb | we need to update clouds.yaml contents, then add the region to our "cloud launcher" which sets up keys and security groups and so on, then we boot a mirror and then finally add the region to nodepool. We can ask questions if we run into anything unexpected in that process but I suspecti t will be similar to sjc | 04:39 |
cloudnull | sounds good. | 04:41 |
cloudnull | RE: networking tomfoolery, I've got some threads going - likely wont have an answer this evening but I'll keep you posted as we work through it. | 04:41 |
Clark[m] | No rush it's getting late | 05:45 |
opendevreview | QianBiao Ng proposed openstack/diskimage-builder master: Fix: ramdisk image build issues https://review.opendev.org/c/openstack/diskimage-builder/+/755056 | 10:58 |
karolinku | Hello folks. Question about glean and startup environment. I'm working on keyfiles support for C10, I noticed that most of distros are manually populating their networking files (like ifcfg) basing on config drive, but it case of keyfiles and NetworkManager specification, it is not recomended to edit them manually. So can nmcli be used for setting up config? | 12:17 |
fungi | karolinku[m]: i guess my only concern would be getting the sequencing correct, since glean can write config files at any time but nmcli may not be able to interact with interfaces until later during the boot process | 14:05 |
Clark[m] | I think there have been issues with network manager starting before config is in place too. I want to say that was the genesis of problems around failed DHCP (or was it RAs?) leading to the interface being improperly configured. Git logs in glean or dib probably have details | 14:15 |
fungi | yeah, i guess the trick would be convincing nm to "do nothing" until instructed by glean | 14:18 |
Clark[m] | I think it had to do with network manager marking interfaces down if default DHCP failed on startup (and some clouds don't do DHCP) | 14:19 |
Clark[m] | Ya either you configure it first or configure it to idle | 14:19 |
karolinku[m] | so I understand that both options are possible here - glean writing config prior and after nm starting | 14:25 |
Clark[m] | Well previously it caused bugs if glean ran after | 14:26 |
Clark[m] | That may not be an issue on centos 10 I'm not sure. | 14:26 |
karolinku[m] | but nmcli can modify config dynamically | 14:30 |
karolinku[m] | so even if it starts with default config, glean can overwrite it later. unless it puts interface down... | 14:32 |
fungi | problem was nm would automatically start trying to dhcp and slaac configure interfaces before glean was able to tell it not to | 14:33 |
fungi | ianw likely remembers the gritty details, i think he did a bunch of the troubleshooting for that when fedora first switched to nm | 14:35 |
karolinku[m] | the other dirty hack which came to my mind, is to create ifcfg file in c10, then use "nmcli migrate" to transformate it into keyfile format | 14:36 |
clarkb | can we just write a keyfile? is the format stable and easy to work with? | 14:37 |
clarkb | https://review.opendev.org/c/opendev/glean/+/688031 I think thisi s related to the issue I mentioned | 14:39 |
clarkb | in any case yuo'll have to figure out what works and makes snse | 14:39 |
clarkb | ultimately the actual order of things and the tools use don't matter too much as long as they work consistently | 14:40 |
karolinku[m] | doable, but not recommended as they wrote here https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/10-beta/html/configuring_and_managing_networking/networkmanager-connection-profiles-in-keyfile-format#networkmanager-connection-profiles-in-keyfile-format | 14:41 |
clarkb | infra-root if https://zuul-lb02.opendev.org/ looks good to you I'd like to proceed with https://review.opendev.org/c/opendev/zone-opendev.org/+/941167 I'll probably approve the dns change myself after breakfast and morning things regardless as the lb seems to work for me | 14:41 |
clarkb | karolinku[m]: note that warning is for manual editing by hand | 14:42 |
clarkb | karolinku[m]: its the same reason you shouldn't edit sudoers manually but instead use visudo | 14:42 |
clarkb | in the case of glean it would be a programmatic tool that always emits the same content that can be tested too | 14:42 |
clarkb | (thus avoiding inadverdent typos) | 14:42 |
clarkb | basically I don't read that warning to say you shouldn't exit the files directly. I read it as saying you shouldn't just edit them manually by hand in a one off as it is easy to introduce bugs. That is different to editing them programmatically | 14:43 |
karolinku[m] | I understand it as those keyfiles are just "store" of config, and default way of modyfing it is nmcli | 14:45 |
karolinku[m] | using nmcli user just don't have to think what section should be used or option choosen | 14:46 |
clarkb | "Typos or incorrect placements of parameters can lead to unexpected behavior. Therefore, do not manually edit or create NetworkManager profiles." this can be avoided entirely by using a programmatic tool like nmcli or glean | 14:46 |
clarkb | yes that is true with glean too. Wtih glean the only input should be the config drive | 14:46 |
karolinku[m] | im just not networking expert, so I dont even know If I be able to write proper config in fact | 14:46 |
karolinku[m] | because I'm the human parser between config drive and keyfile 😅 | 14:48 |
clarkb | I'm not saying don't use nmcli. I'm more saying I don't think that networkmanager is actually saying you must use nmcli in this case | 14:49 |
fungi | well, i expect the same could be said for writing a proper nm-interfacing script. you need the same degree of network knowledge either way, it's just a question of how you express what you want | 14:49 |
clarkb | but yes you have to undersatnd how this works because you have to teach glean how it works | 14:49 |
clarkb | the degree of understanding may vary based on additional abstractions and tools employed to simplify things | 14:49 |
fungi | though maybe my years as a network engineer have made me cynical of the people who write network tools | 14:49 |
karolinku[m] | wrt config drive - for learning purpouse I used https://opendev.org/opendev/glean/src/branch/master/glean/tests/fixtures/vexxhost/mnt/config/openstack/latest/network_data.json | 14:52 |
karolinku[m] | how it is generated for vms? | 14:52 |
clarkb | karolinku[m]: openstack nova generates the data and attaches it to the VMs either as an iso or fat32 filesystem iirc | 14:53 |
clarkb | the filesystem has a specific label that glean looks for and mounts to /mount/config or something like that. Then it loads the json data off of it. The example data you link to is sanitized from a real cloud instance in that cloud provider | 14:54 |
fungi | note that we interface with cloud providers running varying vintages of openstack, and what you can expect from configdrive's network metadata has changed over time | 14:56 |
clarkb | I think for the stuff opendev cares about its been fairly stable compared to what is in our example test data | 14:56 |
clarkb | this may not be true for ironic's use cases as they are more complicated with vlans and bonds | 14:57 |
karolinku[m] | ok, thanks for the tips! | 15:15 |
mnaser | lol, what are the crazy odds two differnet changes get the same change id.. https://review.opendev.org/q/Ic8aaa0728a43936cd4c6e1ed590e01ba8f0fbf5b | 15:16 |
clarkb | its common for backports because people cherry pick them and carry the change id over | 15:17 |
mnaser | yeah but in that case its two seperate changes + their backports lol | 15:17 |
fungi | yeah, change ids are only required to be unique for a specific project+branch combo | 15:17 |
clarkb | oh neat I see its one change to nova + backports and a different one to manila + backports | 15:18 |
fungi | i have a feeling that's not random chance, and rather that one of them copied the change id from the other | 15:18 |
clarkb | ya random=$({ git var GIT_COMMITTER_IDENT ; echo "$refhash" ; cat "$1"; } | git hash-object --stdin) is how change id gets generated. $refhash is git rev-parse HEAD | 15:19 |
clarkb | just GIT_COMMITTER_IDENT should be sufficient to produce different hashes here since these are different committers | 15:20 |
mnaser | i guess technically with that we can find out whos the real holder of the change id :P | 15:23 |
fungi | sha-1 is not immune to collisions, but the chances are more remote than the chance that some human copied a commit message and started editing it but didn't wipe the old id footer | 15:23 |
mnaser | would have been a fun anamoly if that was the case :) | 15:24 |
fungi | cloudnull: back to yesterday's discussion, i've been observing network outages in sjc3 that last around 10-15 minutes almost every day (sometimes more than once a day) for the past week or so, and when it happens both virtual machines and the nova api are unreachable from everywhere i try, so guessing it's something upstream of the entire cloud checking out at random | 15:26 |
fungi | routes falling out of the igp? overloaded state tracking in some middlebox? whatever | 15:27 |
clarkb | ok zuul-lb02 continues to work for me. I'm going to approve the change to update dns now | 15:56 |
clarkb | worst case only the zuul dashboard becomes unreachable and we have to hit them directly with /etc/hosts overrides while we revert | 15:57 |
clarkb | so impact should be small if anything does go wrong | 15:57 |
clarkb | then once that is done I'll figure out loading extra permissions so that I can force merge the buildx fix in zuul-jobs so that we can fix nodepool testing so that we can fix dib testing | 15:58 |
clarkb | oh and then try again to deploy codesaerch02 I guess | 15:58 |
clarkb | I need to rebase the codesarch dns update after the zuul.o.o dns update | 15:59 |
opendevreview | Merged opendev/zone-opendev.org master: Switch zuul.o.o to zuul-lb02 https://review.opendev.org/c/opendev/zone-opendev.org/+/941167 | 16:00 |
clarkb | ok already found a minor ish problem :/ haproxy logging is set up with `log /dev/log local0` in the haproxy config. I think this sets up a socket for syslog logging. It appears that there may have been permissions issues for that under podman and/or noble so we aren't logging like we expect to | 16:08 |
clarkb | the service itself is working through | 16:08 |
clarkb | ok rsyslog config is configured to set up a listen socket at /var/haproxy/dev/log which is a socket on lb01 but is a regular dir on lb02. I suspect there may be some startup race | 16:14 |
clarkb | rsyslogd: cannot create '/var/haproxy/dev/log': Address already in use <- that seems to be the case here. Looking at the ansible it isn't clear to me how this happened as we should restart rsyslog with the new config before ever creating that path | 16:17 |
clarkb | anyway what I'd like to do is stop the haproxy service (so the bind mount is stopped), move that log dir aside, restart rsyslogd then start haproxy again | 16:18 |
clarkb | this will be noticeable as a blip in zuul web service. Any concern with that cc infra-root corvus | 16:18 |
clarkb | if we prefer we can revert the dns update and do this surgery then update dns again. Maybe that is better | 16:18 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Revert "Switch zuul.o.o to zuul-lb02" https://review.opendev.org/c/opendev/zone-opendev.org/+/941465 | 16:22 |
fungi | seems like it should be pretty quick | 16:22 |
fungi | no objection from me | 16:22 |
fungi | i doubt anyone will spot the blip, worst case they'll click reload in their browsers | 16:22 |
clarkb | fungi: ya my concern is that the bind mount for /var/haproxy/dev/log:/dev/log is what is converting the socket device to a directory | 16:23 |
clarkb | so mostly concerned I'll do all that work and be right back where I started so reverting to lb01 for now seems better | 16:23 |
fungi | it's a reasonable theory, i agree | 16:24 |
clarkb | if that is the case I feel like that is a major bug in podman. At the very least it should error rather than silently deleting content in your filesystem and replacing it with somethign different | 16:24 |
clarkb | I foudn the logs from the original rsyslog config update and restart and ther were no complaints there. But I did a restart just now and it complains about not being able to change an existent file system entry | 16:24 |
clarkb | that implies it worked at first and then podman broke it (or haproxy I suppose) | 16:25 |
clarkb | anyway dns update is approved | 16:25 |
fungi | it's a fifo we pre-create right? | 16:25 |
clarkb | $AddUnixListenSocket /var/haproxy/dev/log | 16:25 |
clarkb | its a unix domain socket | 16:25 |
fungi | ah okay | 16:25 |
fungi | not a named pipe | 16:25 |
clarkb | anyway one step at a time. First undo dns update, then stop haproxy, clean up the filesystem and try again gathering data as we go | 16:27 |
opendevreview | Merged opendev/zone-opendev.org master: Revert "Switch zuul.o.o to zuul-lb02" https://review.opendev.org/c/opendev/zone-opendev.org/+/941465 | 16:28 |
clarkb | ok my local resolution flipped over to lb01 and the ttl has reset back to 300 after the first lb01 lookup. I think its safe to start operating on lb02 so will do so now | 16:36 |
corvus | just acking that i'm caught up and agree. | 16:37 |
clarkb | rsyslogd[56508]: rsyslogd: cannot create '/var/haproxy/dev/log': Permission denied | 16:38 |
clarkb | and now that I see ^ in the log I see that this appears to have happend when we deployed previously. The logs I saw earlier that looked clean must've been from earlier node setup | 16:39 |
fungi | so rsyslog was started when the permissions didn't allow creating the socket, then podman saw thing being bindmounted didn't exist so helpfully created a directory there, then later when permissions got fixed haproxy couldn't create the socket because there was already a directory with the same name? | 16:40 |
clarkb | thats half right. permissions haven't gotten fixed and it is still failing on this. And drumroll please the culprit is apparmor | 16:41 |
* fungi is shocked. SHOCKED | 16:41 | |
clarkb | apparmor="DENIED" operation="mknod" class="file" profile="rsyslogd" name="/var/haproxy/dev/log" | 16:41 |
clarkb | I'm off to figure out if I can add local config to apparmor to allow this to happen | 16:41 |
* fungi sighs | 16:41 | |
corvus | the "creating missing directory" part is standard docker behavior | 16:41 |
corvus | (as in, dockerd would have done that too, not just podman) | 16:42 |
clarkb | ya in "positive" news this means podman isn't deleting anything it was never there in the first place so I'm happier about that | 16:42 |
clarkb | corvus: ack | 16:42 |
clarkb | and also this is a generic noble problem and doesn't have to do with podman or docker | 16:42 |
corvus | yeah i guess this is a "noble rsyslog upgrade" problem? | 16:42 |
fungi | because nobody uses rsyslog any more now that systemd-journald is a thing, i guess | 16:43 |
clarkb | corvus: yup | 16:43 |
corvus | yeah, totally our fault | 16:44 |
fungi | https://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/1826294 maybe? | 16:45 |
clarkb | /var/lib/*/dev/log rwl, <- this rule already exists. Do we have a preference to move the haproxy syslog device into /var/lib/haproxy/dev/log or do we want a local rule to allow rwl against /var/haproxy/dev/log ? | 16:46 |
corvus | i'm fine bending to the will of the fhs | 16:47 |
fungi | were we explicitly creating /var/haproxy/dev/log previously? | 16:48 |
fungi | s/creating/configuring/ | 16:48 |
clarkb | fungi: yes that is what rsyslog configuration is set to create and is what we bind mount | 16:48 |
fungi | okay, but it's our own configuration choice sounds like, not a package-supplied default | 16:48 |
clarkb | yes this is all containers and our choice | 16:48 |
fungi | so yes, i agree moving the socket to where the shipped apparmor rules expect it makes sense | 16:49 |
corvus | yeah, we were trying to keep all the docker references together rather than bind-mounting things from all over the filesystem -- to try to make it more clear what's going on | 16:49 |
clarkb | corvus: that is my next question should we move everything? I think that is fairly easy for haproxy because it is so stateless | 16:49 |
fungi | unless we want to end up carting around an ever increasing amount of apparmor configuration too, i guess | 16:49 |
corvus | but i think that's not an important enough goal to push back against existing config | 16:49 |
clarkb | I think we can have ansible write out everything to /var/lib/haproxy and then we just go and clean up /var/haproxy when done | 16:49 |
corvus | clarkb: that works for me | 16:50 |
corvus | so we'd have /var/lib/haproxy/run and .../etc | 16:50 |
clarkb | yes | 16:50 |
fungi | wfm | 16:50 |
clarkb | I'll work on that and adding tests to ensure we're creating things. Though it still isn't clear to me if our test nodes have apparmor in an enforcing state | 16:51 |
corvus | signs point to no on that one, eh? | 16:51 |
clarkb | corvus: ya based on docker-compose kill / podman kill working on noble in CI | 16:51 |
fungi | well, also based on this not getting caught in testing | 16:52 |
clarkb | we don't explicitly test the haproxy logging I don't think but maybe we do and if so then ya | 16:52 |
clarkb | anyway I'll pause effort on lb02 and start working on a change to see how ugly this is. I don't expect it to be too bad even for the existing gitea and zuul proxies | 16:53 |
clarkb | I guess I should update rsyslog config and double check it can create at athat location afterall | 16:54 |
clarkb | I'll do that too | 16:54 |
clarkb | yup that works | 16:55 |
opendevreview | Merged zuul/zuul-jobs master: Install ca-certificates in the buildx image https://review.opendev.org/c/zuul/zuul-jobs/+/939823 | 17:01 |
fungi | gonna run some lunch errands, but should be back in about an hour | 17:10 |
opendevreview | Clark Boylan proposed opendev/system-config master: Move haproxy config into /var/lib/haproxy https://review.opendev.org/c/opendev/system-config/+/941468 | 17:15 |
clarkb | the more I think about ^ the more I'm concerned that we might have unexpected fallout from the existing servers like with the old rsyslog socket path no longer being valid in the time between restarting rsyslog and haproxy | 17:15 |
clarkb | but I figure this is good for a first pass and others can review it and add concerns if they have them. I think absolute worst case we'll be able to manually undo bits and pieces here and there to get it up and running again if necessary | 17:16 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Add codesearch02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/941140 | 17:18 |
clarkb | if CI is happy with ^ I'm going to go ahead and approve it in order to reduce the amoutn of rebase conflicts we have. The change to add codesearch02 is passing now so I think having it in dns is a good idea | 17:19 |
opendevreview | Merged opendev/zone-opendev.org master: Add codesearch02 to DNS https://review.opendev.org/c/opendev/zone-opendev.org/+/941140 | 17:30 |
opendevreview | Clark Boylan proposed opendev/system-config master: Collect apparmor status in ci jobs https://review.opendev.org/c/opendev/system-config/+/941471 | 17:35 |
clarkb | debugging the apparmor behavior seems like it may be important as we move more services to noble ^ that is a starting point | 17:36 |
clarkb | arg but editing that file runs all of our jobs... | 17:36 |
corvus | is that bad? :) | 17:37 |
clarkb | well it will eat docker hub quota but it won't prevent us from getting useful apparmor status info back so I guess its good in that regard | 17:37 |
corvus | clarkb: re restarting: seems like worst case is a brief outage that is just corrected by restarting haproxy? | 17:39 |
clarkb | corvus: yes I think so especially since haproxy seems to continue (logs a warning to stdout) if its main logging location isn't available as we saw on zuul-lb02 | 17:40 |
clarkb | basically I think worst case we identfiy whatever it was that didn't get transitioned properly then manually restart. Shouldn't be a huge impact if it happens and CI should confirm that things generally work | 17:40 |
clarkb | its just the transition that we don't test | 17:40 |
clarkb | I'm going to take a break while we wait on results from all the things | 17:42 |
clarkb | looks like the zuul job passed with the haproxy refactor which is a good sign that it works with the shift (again the main concern is the transition as we don't test that). Let me know what ya'll think | 18:11 |
clarkb | we're also starting to get back apparmor_status info but none of the noble host jobs haev run yet so don't have data to compare there yet | 18:11 |
corvus | refactor has a +2 from me | 18:12 |
clarkb | fungi: maybe you can take al ook after lunch too? I should be able to be around all day today to help monitor and debug/fix should the transition be problematic | 18:27 |
fungi | yeah, looking now | 18:32 |
clarkb | system-config-run-base has empty apparmor_status output and errors that apparmor_status command isn't found. But bridge does have apparmor status. This implies to me that we start out with no apparmor then as we deploy things we install apparmor. I'm wondering if we need to reboot for apparmor enforcement to take effect. Should know more once paste/zuul/grafana jobs complete | 18:34 |
fungi | `apt-file search bin/apparmor_status` tells me that the apparmor package ships a /usr/sbin/apparmor_status file | 18:35 |
fungi | so sounds like something is pulling in apparmor automatically as a dependency | 18:36 |
clarkb | looks like it comes from install-docker via upstream.yaml on !noble and on noble it must be a podman dependency? | 18:37 |
clarkb | but maybe it isn't a podman dependency so we don't actually get it on noble nodes in ci. If that is the case we can add apparmor as a dep to better align with the old docker system and what we appear to get in prod images | 18:38 |
clarkb | https://zuul.opendev.org/t/openstack/build/9bb4f267b2934e9fae08277c73f209a3/log/paste99.opendev.org/apparmor_status.txt ok I suspect that is what is going on. For all the old docker stuff we installedapparmor but now we don't for podman | 18:40 |
clarkb | I'm going to look at deps for podman packaging but I'll propose an update to install-docker to install apparmor for podman too | 18:40 |
clarkb | ya there is no direct dependency. I'm guessing that the ubuntu stance is that they just install apparmor bydefault then things don't have to strictly depend on it even when they have integration? | 18:43 |
opendevreview | Clark Boylan proposed opendev/system-config master: Install apparmor when installing podman https://review.opendev.org/c/opendev/system-config/+/941471 | 18:46 |
clarkb | that should fail on the zuul-lb thing | 18:46 |
clarkb | depending on how we want to do stuff we could put the haproxy refactor on top of ^ to confirm things are happy afterwards then reverse the order to make things mergable | 18:47 |
fungi | interestingly, apparmor has "optional" priority, not essential or important | 18:47 |
fungi | so i suspect it's just included as part of their default base package selection at image build time | 18:48 |
clarkb | ya | 18:48 |
clarkb | it honestly might not be a terrible idea for us to ensure we include it in our images too since that is what ubuntu in the wild operates with | 18:48 |
clarkb | that might break a lot of stuff but maybe that is for the best? could possibly make the change only in the zuul-launcher images and then that can be part of migrating to the new images | 18:48 |
fungi | agreed, though it could be disruptive for jobs that aren't already dragging it in | 18:48 |
clarkb | corvus: ^ not sure if you want to include that in the list of things to test | 18:48 |
clarkb | noble in particular pulls in apparmor 4.0 which includes rules for a lot more stuff (this is what got podman for example) | 18:49 |
clarkb | rsyslog seems to be new rules too | 18:49 |
clarkb | feels a bit like I'm spinning my wheels on this stuff but this has all be good to learn and that aws the primary goal. Ensuring we're confident in our podman deployment | 18:50 |
clarkb | less of a success replacing services more of a success making our podman better | 18:50 |
clarkb | one thing to note landing 941471 might be difficult with the way docker rate limits work | 18:52 |
clarkb | we may need to do our best effort to ensure we get coverage across as many thinsg as possible then reduce the testing we run before merging it | 18:52 |
corvus | clarkb: let's exercise how easy it is to make new images and just make a dedicated apparmor image; i think i'd like to keep our current images so that we're still able to test. if the apparmor image works, we can switch over to it. | 18:58 |
clarkb | corvus: that seems reasonable | 18:59 |
clarkb | currently we add packages to the infra package needs list but that would be shared by the two images so need another mechanism. maybe just another small element with additional package deps | 18:59 |
corvus | (if you wanted to go crazy and define an image-validation pipeline with a job for an apparmor image... technically that should work but it's untested in prod :) | 19:01 |
corvus | clarkb: maybe an element that gets packages from an env var, and we set that env var in the job? that would be nice and general purpose for easy changes like this | 19:01 |
clarkb | ++ | 19:02 |
clarkb | its possible dib already has something like that too | 19:02 |
clarkb | https://zuul.opendev.org/t/openstack/build/5d7a88d99c9a4c09bc9ffea273bd8d23/log/paste99.opendev.org/apparmor_status.txt shows that installing apparmor on noble as part of podman installation looks liek production | 19:23 |
clarkb | so ya now we have a decision to make on whether or not we take clarkb's manual testing as sufficient and move forward with haproxy updatse or try to land 941471 first | 19:23 |
clarkb | just note that 941471 is likely to be difficult to land normally due to how many docker consuming jobs it runs | 19:24 |
dpanech | Hi, zuul jobs are not starting on this review, could someone help? https://review.opendev.org/c/starlingx/helm-charts/+/940993 | 19:32 |
clarkb | dpanech: hit refresh | 19:36 |
fungi | fastest fix ever ;) | 19:36 |
dpanech | huh awesome, did you do something or was just it taking longer than usual? | 19:49 |
clarkb | I didn't do anything | 19:49 |
clarkb | dpanech: did you check the zuul status page? | 19:49 |
fungi | my guess is that if you went to the zuul status dashboard you'd have seen it in a "queued" state for about an hour while some provider was timing out trying to satisfy a node request for it | 19:50 |
fungi | since it doesn't look from the usage graphs like we were maxxed out on available quota | 19:51 |
dpanech | you mean this status page? https://zuul.opendev.org/t/openstack/builds | 19:52 |
fungi | the build result reports that the job itself only took a couple of minutes to run, but didn't start running for about an hour after gerrit would have sent the event | 19:52 |
clarkb | dpanech: https://zuul.opendev.org/t/openstack/status this one | 19:52 |
fungi | dpanech: https://zuul.opendev.org/t/openstack/status | 19:52 |
clarkb | builds shows you a historical view on what has run and completed. status shows you the live status | 19:52 |
dpanech | ok good to know, thanks | 19:53 |
clarkb | I'm going to eat lunch now. fungi I guess let me know what you think of https://review.opendev.org/c/opendev/system-config/+/941468 | 19:53 |
fungi | clarkb: huh, that change ran fewer jobs than i anticipated | 19:53 |
clarkb | fungi: it ran jobs for the two services that use haproxy | 19:55 |
clarkb | the change that installs apparmor and collects apparmor_status info runs against everything but is separate | 19:55 |
fungi | yeah, for some reason i thought we had more | 19:55 |
clarkb | ah | 19:55 |
frickler | fyi https://blog.codeberg.org/we-stay-strong-against-hate-and-hatred.html | 19:59 |
fungi | somehow 941471 only failed 2 jobs | 20:26 |
fungi | one on a docker rate limit, the other looks like the executor got disconnected from a test node mid-test | 20:27 |
clarkb | amazing. But also thats a good indication it doesn't break anything too bad | 20:28 |
fungi | yes, my takeaway as well basically | 20:28 |
clarkb | might be worth a recheck to ensure those two jobs succeed then we can think about direct enquing to the gate or even force merging if we want to get it in | 20:28 |
clarkb | it occurred to me that the load balancer apparmor problems may not cause that chagne to fail because we don't actually assert anything about those logs. My change to move things in haproxy does add those asserts though | 20:29 |
fungi | yes, that's helpful | 20:29 |
fungi | i've rechecked 941471 for more results | 20:30 |
clarkb | re https://review.opendev.org/c/opendev/system-config/+/941468 I can make time all afternoon to monitor and fixup things if we want to approve it now | 20:55 |
clarkb | I'm a bit worried about tomorrow and Friday as we're supposed to get badish weather which always comes with the risk of power outage, but if we think waiting is best I'm fine with that too | 20:56 |
fungi | yeah, i can approve it now, just didn't want to have it land while you were lunching | 20:59 |
clarkb | a tuna sandwich with spicy pickles has been consumed | 21:00 |
fungi | there's nothing in that description i dislike | 21:00 |
fungi | sounds amazing | 21:00 |
fungi | my sandwich too had pickles, but it was a cuban | 21:00 |
clarkb | the other day I had a grilled cheese and ham sandwich and only after i had eaten it did I remember I had pickles to make it an almost cubano. | 21:03 |
Clark[m] | fungi: the gitea job failed for haproxy refactor. Do we want to dequeue and reenqueue to the gate to try again asap or let it finish then recheck to go back through check first? | 21:34 |
fungi | i can reenqueue it | 21:35 |
Clark[m] | It failed pulling haproxy-statsd from docker hub | 21:35 |
opendevreview | Aurelio Jargas proposed zuul/zuul-jobs master: Add role: `ensure-python-command`, refactor similar roles https://review.opendev.org/c/zuul/zuul-jobs/+/941490 | 21:54 |
fungi | system-config-run-tracing hit a dockerhub pull rate limit on 941471 this time, everything else passed, but we still haven't seen that job succeed for sure. i'll recheck again | 22:33 |
clarkb | I've ssh'd into gitea-lb02, zuul-lb01 and zuul-lb02 | 22:45 |
clarkb | will keep an eye on them when the change merges which should happen shortly | 22:45 |
clarkb | at this rate the jobs will run after hourly jobs. Oh well eventually we will get there | 22:59 |
opendevreview | Merged opendev/system-config master: Move haproxy config into /var/lib/haproxy https://review.opendev.org/c/opendev/system-config/+/941468 | 23:00 |
clarkb | it snuck in just ahead of hourly jobs | 23:02 |
clarkb | gitea is running first | 23:02 |
clarkb | containers restarted and /var/lib/haproxy exists as does the syslog socket. /var/log/haproxy is bring written to | 23:04 |
clarkb | and the job succeeded. I am able to hit https://opendev.org/ if there was a blip it must've been a very short one | 23:04 |
corvus | nice! | 23:04 |
clarkb | zuul-lb01 and zuul-lb02 seem to have experienced similar. Things shifted over and containers restarted | 23:05 |
clarkb | I don't see a /var/log/haproxy on zuul-lb02 yet | 23:06 |
clarkb | so trying to figure out if things are sad on the next thing :/ | 23:07 |
fungi | deploy reported success for both | 23:07 |
clarkb | as far as I can tell both production load balancers are happy with their new "homes" | 23:07 |
clarkb | still getting [ALERT] (11) : sendmsg()/writev() failed in logger #1: Permission denied (errno=13) on zuul-lb02 | 23:08 |
clarkb | srw-rw-rw- 1 root root 0 Feb 12 23:04 log which should allow anything to write to it? | 23:08 |
fungi | yeah, unless... apparmor again? | 23:09 |
clarkb | apparmor="DENIED" operation="sendmsg" class="file" info="Failed name lookup - disconnected path" error=-13 profile="rsyslogd" name="var/lib/haproxy/dev/log" pid=60555 comm="haproxy" requested_mask="r" denied_mask="r" | 23:09 |
clarkb | is the problem that the container side sees the path as var/lib/haproxy/dev/log and not /var/lib/haproxy/dev/log? | 23:12 |
clarkb | yes it seems to be related to chroots/namespaces | 23:13 |
clarkb | "attach_disconnected This forces AppArmor to attach disconnected objects to the task's namespace and mediate them as though they are part of the namespace. WARNING this mode is unsafe and can result in aliasing and access to objects that should not be allowed. Its intent is a debug and policy development tool" | 23:14 |
clarkb | I can't help but feel these apparmor rules are highly restrictive but also often useless. For example podman kill fails but I can just use kill(1) | 23:18 |
clarkb | now I can't allow access to /var/lib/haproxy/dev/log because I'm bind mounting it into a container with a different filesystem namespace. I could instead talk to a udp port listening on localhost? | 23:19 |
clarkb | what have we accomplished here? | 23:19 |
clarkb | https://documentation.suse.com/sles/12-SP5/html/SLES-all/cha-apparmor-profiles.html#sec-apparmor-profiles-glob-variables suse documents doing magic with variables for chroot prefixes here which isn't quite what we need | 23:20 |
clarkb | we should be able to test this via https://review.opendev.org/c/opendev/system-config/+/941471 at least (which will now fail due to missing /var/log/haproxy.log) | 23:23 |
clarkb | does anyone know anyone that understands apparmor? I feel like this is the sort of thing that should be doable. Another option would be to change how haproxy logs https://docs.haproxy.org/2.2/configuration.html#4.2-log | 23:24 |
mordred | are there any people who understand apparmor? | 23:31 |
fungi | apparmorsmiths | 23:34 |
fungi | they're the only ones who can # it into shape | 23:34 |
clarkb | thinking about this more the old problem was that rsyslog was not allowed to write to /var/haproxy/dev/log to create the socket. So we moved it to /var/lib/haproxy/dev/log and now rsyslog can create the socket | 23:35 |
clarkb | the problem now is that haproxy running in a different namespace is trying to write to that socket and getting denied | 23:35 |
clarkb | looing in /etc/apparmor.d/base/abstractions there seems to be a global rule to allow things to write to /dev/log | 23:35 |
clarkb | I wonder if the problem now is more that we need to have a rule that says haproxy in a container can write to what is actually /var/lib/haproxy/dev/log. Though it is odd that the profile="rsyslogd" in the DENIED message | 23:36 |
clarkb | since that profile shouldn't apply to our haproxy process? | 23:36 |
fungi | unless apparmor somehow tracks what profile was associated with the process that created that resource? | 23:38 |
clarkb | in the beginning of the file we have `profile rsyslogd /usr/sbin/rsyslogd {` I believe rsyslogd there is a logical name and can be any string (maybe with limtiations on characters) then /usr/sbin/rsyslogd is an attachment conditional that is supposed to restrict wher ethe profile appears | 23:40 |
clarkb | but now that I think of it more its a socket so both rsyslog and haproxy are cooperating over it and maybe that is how we're tripping over this | 23:40 |
clarkb | I'm going to try attach_disconnected just to see if that fixes it | 23:42 |
clarkb | I don't like it because it applies to the whole profile and we really just need it for this one resource | 23:42 |
corvus | as much as i hate to say it; if they are going to start including apparmor profiles for docker stuff, we may want to resign ourselves to just needing to write the occasional supplemental profile. it might be worth giving in and going ahead and doing that right now than continuing to fight it. bonus: moar sekurity? | 23:43 |
corvus | er by "right now" i did not mean "immediately" i meant "correctly, soon" | 23:44 |
corvus | (basically, i'm wondering if we're at the beginning of an apparmor journey on noble) | 23:44 |
corvus | (rather than haproxy being the beginning and the end; i dunno honestly) | 23:45 |
clarkb | ya I think that may be the case | 23:47 |
clarkb | apparmor is no longer a little toy in the corner but something important to the system | 23:47 |
clarkb | and ya I'm mostly trying to figure out what it wants from me to make this work. Adding that attach_disconnected flag did not help | 23:47 |
clarkb | https://lists.ubuntu.com/archives/apparmor/2019-August/012014.html I think this is related also "Hopefully we'll have something more pleasing in the future, but this is where it's at today." from circa 2019 | 23:52 |
clarkb | but I recognize that person as someone in the ubuntu security irc channel maybe we can ask there | 23:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!