*** dviroel|out is now known as dviroel | 00:05 | |
*** dviroel is now known as dviroel|out | 00:20 | |
*** rlandy|bbl is now known as rlandy|out | 01:17 | |
*** mazzy5098812929580859404 is now known as mazzy50988129295808594 | 03:14 | |
*** soniya29 is now known as soniya29|rover | 04:19 | |
*** marios is now known as marios|ruck | 04:58 | |
*** ysandeep|out is now known as ysandeep | 05:05 | |
*** pojadhav|afk is now known as pojadhav | 05:11 | |
marios|ruck | o/ anyone know what's up with NODE_FAILURE today https://zuul.opendev.org/t/openstack/builds?result=NODE_FAILURE | 06:48 |
---|---|---|
marios|ruck | ysandeep: soniya29|rover: ^^ fyi | 06:48 |
soniya29|rover | marios|ruck, ack | 06:48 |
frickler | maybe some issue with recently built c9s images? we failed to build new images for a couple of days because of a full disk situation, that was resolved yesterday, so today you are running with fresh images for the first time in a week or so | 06:55 |
marios|ruck | thanks frickler might be... | 07:17 |
marios|ruck | unfortunately there are no logs available cant even see where it ran or anything else | 07:18 |
frickler | marios|ruck: in the nodepool logs there is only: Timeout waiting for connection to ... on port 22 | 07:19 |
frickler | so likely some IP setup/firewall/sshd change | 07:20 |
marios|ruck | thanks | 07:22 |
*** ysandeep is now known as ysandeep|lunch | 07:23 | |
opendevreview | chzhang8 proposed openstack/project-config master: register and bring back tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/800442 | 07:34 |
frickler | might also be provider specific or not 100% failing, since I do see some working nodes being spawned | 07:34 |
*** jpena|off is now known as jpena | 07:34 | |
opendevreview | chzhang8 proposed openstack/project-config master: register and bring back tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/800442 | 07:55 |
*** gthiemon1e is now known as gthiemonge | 08:10 | |
opendevreview | chzhang8 proposed openstack/project-config master: create tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/800442 | 08:28 |
opendevreview | chzhang8 proposed openstack/project-config master: create tricircle under x namespaces https://review.opendev.org/c/openstack/project-config/+/800442 | 09:02 |
*** ysandeep|lunch is now known as ysandeep | 10:07 | |
*** soniya29|rover is now known as soniya29|rover|lunch | 10:08 | |
*** rlandy|out is now known as rlandy | 10:25 | |
*** soniya29|rover|lunch is now known as soniya29|rover | 10:27 | |
*** marios|ruck is now known as marios|ruck|lunch | 11:09 | |
*** bhagyashris_ is now known as bhagyashris | 11:09 | |
*** marios|ruck|lunch is now known as marios|ruck | 11:15 | |
*** pojadhav is now known as pojadhav|afk | 11:17 | |
*** soniya29|rover is now known as soniya29|rover|afk | 11:18 | |
*** dviroel_ is now known as dviroel | 11:22 | |
*** ysandeep is now known as ysandeep|brb | 11:49 | |
*** ysandeep|brb is now known as ysandeep|afk | 12:05 | |
*** soniya29 is now known as soniya29|rover | 12:07 | |
*** pojadhav|afk is now known as pojadhav | 12:33 | |
*** ysandeep|afk is now known as ysandeep | 13:14 | |
fungi | frickler: do the ssh timeouts seem to be for the same providers most of the time? | 13:16 |
frickler | fungi: I didn't look systematically enough to tell. I did check some freshly created node on ovh where I could login | 13:18 |
frickler | quick "grep Timeout /var/log/nodepool/launcher-debug.log|wc" finds 2200 on nl01, 0 on nl02, 158 on nl04, so very likely only some providers affected | 13:21 |
frickler | might be similar to the fedora 35 issues we had? maybe ianw can check later | 13:22 |
fungi | or may just be one of our larger providers having significant network disruption | 13:23 |
frickler | but it seemed specific to centos, I haven't seen other node types affected | 13:23 |
fungi | and it's not the nested-virt ones specifically like we saw with octavia, right? | 13:24 |
fungi | the normal centos-9-stream label? | 13:25 |
frickler | yes, the latter | 13:25 |
fungi | agreed, more likely an image problem in that case | 13:26 |
fungi | okay, i'm disappearing now, but hope to be back around 21:00 utc | 13:59 |
*** marios|ruck is now known as marios | 15:00 | |
*** marios is now known as marios|ruck | 15:01 | |
clarkb | re NODE_FAILURE that should only happen if all the cloud sfail to boot the instance | 15:29 |
clarkb | which implies we have problems on all the clouds? though frickler's greps indicate nl03 is probably fine? | 15:30 |
clarkb | buy ya considering that fedora did that to us recently I wouldn't be surprised if this is the same issue over again | 15:30 |
clarkb | is there a changlog for stream? | 15:31 |
clarkb | I've just confirmed that the centos-9 image updated 12 hours ago and the prior imgae is from 12 days ago | 15:36 |
clarkb | file sizes haven't drifered dramatically indicating some major disk writing problem. Likely due to centos updates itself | 15:38 |
*** ysandeep is now known as ysandeep|out | 15:40 | |
clarkb | we don't seem to have the log for the build from 12 days ago so hard to compare package lists. I guess I can manually boot off of that image and find the equivalent of dpkg -l | 15:41 |
clarkb | ovh bhs1 has no valid host found errors | 15:48 |
clarkb | and gra1 is offline for maintenance. amorin is that maintenance still ongoing or can we return gra1 to service? | 15:49 |
amorin | gra1 os done! | 15:50 |
amorin | let me remove WIP from my change | 15:50 |
clarkb | thanks! I'll approve it when that is done | 15:50 |
clarkb | any idea why bhs1 is complaining about no valid hosts found? Probably not urgent but I noticed it when trying to find logs for this centos-9 problem | 15:50 |
amorin | https://review.opendev.org/c/openstack/project-config/+/835422 | 15:50 |
amorin | clarkb will check about BHS1 | 15:51 |
clarkb | approved thanks again | 15:51 |
clarkb | #status log Paused centos-9-stream image building to ensure it doesn't rebuild and delete our prior image before we can properly debug related NODE_FAILURES | 15:53 |
opendevstatus | clarkb: finished logging | 15:53 |
clarkb | infra-root ^ fyi I did that just now to keep the option of deleting the current image and using the 12 day old one open while we sort through this | 15:53 |
*** dviroel_ is now known as dviroel | 15:54 | |
clarkb | ok I think I've onfirmed they fail to boot in the iweb cloud | 15:57 |
clarkb | but it looks like a lot of stuff is failing to boot there? | 15:58 |
clarkb | which may be the expected shutdown beginning? | 15:59 |
opendevreview | Merged openstack/project-config master: Revert "[OVH/GRA1] Disable nodepool temporarily" https://review.opendev.org/c/openstack/project-config/+/835422 | 15:59 |
amorin | clarkb: I found the issue, fix in progress | 16:03 |
amorin | about bhs1 | 16:03 |
clarkb | amorin: thank you ! | 16:04 |
clarkb | re iweb it seems that we might have a problem deleting things mor ethan launching. THen centos-9 is specifically a launching problem? | 16:04 |
clarkb | I'm going to look at another cloud for centos-9 info | 16:04 |
*** rlandy is now known as rlandy|rover | 16:09 | |
clarkb | ok I think I may understand a high level of what is happening. It appears that cnetos-9-stream can boot as well as any other image in other clouds. Except it isn't booting successfully in rax due to being unable to gather host keys | 16:10 |
clarkb | there is some instability in othe rcloud sthat results in boot failures but in general they can boot the image I think. Things like the no valid host found in bhs1 and that seems to affect inmotions a bit as well (though inmotion is using most of its quota) then gra1 is down. Also iweb cloud had problems booting a few hours ago which seems to be fine now but it can't delete nodes so | 16:11 |
clarkb | we don't have much capacity there | 16:11 |
clarkb | Anyay being unable to boot cnetos-9-stream in rax means it is far more likely to hit NODE_FAILURE if it wins the lottery in the othre clouds | 16:12 |
*** marios|ruck is now known as marios|out | 16:12 | |
clarkb | And rax is xen and I wouldn't be surprised if centos-9-stream broke that somehow. It also requires static network config which maybe glean + networking manager + centos-9 broke | 16:13 |
*** dviroel is now known as dviroel|lunch | 16:14 | |
clarkb | I'm manually booting an instance in ord to see if it ever pings and if the console shows anything helpful | 16:18 |
clarkb | so far the console boot output stops at resizing the root disk | 16:20 |
clarkb | oh there it goes but how do I see the history heh | 16:20 |
clarkb | I can't page up paste the login prompt now ugh | 16:20 |
clarkb | it does not ping yet but I do get a login prompt so this isn't a completely failed boot | 16:21 |
clarkb | this appears to be a failure to configure networking. Likely rax is affected as the only cloud that a fallback to dhcp doesn't work on | 16:23 |
clarkb | I'll try to boot a recovery instance next and look at the network config I Guess. | 16:24 |
clarkb | I guess rescuing uses the same image yo u are trying to resuce? I feel like I leran this lesson every time I do this | 16:32 |
clarkb | I've set the rescue image to debian bullseye and it still doesn't ping | 16:38 |
clarkb | no ssh connectivity yet either. I'll see if this changes in a minute or two but this may be a dead end | 16:38 |
*** jpena is now known as jpena|off | 16:39 | |
clarkb | There appear to have been multiple kernel updates in the period of time where we were unable to build images | 16:44 |
clarkb | And my rescued instance with a bullseye image isn't any better than the one booting off of itself? I'm so confused | 16:45 |
clarkb | I'm able to get into the rescue instance via the webconsole | 16:50 |
clarkb | but no networking confirmed there | 16:50 |
clarkb | maybe because I'm booting with config drive set to true? | 16:50 |
clarkb | sysconfig/network doens't appear to have any glean content in it | 16:52 |
clarkb | systemctl list-units -a doesn't show a glean unit | 16:54 |
clarkb | oh that is because its systemctl for the rescue node this is fun | 16:55 |
clarkb | If I'm reading /var/log/messages correctly it never triggers glean for the actual network interface? | 17:00 |
clarkb | This is interesting the rescure node also only has a lo interface | 17:00 |
clarkb | /sys/class/net reports an eth0 and eth1 as expected | 17:02 |
clarkb | I can see in /var/log/mesages that eth0 is renamed to enX0 and eth1 is renamed to enX1. Then later NetowrkManager tries to configure them and fials because NetworkManager doesn't grok static network config using config drive | 17:09 |
clarkb | It isn't clear to me why our udev rule isn't firing the unit for those interfaces though. At least under debian they should up with the right subsystem and addr_assign_type. I suppose that value may have changed in the new centos kernel though and we don't match anymore? | 17:10 |
clarkb | it is curious that the Debian image in rax that I used to rescue didn't configure networking either | 17:11 |
clarkb | it should use cloud init | 17:12 |
*** dviroel|lunch is now known as dviroel | 17:18 | |
clarkb | I'm fairly certain the problem is due to udev not firing for the network devices | 17:19 |
clarkb | or maybe more accurately udev isn't triggering our unit based on our udev rule | 17:19 |
clarkb | it isn't clear t ome if the rule is being ignored or evaluated and not matching | 17:20 |
clarkb | I set a passwd in the mounted image via the rescue image and have unrescued. I think I need to inspect things like the attributes of the network devices from the context of the centos kernel | 17:30 |
clarkb | and apparently I can't retype the password I set | 17:32 |
clarkb | operating in the web console is the worst thing ever | 17:33 |
clarkb | I'm doing something wrong because I've tried resetting the password multiple times now and when I attempt to login its the wrong password | 17:46 |
clarkb | I think that is my cue to take a break. TL;DR is that it appears the glean unit for eth0/enX0 and eth1/enX1 is not firing which prevents and static network config from happening | 17:47 |
clarkb | I'm going t odelete my test node. Easy wnough for someone else to make a new one | 17:47 |
clarkb | something I should've checked earlier: this same udev problem occurs on ovh booted instances and I assume elsewhere as well. We boot successfully using dhcp though | 17:51 |
clarkb | that should mmake this a lot easier to debug if we can use a cloud we can ssh into | 17:51 |
clarkb | I've confirmed addr_assign_type is 0 which is necessary for our udev trigger to fire | 17:52 |
clarkb | udevadmin shows the systemd wants value for lo but not ens3 on bhs1. More clues that irt really isn't doing what our udev rules attempts to do | 18:00 |
clarkb | ok on a bhs1 node if I run `udevadm test /sys/class/net/esn3` that results in a SYSTEMD_WANTS value for glean@ens3.service. | 18:07 |
clarkb | That implies to me that either an event isn't firing on boot or the attributes are not correct when the event does fire | 18:07 |
rezabojnordi | Hi guys | 18:43 |
clarkb | hello | 18:45 |
clarkb | ok removing the ACTION=="add" from the udev rule makes it work | 18:57 |
clarkb | so for some reason we aren't getting added with an action called add? | 18:57 |
clarkb | I have been unsuccessful in debugging udev events at boot to determine what the actual event attributes are | 18:58 |
clarkb | I think I may understand this. The device starts live as eth0 but is moved to ens3. This means the udev action for ens3 is a move not an add. It looks like eth0 is add'd then I guess the move process short cicuits the udev rule processing or we're looking for eth0 paths that don't exist anymore to do things like avaluate the attrs that we check | 19:08 |
clarkb | I suspect that if I change this to ACTION=="add|move" this may work | 19:08 |
clarkb | will test that and if confirmed will likely leave it up to red hat folks that undersatnd this move business to give further guidance | 19:09 |
rezabojnordi | hi guys, I have question | 19:09 |
clarkb | rezabojnordi: feel free to ask | 19:10 |
rezabojnordi | I have problem on my architecture? can you help me? | 19:11 |
clarkb | infra-root yup I think that is it. Switching ACTION=="add" to ACTION=="add|move" seems to make it work. I'm not sure what other fallout that would have and it seems to do with centos-9's particular udev rule sfor renaming things? Will have to defer to people that understrand that distro better | 19:11 |
clarkb | rezabojnordi: architecture of what? We help build the developer tools for openstack and other projects but don't really know much about operating them necessarily | 19:12 |
rezabojnordi | How to contribute on OpenStack? | 19:13 |
rezabojnordi | I like to contribute on openstack | 19:13 |
clarkb | ah we can probably help with that | 19:14 |
clarkb | is there a specific problem? | 19:14 |
rezabojnordi | yes, how can I start? | 19:17 |
clarkb | https://docs.openstack.org/contributors/ is probably a good plcae to start. That has specific starting points depending on what sort of contributions you want to make | 19:18 |
rezabojnordi | Thank you so much my bro | 19:21 |
*** dviroel is now known as dviroel|afk | 20:24 | |
ianw | clarkb: ... great detective work! | 20:58 |
Clark[m] | ianw: ya I'm not sure if we should just add |move. There is a specific log entry for when things get renamed and it is related to some rhel 9 ruleset. If you want to look the instance is in ovh bhs1 and named clarkb-test-centos-9. You can log in as root with your key like any other test node | 21:03 |
Clark[m] | I'm finishing lunch and then doing school run so not at a keyboard for a bit | 21:04 |
ianw | yeah i need to do the opposite school run and will be back in about 20m too, will look | 21:27 |
fungi | okay, i'm back finally but have some paperwork i need to get done. looks like there are no burning fires, just a fun investigative series on why centos suddenly broke glean | 21:46 |
clarkb | ya I think that was mostly it | 21:55 |
clarkb | systemd-udevd[588]: Using default interface naming scheme 'rhel-9.0'. <- that triggers the renames I think | 22:05 |
clarkb | what is interesting is that rhel 7 and 8 did that too I think so something about the mechansim is maybe at fault? | 22:06 |
clarkb | I did manage to get the udev lgos to be more verbose via rebuilding the initramfs | 22:12 |
clarkb | eth0: Processing device (SEQNUM=2129, ACTION=add) happens then ens3: Processing device (SEQNUM=2149, ACTION=move) happens | 22:13 |
ianw | https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/9-beta/html/configuring_and_managing_networking/consistent-network-interface-device-naming_configuring-and-managing-networking | 22:13 |
ianw | what a read ... | 22:14 |
clarkb | what is weird to me is that we don' tend up adding eth0, but that seems to be what we want anyway so not a problem just odd to not see gleanfire for eth0. It must short circuit somehow when it does the rename to trigger actions on the new device name | 22:14 |
clarkb | I suspect that what happened here is centos updated to stop treating that as a device add and instead a device move | 22:14 |
clarkb | if we think that it is ok for glean to refire on a move action (I suspect it is, it will write out the config for the new device name and that might orphan the old config under the old device name but oh well?) | 22:17 |
clarkb | er if we think that is ok it is a simple fix in the glean udev rule I think | 22:17 |
ianw | i'm just wondering if it's an ordering issue, again? should our udev match run later? | 22:20 |
ianw | i guess we install as /etc/udev/rules.d/99-glean.rules | 22:21 |
clarkb | ianw: I don't think it is an ordering issue because the problem is the event just never matches | 22:21 |
clarkb | it runs successfully for lo before NM executes | 22:22 |
clarkb | the problem is we never trigger the glean unit for ens3 but as far as I can tell the event is emited just with changed details | 22:22 |
clarkb | looking at the systemd pcakgae for centos 9 it seems they aremoved a ton of patches that were carried in the pcakge related to udev | 22:22 |
clarkb | I wonder if that changed the behavior or maybe they updated the rules or the rename helper | 22:23 |
ianw | yeah, i wonder if ens3 should be getting an "add" too and that's a bug | 22:25 |
clarkb | yes I think that is what must've changed here | 22:25 |
ianw | like eth0 issues a "move" to indicate it's disappearing. but you'd think that ens3 would issue an "add" to indicate it's appearing | 22:25 |
clarkb | it was likely doing that before (though I haven't confirmed it) | 22:26 |
clarkb | https://github.com/redhat-plumbers/systemd-rhel9/tree/main/rules.d seems to not include 60-net.rules which is what does the rename | 22:30 |
fungi | any chance you've not yet thought to log into a node booted from the working image to compare boot-time logs? | 22:31 |
clarkb | fungi: 90% of this has been from a "working" node since the rax console is really painful | 22:31 |
clarkb | fungi: the "good" news is that on not rax we work because we fallback to dhcp | 22:31 |
clarkb | but that means you get to inestigate all of the same behaviors that cause rax to fail from a proper shell | 22:32 |
clarkb | but I havne't booted the old image yet to see if it is different. I guess I can do that really quickly | 22:32 |
fungi | i meant a node booted from the 12-day-old image | 22:33 |
fungi | ahh, yeah that one | 22:33 |
fungi | just didn't know if that would eliminate the guessing as to what might have changed in 12 days | 22:33 |
clarkb | ya it might help. We can get a package list too. Looks like that set of udev rules are provided by the initscripts-rename-device package and not by systemd | 22:33 |
ianw | i wonder if this somehow relates to having the rules in the initramfs | 22:37 |
clarkb | ianw: I think they've always been in there though? since I had to dracut -f to get my udev logging config to stick and I found a blog post from loic dachary from 6 years ago explaing that | 22:37 |
ianw | yeah, but the glean rules are not? | 22:38 |
ianw | (i am clutching at straws :) | 22:39 |
clarkb | hrm the docs say your rules should go in /etc/udev/rules.d I would've expected them to be included if necessary but maybe not | 22:39 |
clarkb | looks like the 13 day old image also does not have SYSTEMD_WANTS in its udevadm info /sys/class/net/ens3 output | 22:39 |
clarkb | I wonder if this was always broken and we just didn't notice that rax couldn't boot them | 22:39 |
clarkb | until the other clouds fail rate got high enough | 22:39 |
clarkb | I do notice that other rules files for network things check ACTION="add|move|change" | 22:40 |
fungi | if it was working in rax with the old image, in theory the launcher logs from before today would have some successfully booted nodes mentioned | 22:40 |
clarkb | skimming the logs I'm not sure if it ever worked | 22:42 |
clarkb | I really suspect that they updated the interface renaming process and invalidated our udev rules | 22:42 |
clarkb | I don't think that there is a race to address here just centos changing things | 22:42 |
fungi | i would believe that | 22:43 |
clarkb | and I agree that when you rename an interface that should be treated as a removal of the old name and an add of the new name | 22:44 |
clarkb | not a move | 22:44 |
clarkb | because that appears to be how udev events are structured | 22:44 |
fungi | we also stepped up centos 9 testing a lot in the last month or two, so not having it working in 30-40% of our provider capacity might have just gone unnoticed for a whiel | 22:44 |
fungi | while | 22:44 |
clarkb | you can add and remove things (and sometimes "change" them too) | 22:44 |
clarkb | I'm going to delete my old image booted instance since I don't think it is telling us anything new | 22:46 |
clarkb | the dhcp fallback is both really great and annoying at the same time. I think debugging this without the dhcp fallback is much more painful but if we didn't have it we would notice these things much sooner | 22:47 |
clarkb | since it would fail in all clouds I think | 22:47 |
ianw | https://kubic.opensuse.org/blog/2020-02-04-quickboot/ | 22:49 |
ianw | Checking the journal, this is because the interface got renamed from eth0 to enp1s0 during boot. Usually this happens when udev runs in the initrd already, which means that after switch-root there’s an add event with the new name already, ... | 22:49 |
ianw | that is what i'd expect to happen. the interface gets renamed by initramfs, and we have an add event waiting for glean | 22:50 |
opendevreview | Clark Boylan proposed opendev/glean master: Handle udev move events in addition to add events https://review.opendev.org/c/opendev/glean/+/836103 | 22:51 |
clarkb | ianw: there is no add even though | 22:51 |
clarkb | my test instance has verbose udev logging in journalctl and it shows the add for eth0 and not ens3 | 22:52 |
clarkb | ianw: oh interesting their code triggers an add fro mthe mova action https://github.com/thkukuk/issue-generator/pull/3/files | 22:54 |
clarkb | so my change would be roughly equivalent to generating a ne wadd event. Ther emight be an argument that the distro should be doing that I guess | 22:55 |
clarkb | however we have an initrd so I don't think this is related to that specific problem? Something else must be happening with centos9 | 22:55 |
fungi | or the root pivot in centos doesn't do the same things as suse's | 22:57 |
clarkb | basically if move is equivalent to add and some installs proxy an add event then we should be able to listen for the move instead | 22:58 |
ianw | https://github.com/fedora-sysv/initscripts/blob/master/src/rename_device.c is the source for rename_device. not sure how that all hangs together with systemd. i guess it leaves the renaming up to the distro | 23:04 |
clarkb | ianw: it is invoked by https://github.com/fedora-sysv/initscripts/blob/master/usr/lib/udev/rules.d/60-net.rules | 23:06 |
clarkb | so the eth0 add comes in then rename_device is run | 23:06 |
clarkb | I don't know how that injects the move ACTION that we see for ens3 | 23:07 |
ianw | rename_device seems to just output a value | 23:07 |
clarkb | maybe udev detects it via sysfs | 23:07 |
clarkb | I wonder if that is it. rename_device modifies sysfs in a way that udev can see and then udev generates the new event from that | 23:07 |
*** rlandy|rover is now known as rlandy|out | 23:08 | |
clarkb | also side note less complains about a bad terminal on centos 9 | 23:08 |
ianw | afaict it just prints out if the device should be renamed? https://github.com/fedora-sysv/initscripts/blob/master/src/rename_device.c#L392 | 23:09 |
clarkb | ianw: I think the NAME setting in the udev rule does it | 23:11 |
clarkb | man udev says NAME is for setting a network device name | 23:11 |
ianw | yeah, in combo with subsystem=net i guess | 23:11 |
ianw | so that's a bit of a red herring, *that* must be handled by systemd-udevd i guess? | 23:11 |
clarkb | ya I think 60-net.rules generates the biosdevname or similar and returns it to system-udevd. Then the NAME specification is set to that result and instraucts systemd-udevd to set the new name? That then short circuits add processing for eth0 and generates a move event for ens3 | 23:13 |
clarkb | the change here must be how systemd-udevd emits move instead of add for that? centos 8 does boot in rax iircbut the change may be anywhere between? | 23:13 |
clarkb | grepping for NAME in the ruleset on my test node I don't see anything obious that shows the renaming happening via the rules execiting anything it must be internal to udev | 23:16 |
clarkb | https://github.com/redhat-plumbers/systemd-rhel9/tree/main/src/udev and that is a much bigger codebase | 23:17 |
ianw | https://github.com/systemd/systemd/blob/main/src/udev/udev-event.c#L1093 | 23:28 |
ianw | it must be triggered from here | 23:28 |
ianw | i was hoping to see something obvious like it triggering a move event | 23:31 |
clarkb | ianw: and line 1079 checks a constant that is defined as "move" | 23:31 |
clarkb | well SD_DEVICE_MOVE is integer type but it indexes into an array of event actions and "move" is at that index | 23:31 |
clarkb | SD_DEVICE_MOVE is from february of last yera and before that it was DEVICE_ACTION_MOVE | 23:33 |
ianw | the actual rename happens in rtnl_set_link_name | 23:35 |
clarkb | ianw: also looks like a lot of that code is there to skip rnaming thingswith NAME= if the action is a move | 23:36 |
clarkb | because setting NAME triggers the move | 23:36 |
ianw | that sends a netlink message RTM_SETLINK and then IFLA_IFNAME | 23:36 |
ianw | this is probably not surprising | 23:36 |
clarkb | DEVICE_ACTION_MOVE has been around since 2015 so that seems to have been the way it worked for a while. I wonder then if the problem is merely in not getting an add action emited too as in that blog you linked | 23:38 |
clarkb | it is interesting that the only place I see SD_DEVICE_ADD assigned to a var is in the tests. I wonder if they are incrememnting that value as the events are handled instead | 23:42 |
clarkb | udev_device_new_from_environment() is what would create the new device and event if the rules used IMPORT instead of doing NAME= explicitly on the reesult | 23:53 |
ianw | https://github.com/torvalds/linux/blob/master/drivers/base/core.c#L4175 | 23:56 |
ianw | is where the rename ends up; afaict the event triggered up to udev comes from the kojbect_rename() call in there. and that's just a "move" | 23:57 |
ianw | https://github.com/torvalds/linux/blob/81ff0be4b9e3bcfee022d71cf89d72f7e2ed41ba/lib/kobject.c#L481 | 23:57 |
clarkb | oh so udev isn't generating that itself when it processe sth rename it is relying on side effect from the kernel side to cause the move event to happen? | 23:57 |
clarkb | ianw: kobject_uevent_env() triggering the move event back in userspace? | 23:59 |
clarkb | its called with KOBJ_MOVE so I assume so | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!