Thursday, 2022-03-31

*** dviroel|out is now known as dviroel00:05
*** dviroel is now known as dviroel|out00:20
*** rlandy|bbl is now known as rlandy|out01:17
*** mazzy5098812929580859404 is now known as mazzy5098812929580859403:14
*** soniya29 is now known as soniya29|rover04:19
*** marios is now known as marios|ruck04:58
*** ysandeep|out is now known as ysandeep05:05
*** pojadhav|afk is now known as pojadhav05:11
marios|rucko/ anyone know what's up with NODE_FAILURE today 06:48
marios|ruckysandeep: soniya29|rover: ^^ fyi06:48
soniya29|rovermarios|ruck, ack06:48
fricklermaybe some issue with recently built c9s images? we failed to build new images for a couple of days because of a full disk situation, that was resolved yesterday, so today you are running with fresh images for the first time in a week or so06:55
marios|ruckthanks frickler might be... 07:17
marios|ruckunfortunately there are no logs available cant even see where it ran or anything else07:18
fricklermarios|ruck: in the nodepool logs there is only: Timeout waiting for connection to ... on port 2207:19
fricklerso likely some IP setup/firewall/sshd change07:20
*** ysandeep is now known as ysandeep|lunch07:23
opendevreviewchzhang8 proposed openstack/project-config master: register and bring back tricircle under x namespaces
fricklermight also be provider specific or not 100% failing, since I do see some working nodes being spawned07:34
*** jpena|off is now known as jpena07:34
opendevreviewchzhang8 proposed openstack/project-config master: register and bring back tricircle under x namespaces
*** gthiemon1e is now known as gthiemonge08:10
opendevreviewchzhang8 proposed openstack/project-config master: create tricircle under x namespaces
opendevreviewchzhang8 proposed openstack/project-config master: create tricircle under x namespaces
*** ysandeep|lunch is now known as ysandeep10:07
*** soniya29|rover is now known as soniya29|rover|lunch10:08
*** rlandy|out is now known as rlandy10:25
*** soniya29|rover|lunch is now known as soniya29|rover10:27
*** marios|ruck is now known as marios|ruck|lunch11:09
*** bhagyashris_ is now known as bhagyashris11:09
*** marios|ruck|lunch is now known as marios|ruck11:15
*** pojadhav is now known as pojadhav|afk11:17
*** soniya29|rover is now known as soniya29|rover|afk11:18
*** dviroel_ is now known as dviroel11:22
*** ysandeep is now known as ysandeep|brb11:49
*** ysandeep|brb is now known as ysandeep|afk12:05
*** soniya29 is now known as soniya29|rover12:07
*** pojadhav|afk is now known as pojadhav12:33
*** ysandeep|afk is now known as ysandeep13:14
fungifrickler: do the ssh timeouts seem to be for the same providers most of the time?13:16
fricklerfungi: I didn't look systematically enough to tell. I did check some freshly created node on ovh where I could login13:18
fricklerquick "grep Timeout /var/log/nodepool/launcher-debug.log|wc" finds 2200 on nl01, 0 on nl02, 158 on nl04, so very likely only some providers affected13:21
fricklermight be similar to the fedora 35 issues we had? maybe ianw can check later13:22
fungior may just be one of our larger providers having significant network disruption13:23
fricklerbut it seemed specific to centos, I haven't seen other node types affected13:23
fungiand it's not the nested-virt ones specifically like we saw with octavia, right?13:24
fungithe normal centos-9-stream label?13:25
frickleryes, the latter13:25
fungiagreed, more likely an image problem in that case13:26
fungiokay, i'm disappearing now, but hope to be back around 21:00 utc13:59
*** marios|ruck is now known as marios15:00
*** marios is now known as marios|ruck15:01
clarkbre NODE_FAILURE that should only happen if all the cloud sfail to boot the instance15:29
clarkbwhich implies we have problems on all the clouds? though frickler's greps indicate nl03 is probably fine?15:30
clarkbbuy ya considering that fedora did that to us recently I wouldn't be surprised if this is the same issue over again15:30
clarkbis there a changlog for stream?15:31
clarkbI've just confirmed that the centos-9 image updated 12 hours ago and the prior imgae is from 12 days ago15:36
clarkbfile sizes haven't drifered dramatically indicating some major disk writing problem. Likely due to centos updates itself15:38
*** ysandeep is now known as ysandeep|out15:40
clarkbwe don't seem to have the log for the build from 12 days ago so hard to compare package lists. I guess I can manually boot off of that image and find the equivalent of dpkg -l15:41
clarkbovh bhs1 has no valid host found errors15:48
clarkband gra1 is offline for maintenance. amorin is that maintenance still ongoing or can we return gra1 to service?15:49
amoringra1 os done!15:50
amorinlet me remove WIP from my change15:50
clarkbthanks! I'll approve it when that is done15:50
clarkbany idea why bhs1 is complaining about no valid hosts found? Probably not urgent but I noticed it when trying to find logs for this centos-9 problem15:50
amorinclarkb will check about BHS115:51
clarkbapproved thanks again15:51
clarkb#status log Paused centos-9-stream image building to ensure it doesn't rebuild and delete our prior image before we can properly debug related NODE_FAILURES15:53
opendevstatusclarkb: finished logging15:53
clarkbinfra-root ^ fyi I did that just now to keep the option of deleting the current image and using the 12 day old one open while we sort through this15:53
*** dviroel_ is now known as dviroel15:54
clarkbok I think I've onfirmed they fail to boot in the iweb cloud15:57
clarkbbut it looks like a lot of stuff is failing to boot there?15:58
clarkbwhich may be the expected shutdown beginning?15:59
opendevreviewMerged openstack/project-config master: Revert "[OVH/GRA1] Disable nodepool temporarily"
amorinclarkb: I found the issue, fix in progress16:03
amorinabout bhs116:03
clarkbamorin: thank you !16:04
clarkbre iweb it seems that we might have a problem deleting things mor ethan launching. THen centos-9 is specifically a launching problem?16:04
clarkbI'm going to look at another cloud for centos-9 info16:04
*** rlandy is now known as rlandy|rover16:09
clarkbok I think I may understand a high level of what is happening. It appears that cnetos-9-stream can boot as well as any other image in other clouds. Except it isn't booting successfully in rax due to being unable to gather host keys16:10
clarkbthere is some instability in othe rcloud sthat results in boot failures but in general they can boot the image I think. Things like the no valid host found in bhs1 and that seems to affect inmotions a bit as well (though inmotion is using most of its quota) then gra1 is down. Also iweb cloud had problems booting a few hours ago which seems to be fine now but it can't delete nodes so16:11
clarkbwe don't have much capacity there16:11
clarkbAnyay being unable to boot cnetos-9-stream in rax means it is far more likely to hit NODE_FAILURE if it wins the lottery in the othre clouds16:12
*** marios|ruck is now known as marios|out16:12
clarkbAnd rax is xen and I wouldn't be surprised if centos-9-stream broke that somehow. It also requires static network config which maybe glean + networking manager + centos-9 broke16:13
*** dviroel is now known as dviroel|lunch16:14
clarkbI'm manually booting an instance in ord to see if it ever pings and if the console shows anything helpful16:18
clarkbso far the console boot output stops at resizing the root disk16:20
clarkboh there it goes but how do I see the history heh16:20
clarkbI can't page up paste the login prompt now ugh16:20
clarkbit does not ping yet but I do get a login prompt so this isn't a completely failed boot16:21
clarkbthis appears to be a failure to configure networking. Likely rax is affected as the only cloud that a fallback to dhcp doesn't work on16:23
clarkbI'll try to boot a recovery instance next and look at the network config I Guess.16:24
clarkbI guess rescuing uses the same image yo u are trying to resuce? I feel like I leran this lesson every time I do this16:32
clarkbI've set the rescue image to debian bullseye and it still doesn't ping16:38
clarkbno ssh connectivity yet either. I'll see if this changes in a minute or two but this may be a dead end16:38
*** jpena is now known as jpena|off16:39
clarkbThere appear to have been multiple kernel updates in the period of time where we were unable to build images16:44
clarkbAnd my rescued instance with a bullseye image isn't any better than the one booting off of itself? I'm so confused16:45
clarkbI'm able to get into the rescue instance via the webconsole16:50
clarkbbut no networking confirmed there16:50
clarkbmaybe because I'm booting with config drive set to true?16:50
clarkbsysconfig/network doens't appear to have any glean content in it16:52
clarkbsystemctl list-units -a doesn't show a glean unit16:54
clarkboh that is because its systemctl for the rescue node this is fun16:55
clarkbIf I'm reading /var/log/messages correctly it never triggers glean for the actual network interface?17:00
clarkbThis is interesting the rescure node also only has a lo interface17:00
clarkb/sys/class/net reports an eth0 and eth1 as expected17:02
clarkbI can see in /var/log/mesages that eth0 is renamed to enX0 and eth1 is renamed to enX1. Then later NetowrkManager tries to configure them and fials because NetworkManager doesn't grok static network config using config drive17:09
clarkbIt isn't clear to me why our udev rule isn't firing the unit for those interfaces though. At least under debian they should up with the right subsystem and addr_assign_type. I suppose that value may have changed in the new centos kernel though and we don't match anymore?17:10
clarkbit is curious that the Debian image in rax that I used to rescue didn't configure networking either17:11
clarkbit should use cloud init17:12
*** dviroel|lunch is now known as dviroel17:18
clarkbI'm fairly certain the problem is due to udev not firing for the network devices17:19
clarkbor maybe more accurately udev isn't triggering our unit based on our udev rule17:19
clarkbit isn't clear t ome if the rule is being ignored or evaluated and not matching17:20
clarkbI set a passwd in the mounted image via the rescue image and have unrescued. I think I need to inspect things like the attributes of the network devices from the context of the centos kernel17:30
clarkband apparently I can't retype the password I set17:32
clarkboperating in the web console is the worst thing ever17:33
clarkbI'm doing something wrong because I've tried resetting the password multiple times now and when I attempt to login its the wrong password17:46
clarkbI think that is my cue to take a break. TL;DR is that it appears the glean unit for eth0/enX0 and eth1/enX1 is not firing which prevents and static network config from happening17:47
clarkbI'm going t odelete my test node. Easy wnough for someone else to make a new one17:47
clarkbsomething I should've checked earlier: this same udev problem occurs on ovh booted instances and I assume elsewhere as well. We boot successfully using dhcp though17:51
clarkbthat should mmake this a lot easier to debug if we can use a cloud we can ssh into17:51
clarkbI've confirmed addr_assign_type is 0 which is necessary for our udev trigger to fire17:52
clarkbudevadmin shows the systemd wants value for lo but not ens3 on bhs1. More clues that irt really isn't doing what our udev rules attempts to do18:00
clarkbok on a bhs1 node if I run `udevadm test /sys/class/net/esn3` that results in a SYSTEMD_WANTS value for glean@ens3.service.18:07
clarkbThat implies to me that either an event isn't firing on boot or the attributes are not correct when the event does fire18:07
rezabojnordiHi guys18:43
clarkbok removing the ACTION=="add" from the udev rule makes it work18:57
clarkbso for some reason we aren't getting added with an action called add?18:57
clarkbI have been unsuccessful in debugging udev events at boot to determine what the actual event attributes are18:58
clarkbI think I may understand this. The device starts live as eth0 but is moved to ens3. This means the udev action for ens3 is a move not an add. It looks like eth0 is add'd then I guess the move process short cicuits the udev rule processing or we're looking for eth0 paths that don't exist anymore to do things like avaluate the attrs that we check19:08
clarkbI suspect that if I change this to ACTION=="add|move" this may work19:08
clarkbwill test that and if confirmed will likely leave it up to red hat folks that undersatnd this move business to give further guidance19:09
rezabojnordihi guys, I have question19:09
clarkbrezabojnordi: feel free to ask19:10
rezabojnordiI have problem on my architecture? can you help me?19:11
clarkbinfra-root yup I think that is it. Switching ACTION=="add" to ACTION=="add|move" seems to make it work. I'm not sure what other fallout that would have and it seems to do with centos-9's particular udev rule sfor renaming things? Will have to defer to people that understrand that distro better19:11
clarkbrezabojnordi: architecture of what? We help build the developer tools for openstack and other projects but don't really know much about operating them necessarily19:12
rezabojnordiHow to contribute on OpenStack?19:13
rezabojnordiI like to contribute on openstack19:13
clarkbah we can probably help with that19:14
clarkbis there a specific problem?19:14
rezabojnordiyes, how can I start?19:17
clarkb is probably a good plcae to start. That has specific starting points depending on what sort of contributions you want to make19:18
rezabojnordiThank you so much my bro19:21
*** dviroel is now known as dviroel|afk20:24
ianwclarkb: ... great detective work!20:58
Clark[m]ianw: ya I'm not sure if we should just add |move. There is a specific log entry for when things get renamed and it is related to some rhel 9 ruleset. If you want to look the instance is in ovh bhs1 and named clarkb-test-centos-9. You can log in as root with your key like any other test node21:03
Clark[m]I'm finishing lunch and then doing school run so not at a keyboard for a bit21:04
ianwyeah i need to do the opposite school run and will be back in about 20m too, will look21:27
fungiokay, i'm back finally but have some paperwork i need to get done. looks like there are no burning fires, just a fun investigative series on why centos suddenly broke glean21:46
clarkbya I think that was mostly it21:55
clarkbsystemd-udevd[588]: Using default interface naming scheme 'rhel-9.0'. <- that triggers the renames I think22:05
clarkbwhat is interesting is that rhel 7 and 8 did that too I think so something about the mechansim is maybe at fault?22:06
clarkbI did manage to get the udev lgos to be more verbose via rebuilding the initramfs22:12
clarkbeth0: Processing device (SEQNUM=2129, ACTION=add) happens then ens3: Processing device (SEQNUM=2149, ACTION=move) happens22:13
ianwwhat a read ...22:14
clarkbwhat is weird to me is that we don' tend up adding eth0, but that seems to be what we want anyway so not a problem just odd to not see gleanfire for eth0. It must short circuit somehow when it does the rename to trigger actions on the new device name22:14
clarkbI suspect that what happened here is centos updated to stop treating that as a device add and instead a device move22:14
clarkbif we think that it is ok for glean to refire on a move action (I suspect it is, it will write out the config for the new device name and that might orphan the old config under the old device name but oh well?)22:17
clarkber if we think that is ok it is a simple fix in the glean udev rule I think22:17
ianwi'm just wondering if it's an ordering issue, again?  should our udev match run later?22:20
ianwi guess we install as /etc/udev/rules.d/99-glean.rules22:21
clarkbianw: I don't think it is an ordering issue because the problem is the event just never matches22:21
clarkbit runs successfully for lo before NM executes22:22
clarkbthe problem is we never trigger the glean unit for ens3 but as far as I can tell the event is emited just with changed details22:22
clarkblooking at the systemd pcakgae for centos 9 it seems they aremoved a ton of patches that were carried in the pcakge related to udev22:22
clarkbI wonder if that changed the behavior or maybe they updated the rules or the rename helper22:23
ianwyeah, i wonder if ens3 should be getting an "add" too and that's a bug22:25
clarkbyes I think that is what must've changed here22:25
ianwlike eth0 issues a "move" to indicate it's disappearing.  but you'd think that ens3 would issue an "add" to indicate it's appearing22:25
clarkbit was likely doing that before (though I haven't confirmed it)22:26
clarkb seems to not include 60-net.rules which is what does the rename22:30
fungiany chance you've not yet thought to log into a node booted from the working image to compare boot-time logs?22:31
clarkbfungi: 90% of this has been from a "working" node since the rax console is really painful22:31
clarkbfungi: the "good" news is that on not rax we work because we fallback to dhcp22:31
clarkbbut that means you get to inestigate all of the same behaviors that cause rax to fail from a proper shell22:32
clarkbbut I havne't booted the old image yet to see if it is different. I guess I can do that really quickly22:32
fungii meant a node booted from the 12-day-old image22:33
fungiahh, yeah that one22:33
fungijust didn't know if that would eliminate the guessing as to what might have changed in 12 days22:33
clarkbya it might help. We can get a package list too. Looks like that set of udev rules are provided by the initscripts-rename-device package and not by systemd22:33
ianwi wonder if this somehow relates to having the rules in the initramfs22:37
clarkbianw: I think they've always been in there though? since I had to dracut -f to get my udev logging config to stick and I found a blog post from loic dachary from 6 years ago explaing that22:37
ianwyeah, but the glean rules are not?22:38
ianw(i am clutching at straws :)22:39
clarkbhrm the docs say your rules should go in /etc/udev/rules.d I would've expected them to be included if necessary but maybe not22:39
clarkblooks like the 13 day old image also does not have SYSTEMD_WANTS in its udevadm info /sys/class/net/ens3 output22:39
clarkbI wonder if this was always broken and we just didn't notice that rax couldn't boot them22:39
clarkbuntil the other clouds fail rate got high enough22:39
clarkbI do notice that other rules files for network things check ACTION="add|move|change"22:40
fungiif it was working in rax with the old image, in theory the launcher logs from before today would have some successfully booted nodes mentioned22:40
clarkbskimming the logs I'm not sure if it ever worked22:42
clarkbI really suspect that they updated the interface renaming process and invalidated our udev rules22:42
clarkbI don't think that there is a race to address here just centos changing things22:42
fungii would believe that22:43
clarkband I agree that when you rename an interface that should be treated as a removal of the old name and an add of the new name22:44
clarkbnot a move22:44
clarkbbecause that appears to be how udev events are structured22:44
fungiwe also stepped up centos 9 testing a lot in the last month or two, so not having it working in 30-40% of our provider capacity might have just gone unnoticed for a whiel22:44
clarkbyou can add and remove things (and sometimes "change" them too)22:44
clarkbI'm going to delete my old image booted instance since I don't think it is telling us anything new22:46
clarkbthe dhcp fallback is both really great and annoying at the same time. I think debugging this without the dhcp fallback is much more painful but if we didn't have it we would notice these things much sooner22:47
clarkbsince it would fail in all clouds I think22:47
ianwChecking the journal, this is because the interface got renamed from eth0 to enp1s0 during boot. Usually this happens when udev runs in the initrd already, which means that after switch-root there’s an add event with the new name already, ...22:49
ianwthat is what i'd expect to happen.  the interface gets renamed by initramfs, and we have an add event waiting for glean22:50
opendevreviewClark Boylan proposed opendev/glean master: Handle udev move events in addition to add events
clarkbianw: there is no add even though22:51
clarkbmy test instance has verbose udev logging in journalctl and it shows the add for eth0 and not ens322:52
clarkbianw: oh interesting their code triggers an add fro mthe mova action
clarkbso my change would be roughly equivalent to generating a ne wadd event. Ther emight be an argument that the distro should be doing that I guess22:55
clarkbhowever we have an initrd so I don't think this is related to that specific problem? Something else must be happening with centos922:55
fungior the root pivot in centos doesn't do the same things as suse's22:57
clarkbbasically if move is equivalent to add and some installs proxy an add event then we should be able to listen for the move instead22:58
ianw is the source for rename_device.  not sure how that all hangs together with systemd.  i guess it leaves the renaming up to the distro23:04
clarkbianw: it is invoked by
clarkbso the eth0 add comes in then rename_device is run23:06
clarkbI don't know how that injects the move ACTION that we see for ens323:07
ianwrename_device seems to just output a value23:07
clarkbmaybe udev detects it via sysfs23:07
clarkbI wonder if that is it. rename_device modifies sysfs in a way that udev can see and then udev generates the new event from that23:07
*** rlandy|rover is now known as rlandy|out23:08
clarkbalso side note less complains about a bad terminal on centos 923:08
ianwafaict it just prints out if the device should be renamed?
clarkbianw: I think the NAME setting in the udev rule does it23:11
clarkbman udev says NAME is for setting a network device name23:11
ianwyeah, in combo with subsystem=net i guess23:11
ianwso that's a bit of a red herring, *that* must be handled by systemd-udevd i guess?23:11
clarkbya I think 60-net.rules generates the biosdevname or similar and returns it to system-udevd. Then the NAME specification is set to that result and instraucts systemd-udevd to set the new name? That then short circuits add processing for eth0 and generates a move event for ens323:13
clarkbthe change here must be how systemd-udevd emits move instead of add for that? centos 8 does boot in rax iircbut the change may be anywhere between?23:13
clarkbgrepping for NAME in the ruleset on my test node I don't see anything obious that shows the renaming happening via the rules execiting anything it must be internal to udev23:16
clarkb and that is a much bigger codebase23:17
ianwit must be triggered from here23:28
ianwi was hoping to see something obvious like it triggering a move event23:31
clarkbianw: and line 1079 checks a constant that is defined as "move"23:31
clarkbwell SD_DEVICE_MOVE is integer type but it indexes into an array of event actions and "move" is at that index23:31
clarkbSD_DEVICE_MOVE is from february of last yera and before that it was DEVICE_ACTION_MOVE23:33
ianwthe actual rename happens in rtnl_set_link_name23:35
clarkbianw: also looks like a lot of that code is there to skip rnaming thingswith NAME= if the action is a move23:36
clarkbbecause setting NAME triggers the move23:36
ianwthat sends a netlink message RTM_SETLINK and then IFLA_IFNAME23:36
ianwthis is probably not surprising23:36
clarkbDEVICE_ACTION_MOVE has been around since 2015 so that seems to have been the way it worked for a while. I wonder then if the problem is merely in not getting an add action emited too as in that blog you linked23:38
clarkbit is interesting that the only place I see SD_DEVICE_ADD assigned to a var is in the tests. I wonder if they are incrememnting that value as the events are handled instead23:42
clarkbudev_device_new_from_environment() is what would create the new device and event if the rules used IMPORT instead of doing NAME= explicitly on the reesult23:53
ianwis where the rename ends up; afaict the event triggered up to udev comes from the kojbect_rename() call in there.  and that's just a "move"23:57
clarkboh so udev isn't generating that itself when it processe sth rename it is relying on side effect from the kernel side to cause the move event to happen?23:57
clarkbianw: kobject_uevent_env() triggering the move event back in userspace?23:59
clarkbits called with KOBJ_MOVE so I assume so23:59

Generated by 2.17.3 by Marius Gedminas - find it at!