Monday, 2025-09-22

opendevreviewMerged openstack/reviewstats master: Various updates to pass CI  https://review.opendev.org/c/openstack/reviewstats/+/96189201:25
opendevreviewMerged openstack/reviewstats master: Various cleanup and improvements  https://review.opendev.org/c/openstack/reviewstats/+/96189701:25
*** jph_ is now known as jph01:28
opendevreviewSean McGinnis proposed openstack/reviewstats master: remove unicode from code  https://review.opendev.org/c/openstack/reviewstats/+/85117112:03
opendevreviewSean McGinnis proposed openstack/reviewstats master: remove unicode from code  https://review.opendev.org/c/openstack/reviewstats/+/85117112:06
ykarelHi is there any known issue with multinode jobs? any recent changes in the infra in last 2 weeks?12:47
ykarelNo route to controller vm seeing since 18th sept https://6f670868c142d4e8c5b3-0595cf8ce23c3a62b4c5e95f9113455a.ssl.cf5.rackcdn.com/openstack/9f840e9774404954bbce055b15c872d1/job-output.txt12:49
ykarelidentity timed out one ex https://39d563aa6881906ba38a-c46c6711d422731404e9aacada362597.ssl.cf2.rackcdn.com/openstack/36407411f7b143f0bb60eb0011a5d886/job-output.txt can see even before, earlier 9th sept but that's likely due to older logs not in opensearch12:51
fungiykarel: if it was in rackspace flex, something changed in those regions that started causing us to need to specify a network, and then when we added it to clouds.yaml we began ending up with two network interfaces in those instances instead of just one (both on the default network). is it still happening today?13:02
ykarelfungi, yes one of these in raxflex-DFW3 and raxflex-SJC3, and yes seeing today as well13:04
ykarelother one is more generic i.e not specfic to raxflex13:04
fungiit presumably started when https://review.opendev.org/c/opendev/system-config/+/961537 deployed on the 17th, but should have ceased after https://review.opendev.org/c/opendev/system-config/+/961815 landed on the 19th13:06
fungii wasn't following closely, but there were a couple of subsequent changes to opendev/zuul-providers trying to work out what's going on with the extra interfaces in flex13:10
fungiclarkb probably has a more current understanding of the state of things once he's awake/around13:10
ykarelok thx fungi will wait for clarkb 13:23
clarkbfungi: ykarel the underlying issue last week was instances booting with multiple interfaces. We think we tracked that down to configuring interfaces in both clouds.yaml and zuul launcher config causing openstacksdk to request two itnerfaces on boot. We added clouds.yaml config (zuul launcher config was preexisting) as we thought that there was a problem with cloud side14:21
clarkbupdates or openstacksdk updates but on further inspection it seems to maybe be that a cloud error had us cache a bad network id?14:21
clarkbanyway on friday we undid all of this extra config (so now only zuul-launcher should have network config no more clouds.yaml config) and then weekly upgrades should've restarted the launchers and reset us back to where we were before14:21
clarkbanyway the things to check are "are we still booting with multiple interfaces" and "did the config reset get applied as expected by weekely restarts)14:21
clarkbok server list shows two interfaces in sjc3 so either the config didn't roll back as expected or something else is going on14:23
fungiinteresting that we could end up caching a bad network id in more than one cloud region14:24
clarkbfungi: yes though they are all in th same provider so not necessarily impossible14:25
fungiright, so maybe not a random error but rather a temporarily config got rolled out to multiple regions there14:25
fungier, t3mporarily broken config14:26
clarkbthe /etc/openstack/clouds.yaml file on zl01 updated on september 19 at 17:05 and I don't see the extra network config there anymore. The zuul launcher process there is from september 20 so I think we did restart onto the correct config14:26
clarkbso it seems that we're getting the multiple interfaces without extra config in clouds.yaml. This is unexpected14:27
sfernandhey @clarkb I added a few debug commands in a post playbook and figured out the issue may be probably occurring due to "df -h"  getting stuck. The job I've been working on tests the cinder nfs driver so I guess some stale nfs mount is causing the df command to stall. I proposed this patch to timeout the df command15:40
sfernandhttps://review.opendev.org/c/openstack/devstack/+/961814/3/roles/capture-system-logs/tasks/main.yaml15:40
clarkbsfernand: interesting. I wonder if a mount -a or similar before doing the diagnostics would help15:41
clarkbbtu a timeout seems like a good idea there if this is an issue15:41
sfernandhmm I think the mount -a might fix the df getting stuck but that also changes the environment after the test completes15:46
clarkbya it would be a tradeoff between what we think is more important info to have I guess15:46
clarkb(is the df capacity report more or less important than potentially tracking down bad mounts)15:47
sfernandyou right it makes sense15:47
sfernandmaybe running df with timeout -> run the mount command for listing what was expected to be mounted -> run mount -a to try fixing the issue -> run df with a timeout again15:49
sfernandor just run df -l first15:51
clarkbfor now I think the timeout is probably ok15:54
clarkband if we don't see a fix for the mount issue then we can think about optimizing the debug output capturing15:54
sfernandcool I agree15:55
kozhukalovHi team. Before switching to ansible 11 all the shell tasks were streaming to the zuul console.  I was able to see the bash scripts xtrace  while they were running. Now I only see the name of the task. How can I make it behave as before? I run zuul_console role in the very beginning of every job and it used to work.16:39
clarkbkozhukalov: I don't think we expected any changes around that. As far as I know jobs generally still work. Also you shouldn't need to manage the zuul_console stuff yourself unless you are rebooting the nodes16:40
clarkbkozhukalov: https://zuul.opendev.org/t/openstack/build/9e6b5c09309b42cdb60ffa8e79997da4/log/job-output.txt#21679 here is an example of what I think is a shell/command task with streaming output16:41
clarkbit might help if you link to specific examples and the ansible source code that produces the output?16:42
kozhukalovOk, I'll try to remove the zuul_console role. Maybe this somehow affects the behavior. Jobs work well, I just don't see the stderr while the shell task is running. Thanks for the example. Will try to use it.16:42
clarkbkozhukalov: by default stdout and stderr should be combined16:43
clarkbif the issue is specific to stderr then maybe it is just hidden amongst the stdout?16:43
kozhukalovSpent some time debugging this. Turned out that if run shell task directly as part of a playbook then stdout/stderr streaming works well. But if i include role with a shell task then I don't see stdout/stderr, but the task name.19:41
fungiinteresting. i know ansible sometimes likes to output streams on alternative file descriptors too, i wonder if that's what's happening there, like the included role's stderr is remapped to fd#3 or higher and the callback plugin doesn't pick that up19:43
clarkbyour best bet is probably to bring that difference of behavior up in the zuul matrix room and see if anyone knows why that might be happening21:52

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!