Thursday, 2025-10-16

opendevreview	OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/962557	02:15
hemanth	Hey regarding RETRY_LIMIT Host Unreachable.. I am able to see the error multiple times yesterday and today.. 1. Increased verbosity of charmcraft command and the charmcraft creates a lxc container and setup networking, tz etc inside the container and tries to stop the container and there is disconnection to logs 2. Modified the lxc network from 10.x.x.x to 172.28.x.x but the problem still exists ... the pattern which i observed this is	03:41
hemanth	happening on instances where more than 1 interface exists (104.x.x.x and 23.x.x.x).. is there a nodeset i can use so that the test machine can be from 104.x.x.x/23.x.x.x? (Do not want to recheck which runs 40+ jobs to see the error)	03:41
hemanth	I see 3 attempts on some of the jobs that are failing but there is no pre-run defined for the job.. is this expected? https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.attempts	03:42
hemanth	Job definition: https://opendev.org/openstack/sunbeam-charms/src/branch/main/zuul.d/jobs.yaml#L1 (or any charm-build-* in this file)	03:44
hemanth	* Hey regarding RETRY_LIMIT Host Unreachable.. I am able to see the error multiple times yesterday and today.. 1. Increased verbosity of charmcraft command and the charmcraft creates a lxc container and setup networking, tz etc inside the container and tries to stop the container and there is disconnection to logs 2. Modified the lxc network from 10.x.x.x to 172.28.x.x but the problem still exists ... the pattern which i observed this	03:56
hemanth	is happening on instances where more than 1 interface exists (primary interfaces: 104.x.x.x and 23.x.x.x).. is there a nodeset i can use so that the test machine can be from 104.x.x.x/23.x.x.x? (Do not want to recheck which runs 40+ jobs to see the error)	03:56
tonyb	hemanth: I'm AFK right now (on my phone). Thanks for the updates. I'll process what you've written and see what suggestions I have	04:16
tonyb	hemanth: A very non-scientific look at: https://zuul.opendev.org/t/openstack/builds?job_name=charm-build-glance-k8s&project=openstack%2Fsunbeam*&change=963705&skip=0 seems to show that all the failing jobs are running on the 'RAX Classic' cloud (in various regions).	06:10
tonyb	Some (https://4ca4d8620dc864739feb-f83d06667d580e000031601b82c71a43.ssl.cf5.rackcdn.com/openstack/acfdfa3dcb224b04b45750039d5298c0/zuul-info/inventory.yaml) run (successfully) on "RAX Flex" which has multiple interfaces.	06:10
tonyb	I don't think there is anyway to explicitly request a node on that (RAX Classic) provider.	06:15
tonyb	hemanth: I suggest creating a modified version of your change that removes all jobs from: https://opendev.org/openstack/sunbeam-charms/src/branch/main/zuul.d/project-templates.yaml#L40 apart from charm-build-glance-k8s. Then you can recheck on that.	06:18
tonyb	Once you see it's running on RAX Classic (near the top of the job-output/console stream) look for the "Print node information" task. Assuming an admin is online we can add an autohold for you	06:20
hemanth	I moved the builds to ubuntu-noble and i do not see any issue ..https://zuul.openstack.org/status?change=964205	06:27
hemanth	https://review.opendev.org/c/openstack/sunbeam-charms/+/964205/2#message-8f3e9d694cff4b7e7badd61f615e41e2565e0f45	06:31
tonyb	You got a little lucky in that you landed on RAX Flex	06:32
hemanth	ok.. let me start one just with charm-build-glance-k8s	06:34
tonyb	Okay, I'll discard mine	06:37
hemanth	I trigged one - change id 964208	06:37
tonyb	hemanth: What's the chnage number?	06:38
hemanth	The job started on raxclassic one	06:43
hemanth	Can you autohold and is it possible for me to ssh into the node?	06:45
fungi	tonyb: ^ (in case you weren't watching closely)	06:46
tonyb	Added, sorry I was trying to get the hold in place while making a stupid amount of typos	06:47
tonyb	https://zuul.opendev.org/t/openstack/autoholds	06:47
tonyb	hemanth: ^^	06:47
tonyb	hemanth: Can you send me your ssh public key and I'll add it to the node	06:48
hemanth	tonyb: https://launchpad.net/~hemanth-n/+sshkeys . .can you add the last one	06:50
tonyb	Ah and of course the networking is b0rked	06:51
hemanth	:-( i will ping you back when i see it runs again on rax classic	06:52
fungi	well, that's sort of to be expected since that's what the job failures were suggesting	06:52
tonyb	Yeah :/	06:52
hemanth	i am just thinking how to debug further	06:52
tonyb	You could add an explict "fail" into the playbooks before he charmcraft	06:53
tonyb	gimme 3 to think ...	06:54
fungi	booting the node into recovery mode through the rackspace dashboard should be possible, but that may destroy whatever state the playbook has gotten it into	06:54
fungi	though could still be enough to collect logs	06:54
tonyb	fungi: Do you think I could maybe access it via the service-net in that region ... from the mirror node	06:56
tonyb	(guessing that only eth0 is b0rked)	06:56
fungi	oh, entirely possible but you'd need to copy your keys over	06:56
fungi	or do agent forwarding or something	06:56
fungi	not really the safest idea	06:57
tonyb	Yeah I'd probably do agent forwarding	06:57
fungi	though i suppose you could proxy the socket	06:57
tonyb	Let me see if I can get the Service-IP, and if it's listening on 22	06:59
tonyb	The first node has been recycled	07:01
hemanth	tonyb: 2nd attempt landed on rax-ord-main	07:06
tonyb	Okay I got in and killed /usr/bin/snap install --channel latest/stable lxd	07:08
tonyb	so the job failed and hopefully we get the autohold	07:08
tonyb	I'll see if I can get in via service-net, and then add hemanth's ssh key.	07:09
tonyb	hemanth: You'll need to manually walk through the playbook	07:09
hemanth	tonyb: ack	07:09
tonyb	Okay service-net does listen on 22	07:11
hemanth	tonyb: let me know when i should be good to ssh	07:18
tonyb	Will do, sorry it's taking a little longer to get in that I thought	07:18
tonyb	hemanth: root@23.253.164.133	07:21
hemanth	tonyb: thanks, i am in	07:22
tonyb	Okay	07:22
tonyb	you should be able to sudo -i -u zuul su - (or similar)	07:22
tonyb	and then work through the playbook	07:22
hemanth	ack	07:22
tonyb	hopefully, if the network dies again it will only impact the public network	07:23
tonyb	I'm running a mix of ip/iptables commands with (local logging) "just in case"	07:25
hemanth	ack	07:26
tonyb	the lxbr0 is still on 10.x is that expected?	07:27
hemanth	yes with the current one yes..	07:28
tonyb	++	07:28
hemanth	i can change manually.. lemme do that	07:28
tonyb	No it's fine	07:28
tonyb	I just wanted to check what I'm seeing	07:28
tonyb	the routes look okay	07:28
hemanth	started the charmcraft pack command	07:32
tonyb	++	07:33
hemanth	is the connection gone for you?	07:34
tonyb	I still have access on service-net	07:35
hemanth	I can no more access	07:36
tonyb	Okay it's lost the default route	07:36
hemanth	I added following rule nft insert rule filter openstack-INPUT iif lxdbr0 accept before running charmacraft pack	07:36
hemanth	as per the playbook.. to get network access to lxd	07:37
tonyb	eth0 no longer has an IP and is marked as DOWN	07:37
tonyb	Good: https://paste.opendev.org/show/buP2Y3YFx7fbLhAfd2e3/	07:38
tonyb	Now: https://paste.opendev.org/show/bFGMj25zQox3PexkP6o6/	07:39
hemanth	anything on syslog that tells why it happened	07:39
tonyb	Anything in thje charmpack that would down the host's network?	07:39
*** ykarel_ is now known as ykarel		07:40
tonyb	https://paste.opendev.org/show/bk4ZjLzFAvs00bGZU0BM/	07:40
hemanth	ideally it shudnt touch the host... the only thing that tricks me is the interface name defined within container is eth0 which is defined in lxc profile.. sudo lxc profile show default	07:40
tonyb	Okay I have run ifup eth0 you should be able to get back in	07:42
hemanth	ok	07:42
hemanth	I will run some experiments if you dont mind...	07:43
tonyb	https://paste.opendev.org/show/bTCmjYHDdgYcoNUjOe4Q/	07:44
tonyb	Looks like the snap shutdown eth0	07:44
hemanth	yeah i want to change the lxd interface name something else other than eth0	07:44
tonyb	Okay	07:44
hemanth	I tried to remove lxd and my connection is again lost	07:45
hemanth	can you ifup eth0	07:45
tonyb	done	07:45
hemanth	I will get back to you in 15 minutes	07:46
tonyb	Okay	07:46
hemanth	tonyb: I think its an issue with charmcraft with interface name as eth0 ... I will debug creating an environment with that combination.. I do not have a good workaround yet.. but seems this is not problem with ubuntu-noble where i see the tests ran on rax-iad-main but the interface names are enx0, enx1 ... please release the node	08:29
hemanth	* tonyb: I think its an issue with charmcraft with host interface name as eth0 ... I will debug creating an environment with that combination.. I do not have a good workaround yet.. but seems this is not problem with ubuntu-noble where i see the tests ran on rax-iad-main but the interface names are enx0, enx1 ... please release the node	08:29
hemanth	and thank you for all the support	08:30
tonyb	np Good luck	08:30
tonyb	you know where I am/We are if you need more work	08:30
hemanth	yeah sure and thanks for the patience	08:30
tonyb	All good.	08:32
hemanth	tonyb: did you happen to save the iptable rules. If you have them, can you provide those information as well	09:04
tonyb	Sorry, I cleaned them up, thinking they weren't needed anymore	09:05
hemanth	ack.. thats fine	09:05
tonyb	hemanth: Let me check something	09:06
tonyb	I thought they might be in: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a15/openstack/a15b8e30ca57484c8f33eac2cfc352f4/zuul-info/zuul-info.ubuntu-jammy.txt	09:07
tonyb	but it seems not	09:07
hemanth	I tried to replicate the eth0 interface on my VM and this time the build is fine.. the only difference in my environment is there are no specific iptable/nft rules	09:09
hemanth	I have the similar logs as upstream one (which you provided) https://paste.opendev.org/show/bTCmjYHDdgYcoNUjOe4Q/ .. except for iptables drop .. the eth0 interface is referred to the container eth0 interface	09:10
tonyb	Well you can use the dummy change to dump them on a new run. You could also add a hard fail right before the charmpack? and then try again	09:11
hemanth	I moved the jobs to ubuntu-noble and triggered those jobs in change 962366 .. seems fine for now.. (mostly due to different interface name on host.. crosschecked couple of instances on rax classic)	09:12
tonyb	now that we now roughly where/why it happens	09:12
tonyb	or if it makes no real difference to the output you could just switch to noble, as you say that chnages the interface name in the host	09:13
tonyb	I guess at this stage it's a matter of how much you want to debug this/understand the underlying problem	09:13
hemanth	yeah i will see some builds today but will come back tomorrow on jammy since it is curious what causes the iptables drop	09:13
tonyb	Well the DROPS in https://paste.opendev.org/show/bTCmjYHDdgYcoNUjOe4Q/ are just random checks to see if there is a webserver on the node	09:18
hemanth	ack	09:19
tonyb	I don't think they're the cause	09:19
tonyb	It's the 'Oct 16 07:32:55 ubuntu kernel: [ 1680.186759] eth0: renamed from physyLxJCZ' that I thought was more interesting	09:20
hemanth	i am retesting again in my local with few settings..	09:21
tonyb	Okay	09:21
tonyb	I'm probably going to sign off soon	09:22
hemanth	\o	09:23
*** bauzas9 is now known as bauzas		19:35
opendevreview	Tony Breeds proposed openstack/project-config master: [pti-python-tarball] Add compatibility for older wheels https://review.opendev.org/c/openstack/project-config/+/964251	20:12
opendevreview	Tony Breeds proposed openstack/project-config master: [pti-python-tarball] Add compatibility for older wheels https://review.opendev.org/c/openstack/project-config/+/964251	21:07

Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!