| opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/962557 | 02:15 |
|---|---|---|
| hemanth | Hey regarding RETRY_LIMIT Host Unreachable.. I am able to see the error multiple times yesterday and today.. 1. Increased verbosity of charmcraft command and the charmcraft creates a lxc container and setup networking, tz etc inside the container and tries to stop the container and there is disconnection to logs 2. Modified the lxc network from 10.x.x.x to 172.28.x.x but the problem still exists ... the pattern which i observed this is | 03:41 |
| hemanth | happening on instances where more than 1 interface exists (104.x.x.x and 23.x.x.x).. is there a nodeset i can use so that the test machine can be from 104.x.x.x/23.x.x.x? (Do not want to recheck which runs 40+ jobs to see the error) | 03:41 |
| hemanth | I see 3 attempts on some of the jobs that are failing but there is no pre-run defined for the job.. is this expected? https://zuul-ci.org/docs/zuul/latest/config/job.html#attr-job.attempts | 03:42 |
| hemanth | Job definition: https://opendev.org/openstack/sunbeam-charms/src/branch/main/zuul.d/jobs.yaml#L1 (or any charm-build-* in this file) | 03:44 |
| hemanth | * Hey regarding RETRY_LIMIT Host Unreachable.. I am able to see the error multiple times yesterday and today.. 1. Increased verbosity of charmcraft command and the charmcraft creates a lxc container and setup networking, tz etc inside the container and tries to stop the container and there is disconnection to logs 2. Modified the lxc network from 10.x.x.x to 172.28.x.x but the problem still exists ... the pattern which i observed this | 03:56 |
| hemanth | is happening on instances where more than 1 interface exists (primary interfaces: 104.x.x.x and 23.x.x.x).. is there a nodeset i can use so that the test machine can be from 104.x.x.x/23.x.x.x? (Do not want to recheck which runs 40+ jobs to see the error) | 03:56 |
| tonyb | hemanth: I'm AFK right now (on my phone). Thanks for the updates. I'll process what you've written and see what suggestions I have | 04:16 |
| tonyb | hemanth: A very non-scientific look at: https://zuul.opendev.org/t/openstack/builds?job_name=charm-build-glance-k8s&project=openstack%2Fsunbeam*&change=963705&skip=0 seems to show that all the failing jobs are running on the 'RAX Classic' cloud (in various regions). | 06:10 |
| tonyb | Some (https://4ca4d8620dc864739feb-f83d06667d580e000031601b82c71a43.ssl.cf5.rackcdn.com/openstack/acfdfa3dcb224b04b45750039d5298c0/zuul-info/inventory.yaml) run (successfully) on "RAX Flex" which has multiple interfaces. | 06:10 |
| tonyb | I don't think there is anyway to explicitly request a node on that (RAX Classic) provider. | 06:15 |
| tonyb | hemanth: I suggest creating a modified version of your change that removes all jobs from: https://opendev.org/openstack/sunbeam-charms/src/branch/main/zuul.d/project-templates.yaml#L40 apart from charm-build-glance-k8s. Then you can recheck on that. | 06:18 |
| tonyb | Once you see it's running on RAX Classic (near the top of the job-output/console stream) look for the "Print node information" task. Assuming an admin is online we can add an autohold for you | 06:20 |
| hemanth | I moved the builds to ubuntu-noble and i do not see any issue ..https://zuul.openstack.org/status?change=964205 | 06:27 |
| hemanth | https://review.opendev.org/c/openstack/sunbeam-charms/+/964205/2#message-8f3e9d694cff4b7e7badd61f615e41e2565e0f45 | 06:31 |
| tonyb | You got a little lucky in that you landed on RAX Flex | 06:32 |
| hemanth | ok.. let me start one just with charm-build-glance-k8s | 06:34 |
| tonyb | Okay, I'll discard mine | 06:37 |
| hemanth | I trigged one - change id 964208 | 06:37 |
| tonyb | hemanth: What's the chnage number? | 06:38 |
| hemanth | The job started on raxclassic one | 06:43 |
| hemanth | Can you autohold and is it possible for me to ssh into the node? | 06:45 |
| fungi | tonyb: ^ (in case you weren't watching closely) | 06:46 |
| tonyb | Added, sorry I was trying to get the hold in place while making a stupid amount of typos | 06:47 |
| tonyb | https://zuul.opendev.org/t/openstack/autoholds | 06:47 |
| tonyb | hemanth: ^^ | 06:47 |
| tonyb | hemanth: Can you send me your ssh public key and I'll add it to the node | 06:48 |
| hemanth | tonyb: https://launchpad.net/~hemanth-n/+sshkeys . .can you add the last one | 06:50 |
| tonyb | Ah and of course the networking is b0rked | 06:51 |
| hemanth | :-( i will ping you back when i see it runs again on rax classic | 06:52 |
| fungi | well, that's sort of to be expected since that's what the job failures were suggesting | 06:52 |
| tonyb | Yeah :/ | 06:52 |
| hemanth | i am just thinking how to debug further | 06:52 |
| tonyb | You could add an explict "fail" into the playbooks before he charmcraft | 06:53 |
| tonyb | gimme 3 to think ... | 06:54 |
| fungi | booting the node into recovery mode through the rackspace dashboard should be possible, but that may destroy whatever state the playbook has gotten it into | 06:54 |
| fungi | though could still be enough to collect logs | 06:54 |
| tonyb | fungi: Do you think I could *maybe* access it via the service-net in that region ... from the mirror node | 06:56 |
| tonyb | (guessing that only eth0 is b0rked) | 06:56 |
| fungi | oh, entirely possible but you'd need to copy your keys over | 06:56 |
| fungi | or do agent forwarding or something | 06:56 |
| fungi | not really the safest idea | 06:57 |
| tonyb | Yeah I'd probably do agent forwarding | 06:57 |
| fungi | though i suppose you could proxy the socket | 06:57 |
| tonyb | Let me see if I can get the Service-IP, and if it's listening on 22 | 06:59 |
| tonyb | The first node has been recycled | 07:01 |
| hemanth | tonyb: 2nd attempt landed on rax-ord-main | 07:06 |
| tonyb | Okay I got in and killed /usr/bin/snap install --channel latest/stable lxd | 07:08 |
| tonyb | so the job failed and hopefully we get the autohold | 07:08 |
| tonyb | I'll see if I can get in via service-net, and then add hemanth's ssh key. | 07:09 |
| tonyb | hemanth: You'll need to manually walk through the playbook | 07:09 |
| hemanth | tonyb: ack | 07:09 |
| tonyb | Okay service-net does listen on 22 | 07:11 |
| hemanth | tonyb: let me know when i should be good to ssh | 07:18 |
| tonyb | Will do, sorry it's taking a little longer to get in that I thought | 07:18 |
| tonyb | hemanth: root@23.253.164.133 | 07:21 |
| hemanth | tonyb: thanks, i am in | 07:22 |
| tonyb | Okay | 07:22 |
| tonyb | you should be able to sudo -i -u zuul su - (or similar) | 07:22 |
| tonyb | and then work through the playbook | 07:22 |
| hemanth | ack | 07:22 |
| tonyb | *hopefully*, if the network dies again it will only impact the public network | 07:23 |
| tonyb | I'm running a mix of ip/iptables commands with (local logging) "just in case" | 07:25 |
| hemanth | ack | 07:26 |
| tonyb | the lxbr0 is still on 10.x is that expected? | 07:27 |
| hemanth | yes with the current one yes.. | 07:28 |
| tonyb | ++ | 07:28 |
| hemanth | i can change manually.. lemme do that | 07:28 |
| tonyb | No it's fine | 07:28 |
| tonyb | I just wanted to check what I'm seeing | 07:28 |
| tonyb | the routes look okay | 07:28 |
| hemanth | started the charmcraft pack command | 07:32 |
| tonyb | ++ | 07:33 |
| hemanth | is the connection gone for you? | 07:34 |
| tonyb | I still have access on service-net | 07:35 |
| hemanth | I can no more access | 07:36 |
| tonyb | Okay it's lost the default route | 07:36 |
| hemanth | I added following rule nft insert rule filter openstack-INPUT iif lxdbr0 accept before running charmacraft pack | 07:36 |
| hemanth | as per the playbook.. to get network access to lxd | 07:37 |
| tonyb | eth0 no longer has an IP and is marked as DOWN | 07:37 |
| tonyb | Good: https://paste.opendev.org/show/buP2Y3YFx7fbLhAfd2e3/ | 07:38 |
| tonyb | Now: https://paste.opendev.org/show/bFGMj25zQox3PexkP6o6/ | 07:39 |
| hemanth | anything on syslog that tells why it happened | 07:39 |
| tonyb | Anything in thje charmpack that would down the host's network? | 07:39 |
| *** ykarel_ is now known as ykarel | 07:40 | |
| tonyb | https://paste.opendev.org/show/bk4ZjLzFAvs00bGZU0BM/ | 07:40 |
| hemanth | ideally it shudnt touch the host... the only thing that tricks me is the interface name defined within container is eth0 which is defined in lxc profile.. sudo lxc profile show default | 07:40 |
| tonyb | Okay I have run ifup eth0 you should be able to get back in | 07:42 |
| hemanth | ok | 07:42 |
| hemanth | I will run some experiments if you dont mind... | 07:43 |
| tonyb | https://paste.opendev.org/show/bTCmjYHDdgYcoNUjOe4Q/ | 07:44 |
| tonyb | Looks like the snap shutdown eth0 | 07:44 |
| hemanth | yeah i want to change the lxd interface name something else other than eth0 | 07:44 |
| tonyb | Okay | 07:44 |
| hemanth | I tried to remove lxd and my connection is again lost | 07:45 |
| hemanth | can you ifup eth0 | 07:45 |
| tonyb | done | 07:45 |
| hemanth | I will get back to you in 15 minutes | 07:46 |
| tonyb | Okay | 07:46 |
| hemanth | tonyb: I think its an issue with charmcraft with interface name as eth0 ... I will debug creating an environment with that combination.. I do not have a good workaround yet.. but seems this is not problem with ubuntu-noble where i see the tests ran on rax-iad-main but the interface names are enx0, enx1 ... please release the node | 08:29 |
| hemanth | * tonyb: I think its an issue with charmcraft with host interface name as eth0 ... I will debug creating an environment with that combination.. I do not have a good workaround yet.. but seems this is not problem with ubuntu-noble where i see the tests ran on rax-iad-main but the interface names are enx0, enx1 ... please release the node | 08:29 |
| hemanth | and thank you for all the support | 08:30 |
| tonyb | np Good luck | 08:30 |
| tonyb | you know where I am/We are if you need more work | 08:30 |
| hemanth | yeah sure and thanks for the patience | 08:30 |
| tonyb | All good. | 08:32 |
| hemanth | tonyb: did you happen to save the iptable rules. If you have them, can you provide those information as well | 09:04 |
| tonyb | Sorry, I cleaned them up, thinking they weren't needed anymore | 09:05 |
| hemanth | ack.. thats fine | 09:05 |
| tonyb | hemanth: Let me check something | 09:06 |
| tonyb | I thought they might be in: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a15/openstack/a15b8e30ca57484c8f33eac2cfc352f4/zuul-info/zuul-info.ubuntu-jammy.txt | 09:07 |
| tonyb | but it seems not | 09:07 |
| hemanth | I tried to replicate the eth0 interface on my VM and this time the build is fine.. the only difference in my environment is there are no specific iptable/nft rules | 09:09 |
| hemanth | I have the similar logs as upstream one (which you provided) https://paste.opendev.org/show/bTCmjYHDdgYcoNUjOe4Q/ .. except for iptables drop .. the eth0 interface is referred to the container eth0 interface | 09:10 |
| tonyb | Well you can use the dummy change to dump them on a new run. You could also add a hard fail right before the charmpack? and then try again | 09:11 |
| hemanth | I moved the jobs to ubuntu-noble and triggered those jobs in change 962366 .. seems fine for now.. (mostly due to different interface name on host.. crosschecked couple of instances on rax classic) | 09:12 |
| tonyb | now that we now roughly where/why it happens | 09:12 |
| tonyb | or if it makes no real difference to the output you could just switch to noble, as you say that chnages the interface name in the host | 09:13 |
| tonyb | I guess at this stage it's a matter of how much you want to debug this/understand the underlying problem | 09:13 |
| hemanth | yeah i will see some builds today but will come back tomorrow on jammy since it is curious what causes the iptables drop | 09:13 |
| tonyb | Well the DROPS in https://paste.opendev.org/show/bTCmjYHDdgYcoNUjOe4Q/ are just random checks to see if there is a webserver on the node | 09:18 |
| hemanth | ack | 09:19 |
| tonyb | I don't think they're the cause | 09:19 |
| tonyb | It's the 'Oct 16 07:32:55 ubuntu kernel: [ 1680.186759] eth0: renamed from physyLxJCZ' that I thought was more interesting | 09:20 |
| hemanth | i am retesting again in my local with few settings.. | 09:21 |
| tonyb | Okay | 09:21 |
| tonyb | I'm probably going to sign off soon | 09:22 |
| hemanth | \o | 09:23 |
| *** bauzas9 is now known as bauzas | 19:35 | |
| opendevreview | Tony Breeds proposed openstack/project-config master: [pti-python-tarball] Add compatibility for older wheels https://review.opendev.org/c/openstack/project-config/+/964251 | 20:12 |
| opendevreview | Tony Breeds proposed openstack/project-config master: [pti-python-tarball] Add compatibility for older wheels https://review.opendev.org/c/openstack/project-config/+/964251 | 21:07 |
Generated by irclog2html.py 4.0.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!