Friday, 2024-06-28

opendevreviewGhanshyam proposed openstack/devstack master: Do not use 'admin' role in service users  https://review.opendev.org/c/openstack/devstack/+/92301301:44
opendevreviewGhanshyam proposed openstack/devstack master: Do not use 'admin' role in service users  https://review.opendev.org/c/openstack/devstack/+/92301301:45
opendevreviewGhanshyam proposed openstack/tempest master: New job to test the service-to-service inteeraction using service role  https://review.opendev.org/c/openstack/tempest/+/92301401:57
opendevreviewGhanshyam proposed openstack/tempest master: New job to test the service-to-service inteeraction using service role  https://review.opendev.org/c/openstack/tempest/+/92301401:58
opendevreviewAmit Uniyal proposed openstack/whitebox-tempest-plugin master: Updated watchdog tests with machine type  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92303108:52
*** whoami-rajat_ is now known as whoami-rajat11:25
*** dmellado07553 is now known as dmellado075514:51
opendevreviewArtom Lifshitz proposed openstack/whitebox-tempest-plugin master: Swtich to Ubuntu Noble image  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92306316:49
sean-k-mooneyhi folks was there a zuul restart today?17:14
sean-k-mooneyim not sure i have seen canceld before as a build status other then that17:14
sean-k-mooneyhttps://zuul.opendev.org/t/openstack/buildset/e86ea5bb01814931af504815752f0a2817:14
clarkbsean-k-mooney: no zuul restarts early saturday utc time but does so in a rolling fashion and shouldn't cancel any jobs to do that17:16
clarkbit is exceptionally rare these days for us to hard stop all of zuul17:17
clarkbsean-k-mooney: I don't know yet why that happened but it may have to do with parent change state17:18
sean-k-mooneywell the parent change merged17:18
sean-k-mooneythere has been a behavior change in zuul at somepoint17:19
sean-k-mooneywhich seams to abort or chanel jobs when that happens17:19
sean-k-mooneywe often stack up 5+ change with +2w holding the bottom change and then +w the bottom change ot merge the set17:20
clarkb[<Change 0x7f5f25ae1c90 openstack/nova 915735,4>] in gate> is a failing item because ['at least one job failed']17:20
sean-k-mooneyi have seen issues with that over the last few months17:20
clarkbso I think it had a failing job and wasn't going to merge. And the bug here is that zuul is changing the result from failed to cancelled so it isn't clear where the failuer is17:20
sean-k-mooneythye were all cancled accouding the the result17:21
clarkbsean-k-mooney: yes but I think that is an after the fact (eg after success or failure since it is reporting at least one failed) thing17:21
clarkbbasically its a UI bug not a didn't merge properly bug, but I'm still looking17:21
sean-k-mooneymaybe it logs like non of the job ran more then a minut or so17:21
sean-k-mooneywe dont have any logs either which makes this hard to triage17:22
clarkbI half suspect that there was a buildset that failed and then as part of redoign the queue a new buildset was created which then got cancelled17:23
clarkbif that is the case the db doesn't confirm though17:24
sean-k-mooneyright but could  that have happend because the parent patch merged17:24
sean-k-mooneythe parent patch merged at Friday, 2024-06-28, 17:21:16 UTC+01:0017:25
clarkbthe cancellations happen well before that17:26
sean-k-mooneydo they whe you accont for time zones?17:27
sean-k-mooney2024-06-28 14:22:57 its what it dispalce but i dont know what time zone that is17:27
sean-k-mooneyit was cnacedl 1hour and 6 minutes ago and the parent merged 1 hour and 7 minutes ago looking at gerrit17:28
clarkbwe only use utc17:29
sean-k-mooneyso just looking at the gettit comments it looks like they are in a few seconds of each other17:29
clarkbthe builds were cancelled because the parent change had a failing job17:29
sean-k-mooneythat does not fit with what im seeing 17:29
sean-k-mooneythe parent patch succesfully was merged17:30
clarkb2024-06-28 14:23:09,832 DEBUG zuul.Pipeline.openstack.gate: [e: 967b8cbde1cb4649bc4b7f5d7ded6be8] <QueueItem 664fbea31e5343a7a5a8f7178c31e799 live for [<Change 0x7f9c2c6bb7d0 openstack/nova 915734,4>] in gate> is a failing item because ['at least one job failed']17:30
clarkbsean-k-mooney: yes that is possible in the gate due to resets17:30
clarkbso its seems this child change was caught in a reset and got cancelled. Why it didn't get reenqueued with new jobs and no association to the parent is still a mystery17:30
sean-k-mooneyok so you think there was a patch in between that had a failure17:30
clarkbthe parent seems to have gotten reeneued properly in the reset17:30
sean-k-mooneyand then it did not requeue properly17:31
clarkbyes its looking like there was at least one more change involved so that things could be reset multiple times17:31
clarkband this one didn't get reset properly17:31
sean-k-mooneyack so this isnt the first time i have seen this17:31
sean-k-mooneyit rare but we have had parent merge and all there child patches aborted in frequelty for 6-12 months now17:32
sean-k-mooneywhich has been frustrating sicne it requries a recheck to fix whihc means 2 more full ci runs per patch17:33
clarkbok this is the first I've heard or seen it17:33
clarkbbut I'm trying to figure out from logs where things went wrong and have brought it up with corvus too17:34
sean-k-mooneyack thank you, i proably should hop on matrix more when i see things like this17:34
clarkbhttps://zuul.opendev.org/t/openstack/builds?change=915734&skip=0 may be the clue we need17:36
clarkbnote that none of the gate jobs have a failure state but one is in a retry state. My hunch now is that this corner case isn't handled properly and zuul thinks it requires a gate reset but then code that actually reenqueues stuff realizes no a retry isn't a failing state yet so it doesn't full reset17:37
sean-k-mooneyyou think maybe https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f17:37
sean-k-mooneyit did pass after the retry17:38
clarkbya I think thats tripping the reset for 915735 inappropriately (this would be a bug) then when zuul goes to reenqueue 915735 it realizes it doesn't need to do anything bceause 915734 isn't actually failing which effectively orphans the acncelled state in the uque17:38
clarkbthen 915734's job reruns and succeeds and the cahnge can merge17:38
sean-k-mooneysound plasible17:39
sean-k-mooneyi wonder if they change the logic at somepoint to now wiat for the entire build set to be compelted before decuing the later patches17:39
sean-k-mooneyalthough no i used to reorder on the first job failure17:40
sean-k-mooneybut  obviouly its not handliy a retry properly17:40
sean-k-mooneyif this is infact the cause17:40
sean-k-mooneythat url actuly show something that is kind interesting form a data perspecitve. 17:43
sean-k-mooneyyou can see the totltal number of jobs that ran on that change in total17:44
sean-k-mooneyover all the revisiosn and retries17:44
sean-k-mooneyin that case __only__ 137 executions17:45
clarkbhttps://zuul.opendev.org/t/openstack/builds?change=915734&change=915735&skip=0 is itneresting too because you can see the cancels coincide with the retry17:45
sean-k-mooneyoh you filtered on both changes ya17:47
sean-k-mooney2024-06-28 14:21:46 +1 min 29 secs is 14:22:15 ish17:48
sean-k-mooneyalthough that the total runtim17:49
sean-k-mooneythe cacles proably got triggers early when it started to fail17:49
clarkbI think this might be a side effect of an optimization to reconfigure queues as soon as zuul detects the first ansible task failure and not waiting for the job to fully complete17:51
clarkbthe reason is if pre_fail is True we short circuit and consider the job failed and dont' check if it is retry or not17:52
sean-k-mooneythat sounds likely17:52
sean-k-mooneydoes the short curicut take account of ansible recovery ?17:52
sean-k-mooneyi.e. if a playbook is writen to try something and recover if it fails and try something else17:53
sean-k-mooneythats not what happens in this case https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f/log/job-output.txt#20217:53
sean-k-mooneybut just wondering how smart the shortcurcit is17:54
clarkbI think so it has to wait for the block to fail to hanlde that case17:54
clarkbbut zuul also has test cases for handling retry and pre_fail together so this must be some edge case17:54
sean-k-mooneydo you know roughtly when that optimisation was added/enabled17:55
sean-k-mooneyhttps://zuul-ci.org/docs/zuul/latest/developer/model-changelog.html#version-1417:56
sean-k-mooneyso i guess 9.0/017:56
clarkbya which would fit into your timeline I think17:57
sean-k-mooney10 mnonths ago https://opendev.org/zuul/zuul/releases/tag/9.0.017:58
sean-k-mooneyim not seeing a schduler config option or pipeline var for pre_fail18:03
sean-k-mooneyis this hardcoded to true or perhasp tenatn level config18:03
clarkbya I'm not sure it is configurable18:07
melwitthas anyone seen a failure like this before where ssh fails public key authentication? I have seen it rarely and I couldn't figure out how it's possible to happen. or maybe I am interpreting the failure wrong? https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_43e/922918/1/check/nova-multi-cell/43ef3c3/testr_results.html18:08
sean-k-mooneyhttps://opendev.org/zuul/zuul/commit/1170b91bd893741dd0a664b28b56177998aa067518:08
sean-k-mooneylooking at the chagne that added it 18:08
sean-k-mooneyi think it was not18:08
clarkbsean-k-mooney: the theory that is forming in the zuul matrix room is that this is due to post-run also failing. Basically everything is handling the pre run failure just fine but then your post run playbook also fails and then it flips the pre fail flag18:09
sean-k-mooneythe post_run playbook fails becasue of log collection but its ignore/recovers i think18:10
clarkbthere is no explicit recovery in the post run playbook18:14
sean-k-mooneysure but that should not be required either18:16
clarkbright I'm working on a patch to zuul18:16
clarkbjust pointing out the behaviors and why we end up with that end result18:16
sean-k-mooneycool glad we could find the issue so quickly18:16
sean-k-mooneyi ment too ask about this a few month ago and just never got the time18:17
sean-k-mooneyi.e. just after it had happend18:17
sean-k-mooneythis always cropped up while i was asleep after a proving a set of changes before18:17
clarkbsean-k-mooney: fwiw fail-fast is the configuration to opt into the fast failing, however this is buggy even if you don't set it18:30
clarkbwhich is the case here (it isn't set as far as I can tell but we're hitting the buggy code path anyway)18:31
sean-k-mooneyah ok18:40
sean-k-mooneyi wasnt getting any hits in the web serach but didnt try too hard to find it when it was not obvious18:41
sean-k-mooneymelwitt: im just going to drop for the week but yes Failed to establish authenticated ssh connection to cirros@172.24.5.195 after 15 attempts. Proxy client: no proxy client18:42
sean-k-mooneyis the error18:42
sean-k-mooneyif we look at the cirrus banner its pringt th elogin logo at === cirros: current=0.6.2 uptime=47.47 ===18:42
sean-k-mooneywe diding get metadata unitl successful after 6/20 tries: up 33.65. iid=i-000000618:43
sean-k-mooneyso while we had an ip adress after about 12 second we did not pull the ssh key for anouth 20 or so18:43
sean-k-mooneythis is either the bug in ml2/ovn where it does not setup the metadata proxy untilafter the vm boots18:45
sean-k-mooneyor its just a slow node18:45
clarkbsean-k-mooney: thank you for pointing that out. And definitely say something in the future if you see other weird behaviors18:45
sean-k-mooneyclarkb: sure i geneally keep an eye out fo changes in behaivor like this but i dont alwasy have a concreat example to point too18:48
sean-k-mooneymelwitt: my geneieral incliation is that dropbare started fine18:50
melwittsean-k-mooney: cool thanks for the insight. have a nice weekend18:50
sean-k-mooneywe see "Starting dropbear sshd: remove-dropbear-host-keys already run per instance" followed by OK18:50
melwittyeah it's like, it connects but then the public key is rejected. I didn't understand how can it have a wrong key or does it mean something else18:50
sean-k-mooneybut it just took a littl too long to get to the login probmpt18:50
sean-k-mooneynormlaly you can tune this in tempet with a config option here https://github.com/openstack/tempest/blob/master/tempest/config.py#L844-L92918:51
sean-k-mooneyi dont see a way to alter the ssh retry count or interval however18:52
sean-k-mooneywe are seeing 18:53
sean-k-mooney2024-06-27 17:52:33,047 99236 INFO     [paramiko.transport] Connected (version 2.0, client dropbear_2020.81)18:53
sean-k-mooney2024-06-27 17:52:33,151 99236 INFO     [paramiko.transport] Authentication (publickey) failed.18:53
sean-k-mooneyso we no its not a network issue18:53
sean-k-mooneybut i would assuem we have not pulled the ssh key and added it to the authorize keys file in time or the login deamon has not booted yet so pam is rejecting the login attempt18:54
melwittyeah. doesn't that mean the key presented was rejected? basically I wondered if there was any possibility of a bug in tempest. hm ok18:56
melwittor rather corner case18:56
sean-k-mooneyya so normally the ssh deamon being up and you being able to login are very close in time18:57
sean-k-mooneybut if you have ever tied to ssh into a real server when you have rebooted it18:57
sean-k-mooneyyou sometime get a reject/aut issue becasue the sever is not ready18:57
sean-k-mooneyopenssh gives you a diffent messge when that happens but i dont knwo if dropbare does18:58
melwittno I have, but I thought the thing it tells you is different than "public key failed"18:58
melwittmaybe I didn't notice18:58
sean-k-mooneywell that why i think cloud-init has not actuly added the key yet18:59
sean-k-mooneyi think the autorised_keys file is empty18:59
sean-k-mooneyif we look at the cloud-init output18:59
sean-k-mooneyhecking http://169.254.169.254/2009-04-04/instance-id18:59
sean-k-mooneyfailed 1/20: up 22.16. request failed18:59
sean-k-mooneyfailed 2/20: up 24.55. request failed18:59
sean-k-mooneyfailed 3/20: up 26.82. request failed18:59
sean-k-mooneyfailed 4/20: up 29.10. request failed19:00
sean-k-mooneyfailed 5/20: up 31.39. request failed19:00
sean-k-mooneysuccessful after 6/20 tries: up 33.65. iid=i-0000006319:00
sean-k-mooneywe see that which is after dropbare was started19:00
melwittok, I see19:00
melwittwish we could dump that too when ssh fails19:00
sean-k-mooneyso that a good 13+ seconds before we were able to pull the first file19:00
sean-k-mooneyi think htis is proably the ovn bug19:00
sean-k-mooneyif we were to check the ovn metadta logs when this was booting or the ovs logs19:01
sean-k-mooneyi bet the ovn metadata proxy was not ready to recive the requests19:01
melwittok I'll check that, just for curiosity19:02
sean-k-mooneyhttps://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_43e/922918/1/check/nova-multi-cell/43ef3c3/compute1/logs/screen-q-ovn-metadata-agent.txt19:04
sean-k-mooneyat Jun 27 17:54:52.82640419:05
sean-k-mooneyit looks like it started setting it up at 27 17:51:1819:08
melwittoh interesting19:09
sean-k-mooneybut i dont actully know if/when it finsihed19:09
sean-k-mooneywe see Jun 27 17:51:18.377087 np0037827180 neutron-ovn-metadata-agent[40160]: INFO neutron.agent.ovn.metadata.agent [-] Port 76f5dd84-8e96-4b52-9866-bca8ff8ba523 in datapath 8f767a47-1afd-49b4-b8a4-dcd3bf05091c bound to our chassis19:10
sean-k-mooneythen Provisioning metadata for network 8f767a47-1afd-49b4-b8a4-dcd3bf05091c19:10
sean-k-mooneyi think it starts at Jun 27 17:51:20.72517619:13
sean-k-mooneyand starts handeling request at 17:52:0219:14
sean-k-mooneytempest starts tryign to ssh at  17:51:2619:15
sean-k-mooneyand gives up at 17:54:5119:16
sean-k-mooneyhttps://paste.opendev.org/show/bL4YosAN7DorXXdMKtZy/19:22
sean-k-mooneythe vm pulled the ssh key for ths first time at 17:52:0319:23
sean-k-mooneytempest stopped retrying at 17:54:51,505 99236 ERROR    [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.195 after 15 attempts. Proxy client: no proxy client19:24
sean-k-mooneymelwitt: so ya the ovn metadta logs confirm19:24
sean-k-mooneywell actully no number are hard19:25
sean-k-mooneyit may have download the key at 17:52:0319:26
melwittsean-k-mooney: and you can tell because of the time of the port binding in the ovn metadata log is slightly after tempest gave up?19:26
sean-k-mooneybut we dont know how long it took for cloud-init to actully install it19:27
sean-k-mooneymelwitt: no i mis read one of the number19:27
sean-k-mooneytempest is trying to ssh in beofre cloud init has even tried to pull the key19:27
sean-k-mooneybut cloud init had some amount of time to set it up19:28
sean-k-mooneybut there it looks lik e arace to me19:28
melwittthe port binding message shows at Jun 27 17:54:52.826404 and tempest gave up at 2024-06-27 17:54:51,50519:29
sean-k-mooneyya if you look clsoely that whewn the port is unbound19:29
sean-k-mooneythat when we are deletign the vm during cleanup19:30
melwittoh unbound. ok19:30
sean-k-mooneyovnmeta-8f767a47-1afd-49b4-b8a4-dcd3bf05091c is the ovn network namespace for the proxy19:31
sean-k-mooneyso we can see it ge created and then all the porxied calls19:31
sean-k-mooneythe vms ip is 10.1.0.319:31
sean-k-mooneyyou can see all the metadat line where it quering 19:33
sean-k-mooneyif you search for neutron-ovn-metadata-agent[42136]: INFO eventlet.wsgi.server [-] 10.1.0.3,<local> "GET 19:34
sean-k-mooneyaround the right ime19:34
sean-k-mooneyyou cna cinfirm its the correct request by also looking at the netowrk id19:34
sean-k-mooneyX-Ovn-Network-Id: 8f767a47-1afd-49b4-b8a4-dcd3bf05091c19:34
melwitt@_@19:35
melwittit'll take a little while for my brain to absorb this19:36
sean-k-mooneythat when the proxy starte dhttps://paste.opendev.org/show/bY95LSHfDNbIrfYdfJ9W/19:36
sean-k-mooneyno worries19:36
sean-k-mooneyits not the simplest thing to follow19:36
sean-k-mooneythat got created at 17:51:2019:37
sean-k-mooneyhum19:41
sean-k-mooneythat is not what i was expcting 19:41
sean-k-mooneyJun 27 17:51:23.727070 np0037827180 nova-compute[42216]: DEBUG nova.compute.manager [req-308d2a9f-aafa-4104-9efe-b24a6eddd408 req-41d56dd8-7c93-410d-b00c-deeb9098f0b7 service nova] [instance: c31f0f22-1d32-4617-93be-28e3fb12589d] Received event network-vif-plugged-76f5dd84-8e96-4b52-9866-bca8ff8ba523 {{(pid=42216) external_instance_event19:41
sean-k-mooney/opt/stack/nova/nova/compute/manager.py:11129}}19:41
sean-k-mooneywe actully didn recive the network-vif-pullged event until after neutron set up metadta19:42
sean-k-mooneythe nova comptue log says19:43
sean-k-mooneyJun 27 17:51:23.737579 np0037827180 nova-compute[42216]: INFO nova.virt.libvirt.driver [-] [instance: c31f0f22-1d32-4617-93be-28e3fb12589d] Instance spawned successfully.19:43
sean-k-mooneyand tempest waits for ti to go active until Thu, 27 Jun 2024 17:51:2519:44
sean-k-mooneymelwitt: so all that looks liek its correcct19:44
sean-k-mooneytempest started sshing almost imitadly 17:51:26,19:44
sean-k-mooneybut it just took too long for cloud-init to set everytihng up19:45
sean-k-mooneywe can see in the consoel that form the time the vm is spawnd it over 30 second before netwokring is up19:46
sean-k-mooneyand we can also see that tempest is not trying to ping beofre its sshing19:47
melwittoh huh. I thought all those got updated to wait for ping and ssh19:48
sean-k-mooneyi think it might be either or rahter then doing one then the other19:48
sean-k-mooneybut i think that woud prevent the issue19:48
sean-k-mooneywiath for ti to be pingable then waith for it to be sshable19:48
melwittyeah. I had thought those all got changed to do that tho. I mean I guess apparently not but I remember there being some changes19:49
sean-k-mooneywe are defintly waiting for it to besshable now for all the voluem stuff19:50
melwittyeah19:50
sean-k-mooneyvalidation are also properly configured in the tempest.conf too19:50
sean-k-mooneyhttps://github.com/openstack/tempest/blob/6618aa253e04b8879ae6d721a48ee4851543ba4a/tempest/common/compute.py#L133-L15119:53
sean-k-mooneyi tought the SSHABLE section checked for pings first but apprently not19:54
sean-k-mooneymaybe we coudl change that19:54
sean-k-mooneythat would still be racy i guess19:55
sean-k-mooneywe are able to connect to the ssh server19:55
sean-k-mooneybut it just is not ready for conenction or the authorised key got currpoted19:55
melwittyeah I had thought so too19:56
melwittoh yeah, that's true, it did connect so it was already pingable19:56
sean-k-mooneyok my stomac says its time for mead and other tasty things so im going to just call it a slow node an call it a week19:57
sean-k-mooneyif we see this happenign more frequently we can try and debug more19:57
melwitthaha ok. enjoy the mead :) o/19:58
opendevreviewArtom Lifshitz proposed openstack/whitebox-tempest-plugin master: Swtich to Ubuntu Noble image  https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/92306320:04
opendevreviewAshish Gupta proposed openstack/tempest master: [WIP]Add test to verify hostname allows FQDN  https://review.opendev.org/c/openstack/tempest/+/92234223:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!