Friday, 2024-06-28

opendevreview	Ghanshyam proposed openstack/devstack master: Do not use 'admin' role in service users https://review.opendev.org/c/openstack/devstack/+/923013	01:44
opendevreview	Ghanshyam proposed openstack/devstack master: Do not use 'admin' role in service users https://review.opendev.org/c/openstack/devstack/+/923013	01:45
opendevreview	Ghanshyam proposed openstack/tempest master: New job to test the service-to-service inteeraction using service role https://review.opendev.org/c/openstack/tempest/+/923014	01:57
opendevreview	Ghanshyam proposed openstack/tempest master: New job to test the service-to-service inteeraction using service role https://review.opendev.org/c/openstack/tempest/+/923014	01:58
opendevreview	Amit Uniyal proposed openstack/whitebox-tempest-plugin master: Updated watchdog tests with machine type https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923031	08:52
*** whoami-rajat_ is now known as whoami-rajat		11:25
*** dmellado07553 is now known as dmellado0755		14:51
opendevreview	Artom Lifshitz proposed openstack/whitebox-tempest-plugin master: Swtich to Ubuntu Noble image https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923063	16:49
sean-k-mooney	hi folks was there a zuul restart today?	17:14
sean-k-mooney	im not sure i have seen canceld before as a build status other then that	17:14
sean-k-mooney	https://zuul.opendev.org/t/openstack/buildset/e86ea5bb01814931af504815752f0a28	17:14
clarkb	sean-k-mooney: no zuul restarts early saturday utc time but does so in a rolling fashion and shouldn't cancel any jobs to do that	17:16
clarkb	it is exceptionally rare these days for us to hard stop all of zuul	17:17
clarkb	sean-k-mooney: I don't know yet why that happened but it may have to do with parent change state	17:18
sean-k-mooney	well the parent change merged	17:18
sean-k-mooney	there has been a behavior change in zuul at somepoint	17:19
sean-k-mooney	which seams to abort or chanel jobs when that happens	17:19
sean-k-mooney	we often stack up 5+ change with +2w holding the bottom change and then +w the bottom change ot merge the set	17:20
clarkb	[<Change 0x7f5f25ae1c90 openstack/nova 915735,4>] in gate> is a failing item because ['at least one job failed']	17:20
sean-k-mooney	i have seen issues with that over the last few months	17:20
clarkb	so I think it had a failing job and wasn't going to merge. And the bug here is that zuul is changing the result from failed to cancelled so it isn't clear where the failuer is	17:20
sean-k-mooney	thye were all cancled accouding the the result	17:21
clarkb	sean-k-mooney: yes but I think that is an after the fact (eg after success or failure since it is reporting at least one failed) thing	17:21
clarkb	basically its a UI bug not a didn't merge properly bug, but I'm still looking	17:21
sean-k-mooney	maybe it logs like non of the job ran more then a minut or so	17:21
sean-k-mooney	we dont have any logs either which makes this hard to triage	17:22
clarkb	I half suspect that there was a buildset that failed and then as part of redoign the queue a new buildset was created which then got cancelled	17:23
clarkb	if that is the case the db doesn't confirm though	17:24
sean-k-mooney	right but could that have happend because the parent patch merged	17:24
sean-k-mooney	the parent patch merged at Friday, 2024-06-28, 17:21:16 UTC+01:00	17:25
clarkb	the cancellations happen well before that	17:26
sean-k-mooney	do they whe you accont for time zones?	17:27
sean-k-mooney	2024-06-28 14:22:57 its what it dispalce but i dont know what time zone that is	17:27
sean-k-mooney	it was cnacedl 1hour and 6 minutes ago and the parent merged 1 hour and 7 minutes ago looking at gerrit	17:28
clarkb	we only use utc	17:29
sean-k-mooney	so just looking at the gettit comments it looks like they are in a few seconds of each other	17:29
clarkb	the builds were cancelled because the parent change had a failing job	17:29
sean-k-mooney	that does not fit with what im seeing	17:29
sean-k-mooney	the parent patch succesfully was merged	17:30
clarkb	2024-06-28 14:23:09,832 DEBUG zuul.Pipeline.openstack.gate: [e: 967b8cbde1cb4649bc4b7f5d7ded6be8] <QueueItem 664fbea31e5343a7a5a8f7178c31e799 live for [<Change 0x7f9c2c6bb7d0 openstack/nova 915734,4>] in gate> is a failing item because ['at least one job failed']	17:30
clarkb	sean-k-mooney: yes that is possible in the gate due to resets	17:30
clarkb	so its seems this child change was caught in a reset and got cancelled. Why it didn't get reenqueued with new jobs and no association to the parent is still a mystery	17:30
sean-k-mooney	ok so you think there was a patch in between that had a failure	17:30
clarkb	the parent seems to have gotten reeneued properly in the reset	17:30
sean-k-mooney	and then it did not requeue properly	17:31
clarkb	yes its looking like there was at least one more change involved so that things could be reset multiple times	17:31
clarkb	and this one didn't get reset properly	17:31
sean-k-mooney	ack so this isnt the first time i have seen this	17:31
sean-k-mooney	it rare but we have had parent merge and all there child patches aborted in frequelty for 6-12 months now	17:32
sean-k-mooney	which has been frustrating sicne it requries a recheck to fix whihc means 2 more full ci runs per patch	17:33
clarkb	ok this is the first I've heard or seen it	17:33
clarkb	but I'm trying to figure out from logs where things went wrong and have brought it up with corvus too	17:34
sean-k-mooney	ack thank you, i proably should hop on matrix more when i see things like this	17:34
clarkb	https://zuul.opendev.org/t/openstack/builds?change=915734&skip=0 may be the clue we need	17:36
clarkb	note that none of the gate jobs have a failure state but one is in a retry state. My hunch now is that this corner case isn't handled properly and zuul thinks it requires a gate reset but then code that actually reenqueues stuff realizes no a retry isn't a failing state yet so it doesn't full reset	17:37
sean-k-mooney	you think maybe https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f	17:37
sean-k-mooney	it did pass after the retry	17:38
clarkb	ya I think thats tripping the reset for 915735 inappropriately (this would be a bug) then when zuul goes to reenqueue 915735 it realizes it doesn't need to do anything bceause 915734 isn't actually failing which effectively orphans the acncelled state in the uque	17:38
clarkb	then 915734's job reruns and succeeds and the cahnge can merge	17:38
sean-k-mooney	sound plasible	17:39
sean-k-mooney	i wonder if they change the logic at somepoint to now wiat for the entire build set to be compelted before decuing the later patches	17:39
sean-k-mooney	although no i used to reorder on the first job failure	17:40
sean-k-mooney	but obviouly its not handliy a retry properly	17:40
sean-k-mooney	if this is infact the cause	17:40
sean-k-mooney	that url actuly show something that is kind interesting form a data perspecitve.	17:43
sean-k-mooney	you can see the totltal number of jobs that ran on that change in total	17:44
sean-k-mooney	over all the revisiosn and retries	17:44
sean-k-mooney	in that case __only__ 137 executions	17:45
clarkb	https://zuul.opendev.org/t/openstack/builds?change=915734&change=915735&skip=0 is itneresting too because you can see the cancels coincide with the retry	17:45
sean-k-mooney	oh you filtered on both changes ya	17:47
sean-k-mooney	2024-06-28 14:21:46 +1 min 29 secs is 14:22:15 ish	17:48
sean-k-mooney	although that the total runtim	17:49
sean-k-mooney	the cacles proably got triggers early when it started to fail	17:49
clarkb	I think this might be a side effect of an optimization to reconfigure queues as soon as zuul detects the first ansible task failure and not waiting for the job to fully complete	17:51
clarkb	the reason is if pre_fail is True we short circuit and consider the job failed and dont' check if it is retry or not	17:52
sean-k-mooney	that sounds likely	17:52
sean-k-mooney	does the short curicut take account of ansible recovery ?	17:52
sean-k-mooney	i.e. if a playbook is writen to try something and recover if it fails and try something else	17:53
sean-k-mooney	thats not what happens in this case https://zuul.opendev.org/t/openstack/build/a51b05d89221490cb51d6c777dd42f8f/log/job-output.txt#202	17:53
sean-k-mooney	but just wondering how smart the shortcurcit is	17:54
clarkb	I think so it has to wait for the block to fail to hanlde that case	17:54
clarkb	but zuul also has test cases for handling retry and pre_fail together so this must be some edge case	17:54
sean-k-mooney	do you know roughtly when that optimisation was added/enabled	17:55
sean-k-mooney	https://zuul-ci.org/docs/zuul/latest/developer/model-changelog.html#version-14	17:56
sean-k-mooney	so i guess 9.0/0	17:56
clarkb	ya which would fit into your timeline I think	17:57
sean-k-mooney	10 mnonths ago https://opendev.org/zuul/zuul/releases/tag/9.0.0	17:58
sean-k-mooney	im not seeing a schduler config option or pipeline var for pre_fail	18:03
sean-k-mooney	is this hardcoded to true or perhasp tenatn level config	18:03
clarkb	ya I'm not sure it is configurable	18:07
melwitt	has anyone seen a failure like this before where ssh fails public key authentication? I have seen it rarely and I couldn't figure out how it's possible to happen. or maybe I am interpreting the failure wrong? https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_43e/922918/1/check/nova-multi-cell/43ef3c3/testr_results.html	18:08
sean-k-mooney	https://opendev.org/zuul/zuul/commit/1170b91bd893741dd0a664b28b56177998aa0675	18:08
sean-k-mooney	looking at the chagne that added it	18:08
sean-k-mooney	i think it was not	18:08
clarkb	sean-k-mooney: the theory that is forming in the zuul matrix room is that this is due to post-run also failing. Basically everything is handling the pre run failure just fine but then your post run playbook also fails and then it flips the pre fail flag	18:09
sean-k-mooney	the post_run playbook fails becasue of log collection but its ignore/recovers i think	18:10
clarkb	there is no explicit recovery in the post run playbook	18:14
sean-k-mooney	sure but that should not be required either	18:16
clarkb	right I'm working on a patch to zuul	18:16
clarkb	just pointing out the behaviors and why we end up with that end result	18:16
sean-k-mooney	cool glad we could find the issue so quickly	18:16
sean-k-mooney	i ment too ask about this a few month ago and just never got the time	18:17
sean-k-mooney	i.e. just after it had happend	18:17
sean-k-mooney	this always cropped up while i was asleep after a proving a set of changes before	18:17
clarkb	sean-k-mooney: fwiw fail-fast is the configuration to opt into the fast failing, however this is buggy even if you don't set it	18:30
clarkb	which is the case here (it isn't set as far as I can tell but we're hitting the buggy code path anyway)	18:31
sean-k-mooney	ah ok	18:40
sean-k-mooney	i wasnt getting any hits in the web serach but didnt try too hard to find it when it was not obvious	18:41
sean-k-mooney	melwitt: im just going to drop for the week but yes Failed to establish authenticated ssh connection to cirros@172.24.5.195 after 15 attempts. Proxy client: no proxy client	18:42
sean-k-mooney	is the error	18:42
sean-k-mooney	if we look at the cirrus banner its pringt th elogin logo at === cirros: current=0.6.2 uptime=47.47 ===	18:42
sean-k-mooney	we diding get metadata unitl successful after 6/20 tries: up 33.65. iid=i-0000006	18:43
sean-k-mooney	so while we had an ip adress after about 12 second we did not pull the ssh key for anouth 20 or so	18:43
sean-k-mooney	this is either the bug in ml2/ovn where it does not setup the metadata proxy untilafter the vm boots	18:45
sean-k-mooney	or its just a slow node	18:45
clarkb	sean-k-mooney: thank you for pointing that out. And definitely say something in the future if you see other weird behaviors	18:45
sean-k-mooney	clarkb: sure i geneally keep an eye out fo changes in behaivor like this but i dont alwasy have a concreat example to point too	18:48
sean-k-mooney	melwitt: my geneieral incliation is that dropbare started fine	18:50
melwitt	sean-k-mooney: cool thanks for the insight. have a nice weekend	18:50
sean-k-mooney	we see "Starting dropbear sshd: remove-dropbear-host-keys already run per instance" followed by OK	18:50
melwitt	yeah it's like, it connects but then the public key is rejected. I didn't understand how can it have a wrong key or does it mean something else	18:50
sean-k-mooney	but it just took a littl too long to get to the login probmpt	18:50
sean-k-mooney	normlaly you can tune this in tempet with a config option here https://github.com/openstack/tempest/blob/master/tempest/config.py#L844-L929	18:51
sean-k-mooney	i dont see a way to alter the ssh retry count or interval however	18:52
sean-k-mooney	we are seeing	18:53
sean-k-mooney	2024-06-27 17:52:33,047 99236 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2020.81)	18:53
sean-k-mooney	2024-06-27 17:52:33,151 99236 INFO [paramiko.transport] Authentication (publickey) failed.	18:53
sean-k-mooney	so we no its not a network issue	18:53
sean-k-mooney	but i would assuem we have not pulled the ssh key and added it to the authorize keys file in time or the login deamon has not booted yet so pam is rejecting the login attempt	18:54
melwitt	yeah. doesn't that mean the key presented was rejected? basically I wondered if there was any possibility of a bug in tempest. hm ok	18:56
melwitt	or rather corner case	18:56
sean-k-mooney	ya so normally the ssh deamon being up and you being able to login are very close in time	18:57
sean-k-mooney	but if you have ever tied to ssh into a real server when you have rebooted it	18:57
sean-k-mooney	you sometime get a reject/aut issue becasue the sever is not ready	18:57
sean-k-mooney	openssh gives you a diffent messge when that happens but i dont knwo if dropbare does	18:58
melwitt	no I have, but I thought the thing it tells you is different than "public key failed"	18:58
melwitt	maybe I didn't notice	18:58
sean-k-mooney	well that why i think cloud-init has not actuly added the key yet	18:59
sean-k-mooney	i think the autorised_keys file is empty	18:59
sean-k-mooney	if we look at the cloud-init output	18:59
sean-k-mooney	hecking http://169.254.169.254/2009-04-04/instance-id	18:59
sean-k-mooney	failed 1/20: up 22.16. request failed	18:59
sean-k-mooney	failed 2/20: up 24.55. request failed	18:59
sean-k-mooney	failed 3/20: up 26.82. request failed	18:59
sean-k-mooney	failed 4/20: up 29.10. request failed	19:00
sean-k-mooney	failed 5/20: up 31.39. request failed	19:00
sean-k-mooney	successful after 6/20 tries: up 33.65. iid=i-00000063	19:00
sean-k-mooney	we see that which is after dropbare was started	19:00
melwitt	ok, I see	19:00
melwitt	wish we could dump that too when ssh fails	19:00
sean-k-mooney	so that a good 13+ seconds before we were able to pull the first file	19:00
sean-k-mooney	i think htis is proably the ovn bug	19:00
sean-k-mooney	if we were to check the ovn metadta logs when this was booting or the ovs logs	19:01
sean-k-mooney	i bet the ovn metadata proxy was not ready to recive the requests	19:01
melwitt	ok I'll check that, just for curiosity	19:02
sean-k-mooney	https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_43e/922918/1/check/nova-multi-cell/43ef3c3/compute1/logs/screen-q-ovn-metadata-agent.txt	19:04
sean-k-mooney	at Jun 27 17:54:52.826404	19:05
sean-k-mooney	it looks like it started setting it up at 27 17:51:18	19:08
melwitt	oh interesting	19:09
sean-k-mooney	but i dont actully know if/when it finsihed	19:09
sean-k-mooney	we see Jun 27 17:51:18.377087 np0037827180 neutron-ovn-metadata-agent[40160]: INFO neutron.agent.ovn.metadata.agent [-] Port 76f5dd84-8e96-4b52-9866-bca8ff8ba523 in datapath 8f767a47-1afd-49b4-b8a4-dcd3bf05091c bound to our chassis	19:10
sean-k-mooney	then Provisioning metadata for network 8f767a47-1afd-49b4-b8a4-dcd3bf05091c	19:10
sean-k-mooney	i think it starts at Jun 27 17:51:20.725176	19:13
sean-k-mooney	and starts handeling request at 17:52:02	19:14
sean-k-mooney	tempest starts tryign to ssh at 17:51:26	19:15
sean-k-mooney	and gives up at 17:54:51	19:16
sean-k-mooney	https://paste.opendev.org/show/bL4YosAN7DorXXdMKtZy/	19:22
sean-k-mooney	the vm pulled the ssh key for ths first time at 17:52:03	19:23
sean-k-mooney	tempest stopped retrying at 17:54:51,505 99236 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@172.24.5.195 after 15 attempts. Proxy client: no proxy client	19:24
sean-k-mooney	melwitt: so ya the ovn metadta logs confirm	19:24
sean-k-mooney	well actully no number are hard	19:25
sean-k-mooney	it may have download the key at 17:52:03	19:26
melwitt	sean-k-mooney: and you can tell because of the time of the port binding in the ovn metadata log is slightly after tempest gave up?	19:26
sean-k-mooney	but we dont know how long it took for cloud-init to actully install it	19:27
sean-k-mooney	melwitt: no i mis read one of the number	19:27
sean-k-mooney	tempest is trying to ssh in beofre cloud init has even tried to pull the key	19:27
sean-k-mooney	but cloud init had some amount of time to set it up	19:28
sean-k-mooney	but there it looks lik e arace to me	19:28
melwitt	the port binding message shows at Jun 27 17:54:52.826404 and tempest gave up at 2024-06-27 17:54:51,505	19:29
sean-k-mooney	ya if you look clsoely that whewn the port is unbound	19:29
sean-k-mooney	that when we are deletign the vm during cleanup	19:30
melwitt	oh unbound. ok	19:30
sean-k-mooney	ovnmeta-8f767a47-1afd-49b4-b8a4-dcd3bf05091c is the ovn network namespace for the proxy	19:31
sean-k-mooney	so we can see it ge created and then all the porxied calls	19:31
sean-k-mooney	the vms ip is 10.1.0.3	19:31
sean-k-mooney	you can see all the metadat line where it quering	19:33
sean-k-mooney	if you search for neutron-ovn-metadata-agent[42136]: INFO eventlet.wsgi.server [-] 10.1.0.3,<local> "GET	19:34
sean-k-mooney	around the right ime	19:34
sean-k-mooney	you cna cinfirm its the correct request by also looking at the netowrk id	19:34
sean-k-mooney	X-Ovn-Network-Id: 8f767a47-1afd-49b4-b8a4-dcd3bf05091c	19:34
melwitt	@_@	19:35
melwitt	it'll take a little while for my brain to absorb this	19:36
sean-k-mooney	that when the proxy starte dhttps://paste.opendev.org/show/bY95LSHfDNbIrfYdfJ9W/	19:36
sean-k-mooney	no worries	19:36
sean-k-mooney	its not the simplest thing to follow	19:36
sean-k-mooney	that got created at 17:51:20	19:37
sean-k-mooney	hum	19:41
sean-k-mooney	that is not what i was expcting	19:41
sean-k-mooney	Jun 27 17:51:23.727070 np0037827180 nova-compute[42216]: DEBUG nova.compute.manager [req-308d2a9f-aafa-4104-9efe-b24a6eddd408 req-41d56dd8-7c93-410d-b00c-deeb9098f0b7 service nova] [instance: c31f0f22-1d32-4617-93be-28e3fb12589d] Received event network-vif-plugged-76f5dd84-8e96-4b52-9866-bca8ff8ba523 {{(pid=42216) external_instance_event	19:41
sean-k-mooney	/opt/stack/nova/nova/compute/manager.py:11129}}	19:41
sean-k-mooney	we actully didn recive the network-vif-pullged event until after neutron set up metadta	19:42
sean-k-mooney	the nova comptue log says	19:43
sean-k-mooney	Jun 27 17:51:23.737579 np0037827180 nova-compute[42216]: INFO nova.virt.libvirt.driver [-] [instance: c31f0f22-1d32-4617-93be-28e3fb12589d] Instance spawned successfully.	19:43
sean-k-mooney	and tempest waits for ti to go active until Thu, 27 Jun 2024 17:51:25	19:44
sean-k-mooney	melwitt: so all that looks liek its correcct	19:44
sean-k-mooney	tempest started sshing almost imitadly 17:51:26,	19:44
sean-k-mooney	but it just took too long for cloud-init to set everytihng up	19:45
sean-k-mooney	we can see in the consoel that form the time the vm is spawnd it over 30 second before netwokring is up	19:46
sean-k-mooney	and we can also see that tempest is not trying to ping beofre its sshing	19:47
melwitt	oh huh. I thought all those got updated to wait for ping and ssh	19:48
sean-k-mooney	i think it might be either or rahter then doing one then the other	19:48
sean-k-mooney	but i think that woud prevent the issue	19:48
sean-k-mooney	wiath for ti to be pingable then waith for it to be sshable	19:48
melwitt	yeah. I had thought those all got changed to do that tho. I mean I guess apparently not but I remember there being some changes	19:49
sean-k-mooney	we are defintly waiting for it to besshable now for all the voluem stuff	19:50
melwitt	yeah	19:50
sean-k-mooney	validation are also properly configured in the tempest.conf too	19:50
sean-k-mooney	https://github.com/openstack/tempest/blob/6618aa253e04b8879ae6d721a48ee4851543ba4a/tempest/common/compute.py#L133-L151	19:53
sean-k-mooney	i tought the SSHABLE section checked for pings first but apprently not	19:54
sean-k-mooney	maybe we coudl change that	19:54
sean-k-mooney	that would still be racy i guess	19:55
sean-k-mooney	we are able to connect to the ssh server	19:55
sean-k-mooney	but it just is not ready for conenction or the authorised key got currpoted	19:55
melwitt	yeah I had thought so too	19:56
melwitt	oh yeah, that's true, it did connect so it was already pingable	19:56
sean-k-mooney	ok my stomac says its time for mead and other tasty things so im going to just call it a slow node an call it a week	19:57
sean-k-mooney	if we see this happenign more frequently we can try and debug more	19:57
melwitt	haha ok. enjoy the mead :) o/	19:58
opendevreview	Artom Lifshitz proposed openstack/whitebox-tempest-plugin master: Swtich to Ubuntu Noble image https://review.opendev.org/c/openstack/whitebox-tempest-plugin/+/923063	20:04
opendevreview	Ashish Gupta proposed openstack/tempest master: [WIP]Add test to verify hostname allows FQDN https://review.opendev.org/c/openstack/tempest/+/922342	23:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!