Tuesday, 2023-02-21

opendevreview	Amit Uniyal proposed openstack/tempest master: Adds test for resize server swap to 0 https://review.opendev.org/c/openstack/tempest/+/858885	06:01
*** ralonsoh_ooo is now known as ralonsoh		07:32
*** jpena\|off is now known as jpena		08:29
ykarel	gmann, kopecmartin https://review.opendev.org/c/openstack/devstack/+/859773 broke atleast tempest slow jobs	10:19
kopecmartin	oh, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps fails with - cat: can't open '/var/run/udhcpc.eth0.pid': No such file or directory	10:23
kopecmartin	https://2ba7f10ac23ddac3b9f6-1e843e6e8b4b324e302975788622dfa4.ssl.cf2.rackcdn.com/874232/1/check/tempest-slow-py3/35d32f7/testr_results.html	10:23
kopecmartin	renew_lease method fails :/	10:25
kopecmartin	but if that failed due to cirros bump, it could have failed with any other custom image a user might use	10:26
ykarel	yes if those images don't use udhcpc	10:27
ykarel	from what i see only that test uses that config option and is marked as slow	10:29
ykarel	so only jobs running slow tests are impacted	10:29
kopecmartin	ykarel: right, i see that udhcpc is the default client .. anyway, does this mean that they changed cirros in 0.6.1 not to include this client? or maybe use a different one, i'm trying to find a change log or smth	12:46
kopecmartin	oh	12:47
kopecmartin	https://github.com/cirros-dev/cirros/blob/0.6.1/ChangeLog#L20	12:48
kopecmartin	they switched to dhcpcd	12:48
ykarel	kopecmartin, yeap	12:48
ykarel	https://github.com/cirros-dev/cirros/commit/ded54d3524d1dda485b095ed8a0f934695200c65	12:48
ykarel	https://github.com/cirros-dev/cirros/commit/e59406d14c857a949d6eeb400d67c2ed8f545390	12:48
kopecmartin	i'm gonna try to override the default client to dhcpcd in the slow job	12:49
*** ralonsoh is now known as ralonsoh_lunch		12:51
ykarel	+1	12:52
opendevreview	Milana Levy proposed openstack/tempest master: This change was written so that a new volume could be created by another client other than the primary admin https://review.opendev.org/c/openstack/tempest/+/874577	12:58
*** ralonsoh_lunch is now known as ralonsoh		13:31
opendevreview	Martin Kopec proposed openstack/tempest master: Change dhcp client to dhcpcd in slow jobs https://review.opendev.org/c/openstack/tempest/+/874586	13:57
opendevreview	yatin proposed openstack/grenade master: Dump Console log if ping fails https://review.opendev.org/c/openstack/grenade/+/874417	14:12
kopecmartin	#startmeeting qa	15:01
opendevmeet	Meeting started Tue Feb 21 15:01:04 2023 UTC and is due to finish in 60 minutes. The chair is kopecmartin. Information about MeetBot at http://wiki.debian.org/MeetBot.	15:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	15:01
opendevmeet	The meeting name has been set to 'qa'	15:01
mnaser	was there previously an invisible_to_admin user or something like that in the past in devstack?	15:01
mnaser	oops bad timing	15:01
mnaser	ignore that :)	15:01
lpiwowar	o/	15:01
kopecmartin	mnaser: i don't know, let me get back to that at the end of the meeting in the Open Discussion	15:01
*** yadnesh_ is now known as yadnesh\|away		15:02
kopecmartin	#topic Announcement and Action Item (Optional)	15:04
kopecmartin	OpenStack Elections	15:04
kopecmartin	the current status at	15:04
kopecmartin	#link #link https://governance.openstack.org/election/	15:04
kopecmartin	#topic Antelope Priority Items progress	15:05
kopecmartin	#link https://etherpad.opendev.org/p/qa-antelope-priority	15:05
* kopecmartin checks the status there		15:05
frickler	no updates on ceph plugin I guess?	15:07
kopecmartin	doesn't look like it	15:08
kopecmartin	anyone working on that?	15:08
frickler	I thought that at some time you wanted to take a look at the tempest issue. or was it gmann?	15:08
kopecmartin	i think it was me and i lost it in the pile of tabs :/	15:09
kopecmartin	i rechecked that to get fresh logs	15:09
kopecmartin	i'm gonna try to get to that	15:09
frickler	cool	15:10
kopecmartin	so the goal is to fix whatever is failing here now, right?	15:10
kopecmartin	#link https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/865315	15:10
kopecmartin	so that we can merge that	15:10
frickler	yes	15:10
kopecmartin	okey	15:12
kopecmartin	there's been a progress on FIPS	15:12
kopecmartin	#link https://review.opendev.org/c/openstack/devstack/+/871606	15:12
kopecmartin	but that depends on a patch in zuul-jobs	15:13
kopecmartin	i wonder whether there are patches which depend on the devstack one - becuase the patch ^^ doesn't change anything, just allows the consumers to enable fips	15:13
kopecmartin	I'll check with Ade	15:13
kopecmartin	oh, one thing i forgot to mention in the announcements section	15:16
kopecmartin	we're about to release a new tempest tag	15:16
kopecmartin	#link https://review.opendev.org/c/openstack/tempest/+/871018	15:16
kopecmartin	the patches are in the queue depending on that one ^^	15:16
kopecmartin	which is currently blocked by the cirros bump, but i'll get to that later	15:16
kopecmartin	#topic OpenStack Events Updates and Planning	15:17
kopecmartin	#link https://etherpad.opendev.org/p/qa-bobcat-ptg	15:17
kopecmartin	if you have any ideas for the topics to discuss over the ptg, then ^^	15:17
kopecmartin	i'll need to reserve some time and think about the topics we might wanna cover during the PTG	15:18
kopecmartin	#topic Gate Status Checks	15:19
kopecmartin	#link https://review.opendev.org/q/label:Review-Priority%253D%252B2+status:open+(project:openstack/tempest+OR+project:openstack/patrole+OR+project:openstack/devstack+OR+project:openstack/grenade)	15:19
kopecmartin	2 reviews one is blocked by the other	15:19
kopecmartin	the cirros version bump caused an issue with the dhcp client .. apparently the new cirros uses a different dhcp client by default	15:20
opendevreview	Jorge San Emeterio proposed openstack/tempest master: Create a tempest test to verify bz#2118968 https://review.opendev.org/c/openstack/tempest/+/873706	15:20
kopecmartin	more info here	15:20
kopecmartin	#link https://review.opendev.org/c/openstack/tempest/+/874586	15:20
kopecmartin	and in the associated bug report	15:20
kopecmartin	anything urgent to review?	15:22
kopecmartin	#topic Bare rechecks	15:23
kopecmartin	#link https://etherpad.opendev.org/p/recheck-weekly-summary	15:23
kopecmartin	we're doing quite good here	15:23
kopecmartin	#topic Periodic jobs Status Checks	15:23
kopecmartin	stable	15:23
kopecmartin	#link https://zuul.openstack.org/builds?job_name=tempest-full-yoga&job_name=tempest-full-xena&job_name=tempest-full-wallaby-py3&job_name=tempest-full-victoria-py3&job_name=tempest-full-ussuri-py3&job_name=tempest-full-zed&pipeline=periodic-stable	15:23
kopecmartin	master	15:23
kopecmartin	#link https://zuul.openstack.org/builds?project=openstack%2Ftempest&project=openstack%2Fdevstack&pipeline=periodic	15:23
kopecmartin	master got hit by the dhcp client issue	15:24
kopecmartin	i'm checking whether those jobs would be fixed by the patch i earlier	15:26
frickler	yes, I changed the dhcp client in order to better support different IPv6 scenarios	15:27
kopecmartin	ack, it requires a small change in a few jobs because tempest uses the previous dhcp client by default	15:29
kopecmartin	#topic Distros check	15:30
kopecmartin	cs-9	15:31
kopecmartin	#link https://zuul.openstack.org/builds?job_name=tempest-full-centos-9-stream&job_name=devstack-platform-centos-9-stream&skip=0	15:31
kopecmartin	fedora	15:31
kopecmartin	#link https://zuul.openstack.org/builds?job_name=devstack-platform-fedora-latest&skip=0	15:31
kopecmartin	debian	15:31
kopecmartin	#link https://zuul.openstack.org/builds?job_name=devstack-platform-debian-bullseye&skip=0	15:31
kopecmartin	focal	15:31
kopecmartin	#link https://zuul.opendev.org/t/openstack/builds?job_name=devstack-platform-ubuntu-focal&skip=0	15:31
kopecmartin	rocky	15:31
kopecmartin	#link https://zuul.openstack.org/builds?job_name=devstack-platform-rocky-blue-onyx	15:31
kopecmartin	openEueler	15:31
kopecmartin	#link https://zuul.openstack.org/builds?job_name=devstack-platform-openEuler-22.03-ovn-source&job_name=devstack-platform-openEuler-22.03-ovs&skip=0	15:31
kopecmartin	all good, all passing, note that we merged the fix for rocky only a day or 2 ago	15:32
kopecmartin	#topic Sub Teams highlights	15:33
kopecmartin	Changes with Review-Priority == +1	15:33
kopecmartin	#link https://review.opendev.org/q/label:Review-Priority%253D%252B1+status:open+(project:openstack/tempest+OR+project:openstack/patrole+OR+project:openstack/devstack+OR+project:openstack/grenade)	15:33
kopecmartin	no reviews there	15:33
kopecmartin	#topic Open Discussion	15:33
kopecmartin	(gmann) PyPi additional maintainers audit for QA repo	15:34
kopecmartin	regarding this	15:34
kopecmartin	we have reached out to everyone we could find	15:34
kopecmartin	i think we can consider this done	15:34
kopecmartin	.. i made a note here that we are ok with the removal of additional maintainers	15:35
kopecmartin	#link https://etherpad.opendev.org/p/openstack-pypi-maintainers-cleanup	15:35
kopecmartin	anything for the open discussion?	15:35
tkajinam	o/	15:36
tkajinam	May I bring one topic ?	15:36
kopecmartin	sure	15:36
tkajinam	https://github.com/unbit/uwsgi/commit/5838086dd4490b8a55ff58fc0bf0f108caa4e079	15:37
tkajinam	I happened to notice uwsgi announced maintenance mode last year. is anybody aware of this ?	15:37
tkajinam	this might be concerning for us because we are now extensively using uwsgi in devstack afaik	15:38
kopecmartin	isn't the maintenance mode enough for us?	15:40
tkajinam	if they will still maintain it well. but it's not a good sign imho.	15:40
kopecmartin	yes, that's true	15:41
kopecmartin	how can we mitigate that?	15:41
kopecmartin	should we plan replacing that with something else?	15:41
kopecmartin	(seems we have a topic for the upcoming virtual ptg)	15:41
tkajinam	I noticed this 30 minutes ago so sharing this here so I don't have clear ideas now. we probably have to check the reason behind that shift and prepare replacement plan in case it becomes unmaintained.	15:42
kopecmartin	tkajinam: i'm just thinking outloud .. thanks for sharing, it's very appreciated	15:43
kopecmartin	gmann: ^ did it come up in TC?	15:43
kopecmartin	let's gather more info and get back to this	15:43
tkajinam	I'll send an email to openstack-discuss. probably that would be a good way to initiate discussion around this.	15:44
kopecmartin	tkajinam: very good idea	15:44
kopecmartin	+1	15:44
kopecmartin	searching the ML whether it hasn't come up already and i don't see anything specific	15:45
frickler	doesn't ring a bell for me, either, but certainly worth discussing	15:46
kopecmartin	yeah, this is interesting, seems like a very important info and it didn't come up for a year o.O .. thanks again tkajinam	15:47
frickler	regarding mnaser's question, I only know about the project of that name, not a user https://opendev.org/openstack/devstack/src/branch/master/lib/keystone#L343-L345	15:47
tkajinam	kopecmartin frickler, thanks !	15:48
mnaser	im trying to fix ospurge gate and its failing because of that	15:48
mnaser	https://opendev.org/x/ospurge/src/branch/master/tools/func-tests.sh#L32	15:48
kopecmartin	it doesn't look it's used anywhere else but there	15:49
kopecmartin	#link https://codesearch.opendev.org/?q=invisible_to_admin_demo_pass&i=nope&literal=nope&files=&excludeFiles=&repos=	15:49
frickler	mnaser: iiuc "demo" is the username and invisible_to_admin the project name	15:50
frickler	do you have a link to a failure?	15:50
mnaser	https://zuul.opendev.org/t/openstack/build/ef954eefbef2439da35829b2f99d8ef5	15:50
mnaser	yeah so i wonder if it's bitrot since its no longer used, since i checked codesearch too	15:50
kopecmartin	it seds files under DEVSTACK_DIR which aren't there, i wonder whether they were there at some point or they were just generated by someone on the fly	15:53
frickler	it seems accrc is no longer being created at all	15:54
kopecmartin	the last commit in x/ospurge was done 3 years ago	15:54
mnaser	yeah theres a lot of bitrot there	15:54
mnaser	but ah ok if accrc is not a thing at all	15:55
frickler	might be related to our general move to clouds.yaml, does ospurge support that?	15:55
mnaser	i tihnk it uses openstacksdk client in the backend	15:55
mnaser	so i could update the tests to use --os-cloud	15:55
frickler	I think that that would be the best path looking forward	15:56
mnaser	ok ill try to see what teh different options and how the clouds yaml file is generated and clean up that file	15:57
frickler	we could add a cloud definition for the invisible project if needed	15:57
frickler	anyway, I think we can also continue this discussion after the meeting	15:58
kopecmartin	ack , last quick note about the bug triage	15:59
kopecmartin	#topic Bug Triage	15:59
kopecmartin	#link https://etherpad.openstack.org/p/qa-bug-triage-antelope	15:59
kopecmartin	numbers recorded as always	15:59
kopecmartin	and we're out of time	15:59
kopecmartin	thank you everyone for joining	15:59
kopecmartin	see you online	15:59
kopecmartin	#endmeeting	16:00
opendevmeet	Meeting ended Tue Feb 21 16:00:04 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	16:00
opendevmeet	Minutes: https://meetings.opendev.org/meetings/qa/2023/qa.2023-02-21-15.01.html	16:00
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/qa/2023/qa.2023-02-21-15.01.txt	16:00
opendevmeet	Log: https://meetings.opendev.org/meetings/qa/2023/qa.2023-02-21-15.01.log.html	16:00
frickler	thx kopecmartin	16:00
*** artom_ is now known as artom		16:01
lpiwowar	thanks o/	16:01
*** sean-k-mooney1 is now known as sean-k-mooney		16:25
opendevreview	Merged openstack/grenade master: Dump Console log if ping fails https://review.opendev.org/c/openstack/grenade/+/874417	18:00
gmann	tkajinam: thanks for bringing it, I will check mail	18:43
dansmith	gmann: can you tell if this failure is during tearDown() or part of the test itself? https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_137/874664/1/check/tempest-integrated-compute-ubuntu-focal/1376f12/testr_results.html	19:32
dansmith	because it doesn't show me a trace in the actual test, I'm guessing this is just tearDown?	19:32
gmann	k, checking	19:37
gmann	dansmith: yes, it is during tearDown from here https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L127	19:39
dansmith	gmann: okay, so I think what's going on there is that we've disturbed the guest a lot by doing the snapshot while it's running, and it is stuck to the point where it will not detach the volume	19:40
dansmith	so we sit there and wait for 8*20s trying to detach it, it never lets go, so we never finish detaching and we fail there	19:40
dansmith	I'm not sure if that really means the test failed or not, because the snapshots succeeded	19:41
dansmith	but we could (a) get the console of the guest to see if it has a kernel panic or something (unlikely I think)	19:41
dansmith	or (b) we could try a force reboot of the guest during teardown before we try to clean up or something	19:42
dansmith	we see this failure pattern a lot	19:42
dansmith	so I'm trying to think of how we can either be more forgiving here, or debug further what is going on	19:42
dansmith	I guess I don't really know what happens during a volume snapshot with the guest running, but it seems to clearly destabilize the guest	19:43
gmann	but this testwait for deleting the snapshot and before this delete_volume happening right? https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L184	19:43
dansmith	you're saying it _does_ delete the snaps before the volume right?	19:44
gmann	yes	19:44
gmann	so snapshots things should be clean by volume are deleted	19:44
dansmith	right, but the thing it's failing on is (AFAICT) a detach operation in nova, which tries 8 times and fails because the guest never releases the block device	19:44
dansmith	gmann: right but that's what I'm saying I think the test has finished already	19:44
dansmith	well,	19:45
dansmith	I guess maybe I'm mistaken about what happens during a snapshot	19:45
dansmith	gmann: if the test failed in the middle of the meat of the test, wouldn't we see a failure specific to that, in addition to the failure during teardown?	19:45
rosmaita	o/	19:46
dansmith	rosmaita: looking at that test that failed.. when that volume test does a force snapshot with the guest running - what is happening? is it snapshotting the volume underneath, or does it try to detach, snapshot, reattach?	19:46
dansmith	I had assumed the former	19:46
rosmaita	i think it depends on the driver, but for lvm, i believe it's the former	19:47
dansmith	rosmaita: this: https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/test_volumes_snapshots.py#L59	19:48
dansmith	okay, so what I'm trying to determine is if we are hanging on the detach as part of the snapshot, or just the test cleanup	19:48
dansmith	and based on that, I'm thinking it's the latter.. we've disturbed the guest by doing the snapshot underneath and it's wedged such that we just fail to cleanup	19:48
rosmaita	looks to me like the cleanup	19:48
gmann	dansmith: yeah, I am ok on 'more forgiving here' in cleanup as we do check snapshot deletion happing fine in the test itself https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L184	19:49
dansmith	we hit this sort of "volume fails to detach" thing so very often, that I think we need to do something here	19:49
gmann	snapshot deletion complete the operation this test teting	19:49
dansmith	gmann: right, okay	19:50
rosmaita	what i'm seeing in the c-vol log is the volume is reported as available, and then a series of lvcreate --snapshot commands	19:50
gmann	this is same case many other test case cleanup also where detaching stuck in cleanup when test try many operations	19:50
dansmith	rosmaita: all that happens underneath without really disturbing iscsi or the guest, I would think, so I'm not sure what the problem is	19:50
dansmith	gmann: yes, but volume detach is a good portion of those It hink	19:51
gmann	yeah	19:52
dansmith	gmann: so, hmm.. should we have already run the delete server and wait for termination part of the cleanup?	19:54
dansmith	https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L211	19:55
dansmith	or do those run in reverse order so we're trying to delete the volume first?	19:55
gmann	dansmith: cleanup is in reverse order but here in this test, server is cleanup after test as it is added as addCleanup and delete volume cleanup happening at test class level as it is added as addClassResourceCleanup https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L127	19:58
gmann	so delete volume happening later	19:58
dansmith	gmann: the instance is still very clearly running, but you think it should have already been deleted?	19:58
gmann	is it?	19:59
dansmith	yes	20:00
dansmith	well,	20:00
dansmith	let me say the instance is still running when the volume fails to detach	20:01
gmann	I see volume detach request here 2023-02-21 17:34:25.163 99115 INFO tempest.lib.common.rest_client [req-b137b6c4-ba62-411a-b2c4-bd5637202770 req-b137b6c4-ba62-411a-b2c4-bd5637202770 ] Request (VolumesSnapshotTestJSON:_run_cleanups): 202 DELETE https://10.176.196.163/compute/v2.1/servers/91b2ff57-588c-454c-aad1-3e67749420ee/os-volume_attachments/6ec5c1f8-6f4c-430f-94c2-6e08f0ce78f9 0.186s	20:05
gmann	this is from tempest.log	20:05
gmann	and server deletion request was not done yet	20:05
dansmith	okay, so,	20:06
dansmith	I think maybe we're actually stuck trying to delete the server	20:06
dansmith	and it's stuck because it's waiting for the volume to be detached gracefully	20:06
dansmith	I'm thinking the server becomes "deleted" immediately from the view of tempest,	20:08
dansmith	so it moves on to delete the volume,	20:08
gmann	it wait for server termination	20:08
gmann	https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L211	20:08
dansmith	which it can't do because the instance is still kinda stuck detaching in its attempt to be deleted	20:08
dansmith	gmann: right but it just waits for it to go 404, which happens basically immediately I think	20:09
dansmith	but n-cpu continues to gracefully detach the volume before it deletes the server	20:09
gmann	ah right	20:09
dansmith	so, here's the thing	20:10
gmann	if detach is stuck it should stuck here https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L193	20:10
dansmith	melwitt was working on moving us to force-detach with brick for another unrelated thing	20:10
dansmith	which is actually what we should be doing on delete server	20:10
dansmith	so maybe we could try applying that and see if some/all of these go away	20:10
dansmith	force-detach volumes with brick only on server delete, I mean	20:11
dansmith	gmann: because on server delete, we try to do a graceful shutdown, but with limited patience before we cut and actually delete	20:12
dansmith	and that's what brick's force detach _does_	20:12
dansmith	but this volume failure to detach can get in the way of that	20:12
gmann	yeah, i can see detach stuck but because it is cleanup delete server still run Body: b'{"badRequest": {"code": 400, "message": "Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots, awaiting a transfer, or be disassociated from snapshots after volume transfer."}}' _log_request_full	20:12
gmann	/opt/stack/tempest/tempest/lib/common/rest_client.py:464	20:12
gmann	this is right before the server delete request	20:12
dansmith	gmann: exactly, it gets that immediately, even before it has tried to do anything with the volume	20:13
dansmith	oh wait, no	20:13
dansmith	it does try to delete the attachment	20:13
dansmith	dang	20:13
dansmith	I misread that call, I thought it was trying to delete the server, but it's actually trying to delete the attachment is that right?	20:14
dansmith	this, is what I didn't have scrolled far enough to the right: 2023-02-21 17:34:25,163 99115 INFO [tempest.lib.common.rest_client] Request (VolumesSnapshotTestJSON:_run_cleanups): 202 DELETE https://10.176.196.163/compute/v2.1/servers/91b2ff57-588c-454c-aad1-3e67749420ee/os-volume_attachments/6ec5c1f8-6f4c-430f-94c2-6e08f0ce78f9 0.186s	20:14
gmann	yes this happen before delete server	20:15
dansmith	okay, my bad	20:16
dansmith	the other thing that I thought supported this, is that immediately after we see the final detach attempt fail in the n-cpu log, the instance is deleted	20:16
dansmith	so I thought it was stuck in that wait process	20:16
dansmith	gmann: so where is the tempest code that tries to delete the attachment?	20:17
gmann	from here, and it does wait for attachment to be deleted https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L193-L195	20:18
gmann	it is from attach_volume cleanup	20:18
dansmith	ah, in the attach I see	20:18
dansmith	I never can wrap my head around all the positive actions do their own cleanup scheduling	20:19
dansmith	okay, so I don't think we have any way to do the force detach from the API	20:19
dansmith	gmann: so maybe a force reboot of the affected instance before we go to do the detach? it's a little messy, but it might shake it loose	20:20
gmann	humm that makes testd more lengthy	20:21
gmann	can we go for ignoring detach completion in such non-detach test doing lot of other operation on guest?	20:22
rosmaita	i don't know what this means, but in c-vol log, that volume is last mentioned when the 3rd snapshot is created at Feb 21 17:34:17.577277, and then not again until Feb 21 17:37:47.596555 when the initiator is deleted ... which seems a long time after that delete-attachment call gmann posted earlier	20:22
dansmith	gmann: meaning don't do the wait_for_volume_resource_status==available step?	20:23
dansmith	rosmaita: right because it's waiting for the guest to let go before it does	20:23
gmann	dansmith: yes but delete server will stuck right	20:23
dansmith	rosmaita: 8 attempts at 20s each	20:23
dansmith	gmann: well, if we make delete server (in nova) properly do a force detach of the volume because it's being deleted, that would actually improve	20:24
dansmith	gmann: so (1) do not wait for delete attachment to complete (2) go straight to delete server (3) make nova do force detach in delete server (which we need to do anyway)	20:24
gmann	dansmith: but does volume get deleted in-use state?	20:24
dansmith	gmann: it will still go back to available i think once the server is deleted	20:25
gmann	dansmith: i see. I think that is right way as server is anyway going to be deleted to cleanup attachment forcefully and tell cinder the same so they they can make volume available	20:26
gmann	hope volume will be ok to be reused again? if no then force volume delete also needed?	20:27
dansmith	gmann: so we would need a flag to attach volume that says "don't schedule a wait_for_volume_resource_status because I'm going to delete this server" ?	20:27
dansmith	gmann: ah, because this volume is shared among other tests in this class?	20:27
gmann	no, not in this test. I am thinking about general users scenarion if nova forcefully delete attachment but volume is not reusable because of that	20:28
dansmith	gmann: oh yeah, it has to be reusable for sure	20:28
gmann	ok, then it is fine	20:29
dansmith	gmann: force detach still tries to do it gracefully first, it just forces if it doesn't go easily	20:29
dansmith	melwitt: right?	20:29
gmann	dansmith: yeah in later case, what we will do with volume (mean tell cinder)	20:29
dansmith	I guess I'm echoing what I heard about brick's detach	20:30
dansmith	I haven't chased the process that nova goes through on delete, but it has to delete the attachment with cinder	20:30
gmann	because in case where user want to reuse the volume after server delete, I think stuck in serve delete is better than delete-server-with-force-detach-but-make-volume-unusable	20:32
dansmith	gmann: do you mean unusable because of the state of the volume, or "unclean unmount from the guest" ?	20:32
gmann	"unclean unmount from the guest" state is fine which can be modified forcefully	20:33
dansmith	the former is definitely required, and I'm sure we're doing that now, or we'd already be locking volumes when you delete a server and it happens gracefully	20:33
gmann	k	20:33
melwitt	dansmith: os-brick force detach? yes it does a graceful detach first but if it doesn't complete it will force detach it	20:34
dansmith	gmann: no, delete of an active server is effectively pulling the plug out, just like hard reboot, so if you leave the volume unclean after that, we did what you asked	20:34
gmann	and this can be a nice test to reuse volume after the delete-server-with-force-delete-attachment	20:34
dansmith	melwitt: yeah, I'm more talking about the nova part.. surely if you delete a server with a volume attached, nova deletes the attachment in cinder	20:34
dansmith	otherwise even in the everything-worked case, we'd leave the volume unattachable if we didn't delete the attachment record and put it back to "available"	20:35
dansmith	gmann: for hard reboot from the docs: "The HARD reboot corresponds to the power cycles of the server."	20:36
melwitt	yes it deletes the attachment in cinder as part of an instance delete in nova	20:36
melwitt	it does that after detaching with os-brick and it ignores errors from os-brick and deletes the attachment regardless only for instance delete	20:36
dansmith	melwitt: yeah, cool, so if we make tempest not do (or wait for) the attachment delete,	20:36
dansmith	then just deleting the server will (a) clean up cinder, (b) force-disconnect with brick and not hang and (c) delete the server	20:36
melwitt	yeah if we were to add force=True to our os-brick detach call for server delete, it would do the steps as you describe	20:38
dansmith	yeah	20:39
gmann	dansmith: melwitt: that will done by default internally in delete server flow if detach not happening in normal way or it will be based on a new 'force-detach' flag in delete server nova API?	20:42
dansmith	gmann: always, during server delete.. but brick's force detach tries graceful first	20:43
gmann	ok	20:43
dansmith	gmann: just like we do without a volume now.. we ask the server via acpi, but if it doesn't shut down in time, we nuke it from orbit	20:43
gmann	ok	20:44
melwitt	gmann: yeah, volume detach is kind of confusing bc there are multiple steps: 1) detach vol from guest 2) detach vol from host (currently we do not force this) 3) delete attachment in cinder	20:44
gmann	k, so tempest tests just need to modify the cleanup not to wait for detach things and rely on delete server to do everything	20:46
melwitt	we could use the force feature in os-brick at step 2) to force the detach if it doesn't succeed gracefully	20:46
gmann	i see	20:46
dansmith	gmann: yeah, I'll put something up in a sec	20:47
gmann	thanks	20:47
opendevreview	Dan Smith proposed openstack/tempest master: Avoid long wait for volume detach in some tests https://review.opendev.org/c/openstack/tempest/+/874700	20:52
dansmith	gmann: is that what you had in mind? ^	20:53
dansmith	cc melwitt	20:53
gmann	dansmith: yes. that way	20:53
dansmith	gmann: so for test cases other than these which might hit the same thing,	20:56
dansmith	is it easy to add the "log the guest console" thing?	20:56
dansmith	because there's something causing the guest to not release the volume.. that may not manifest on the console, but .. it might	20:56
dansmith	the referenced bug already tried to address some APIC-related reason for this happening (on live migrate I think) but it might be useful to get the console if we fail to wait for deleting the attachment	20:57
dansmith	er, if we hit the timeout waiting for the attach delete, which has failed, I mean	20:57
gmann	you mean during detach_volume call itself?	20:58
dansmith	gmann: no, I mean for any other volume test that may be doing a detach and then wait (i.e. not passing this flag).. if we timeout waiting for the detach, we should log the console	20:59
gmann	dansmith: yeah we can put that in waiter method itself which can be helpful for other volume state error also	21:02
dansmith	gmann: can you lazy internet me a link to what to shove in there? :D	21:02
dansmith	gmann: as you know, I'm very lazy	21:02
gmann	dansmith: but we do two type of wait for detach to confirm 1. volume status -https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L193 2. wait_for_volume_attachment_remove_from_server as in compute test base class https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/compute/base.py#L612-L615	21:03
dansmith	gmann: the ack, should be in both places	21:04
gmann	dansmith: so we can do log console here https://github.com/openstack/tempest/blob/1569290be06e61d63061ae35a997aff0ebad68f1/tempest/common/waiters.py#L337	21:04
gmann	dansmith: and in 2nd place we already do https://github.com/openstack/tempest/blob/1569290be06e61d63061ae35a997aff0ebad68f1/tempest/common/waiters.py#L405	21:05
dansmith	gmann: aha, cool, I'll add it to the former then	21:06
gmann	dansmith: ok, former one need to pass server id also it does not have currently	21:06
dansmith	gmann: ack	21:07
gmann	but as this is generic method for other volume test we can output concole based on server_id none or not	21:07
dansmith	gmann: ah, we'd need servers_client to do that right?	21:17
gmann	dansmith: right	21:19
dansmith	gmann: so are you okay passing both of those (optionally) in there?	21:19
gmann	dansmith: yeah	21:19
dansmith	okay	21:20
dansmith	gmann: also, further investigation reveals that we already do the right thing in nova here	21:20
gmann	you mean for force detach thing?	21:21
dansmith	gmann: if you look at the log, we already force delete the instance with fire first and then deal with the volumes after wards	21:21
dansmith	gmann: don't even need force detach for this case	21:21
dansmith	gmann: right after we fail to wait for the detach in tempest, this happens:	21:21
dansmith	[instance: 91b2ff57-588c-454c-aad1-3e67749420ee] Instance destroyed successfully.	21:21
dansmith	then this:	21:21
dansmith	[instance: 91b2ff57-588c-454c-aad1-3e67749420ee] calling os-brick to detach iSCSI Volume	21:22
dansmith	which succeeds	21:22
dansmith	so nova doesn't even try to detach the volume before it deletes the instance, it waits until the instance can't possibly be using it anymore and then does the disconnect	21:22
dansmith	so I think just this tempest patch will likely improve gate things	21:23
*** jpena is now known as jpena\|off		21:25
gmann	but in this test detach is happening before delete server and failing. so you are saying leaving detach things to delete server will clean it up correctly ?	21:25
dansmith	gmann: yes, I'm saying just this tempest change, and no nova side change is required	21:26
gmann	dansmith: ok	21:26
opendevreview	Ghanshyam proposed openstack/tempest master: Fix tempest-full-py3 for stable/ussuri to wallaby https://review.opendev.org/c/openstack/tempest/+/874704	21:31
gmann	ykarel: ^^ this will fix the stable/wallaby and older job	21:33
kopecmartin	gmann: i'm trying to figure out the fix for this bug - https://bugs.launchpad.net/tempest/+bug/2007973 .. it affects all slow jobs, it's quite many of them and on different branches .. wouldn't it be easier to make the fix in devstack? something like if new cirros image, set the dhcp_client in tempest.conf accordingly	21:38
kopecmartin	wdyt, would it work?	21:38
gmann	kopecmartin: is this same as what ykarel reported https://review.opendev.org/c/openstack/devstack/+/859773?tab=comments	21:39
gmann	I am checking the same and it seems we need to revert the cirros bump to 0.6.1 to unblock gate first and then we can debug ?	21:40
opendevreview	Ghanshyam proposed openstack/devstack master: Revert "Bump cirros version to 0.6.1" https://review.opendev.org/c/openstack/devstack/+/874625	21:41
gmann	kopecmartin: ^^	21:41
kopecmartin	gmann: yup, i opened that bug based on ykarel's feedback	21:42
kopecmartin	gmann: probably easier to revert and figure it out .. although we know what's wrong	21:42
kopecmartin	i just don't know how to set it affectively in the jobs	21:42
kopecmartin	any job which will use the newer cirros version needs to set scenario.dhcp_client to dhcpcd in tempest.conf	21:43
kopecmartin	it's impossible to go this way - https://review.opendev.org/c/openstack/tempest/+/874586/1/zuul.d/integrated-gate.yaml - too many job variants	21:44
gmann	kopecmartin: then you need to do it via config option in tempest.conf and set that from devstack so that it will be set in all jobs on master using new cirrors and job on stable using devstack with old cirrors	21:44
kopecmartin	so maybe if we added a condition to devstack like - if cirros >=0.6.1 than set the opt	21:44
kopecmartin	exactly , good	21:45
gmann	because devstack master configure new cirros so setting there without condition can be added and in tempest config option we can keep old dhcp.client as default so we do not need to chaneg devstack	21:46
gmann	but to merge those we need to revert devstack change first	21:47
kopecmartin	omg :D	21:47
kopecmartin	it's really easy to get locked out	21:48
gmann	kopecmartin its release time so expect everything :)	21:49
kopecmartin	gmann: wait, do we need to revert that? instead of the revert can't we just set the proper dhcp client here https://opendev.org/openstack/devstack/src/branch/master/.zuul.yaml#L578	21:59
gmann	kopecmartin: that will break stable branch job. that is why we need config option and set that from devstack	22:01
gmann	devstack master set that as it will use cirrors new version and devstack stable branch will not set so default will work	22:01
kopecmartin	i got lost in it, i don't understand how a change in master can break stable jobs when devstack is branched	22:03
gmann	kopecmartin: tempest job from master are used to run on stable also right so anything change in configuration of job will impact stable	22:04
gmann	unless you are adding it as condition but that only solve job not the tempest run with new cirrors in production	22:05
gmann	kopecmartin: that is why we need to set that new config from devstack which is branched and will take care of old and new things automatically	22:06
gmann	like any other feature flag	22:06
gmann	kopecmartin: ohk, you are saying to set via devstack job. that will work but this is not job specific right we should set it in lib/tempest so that any local installation alos work fine	22:29

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!