opendevreview | Amit Uniyal proposed openstack/tempest master: Adds test for resize server swap to 0 https://review.opendev.org/c/openstack/tempest/+/858885 | 06:01 |
---|---|---|
*** ralonsoh_ooo is now known as ralonsoh | 07:32 | |
*** jpena|off is now known as jpena | 08:29 | |
ykarel | gmann, kopecmartin https://review.opendev.org/c/openstack/devstack/+/859773 broke atleast tempest slow jobs | 10:19 |
kopecmartin | oh, tempest.scenario.test_network_basic_ops.TestNetworkBasicOps fails with - cat: can't open '/var/run/udhcpc.eth0.pid': No such file or directory | 10:23 |
kopecmartin | https://2ba7f10ac23ddac3b9f6-1e843e6e8b4b324e302975788622dfa4.ssl.cf2.rackcdn.com/874232/1/check/tempest-slow-py3/35d32f7/testr_results.html | 10:23 |
kopecmartin | renew_lease method fails :/ | 10:25 |
kopecmartin | but if that failed due to cirros bump, it could have failed with any other custom image a user might use | 10:26 |
ykarel | yes if those images don't use udhcpc | 10:27 |
ykarel | from what i see only that test uses that config option and is marked as slow | 10:29 |
ykarel | so only jobs running slow tests are impacted | 10:29 |
kopecmartin | ykarel: right, i see that udhcpc is the default client .. anyway, does this mean that they changed cirros in 0.6.1 not to include this client? or maybe use a different one, i'm trying to find a change log or smth | 12:46 |
kopecmartin | oh | 12:47 |
kopecmartin | https://github.com/cirros-dev/cirros/blob/0.6.1/ChangeLog#L20 | 12:48 |
kopecmartin | they switched to dhcpcd | 12:48 |
ykarel | kopecmartin, yeap | 12:48 |
ykarel | https://github.com/cirros-dev/cirros/commit/ded54d3524d1dda485b095ed8a0f934695200c65 | 12:48 |
ykarel | https://github.com/cirros-dev/cirros/commit/e59406d14c857a949d6eeb400d67c2ed8f545390 | 12:48 |
kopecmartin | i'm gonna try to override the default client to dhcpcd in the slow job | 12:49 |
*** ralonsoh is now known as ralonsoh_lunch | 12:51 | |
ykarel | +1 | 12:52 |
opendevreview | Milana Levy proposed openstack/tempest master: This change was written so that a new volume could be created by another client other than the primary admin https://review.opendev.org/c/openstack/tempest/+/874577 | 12:58 |
*** ralonsoh_lunch is now known as ralonsoh | 13:31 | |
opendevreview | Martin Kopec proposed openstack/tempest master: Change dhcp client to dhcpcd in slow jobs https://review.opendev.org/c/openstack/tempest/+/874586 | 13:57 |
opendevreview | yatin proposed openstack/grenade master: Dump Console log if ping fails https://review.opendev.org/c/openstack/grenade/+/874417 | 14:12 |
kopecmartin | #startmeeting qa | 15:01 |
opendevmeet | Meeting started Tue Feb 21 15:01:04 2023 UTC and is due to finish in 60 minutes. The chair is kopecmartin. Information about MeetBot at http://wiki.debian.org/MeetBot. | 15:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 15:01 |
opendevmeet | The meeting name has been set to 'qa' | 15:01 |
mnaser | was there previously an invisible_to_admin user or something like that in the past in devstack? | 15:01 |
mnaser | oops bad timing | 15:01 |
mnaser | ignore that :) | 15:01 |
lpiwowar | o/ | 15:01 |
kopecmartin | mnaser: i don't know, let me get back to that at the end of the meeting in the Open Discussion | 15:01 |
*** yadnesh_ is now known as yadnesh|away | 15:02 | |
kopecmartin | #topic Announcement and Action Item (Optional) | 15:04 |
kopecmartin | OpenStack Elections | 15:04 |
kopecmartin | the current status at | 15:04 |
kopecmartin | #link #link https://governance.openstack.org/election/ | 15:04 |
kopecmartin | #topic Antelope Priority Items progress | 15:05 |
kopecmartin | #link https://etherpad.opendev.org/p/qa-antelope-priority | 15:05 |
* kopecmartin checks the status there | 15:05 | |
frickler | no updates on ceph plugin I guess? | 15:07 |
kopecmartin | doesn't look like it | 15:08 |
kopecmartin | anyone working on that? | 15:08 |
frickler | I thought that at some time you wanted to take a look at the tempest issue. or was it gmann? | 15:08 |
kopecmartin | i think it was me and i lost it in the pile of tabs :/ | 15:09 |
kopecmartin | i rechecked that to get fresh logs | 15:09 |
kopecmartin | i'm gonna try to get to that | 15:09 |
frickler | cool | 15:10 |
kopecmartin | so the goal is to fix whatever is failing here now, right? | 15:10 |
kopecmartin | #link https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/865315 | 15:10 |
kopecmartin | so that we can merge that | 15:10 |
frickler | yes | 15:10 |
kopecmartin | okey | 15:12 |
kopecmartin | there's been a progress on FIPS | 15:12 |
kopecmartin | #link https://review.opendev.org/c/openstack/devstack/+/871606 | 15:12 |
kopecmartin | but that depends on a patch in zuul-jobs | 15:13 |
kopecmartin | i wonder whether there are patches which depend on the devstack one - becuase the patch ^^ doesn't change anything, just allows the consumers to enable fips | 15:13 |
kopecmartin | I'll check with Ade | 15:13 |
kopecmartin | oh, one thing i forgot to mention in the announcements section | 15:16 |
kopecmartin | we're about to release a new tempest tag | 15:16 |
kopecmartin | #link https://review.opendev.org/c/openstack/tempest/+/871018 | 15:16 |
kopecmartin | the patches are in the queue depending on that one ^^ | 15:16 |
kopecmartin | which is currently blocked by the cirros bump, but i'll get to that later | 15:16 |
kopecmartin | #topic OpenStack Events Updates and Planning | 15:17 |
kopecmartin | #link https://etherpad.opendev.org/p/qa-bobcat-ptg | 15:17 |
kopecmartin | if you have any ideas for the topics to discuss over the ptg, then ^^ | 15:17 |
kopecmartin | i'll need to reserve some time and think about the topics we might wanna cover during the PTG | 15:18 |
kopecmartin | #topic Gate Status Checks | 15:19 |
kopecmartin | #link https://review.opendev.org/q/label:Review-Priority%253D%252B2+status:open+(project:openstack/tempest+OR+project:openstack/patrole+OR+project:openstack/devstack+OR+project:openstack/grenade) | 15:19 |
kopecmartin | 2 reviews one is blocked by the other | 15:19 |
kopecmartin | the cirros version bump caused an issue with the dhcp client .. apparently the new cirros uses a different dhcp client by default | 15:20 |
opendevreview | Jorge San Emeterio proposed openstack/tempest master: Create a tempest test to verify bz#2118968 https://review.opendev.org/c/openstack/tempest/+/873706 | 15:20 |
kopecmartin | more info here | 15:20 |
kopecmartin | #link https://review.opendev.org/c/openstack/tempest/+/874586 | 15:20 |
kopecmartin | and in the associated bug report | 15:20 |
kopecmartin | anything urgent to review? | 15:22 |
kopecmartin | #topic Bare rechecks | 15:23 |
kopecmartin | #link https://etherpad.opendev.org/p/recheck-weekly-summary | 15:23 |
kopecmartin | we're doing quite good here | 15:23 |
kopecmartin | #topic Periodic jobs Status Checks | 15:23 |
kopecmartin | stable | 15:23 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=tempest-full-yoga&job_name=tempest-full-xena&job_name=tempest-full-wallaby-py3&job_name=tempest-full-victoria-py3&job_name=tempest-full-ussuri-py3&job_name=tempest-full-zed&pipeline=periodic-stable | 15:23 |
kopecmartin | master | 15:23 |
kopecmartin | #link https://zuul.openstack.org/builds?project=openstack%2Ftempest&project=openstack%2Fdevstack&pipeline=periodic | 15:23 |
kopecmartin | master got hit by the dhcp client issue | 15:24 |
kopecmartin | i'm checking whether those jobs would be fixed by the patch i earlier | 15:26 |
frickler | yes, I changed the dhcp client in order to better support different IPv6 scenarios | 15:27 |
kopecmartin | ack, it requires a small change in a few jobs because tempest uses the previous dhcp client by default | 15:29 |
kopecmartin | #topic Distros check | 15:30 |
kopecmartin | cs-9 | 15:31 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=tempest-full-centos-9-stream&job_name=devstack-platform-centos-9-stream&skip=0 | 15:31 |
kopecmartin | fedora | 15:31 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-fedora-latest&skip=0 | 15:31 |
kopecmartin | debian | 15:31 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-debian-bullseye&skip=0 | 15:31 |
kopecmartin | focal | 15:31 |
kopecmartin | #link https://zuul.opendev.org/t/openstack/builds?job_name=devstack-platform-ubuntu-focal&skip=0 | 15:31 |
kopecmartin | rocky | 15:31 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-rocky-blue-onyx | 15:31 |
kopecmartin | openEueler | 15:31 |
kopecmartin | #link https://zuul.openstack.org/builds?job_name=devstack-platform-openEuler-22.03-ovn-source&job_name=devstack-platform-openEuler-22.03-ovs&skip=0 | 15:31 |
kopecmartin | all good, all passing, note that we merged the fix for rocky only a day or 2 ago | 15:32 |
kopecmartin | #topic Sub Teams highlights | 15:33 |
kopecmartin | Changes with Review-Priority == +1 | 15:33 |
kopecmartin | #link https://review.opendev.org/q/label:Review-Priority%253D%252B1+status:open+(project:openstack/tempest+OR+project:openstack/patrole+OR+project:openstack/devstack+OR+project:openstack/grenade) | 15:33 |
kopecmartin | no reviews there | 15:33 |
kopecmartin | #topic Open Discussion | 15:33 |
kopecmartin | (gmann) PyPi additional maintainers audit for QA repo | 15:34 |
kopecmartin | regarding this | 15:34 |
kopecmartin | we have reached out to everyone we could find | 15:34 |
kopecmartin | i think we can consider this done | 15:34 |
kopecmartin | .. i made a note here that we are ok with the removal of additional maintainers | 15:35 |
kopecmartin | #link https://etherpad.opendev.org/p/openstack-pypi-maintainers-cleanup | 15:35 |
kopecmartin | anything for the open discussion? | 15:35 |
tkajinam | o/ | 15:36 |
tkajinam | May I bring one topic ? | 15:36 |
kopecmartin | sure | 15:36 |
tkajinam | https://github.com/unbit/uwsgi/commit/5838086dd4490b8a55ff58fc0bf0f108caa4e079 | 15:37 |
tkajinam | I happened to notice uwsgi announced maintenance mode last year. is anybody aware of this ? | 15:37 |
tkajinam | this might be concerning for us because we are now extensively using uwsgi in devstack afaik | 15:38 |
kopecmartin | isn't the maintenance mode enough for us? | 15:40 |
tkajinam | if they will still maintain it well. but it's not a good sign imho. | 15:40 |
kopecmartin | yes, that's true | 15:41 |
kopecmartin | how can we mitigate that? | 15:41 |
kopecmartin | should we plan replacing that with something else? | 15:41 |
kopecmartin | (seems we have a topic for the upcoming virtual ptg) | 15:41 |
tkajinam | I noticed this 30 minutes ago so sharing this here so I don't have clear ideas now. we probably have to check the reason behind that shift and prepare replacement plan in case it becomes unmaintained. | 15:42 |
kopecmartin | tkajinam: i'm just thinking outloud .. thanks for sharing, it's very appreciated | 15:43 |
kopecmartin | gmann: ^ did it come up in TC? | 15:43 |
kopecmartin | let's gather more info and get back to this | 15:43 |
tkajinam | I'll send an email to openstack-discuss. probably that would be a good way to initiate discussion around this. | 15:44 |
kopecmartin | tkajinam: very good idea | 15:44 |
kopecmartin | +1 | 15:44 |
kopecmartin | searching the ML whether it hasn't come up already and i don't see anything specific | 15:45 |
frickler | doesn't ring a bell for me, either, but certainly worth discussing | 15:46 |
kopecmartin | yeah, this is interesting, seems like a very important info and it didn't come up for a year o.O .. thanks again tkajinam | 15:47 |
frickler | regarding mnaser's question, I only know about the project of that name, not a user https://opendev.org/openstack/devstack/src/branch/master/lib/keystone#L343-L345 | 15:47 |
tkajinam | kopecmartin frickler, thanks ! | 15:48 |
mnaser | im trying to fix ospurge gate and its failing because of that | 15:48 |
mnaser | https://opendev.org/x/ospurge/src/branch/master/tools/func-tests.sh#L32 | 15:48 |
kopecmartin | it doesn't look it's used anywhere else but there | 15:49 |
kopecmartin | #link https://codesearch.opendev.org/?q=invisible_to_admin_demo_pass&i=nope&literal=nope&files=&excludeFiles=&repos= | 15:49 |
frickler | mnaser: iiuc "demo" is the username and invisible_to_admin the project name | 15:50 |
frickler | do you have a link to a failure? | 15:50 |
mnaser | https://zuul.opendev.org/t/openstack/build/ef954eefbef2439da35829b2f99d8ef5 | 15:50 |
mnaser | yeah so i wonder if it's bitrot since its no longer used, since i checked codesearch too | 15:50 |
kopecmartin | it seds files under DEVSTACK_DIR which aren't there, i wonder whether they were there at some point or they were just generated by someone on the fly | 15:53 |
frickler | it seems accrc is no longer being created at all | 15:54 |
kopecmartin | the last commit in x/ospurge was done 3 years ago | 15:54 |
mnaser | yeah theres a lot of bitrot there | 15:54 |
mnaser | but ah ok if accrc is not a thing at all | 15:55 |
frickler | might be related to our general move to clouds.yaml, does ospurge support that? | 15:55 |
mnaser | i tihnk it uses openstacksdk client in the backend | 15:55 |
mnaser | so i could update the tests to use --os-cloud | 15:55 |
frickler | I think that that would be the best path looking forward | 15:56 |
mnaser | ok ill try to see what teh different options and how the clouds yaml file is generated and clean up that file | 15:57 |
frickler | we could add a cloud definition for the invisible project if needed | 15:57 |
frickler | anyway, I think we can also continue this discussion after the meeting | 15:58 |
kopecmartin | ack , last quick note about the bug triage | 15:59 |
kopecmartin | #topic Bug Triage | 15:59 |
kopecmartin | #link https://etherpad.openstack.org/p/qa-bug-triage-antelope | 15:59 |
kopecmartin | numbers recorded as always | 15:59 |
kopecmartin | and we're out of time | 15:59 |
kopecmartin | thank you everyone for joining | 15:59 |
kopecmartin | see you online | 15:59 |
kopecmartin | #endmeeting | 16:00 |
opendevmeet | Meeting ended Tue Feb 21 16:00:04 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 16:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/qa/2023/qa.2023-02-21-15.01.html | 16:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/qa/2023/qa.2023-02-21-15.01.txt | 16:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/qa/2023/qa.2023-02-21-15.01.log.html | 16:00 |
frickler | thx kopecmartin | 16:00 |
*** artom_ is now known as artom | 16:01 | |
lpiwowar | thanks o/ | 16:01 |
*** sean-k-mooney1 is now known as sean-k-mooney | 16:25 | |
opendevreview | Merged openstack/grenade master: Dump Console log if ping fails https://review.opendev.org/c/openstack/grenade/+/874417 | 18:00 |
gmann | tkajinam: thanks for bringing it, I will check mail | 18:43 |
dansmith | gmann: can you tell if this failure is during tearDown() or part of the test itself? https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_137/874664/1/check/tempest-integrated-compute-ubuntu-focal/1376f12/testr_results.html | 19:32 |
dansmith | because it doesn't show me a trace in the actual test, I'm guessing this is just tearDown? | 19:32 |
gmann | k, checking | 19:37 |
gmann | dansmith: yes, it is during tearDown from here https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L127 | 19:39 |
dansmith | gmann: okay, so I think what's going on there is that we've disturbed the guest a lot by doing the snapshot while it's running, and it is stuck to the point where it will not detach the volume | 19:40 |
dansmith | so we sit there and wait for 8*20s trying to detach it, it never lets go, so we never finish detaching and we fail there | 19:40 |
dansmith | I'm not sure if that really means the test failed or not, because the snapshots succeeded | 19:41 |
dansmith | but we could (a) get the console of the guest to see if it has a kernel panic or something (unlikely I think) | 19:41 |
dansmith | or (b) we could try a force reboot of the guest during teardown before we try to clean up or something | 19:42 |
dansmith | we see this failure pattern a lot | 19:42 |
dansmith | so I'm trying to think of how we can either be more forgiving here, or debug further what is going on | 19:42 |
dansmith | I guess I don't really know what happens during a volume snapshot with the guest running, but it seems to clearly destabilize the guest | 19:43 |
gmann | but this testwait for deleting the snapshot and before this delete_volume happening right? https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L184 | 19:43 |
dansmith | you're saying it _does_ delete the snaps before the volume right? | 19:44 |
gmann | yes | 19:44 |
gmann | so snapshots things should be clean by volume are deleted | 19:44 |
dansmith | right, but the thing it's failing on is (AFAICT) a detach operation in nova, which tries 8 times and fails because the guest never releases the block device | 19:44 |
dansmith | gmann: right but that's what I'm saying I think the test has finished already | 19:44 |
dansmith | well, | 19:45 |
dansmith | I guess maybe I'm mistaken about what happens during a snapshot | 19:45 |
dansmith | gmann: if the test failed in the middle of the meat of the test, wouldn't we see a failure specific to that, in addition to the failure during teardown? | 19:45 |
rosmaita | o/ | 19:46 |
dansmith | rosmaita: looking at that test that failed.. when that volume test does a force snapshot with the guest running - what is happening? is it snapshotting the volume underneath, or does it try to detach, snapshot, reattach? | 19:46 |
dansmith | I had assumed the former | 19:46 |
rosmaita | i think it depends on the driver, but for lvm, i believe it's the former | 19:47 |
dansmith | rosmaita: this: https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/test_volumes_snapshots.py#L59 | 19:48 |
dansmith | okay, so what I'm trying to determine is if we are hanging on the detach as part of the snapshot, or just the test cleanup | 19:48 |
dansmith | and based on that, I'm thinking it's the latter.. we've disturbed the guest by doing the snapshot underneath and it's wedged such that we just fail to cleanup | 19:48 |
rosmaita | looks to me like the cleanup | 19:48 |
gmann | dansmith: yeah, I am ok on 'more forgiving here' in cleanup as we do check snapshot deletion happing fine in the test itself https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L184 | 19:49 |
dansmith | we hit this sort of "volume fails to detach" thing so *very* often, that I think we need to do something here | 19:49 |
gmann | snapshot deletion complete the operation this test teting | 19:49 |
dansmith | gmann: right, okay | 19:50 |
rosmaita | what i'm seeing in the c-vol log is the volume is reported as available, and then a series of lvcreate --snapshot commands | 19:50 |
gmann | this is same case many other test case cleanup also where detaching stuck in cleanup when test try many operations | 19:50 |
dansmith | rosmaita: all that happens underneath without really disturbing iscsi or the guest, I would think, so I'm not sure what the problem is | 19:50 |
dansmith | gmann: yes, but volume detach is a good portion of those It hink | 19:51 |
gmann | yeah | 19:52 |
dansmith | gmann: so, hmm.. should we have already run the delete server and wait for termination part of the cleanup? | 19:54 |
dansmith | https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L211 | 19:55 |
dansmith | or do those run in reverse order so we're trying to delete the volume first? | 19:55 |
gmann | dansmith: cleanup is in reverse order but here in this test, server is cleanup after test as it is added as addCleanup and delete volume cleanup happening at test class level as it is added as addClassResourceCleanup https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L127 | 19:58 |
gmann | so delete volume happening later | 19:58 |
dansmith | gmann: the instance is still very clearly running, but you think it should have already been deleted? | 19:58 |
gmann | is it? | 19:59 |
dansmith | yes | 20:00 |
dansmith | well, | 20:00 |
dansmith | let me say the instance is still running when the volume fails to detach | 20:01 |
gmann | I see volume detach request here 2023-02-21 17:34:25.163 99115 INFO tempest.lib.common.rest_client [req-b137b6c4-ba62-411a-b2c4-bd5637202770 req-b137b6c4-ba62-411a-b2c4-bd5637202770 ] Request (VolumesSnapshotTestJSON:_run_cleanups): 202 DELETE https://10.176.196.163/compute/v2.1/servers/91b2ff57-588c-454c-aad1-3e67749420ee/os-volume_attachments/6ec5c1f8-6f4c-430f-94c2-6e08f0ce78f9 0.186s | 20:05 |
gmann | this is from tempest.log | 20:05 |
gmann | and server deletion request was not done yet | 20:05 |
dansmith | okay, so, | 20:06 |
dansmith | I think maybe we're actually stuck trying to delete the server | 20:06 |
dansmith | and it's stuck because it's waiting for the volume to be detached gracefully | 20:06 |
dansmith | I'm thinking the server becomes "deleted" immediately from the view of tempest, | 20:08 |
dansmith | so it moves on to delete the volume, | 20:08 |
gmann | it wait for server termination | 20:08 |
gmann | https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L211 | 20:08 |
dansmith | which it can't do because the instance is still kinda stuck detaching in its attempt to be deleted | 20:08 |
dansmith | gmann: right but it just waits for it to go 404, which happens basically immediately I think | 20:09 |
dansmith | but n-cpu continues to gracefully detach the volume before it deletes the server | 20:09 |
gmann | ah right | 20:09 |
dansmith | so, here's the thing | 20:10 |
gmann | if detach is stuck it should stuck here https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L193 | 20:10 |
dansmith | melwitt was working on moving us to force-detach with brick for another unrelated thing | 20:10 |
dansmith | which is actually what we should be doing on delete server | 20:10 |
dansmith | so maybe we could try applying that and see if some/all of these go away | 20:10 |
dansmith | force-detach volumes with brick only on server delete, I mean | 20:11 |
dansmith | gmann: because on server delete, we try to do a graceful shutdown, but with limited patience before we cut and actually delete | 20:12 |
dansmith | and that's what brick's force detach _does_ | 20:12 |
dansmith | but this volume failure to detach can get in the way of that | 20:12 |
gmann | yeah, i can see detach stuck but because it is cleanup delete server still run Body: b'{"badRequest": {"code": 400, "message": "Invalid volume: Volume status must be available or error or error_restoring or error_extending or error_managing and must not be migrating, attached, belong to a group, have snapshots, awaiting a transfer, or be disassociated from snapshots after volume transfer."}}' _log_request_full | 20:12 |
gmann | /opt/stack/tempest/tempest/lib/common/rest_client.py:464 | 20:12 |
gmann | this is right before the server delete request | 20:12 |
dansmith | gmann: exactly, it gets that immediately, even before it has tried to do anything with the volume | 20:13 |
dansmith | oh wait, no | 20:13 |
dansmith | it does try to delete the attachment | 20:13 |
dansmith | dang | 20:13 |
dansmith | I misread that call, I thought it was trying to delete the server, but it's actually trying to delete the *attachment* is that right? | 20:14 |
dansmith | this, is what I didn't have scrolled far enough to the right: 2023-02-21 17:34:25,163 99115 INFO [tempest.lib.common.rest_client] Request (VolumesSnapshotTestJSON:_run_cleanups): 202 DELETE https://10.176.196.163/compute/v2.1/servers/91b2ff57-588c-454c-aad1-3e67749420ee/os-volume_attachments/6ec5c1f8-6f4c-430f-94c2-6e08f0ce78f9 0.186s | 20:14 |
gmann | yes this happen before delete server | 20:15 |
dansmith | okay, my bad | 20:16 |
dansmith | the other thing that I thought supported this, is that immediately after we see the final detach attempt fail in the n-cpu log, the instance is deleted | 20:16 |
dansmith | so I thought it was stuck in that wait process | 20:16 |
dansmith | gmann: so where is the tempest code that tries to delete the attachment? | 20:17 |
gmann | from here, and it does wait for attachment to be deleted https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L193-L195 | 20:18 |
gmann | it is from attach_volume cleanup | 20:18 |
dansmith | ah, in the attach I see | 20:18 |
dansmith | I never can wrap my head around all the positive actions do their own cleanup scheduling | 20:19 |
dansmith | okay, so I don't think we have any way to do the force detach from the API | 20:19 |
dansmith | gmann: so maybe a force reboot of the affected instance before we go to do the detach? it's a little messy, but it might shake it loose | 20:20 |
gmann | humm that makes testd more lengthy | 20:21 |
gmann | can we go for ignoring detach completion in such non-detach test doing lot of other operation on guest? | 20:22 |
rosmaita | i don't know what this means, but in c-vol log, that volume is last mentioned when the 3rd snapshot is created at Feb 21 17:34:17.577277, and then not again until Feb 21 17:37:47.596555 when the initiator is deleted ... which seems a long time after that delete-attachment call gmann posted earlier | 20:22 |
dansmith | gmann: meaning don't do the wait_for_volume_resource_status==available step? | 20:23 |
dansmith | rosmaita: right because it's waiting for the guest to let go before it does | 20:23 |
gmann | dansmith: yes but delete server will stuck right | 20:23 |
dansmith | rosmaita: 8 attempts at 20s each | 20:23 |
dansmith | gmann: well, if we make delete server (in nova) properly do a force detach of the volume because it's being deleted, that would actually improve | 20:24 |
dansmith | gmann: so (1) do not wait for delete attachment to complete (2) go straight to delete server (3) make nova do force detach in delete server (which we need to do anyway) | 20:24 |
gmann | dansmith: but does volume get deleted in-use state? | 20:24 |
dansmith | gmann: it will still go back to available i think once the server is deleted | 20:25 |
gmann | dansmith: i see. I think that is right way as server is anyway going to be deleted to cleanup attachment forcefully and tell cinder the same so they they can make volume available | 20:26 |
gmann | hope volume will be ok to be reused again? if no then force volume delete also needed? | 20:27 |
dansmith | gmann: so we would need a flag to attach volume that says "don't schedule a wait_for_volume_resource_status because I'm going to delete this server" ? | 20:27 |
dansmith | gmann: ah, because this volume is shared among other tests in this class? | 20:27 |
gmann | no, not in this test. I am thinking about general users scenarion if nova forcefully delete attachment but volume is not reusable because of that | 20:28 |
dansmith | gmann: oh yeah, it has to be reusable for sure | 20:28 |
gmann | ok, then it is fine | 20:29 |
dansmith | gmann: force detach still tries to do it gracefully first, it just forces if it doesn't go easily | 20:29 |
dansmith | melwitt: right? | 20:29 |
gmann | dansmith: yeah in later case, what we will do with volume (mean tell cinder) | 20:29 |
dansmith | I guess I'm echoing what I heard about brick's detach | 20:30 |
dansmith | I haven't chased the process that nova goes through on delete, but it *has* to delete the attachment with cinder | 20:30 |
gmann | because in case where user want to reuse the volume after server delete, I think stuck in serve delete is better than delete-server-with-force-detach-but-make-volume-unusable | 20:32 |
dansmith | gmann: do you mean unusable because of the state of the volume, or "unclean unmount from the guest" ? | 20:32 |
gmann | "unclean unmount from the guest" state is fine which can be modified forcefully | 20:33 |
dansmith | the former is definitely required, and I'm sure we're doing that now, or we'd already be locking volumes when you delete a server and it happens gracefully | 20:33 |
gmann | k | 20:33 |
melwitt | dansmith: os-brick force detach? yes it does a graceful detach first but if it doesn't complete it will force detach it | 20:34 |
dansmith | gmann: no, delete of an active server is effectively pulling the plug out, just like hard reboot, so if you leave the volume unclean after that, we did what you asked | 20:34 |
gmann | and this can be a nice test to reuse volume after the delete-server-with-force-delete-attachment | 20:34 |
dansmith | melwitt: yeah, I'm more talking about the nova part.. surely if you delete a server with a volume attached, nova deletes the attachment in cinder | 20:34 |
dansmith | otherwise even in the everything-worked case, we'd leave the volume unattachable if we didn't delete the attachment record and put it back to "available" | 20:35 |
dansmith | gmann: for hard reboot from the docs: "The HARD reboot corresponds to the power cycles of the server." | 20:36 |
melwitt | yes it deletes the attachment in cinder as part of an instance delete in nova | 20:36 |
melwitt | it does that after detaching with os-brick and it ignores errors from os-brick and deletes the attachment regardless only for instance delete | 20:36 |
dansmith | melwitt: yeah, cool, so if we make tempest *not* do (or wait for) the attachment delete, | 20:36 |
dansmith | then just deleting the server will (a) clean up cinder, (b) force-disconnect with brick and not hang and (c) delete the server | 20:36 |
melwitt | yeah if we were to add force=True to our os-brick detach call for server delete, it would do the steps as you describe | 20:38 |
dansmith | yeah | 20:39 |
gmann | dansmith: melwitt: that will done by default internally in delete server flow if detach not happening in normal way or it will be based on a new 'force-detach' flag in delete server nova API? | 20:42 |
dansmith | gmann: always, during server delete.. but brick's force detach *tries* graceful first | 20:43 |
gmann | ok | 20:43 |
dansmith | gmann: just like we do without a volume now.. we ask the server via acpi, but if it doesn't shut down in time, we nuke it from orbit | 20:43 |
gmann | ok | 20:44 |
melwitt | gmann: yeah, volume detach is kind of confusing bc there are multiple steps: 1) detach vol from guest 2) detach vol from host (currently we do not force this) 3) delete attachment in cinder | 20:44 |
gmann | k, so tempest tests just need to modify the cleanup not to wait for detach things and rely on delete server to do everything | 20:46 |
melwitt | we could use the force feature in os-brick at step 2) to force the detach if it doesn't succeed gracefully | 20:46 |
gmann | i see | 20:46 |
dansmith | gmann: yeah, I'll put something up in a sec | 20:47 |
gmann | thanks | 20:47 |
opendevreview | Dan Smith proposed openstack/tempest master: Avoid long wait for volume detach in some tests https://review.opendev.org/c/openstack/tempest/+/874700 | 20:52 |
dansmith | gmann: is that what you had in mind? ^ | 20:53 |
dansmith | cc melwitt | 20:53 |
gmann | dansmith: yes. that way | 20:53 |
dansmith | gmann: so for test cases other than these which might hit the same thing, | 20:56 |
dansmith | is it easy to add the "log the guest console" thing? | 20:56 |
dansmith | because there's something causing the guest to not release the volume.. that may not manifest on the console, but .. it might | 20:56 |
dansmith | the referenced bug already tried to address some APIC-related reason for this happening (on live migrate I think) but it might be useful to get the console if we fail to wait for deleting the attachment | 20:57 |
dansmith | er, if we hit the timeout waiting for the attach delete, which has failed, I mean | 20:57 |
gmann | you mean during detach_volume call itself? | 20:58 |
dansmith | gmann: no, I mean for any other volume test that may be doing a detach and then wait (i.e. not passing this flag).. if we timeout waiting for the detach, we should log the console | 20:59 |
gmann | dansmith: yeah we can put that in waiter method itself which can be helpful for other volume state error also | 21:02 |
dansmith | gmann: can you lazy internet me a link to what to shove in there? :D | 21:02 |
dansmith | gmann: as you know, I'm very lazy | 21:02 |
gmann | dansmith: but we do two type of wait for detach to confirm 1. volume status -https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/volume/base.py#L193 2. wait_for_volume_attachment_remove_from_server as in compute test base class https://github.com/openstack/tempest/blob/3f9ae1349768ee7ad7f163a302dd387847ebce7a/tempest/api/compute/base.py#L612-L615 | 21:03 |
dansmith | gmann: the ack, should be in both places | 21:04 |
gmann | dansmith: so we can do log console here https://github.com/openstack/tempest/blob/1569290be06e61d63061ae35a997aff0ebad68f1/tempest/common/waiters.py#L337 | 21:04 |
gmann | dansmith: and in 2nd place we already do https://github.com/openstack/tempest/blob/1569290be06e61d63061ae35a997aff0ebad68f1/tempest/common/waiters.py#L405 | 21:05 |
dansmith | gmann: aha, cool, I'll add it to the former then | 21:06 |
gmann | dansmith: ok, former one need to pass server id also it does not have currently | 21:06 |
dansmith | gmann: ack | 21:07 |
gmann | but as this is generic method for other volume test we can output concole based on server_id none or not | 21:07 |
dansmith | gmann: ah, we'd need servers_client to do that right? | 21:17 |
gmann | dansmith: right | 21:19 |
dansmith | gmann: so are you okay passing both of those (optionally) in there? | 21:19 |
gmann | dansmith: yeah | 21:19 |
dansmith | okay | 21:20 |
dansmith | gmann: also, further investigation reveals that we already do the right thing in nova here | 21:20 |
gmann | you mean for force detach thing? | 21:21 |
dansmith | gmann: if you look at the log, we already force delete the instance with fire first and then deal with the volumes after wards | 21:21 |
dansmith | gmann: don't even need force detach for this case | 21:21 |
dansmith | gmann: right after we fail to wait for the detach in tempest, this happens: | 21:21 |
dansmith | [instance: 91b2ff57-588c-454c-aad1-3e67749420ee] Instance destroyed successfully. | 21:21 |
dansmith | then this: | 21:21 |
dansmith | [instance: 91b2ff57-588c-454c-aad1-3e67749420ee] calling os-brick to detach iSCSI Volume | 21:22 |
dansmith | which succeeds | 21:22 |
dansmith | so nova doesn't even try to detach the volume before it deletes the instance, it waits until the instance can't possibly be using it anymore and then does the disconnect | 21:22 |
dansmith | so I think just this tempest patch will likely improve gate things | 21:23 |
*** jpena is now known as jpena|off | 21:25 | |
gmann | but in this test detach is happening before delete server and failing. so you are saying leaving detach things to delete server will clean it up correctly ? | 21:25 |
dansmith | gmann: yes, I'm saying just this tempest change, and no nova side change is required | 21:26 |
gmann | dansmith: ok | 21:26 |
opendevreview | Ghanshyam proposed openstack/tempest master: Fix tempest-full-py3 for stable/ussuri to wallaby https://review.opendev.org/c/openstack/tempest/+/874704 | 21:31 |
gmann | ykarel: ^^ this will fix the stable/wallaby and older job | 21:33 |
kopecmartin | gmann: i'm trying to figure out the fix for this bug - https://bugs.launchpad.net/tempest/+bug/2007973 .. it affects all slow jobs, it's quite many of them and on different branches .. wouldn't it be easier to make the fix in devstack? something like if new cirros image, set the dhcp_client in tempest.conf accordingly | 21:38 |
kopecmartin | wdyt, would it work? | 21:38 |
gmann | kopecmartin: is this same as what ykarel reported https://review.opendev.org/c/openstack/devstack/+/859773?tab=comments | 21:39 |
gmann | I am checking the same and it seems we need to revert the cirros bump to 0.6.1 to unblock gate first and then we can debug ? | 21:40 |
opendevreview | Ghanshyam proposed openstack/devstack master: Revert "Bump cirros version to 0.6.1" https://review.opendev.org/c/openstack/devstack/+/874625 | 21:41 |
gmann | kopecmartin: ^^ | 21:41 |
kopecmartin | gmann: yup, i opened that bug based on ykarel's feedback | 21:42 |
kopecmartin | gmann: probably easier to revert and figure it out .. although we know what's wrong | 21:42 |
kopecmartin | i just don't know how to set it affectively in the jobs | 21:42 |
kopecmartin | any job which will use the newer cirros version needs to set scenario.dhcp_client to dhcpcd in tempest.conf | 21:43 |
kopecmartin | it's impossible to go this way - https://review.opendev.org/c/openstack/tempest/+/874586/1/zuul.d/integrated-gate.yaml - too many job variants | 21:44 |
gmann | kopecmartin: then you need to do it via config option in tempest.conf and set that from devstack so that it will be set in all jobs on master using new cirrors and job on stable using devstack with old cirrors | 21:44 |
kopecmartin | so maybe if we added a condition to devstack like - if cirros >=0.6.1 than set the opt | 21:44 |
kopecmartin | exactly , good | 21:45 |
gmann | because devstack master configure new cirros so setting there without condition can be added and in tempest config option we can keep old dhcp.client as default so we do not need to chaneg devstack | 21:46 |
gmann | but to merge those we need to revert devstack change first | 21:47 |
kopecmartin | omg :D | 21:47 |
kopecmartin | it's really easy to get locked out | 21:48 |
gmann | kopecmartin its release time so expect everything :) | 21:49 |
kopecmartin | gmann: wait, do we need to revert that? instead of the revert can't we just set the proper dhcp client here https://opendev.org/openstack/devstack/src/branch/master/.zuul.yaml#L578 | 21:59 |
gmann | kopecmartin: that will break stable branch job. that is why we need config option and set that from devstack | 22:01 |
gmann | devstack master set that as it will use cirrors new version and devstack stable branch will not set so default will work | 22:01 |
kopecmartin | i got lost in it, i don't understand how a change in master can break stable jobs when devstack is branched | 22:03 |
gmann | kopecmartin: tempest job from master are used to run on stable also right so anything change in configuration of job will impact stable | 22:04 |
gmann | unless you are adding it as condition but that only solve job not the tempest run with new cirrors in production | 22:05 |
gmann | kopecmartin: that is why we need to set that new config from devstack which is branched and will take care of old and new things automatically | 22:06 |
gmann | like any other feature flag | 22:06 |
gmann | kopecmartin: ohk, you are saying to set via devstack job. that will work but this is not job specific right we should set it in lib/tempest so that any local installation alos work fine | 22:29 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!