opendevreview | Ghanshyam proposed openstack/nova master: DNM testing grenade neutron-trunk fix https://review.opendev.org/c/openstack/nova/+/811118 | 00:51 |
---|---|---|
opendevreview | Ghanshyam proposed openstack/nova stable/xena: DNM: Testing nova-grenade-multinode with neutron-trunk https://review.opendev.org/c/openstack/nova/+/811491 | 01:02 |
opendevreview | Ghanshyam proposed openstack/nova stable/wallaby: DNM: Testing nova-grenade-multinode with neutron-trunk https://review.opendev.org/c/openstack/nova/+/811513 | 01:03 |
opendevreview | Hang Yang proposed openstack/nova master: Support creating servers with RBAC SGs https://review.opendev.org/c/openstack/nova/+/811521 | 01:34 |
opendevreview | melanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes https://review.opendev.org/c/openstack/nova/+/806629 | 01:48 |
opendevreview | Federico Ressi proposed openstack/nova master: Check Nova project changes with Tobiko scenario test cases https://review.opendev.org/c/openstack/nova/+/806853 | 01:52 |
opendevreview | Federico Ressi proposed openstack/nova master: Debug Nova APIs call failures https://review.opendev.org/c/openstack/nova/+/806683 | 01:56 |
opendevreview | Ghanshyam proposed openstack/nova stable/victoria: DNM: Testing nova-grenade-multinode with neutron-trunk https://review.opendev.org/c/openstack/nova/+/811540 | 04:10 |
bauzas | good morning Nova | 06:43 |
deke | Hi | 08:16 |
deke | Has there been any discussion about implementing a MAAS driver for nova to enable ironic-like functionality for users who have deployed openstack with juju on MAAS? | 08:16 |
bauzas | deke: I haven't heard anything about this | 08:36 |
bauzas | deke: tbh, it would be a laaaarge discussion, right? | 08:36 |
deke | yea it would | 08:36 |
deke | it was just a thought I had | 08:36 |
deke | if we are already deploying on top of a baremetal provisioning service | 08:37 |
deke | then why deploy another baremetal provisioning service | 08:37 |
bauzas | we are not saying "no" for a new virt driver | 08:39 |
bauzas | but for having a new virt driver upstream, that means we need to discuss how to use it and how we could verify it by the CI | 08:40 |
bauzas | that means large discussions during a lot of PTGs + making sure we at least have a third-party CI | 08:40 |
lyarwood | dansmith: https://review.opendev.org/c/openstack/grenade/+/811117 - looks like you're the only remaining active core on grenade, can you ack this when you get online to unblock Nova's gate? | 08:42 |
gibi | gmann: thanks for the patches! | 09:03 |
bauzas | gmann: what gibi said, thanks for having worked on them | 09:05 |
bauzas | so the xena and master changes for fixing the placement API endpoints are now merged, we only have the grenade issue left, right? | 09:06 |
lyarwood | Yup I believe so | 09:07 |
lyarwood | and I think we only need it in master | 09:07 |
gibi | yupp thats my view too | 09:07 |
lyarwood | I don't know why we've backported it tbh | 09:07 |
gibi | lyarwood: you mean why we are enabling trunk testing on wallaby and back? | 09:07 |
lyarwood | on stable/xena and backwards, I thought grenade used the current branch of itself and the previous branch for everything else for the initial deploy | 09:08 |
lyarwood | so master grenade deploys a stable/xena env and then upgrades that to master | 09:08 |
bauzas | gibi: I stupidely forgot to +1 the RC2 patches yesterday before leaving | 09:10 |
bauzas | :facepalm: | 09:10 |
bauzas | now we need elodilles_pto but as his nick says, he's on PTO :) | 09:10 |
* bauzas hides | 09:11 | |
gibi | lyarwood: if we do this https://review.opendev.org/c/openstack/devstack/+/811518/2/lib/tempest then we need to do this as well https://review.opendev.org/c/openstack/grenade/+/811542/1/.zuul.yaml on stable. But I agre that we don't have to enable trunk on stable, skip is OK to me too. | 09:12 |
lyarwood | yeah I agree with the comment on the stable/wallaby change that we shouldn't be adding tests to stable this late on | 09:12 |
gibi | bauzas: I think other release cores can approve the RC2 we don't necessary need to wait for elodilles_pto. | 09:13 |
gibi | anyhow he is back tomorrow so no worries | 09:13 |
gibi | lyarwood: ack, we can discuss this with gmann when he is up | 09:14 |
lyarwood | kk | 09:20 |
opendevreview | Lee Yarwood proposed openstack/nova master: nova-manage: Ensure mountpoint is passed when updating attachment https://review.opendev.org/c/openstack/nova/+/811713 | 10:56 |
opendevreview | Lee Yarwood proposed openstack/nova master: nova-manage: Always get BDMs using get_by_volume_and_instance https://review.opendev.org/c/openstack/nova/+/811716 | 11:17 |
lyarwood | https://review.opendev.org/c/openstack/nova/+/811118 - cool the check queue is passing with the grenade fix | 11:26 |
gibi | \o/ | 11:27 |
mdbooth | 👋 Are we confident that https://review.opendev.org/c/openstack/devstack/+/811399 fixed the devstack issue on Ubuntu? My devstack-using CI job still seems to be failing: | 12:00 |
mdbooth | https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1009/pull-cluster-api-provider-openstack-e2e-test/1443169418262614016/artifacts/logs/cloud-final.log | 12:00 |
mdbooth | I can see the following in the devstack output: "ProxyPass "/placement" "unix:/var/run/uwsgi/placement-api.socket|uwsgi://uwsgi-uds-placement-api" retry=0", which looks like it contains the fix. However I'm still seeing the borked GETs to placement | 12:02 |
mdbooth | I'm far from discounting some other issue in the environment, btw, just looking for confirmation or otherwise that we've seen this fix the issue. | 12:04 |
*** swp20 is now known as songwenping | 12:04 | |
gibi | mdbooth: you logs shows the original error I can confirm as placement logs ""GET /placemen//resource_providers?in_" | 12:05 |
mdbooth | gibi: Right | 12:05 |
gibi | and I do see the ProxyPass you mentioned, and that is the correct one | 12:06 |
gibi | are you sure that your env using the ProxyPass that is logged? | 12:06 |
mdbooth | gibi: There's a git pull; git show up the top of that output which suggests the working directory is at "7f16f6d4 Fix uwsgi config for trailing slashes" | 12:06 |
mdbooth | gibi: i.e. Can I confirm that the apache config actually contains that ProxyPass? | 12:07 |
gibi | gibi: or that apache2 was restarted / reload after the setting is applied | 12:07 |
gibi | hm, I see an apache2 reload later | 12:08 |
mdbooth | This is running in cloud-init on initial boot. *However* it is running from an image that already contains devstack bits, so it's entirely possible there's dirty state there. | 12:08 |
mdbooth | Let me try to confirm that. I hadn't considered that it might not actually be updating the apache config. | 12:09 |
gibi | hm, sorry I only see logs that suggest the reload like | 12:09 |
gibi | To activate the new configuration, you need to run: | 12:09 |
gibi | systemctl reload apache2 | 12:09 |
gibi | but I don't see the reload itself | 12:10 |
gibi | so If you can log into the system then try accessing placment if it fails then check the ProxyPass confing then reload apache and check that placement is accessible now | 12:11 |
gibi | (I did this manually with in my local devstack so I do belive the fix helps at least in devstack) | 12:11 |
mdbooth | Unfortunately this is running in a CI system I don't have access to :( I have to debug via modifying the job and re-executing 😬 | 12:12 |
mdbooth | I'll see what I can confirm. Thanks! | 12:12 |
gibi | mdbooth: no problem, let us know if this still fails for you | 12:13 |
mdbooth | Ok, I can confirm that the image the machine is created from already contains dirty devstack state | 12:18 |
mdbooth | Including the broken ProxyPass directives | 12:18 |
mdbooth | So unless we consider it a bug that devstack doesn't update that, I'm guessing this is my problem? | 12:19 |
gibi | mdbooth: so you have both wrong and bad ProxyPass lines in the config file? | 12:20 |
gibi | I mean wrong and good | 12:20 |
gibi | :D | 12:20 |
mdbooth | So *before* pulling and executing a patched devstack I already have bad config present | 12:21 |
mdbooth | And running the patched devstack doesn't fix it | 12:21 |
gibi | I do see multiple ProxyPass lines locally too, I guess those are from multiple unstack.sh / stack.sh runs in my case | 12:21 |
lyarwood | yeah that isn't a devstack bug | 12:22 |
lyarwood | ./clean.sh first | 12:22 |
gibi | hm, if clean.sh removes it then I agree it is not a bug (I'm lazy to always run clean.sh but I should) | 12:22 |
mdbooth | lyarwood: Or more likely summon the arcane wizards of CI and update the 'preinstalled' image! | 12:23 |
lyarwood | yeah or that if it's already baked into the image | 12:23 |
gibi | mdbooth: yeah, for CI I suggest to always start from a clean image :) | 12:23 |
lyarwood | that's odd tbh | 12:23 |
mdbooth | Apparently it saves a ton of time, although I haven't personally measured it. I'm about to measure it, though :) | 12:24 |
gibi | not too long ago dansmith added parallelism to devstack stack.sh run that helped a lot with runtime | 12:24 |
lyarwood | yeah that we use across most jobs now AFAIK | 12:27 |
mdbooth | lyarwood: Is there a flag? | 12:28 |
gibi | there should be | 12:29 |
gibi | but is on by default | 12:29 |
mdbooth | Ok, cool | 12:29 |
lyarwood | yeah it's there by default in master and xena | 12:29 |
gibi | btw, ./clean.sh deletes all the config from /etc/apache2/sites-enabled/ except glance :D | 12:29 |
mdbooth | Ah, looks like it's DEVSTACK_PARALLEL, and it's not in victoria | 12:30 |
mdbooth | Perhaps I'll update | 12:30 |
lyarwood | why are you deploying victoria btw? | 12:30 |
mdbooth | lyarwood: Because nobody changed it | 12:31 |
lyarwood | oh fun | 12:31 |
gibi | :) | 12:31 |
mdbooth | Honestly this CI system is awesome, I'm not complaining. It installs devstack in a VM in GCE and runs tests against it, and until we hit this it was rock solid. | 12:32 |
mdbooth | Ok, I'm going to run again from a clean image instead of the preinstalled image to see how long that takes. Then I'm going to do the same again but against Xena to measure again. | 12:33 |
mdbooth | Thanks for all the help! | 12:33 |
gibi | happy to help | 12:34 |
gmann | bauzas: other than neutron-grenade. nova-ceph-multistore is broken on stable (https://bugs.launchpad.net/devstack-plugin-ceph/+bug/1945358) fixes are ready to merger - https://review.opendev.org/q/topic:%22bug%252F1945358%22+(status:open%20OR%20status:merged) | 12:36 |
gmann | gibi: lyarwood checking comments on grenade fixes, | 12:36 |
gibi | gmann: in short we are questioning whether we need to turn on trunk testing on stable branches | 12:39 |
gmann | gibi: lyarwood ok, I will say we should as Tempest master is used to test the stable/ussuri -> master and if we do not enable then we skip the trunk test for no reason. it was just a miss in enabling the extension in stable branches. | 12:41 |
lyarwood | gm | 12:41 |
lyarwood | ops sorry | 12:41 |
lyarwood | gmann: Yeah I appreciate that but if wasn't tested until now it seems a little odd to add it in for non-master branches | 12:42 |
lyarwood | gmann: the missing coverage was mostly around live migration right? | 12:42 |
gmann | and we can see it is passing in nova grenade jobs in stable so we do not need anythings extra than grenade fix | 12:42 |
gmann | lyarwood: yeah, live migration trunk tests. those are only tests for trunk currently in tempest | 12:43 |
lyarwood | gmann: okay well if it's passing and the neutron folks are okay with helping with any fallout that might appear in the coming days then I guess we can go ahead | 12:43 |
gmann | lyarwood: yeah, if neutron team is not comfortable or confident then we can just keep it like that. | 12:43 |
gmann | lyarwood: whether we need it for Xena or not, I remember master testing patch passed only when I added xena fix as depends-on here https://review.opendev.org/c/openstack/nova/+/811118 | 12:53 |
gmann | I think i added due to zuul job inventory but let me debug that what exactly happing here | 12:54 |
lyarwood | okay if that's the case lets just merge both | 12:54 |
gmann | lyarwood: zuul inventory seems to take job definition from master only https://zuul.opendev.org/t/openstack/build/01f10e2cb162450e912551026dde8f85/log/zuul-info/inventory.yaml#294-304 | 13:08 |
gmann | let me remove the xena fix as depends-on and then we can get more clarity | 13:09 |
opendevreview | Ghanshyam proposed openstack/nova master: DNM testing grenade neutron-trunk fix https://review.opendev.org/c/openstack/nova/+/811118 | 13:09 |
lyarwood | gmann: kk | 13:16 |
bauzas | sorry folks, I had issues with my computer | 13:55 |
bauzas | gmann: thanks, unfortunately gtk | 13:56 |
bauzas | looks like it's limbo | 13:57 |
gmann | nova grenade master fix is in gate https://review.opendev.org/c/openstack/grenade/+/811117 | 13:58 |
gmann | for enabling testing on stable or not is different things and we can continue discussion. | 13:58 |
dansmith | bauzas: gibi looks like the nova-yoga-ptg etherpad has been emptied | 14:01 |
bauzas | ORLY ? | 14:01 |
gibi | shit, I see | 14:02 |
bauzas | dansmith: this was really done 3 mins before | 14:02 |
gibi | this is the last known state https://etherpad.opendev.org/p/nova-yoga-ptg/timeslider#8272 | 14:02 |
gibi | last known good state | 14:03 |
bauzas | yup | 14:03 |
dansmith | bauzas: okay I just loaded it and saw it empty | 14:03 |
bauzas | dansmith: me too | 14:03 |
bauzas | it's just, something happened between 8272 and 8273 | 14:03 |
dansmith | certainly you can restore a rev right? | 14:03 |
gibi | the content yes, the coloring I think no | 14:03 |
bauzas | I wonder | 14:04 |
bauzas | lemme ask infra | 14:04 |
gibi | sure | 14:04 |
dansmith | hmm, I thought there was some way | 14:04 |
gibi | what I did before is that I exported the old state and copied that back | 14:04 |
bauzas | fungi is looking at restoring | 14:05 |
gibi | ack | 14:05 |
bauzas | folks, don't touch now | 14:05 |
dansmith | sweet | 14:05 |
bauzas | dansmith: thanks for having identified this | 14:06 |
lyarwood | ah ffs I think that was me | 14:06 |
lyarwood | I was editing and my dock crashed | 14:06 |
lyarwood | and then I couldn't reconnect to the pad | 14:07 |
lyarwood | yeah my items were the last on there | 14:08 |
lyarwood | https://etherpad.opendev.org/p/nova-yoga-ptg/timeslider#8272 | 14:08 |
lyarwood | not entirely sure how my laptop dock (and thus network) crashing caused this tbh | 14:09 |
gibi | an interesting untested edge case in the etherpad code :) | 14:09 |
fungi | bauzas: gibi: dansmith: i've rolled it back to revision 8272 | 14:13 |
fungi | lyarwood: ^ | 14:13 |
dansmith | looks good, thanks fungi ! | 14:13 |
bauzas | fungi: with all my love | 14:13 |
fungi | cool, just making sure it's looking like you needed | 14:13 |
lyarwood | thanks for that and apologies all | 14:13 |
fungi | we do also back up the db behind it daily, worst case | 14:13 |
bauzas | fungi: it is, all the colors | 14:14 |
gibi | fungi: awesome thanks | 14:14 |
fungi | no problem, glad we could recover it | 14:14 |
bauzas | and yeah, interesting edge case | 14:14 |
bauzas | a docking issue with a network problem swallows an etherpad | 14:14 |
fungi | those aren't so bad. in the past there have also been bugs which clients somehow tickled to make the pads completely unusable | 14:16 |
fungi | and the most we can do in those cases is dump an earlier text copy with the api and stick that into a new pad, but it loses all the attribution and history | 14:16 |
fungi | and formatting | 14:17 |
bauzas | fungi: thanks, gtk | 14:18 |
bauzas | gmann: i know you're busy with all those devstack and grenade stuff | 14:19 |
bauzas | gmann: but tell me when you think it would be a good opportunity for the x-p PTG session between oslo and nova re: policy | 14:19 |
gmann | bauzas: give me 10 min, currently internal meeting | 14:22 |
bauzas | gmann: nah, no rush | 14:22 |
lyarwood | https://review.opendev.org/c/openstack/nova/+/811118/ is passing with just the master grenade fix that's in the gate FWIW | 14:34 |
* lyarwood does a little dance | 14:34 | |
gibi | \o/ | 14:35 |
bauzas | lyarwood: https://media.giphy.com/media/HTjcWZwMtHpyhuCGKZ/giphy-downsized-large.gif?cid=ecf05e47cwkexzrwaxwlkbicd2ppy22bb1l52sg7yoa8sk9x&rid=giphy-downsized-large.gif&ct=g | 14:52 |
* bauzas is sometimes regreting we can't add media in IRC | 14:52 | |
bauzas | but that's lasting 0.1sec and then I say "meh" | 14:52 |
fungi | if you used a matrix client with the matrix-oftc bridge to join this channel, you and other matrix users could share inline media while irc users would just see a url to it in-channel | 14:53 |
bauzas | fungi: what I said, "meh" :p | 15:04 |
fungi | heh | 15:05 |
gmann | lyarwood: perfect then I will say it was late night thing which thought that Xena fix is needed :) | 15:05 |
gmann | bauzas: for PTG oslo sessions, is it possible on Tuesday Oct 19th between 13-15 UTC | 15:07 |
opendevreview | melanie witt proposed openstack/nova stable/train: Reject open redirection in the console proxy https://review.opendev.org/c/openstack/nova/+/791807 | 15:14 |
opendevreview | melanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes https://review.opendev.org/c/openstack/nova/+/806629 | 15:14 |
gmann | bauzas: and as per current topic I need only 30 min unless there is more things to discuss from anyone else | 15:19 |
bauzas | gmann: sorry, I'm on a meeting but I guess we could run a meeting after 1400UTC as we have a cyborg x-p session first | 15:20 |
gmann | bauzas: on tuesday right? | 15:22 |
bauzas | gmann: on the 19th of October, yes | 15:22 |
gmann | bauzas: perfect, sounds good. thanks | 15:22 |
bauzas | https://etherpad.opendev.org/p/nova-yoga-ptg L47 tells me the 14-15:00UTC slot is already taken | 15:23 |
gmann | bauzas: that is 'Cyborg-Nova: Tuesday (19th Oct) 13:00 UTC - 14: 00 UTC:' | 15:24 |
bauzas | my bad, yeah | 15:24 |
bauzas | again, wfm for a oslo x-p session on Oct-19 14:00UTC | 15:24 |
gmann | +1, thanks again | 15:24 |
lyarwood | gmann: ^_^ no issues thanks for working on it so late | 15:24 |
bauzas | gmann: all good | 15:25 |
gmann | lyarwood: np!, and for stable backport we can wait for neutron team opinion so I agree on 'no hurry for those' . | 15:25 |
lyarwood | awesome | 15:25 |
bauzas | artom: honestly, I'm torn with https://review.opendev.org/c/openstack/nova/+/808474 | 16:12 |
bauzas | that's an behavioural change | 16:13 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Enable min pps tempest testing in nova-next https://review.opendev.org/c/openstack/nova/+/811748 | 16:13 |
bauzas | operators suppose a plugoff when delete | 16:13 |
bauzas | now, we'll first try to shutdown the guest for every instance | 16:13 |
bauzas | including other drivers but libvirt | 16:13 |
melwitt | that is my concern as well. maybe it could be conditional on bfv that is not "delete on termination"? | 16:14 |
melwitt | is that the only case where this would be desired? | 16:15 |
melwitt | or rather, is there any gain in doing it for non bfv non delete on termination? | 16:15 |
dansmith | generally you don't want that :) | 16:15 |
melwitt | maybe also shared storage is another case | 16:16 |
dansmith | the only real case for non-bfv volumes is for precious data | 16:16 |
melwitt | but precious data on a local disk that's going to be deleted anyway? I must be missing something | 16:16 |
dansmith | sorry, thought you were talking about delete-on-termination for non-bfv cinder volumes | 16:17 |
artom | melwitt, "is there any gain in doing it for non bfv non delete on termination?" None that I can see | 16:17 |
artom | Well, no, any volume, really | 16:17 |
artom | Doesn't have to be bfv | 16:17 |
melwitt | oh, yeah attached volumes. I wasn't thinking of that. yeah | 16:17 |
dansmith | that's my point, delete-on-termination should only be useful for bfv volumes we created with non-precious data from an image | 16:18 |
artom | It's still attached and mounted in the guest, and would ideally be flushed correctly if it's not delete_on_termination=True | 16:18 |
artom | I need to run an errand quickly, can this be carried over to the gerrit review? | 16:18 |
artom | And thanks for looking into it :) | 16:18 |
artom | And yeah, so bauzas's point, the compute manager/driver division of labour here is pretty muddy | 16:19 |
melwitt | my bad for not looking at the change yet. if it's targeted to only instance with volume(s) cases I think that makes a lot more sense | 16:19 |
gibi | is it really a gracefull shutdown via openstack server stop and then a openstack server delete? | 16:19 |
dansmith | gibi: stop + delete should be graceful | 16:19 |
gibi | so my point is this can already done with our APIs | 16:20 |
melwitt | just saying I don't think we should be doing it for everything, for things where the data is going to be blown away anyway | 16:20 |
dansmith | gibi: for sure. I assume the goal is to make nova do the graceful behavior if volumes are attached, but to do it properly really requires some higher-level orch, like a stop...timeout...destroy kind of thing | 16:21 |
dansmith | "do the graceful behavior *automatically*" I should have said | 16:21 |
gibi | OK I see | 16:21 |
melwitt | yeah that is my understanding as well | 16:21 |
dansmith | I'm a bit torn, because unless you're running with unsafe cache, I would think that fast destroy is fine.. might have a journal to replay when you use the volume later, but... | 16:22 |
gibi | it make sense for data consistency but it also makes delete slower so I think this should be opt in | 16:22 |
melwitt | dansmith: yeah it's weird, the user is experiencing volume gets corrupted and no longer usable when they delete without stopping first | 16:23 |
melwitt | we had thought just deleting should be fine but it's behaving in a way we didn't expect | 16:23 |
melwitt | not sure why | 16:23 |
dansmith | destroy of a running vm is the same as pulling the plug.. if you're using a precious volume, you wouldn't do that to a physical server, so... | 16:23 |
artom | dansmith, so the "real" problem is https://bugzilla.redhat.com/show_bug.cgi?id=1965081 | 16:24 |
melwitt | yeah but re: "I would think that fast destroy is fine"? | 16:24 |
artom | Apparently in some cases stop+delete causes races | 16:24 |
melwitt | yeah I think you have to poll and wait for it to be stopped no? | 16:25 |
dansmith | melwitt: to do the graceful shutdown, you'd need some long-running task, yeah, what I said above | 16:25 |
artom | So we can either make delete safer, or require that any orchestration/automation on top of Nova does stop + delete, but then we would need to fix that race | 16:25 |
dansmith | artom: so this has nothing to do with volume safety? | 16:25 |
artom | dansmith, it does, because the reason for doing stop + delete (which can cause this deadlock) is volume safety | 16:26 |
melwitt | it does. it began with the racing problem and then we said "try a delete without the stop" and then their volumes got messed up | 16:26 |
dansmith | oh the dbdeadlock came from stop? | 16:26 |
lyarwood | doesn't delete take an instance.uuid lock on the compute? | 16:26 |
artom | dansmith, stop immediately followed by delete, apparently | 16:26 |
melwitt | it came from doing a delete right after a stop | 16:27 |
melwitt | without waiting for the stop to be stopped | 16:27 |
melwitt | lyarwood: good question | 16:27 |
dansmith | okay, so, fix that | 16:27 |
dansmith | don't engineer an orchestrated graceful delete, IMHO | 16:27 |
dansmith | I'm guessing task_state doesn't protect the delete from running, | 16:27 |
artom | I dunno, I don't necessarily think expecting an attached volume to not be corrupted after a delete is invalid | 16:28 |
dansmith | and the api might start the delete process while the stop is still running on the compute host or something | 16:28 |
melwitt | hm, stop and delete are both locked with instance.uuid | 16:28 |
dansmith | artom: I agree, delete could leave a volume unhappy, but I'm thinking if you issued a stop and then delete you're assuming they're queued | 16:28 |
melwitt | I just checked | 16:28 |
lyarwood | https://github.com/openstack/nova/blob/e07bb310b674fb471a92edf3258e564f05534595/nova/compute/manager.py#L3237-L3255 looks like soft delete doesnt | 16:28 |
dansmith | artom: in reality it's probably about like hitting shutdown on your server and then pulling the plug before it finishes, but... | 16:29 |
melwitt | dansmith: that's exactly what they want, the queuing | 16:29 |
dansmith | I think relying on the instance lock is probably too fragile here, | 16:30 |
dansmith | since it's just on the compute node, but I'd have to go look at the (many) delete path(s) we have | 16:30 |
lyarwood | I wouldn't say so for this case where the instance was running on a host | 16:31 |
dansmith | delete is and always has been pretty much "I want this to complete and stop charging me immediately", so... we're really not wrong here, IMHO | 16:31 |
lyarwood | that makes perfect sense | 16:31 |
opendevreview | Balazs Gibizer proposed openstack/nova master: Enable min pps tempest testing in nova-next https://review.opendev.org/c/openstack/nova/+/811748 | 16:31 |
dansmith | lyarwood: wouldn't say what, that relying on the lock is unsafe? we can do stuff in the api to delete things (in the local case) which has nothing to do with the instance lock, | 16:32 |
dansmith | and I'm not sure that relying on the ordering of two calls is really safe either unless we have perfectly fair locks | 16:32 |
bauzas | sorry, I was afk | 16:32 |
* bauzas scrolling back to the convo I started | 16:33 | |
lyarwood | dansmith: we can't force the delete in the API of an instance still associated with a host can we? | 16:33 |
dansmith | like, if you issue three calls on that instance and the stop is #2, delete is #3, the lock has to be fair to make sure we don't happen to choose the waiting #3 thread when we stop doing #1 | 16:33 |
dansmith | lyarwood: of course we can | 16:33 |
dansmith | if we think the host is down, or we missed the last service update, etc | 16:33 |
lyarwood | right I'm stuck in the happy path here | 16:34 |
melwitt | we have the ability to add fair=True to the locks, I wonder if that would work? | 16:34 |
bauzas | my biggest concern here is that we would make destroy holding for a graceful shutdown | 16:34 |
dansmith | bauzas: yeah I don't think we should do that for sure | 16:35 |
dansmith | bauzas: I think at most, we should try to make sure a delete doesn't preempt a stop operation in progress | 16:35 |
bauzas | yup, agreed | 16:35 |
bauzas | if a stop is occuring, destroy should wait | 16:35 |
melwitt | how can we make it wait without rejecting it? | 16:35 |
bauzas | excellent questioin | 16:36 |
melwitt | I do wonder if the fair=True would help | 16:36 |
bauzas | I'd say we would reject synchronously by looking at the VM state | 16:36 |
melwitt | I'll try it in a devstack if I can repro the problem | 16:36 |
bauzas | destroy is synchronous, right? | 16:37 |
bauzas | even if the host is down, destroy will occur | 16:37 |
dansmith | melwitt: I'm just suggesting that lock inversion is *one* possible way that we might not be able to depend on the ordering | 16:37 |
bauzas | that's the guest destroy which is async, right? | 16:37 |
dansmith | anything else that happens on the compute host before we block on the lock could cause it, as well as if conductor is involved or something like that | 16:38 |
bauzas | dansmith: I missed your proposal | 16:38 |
bauzas | you're telling we could lock on stop ? | 16:38 |
bauzas | hence preventing the delete ? | 16:38 |
melwitt | we do lock on stop on the compute node | 16:39 |
melwitt | but lock waiters would not be in order because they are not fair locks | 16:39 |
melwitt | *would not necessarily be in order | 16:39 |
dansmith | bauzas: I haven't proposed anything | 16:39 |
bauzas | ok, melwitt explained | 16:40 |
dansmith | yeah, so when we go to do delete, | 16:41 |
melwitt | dansmith: ack, I was just thinking that would be such a small and simple way to fix it if the fair locks would do that. but like you said if we have an issue with two async things arriving on the compute node in the "wrong order" then it wouldn't help | 16:41 |
dansmith | we create an instance event before we run terminate_instance and grab the lock, | 16:41 |
dansmith | which means we're doing network IO to conductor | 16:41 |
dansmith | so even fair locks won't prevent order inversion there | 16:41 |
melwitt | yeah... | 16:41 |
dansmith | so if stop and delete arrive in the right order, but each call out to conductor, we hit any conductor in the cluster, each one tries to create records in the db, who knows which one will finish first, grab the lock, etc | 16:42 |
dansmith | that's what I mean by assuming this sort of super tight ordering is unsafe, even though we *think* we're on the compute and largely single-threaded | 16:42 |
bauzas | I see | 16:42 |
bauzas | stupid idea, can't we rely on the state of the instance ? | 16:43 |
dansmith | delete is anything-goes I think | 16:43 |
bauzas | yeah, that's the original problem | 16:43 |
dansmith | because that's what we want... | 16:43 |
bauzas | yup... | 16:43 |
dansmith | and anything else we build into there is going to be pretty obscure.. "delete always deletes, except stop but not shelve... and either waits or refuses or ..." | 16:44 |
dansmith | makes the "delete always works" contract a little less clear | 16:44 |
bauzas | honestly I don't know how to move on with this ask | 16:44 |
dansmith | tell them that delete means pulling the plug, | 16:44 |
bauzas | "ask your orchestration to be smarter ?" | 16:44 |
lyarwood | how about making stop more graceful just for the libvirt driver? | 16:44 |
gibi | can we add a flag to the delete api saying I-want-a-shutdown-first? | 16:44 |
dansmith | and you wouldn't do that while waiting for start->shutdown on a physical server, so they should wait for vm_state=STOPPED before delete | 16:45 |
dansmith | gibi: that makes it a little less obscure, but doesn't eliminate the need to make that orchestration bit work, of course | 16:45 |
bauzas | yeah | 16:45 |
lyarwood | artom: there main issue with stop was that it eventually destroys the instance right? | 16:45 |
bauzas | that doesn't solve the ordering problem | 16:45 |
lyarwood | artom: and in your change you've suggested that we call shutdown on the domain as an initial step to make this more graceful | 16:46 |
dansmith | lyarwood: can we call quiesce or something else constant-time before libvirt destroy? | 16:46 |
dansmith | instead of shutdown, which the guest can block or ignore? | 16:46 |
bauzas | the root problem is the I/O flushes, right? | 16:47 |
dansmith | novaclient has a --poll option for some things. "nova stop server --poll; nova delete server" would solve this pretty easy :) | 16:47 |
melwitt | they're using tripleo/heat but I'm pretty sure heat has dependency or waiting ability | 16:49 |
dansmith | surely | 16:49 |
lyarwood | dansmith: quiesce before $domain.shutdown() might help flush things if that's where the guestOS is getting hung up | 16:49 |
melwitt | their argument has been that stop + insta delete should not break the volume, IIUC | 16:49 |
lyarwood | waiting assumes we don't kill it before it's finished shutting down | 16:50 |
dansmith | lyarwood: I meant to make sure journal buffers are written before we nuke the guest | 16:50 |
dansmith | melwitt: that argument is fine as long as you wait between stop and delete :) | 16:50 |
dansmith | melwitt: because again, start->shutdown and then pulling the plug before it finishes yields a corrupted disk :) | 16:51 |
melwitt | dansmith: that was the first thing I said on the bug report but it's gone on for a long time now and gotten into the weeds | 16:51 |
* dansmith nods | 16:51 | |
bauzas | melwitt: I haven't seen any bug report against artom's patch | 16:52 |
bauzas | I guess you're talking internally | 16:52 |
melwitt | bauzas: yes internally | 16:52 |
bauzas | yeah, because that's what worried me originally | 16:52 |
bauzas | technically, destroy works like expected | 16:52 |
dansmith | we also really need to do a better job of making this a sanitized bug externally if we're going to claim this is a bug in nova | 16:52 |
bauzas | I'm not happy with claiming this as an upstream bug | 16:53 |
dansmith | me either, fwiw :) | 16:53 |
bauzas | a blueprint or a wishlist bug | 16:53 |
dansmith | we could add a feature as gibi said, but destroy is doing the right thing here | 16:54 |
bauzas | (18:52:51) bauzas: technically, destroy works like expected | 16:54 |
melwitt | +1 to all of that | 16:54 |
dansmith | an alternative to gibi's idea, would be a delete flag that says "assuming task_state=None" meaning "delete this if nothing else is going on" | 16:54 |
bauzas | strong agreement here | 16:54 |
dansmith | but it would require the client to retry, which they could currently do by just waiting for the stop to finish | 16:54 |
dansmith | so, meh | 16:54 |
bauzas | dansmith: yeah, that's why I was considering the vm state or the task state | 16:55 |
bauzas | if we really want to do *something* | 16:55 |
bauzas | but, | 16:55 |
bauzas | this can't be done with the current destroy API | 16:55 |
dansmith | it doesn't make it do what they want, and honestly it's kinda weird since they could just wait for the stop just as well, but it's less new orchestration stuff | 16:55 |
dansmith | [09:47:37] <dansmith> novaclient has a --poll option for some things. "nova stop server --poll; nova delete server" would solve this pretty easy :) | 16:56 |
dansmith | ^ :) | 16:56 |
bauzas | it would be a 'destroy++" API | 16:56 |
bauzas | heh | 16:56 |
dansmith | also, if melwitt is right and they're using heat, then FFS, get heat to wait or something | 16:56 |
bauzas | (18:44:41) bauzas: "ask your orchestration to be smarter ?" | 16:56 |
dansmith | lol | 16:56 |
lyarwood | we keep saying they could wait for the stop to finish but wasn't that part of the issue here? Even if they did wait the libvirt driver would kill the instance prematurely before it had finished shutting down? | 16:56 |
dansmith | if we're just re-quoting ourselves, are we done here? :P | 16:57 |
gibi | if currently stop + wait for STOPPED + delete works, then a new delete-with-gracefull-shutdown could also be orchestrated from the conductor but it is obviously an orchestration and adds complexity to the already complext delete codepaths | 16:57 |
bauzas | gibi: and again, this can't be the straight delete | 16:57 |
dansmith | gibi: we don't go through conductor directly for stop or delete right now, AFAIK, so it would really get confusing to add another whole path like that, IMHO | 16:57 |
gibi | bauzas: yepp this is delete++ :D | 16:57 |
gibi | dansmith: then we either need the make the stop RCP sync or do a polling from the nova-api | 16:58 |
gibi | RPC | 16:58 |
melwitt | lyarwood: I didn't think so? | 16:58 |
bauzas | can't we just have as dansmith suggested a "destroy++" API return some 40x when the task state is not None ? | 16:58 |
lyarwood | that's fine if they aren't waiting | 16:59 |
melwitt | if that's the case I missed it | 16:59 |
dansmith | gibi: we'd need a cast to conductor, a new method there, and a sync stop operation to compute from conductor followed by delete.. waiting from the api is not reasonable because stop can take a *long* time | 16:59 |
gibi | yeah | 16:59 |
lyarwood | melwitt: I was sure we had talked about them waiting and libvirt still killing the domain before things had shutdown sorry | 16:59 |
lyarwood | I'll look at the bug again | 16:59 |
dansmith | in addition to the other four ways we can delete things :) | 16:59 |
melwitt | lyarwood: my understanding (and I could be wrong) is that they have never tried waiting to stopped | 16:59 |
lyarwood | I was sure we suggested that early on | 17:00 |
dansmith | melwitt: right that seems like it to me | 17:00 |
melwitt | lyarwood: ok same, maybe I'm way off | 17:00 |
bauzas | I'd rather prefer this destroy++ API to stop synchronously at the API level if some conditions aren't met | 17:00 |
bauzas | rather than us pursuing the idea we could achieve some distributed locking mechanism | 17:00 |
dansmith | gibi: also delete returns "I will eventually do this" to the client, which it would no longer be able to guarantee | 17:00 |
dansmith | bauzas: fwiw, I don't think that is really going to do what this person wants, so we should make sure they think it would help before we do that work | 17:01 |
gibi | dansmith: ohh, that is correct and bad :/ | 17:01 |
gibi | delete++ is just too complex | 17:02 |
bauzas | dansmith: welp, good point | 17:02 |
bauzas | either way, I think we need a proper tracking | 17:07 |
bauzas | artom: I guess you need to fill a blueprint and honestly given the brainstorm efforts we made over last hour, you need to write a bit of a spec | 17:07 |
bauzas | mostly because the current destroy action can't be used for this and we need to consider a new action parameter (or any change in our API) | 17:08 |
bauzas | artom: feel also free to write a PTG proposal for this one, we could continue the talk there | 17:08 |
* bauzas stops for the day | 17:22 | |
* gibi too | 17:23 | |
*** beekneemech is now known as bnemec | 17:41 | |
spatel | Any idea how to do VM nic bonding using two VF with SRIOV ? | 17:57 |
spatel | i am looking for redundancy with sriov implementation and only solution is to do bonding inside vm | 17:58 |
opendevreview | Ivan Kolodyazhny proposed openstack/nova master: Add release note which descrube NVMe attach issue https://review.opendev.org/c/openstack/nova/+/811447 | 18:04 |
opendevreview | melanie witt proposed openstack/nova stable/train: Reject open redirection in the console proxy https://review.opendev.org/c/openstack/nova/+/791807 | 19:15 |
opendevreview | melanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes https://review.opendev.org/c/openstack/nova/+/806629 | 19:15 |
opendevreview | melanie witt proposed openstack/nova stable/train: [stable-only] Set lower-constraints job as non-voting https://review.opendev.org/c/openstack/nova/+/811762 | 19:15 |
opendevreview | David Vallee Delisle proposed openstack/nova master: rephrasing config description for num_pcie_ports in libvirt https://review.opendev.org/c/openstack/nova/+/811173 | 21:59 |
opendevreview | Goutham Pacha Ravi proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific https://review.opendev.org/c/openstack/nova/+/811802 | 22:19 |
melwitt | gouthamr: thanks for that! ^ I was going to do it and forgot | 22:33 |
gouthamr | o/ melwitt - i may have hit a circular dependency of sorts | 22:34 |
melwitt | oh hm.. (looking at it now) | 22:35 |
melwitt | gouthamr: I wonder if you need to use the url instead of the change-id? not sure how it would know which of the two (master or stable/wallaby) to use | 22:36 |
melwitt | for the depends-on | 22:37 |
gouthamr | +1 can try that | 22:37 |
opendevreview | Goutham Pacha Ravi proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific https://review.opendev.org/c/openstack/nova/+/811802 | 22:38 |
clarkb | the url is prefered now because you can depends on things in other code review systems | 22:38 |
clarkb | change id is gerrit specific | 22:38 |
gmann | gouthamr: i do not see circular deps, it should work fine. nova->plugin->devstack | 22:39 |
melwitt | good to know | 22:39 |
gouthamr | thanks, but zuul stull tells me "Unable to freeze job graph: 0" | 22:40 |
clarkb | gouthamr: that implies you have a bug in your zuul config I think | 22:40 |
clarkb | gouthamr: can you give me links to all the changes involved? | 22:41 |
gouthamr | clarkb: yep: https://review.opendev.org/c/openstack/nova/+/811802/ | 22:41 |
gouthamr | https://review.opendev.org/c/openstack/devstack-plugin-ceph/+/810059 and https://review.opendev.org/c/openstack/devstack/+/810202 | 22:42 |
opendevreview | Ghanshyam proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific https://review.opendev.org/c/openstack/nova/+/811802 | 22:42 |
gmann | gouthamr: I think running now ^^ ? | 22:43 |
gouthamr | oh! | 22:43 |
gmann | yeah | 22:43 |
gouthamr | gmann: thanks! i am used to pushing empty changes to trigger the CI elsewhere :) | 22:43 |
gmann | with right deps too https://zuul.openstack.org/status#nova | 22:44 |
melwitt | gmann++ | 22:44 |
clarkb | the issue is you have no files in the commit | 22:47 |
clarkb | seems like you figured that out | 22:47 |
clarkb | I don't know that that is a use case we should support. If you aren't changing anything then why bother | 22:48 |
melwitt | I noticed that but didn't know that it would cause problems | 22:48 |
gmann | clarkb: yeah, erorr was confusing though | 22:48 |
clarkb | gmann: yes, https://paste.opendev.org/show/809680/ is the internal handling of it. It is an exceptional case currently | 22:49 |
clarkb | I've brought it up in the zuul matrix room to see if that is something we can handle better | 22:49 |
gmann | clarkb: +1 | 22:49 |
gmann | thanks | 22:49 |
gouthamr | awesome thanks clarkb! | 22:51 |
clarkb | the other issue you'll run into with that even if zuul didn't explode on it is so many jobs match on files that are modified. If no files are modified then none of those jobs will run | 22:52 |
opendevreview | melanie witt proposed openstack/nova stable/wallaby: Add functional regression test for bug 1853009 https://review.opendev.org/c/openstack/nova/+/811805 | 23:51 |
opendevreview | melanie witt proposed openstack/nova stable/wallaby: Clear rebalanced compute nodes from resource tracker https://review.opendev.org/c/openstack/nova/+/811806 | 23:51 |
opendevreview | melanie witt proposed openstack/nova stable/wallaby: Invalidate provider tree when compute node disappears https://review.opendev.org/c/openstack/nova/+/811807 | 23:51 |
opendevreview | melanie witt proposed openstack/nova stable/wallaby: Prevent deletion of a compute node belonging to another host https://review.opendev.org/c/openstack/nova/+/811808 | 23:51 |
opendevreview | melanie witt proposed openstack/nova stable/wallaby: Fix inactive session error in compute node creation https://review.opendev.org/c/openstack/nova/+/811809 | 23:51 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!