16:00:06 <gibi> #startmeeting nova
16:00:06 <opendevmeet> Meeting started Tue Jun 29 16:00:06 2021 UTC and is due to finish in 60 minutes.  The chair is gibi. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:06 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:06 <opendevmeet> The meeting name has been set to 'nova'
16:00:14 <stephenfin> o/
16:00:21 * kashyap waves
16:00:32 <gibi> kashyap: OK, then I will do the normal agenda and your topic will be part of the Open Discussion
16:00:38 <kashyap> Yes, that's fine.
16:01:06 <bauzas> \o
16:01:10 <gmann> o/
16:01:24 <gibi> bauzas asked for a quick meeting so lets start
16:01:41 <gibi> #topic Bugs (stuck/critical)
16:01:44 <gibi> no critical bug
16:01:55 <gibi> #link 22 new untriaged bugs (+1 since the last meeting): #link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New
16:02:14 <gibi> any bug we need to talk about today/
16:02:15 <gibi> ?
16:02:47 <stephenfin> lyarwood spotted a bug earlier this morning. We think it's basically fixed now. Just waiting for the neutron patch to merge (thanks slaweq)
16:03:13 <stephenfin> until that merges though, the gate is stuck (the live-migration job will fail)
16:03:45 <stephenfin> (https://review.opendev.org/c/openstack/neutron/+/798634/ is the fix, btw)
16:04:23 <gibi> stephenfin: thanks
16:04:49 <gibi> any other bug to mention?
16:05:09 <sean-k-mooney> stephenfin: that job should not fail
16:05:27 <sean-k-mooney> we should fall back to the old way without mulitple port bindings
16:05:37 <sean-k-mooney> it might fail on stable branches if we have not backported the fix
16:05:45 <stephenfin> maybe not, but it does and I haven't had time to figure out why 🤷
16:06:00 <sean-k-mooney> well we should since that means contrail is broken
16:06:08 <gmann> one test failing is this which is port status gibi mentioned before meeting https://be2e92e10ead782aa651-35e07a4cf42cfaed2fcffa4bf0b16f1b.ssl.cf1.rackcdn.com/794757/9/check/tempest-multinode-full-py3/94d92ea/testr_results.html
16:06:10 <sean-k-mooney> they do  not support it.
16:06:20 <gmann> is that same?
16:06:36 <stephenfin> yes, I think so
16:06:52 <gmann> ok
16:07:25 <sean-k-mooney> so we proably want to hold the neutron patch till we fiture this out or propose a revert so we can debug
16:07:43 <sean-k-mooney> well the neutron patch is correct
16:07:57 <gibi> if the neutron fix is needed anyhow then I vote for merging it and troubleshoot on a revert if needed
16:07:59 <sean-k-mooney> so proably the latter propose a revert so we can figure out why it failed
16:08:25 <stephenfin> yeah, what gibi said
16:08:54 <gibi> as we anyhow moved to the gate status lets move there by topic as wel
16:08:55 <gibi> l
16:08:56 <gibi> #topic Gate status
16:09:00 <gibi> Nova gate bugs #link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure
16:09:02 <sean-k-mooney> a reguression of this is effectivly a critial nova bug just an fyi
16:09:53 <gibi> please tag bugs with gate-failure so that we can follow them there
16:10:15 <gibi> placement weekly jobs are green #link https://zuul.openstack.org/builds?project=openstack%2Fplacement&pipeline=periodic-weekly
16:10:33 <gibi> anything else on the gate status?
16:11:27 <gibi> #topic Release Planning
16:11:31 <gibi> Milestone 2 is in 3 weeks (15 of July) which is spec freeze
16:11:36 <gibi> Spec review day is 6th of July #link http://lists.openstack.org/pipermail/openstack-discuss/2021-June/023083.html
16:11:58 <gibi> and as you saw on the ML the next PTG time is set to October 18-22
16:12:05 <gibi> and it will be virtual still
16:12:39 <gibi> any other news on the incoming release?
16:12:58 <sean-k-mooney> :( i dont think so
16:13:15 <dansmith> sean-k-mooney: what is the sadface? virtual ptg?
16:13:20 <sean-k-mooney> yes
16:13:21 <bauzas> yup
16:13:24 <bauzas> :( :( :( even
16:13:31 <dansmith> heh, okay
16:14:11 <gibi> we should have a terapeutic session in the PTG
16:14:15 <gibi> anyhow moving on
16:14:17 <gibi> #topic Stable Branches
16:14:22 <gibi> stable gates should be OK (though 'wait_for_volume_resource_status' intermittently fails)
16:14:26 <gibi> EOM (from elodilles )
16:14:42 <gibi> any other stable news?
16:15:04 <elodilles> nothing from me for now
16:15:17 <dansmith> same failure on master a bunch too I think
16:15:37 <dansmith> the cinder peeps were working on it a couple weeks ago
16:15:51 <gibi> yeah I think lyarwood is still trying to see what happens in the guest that prevents detaching a volume
16:16:18 <dansmith> well, the cinder peeps were thinking it was an lvm segv or something
16:16:29 <dansmith> (on the host)
16:16:46 <sean-k-mooney> have we check that the falvor have at least 2 cores. its really just a workaroudn but i think that help downstream at one point with the guest not respondind to the detach
16:17:34 <gibi> dansmith: could be multiple indendependent failure I only hit the detach one last week but I'm did not looked at CI results recently
16:17:37 <sean-k-mooney> i mean 1 shoudl really be enought but sometiems if the guest has 2 cores it will still be abel to repsond if its hung on other thngs
16:18:05 <dansmith> gibi: yeah, they already fixed one thing that manifested in the same way I think, which was specifically timeout related IIRC
16:18:16 <dansmith> but the latest was lvm crashing I think
16:18:16 <dansmith> anyway
16:18:27 <gibi> dansmith: thansk that is good info
16:18:30 <gibi> sean-k-mooney: good idea
16:18:36 <gibi> moving on
16:18:38 <gibi> #topic Sub/related team Highlights
16:18:43 <gibi> bauzas: are you still with us?
16:18:55 <bauzas> yup
16:19:07 <bauzas> nothing to report, sir.
16:19:09 <gibi> then
16:19:10 <gibi> Libvirt (bauzas)
16:19:12 <gibi> ack
16:19:13 <gibi> thanks
16:19:13 <bauzas> this^
16:19:15 <gibi> :)
16:19:23 <gibi> moving on
16:19:23 <gibi> #topic Open discussion
16:19:27 <gibi> (kashyap) seeking approval for the specless bp https://blueprints.launchpad.net/nova/+spec/virtio-as-default-display-device
16:19:34 <kashyap> gibi: So on that:
16:19:43 <gibi> I think this was discussed today on the channel
16:19:47 <kashyap> I was reminded that we can't do the switch in the current devel cycle
16:20:03 <kashyap> But we need some preperatory work for Y release
16:20:22 <kashyap> E.g. recording the video model in system_metadata.  Get the tests sorted, and then do the switch.
16:20:42 <gibi> OK, so then for X you only aim for the recording and testing then switch in Y
16:20:58 <dansmith> why do we need to record the video model
16:20:59 <dansmith> ?
16:21:17 <sean-k-mooney> to prevent it chanigng for exisating vm after upgrade and hard reboot
16:21:18 <gibi> I think it is to avoid changing ABI to the gues during hard reboot
16:21:28 <kashyap> dansmith: Upthread, sean-k-mooney was saying it might
16:21:29 <sean-k-mooney> yes that ^
16:21:32 <kashyap> But:
16:21:40 <dansmith> is that because we're changing our default?
16:21:44 <dansmith> oh
16:21:45 <kashyap> There won't be any _visible_ breakage here:
16:21:49 <kashyap> dansmith: Yep
16:21:54 <dansmith> I thought the spec was adding it as an option, this is for changing the default, I see
16:21:56 <kashyap> Sorry, I should've given a summary here.
16:22:10 <kashyap> dansmith: The first sentence of the BP says: "Change Nova's default video display from 'cirrus' to 'virtio'." :-)
16:22:11 <sean-k-mooney> dansmith: ya we added virtio i think in train
16:22:47 <sean-k-mooney> so in this specific case it actuly might be safe to jsut make the change
16:23:00 <sean-k-mooney> because fo vga fallback mode
16:23:07 <kashyap> Yeah
16:23:08 <dansmith> well, I was going to say, i think we've made such changes in the past after some interval
16:23:09 <sean-k-mooney> but in general for device model default changes
16:23:14 <sean-k-mooney> we woudl have to recored then change
16:23:15 <kashyap> dansmith: To summarize the above:
16:23:44 <kashyap> If your guest has the kernel driver, then "virtio" display dev will make use of it; or else, it'll gracefully fallback to VGA
16:23:49 <sean-k-mooney> dansmith: the only one i can think of was enabling the RNG by default
16:23:55 <kashyap> So that's the recommended option from the QEMU graphics maints
16:24:10 <kashyap> sean-k-mooney: Yep
16:24:14 <dansmith> kashyap: fall back to cirrus?
16:24:33 <kashyap> dansmith: No, no; fall back to "VGA compatibility mode", which is still better than "cirrus"
16:24:34 <sean-k-mooney> dansmith: no the virtio-gpu device support a vga hardware interface
16:24:55 <dansmith> okay
16:24:56 <sean-k-mooney> dansmith: you just wont get all the fatures but it shold funciton simialr to cirrus in the guest
16:24:58 <kashyap> I.e. standard VGA.
16:25:02 <kashyap> Yes
16:25:10 <bauzas> do operators would opt into it ?
16:25:14 <dansmith> yeah, but that might freak out a windows machine if your display adapter suddenly changes I guess
16:25:39 <bauzas> or would we need to change the default automatically ?
16:25:40 <kashyap> bauzas: No; they can opt out of it here.
16:25:40 <sean-k-mooney> dansmith: on reboot it should be ok but that was the upgrade concern that would prompt recored then change next cycle
16:25:58 <bauzas> I understand dansmith's concern about freaking out if done automatically
16:25:59 <kashyap> bauzas: Yes, we should do the right thing here by changing the defaul.
16:26:06 <sean-k-mooney> kashyap: not for existing instahce you need to use hw_video_model in the image
16:26:10 <kashyap> bauzas: dansmith's good point is for Windows
16:26:31 <kashyap> sean-k-mooney: Right; obvious the default implies only for the new ones.
16:26:35 <sean-k-mooney> so tl;dr recored in X change in Y ?
16:26:37 <kashyap> s/obvious/obviously,/
16:26:53 <dansmith> so the problem is for people who don't have hw_video_model in their image meta right?
16:26:54 <kashyap> sean-k-mooney: But _do_ we need to record at all?  As there's no breakage here
16:27:03 <sean-k-mooney> dansmith: correct
16:27:08 <dansmith> can we just create all new instances with that set to the default if they don't have it in their image?
16:27:17 <dansmith> then we're good for next time too when we switch to whizbang32 video
16:27:28 <bauzas> can't wait for it
16:27:53 <dansmith> compute assumes cirrus if unset forever, otherwise honors what it's set to, and then we can make the switch now for any new instances
16:27:57 <sean-k-mooney> dansmith: not really but we can store our default in the instance_system_metadata
16:28:16 <sean-k-mooney> which is what we are now doing for machine_type as if it was set in the image
16:28:20 <dansmith> sean-k-mooney: not really? we mirror image meta in sysmeta already right? so we'd just be using that instead of a bespoke key?
16:28:44 <sean-k-mooney> dansmith: ya so we can set it in our copy which is what kashyap was going to do
16:28:50 <dansmith> mirror *some* of image_meta I mean
16:29:09 <sean-k-mooney> we just cant set it in glance unless we just document use the glance import plugin to set it on all uploaded images
16:29:13 <dansmith> ack, okay, then we don't need a warning cycle to switch the default if we do it that way
16:29:21 <dansmith> sean-k-mooney: right I'm talking about our local copy (of course)
16:29:41 <sean-k-mooney> dansmith: so unless we backport the recording of the current value we would still need one cycle
16:29:53 <dansmith> why?
16:30:05 <sean-k-mooney> to populate the instance_metadata_table for exisiting instnaces
16:30:08 <kashyap> sean-k-mooney: Yeah, why?  I still don't see it.
16:30:17 <dansmith> no, we just assume cirrus forever if unset
16:30:30 <sean-k-mooney> oh
16:30:43 <sean-k-mooney> that just means dont change the default
16:30:51 <dansmith> past the virtio default, it'll always be set to something, so if set, honor that, else cirrus (but just on the compute).. new instances always get virtio set explicitly by default on create
16:30:54 <kashyap> dansmith: So we can even directly change w/o even recording it in system_metadata, as we did for virtio-rng (I'll get the commit later for you to read)
16:31:31 <sean-k-mooney> dansmith: that will complicate the inital spawn logic and posibel hard reboot
16:31:31 <dansmith> kashyap: okay not sure how, but happy to look
16:31:49 <sean-k-mooney> it might be doable but we reuse span in hard reboot
16:31:54 * bauzas needs to disappear
16:31:58 <dansmith> sean-k-mooney: just spawn, AFAIK, which seems fine as we record other such things IIRC, but whatever
16:32:03 <kashyap> dansmith: https://opendev.org/openstack/nova/commit/de512f2c025
16:32:04 <sean-k-mooney> so we will need to tell the different betweeen first boot and subsequint
16:32:08 <dansmith> just trying to avoid needing a cycle to change *and* annotate all existing instances
16:32:14 <kashyap> (It's slow to load)
16:32:41 <sean-k-mooney> dansmith: i guess we could try and implement that and see what it looks like
16:32:54 <dansmith> we can talk outside the meeting about it
16:32:57 <kashyap> Yeah
16:33:06 <kashyap> Thanks for the design discussion so far!
16:33:17 <kashyap> gibi: Any other topics?  We can hash it outside of the meeting
16:33:17 <gibi> OK. then I hold on approving the bp until you agree on the way forward
16:33:38 <gibi> sean-k-mooney has one more headsup I think
16:33:44 <gibi> so moving on to that
16:33:45 <gibi> sean-k-mooney:
16:33:56 <sean-k-mooney> yes so ovn migration...
16:34:19 <sean-k-mooney> am tl;dr is architeutlaly there is alwasy a race when doing live migartion with ovn
16:34:44 <sean-k-mooney> effectivly ovn can only start installing rule when the tap is created on the dest
16:35:00 <sean-k-mooney> and at that point we have called libvirt to do the migration and its incontol
16:35:16 <sean-k-mooney> to to avoid that and create the port in prelive migration im proposing an os-vif change
16:35:51 <sean-k-mooney> baiscly reinotduce hybrid-plug btu with ovs bridges and patch port instead of linux bridges and veth pairs
16:36:07 <sean-k-mooney> that will not have any perfromance impact on the vm
16:36:19 <sean-k-mooney> but will allow ovn to isntall the rules in prelive migrate
16:36:35 <sean-k-mooney> i was wondering how people felt about that
16:37:13 <gibi> honestly it is too deep networking to me. I assume the impact is mostly in os-vif. Does nova needs to be adapted?
16:37:13 <stephenfin> so previously, we had
16:37:22 <stephenfin> (ovs bridge)  veth | <---> | veth (linux bridge) tap | <---> | VM
16:37:32 <stephenfin> and now we'll have
16:37:39 <stephenfin> (ovs bridge)  patch | <---> | patch (ovs bridge) tap | <---> | VM
16:37:56 <stephenfin> so everything stays in OVS but there's an additional (on top of br-int) bridge?
16:38:07 <sean-k-mooney> more like (ovs bridge) tap | <---> | VM orginally to (ovs bridge)  patch | <---> | patch (ovs bridge) tap | <---> | VM
16:38:18 <sean-k-mooney> yes
16:38:28 <sean-k-mooney> this is the poc but it has a bug (ovs bridge)  patch | <---> | patch (ovs bridge) tap | <---> | VM
16:38:34 <sean-k-mooney> https://review.opendev.org/c/openstack/os-vif/+/798055
16:39:00 <sean-k-mooney> currently its configurable and defualting to true for development
16:39:08 <stephenfin> do we need to worry about flows getting added for the patch <-> tap in the second (new) bridge?
16:39:13 <stephenfin> or does that happen automatically?
16:39:24 <sean-k-mooney> stephenfin: just the normal action
16:39:33 <sean-k-mooney> so no rules required
16:39:50 <sean-k-mooney> on the neutron side if we wanted to proceed there woudl need to be some qos changes for ovn
16:39:57 <sean-k-mooney> so that will be covered by a spec
16:40:24 <sean-k-mooney> if we are ok with this on the nova side i would like to track the capablity as a bug against os-vif
16:40:48 <sean-k-mooney> so we can backport the ablity to opt in tothis behavor but not use it by default for stable branches
16:40:59 <stephenfin> excellent, so we'll pre-populate a flow in the br-int for the new patch port, and then the comms from the other side of the patch port to the VM don't need anything explicit bar the normal action
16:41:07 <stephenfin> that wfm, personally
16:41:26 <stephenfin> certainly seems better than re-adding hybrid plug with the OVS -> linux bridge -> VM dance
16:41:47 <sean-k-mooney> i guess may main question is bug blueprint or spec for this
16:43:01 <stephenfin> I would like to see some high level docs on this _somewhere_
16:43:02 <gibi> hm, if this requires a neutron spec, then why do you need to backport the os-vif change to stable?
16:43:12 <sean-k-mooney> personally i would prefer to leave this bake for a cycle and enable it by default next cycle
16:43:23 <sean-k-mooney> gibi: the neutorn spec is to fix QOS support
16:43:29 <stephenfin> it could be a blueprint but docs in the neutron tree might be better
16:43:31 * gibi is slow
16:43:37 <gibi> sean-k-mooney: ahh OK I see
16:43:40 <sean-k-mooney> it would be useful for those that dont need qos without that
16:43:48 <gibi> yepp now I got it
16:44:07 <gibi> this is a bugfix for os-vif to support live migration with OVN
16:44:23 <gibi> or more preciesly fix a race in live migration
16:44:28 <gibi> I can live with this as a bugfix
16:44:37 <sean-k-mooney> yes basically
16:45:13 <sean-k-mooney> and thats also why we woudl default this to off intially and then enable it by default in the future
16:45:30 <gibi> any objection?
16:45:39 <sean-k-mooney> operators can opt in early if they want but not change any behavior by default
16:46:16 <stephenfin> I'm good. Can't speak for others tho
16:46:27 <gibi> I don't see any hands raised :)
16:46:33 <sean-k-mooney> we can defer if peopel want to think about it more
16:46:38 <sean-k-mooney> im still working on the poc
16:47:09 <sean-k-mooney> my main concern is m2 and spec freeze
16:47:14 <gibi> it is accepted as a bug now, here. If somebody later has an objection the we can rediscuss but until that this is a bug
16:47:34 <gibi> Is there any other topic for today
16:48:31 <sean-k-mooney> not form me
16:49:00 <gibi> then let's close this
16:49:04 <gibi> thanks for joining
16:49:07 <gibi> #endmeeting