14:00:05 <mriedem> #startmeeting nova
14:00:06 <openstack> Meeting started Thu May 18 14:00:05 2017 UTC and is due to finish in 60 minutes.  The chair is mriedem. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:09 <openstack> The meeting name has been set to 'nova'
14:00:28 <efried> o/
14:00:29 <takashin> o/
14:00:31 <lyarwood> o/
14:00:33 <mlakat> o/
14:00:35 <dgonzalez> o/
14:00:39 <dansmith> o/
14:00:41 <mdbooth> o/
14:00:56 <gibi> o/
14:01:21 <mriedem> alright then
14:01:26 <mriedem> #link agenda https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting
14:01:35 <mriedem> #topic release news
14:01:39 <bauzas> \o
14:01:44 <mriedem> #link Pike release schedule: https://wiki.openstack.org/wiki/Nova/Pike_Release_Schedule
14:02:01 <mriedem> #info Next upcoming milestone: Jun 8: p-2 milestone (3 weeks)
14:02:10 * johnthetubaguy sneaks in late
14:02:12 <mriedem> #info Blueprints: 70 targeted, 66 approved, 12 completed, 6 not started
14:02:33 <mriedem> anything not started by p-3 we'll likely defer
14:02:48 <mriedem> started or blocked i should say
14:03:06 <mriedem> as there are some blueprints that are blocked on needing changes / owners in other projects, like ironic
14:03:16 <mriedem> questions about the release?
14:03:41 <mriedem> #topic bugs
14:03:52 <mriedem> no critical bugs
14:04:08 <mriedem> #help Need help with bug triage; there are 94 new untriaged bugs as of today (May 18)
14:04:21 <mriedem> #link check queue gate status http://status.openstack.org/elastic-recheck/index.html
14:04:27 <mriedem> things are ok'ish
14:04:40 <mriedem> the gate-grenade-dsvm-neutron-multinode-live-migration-nv job on master is 100% fail
14:05:16 <mriedem> i think because the ocata side is not running systemd and the new side is trying to stop things the systemd way, and those aren't running so it fails
14:05:49 <mriedem> sdague: it was likely due to https://review.openstack.org/#/c/465766/
14:05:50 <patchbot> patch 465766 - nova (stable/ocata) - Use systemctl to restart services
14:05:53 <mriedem> which didn't take grenade into account
14:06:02 <mriedem> oops i mean https://review.openstack.org/#/c/461803/
14:06:03 <patchbot> patch 461803 - nova - Use systemctl to restart services (MERGED)
14:06:26 <mriedem> i don't have any news for 3rd party ci
14:06:35 <mriedem> #topic reminders
14:06:41 <mriedem> #link Pike Review Priorities etherpad: https://etherpad.openstack.org/p/pike-nova-priorities-tracking
14:06:45 <johnthetubaguy> oh, right, we don't run nova grenade on the devstack gate I guess?
14:07:01 <mriedem> does not compute
14:07:11 <mriedem> the grenade live migration job is non-voting
14:07:18 <mriedem> so it doesn't run in the gate queue
14:07:28 <mriedem> and is restricted to i think nova, and maybe tempest experimental, not sure
14:07:42 <sdague> mriedem: it was, mostly because I didn't realize the script was used that way
14:08:00 <mriedem> sdague: i forgot about it too
14:08:11 <mriedem> one more reminder,
14:08:13 <mriedem> If you led sessions at the Forum, it would be good to provide summaries in the mailing list.
14:08:36 <mriedem> #topic stable branch status
14:08:40 <mriedem> stable/ocata: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/ocata,n,z
14:09:06 <mriedem> stable/newton: https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:stable/newton,n,z
14:09:11 <mriedem> #link We have a few bugs which were regressions in Newton that we need to get fixed on master and backported: https://review.openstack.org/#/c/465042/ https://bugs.launchpad.net/nova/+bug/1658070 https://review.openstack.org/#/c/464088/
14:09:13 <openstack> Launchpad bug 1658070 in OpenStack Compute (nova) "Failed SR_IOV evacuation with host" [High,In progress] - Assigned to Eli Qiao (taget-9)
14:09:13 <patchbot> patch 465042 - nova - Cache database and message queue connection objects
14:09:13 <patchbot> patch 464088 - nova - Handle special characters in database connection U...
14:09:19 <lyarwood> ^ core reviews on stable/ocata would be helpful if anyone has time
14:09:51 <mriedem> we're starting to see newton bugs rolling in,
14:09:55 <mriedem> because people are just upgrading to newton now
14:10:37 <mriedem> #topic subteam highlights
14:10:43 <mriedem> Cells v2 (dansmith)
14:10:48 <dansmith> so,
14:11:16 <dansmith> we talked about the outstanding patch sets we have up. things like the quotas set
14:11:20 <dansmith> and we covered some recent bugs that cropped up, which I think we mostly have fixes for
14:11:39 <dansmith> including one where we, uh, press the limits of the network stack by trying to, uh, use all the connections
14:12:03 <dansmith> and I said I would send a summary of the cellsv2 session
14:12:08 <dansmith> which I totally might do
14:12:19 <dansmith> I think that's it. maybe/
14:12:28 <mriedem> and i'm back
14:12:45 <dansmith> mriedem: I'm done. I said amazing things. they were yuuge.
14:12:50 <mriedem> ok so we need https://review.openstack.org/#/c/465042/
14:12:51 <patchbot> patch 465042 - nova - Cache database and message queue connection objects
14:12:54 <mriedem> and start working on backports
14:13:16 <dansmith> yup
14:13:26 <mriedem> bauzas: scheduler
14:13:49 <johnthetubaguy> oh, whoops, I could see that bug being an issue
14:13:53 <bauzas> no huge discussion for that last meeting, apart knowing which specs to review
14:14:19 <bauzas> I explained to edleafe that I was working on my series for the scheduler claims
14:14:39 <bauzas> and we did discussed on how Jay is very good for presentations :p
14:14:45 <bauzas> that's it, small one.
14:15:03 <mriedem> ok, i don't know if there was an api meeting yesterday
14:15:08 <mriedem> since alex_xu is at the bug smash
14:15:20 <mriedem> johnthetubaguy: did you attend the api meeting?
14:15:29 <johnthetubaguy> ah, I did
14:15:37 <johnthetubaguy> we had a nice chat about policy docs thats up for review
14:15:46 <johnthetubaguy> and the api-extension tidy up stuff
14:16:10 <johnthetubaguy> the keystone policy meeting is starting to discuss how to make wider progress on the per-project admin thingy
14:16:26 <johnthetubaguy> #link https://review.openstack.org/#/q/topic:bp/policy-docs+status:open+project:+openstack/nova
14:16:46 <johnthetubaguy> #link https://review.openstack.org/#/q/status:open+project:openstack/nova+branch:master+topic:bp/api-no-more-extensions-pike
14:16:50 <johnthetubaguy> thats all folks
14:16:57 <mriedem> ok thanks
14:17:04 <mriedem> gibi: notifications
14:17:09 <gibi> focus is on searchlight notification additions:
14:17:16 <gibi> #link https://review.openstack.org/#/q/topic:bp/additional-notification-fields-for-searchlight+status:open
14:17:22 <gibi> and on 3 selected notification transformation patches:
14:17:28 <gibi> #link https://review.openstack.org/#/c/396225/ and the series starting at #link https://review.openstack.org/#/c/396210
14:17:28 <patchbot> patch 396225 - nova - Transform instance.trigger_crash_dump notification
14:17:30 <patchbot> patch 396210 - nova - Transform aggregate.add_host notification
14:17:37 <gibi> the next subteam meeting will be held on 6th of June as I will be on vacation for the next two weeks
14:17:46 <gibi> that is all
14:18:17 <mriedem> k
14:18:20 <mriedem> efried: powervm
14:18:27 <efried> No change since last time.
14:18:33 <efried> We're really at a point where we just need the reviews.
14:18:39 <efried> So removing powervm from "subteams" section.
14:18:43 <mriedem> ok
14:19:05 <mriedem> the cinder new volume attach flow stuff is in the same boat
14:19:12 <mriedem> there is a series starting at https://review.openstack.org/#/c/456896/ with +2s
14:19:12 <patchbot> patch 456896 - nova - Add Cinder v3 detach call to _terminate_volume_con...
14:19:32 <mriedem> john and i have been helping on those but since i'm pushing some of the changes we need a 3rd core
14:19:51 <mriedem> that's it for subteam stuff
14:19:56 <mriedem> #topic stuck reviews
14:20:00 <mriedem> nothing in the agenda
14:20:05 <mriedem> anyone have anything they want to bring up here?
14:20:25 <mriedem> #topic open discussion
14:20:29 <mriedem> there is one item
14:20:34 <mlakat> Yep.
14:20:40 <mriedem> (mlakat) We'd like to get some advice on: What is the standard OpenStack  way of running something (in this case an os.path.getsize()) with a  timeout? This came up as part of: https://bugs.launchpad.net/nova/+bug/1691131 Basically what happened is that a broken NFS connection might block os.path.getsize(path) forever. Approaches tried:
14:20:41 <openstack> Launchpad bug 1691131 in OpenStack Compute (nova) "IO stuck causes nova compute agent outage" [Undecided,In progress] - Assigned to Daniel Gonzalez Nothnagel (dgonzalez)
14:20:47 <mriedem> 1) putting it on a separate green thread: Still blocks the main green thread forever
14:20:53 <mriedem> 2) putting it on a separate Thread: We have no way of killing threads nicely in python
14:21:02 <mriedem> 3) use multiprocess module: seems to work, but I am a bit worried about the overhead of it
14:21:13 <mriedem> * Someone must have solved this already in OpenStack
14:21:22 <mdbooth> mlakat: I commented in the review: I think you need to address that problem at a different level.
14:21:29 <johnthetubaguy> don't we dispatch in the libvirt driver already?
14:21:52 <mdbooth> Given that literally everything may be on NFS, if we've got fundamental issues like that, we can't use that technique everywhere.
14:21:59 <mriedem> right, i was going to say,
14:22:04 <mriedem> this seems like a giant game of whack a mole
14:22:11 <dgonzalez> mdbooth: That's why we put the issue on the agenda, it would be nice to get feedback on the issue and maybe some ideas how this can be solved the proper way
14:22:15 <johnthetubaguy> mdbooth: that came up in channel yesterday, a related thingy
14:22:20 <mkoderer> for me it's a design issue - the heartbeat should never stop although there is an hanging fs
14:22:43 <johnthetubaguy> but if the thread is hanging, nothing will happen, so the heartbeat dieing is actually useful?
14:22:48 <mdbooth> FWIW, NFS in Linux has been a pita as long as I've been using it
14:23:12 <mriedem> i'm not sure what 'heartbeat' we're talking about
14:23:17 <mdbooth> If the server goes away, you end up with a process stuck in uninterriptible sleep
14:23:25 <mdbooth> And that's, well, uninterruptible
14:23:25 <mriedem> the update_available_resource periodic task in the compute?
14:23:25 <mkoderer> the thing was we had an overload in the cloud and weren't able to stop any vm in this case
14:23:27 <lyarwood> D state ftw
14:23:28 <johnthetubaguy> heatbeat = service up check (in DB)?
14:23:32 <mdbooth> lyarwood: Indeed
14:23:45 <mdbooth> I think this is best solved with monitoring tools
14:24:03 <mlakat> So how shall we handle/recover situations where we have an I/O blocked NFS share?
14:24:07 <mdbooth> We should blow the mount out of the water (is that possible now, never used to be) if it hangs
14:24:09 <mkoderer> mdbooth: we have it in the monitoring
14:24:16 <dgonzalez> mriedem: yes its the update_available_resource task
14:24:19 <johnthetubaguy> did os-privsep make this worse by accident, out of interest?
14:24:21 <mkoderer> but we aren't able to stop vm producing the overload
14:24:31 <mriedem> johnthetubaguy: i don't think it has anything to do with that
14:24:43 <johnthetubaguy> oh, this is a regular python call, just on an NFS filesystem
14:24:47 <mriedem> yes
14:24:50 <mlakat> yes, a stat call
14:24:52 <mlakat> in the end
14:24:55 <mdbooth> uninteruptible sleep is a Linux kernel process state
14:24:56 <mlakat> which blocks.
14:25:00 <mdbooth> This isn't a python issue
14:25:02 <dgonzalez> the moment os.path.getsize on the hanging nfs, the task gets stuck
14:25:07 * johnthetubaguy nodes at mlakat in a yes, and a hello nice to see you sense
14:25:17 <johnthetubaguy> heh nods
14:25:29 * mlakat nods back :-)
14:25:44 <johnthetubaguy> mdbooth: agreed
14:26:02 <mkoderer> it's a deployment issue
14:26:19 <lyarwood> so if this stat was threaded and timed out what then? the instances are still marked as running while they are actually in D-state right?
14:26:19 <mriedem> mkoderer: when you say you can't stop any vm in the cloud, you mean just any vm on this compute host right?
14:26:20 <johnthetubaguy> I could have swarn we fixed this someone already, for something... like in disk cache or something
14:26:33 <mriedem> or does this lock every vm on every compute running on the same nfs share?
14:26:46 <mkoderer> mriedem: the NFS is mounted on several compute hosts
14:27:07 <mkoderer> so basically we lost control of a big portion of the cloud
14:27:14 <mriedem> i think i would reconsider using NFS
14:27:20 <mdbooth> mriedem: +1
14:27:42 <mkoderer> so the message is: don't use NFS?
14:27:47 <mkoderer> and don't do any live migration?
14:27:50 <mkoderer> really?
14:27:58 <johnthetubaguy> use iSCSI or something like that instead, or ceph
14:27:59 <mriedem> we can't thread out every os path python call in the libvirt driver
14:28:14 <mriedem> just because NFS can lock up your entire cloud
14:28:29 <mdbooth> mriedem: It's not just os.path
14:28:38 <mriedem> the message is, use monitoring tools to detect the issue, and then blow up the stuck mount
14:28:44 <mdbooth> It's presumably literally anything which touches the filesystem
14:28:50 <mdbooth> And they're just hitting it here first
14:29:18 <mriedem> ok
14:29:22 <johnthetubaguy> just to get my head straight, the problem is the system is broken, NFS has locked up, but we look bad because we also lock up and the computes appear dead?
14:29:50 <mdbooth> johnthetubaguy: I'd say so
14:29:54 <mkoderer> yeah
14:29:56 <dgonzalez> yes
14:29:56 <mlakat> johnthetubaguy, yes, and we have no way of stopping instances causing the high workload
14:30:01 <johnthetubaguy> sounds like we are doing the correct thing
14:30:16 <mriedem> mlakat: you can't kill them via virsh?
14:30:30 <mkoderer> no I would like a design were deletion is also possible in case of an overload
14:30:54 <johnthetubaguy> its a good question, does virsh still work for the delete?
14:31:06 <mlakat> mriedem, we still look at what virsh's reaction in this situation, we are in the process of testing it.
14:31:25 <dgonzalez> I am currently testing this, and it seems to be hanging too...
14:31:41 <mlakat> Assume virsh is not cooperating, how would you then "kill" the busy domain?
14:31:45 <mkoderer> ok that's tricky
14:32:01 <johnthetubaguy> we only really talk vai libvirt, so game over I think
14:32:04 <mlakat> mriedem, you said to blow the mountpoint?
14:32:15 <bswartz> does qemu have an builtin userspace NFS client?
14:32:17 <johnthetubaguy> this NFS isn't mounted by Nova I guess?
14:32:33 <mdbooth> johnthetubaguy: It would have been, actually
14:32:33 <johnthetubaguy> like you can't do a force cinder detatch at this point?
14:32:38 <mdbooth> It's a volume
14:32:48 <johnthetubaguy> ah, I was assuming this was the whole instances dir
14:32:55 <mdbooth> Although it could equally be shared instance dir
14:33:17 <mdbooth> However, force unmount of nfs is apparently *still* not possible robustly in Linux
14:33:31 <johnthetubaguy> dang
14:33:40 <dgonzalez> for us its currently only volumes that are affected. But all VMs that have a volume from the affected NFS can't be deleted
14:34:54 <bswartz> if things getting stuck on NFS is a big problem then it's worth considering NFS soft mounts instead of hard mounts, but I worry that soft mounts create other problems which might be worse
14:35:15 <mdbooth> Yep, soft mounts prevent hangs
14:35:26 <mdbooth> However, they cause IO errors during transient load spikes
14:35:35 <mdbooth> So...
14:35:39 <bswartz> yeah and that's probably and even worse situation
14:35:45 <mdbooth> NFS is what it is
14:35:45 <mkoderer> yep
14:36:11 <mkoderer> ok seems it's not solvable in nova then.. :(
14:36:22 <mriedem> so i'm not sure we're going to come to any conclusion here - i don't really think this is nova's problem to solve, honestly
14:36:23 <mdbooth> I think Nova is the wrong place to address it, unfortunately
14:36:38 <dgonzalez> yeah I agree with you...
14:37:11 <mriedem> even if you thread things out and log an error on timeout, you have to address and resolve the underlying issue at some point outside of nova most likely
14:37:20 <mriedem> by cleaning up the locked mount i guess
14:37:45 <mlakat> sounds like a monitoring agent, or something like that?
14:37:47 <mdbooth> mriedem: Which can sometimes only be achieved with a reboot
14:37:48 <mriedem> threading out anything that touches the filesystem is going to be a mess, unless you're talking about monkey patching the os module or something
14:38:07 <johnthetubaguy> sounds like some stat call to each NFS mount in some monitoring loop would catch this OK?
14:38:10 <mlakat> mriedem, threads are not an option
14:38:33 <mlakat> johnthetubaguy, the action is still not clear in that case.
14:39:02 <mriedem> anyway, i'm -1 to what's proposed in the change, and i think we've spent enough time on it in this meeting
14:39:12 <mriedem> we can move to -nova if you want to discuss further
14:39:16 <mriedem> but let's end this meeting
14:39:17 <mlakat> Thank you for your time.
14:39:20 <mriedem> np
14:39:22 <johnthetubaguy> mlakat: true
14:39:22 <mkoderer> thx
14:39:30 <mriedem> #endmeeting