17:01:43 #startmeeting Octavia 17:01:44 Meeting started Wed Sep 27 17:01:43 2017 UTC and is due to finish in 60 minutes. The chair is johnsom. Information about MeetBot at http://wiki.debian.org/MeetBot. 17:01:46 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 17:01:48 The meeting name has been set to 'octavia' 17:01:52 Hi folks 17:01:52 There we go 17:02:21 Hi johnsom 17:02:28 Sorry, talking with nova guys about the 404 I just reproduced local. NIC is not showing up in the instance 17:02:44 o/ 17:02:47 #topic Announcements 17:03:02 Just a heads up, Zuul v3 is rolling out. 17:03:25 This will likely mean some turbulence in our gates for a bit. 17:03:50 #link https://docs.openstack.org/infra/manual/zuulv3.html 17:04:04 I thought it did already 17:04:06 That is the link for more details on zuul v3. 17:04:29 It's been dragging out as they keep finding bugs. As of last night it was still not fully deployed 17:04:32 Seemed like the pep8 job I looked at the other day was run with v3 17:05:06 Quote: We've pretty much run out of daylight though for the majority of the team and there is a tricky zuul-cloner related issue to deal with, so we're not going to push things further tonight. We're leaving most of today's work in place, having gotten far enough that we feel comfortable not rolling back. 17:05:29 From Monty's email last night 17:05:53 Anyway, just a heads up that this is happening and may impact us. 17:06:12 In the end it will be a good thing as it will be much easier to update our gates 17:06:26 Any other announcements today? 17:06:56 #topic Meeting time revisit 17:07:15 I have got a lot of feedback that the new meeting time is not working out. 17:07:31 I'm dieing right now 17:07:42 Some active contributors cannot make this time, plus that ^^^ ha 17:07:57 It's too early for me to read words 17:08:16 So, due to multiple requests I have put up another doodle to re-evaluate the time for the meetings 17:08:24 #link https://doodle.com/poll/p65x9xxkec52ecaw 17:08:52 I have put two times in, but we can add others. I also picked the same day, but we have flexibility there too 17:09:30 Recently the TC approved that we can host our meetings in the lbaas channel, so we are no longer stuck with what slots are available on the main meeting channels 17:10:32 Any questions/comments about meeting times? other proposals? 17:10:33 When are we starting to use lbaas channel? 17:11:05 I will send out an e-mail and we will have at least one more meeting at this time/channel where I will announce it. 17:11:24 sounds good 17:12:11 #topic Brief progress reports / bugs needing review 17:12:17 Ok, how are things going? 17:12:49 tongl Is this patch still being worked on? https://review.openstack.org/#/c/323645/ 17:12:49 patch 323645 - neutron-lbaas - Add status in VMware driver 17:12:52 Couple of patches waiting to go in that are the results of the PTG 17:12:54 It has a -1 comment 17:13:35 Yeah, we have a bunch of patches with one +2 on them 17:13:42 https://review.openstack.org/#/q/(project:openstack/octavia+OR+project:openstack/octavia-dashboard+OR+project:openstack/python-octaviaclient+OR+project:openstack/octavia-tempest-plugin)+AND+status:open+AND+NOT+label:Code-Review%253C0+AND+NOT+label:Verified%253C%253D0+AND+NOT+label:Workflow%253C0 17:13:46 opps 17:13:47 #link https://review.openstack.org/#/q/(project:openstack/octavia+OR+project:openstack/octavia-dashboard+OR+project:openstack/python-octaviaclient+OR+project:openstack/octavia-tempest-plugin)+AND+status:open+AND+NOT+label:Code-Review%253C0+AND+NOT+label:Verified%253C%253D0+AND+NOT+label:Workflow%253C0 17:13:58 johnsom: Let me have a look at it to resolve the comments. It is nsxv driver. 17:14:03 Some great stuff coming in fixing octavia-dashboard issues 17:14:39 nice 17:14:56 I have been trying to catch up on patch reviews. A number have failed when I go to test them out. 17:15:06 If you have open patches check to see if I have commented. 17:15:45 Any other patches/bugs to discuss today? 17:16:17 https://review.openstack.org/#/c/486499 17:16:18 patch 486499 - octavia - Add flavor, flavor_profile table and their APIs 17:16:56 recently submitted my first patch to octavia, although one gate job is failing but seems unrelated to code 17:17:08 please submit your reviews 17:17:14 Looks like that gate failure was the OVH bug with qemu crashing 17:17:32 Yeah, that one is a infra host issue and not your code. 17:17:39 #link http://logs.openstack.org/99/486499/13/check/gate-octavia-v1-dsvm-py3x-scenario-multinode/68dec49/logs/libvirt/qemu/instance-00000002.txt.gz 17:17:46 cirros doesn't even boot there 17:18:04 ok, thanks :) 17:18:21 Cool, glad to see that is ready for review! 17:18:23 Thanks 17:18:25 Is this the flavor support we discussed during PTG? 17:19:02 implmentation of https://review.openstack.org/#/c/392485/ 17:19:02 patch 392485 - octavia - Spec detailing Octavia service flavors support (MERGED) 17:20:34 i got to know by jniesz that someone is working on provider support 17:20:52 i would like to help there too if any needed 17:21:01 I think some folks are working on writing up the spec. 17:21:25 longstaff Do you have an update on how that is going? 17:21:45 We've been delayed a bit but will be working on it next week 17:22:37 johnsom: should we wait for that spec to come up, or we should add flavor_profile metadat to current octavia handler? 17:22:48 Ok, feel free to post what you have and let some of us hack on it too.... 17:23:12 i did the same in https://review.openstack.org/#/c/484325/ 17:23:12 patch 484325 - octavia - [WIP] Add provider Implementation 17:23:33 pksingh Well, we know it will change, but as long as we are ok re-working it 17:23:45 It might flush out any issues we missed, etc. 17:24:24 ok 17:24:33 The spec says it's dependent on providers, is that not really the case? 17:25:18 yes it dependes on the providers for validating the metadata part of flavor_profile 17:25:25 i have left that step 17:25:42 It is, but the octavia driver handler is kind of like a provider (will need to be moved over) 17:26:29 johnsom: can i move ahead with treating handler as provider? 17:26:50 Well, the interface is totally going to change when we do providers 17:28:20 ok, then i will wait for provider spec to be merged 17:28:28 I would not spend too much time on the octavia handlers until we get farther with the provider spec 17:29:39 Ok, so I will work on reviewing the flavors work. Thanks! 17:29:49 thanks 17:29:52 #topic Open Discussion 17:29:56 Other topics today? 17:30:42 Admin API stuff 17:31:22 I have a couple of Admin API type things that I'm going to look at tackling very soon (like, starting today or tomorrow probably) 17:31:23 Not sure if we want specs or if I should just show up with code 17:31:43 the Amphora Info endpoint is up and ready to merge: https://review.openstack.org/#/c/505404/ 17:31:44 patch 505404 - octavia - Add admin endpoint for amphora info 17:31:49 the next couple I want to do are: 17:32:24 1) A patch to clear out the spares pool, so when we push a new image, we can get rid of the old spares quickly / easily) 17:32:50 Spares pool is pretty straight forward, just an rfe is probably good for that 17:33:37 2) Something to SYNC / retry LBs that are in bad states (ERROR, and possibly PENDING) because I have seen a number of LBs go to these states recently and it is ridiculous that there is no way to get a LB out of ERROR once it goes there 17:33:46 I think there are some things we could at least *try* 17:34:01 Let's talk about that one 17:35:05 Others? 17:35:21 also I'm tempted to have Housekeeping pop things from PENDING to ERROR if they've exceeded some timeout since last update 17:35:48 because when a LB has been in PENDING_UPDATE for 30 minutes, it's obviously stuck 17:35:56 and that state is immutable 17:35:56 We have been hit with a couple of lb's going into error, like when amphora fails to get DHCP on boot 17:36:18 So, I take it there are not others. 17:36:22 yeah, I think possibly the correct approach for that is just to trigger failovers 17:36:35 johnsom: not off the top of my head yet 17:37:06 also when lb is in error, how can we be sure it cleans up all resources? 17:37:13 So, have you run to ground how these are getting stuck in PENDING_UPDATE? That should not be happening. Is it the controller process being restarted? 17:37:21 so in that case, should the "failover API call" just ... be the "sync" or should there be more logic? 17:37:37 johnsom: in at least one case i've seen, yeah, the worker restarts and leaves it hanging 17:38:22 Yeah, ok, so the whole job board thing. Did we break the graceful shutdown of the process or is this a host failed situation? 17:39:22 jniesz: yeah, what i am thinking is we add an "attempt cleanup" method that tries to intelligently remove every piece (starting with VMs, then ports, then SGs) assuming it'll see a lot of 404s, and when it seems like everything is good, then do the failover path 17:39:23 or just fix the failover path to accept 404s in more places 17:39:24 johnsom: i think the graceful shutdown is borked 17:39:48 rm_work agreed that would make cleanup much better 17:39:49 jniesz That is exactly why the current state machine is ERROR->DELETED 17:40:05 yeah, because it was "easy" to start 17:40:25 but telling users "ah, i see your LB randomly went into error for you... time to delete it and start over!" is entirely unacceptable 17:40:58 the thing i've seen cause ERROR states the most often is actually failovers 17:41:12 Yeah, I agree with that. I would really like to run to ground WHY they are going into provisioning state ERROR (we are not talking operation status here) 17:41:21 usually when something dumb happens like a network blip 17:41:34 or some other openstack component is having issues glance, neutron, etc... 17:41:38 right, yes 17:41:55 I'm hearing two things: 17:42:08 it's "access to external services" mostly 17:42:21 so again, is the answer to this maybe "to sync, trigger a failover"? 17:42:26 1. We need to figure out why the worker is not gracefully shutting down (not finishing a workflow before exiting) 17:42:28 because we do have the failover API 17:42:45 2. We need to evaluate adding more retry logic to the flows/tasks 17:42:51 yes, but also: 17:43:22 3. We need to provide some way to clean up stuff that still manages to get into an undefined state 17:43:38 Because we are awesome but not awesome enough that I predict everything will always be bug-free 17:43:51 Yeah. Sadly failing over an ERROR object could mean you get into a worse state 17:44:02 and us saying "well, we really should figure out *why*" when an operator has stuck LBs is not useful 17:44:42 so the ability to clean up PENDING state stuff (as an operator, like, FORCE-DELETE) would be nice 17:44:50 Like losing the VIP IP or having half an LB updated (one amp failover) 17:45:22 yes, these things are bad 17:45:23 and sometimes it happens 17:45:39 and our delete flows also need to be improved a bit I think, because about 50% of the time something goes to ERROR, a delete is just going to ERROR-loop 17:46:05 That is bad. Delete should be able to clean up ERROR cleanly 17:46:08 things like the "security group in use" issue still bite us even in the gate sometimes 17:46:21 i've tried to fix that in my own driver 17:46:25 but it's tricky 17:46:53 Really? I have ONLY seen that in the gate when the main process failed due to coding bugs 17:46:54 I once had that ERROR-loop issue and ending up clean up the data to remove the lb resources. 17:47:06 database 17:47:21 yeah i have to dig into the DB 17:47:24 quite often 17:47:32 same here 17:47:49 Yeah, so what I hope we can do is capture these as bugs, with logs, so we can understand the failure mechanism and work on good mitigation options. 17:48:37 so, my plan would be a call that: A) ignores the state, so you can do a delete in any state; B) tries to first catalogue every possible object that we need to clean up; C) attempts to carefully clean up all of them 17:48:41 It also let's us discuss the pro/cons as some of these solutions have some really dangerous side effects 17:50:08 Like this one, if it's used while a controller is still working on the objects (still has the lock) you get into cases where your force-delete command cleans up parts, but controller goes and creates more orphaned objects. 17:50:43 It's like it would need to check with the controllers to see if they are still "active" with that object 17:51:29 then what about the suggestion that HK flips things to ERROR after a configured timeout? 17:51:40 Our original plan for this was the job board implementation, where it passes the flow token to different controllers if one fails to check in and move it forward. 17:51:43 and we just assume you need to wait until that, and at that point anything is done 17:51:51 hmm 17:52:04 yeah i mean, jobboard would be great, we've been talking about it for 3 years 17:52:22 True. Act/Act for 2ish 17:53:29 Are we ok with starting by capturing these scenarios in bugs (stories)? 17:53:47 I think we need to be capturing these and discussing solutions. 17:53:48 I am ok with that 17:53:58 yes, I think looking at the specific use cases is a good start 17:54:24 specific issues 17:55:07 rm_work ? 17:55:22 ok. probably I will make this API in the meantime, but only use it downstream 17:55:23 and when we figure out what we need, i can tweak and push it up 17:55:54 Yeah, I just think it's going to get abused by folks not thinking about the situation enough. 17:56:21 That is the concern. 17:56:23 probably, but meanwhile i need runbooks to give to people who don't know octavia much, and will keep me out of 24/7 on-call 17:56:31 Big red buttons are shiny 17:56:49 and as long as I design the big red button, i'd rather have them press that than the "call me at 2am on a weekend" button 17:57:25 Yeah, I just don't like spending days deleting orphaned objects, corrupt DBs 17:57:33 i'm already DOING that 17:57:39 but I think I can find the orphaned stuff 17:57:43 with code 17:58:02 Ok, so looking forward to some bugs so we can understand the problems 17:58:10 Grin 17:58:30 We have two minutes, any other quick topics? 17:59:23 is the plan to use failover api for changing flavors? 17:59:39 Oye, that is a topic isn't it 17:59:51 Can I put it on the agenda for next week? 17:59:57 Or in the channel. 18:00:00 We are out of time. 18:00:03 ok 18:00:13 bh526r: Error: Can't start another meeting, one is in progress. Use #endmeeting first. 18:00:15 #endmeeting