15:03:54 <ihrachys> #startmeeting neutron_upgrades 15:03:55 <openstack> Meeting started Mon Nov 23 15:03:54 2015 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:03:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:03:58 <openstack> The meeting name has been set to 'neutron_upgrades' 15:04:15 <ajo> o/ 15:04:15 <ihrachys> welcome all to the 1st meeting of the subteam :) 15:04:27 <ihrachys> #link https://wiki.openstack.org/wiki/Meetings/Neutron-Upgrades-Subteam 15:04:50 <ihrachys> that's the agenda we'll follow during our meetings. feel free to update. :) 15:05:28 <ihrachys> some history for those not informed: the subteam is formed for Mitaka the least and will look into upgrade story for Neutron. 15:06:03 <ihrachys> it covers cold upgrades as well as rolling ones, looking at stability, speed and whatnot all related to upgrades 15:06:30 <ihrachys> I want to clarify on the channel mess first 15:07:03 <ihrachys> initially we planned to handle the meeting in #openstack-meeting-2 but it was pointed out on Fri that the channel is not an official one, so no logging, no bots etc. 15:07:16 <sc68cal> o/ 15:07:25 <Sam-I-Am> howdy 15:07:39 <ihrachys> hence we switched to #openstack-meeting-3 this time since it's free today, and we'll work on getting official channel booking this week 15:07:58 <ihrachys> there is a patch to book -3 bi-weekly for now 15:07:59 <ihrachys> #link https://review.openstack.org/247464 15:08:12 <rossella_s> thanks ihrachys 15:08:20 <ihrachys> but we'll probably look into getting a slot weekly, unless you think it's not worth it 15:08:22 <mestery> yes, thanks for taking care of this ihrachys 15:08:25 <ihrachys> comments on frequency? 15:08:39 <mhickey> ihrachys: I am easy 15:09:23 <ihrachys> ok, let's assume everyone agrees with the goal to have it weekly and move on. if not, speak up later. 15:09:33 <rossella_s> ok 15:09:39 <ihrachys> #topic Announcements 15:09:51 <ajo> let's try weekly, and step back if it's too much 15:10:14 <ihrachys> as you probably know, openstack is now a big tent, and we use tags to claim common features for projects 15:10:39 <ihrachys> lately two project tags were merged in governance repo: assert:supports-upgrade https://review.openstack.org/239771 and assert:supports-rolling-upgrade: https://review.openstack.org/239778 15:11:21 <ihrachys> I believe first thing we'll track as a team is those tags for neutron. it seems we are there or almost there to claim those 15:11:39 <ihrachys> there is actually a patch up for review to add supports-upgrade for neutron 15:11:39 <ihrachys> #link https://review.openstack.org/246242 15:12:05 <ihrachys> we already had it covered with grenade jobs before, so there are no action items beyond that patch 15:12:23 * ihrachys .oO (I wish all victories are that easy) 15:12:28 <sc68cal> woo 15:12:58 <ihrachys> for supports-rolling-upgrade, we have some stuff to do, but not much either, thanks to grenade folks 15:13:22 <ihrachys> sc68cal put a patch for infra to add partial job for neutron experimental queue 15:13:23 <ihrachys> #link https://review.openstack.org/245862 15:13:49 <ihrachys> and it's merged. sc68cal, wanna update on what's the status for the job? 15:14:12 <sc68cal> I made a bit of a mistake, the job needs a different node type to run 15:14:23 <sc68cal> https://review.openstack.org/#/c/248737/ 15:14:39 <sc68cal> so, it should be able to be run in the next hour 15:14:45 <ihrachys> sc68cal: aha. cool, and it's already in gate :) 15:15:11 <ajo> niice 15:15:25 <ihrachys> sc68cal: what's the plan for the job after we make sure it works in experimental? 15:15:44 <njohnston> o/ 15:15:56 <sc68cal> ihrachys: run it, see what explodes, fix 15:16:21 <ihrachys> sc68cal: ok. once we are there, do we go thru non-voting state or go straight to enable it in gate? 15:16:40 <sc68cal> non-voting probably 15:16:51 <sc68cal> if it consistently passes, then make it voting 15:16:54 <amotoki> sc68cal: is there a good pointer to understand how multinode jobs is run? 15:17:00 <ihrachys> yeah, it may be worth to collect some stats. 15:17:06 <sc68cal> amotoki: not..... really :( 15:17:16 <ajo> sc68cal , how healthy is the normal multinode now? 15:17:28 <ihrachys> amotoki: I believe there was an email on the matter from sdague, I will try to find it now. 15:17:29 <amotoki> sc68cal: no problem. it seems better to collect our knowledge somewhere. 15:17:32 <ajo> I saw it had issues last week, but seems to be running fine now? 15:17:46 <korzen> I assume that multinode job should ne able to run VMs on not upgraded node, is it correct? tempest smoke tests covers that? 15:17:48 <sc68cal> amotoki: I badgered clarkb to write some docs - this was what we got https://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/multinode_setup_info.txt 15:18:08 <ihrachys> amotoki: http://lists.openstack.org/pipermail/openstack-dev/2015-November/079397.html 15:18:19 <ihrachys> #link http://lists.openstack.org/pipermail/openstack-dev/2015-November/079397.html details on grenade partial multinode jobs 15:18:22 <sc68cal> ajo: non-dvr multinode I think is pretty healthy 15:18:38 <sc68cal> ajo: DVR multinode has been good, then not so good, for a while 15:18:47 <sc68cal> I think the new DVR subteam has been working on fixing 15:18:59 <ajo> ack, I see a few patches passing DVR too 15:19:03 <ajo> multinode-dvr I mean 15:19:26 <ihrachys> korzen: I presume for partial we run whole suite with old compute node 15:20:05 <ihrachys> ok, so we have plan for partial, and we have sc68cal on it, so there is no place for failure :) 15:20:19 <rossella_s> :) 15:20:28 <ihrachys> sc68cal: thanks for hopping onto the job quickly 15:21:08 <ihrachys> #topic Patches 15:21:19 <korzen> I have spotted that grenade upgrade code is not using the expand/contract schema migration 15:21:20 <ihrachys> I guess I should have switched the topic before, but meh 15:21:54 <korzen> should we work on adding the online schema migration to grenade upgrade scripts? 15:22:01 <ihrachys> korzen: yeah, that's something we should look into. but that would be a separate job that would run all tests/code from stable EXCEPT applying expand migrations from master before that 15:22:31 <ihrachys> korzen: since the whole idea is that you can execute those on previous releases and continue running with no issues. 15:23:31 <ihrachys> maybe it would also require actual upgrade and another tempest run to validate that we didn't break anything while running the old code 15:24:05 <ihrachys> ok, moving to next patches up for review 15:24:17 <ihrachys> we have two upgrade related patches for devref 15:24:29 <ihrachys> #link https://review.openstack.org/241687 devref page for upgrades 15:25:26 <ihrachys> ^ is a page with general info on scenarios, struggles, plans, whatnot. I believe it's mostly for cultural changes outside those explicitly interested in upgrades, though if you find smth interesting for you too, cool. 15:25:52 <ihrachys> we also have another devref change that puts plan for RPC callbacks mechanism upgrade from ajo 15:26:00 <ihrachys> #link https://review.openstack.org/241154 devref update for rpc callbacks upgrades 15:26:11 <ajo> :-) 15:26:28 <ihrachys> if someone is not aware, rpc callbacks is a mechanism used for QoS feature to propagate updates to QoS resources into agents 15:27:05 <ihrachys> and it changes usage pattern somewhat: instead of asking agents to pull data from neutron-server, it makes neutron-server broadcast updates for agents 15:27:21 <ihrachys> that's similar to how we do notifications, but with actual data payload attached 15:27:31 <ajo> Also plans to use them to sync agents more reliably by kevinbenton 15:27:42 <ajo> #link : https://review.openstack.org/#/c/225995/ agent push notification refactor 15:27:48 <ihrachys> there are plans to reuse the mechanism for other features, like port_details updates ^ 15:28:13 <ihrachys> so the mechanism and rolling upgrades for it become more important than before :) 15:29:44 <ihrachys> note that since we need versioned objects to make usage of the mechanism, we would need to provide those to kevinbenton asap. rossella_s was kind to take the PoC for port objects. 15:30:01 <ihrachys> rossella_s: I guess you had no time for that till now, correct? 15:30:18 <rossella_s> ihrachys, yes 15:30:23 <ajo> The new testing jobs will also help validate all this work 15:30:39 <ihrachys> ajo: yes, I believe partial is explicitly mentioned in the spec as a blocker 15:30:43 <rossella_s> ihrachys, I hope to give an update regarding this in the following days...my plan is to start this week 15:31:02 <ajo> rossella_s , ihrachys , keep me looped in into that work 15:31:16 <rossella_s> ajo sure 15:31:26 <korzen> rossella_s please include me also 15:31:33 <ihrachys> rossella_s: thanks a lot! reach me whenever you have questions on objects. I put that all in the tree initially, so hopefully I may be of some help 15:32:05 <rossella_s> korzen, will do 15:32:11 <rossella_s> ihrachys, thanks a lot, I will 15:32:39 <ihrachys> rossella_s: btw to make it clear, the idea for start is merely providing objects to allow kevinbenton to switch notification code to use them. Other code that currently touches tables may still use the old way. This is to unblock further work. 15:33:06 <rossella_s> ihrachys, got it 15:33:17 <ihrachys> ok cool on that one :) 15:33:24 <ajo> Yes, the complication is on extension of ports and networks by mixins :) 15:33:30 <korzen> is there rfe or bp for OVO implementation? 15:33:39 <ihrachys> #action rossella_s will start on PoC for port objects 15:34:08 <ajo> korzen , not sure if an RFE is needed, is not an external feature, but more of internal design, may be as part of any dependent work? 15:34:13 <ihrachys> do we think it needs a separate rfe? 15:34:39 <ajo> ihrachys, i don't know, may be it's worth checking with armax, mestery, any idea ^ ? 15:34:40 <sc68cal> ajo: +1 15:34:41 <ihrachys> ajo: it may worth checking with drivers team on that one. maybe they will indeed want it. 15:35:04 <ajo> ihrachys : +1 15:35:08 <ihrachys> yeah, let's better check instead of hoping for the best and being blocked later 15:35:16 <mestery> ajo ihrachys: I don't think an RFE is needed, we have to do this work, so my opinion is lets file a bug (or write a quick spec) and move forward. 15:35:20 <mestery> Although 15:35:24 <mestery> I guess the bug could be an RFE bug :) 15:35:39 <ihrachys> mestery: we don't have specs without bps, and bps without rfes, so... 15:35:44 <mestery> lol 15:35:47 <ihrachys> (unless I miss smth) 15:35:54 <ihrachys> that process thing is hard! 15:36:01 <mestery> :) 15:36:09 <ajo> lol 15:36:21 <korzen> https://review.openstack.org/#/c/168955/6/specs/liberty/versioned-objects.rst 15:36:22 <ajo> a devref? :) 15:36:30 <korzen> old spec from Liberty 15:36:57 <ihrachys> korzen: cool, thanks for the pointer! 15:37:05 <ajo> korzen++ 15:37:15 <ihrachys> ok, let's check with drivers tomorrow (I believe we have drivers meeting on Tue) 15:37:24 <mestery> ihrachys: Yup, 1500 UTC I believe 15:37:41 <amotoki> devref sounds a good idea for this kind of things. we can discuss design and code in parallel. 15:37:56 <ihrachys> #action ihrachys will follow up with drivers team on the need for RFE/BP/spec/bribe to proceed with OVO adoption 15:38:08 <amotoki> but it seems good to ask armando. 15:38:22 <ihrachys> amotoki: yeah, he is usually concerned about over commit 15:38:58 <ihrachys> ok, we don't have anything more on agenda, so switching the topic 15:39:02 <ihrachys> #topic Open Agenda 15:39:16 <ihrachys> one thing I wanted to clarify is whether we need a LP tag for the team 15:39:16 <ajo> open is good, we love open 15:39:20 <ajo> :) 15:39:21 <amotoki> one question on the upgrade strategy is whether we can assume neutron server is upgraded first. 15:39:39 <sc68cal> amotoki: I think that's a given 15:39:48 <amotoki> it is reasonable to me but I think no other projects are doing so now. is it better to hear opinions from oeprators or others? 15:39:51 <sc68cal> amotoki: I believe that's the way Nova does upgrades 15:39:54 <ajo> I guess that's good as long as is well documented 15:39:55 <ihrachys> amotoki: if we don't specify it, we complicate our lifes for no reason 15:40:12 <ihrachys> amotoki: I *think* nova does? isn't conductor the first one to upgrade? 15:40:18 <amotoki> ihrachys: I totally agree. 15:40:23 <Sam-I-Am> the server (read: db schema) has to be upgraded first 15:40:47 <ihrachys> Sam-I-Am: in nova, it's conductor that touches db, right? 15:40:58 <sc68cal> ihrachys: yep 15:41:01 <Sam-I-Am> ihrachys: yeah 15:41:09 <amotoki> ihrachys: I might be wrong. I will check again. 15:41:19 <Sam-I-Am> the general upgrade process is upgrade source, update db, restart services 15:41:33 <Sam-I-Am> if any config options require changing, do that prior to the service restart 15:41:42 <Sam-I-Am> or if they pertain to db stuff, before db update 15:41:46 <ihrachys> http://superuser.openstack.org/articles/upgrading-nova-to-kilo-with-minimal-downtime , step 3 15:42:43 <ihrachys> note that the order is to be documented for neutron with https://review.openstack.org/241687 15:43:15 <Sam-I-Am> another thing that should be tested during upgrade is connectivity to vms 15:43:21 <Sam-I-Am> it should not break 15:43:49 <ihrachys> Sam-I-Am: yeah. I believe it's not supposed to break, but it's indeed not tested. 15:43:58 <ajo> Sam-I-Am , Isn't that checked by tempest in grenade job? 15:44:03 <ihrachys> it was fixed in L by akamyshnikova 15:44:29 <ihrachys> ajo: I don't think so. during upgrade no actual tests are executed that could catch connectivity drop 15:44:41 <Sam-I-Am> also vm connectivity outbound if possible 15:44:45 <ajo> ha 15:44:48 <Sam-I-Am> but inbound usually implies outbound works 15:44:54 <ajo> we should at least run the smoke tests 15:45:16 <ihrachys> ajo: we do run them, but not while we restart services. 15:45:20 <ajo> or the network related 15:45:24 <ihrachys> ajo: it's like 'run smoke, upgrade, run tests' 15:45:29 <ihrachys> in sequence 15:45:36 <ajo> ah, you mean while upgrade? 15:45:39 <korzen> the idea of vm connectivity is to have constant ping to VM during upgrade, it is not covered by smoke test, right? 15:45:44 <ajo> ok, not sure how that would be possible. 15:45:59 <ajo> That's assumed by the idea that agents going down don't remove the dataplane settings 15:46:05 <ajo> or bring down any real service 15:46:11 <Sam-I-Am> some sort of basic check to make sure an upgrade doesnt break connectivity, and if it does, accurately determine the downtime 15:46:14 <ajo> so the only issues to arise, are when new agents boot up 15:46:20 <ihrachys> ajo: they don't, since L (at least OVS) 15:47:01 <ajo> ah, for OVS we had some before, but it was momentary ( a few seconds) and not related to upgrade itself 15:47:05 <ajo> just because of ovs agent restart 15:47:07 <ajo> (any restart) 15:47:10 <ihrachys> ajo: it may be some background service in devstack that runs while uppgrade is ongoing that provides data on connectivity into some file that we can parse later. 15:47:17 <Sam-I-Am> question is, would any sort of upgrades cause agents to 'rework' the dataplane structure? for example, an architectural change to dvr. 15:47:39 <ihrachys> ajo: yeah, but it's kinda related to upgrades because you can't hot swap python code ;) 15:47:39 <ajo> Sam-I-Am : eventually.. 15:48:05 <ajo> So, may be a background pinger 15:48:15 <ajo> that provides a report by the end of the upgrade 15:48:21 <ajo> ok, you convinced me 15:48:24 <Sam-I-Am> yeah. it just needs to glean the instance info 15:48:26 <ihrachys> Sam-I-Am: I agree it's worth validating. I believe we had a patch merged in master lately that changed names for LB devices for no big reason, and we haven't caught it 15:49:08 <ajo> Sam-I-Am , ihrachys , when debugging stuff like that, I found specially useful to run a pinger in the openstack log format 15:49:13 <Sam-I-Am> it would need to test all common cases... floating ip, private ip, public ip, router ip, etc. 15:49:13 <ihrachys> I believe the more visible the subteam work is in gerrit, the more aware community will be about where to look at 15:49:23 <ajo> that way you could sync with the agent reboot logs, etc, to see when disruptions happen 15:49:39 <Sam-I-Am> probably dhcp too 15:50:07 <ihrachys> ajo: ack. as long as we have data points with time stamps, it's not an issue to sync logs with those 15:50:14 <ajo> correct 15:50:33 <Sam-I-Am> will these upgrades test both ovs and lb? 15:50:41 <ihrachys> Sam-I-Am: dhcp disruption may be fine for a tiny bit of time since we can have multiple agents to serve a network 15:50:52 <ihrachys> Sam-I-Am: I don't think we will look into LB before OVS. 15:50:56 <Sam-I-Am> ihrachys: yeah, dhcp isnt as big of an issue since access is intermittent 15:51:17 <ihrachys> Sam-I-Am: I even suspect LB may still have connectivity disruption since akamyshnikova's patch was OVS specific 15:51:21 <Sam-I-Am> ihrachys: but we should at least check for breaking it because instances do crazy things when they cant get a lease 15:51:39 <ihrachys> ack 15:51:39 <ajo> ihrachys : I think disruption was ovs specific too 15:51:55 <ihrachys> ajo: ok then I get why operators lean towards LB ;) 15:52:03 <Sam-I-Am> iirc, ovs was the only thing that had disruption during a restart of ovs-agent (not specific to upgrades) 15:52:17 <Sam-I-Am> because it re-created all of the magic 15:52:18 <ihrachys> btw one more patch that I want to point out related to upgrades is https://review.openstack.org/248190 15:52:22 <akamyshnikova> ihrachys, yes I agree with ajo, I'm not sure that LB has problems 15:52:28 <Sam-I-Am> linuxbridge has been resilient for a while now 15:52:47 <ihrachys> the patch ^ adds a command to neutron-db-manage to detect whether there are pending contract migrations that would require offline neutron-server instances before execution 15:52:49 <Sam-I-Am> you can kill off all the agents and the bridges stick around (sometimes too many of them stick around) 15:53:16 <ihrachys> I know puppet folks already have some code to determine whether full shutdown is needed, but they rely on ugly hacks like filenames 15:53:28 <ihrachys> the idea of the patch is to make it a supported command they can rely on 15:53:50 <ihrachys> note it's WIP and I need to play with alembic a bit before it's ready. so just a heads-up. 15:54:12 <ihrachys> Sam-I-Am: good to hear :) 15:54:49 <ihrachys> ok anything else to point out upgrades related? 15:55:12 <Sam-I-Am> i'm good for now 15:55:27 <korzen> can we slit the OVO implementation? 15:55:28 <ihrachys> Sam-I-Am: thanks for joining btw, I somehow missed you :) 15:55:38 <Sam-I-Am> ihrachys: np. somanymeetings :) 15:55:39 <ihrachys> korzen: you mean per object type? 15:55:42 <korzen> yes 15:55:50 <ihrachys> korzen: I don't see any reasons not to if we have interested parties! 15:55:57 <Sam-I-Am> ihrachys: i have a spec to improve the general openstack upgrade procedures 15:56:08 <ihrachys> korzen: I presume networks/subnets could be a separate piece to bite 15:56:13 <ihrachys> rossella_s: comments ^ ? 15:56:30 <korzen> yes, my idea is to touch the network/subnets 15:56:31 <ihrachys> Sam-I-Am: nice. link? 15:56:35 <Sam-I-Am> each release will include a list of potential issues including downtime, so having an automated upgrade test process is important 15:56:43 <rossella_s> ihrachys, I agree with you 15:57:03 <Sam-I-Am> ihrachys: http://specs.openstack.org/openstack/docs-specs/specs/mitaka/upgrades.html 15:57:27 <ihrachys> korzen: so just go ahead ^ :) 15:57:41 <korzen> ok, thx 15:57:47 <ihrachys> #link http://specs.openstack.org/openstack/docs-specs/specs/mitaka/upgrades.html 15:57:54 <ihrachys> Sam-I-Am: thanks, added to reading list :) 15:57:59 <ajo> korzen , what do you mean by slit? 15:58:04 <korzen> split* 15:58:07 <korzen> ;) 15:58:11 <ajo> upps, sorry I was reading an old log somehow 15:58:12 <ajo> :) 15:58:18 <Sam-I-Am> thx. i added your devref patch to my reading list 15:58:50 <ihrachys> ok, I believe we are near the end of the meeting. if anything, we may proceed in #openstack-neutron channel. thanks for joining and looking forward to your patches and reviews. :) 15:59:08 <ajo> ihrachys: thanks for driving this 15:59:26 <rossella_s> ihrachys, thanks a lot! 15:59:34 <mhickey> ihrachys, thanks 15:59:39 <ihrachys> and keep attention to meeting time updates in openstack-dev@, we may need to switch if we don't get -2 channel as the official one. 15:59:46 <ihrachys> adieu! o/ 15:59:48 <amotoki> thanks! 15:59:53 <Sam-I-Am> seeya, great meeting 15:59:54 <ihrachys> #endmeeting