15:03:54 <ihrachys> #startmeeting neutron_upgrades
15:03:55 <openstack> Meeting started Mon Nov 23 15:03:54 2015 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:03:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:03:58 <openstack> The meeting name has been set to 'neutron_upgrades'
15:04:15 <ajo> o/
15:04:15 <ihrachys> welcome all to the 1st meeting of the subteam :)
15:04:27 <ihrachys> #link https://wiki.openstack.org/wiki/Meetings/Neutron-Upgrades-Subteam
15:04:50 <ihrachys> that's the agenda we'll follow during our meetings. feel free to update. :)
15:05:28 <ihrachys> some history for those not informed: the subteam is formed for Mitaka the least and will look into upgrade story for Neutron.
15:06:03 <ihrachys> it covers cold upgrades as well as rolling ones, looking at stability, speed and whatnot all related to upgrades
15:06:30 <ihrachys> I want to clarify on the channel mess first
15:07:03 <ihrachys> initially we planned to handle the meeting in #openstack-meeting-2 but it was pointed out on Fri that the channel is not an official one, so no logging, no bots etc.
15:07:16 <sc68cal> o/
15:07:25 <Sam-I-Am> howdy
15:07:39 <ihrachys> hence we switched to #openstack-meeting-3 this time since it's free today, and we'll work on getting official channel booking this week
15:07:58 <ihrachys> there is a patch to book -3 bi-weekly for now
15:07:59 <ihrachys> #link https://review.openstack.org/247464
15:08:12 <rossella_s> thanks ihrachys
15:08:20 <ihrachys> but we'll probably look into getting a slot weekly, unless you think it's not worth it
15:08:22 <mestery> yes, thanks for taking care of this ihrachys
15:08:25 <ihrachys> comments on frequency?
15:08:39 <mhickey> ihrachys: I am easy
15:09:23 <ihrachys> ok, let's assume everyone agrees with the goal to have it weekly and move on. if not, speak up later.
15:09:33 <rossella_s> ok
15:09:39 <ihrachys> #topic Announcements
15:09:51 <ajo> let's try weekly, and step back if it's too much
15:10:14 <ihrachys> as you probably know, openstack is now a big tent, and we use tags to claim common features for projects
15:10:39 <ihrachys> lately two project tags were merged in governance repo: assert:supports-upgrade https://review.openstack.org/239771 and assert:supports-rolling-upgrade: https://review.openstack.org/239778
15:11:21 <ihrachys> I believe first thing we'll track as a team is those tags for neutron. it seems we are there or almost there to claim those
15:11:39 <ihrachys> there is actually a patch up for review to add supports-upgrade for neutron
15:11:39 <ihrachys> #link https://review.openstack.org/246242
15:12:05 <ihrachys> we already had it covered with grenade jobs before, so there are no action items beyond that patch
15:12:23 * ihrachys .oO (I wish all victories are that easy)
15:12:28 <sc68cal> woo
15:12:58 <ihrachys> for supports-rolling-upgrade, we have some stuff to do, but not much either, thanks to grenade folks
15:13:22 <ihrachys> sc68cal put a patch for infra to add partial job for neutron experimental queue
15:13:23 <ihrachys> #link https://review.openstack.org/245862
15:13:49 <ihrachys> and it's merged. sc68cal, wanna update on what's the status for the job?
15:14:12 <sc68cal> I made a bit of a mistake, the job needs a different node type to run
15:14:23 <sc68cal> https://review.openstack.org/#/c/248737/
15:14:39 <sc68cal> so, it should be able to be run in the next hour
15:14:45 <ihrachys> sc68cal: aha. cool, and it's already in gate :)
15:15:11 <ajo> niice
15:15:25 <ihrachys> sc68cal: what's the plan for the job after we make sure it works in experimental?
15:15:44 <njohnston> o/
15:15:56 <sc68cal> ihrachys: run it, see what explodes, fix
15:16:21 <ihrachys> sc68cal: ok. once we are there, do we go thru non-voting state or go straight to enable it in gate?
15:16:40 <sc68cal> non-voting probably
15:16:51 <sc68cal> if it consistently passes, then make it voting
15:16:54 <amotoki> sc68cal: is there a good pointer to understand how multinode jobs is run?
15:17:00 <ihrachys> yeah, it may be worth to collect some stats.
15:17:06 <sc68cal> amotoki: not..... really :(
15:17:16 <ajo> sc68cal , how healthy is the normal multinode now?
15:17:28 <ihrachys> amotoki: I believe there was an email on the matter from sdague, I will try to find it now.
15:17:29 <amotoki> sc68cal: no problem. it seems better to collect our knowledge somewhere.
15:17:32 <ajo> I saw it had issues last week, but seems to be running fine now?
15:17:46 <korzen> I assume that multinode job should ne able to run VMs on not upgraded node, is it correct? tempest smoke tests covers that?
15:17:48 <sc68cal> amotoki: I badgered clarkb to write some docs - this was what we got https://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/multinode_setup_info.txt
15:18:08 <ihrachys> amotoki: http://lists.openstack.org/pipermail/openstack-dev/2015-November/079397.html
15:18:19 <ihrachys> #link http://lists.openstack.org/pipermail/openstack-dev/2015-November/079397.html details on grenade partial multinode jobs
15:18:22 <sc68cal> ajo: non-dvr multinode I think is pretty healthy
15:18:38 <sc68cal> ajo: DVR multinode has been good, then not so good, for a while
15:18:47 <sc68cal> I think the new DVR subteam has been working on fixing
15:18:59 <ajo> ack, I see a few patches passing DVR too
15:19:03 <ajo> multinode-dvr I mean
15:19:26 <ihrachys> korzen: I presume for partial we run whole suite with old compute node
15:20:05 <ihrachys> ok, so we have plan for partial, and we have sc68cal on it, so there is no place for failure :)
15:20:19 <rossella_s> :)
15:20:28 <ihrachys> sc68cal: thanks for hopping onto the job quickly
15:21:08 <ihrachys> #topic Patches
15:21:19 <korzen> I have spotted that grenade upgrade code is not using the expand/contract schema migration
15:21:20 <ihrachys> I guess I should have switched the topic before, but meh
15:21:54 <korzen> should we work on adding the online schema migration to grenade upgrade scripts?
15:22:01 <ihrachys> korzen: yeah, that's something we should look into. but that would be a separate job that would run all tests/code from stable EXCEPT applying expand migrations from master before that
15:22:31 <ihrachys> korzen: since the whole idea is that you can execute those on previous releases and continue running with no issues.
15:23:31 <ihrachys> maybe it would also require actual upgrade and another tempest run to validate that we didn't break anything while running the old code
15:24:05 <ihrachys> ok, moving to next patches up for review
15:24:17 <ihrachys> we have two upgrade related patches for devref
15:24:29 <ihrachys> #link https://review.openstack.org/241687 devref page for upgrades
15:25:26 <ihrachys> ^ is a page with general info on scenarios, struggles, plans, whatnot. I believe it's mostly for cultural changes outside those explicitly interested in upgrades, though if you find smth interesting for you too, cool.
15:25:52 <ihrachys> we also have another devref change that puts plan for RPC callbacks mechanism upgrade from ajo
15:26:00 <ihrachys> #link https://review.openstack.org/241154 devref update for rpc callbacks upgrades
15:26:11 <ajo> :-)
15:26:28 <ihrachys> if someone is not aware, rpc callbacks is a mechanism used for QoS feature to propagate updates to QoS resources into agents
15:27:05 <ihrachys> and it changes usage pattern somewhat: instead of asking agents to pull data from neutron-server, it makes neutron-server broadcast updates for agents
15:27:21 <ihrachys> that's similar to how we do notifications, but with actual data payload attached
15:27:31 <ajo> Also plans to use them to sync agents more reliably by kevinbenton
15:27:42 <ajo> #link : https://review.openstack.org/#/c/225995/ agent push notification refactor
15:27:48 <ihrachys> there are plans to reuse the mechanism for other features, like port_details updates ^
15:28:13 <ihrachys> so the mechanism and rolling upgrades for it become more important than before :)
15:29:44 <ihrachys> note that since we need versioned objects to make usage of the mechanism, we would need to provide those to kevinbenton asap. rossella_s was kind to take the PoC for port objects.
15:30:01 <ihrachys> rossella_s: I guess you had no time for that till now, correct?
15:30:18 <rossella_s> ihrachys, yes
15:30:23 <ajo> The new testing jobs will also help validate all this work
15:30:39 <ihrachys> ajo: yes, I believe partial is explicitly mentioned in the spec as a blocker
15:30:43 <rossella_s> ihrachys, I hope to give an update regarding this in the following days...my plan is to start this week
15:31:02 <ajo> rossella_s , ihrachys , keep me looped in into that work
15:31:16 <rossella_s> ajo sure
15:31:26 <korzen> rossella_s please include me also
15:31:33 <ihrachys> rossella_s: thanks a lot! reach me whenever you have questions on objects. I put that all in the tree initially, so hopefully I may be of some help
15:32:05 <rossella_s> korzen, will do
15:32:11 <rossella_s> ihrachys, thanks a lot, I will
15:32:39 <ihrachys> rossella_s: btw to make it clear, the idea for start is merely providing objects to allow kevinbenton to switch notification code to use them. Other code that currently touches tables may still use the old way. This is to unblock further work.
15:33:06 <rossella_s> ihrachys, got it
15:33:17 <ihrachys> ok cool on that one :)
15:33:24 <ajo> Yes, the complication is on extension of ports and networks by mixins :)
15:33:30 <korzen> is there rfe or bp for OVO implementation?
15:33:39 <ihrachys> #action rossella_s will start on PoC for port objects
15:34:08 <ajo> korzen , not sure if an RFE is needed, is not an external feature, but more of internal design, may be as part of any dependent work?
15:34:13 <ihrachys> do we think it needs a separate rfe?
15:34:39 <ajo> ihrachys, i don't know, may be it's worth checking with armax, mestery, any idea ^ ?
15:34:40 <sc68cal> ajo: +1
15:34:41 <ihrachys> ajo: it may worth checking with drivers team on that one. maybe they will indeed want it.
15:35:04 <ajo> ihrachys : +1
15:35:08 <ihrachys> yeah, let's better check instead of hoping for the best and being blocked later
15:35:16 <mestery> ajo ihrachys: I don't think an RFE is needed, we have to do this work, so my opinion is lets file a bug (or write a quick spec) and move forward.
15:35:20 <mestery> Although
15:35:24 <mestery> I guess the bug could be an RFE bug :)
15:35:39 <ihrachys> mestery: we don't have specs without bps, and bps without rfes, so...
15:35:44 <mestery> lol
15:35:47 <ihrachys> (unless I miss smth)
15:35:54 <ihrachys> that process thing is hard!
15:36:01 <mestery> :)
15:36:09 <ajo> lol
15:36:21 <korzen> https://review.openstack.org/#/c/168955/6/specs/liberty/versioned-objects.rst
15:36:22 <ajo> a devref? :)
15:36:30 <korzen> old spec from Liberty
15:36:57 <ihrachys> korzen: cool, thanks for the pointer!
15:37:05 <ajo> korzen++
15:37:15 <ihrachys> ok, let's check with drivers tomorrow (I believe we have drivers meeting on Tue)
15:37:24 <mestery> ihrachys: Yup, 1500 UTC I believe
15:37:41 <amotoki> devref sounds a good idea for this kind of things. we can discuss design and code in parallel.
15:37:56 <ihrachys> #action ihrachys will follow up with drivers team on the need for RFE/BP/spec/bribe to proceed with OVO adoption
15:38:08 <amotoki> but it seems good to ask armando.
15:38:22 <ihrachys> amotoki: yeah, he is usually concerned about over commit
15:38:58 <ihrachys> ok, we don't have anything more on agenda, so switching the topic
15:39:02 <ihrachys> #topic Open Agenda
15:39:16 <ihrachys> one thing I wanted to clarify is whether we need a LP tag for the team
15:39:16 <ajo> open is good, we love open
15:39:20 <ajo> :)
15:39:21 <amotoki> one question on the upgrade strategy is whether we can assume neutron server is upgraded first.
15:39:39 <sc68cal> amotoki: I think that's a given
15:39:48 <amotoki> it is reasonable to me but I think no other projects are doing so now. is it better to hear opinions from oeprators or others?
15:39:51 <sc68cal> amotoki: I believe that's the way Nova does upgrades
15:39:54 <ajo> I guess that's good as long as is well documented
15:39:55 <ihrachys> amotoki: if we don't specify it, we complicate our lifes for no reason
15:40:12 <ihrachys> amotoki: I *think* nova does? isn't conductor the first one to upgrade?
15:40:18 <amotoki> ihrachys: I totally agree.
15:40:23 <Sam-I-Am> the server (read: db schema) has to be upgraded first
15:40:47 <ihrachys> Sam-I-Am: in nova, it's conductor that touches db, right?
15:40:58 <sc68cal> ihrachys: yep
15:41:01 <Sam-I-Am> ihrachys: yeah
15:41:09 <amotoki> ihrachys: I might be wrong. I will check again.
15:41:19 <Sam-I-Am> the general upgrade process is upgrade source, update db, restart services
15:41:33 <Sam-I-Am> if any config options require changing, do that prior to the service restart
15:41:42 <Sam-I-Am> or if they pertain to db stuff, before db update
15:41:46 <ihrachys> http://superuser.openstack.org/articles/upgrading-nova-to-kilo-with-minimal-downtime , step 3
15:42:43 <ihrachys> note that the order is to be documented for neutron with https://review.openstack.org/241687
15:43:15 <Sam-I-Am> another thing that should be tested during upgrade is connectivity to vms
15:43:21 <Sam-I-Am> it should not break
15:43:49 <ihrachys> Sam-I-Am: yeah. I believe it's not supposed to break, but it's indeed not tested.
15:43:58 <ajo> Sam-I-Am , Isn't that checked by tempest in grenade job?
15:44:03 <ihrachys> it was fixed in L by akamyshnikova
15:44:29 <ihrachys> ajo: I don't think so. during upgrade no actual tests are executed that could catch connectivity drop
15:44:41 <Sam-I-Am> also vm connectivity outbound if possible
15:44:45 <ajo> ha
15:44:48 <Sam-I-Am> but inbound usually implies outbound works
15:44:54 <ajo> we should at least run the smoke tests
15:45:16 <ihrachys> ajo: we do run them, but not while we restart services.
15:45:20 <ajo> or the network related
15:45:24 <ihrachys> ajo: it's like 'run smoke, upgrade, run tests'
15:45:29 <ihrachys> in sequence
15:45:36 <ajo> ah, you mean while upgrade?
15:45:39 <korzen> the idea of vm connectivity is to have constant ping to VM during upgrade, it is not covered by smoke test, right?
15:45:44 <ajo> ok, not sure how that would be possible.
15:45:59 <ajo> That's assumed by the idea that agents going down don't remove the dataplane settings
15:46:05 <ajo> or bring down any real service
15:46:11 <Sam-I-Am> some sort of basic check to make sure an upgrade doesnt break connectivity, and if it does, accurately determine the downtime
15:46:14 <ajo> so the only issues to arise, are when new agents boot up
15:46:20 <ihrachys> ajo: they don't, since L (at least OVS)
15:47:01 <ajo> ah, for OVS we had some before, but it was momentary ( a few seconds) and not related to upgrade itself
15:47:05 <ajo> just because of ovs agent restart
15:47:07 <ajo> (any restart)
15:47:10 <ihrachys> ajo: it may be some background service in devstack that runs while uppgrade is ongoing that provides data on connectivity into some file that we can parse later.
15:47:17 <Sam-I-Am> question is, would any sort of upgrades cause agents to 'rework' the dataplane structure? for example, an architectural change to dvr.
15:47:39 <ihrachys> ajo: yeah, but it's kinda related to upgrades because you can't hot swap python code ;)
15:47:39 <ajo> Sam-I-Am : eventually..
15:48:05 <ajo> So, may be a background pinger
15:48:15 <ajo> that provides a report by the end of the upgrade
15:48:21 <ajo> ok, you convinced me
15:48:24 <Sam-I-Am> yeah. it just needs to glean the instance info
15:48:26 <ihrachys> Sam-I-Am: I agree it's worth validating. I believe we had a patch merged in master lately that changed names for LB devices for no big reason, and we haven't caught it
15:49:08 <ajo> Sam-I-Am , ihrachys , when debugging stuff like that, I found specially useful to run a pinger in the openstack log format
15:49:13 <Sam-I-Am> it would need to test all common cases... floating ip, private ip, public ip, router ip, etc.
15:49:13 <ihrachys> I believe the more visible the subteam work is in gerrit, the more aware community will be about where to look at
15:49:23 <ajo> that way you could sync with the agent reboot logs, etc, to see when disruptions happen
15:49:39 <Sam-I-Am> probably dhcp too
15:50:07 <ihrachys> ajo: ack. as long as we have data points with time stamps, it's not an issue to sync logs with those
15:50:14 <ajo> correct
15:50:33 <Sam-I-Am> will these upgrades test both ovs and lb?
15:50:41 <ihrachys> Sam-I-Am: dhcp disruption may be fine for a tiny bit of time since we can have multiple agents to serve a network
15:50:52 <ihrachys> Sam-I-Am: I don't think we will look into LB before OVS.
15:50:56 <Sam-I-Am> ihrachys: yeah, dhcp isnt as big of an issue since access is intermittent
15:51:17 <ihrachys> Sam-I-Am: I even suspect LB may still have connectivity disruption since akamyshnikova's patch was OVS specific
15:51:21 <Sam-I-Am> ihrachys: but we should at least check for breaking it because instances do crazy things when they cant get a lease
15:51:39 <ihrachys> ack
15:51:39 <ajo> ihrachys : I think disruption was ovs specific too
15:51:55 <ihrachys> ajo: ok then I get why operators lean towards LB ;)
15:52:03 <Sam-I-Am> iirc, ovs was the only thing that had disruption during a restart of ovs-agent (not specific to upgrades)
15:52:17 <Sam-I-Am> because it re-created all of the magic
15:52:18 <ihrachys> btw one more patch that I want to point out related to upgrades is https://review.openstack.org/248190
15:52:22 <akamyshnikova> ihrachys, yes I agree with ajo, I'm not sure that LB has problems
15:52:28 <Sam-I-Am> linuxbridge has been resilient for a while now
15:52:47 <ihrachys> the patch ^ adds a command to neutron-db-manage to detect whether there are pending contract migrations that would require offline neutron-server instances before execution
15:52:49 <Sam-I-Am> you can kill off all the agents and the bridges stick around (sometimes too many of them stick around)
15:53:16 <ihrachys> I know puppet folks already have some code to determine whether full shutdown is needed, but they rely on ugly hacks like filenames
15:53:28 <ihrachys> the idea of the patch is to make it a supported command they can rely on
15:53:50 <ihrachys> note it's WIP and I need to play with alembic a bit before it's ready. so just a heads-up.
15:54:12 <ihrachys> Sam-I-Am: good to hear :)
15:54:49 <ihrachys> ok anything else to point out upgrades related?
15:55:12 <Sam-I-Am> i'm good for now
15:55:27 <korzen> can we slit the OVO implementation?
15:55:28 <ihrachys> Sam-I-Am: thanks for joining btw, I somehow missed you :)
15:55:38 <Sam-I-Am> ihrachys: np. somanymeetings :)
15:55:39 <ihrachys> korzen: you mean per object type?
15:55:42 <korzen> yes
15:55:50 <ihrachys> korzen: I don't see any reasons not to if we have interested parties!
15:55:57 <Sam-I-Am> ihrachys: i have a spec to improve the general openstack upgrade procedures
15:56:08 <ihrachys> korzen: I presume networks/subnets could be a separate piece to bite
15:56:13 <ihrachys> rossella_s: comments ^ ?
15:56:30 <korzen> yes, my idea is to touch the network/subnets
15:56:31 <ihrachys> Sam-I-Am: nice. link?
15:56:35 <Sam-I-Am> each release will include a list of potential issues including downtime, so having an automated upgrade test process is important
15:56:43 <rossella_s> ihrachys, I agree with you
15:57:03 <Sam-I-Am> ihrachys: http://specs.openstack.org/openstack/docs-specs/specs/mitaka/upgrades.html
15:57:27 <ihrachys> korzen: so just go ahead ^ :)
15:57:41 <korzen> ok, thx
15:57:47 <ihrachys> #link http://specs.openstack.org/openstack/docs-specs/specs/mitaka/upgrades.html
15:57:54 <ihrachys> Sam-I-Am: thanks, added to reading list :)
15:57:59 <ajo> korzen , what do you mean by slit?
15:58:04 <korzen> split*
15:58:07 <korzen> ;)
15:58:11 <ajo> upps, sorry I was reading an old log somehow
15:58:12 <ajo> :)
15:58:18 <Sam-I-Am> thx. i added your devref patch to my reading list
15:58:50 <ihrachys> ok, I believe we are near the end of the meeting. if anything, we may proceed in #openstack-neutron channel. thanks for joining and looking forward to your patches and reviews. :)
15:59:08 <ajo> ihrachys: thanks for driving this
15:59:26 <rossella_s> ihrachys, thanks a lot!
15:59:34 <mhickey> ihrachys, thanks
15:59:39 <ihrachys> and keep attention to meeting time updates in openstack-dev@, we may need to switch if we don't get -2 channel as the official one.
15:59:46 <ihrachys> adieu! o/
15:59:48 <amotoki> thanks!
15:59:53 <Sam-I-Am> seeya, great meeting
15:59:54 <ihrachys> #endmeeting