14:00:15 #startmeeting tripleo 14:00:16 Meeting started Tue Aug 2 14:00:15 2016 UTC and is due to finish in 60 minutes. The chair is shardy. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:17 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:20 The meeting name has been set to 'tripleo' 14:00:21 #topic rollcall 14:00:25 hi 14:00:26 o/ 14:00:28 hey 14:00:28 o/ 14:00:29 hello 14:00:30 Hi all, who's around? 14:00:35 o/ here for first half have call later 14:00:47 -o/ 14:00:47 Hello guys! o/ 14:00:48 o/ 14:00:50 o/ 14:01:13 morning 14:01:16 o/ 14:01:19 o/ 14:01:21 o/ 14:01:24 o/ 14:01:26 o/ 14:01:31 o/ 14:01:34 o/ 14:02:02 Ok then, let's get started :) 14:02:03 ¬_¬ 14:02:07 #topic agenda 14:02:09 \o 14:02:16 #link https://etherpad.openstack.org/p/tripleo-meeting-items 14:02:28 * one off agenda items 14:02:28 * bugs 14:02:28 * Projects releases or stable backports 14:02:28 * CI 14:02:28 * Specs 14:02:30 * open discussion 14:02:38 Anyone have anything to add to the one-off items? 14:03:15 Ok, we now have two (thanks bandini!) feel free to add more and we can start 14:03:22 :) 14:03:26 #info deprecating ServiceNetMap keys? 14:03:40 shardy, added about promotion script as well 14:03:56 So, I've been experimenting with ways to generate things like the $service_ips list, ref https://review.openstack.org/#/c/348974/ 14:04:01 sshnaidm: ack, thanks 14:04:26 And one of the barriers is that the keys for ServiceNetMap do not align with the service_name in the service templates 14:04:59 I've got a scary yaql hack in there atm which semi works around it, but ideally I'd like to deprecate the current e.g RabbitMqNetwork 14:05:04 and instead have rabbitmq_network 14:05:14 where the key must align with service_name 14:05:22 What are folks thoughts on that? 14:05:38 o/ 14:05:44 clearly, we'll need to maintain a translation (either in mistral or t-h-t) to support the old keys for a transitional period 14:05:50 since this is a very user-visible interface 14:06:24 I would prefer to do the translation in mistral to simplify t-h-t, but there are upgrade implications we'll need to address there 14:06:50 shardy: I'm okay with the approach. When we named the ServiceNetMap keys we didn't realize these would need to be standardized 14:07:36 dprince: Ok, cool - yeah clearly it was unforseen, I'm just trying to figure out the best path forward 14:07:47 o/ 14:08:24 Ok, well unless folks have strong objections, I'll implement a mistral patch and we can continue discussion on the review 14:08:31 shardy: sounds good to me 14:08:35 +1 14:08:43 it will depend on getting the mistral based deployments working, which I've been debugging today 14:09:01 shardy: thanks for highlighting that review, i haven't seen it before so hard to comment immediately but i'd like to look at it tomorrow. i'm not clear where service_ips is coming from now but haven't looked long there 14:09:12 #info (bandini) Will just give a short update on HA NG 14:09:19 shardy: from a high-level POV do we plan to implement the upgrades via a mistral workflow (starting from M->N)? 14:09:48 bandini: you mean overcloud upgrade? 14:09:53 marios: it's a way to auto-generate this mapping https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud.yaml#L911 14:09:55 marios: yes 14:10:22 bandini: not that i've heard of, at least unless i missed it. the major expected change is rework for composable upgrades but that is newton to ocata 14:10:26 bandini: I would like to see us consider this as an option and make it composable 14:10:32 or are we keeping the current multiple manual steps with specialized heat templates around? 14:10:50 I think we're likely to run out of time for composable upgrades this cycle 14:10:50 dprince, marios: ack, was just curious if anything was decided there already. thanks 14:10:57 my assumption is we'll do it during mitaka 14:11:06 sorry ocata :) 14:11:07 bandini: right, my understanding is, mitaka to newton current workflow. newton is where all the things are composable. so newton to ocata is when we have composable upgrades 14:11:09 ocata ;) 14:11:15 -ENOCOFFEE 14:11:15 shardy: heh... 14:11:20 ack got it thanks 14:11:40 bandini: so, NG HA is pretty much ready based on your ML thread? 14:11:47 we just need reviews and testing? 14:11:55 shardy: i think that gives us summit as good place to discuss that.. finally hav ethat discussion i mean bout the composable and how/what/ansible/mistral etc 14:11:59 but i digress 14:12:04 so I sent a short HA NG update on the other night. Basically after aodh goes in we should be good to go (patch is extremely tiny at this point) 14:12:18 marios: Yeah, it'll give us some more time to do prototyping and consider the best approach 14:12:35 we had some help with testing from the QE HA folks so in terms of failovers things look fairly solid so far 14:12:46 bandini: Ok, sounds good, thanks for the update! 14:13:02 what state is the aodh patch in? 14:13:03 so yeah I'd love to merge it sooner rather than later so we have time work out any kinks we have not seen 14:13:27 the aodh stuff is looking better, we still need to fix one last issue 14:13:34 pradk and I are working on that 14:13:40 that's all from my side ;) 14:13:52 bandini: Ok, thanks, well ping us when it's ready and passing CI, then we can hopefully get it landed 14:14:02 shardy: yes, sir! thanks 14:14:15 #info does promotion script work? (sshnaidm) 14:14:27 so, slagle or dprince maybe you can help with this 14:14:41 it seems the periodic jobs passed yesterday, but no promote happened for current-tripleo 14:14:53 can we see the cron job logs to understand why? 14:15:00 yes, i'll check 14:15:05 this happened last week too 14:15:19 turned out the mirror server had run out of space, so derekh had to clean some things out 14:15:32 we may have forgotten to re-enable the cron job after that 14:15:33 Ok, hopefully it's another similar issue 14:15:38 lol :) 14:16:00 Ok, thanks, well we can follow up after the meeting on that 14:16:01 slagle, where does it run? 14:16:12 sshnaidm: thanks for highlighting the issue 14:16:16 sshnaidm: the mirror server in rh2 14:16:33 Looks like it's commented out in crontab 14:16:42 if you do a nova list, you will see a server called mirror-server 14:16:52 bnemec: ok, so yea, derekh probably forgot to remove that :) 14:17:03 k, sounds like an easy fix :) 14:17:24 Any other one-off items before we move on? 14:18:21 #topic bugs 14:18:27 #link https://bugs.launchpad.net/tripleo/+milestone/newton-3 14:18:40 So, we've got under a month until n-3 14:19:00 so I wanted to ask, can everyone please prioritize reviews to help burn down the 40+ open bugs we've got targetted 14:19:05 as well as the features, obviously 14:19:49 I was going to ask gfidente about the CI impacting ceph bug, but looks like he's not here 14:19:58 anyone else have an update on that? 14:20:46 shardy: perhaps we could mark some additional bp's Essential, if they truly are 14:20:49 to help prioritize 14:21:45 slagle: Yeah, perhaps - tbh I think we really need to deliver all of the high ones 14:22:20 marios: we discussed breaking down the upgrades one into bugs, vs that one essential blueprint 14:22:23 and tagging the bugs 14:22:26 did that happen? 14:23:52 Ok, well lets follow up on that later then I guess 14:24:08 shardy: sry too many windows 14:24:45 shardy: i dont recall the tagging bit though for the main n3 blueprint.there have been some delays 14:25:04 marios: that's the only BP we have marked essential, which means we can't release without it 14:25:06 with people pulled for mitaka work 14:25:23 ok i will revisit discuss with you offline 14:25:25 so we have to figure out the status of that, and either break it into bugs, or deliver it 14:25:30 marios: ack, thanks! 14:25:33 #topic Projects releases or stable backports 14:25:44 So, we already discussed n3 14:26:17 It was noticed that we've not done any stable branch releases for a while, so I was going to do that this week 14:26:25 based on a recent periodic job pass 14:26:45 does anyone have any suggestions re stable release cadence, should we be looking to release more often? 14:27:04 I think that my help folks consuming TripleO e.g via RDO 14:27:19 * beagles nods 14:27:39 as to what is a reasonable cadence... err 14:27:57 * beagles concedes not at all helpful 14:27:59 I'll write a script then we can do it more regularly with minimal overhead 14:28:11 Ok anything re releases or backports to raise? 14:28:25 shardy: in puppet modules, we try to do stable releases every month or so 14:28:29 shardy: I personally like the general idea that every patch worth of backporting should be worth of releasing. So stable releases should be done quite fast after we backport fixes there (obviosly if there is multiple things on flight it makes sense not to release for each of them) 14:29:15 Jokke_: That would be reasonable if we automate it, but if it's human driven probably a periodic release will be easier 14:29:27 EmilienM: ack, maybe we can start doing the same 14:29:41 #topic CI 14:29:58 We talked about the promote problems, does anyone have an update re RH1? 14:29:59 shardy: once you get hang of the process, requesting release is really not that big of a job to do ;) 14:30:04 even manually 14:30:20 I know derekh is on PTO, can anyone else give a status update? 14:30:42 Jokke_: Ya, it's still not something I want to do every day by hand ;) 14:30:51 Jobs are running on rh1 14:31:03 oh, I thought that was Jokke_ volunteering 14:31:07 With the Ceph issue I don't imagine any have passed though. 14:31:14 yes 14:31:18 tripleo-test-cloud-rh1 is online 14:31:26 with 2 max-servers at the moment 14:32:03 So what issue remain before we can increase the capacity? 14:32:10 So, if jobs are passing under tripleo-test-cloud-rh1, I increase the amount of servers and start to shut off tripleo-test-cloud-rh2 14:32:21 I just need somebody to say move the jobs 14:32:48 sounds like we need to fix the ceph issue and get some green jobs before moving anything 14:32:51 Can we leave rh2 online? Based on some discussion yesterday it sounds like we may be keeping that a little longer. 14:33:16 also wanted to mention it looks like my fix for https://bugs.launchpad.net/tripleo/+bug/1608226 did not work in rh2 14:33:16 Launchpad bug 1608226 in tripleo "Ci: new ntp-wait command in devstack-gate hangs for 100 minutes which breaks jobs" [Critical,New] - Assigned to Sagi (Sergey) Shnaidman (sshnaidm) 14:33:20 shardy the ceph issue should be sorted, we had master branch passing in a previous patchset but were hitting a different issue for the liberty/mitaka branches 14:33:34 looking at the stuff derekh added, we upped the limited to 50 servers in tripleo-test-cloud-rh1 14:33:36 so all CI jobs on rh2 are stalling for 10 minutes waiting for ntp 14:33:38 so I've updated one of the submissions and we're now waiting for the new results 14:33:46 slagle: ouch :( 14:34:07 gfidente: ack, thanks for the update 14:34:14 right, tripleo-test-cloud-rh1 doesn't have the ntp-wait issue 14:34:29 appears we don't have something upstream filtering firewalls 14:35:55 Ok, so more investigation required then 14:35:58 thanks for the update all 14:36:20 Anything else re CI - weshay got anything to mention re the 3rd party jobs? 14:37:26 shardy, I sent an update.. we have upgrade 3rd party and basic jobs enabled, we've been watching those to make sure they are running consistently. 14:37:53 I believe apetrich is hitting issues w/ the liberty -> mitaka upgrade and has been pushing issues along 14:37:53 weshay: Ok, what's the status of the upgrade job, did we get it passing? 14:38:24 Ok, can we run mitaka->newton upgrades on upstream patches via an experimental job? 14:38:35 liberty->mitaka fails apetrich have a bug? mitaka -> mitaka is working, mitaka -> master is failing on the overcloud 14:38:36 shardy, nope. 14:38:55 apetrich: I mean, can we run it, let it fail and see why 14:39:01 adarazs, has not yet started on a RHEL based third party job afaik 14:39:15 shardy, https://bugs.launchpad.net/tripleo/+bug/1608867 14:39:15 Launchpad bug 1608867 in tripleo "Upgrade liberty to mitaka fails" [Undecided,New] 14:39:57 shardy, bandini was helping me to see if we can find anything. 14:39:59 Ok, the reason for my question is we just said mitaka->master upgrades was our #1 priority for this cycle 14:40:01 weshay: no, not done anything to the RHEL job yet, we need to solve the log collection. 14:40:06 so we need a way to test it asap :) 14:40:27 just a note about CI, I'm working on a CI job to test undercloud upgrade from mitaka -> newton 14:40:35 you can follow the work here: https://review.openstack.org/346995 14:40:45 EmilienM: ack, that will be useful as a first step, thanks 14:40:52 On that note: https://review.openstack.org/349737 14:41:02 We need to decide _how_ we're going to handle undercloud upgrades. 14:41:12 bnemec: oh wow 14:41:29 bnemec: I'll review it today 14:41:34 #topic specs 14:41:42 * bnemec got started early 14:41:44 since we're now talking about specs :) 14:41:55 I wanted to point out, we've only landed one spec for newton 14:42:00 and it's nearly the end of the cycle 14:42:18 how can we make the spec process work better for us and encourage more reviews? 14:42:30 Or, should we instead adopt e.g spec-lite bugs from ocata? 14:43:05 Is the problem the spec process or the fact that we had some huge pieces of work this cycle that are taking basically the whole six months and are eating up a ton of core reviewer time? 14:43:08 I'd like to see us iterate much faster and not get bogged down for months in detailed review when a spec is supposed to be direction not a detailed design document 14:44:01 bnemec: the entire team has been overloaded I think, I'm not complaining, just recognising the current process isn't working 14:44:33 is it realistic to expect more core reviewer time for specs next cycle, or is this a recurring problem? 14:45:36 I would think next cycle will be somewhat less of a scramble, with composable stuff and the API implemented. 14:45:43 Of course, I say this every cycle. :-) 14:45:46 lol 14:46:00 * shardy will believe it when it happens :D 14:46:05 Ok, well lets see how it goes 14:46:12 At least this cycle everyone was swamped with new things, not backporting to old releases. 14:46:27 bnemec: oh we've got other ideas that'll keep us busy I think 14:46:32 i'm not sure i fully grasp the spec-lite process 14:46:34 for now, please everybody review specs so we can land those we expect to deliver before the release branches :) 14:46:39 it seems it's just specs with less detail 14:46:44 which i'd be fine with 14:46:56 just put less implementation details into specs...just define a direction 14:47:04 slagle: basically you just raise a bug, with a description of the change, then tag it spec-lite 14:47:22 And basically we don't review it? 14:47:31 Bugs are not a good medium for design collaboration. 14:47:43 slagle: Yeah, I think it's basically a spec without the chance for 4 months of review iterations, but with a means to comment 14:47:50 shardy: ok, i thought glance was whom had defined this. and you also have to still propose a "lite" spec to glance-specs 14:48:17 #link http://specs.openstack.org/openstack/glance-specs/specs/lite-specs.html 14:48:29 slagle: Ah, that's true, I evidently misinterpreted it somewhat 14:48:46 Personally I'd rather use full specs and just call people out if they're stuck in the implementation detail weeds. 14:49:01 http://docs.openstack.org/developer/glance/contributing/blueprints.html 14:49:19 bnemec: Ok, I'm fine with sticking with them, but we must iterate faster 14:49:42 e.g it's pointeless if we're still in nitpick review phase when the implementation is landing or landed 14:50:02 ok, regarding the conversation here, it seems that there is no chance avail. monitoring spec will get some attention, right? 14:50:17 so we moved away from spec bugs in glance 14:50:22 paramite: i've given it attention 14:50:28 might be lack of doc update 14:50:40 paramite: well, we're trying to figure out how to give it (and others) attention 14:50:44 slagle, ah nice, thanks! 14:51:08 Jokke_: Ok, what was the reason, they didn't work out? 14:51:38 I personally would prefer all features to be blueprints, but not if we always stall at the blueprint review phase 14:51:50 makes it easier from a tracking perspective 14:52:09 To a large extent I think this is a case of "software design is hard". 14:52:20 And knowing how detailed to make your design is also hard. 14:52:23 yea, i definitely like seeing a bp for all features 14:52:35 but for some reason you can't really review bp's 14:52:38 no way to comment, etc 14:52:48 That's why the spec process came about. 14:52:52 bnemec: Yeah, but I guess the expecation I'm trying to set is that we don't need to do detailed design in specs 14:53:01 that's what code review iterations are for 14:53:08 yes, so let's just have less detail in specs :) 14:53:20 Ok then, lets do that :) 14:53:55 Maybe we could do a spec retrospective? 14:53:57 the composable services spec was a good example. A too large blueprint that never landed because it never made everyone happy about the details 14:54:18 Look back at the specs we worked on this cycle and consider whether the scope was too large, too small, or just right. 14:54:44 bnemec: that's a good idea, and during next cycle we could have spec review sprints (e.g of a couple of hours) periodically? 14:54:49 It would be good if we had some examples of specs that were written at the appropriate level to point people at as an example. 14:55:07 shardy: we effectively wanted all the new features being documented in single place 14:55:10 #topic open discussion 14:55:15 shardy: Yeah, review sprints can be helpful if you can get people to participate. 14:55:33 Jokke_: Ok, that's interesting - similar to the preferences expressed here for all features to be blueprints 14:55:53 sounds like we have some good ideas for optimizations going into next cycle, thanks all 14:56:08 bnemec: maybe I should be offering free beer ;) 14:56:33 Anyone have anything else to raise this week? 14:56:44 We set the topic for the deep dive: 14:56:54 https://etherpad.openstack.org/p/tripleo-deep-dive-topics 14:56:57 shardy: having bugs as blueprints did not work out well doing reviews, release documentation etc. It was just too confusing 14:57:05 slagle and I will cover some "undercloud under the hood" content 14:57:19 * Jokke_ has horrible lag on the connection atm. sorry for that 14:57:19 Jokke_: ack, thanks 14:57:36 Actually that reminds me, we should consider using reno 14:57:42 +1 14:57:48 It's been on my todo list for a while to set that up. 14:57:48 shardy: +1 14:58:16 Ok, we could add the reno output into t-h-t as a starting point 14:58:21 again experience from Glance world. It needs proactivity, a lot of it 14:58:53 and if reviewers don't drop their -s there those renos will just not be made 14:59:05 I just pushed up a patch to start removing private infrastructure from tripleo-ci, moving image uploads to tarballs.o.o: https://review.openstack.org/#/c/350061/ 14:59:10 I'm game for reno 14:59:17 wouldn't mind some help in that effort 14:59:19 * bnemec is always happy to -1 ;-) 15:00:07 Ok, out of time, thanks all! 15:00:10 #endmeeting