14:00:43 <dprince> #startmeeting tripleo 14:00:44 <openstack> Meeting started Tue Apr 5 14:00:43 2016 UTC and is due to finish in 60 minutes. The chair is dprince. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:48 <openstack> The meeting name has been set to 'tripleo' 14:00:54 <derekh> howdy 14:00:59 <rdopiera> hi 14:01:05 <dprince> hi everyone 14:01:15 <matbu> \o 14:01:22 <shardy> o/ 14:01:26 <beagles> o/ 14:02:51 <pradk> o/ 14:02:55 <qasims> o/ 14:02:58 <dprince> #topic agenda 14:02:58 <dprince> * quickstack (one off questions) 14:02:58 <dprince> * bugs 14:02:58 <dprince> * Projects releases or stable backports 14:02:58 <dprince> * CI 14:03:00 <dprince> * Specs 14:03:03 <dprince> * open discussion 14:03:22 <dprince> trown: I've added your quickstart one-off agenda item to the front of the agenda since we didn't get to it last week 14:03:33 <shardy> I also just added a topic re summit sessions 14:03:48 <trown> dprince: k, I mostly resolved it :) 14:03:50 <trown> o/ 14:03:57 <dprince> trown: okay, so skip it? 14:04:15 <dprince> trown: meeting times have been tight so if it is handled we can swap it out for the schedule 14:04:15 <trown> but I have some other questions about documentation and image building 14:04:18 <dprince> shardy: ack 14:04:34 <trown> I will send to ML first and we can discuss in a later meeting 14:05:00 <dprince> okay, so this 14:05:02 <dprince> #topic agenda 14:05:02 <dprince> * bugs 14:05:02 <dprince> * Projects releases or stable backports 14:05:02 <dprince> * CI 14:05:05 <dprince> * Specs 14:05:07 <dprince> * summit sessions 14:05:10 <dprince> * open discussion 14:05:33 <dprince> oops. I meant to call that agenda2. Oh well 14:06:00 <dprince> shardy: will talk about the summit sessions right after specs 14:06:04 <dprince> lets get started 14:06:09 <shardy> dprince: thanks 14:06:11 <dprince> #topic bugs 14:06:25 <dprince> lots of bugs last week 14:06:42 <dprince> https://bugs.launchpad.net/tripleo/+bug/1566259 14:06:43 <openstack> Launchpad bug 1566259 in tripleo "Upgrades job failing with ControllerServiceChain config_settings error" [Critical,Triaged] 14:06:56 <dprince> this is currently breaking our CI the most we think 14:06:59 <shardy> So, as discussed in #tripleo, this is already fixed by a heat revert 14:07:14 <shardy> we just have to figure out how to consume a newer heat I think 14:07:14 <dprince> shardy: yep, just wanted to mention in case others weren't following 14:07:33 <shardy> yup my response also just for info if folks weren't following :) 14:08:14 <dprince> any other bugs to mention 14:08:26 <slagle> https://bugs.launchpad.net/tripleo/+bug/1537898 14:08:28 <openstack> Launchpad bug 1537898 in tripleo "HA job overcloud pingtest timeouts" [High,Triaged] - Assigned to James Slagle (james-slagle) 14:08:37 <slagle> i wanted to mention we merged a patch to address that, hopefully 14:08:48 <derekh> shardy: so options are promote, temprevert or whitelist? /me wasn't following which are we going for? 14:09:05 <slagle> but, now i'm seeting timeouts from the heat stack timeout, which defaults to 360s 14:09:12 <slagle> so we likely need another patch to bump that as well 14:09:34 <shardy> derekh: Yeah, we need to decide which - my whitelist patch didn't pass all the jobs, so all options are still on the table I think 14:09:41 <slagle> that error killed 81 jobs last week, fwiw 14:09:46 <dprince> slagle: does this not fix it https://review.openstack.org/#/c/298941/ 14:09:53 <derekh> 81 ouch 14:09:58 <slagle> dprince: partially, that's what i'm saying 14:10:10 <slagle> we bumped rpc timeout to 10 minutes, but the heat stack has a timeout of 6 mins 14:10:11 <weshay> sshnaidm, ^ 14:10:34 <slagle> so i was going to do another patch to bump the 6 minutes to 20 minutes 14:10:37 <jistr> also one more CI related fix is here https://review.openstack.org/#/c/301624/ -- we weren't getting logs for jobs where crm_resource --wait hanged, this should fix it and perhaps allow us to debug stuck HA jobs 14:10:47 <dprince> slagle: yes. lets do all these 14:11:00 <dprince> slagle: but seriously, I think our pingtest is taking way to long 14:11:24 <sshnaidm> weshay, it's about pingtest timeouts, I meant overall timeout of the whole job 14:11:28 <slagle> dprince: indeed. only thing i've been able to deduce is that we are completely cpu bound in the testenvs 14:11:37 <shardy> wow, so that's 6 minutes to create a few neutron resources and have the server go ACTIVE 14:11:44 <shardy> it's not even including the time to boot it 14:11:59 <dprince> slagle: yeah 14:12:20 <dprince> slagle: will you push the patch to tune/bump the heat stack timeout then? 14:12:25 <slagle> yes, will do 14:12:43 <dprince> slagle: thanks 14:13:02 <dprince> okay, so with shardy's and slagle's incoming patches I think we should be in better shape 14:13:08 <dprince> any other issues this week for bugs? 14:13:15 <sshnaidm> dprince, I'm not sure if it could be considere a bug, but recently a lot of jobs are stopped in the middle and failed because of timeout, they take more than 3 hours, especially ha and upgrades. Is it should be reported as a bug..? 14:13:34 <trown> liberty upgrades job is broke too, but I have not looked into exactly why... more in the CI topic 14:13:43 <dprince> sshnaidm: that can be a bug. We need to stop adding to the waltime of our CI jobs. 14:14:06 <derekh> +1 to reversing our trend of adding to wall time 14:14:07 <sshnaidm> dprince, ok, then I will report it in launchpad 14:14:16 <dprince> sshnaidm: I think perhaps we even need to consider cutting from the jobs. The HA jobs are just too resource intensive 14:14:23 <shardy> well we're also getting poor performance from the CI nodes which gives rise to spurious timeouts 14:14:57 <sshnaidm> dprince, maybe it could run in periodical jobs or experimental.. 14:15:20 <dprince> sshnaidm: yep, we'd need a team focussed on keeping those passing though 14:15:29 <weshay> adarazs, ^ ya we can try to back fill the coverage on ha if needed 14:15:30 <dprince> sshnaidm: but we could go at it that way 14:15:32 <shardy> both options result in folks ignoring failures unfortunately 14:15:46 <shardy> which is a topic for the CI section ;) 14:15:54 <dprince> yep, lets move on 14:16:04 <dprince> #topic Projects releases or stable backports 14:16:24 <dprince> anything needing to be discussed here this week 14:16:29 <dprince> the Mitaka releases are cut 14:16:36 <shardy> dprince: are you planning to publish a "mitaka" release around the GA of the coordinated release? 14:16:40 <dprince> branches are cut rather 14:17:02 <trown> ya there have been some key fixes to land on mitaka after the branch is cut 14:17:07 <shardy> Yeah we've cut the branches, I wasn't sure if we were going to attempt to advertise the released versions via openstack/releases 14:17:17 <dprince> shardy: sure, we can. 14:17:24 <shardy> there was some discussion re the generated docs, e.g for independent project on openstack-dev 14:17:29 <trown> thinking of the mariadb10 compatability patch for tht, though I think there are others 14:17:35 <shardy> in the context of kolla, but same applies to us atm 14:17:47 <shardy> http://releases.openstack.org/independent.html 14:18:01 <dprince> shardy: thinking about releases EmilienM made me use reno for one of my puppet patches. 14:18:18 <shardy> It'd be good if we had a TripleO mitaka "release" on there, even if it's just the various repos 14:18:25 <dprince> shardy: we should consider adopting it as it would really help show what went into the release that is being cut 14:18:37 <shardy> dprince: Yeah, agreed 14:18:40 <trown> +1 for reno 14:18:49 <trown> ironic has been using it very successfully 14:18:49 <shardy> we need reno and probably either spec-lite bugs or blueprints 14:19:06 <dprince> shardy: if we adopt it early in Newton that would be most ideal 14:19:08 <shardy> so we can track both the deliverables/completed stuff and the roadmap 14:19:55 <dprince> shardy: getting the whole core team to by into using it is important though as we'd need to enforce it during reviews 14:20:01 <dprince> but it is fairly painless I think 14:20:09 <shardy> dprince: +1 - I'll try to take a first-cut at describing the process and ping the list for feedback 14:20:26 <dprince> shardy: sounds good :) 14:20:33 <shardy> agree, it shouldn't be much work at all for folks, but we need to get buy-in for sure 14:21:29 <dprince> okay, any other stable branch topics. I think we can revisit the "when to tag" topic again next week 14:21:39 <shardy> dprince: do we need a non-reno releasenotes document for Mitaka? 14:21:46 <shardy> e.g a wiki or page in TripleO docs? 14:22:01 <dprince> shardy: so I sent out a wiki patch 1 month ago to the list and asked people to fill it in 14:22:09 <dprince> shardy: that could be a start... 14:22:21 <shardy> dprince: ack, I need to revisit that ;) 14:22:52 <jistr> https://etherpad.openstack.org/p/tripleo-mitaka 14:23:03 <shardy> #link https://etherpad.openstack.org/p/tripleo-mitaka 14:23:06 <shardy> thanks jistr 14:23:21 <dprince> cool, I was looking for that as well but email was slow 14:23:25 <dprince> okay, lets move on 14:23:28 <dprince> #topic CI 14:24:07 <trown> I started working on some aggregate reporting we could integrate into the ci status page 14:24:12 <shardy> Two topics to mention, capacity (or lack thereof) and state of the periodic job and propmotion of tripleo-common 14:24:17 <trown> http://chunk.io/f/9b65dfaa09dd415d97859ea16bc117a2 is a really raw form of the output 14:24:51 <matbu> I wanted to mention that i started to work on upgrade liberty->mitaka, only UC upgrade are passing atm 14:25:01 <weshay> hrybacki, ^ 14:25:07 <shardy> trown: from RDO testing do you have any list we can use as a basis for tripleo-current promotion blockers atm? 14:25:12 <dprince> trown: cool. I wanted some general wall time metric graphs too https://review.openstack.org/#/c/291393/ 14:25:15 <hrybacki> o/ 14:25:27 <dprince> trown: if I can get that patch landed and start getting data then we could have graphs... 14:25:40 <trown> shardy: RDO does not have a promote job setup for newton yet 14:25:53 <shardy> trown: Ok, thanks 14:26:20 <trown> I think the report I linked above shows we have some pretty serious issues with the upgrades and ha jobs 14:26:28 <slagle> trown: what's the conclusion from your data? everything fails the majority of the time? :) 14:26:50 <shardy> Yeah, I just wondered if anyone had made any progress on sifting through them yet 14:26:54 <trown> not that anyone didnt know that, but conditioning the success rate on the nonha job gives a pretty good approximation of real pass rate 14:26:57 <dprince> trown: yeah, the incoming patches from shardy and slagle should fix a lot of those this week we think 14:27:27 <shardy> Does anyone have any thoughts on how we improve the promotion latency in future? 14:27:30 <trown> dprince: ya it would be nice to have some metrics to back that up 14:27:35 <dprince> trown: yeah, our existing CI reports alreayd give you a feel on this 14:27:47 <shardy> atm it appears we need a ton of folks working on fixes every time the periodic job breaks 14:28:00 <derekh> The problem is that transient bugs will continue to creep in, until we have somebody keeping on top of it all the time, we'll keep getting back to a bad state 14:28:00 <shardy> but I get the impression we don't even have the feedback for folks to know it's failed atm 14:28:07 <dprince> trown: I wanted wall times because jobs that take hours are unacceptable. And as we tune things I wanted to see clear improvements 14:28:13 <dprince> caching, etc. 14:28:22 <trown> dprince: it is pretty fuzzy though... there is a big difference betwee 20% success and 60% success but I dont know that I can look at the status page and see that 14:28:24 <shardy> derekh: would a nag-bot in #tripleo help at least raise visibility? 14:29:08 <slagle> shardy: that would raise visibility, i dont know that it would encourage anyone to do anything about it 14:29:31 <sshnaidm> I think in addition to statistics we need also reasons statistics like: timeouts - 20%, overcloud fails - 40%, infrastructure - 30%, etc 14:29:31 <shardy> slagle: I guess that's my question, how can we, and does anyone have time to help 14:29:32 <trown> from experience it requires a ton of work 14:29:43 <derekh> shardy: we should aim for a success rate of 80%+ on all the jobs , and nag bot might work thats starts nagging if we drop below that 14:29:44 <dprince> shardy: we could just have someone (say bnemec) go and -2 everyones patches if CI is broken 14:29:49 <dprince> shardy: that would get attention I think 14:29:55 <bnemec> Hey now. 14:30:07 <shardy> like, it's really bad that we fixed a heat issue ~2weeks ago after it was reported by folks here, yet it's still biting us now :( 14:30:09 <dprince> bnemec: I choose you randomly sir :) 14:30:10 <bnemec> It's as if you think I'm a -2 bot. :-P 14:30:22 <dprince> bnemec: but you do have the highest -2's I think too 14:30:58 <derekh> shardy: yup, its been 2 weeks since we last promoted, we need to be spending time keeping that below 2 days 14:31:19 <trown> I do think a nag bot that just says 'stop rechecking' if we fall below a certain success rate is a good idea 14:31:22 <shardy> I guess if we can't keep the latency for promotion very low, then we have to figure out reinstating tmprevert functionality 14:31:34 <shardy> as otherwise we've got no way to consume e.g reverts from heat or wherever 14:31:39 <trown> though 80% is quite ambitious given the status quo 14:32:12 <shardy> derekh: A lot of our failures appear capacity/performance related - what's the current plan re upgrading of hardware? 14:32:51 <shardy> Or is it a question of reducing coverage to align with capacity? 14:33:11 <derekh> trown: back in the day, when we used to keep on top of this, I used to start looking for problems if it fell below 90%, (although addmitedly it never stayed above 90% for long) 14:33:13 <shardy> if we do that, I guess we'll need to rely on third-party CI for some tests 14:33:59 <derekh> shardy: we're in the process of getting more memory for the machine but I wouldn't expect it until June'ish 14:34:04 <trown> derekh: also, I see we have a bunch of incremental patches to cache artifacts... what if we just built an undercloud.qcow2 using tripleo-quickstart (with upstream repo setup) in the periodic job 14:34:12 <trown> that would dramatically lower wall time 14:34:39 <trown> and could be completed much faster than redoing all of that work incrementally 14:34:40 <dprince> derekh: best effort is to go after the caching. Speeding up the test time helps us on all fronts I think 14:35:13 <trown> it also solves a problem with the approved spec... how do I document using tripleo-quickstart if the only image is in RDO 14:35:21 <dprince> derekh: i.e. you are already working on the most important things I think 14:35:51 <slagle> keep in mind that as we keep bumping up timeouts, the job times are going to go up 14:35:56 <slagle> hopefully they will pass though :) 14:36:09 <derekh> trown: will take a look at tripleo-quickstart and see what could be done 14:36:14 <trown> right... band aids on band aids 14:36:15 <dprince> derekh: regarding adding the extra memory... I'm not sure that helps all cases as some things seem to be CPU bound 14:36:18 <bnemec> Cutting down image build time may actually make the CPU situation worse though. 14:36:50 <dprince> slagle: that is why until we get some caching in place to significantly reduce the wall time I think we might need to disable the delete and or extra ping tests 14:36:51 <shardy> slagle: that's the problem tho, we're bumping individual timeouts, only to run into the infra global timeout 14:36:55 <bnemec> Not saying we shouldn't do it, but it may not be the silver bullet that makes everything pass all the time again. 14:36:55 <trown> bnemec: I see where you are going, but less wall time, means less load overall 14:37:03 <jistr> +1 on caching. Jobs that don't test a DIB or image elements change doesn't need to build its own images IIUC. That would save a lot of wall time. 14:37:07 <slagle> shardy: yea indeed 14:37:10 <derekh> dprince: yup, we seem to be CPU bound, but we havn't always been, I think something has happened at some demanding more CPU in general 14:37:21 <dprince> bnemec: if it doesn't work. At least we'd know more quickly 14:37:31 <slagle> derekh: did the disabling of turbo mode help with the throttling at all? 14:37:39 <slagle> i haven't checked the nodes myself 14:38:03 <dprince> derekh: both the HA and the upgrades jobs are running 3 controllers w/ pacemaker right? 14:38:12 <derekh> slagle: I checked yesterday was still seeing trottling wanrings 14:38:14 <dprince> derekh: more HA, controller nodes in the mix now... 14:38:35 <derekh> dprince: yup 14:38:36 <trown> hmm... do we need both of those jobs then? 14:39:32 <jistr> just regarding the differences between the two -- one is ipv4, the other is ipv6 i think. And upgrades job will eventually really test upgrades. 14:39:39 <bnemec> What about a one node pacemaker cluster for the upgrades job? 14:39:43 <dprince> trown: we could consider cutting one of them. 14:40:07 <sshnaidm> upgrades jobs has 3 nodes in total, HA has 4 14:40:17 <trown> dprince: we are in a pretty bad spot atm, so I think we should consider drastic measures 14:40:18 <dprince> trown: and swap in a periodic job. but it'd probably fail a lot more then 14:40:33 <dprince> trown: I'm not arguing :/ 14:40:48 <trown> :) 14:41:04 <dprince> trown: we've overstepped our bounds for sure. Even thinking about adding more to CI at this point (like tempest) is not where we are at 14:41:09 <derekh> before we go zapping jobs and moving stuff around I'd love if we could first find out if something in particular is hogging the resources 14:41:41 <dprince> derekh: agree, we are sort of poking about with our eyes closed here 14:41:42 <jistr> +1 14:42:41 <dprince> #action derekh is going to turn the lights on in the CI cloud so we can see what is happening 14:42:44 <dprince> :) 14:43:01 <derekh> looks like upgrades in 1 node 14:43:05 <derekh> dprince: thanks 14:43:13 <derekh> *1 node pacesmaker 14:44:00 <dprince> derekh: so are you suggesting we switch upgrades to 1 node? 14:44:29 <derekh> dprince: Nope, it was just pointing out that it already is a one node pacemaker cluster 14:44:40 <rdopiera> by the way, just a random idea, I'm not sure if it's done or not, but maybe before the actual installation and tests are even started, there could be a quick sanity check for the environment? 14:44:43 <derekh> dprince: bnemec suggested we try it 14:44:53 <dprince> derekh: oh, I see. 14:45:04 <dprince> derekh: I had forget that as well 14:45:08 <rdopiera> sorry if that's not what you are talking about 14:45:44 <bnemec> Must be a good idea if we're already doing it. :-) 14:45:47 <trown> we have 2 more topics, though it seems we could have a weekly CI meeting that is an hour on its own 14:46:00 <dprince> rdopiera: the environments are functionally fine. Just under load 14:46:07 <derekh> rdopiera: I'm not sure what we would sanity test 14:46:10 <dprince> rdopiera: or so we think 14:46:11 <trown> not an awful idea really... to have a CI subteam that just reports back to this meeting 14:46:22 <rdopiera> dprince: I see, forget it then 14:46:57 <rdopiera> dprince: my thinking was that then you would have fewer jobs even starting -- because the ones that would fail anyways wouldn't even start 14:46:58 <dprince> trown: right, we can cut it off soon. I give CI the bulk of the time I think because it is blocking all of us. And we should all probably look at it more 14:47:31 <trown> dprince: yep I agree, and think it probably needs even more time than half hour a week 14:47:33 <dprince> rdopiera: yeah, well we could just scale back the number of concurrent jobs we allow to be running 14:47:44 <sshnaidm> +1 for separate CI meeting 14:48:16 <derekh> trown: I think this would be their brief, 1. the periodic job failed last night, find out why and fix it. 2. bug X is causing a lot of failures, find out why and fix. 3. find places we're waisting time and make things run faster 14:48:48 <dprince> derekh: yep, that is a good start 14:50:04 <dprince> okay, lets moves on to specs then. We can continue CI talk in #tripleo afterwards.... 14:50:11 <dprince> #topic Specs 14:50:45 <dprince> still very close to landing https://review.openstack.org/#/c/280407/ 14:51:34 <shardy> dprince: I'm +2 but it'd be good to see the nits mentioned fixed 14:51:50 <dprince> shardy: yep, not rushing it or anything 14:52:25 <dprince> any other specs this week? 14:52:56 <dprince> slagle: shardy do you guys want me to abandone the composable services spec? Or make another pass at updating it again? 14:53:52 <shardy> dprince: Your call, I'm happy to focus on the patches at this point 14:54:11 <dprince> shardy: yeah, that is what I've been doing anyways. 14:54:15 <slagle> yea, i'm fine either way 14:54:32 <slagle> i would like to see us agree to get updating to the patch being tested in place in CI first 14:54:46 <slagle> but i dont know about the feasibility of that given all the CI issues 14:55:11 <dprince> so if specs aren't helpful (which it seems like this one provoked a lot of qustions) 14:55:16 <slagle> i feel these patches have a potential to break stack-update's, and we won't know until we try towards the end of the cycle 14:55:18 <dprince> then we don't have to do them 14:55:22 <slagle> then we're faced with a mound of problems 14:55:32 <dprince> but having a Spec to point to for people asking about high level ideas is useful 14:56:05 <dprince> For this particular thing... the documentation can be to tripleo-docs however. Because it creates a sort of "interface" that service developers can plug into. 14:56:08 <trown> I think a spec for something as big as composable roles makes a lot of sense 14:56:09 <dprince> so I can document it there 14:56:33 <shardy> slagle: agree - hopefully we can get the upgrades job providing some of that coverage asap 14:56:36 <dprince> trown: the issue with specs is people want to get into implementation details. and the core team doesn't (often) agree on those :/ 14:56:50 <dprince> trown: and a review is a better place to hash out implementation details... 14:57:01 <dprince> anyways. 14:57:11 <dprince> #topic open discussion 14:57:24 <dprince> shardy: do you want to mention the summit sessions etherpad? 14:57:37 <shardy> Yeah can folks please add summit topics to here: 14:57:39 <shardy> https://etherpad.openstack.org/p/newton-tripleo-sessions 14:57:54 <shardy> we have 5 sessions in total, plust a half-day meetup 14:58:22 <shardy> we need to decide the top-5 topics, and I'll report those so the official agenda can be updated 14:58:30 <dprince> shardy: looks almost like what we talked about last time :) 14:58:37 <shardy> Yeah :( 14:59:08 <shardy> I'll ping the ML too, I'd suggest we work out consensus via the ML given we always run out of time in these meetings 14:59:20 <dprince> yep, sounds good 14:59:29 <shardy> for newton I'd like to consider less repeated agenda items to enable this as a better sync-point 14:59:37 <dprince> thanks everyone, and lets continue the CI discussions in #tripleo to get it fixed ASAP 15:00:06 <dprince> #endmeeting