#openstack-meeting-alt log

14:00:43 <dprince> #startmeeting tripleo
14:00:44 <openstack> Meeting started Tue Apr  5 14:00:43 2016 UTC and is due to finish in 60 minutes.  The chair is dprince. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:45 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:48 <openstack> The meeting name has been set to 'tripleo'
14:00:54 <derekh> howdy
14:00:59 <rdopiera> hi
14:01:05 <dprince> hi everyone
14:01:15 <matbu> \o
14:01:22 <shardy> o/
14:01:26 <beagles> o/
14:02:51 <pradk> o/
14:02:55 <qasims> o/
14:02:58 <dprince> #topic agenda
14:02:58 <dprince> * quickstack (one off questions)
14:02:58 <dprince> * bugs
14:02:58 <dprince> * Projects releases or stable backports
14:02:58 <dprince> * CI
14:03:00 <dprince> * Specs
14:03:03 <dprince> * open discussion
14:03:22 <dprince> trown: I've added your quickstart one-off agenda item to the front of the agenda since we didn't get to it last week
14:03:33 <shardy> I also just added a topic re summit sessions
14:03:48 <trown> dprince: k, I mostly resolved it :)
14:03:50 <trown> o/
14:03:57 <dprince> trown: okay, so skip it?
14:04:15 <dprince> trown: meeting times have been tight so if it is handled we can swap it out for the schedule
14:04:15 <trown> but I have some other questions about documentation and image building
14:04:18 <dprince> shardy: ack
14:04:34 <trown> I will send to ML first and we can discuss in a later meeting
14:05:00 <dprince> okay, so this
14:05:02 <dprince> #topic agenda
14:05:02 <dprince> * bugs
14:05:02 <dprince> * Projects releases or stable backports
14:05:02 <dprince> * CI
14:05:05 <dprince> * Specs
14:05:07 <dprince> * summit sessions
14:05:10 <dprince> * open discussion
14:05:33 <dprince> oops. I meant to call that agenda2. Oh well
14:06:00 <dprince> shardy: will talk about the summit sessions right after specs
14:06:04 <dprince> lets get started
14:06:09 <shardy> dprince: thanks
14:06:11 <dprince> #topic bugs
14:06:25 <dprince> lots of bugs last week
14:06:42 <dprince> https://bugs.launchpad.net/tripleo/+bug/1566259
14:06:43 <openstack> Launchpad bug 1566259 in tripleo "Upgrades job failing with ControllerServiceChain config_settings error" [Critical,Triaged]
14:06:56 <dprince> this is currently breaking our CI the most we think
14:06:59 <shardy> So, as discussed in #tripleo, this is already fixed by a heat revert
14:07:14 <shardy> we just have to figure out how to consume a newer heat I think
14:07:14 <dprince> shardy: yep, just wanted to mention in case others weren't following
14:07:33 <shardy> yup my response also just for info if folks weren't following :)
14:08:14 <dprince> any other bugs to mention
14:08:26 <slagle> https://bugs.launchpad.net/tripleo/+bug/1537898
14:08:28 <openstack> Launchpad bug 1537898 in tripleo "HA job overcloud pingtest timeouts" [High,Triaged] - Assigned to James Slagle (james-slagle)
14:08:37 <slagle> i wanted to mention we merged a patch to address that, hopefully
14:08:48 <derekh> shardy: so options are promote, temprevert or whitelist? /me wasn't following which are we going for?
14:09:05 <slagle> but, now i'm seeting timeouts from the heat stack timeout, which defaults to 360s
14:09:12 <slagle> so we likely need another patch to bump that as well
14:09:34 <shardy> derekh: Yeah, we need to decide which - my whitelist patch didn't pass all the jobs, so all options are still on the table I think
14:09:41 <slagle> that error killed 81 jobs last week, fwiw
14:09:46 <dprince> slagle: does this not fix it https://review.openstack.org/#/c/298941/
14:09:53 <derekh> 81 ouch
14:09:58 <slagle> dprince: partially, that's what i'm saying
14:10:10 <slagle> we bumped rpc timeout to 10 minutes, but the heat stack has a timeout of 6 mins
14:10:11 <weshay> sshnaidm, ^
14:10:34 <slagle> so i was going to do another patch to bump the 6 minutes to 20 minutes
14:10:37 <jistr> also one more CI related fix is here https://review.openstack.org/#/c/301624/   -- we weren't getting logs for jobs where crm_resource --wait hanged, this should fix it and perhaps allow us to debug stuck HA jobs
14:10:47 <dprince> slagle: yes. lets do all these
14:11:00 <dprince> slagle: but seriously, I think our pingtest is taking way to long
14:11:24 <sshnaidm> weshay, it's about pingtest timeouts, I meant overall timeout of the whole job
14:11:28 <slagle> dprince: indeed. only thing i've been able to deduce is that we are completely cpu bound in the testenvs
14:11:37 <shardy> wow, so that's 6 minutes to create a few neutron resources and have the server go ACTIVE
14:11:44 <shardy> it's not even including the time to boot it
14:11:59 <dprince> slagle: yeah
14:12:20 <dprince> slagle: will you push the patch to tune/bump the heat stack timeout then?
14:12:25 <slagle> yes, will do
14:12:43 <dprince> slagle: thanks
14:13:02 <dprince> okay, so with shardy's and slagle's incoming patches I think we should be in better shape
14:13:08 <dprince> any other issues this week for bugs?
14:13:15 <sshnaidm> dprince, I'm not sure if it could be considere a bug, but recently a lot of jobs are stopped in the middle and failed because of timeout, they take more than 3 hours, especially ha and upgrades. Is it should be reported as a bug..?
14:13:34 <trown> liberty upgrades job is broke too, but I have not looked into exactly why... more in the CI topic
14:13:43 <dprince> sshnaidm: that can be a bug. We need to stop adding to the waltime of our CI jobs.
14:14:06 <derekh> +1 to reversing our trend of adding to wall time
14:14:07 <sshnaidm> dprince, ok, then I will report it in launchpad
14:14:16 <dprince> sshnaidm: I think perhaps we even need to consider cutting from the jobs. The HA jobs are just too resource intensive
14:14:23 <shardy> well we're also getting poor performance from the CI nodes which gives rise to spurious timeouts
14:14:57 <sshnaidm> dprince, maybe it could run in periodical jobs or experimental..
14:15:20 <dprince> sshnaidm: yep, we'd need a team focussed on keeping those passing though
14:15:29 <weshay> adarazs, ^  ya we can try to back fill the coverage on ha if needed
14:15:30 <dprince> sshnaidm: but we could go at it that way
14:15:32 <shardy> both options result in folks ignoring failures unfortunately
14:15:46 <shardy> which is a topic for the CI section ;)
14:15:54 <dprince> yep, lets move on
14:16:04 <dprince> #topic Projects releases or stable backports
14:16:24 <dprince> anything needing to be discussed here this week
14:16:29 <dprince> the Mitaka releases are cut
14:16:36 <shardy> dprince: are you planning to publish a "mitaka" release around the GA of the coordinated release?
14:16:40 <dprince> branches are cut rather
14:17:02 <trown> ya there have been some key fixes to land on mitaka after the branch is cut
14:17:07 <shardy> Yeah we've cut the branches, I wasn't sure if we were going to attempt to advertise the released versions via openstack/releases
14:17:17 <dprince> shardy: sure, we can.
14:17:24 <shardy> there was some discussion re the generated docs, e.g for independent project on openstack-dev
14:17:29 <trown> thinking of the mariadb10 compatability patch for tht, though I think there are others
14:17:35 <shardy> in the context of kolla, but same applies to us atm
14:17:47 <shardy> http://releases.openstack.org/independent.html
14:18:01 <dprince> shardy: thinking about releases EmilienM made me use reno for one of my puppet patches.
14:18:18 <shardy> It'd be good if we had a TripleO mitaka "release" on there, even if it's just the various repos
14:18:25 <dprince> shardy: we should consider adopting it as it would really help show what went into the release that is being cut
14:18:37 <shardy> dprince: Yeah, agreed
14:18:40 <trown> +1 for reno
14:18:49 <trown> ironic has been using it very successfully
14:18:49 <shardy> we need reno and probably either spec-lite bugs or blueprints
14:19:06 <dprince> shardy: if we adopt it early in Newton that would be most ideal
14:19:08 <shardy> so we can track both the deliverables/completed stuff and the roadmap
14:19:55 <dprince> shardy: getting the whole core team to by into using it is important though as we'd need to enforce it during reviews
14:20:01 <dprince> but it is fairly painless I think
14:20:09 <shardy> dprince: +1 - I'll try to take a first-cut at describing the process and ping the list for feedback
14:20:26 <dprince> shardy: sounds good :)
14:20:33 <shardy> agree, it shouldn't be much work at all for folks, but we need to get buy-in for sure
14:21:29 <dprince> okay, any other stable branch topics. I think we can revisit the "when to tag" topic again next week
14:21:39 <shardy> dprince: do we need a non-reno releasenotes document for Mitaka?
14:21:46 <shardy> e.g a wiki or page in TripleO docs?
14:22:01 <dprince> shardy: so I sent out a wiki patch 1 month ago to the list and asked people to fill it in
14:22:09 <dprince> shardy: that could be a start...
14:22:21 <shardy> dprince: ack, I need to revisit that ;)
14:22:52 <jistr> https://etherpad.openstack.org/p/tripleo-mitaka
14:23:03 <shardy> #link https://etherpad.openstack.org/p/tripleo-mitaka
14:23:06 <shardy> thanks jistr
14:23:21 <dprince> cool, I was looking for that as well but email was slow
14:23:25 <dprince> okay, lets move on
14:23:28 <dprince> #topic CI
14:24:07 <trown> I started working on some aggregate reporting we could integrate into the ci status page
14:24:12 <shardy> Two topics to mention, capacity (or lack thereof) and state of the periodic job and propmotion of tripleo-common
14:24:17 <trown> http://chunk.io/f/9b65dfaa09dd415d97859ea16bc117a2 is a really raw form of the output
14:24:51 <matbu> I wanted to mention that i started to work on upgrade liberty->mitaka, only UC upgrade are passing atm
14:25:01 <weshay> hrybacki, ^
14:25:07 <shardy> trown: from RDO testing do you have any list we can use as a basis for tripleo-current promotion blockers atm?
14:25:12 <dprince> trown: cool. I wanted some general wall time metric graphs too https://review.openstack.org/#/c/291393/
14:25:15 <hrybacki> o/
14:25:27 <dprince> trown: if I can get that patch landed and start getting data then we could have graphs...
14:25:40 <trown> shardy: RDO does not have a promote job setup for newton yet
14:25:53 <shardy> trown: Ok, thanks
14:26:20 <trown> I think the report I linked above shows we have some pretty serious issues with the upgrades and ha jobs
14:26:28 <slagle> trown: what's the conclusion from your data? everything fails the majority of the time? :)
14:26:50 <shardy> Yeah, I just wondered if anyone had made any progress on sifting through them yet
14:26:54 <trown> not that anyone didnt know that, but conditioning the success rate on the nonha job gives a pretty good approximation of real pass rate
14:26:57 <dprince> trown: yeah, the incoming patches from shardy and slagle should fix a lot of those this week we think
14:27:27 <shardy> Does anyone have any thoughts on how we improve the promotion latency in future?
14:27:30 <trown> dprince: ya it would be nice to have some metrics to back that up
14:27:35 <dprince> trown: yeah, our existing CI reports alreayd give you a feel on this
14:27:47 <shardy> atm it appears we need a ton of folks working on fixes every time the periodic job breaks
14:28:00 <derekh> The problem is that transient bugs will continue to creep in, until we have somebody keeping on top of it all the time, we'll keep getting back to a bad state
14:28:00 <shardy> but I get the impression we don't even have the feedback for folks to know it's failed atm
14:28:07 <dprince> trown: I wanted wall times because jobs that take hours are unacceptable. And as we tune things I wanted to see clear improvements
14:28:13 <dprince> caching, etc.
14:28:22 <trown> dprince: it is pretty fuzzy though... there is a big difference betwee 20% success and 60% success but I dont know that I can look at the status page and see that
14:28:24 <shardy> derekh: would a nag-bot in #tripleo help at least raise visibility?
14:29:08 <slagle> shardy: that would raise visibility, i dont know that it would encourage anyone to do anything about it
14:29:31 <sshnaidm> I think in addition to statistics we need also reasons statistics like: timeouts - 20%, overcloud fails - 40%, infrastructure - 30%, etc
14:29:31 <shardy> slagle: I guess that's my question, how can we, and does anyone have time to help
14:29:32 <trown> from experience it requires a ton of work
14:29:43 <derekh> shardy: we should aim for a success rate of 80%+ on all the jobs , and nag bot might work thats starts nagging if we drop below that
14:29:44 <dprince> shardy: we could just have someone (say bnemec) go and -2 everyones patches if CI is broken
14:29:49 <dprince> shardy: that would get attention I think
14:29:55 <bnemec> Hey now.
14:30:07 <shardy> like, it's really bad that we fixed a heat issue ~2weeks ago after it was reported by folks here, yet it's still biting us now :(
14:30:09 <dprince> bnemec: I choose you randomly sir :)
14:30:10 <bnemec> It's as if you think I'm a -2 bot. :-P
14:30:22 <dprince> bnemec: but you do have the highest -2's I think too
14:30:58 <derekh> shardy: yup, its been 2 weeks since we last promoted, we need to be spending time keeping that below 2 days
14:31:19 <trown> I do think a nag bot that just says 'stop rechecking' if we fall below a certain success rate is a good idea
14:31:22 <shardy> I guess if we can't keep the latency for promotion very low, then we have to figure out reinstating tmprevert functionality
14:31:34 <shardy> as otherwise we've got no way to consume e.g reverts from heat or wherever
14:31:39 <trown> though 80% is quite ambitious given the status quo
14:32:12 <shardy> derekh: A lot of our failures appear capacity/performance related - what's the current plan re upgrading of hardware?
14:32:51 <shardy> Or is it a question of reducing coverage to align with capacity?
14:33:11 <derekh> trown: back in the day, when we used to keep on top of this, I used to start looking for problems if it fell below 90%, (although addmitedly it never stayed above 90% for long)
14:33:13 <shardy> if we do that, I guess we'll need to rely on third-party CI for some tests
14:33:59 <derekh> shardy: we're in the process of getting more memory for the machine but I wouldn't expect it until June'ish
14:34:04 <trown> derekh: also, I see we have a bunch of incremental patches to cache artifacts... what if we just built an undercloud.qcow2 using tripleo-quickstart (with upstream repo setup) in the periodic job
14:34:12 <trown> that would dramatically lower wall time
14:34:39 <trown> and could be completed much faster than redoing all of that work incrementally
14:34:40 <dprince> derekh: best effort is to go after the caching. Speeding up the test time helps us on all fronts I think
14:35:13 <trown> it also solves a problem with the approved spec... how do I document using tripleo-quickstart if the only image is in RDO
14:35:21 <dprince> derekh: i.e. you are already working on the most important things I think
14:35:51 <slagle> keep in mind that as we keep bumping up timeouts, the job times are going to go up
14:35:56 <slagle> hopefully they will pass though :)
14:36:09 <derekh> trown: will take a look at tripleo-quickstart and see what could be done
14:36:14 <trown> right... band aids on band aids
14:36:15 <dprince> derekh: regarding adding the extra memory... I'm not sure that helps all cases as some things seem to be CPU bound
14:36:18 <bnemec> Cutting down image build time may actually make the CPU situation worse though.
14:36:50 <dprince> slagle: that is why until we get some caching in place to significantly reduce the wall time I think we might need to disable the delete and or extra ping tests
14:36:51 <shardy> slagle: that's the problem tho, we're bumping individual timeouts, only to run into the infra global timeout
14:36:55 <bnemec> Not saying we shouldn't do it, but it may not be the silver bullet that makes everything pass all the time again.
14:36:55 <trown> bnemec: I see where you are going, but less wall time, means less load overall
14:37:03 <jistr> +1 on caching. Jobs that don't test a DIB or image elements change doesn't need to build its own images IIUC. That would save a lot of wall time.
14:37:07 <slagle> shardy: yea indeed
14:37:10 <derekh> dprince: yup, we seem to be CPU bound, but we havn't always been, I think something has happened at some demanding more CPU in general
14:37:21 <dprince> bnemec: if it doesn't work. At least we'd know more quickly
14:37:31 <slagle> derekh: did the disabling of turbo mode help with the throttling at all?
14:37:39 <slagle> i haven't checked the nodes myself
14:38:03 <dprince> derekh: both the HA and the upgrades jobs are running 3 controllers w/ pacemaker right?
14:38:12 <derekh> slagle: I checked yesterday was still seeing trottling wanrings
14:38:14 <dprince> derekh: more HA, controller nodes in the mix now...
14:38:35 <derekh> dprince: yup
14:38:36 <trown> hmm... do we need both of those jobs then?
14:39:32 <jistr> just regarding the differences between the two -- one is ipv4, the other is ipv6 i think. And upgrades job will eventually really test upgrades.
14:39:39 <bnemec> What about a one node pacemaker cluster for the upgrades job?
14:39:43 <dprince> trown: we could consider cutting one of them.
14:40:07 <sshnaidm> upgrades jobs has 3 nodes in total, HA has 4
14:40:17 <trown> dprince: we are in a pretty bad spot atm, so I think we should consider drastic measures
14:40:18 <dprince> trown: and swap in a periodic job. but it'd probably fail a lot more then
14:40:33 <dprince> trown: I'm not arguing :/
14:40:48 <trown> :)
14:41:04 <dprince> trown: we've overstepped our bounds for sure. Even thinking about adding more to CI at this point (like tempest) is not where we are at
14:41:09 <derekh> before we go zapping jobs and moving stuff around I'd love if we could first find out if something in particular is hogging the resources
14:41:41 <dprince> derekh: agree, we are sort of poking about with our eyes closed here
14:41:42 <jistr> +1
14:42:41 <dprince> #action derekh is going to turn the lights on in the CI cloud so we can see what is happening
14:42:44 <dprince> :)
14:43:01 <derekh> looks like upgrades in 1 node
14:43:05 <derekh> dprince: thanks
14:43:13 <derekh> *1 node pacesmaker
14:44:00 <dprince> derekh: so are you suggesting we switch upgrades to 1 node?
14:44:29 <derekh> dprince: Nope, it was just pointing out that it already is  a one node pacemaker cluster
14:44:40 <rdopiera> by the way, just a random idea, I'm not sure if it's done or not, but maybe before the actual installation and tests are even started, there could be a quick sanity check for the environment?
14:44:43 <derekh> dprince: bnemec suggested we try it
14:44:53 <dprince> derekh: oh, I see.
14:45:04 <dprince> derekh: I had forget that as well
14:45:08 <rdopiera> sorry if that's not what you are talking about
14:45:44 <bnemec> Must be a good idea if we're already doing it. :-)
14:45:47 <trown> we have 2 more topics, though it seems we could have a weekly CI meeting that is an hour on its own
14:46:00 <dprince> rdopiera: the environments are functionally fine. Just under load
14:46:07 <derekh> rdopiera: I'm not sure what we would sanity test
14:46:10 <dprince> rdopiera: or so we think
14:46:11 <trown> not an awful idea really... to have a CI subteam that just reports back to this meeting
14:46:22 <rdopiera> dprince: I see, forget it then
14:46:57 <rdopiera> dprince: my thinking was that then you would have fewer jobs even starting -- because the ones that would fail anyways wouldn't even start
14:46:58 <dprince> trown: right, we can cut it off soon. I give CI the bulk of the time I think because it is blocking all of us. And we should all probably look at it more
14:47:31 <trown> dprince: yep I agree, and think it probably needs even more time than half hour a week
14:47:33 <dprince> rdopiera: yeah, well we could just scale back the number of concurrent jobs we allow to be running
14:47:44 <sshnaidm> +1 for separate CI meeting
14:48:16 <derekh> trown: I think this would be their brief,  1. the periodic job failed last night, find out why and fix it. 2. bug X is causing a lot of failures, find out why and fix. 3. find places we're waisting time and make things run faster
14:48:48 <dprince> derekh: yep, that is a good start
14:50:04 <dprince> okay, lets moves on to specs then. We can continue CI talk in #tripleo afterwards....
14:50:11 <dprince> #topic Specs
14:50:45 <dprince> still very close to landing https://review.openstack.org/#/c/280407/
14:51:34 <shardy> dprince: I'm +2 but it'd be good to see the nits mentioned fixed
14:51:50 <dprince> shardy: yep, not rushing it or anything
14:52:25 <dprince> any other specs this week?
14:52:56 <dprince> slagle: shardy do you guys want me to abandone the composable services spec? Or make another pass at updating it again?
14:53:52 <shardy> dprince: Your call, I'm happy to focus on the patches at this point
14:54:11 <dprince> shardy: yeah, that is what I've been doing anyways.
14:54:15 <slagle> yea, i'm fine either way
14:54:32 <slagle> i would like to see us agree to get updating to the patch being tested in place in CI first
14:54:46 <slagle> but i dont know about the feasibility of that given all the CI issues
14:55:11 <dprince> so if specs aren't helpful (which it seems like this one provoked a lot of qustions)
14:55:16 <slagle> i feel these patches have a potential to break stack-update's, and we won't know until we try towards the end of the cycle
14:55:18 <dprince> then we don't have to do them
14:55:22 <slagle> then we're faced with a mound of problems
14:55:32 <dprince> but having a Spec to point to for people asking about high level ideas is useful
14:56:05 <dprince> For this particular thing... the documentation can be to tripleo-docs however. Because it creates a sort of "interface" that service developers can plug into.
14:56:08 <trown> I think a spec for something as big as composable roles makes a lot of sense
14:56:09 <dprince> so I can document it there
14:56:33 <shardy> slagle: agree - hopefully we can get the upgrades job providing some of that coverage asap
14:56:36 <dprince> trown: the issue with specs is people want to get into implementation details. and the core team doesn't (often) agree on those :/
14:56:50 <dprince> trown: and a review is a better place to hash out implementation details...
14:57:01 <dprince> anyways.
14:57:11 <dprince> #topic open discussion
14:57:24 <dprince> shardy: do you want to mention the summit sessions etherpad?
14:57:37 <shardy> Yeah can folks please add summit topics to here:
14:57:39 <shardy> https://etherpad.openstack.org/p/newton-tripleo-sessions
14:57:54 <shardy> we have 5 sessions in total, plust a half-day meetup
14:58:22 <shardy> we need to decide the top-5 topics, and I'll report those so the official agenda can be updated
14:58:30 <dprince> shardy: looks almost like what we talked about last time :)
14:58:37 <shardy> Yeah :(
14:59:08 <shardy> I'll ping the ML too, I'd suggest we work out consensus via the ML given we always run out of time in these meetings
14:59:20 <dprince> yep, sounds good
14:59:29 <shardy> for newton I'd like to consider less repeated agenda items to enable this as a better sync-point
14:59:37 <dprince> thanks everyone, and lets continue the CI discussions in #tripleo to get it fixed ASAP
15:00:06 <dprince> #endmeeting