14:00:44 #startmeeting tripleo 14:00:44 Meeting started Tue May 17 14:00:44 2016 UTC and is due to finish in 60 minutes. The chair is shardy. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:45 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:47 o/ 14:00:48 o/ 14:00:48 The meeting name has been set to 'tripleo' 14:00:49 #topic rollcall 14:00:53 hey 14:00:55 o/ 14:00:55 Hi all, who's around? 14:00:58 hi 14:01:02 o/ 14:01:10 o/ 14:01:15 o/ 14:01:19 o/ 14:01:40 #link https://wiki.openstack.org/wiki/Meetings/TripleO 14:01:46 o/ 14:01:54 * myoung waves 14:02:08 #topic agenda 14:02:08 * one off agenda items 14:02:08 * bugs 14:02:08 * Projects releases or stable backports 14:02:08 * CI 14:02:10 * Specs 14:02:13 * open discussion 14:02:28 Does anyone have any more one-off items to add today? 14:02:37 shardy, please add to items: "tempest run for nonha periodic jobs" 14:02:50 sshnaidm: ack, thanks 14:03:04 * beagles wanders in a couple of min late 14:03:09 o/ 14:03:16 #topic one off agenda items 14:03:34 #info should we remove the containers job (slagle) 14:03:51 slagle: did you chat with rhallisey or Slower about that? 14:03:54 \o 14:04:09 o/ 14:04:12 hi 14:04:13 we should rather bring it back to green, isn't? 14:04:15 yo 14:04:18 it's been broken for a long time, so I agree we either need to fix it or remove it 14:04:24 shardy: no, i wasn't able to last week, that's why i added it to the meeting 14:04:34 EmilienM: well, yes, but folks said they were doing that and it's not happened for $months 14:04:45 :-( 14:04:49 o/ 14:04:52 i would just like to know if people are working on fixing it 14:05:03 he has a bunch of patches in progress for that 14:05:20 they need to be rebased / passing CI / merged and things could be better 14:05:34 hey 14:05:40 #info Nikolas Hermanns 14:05:41 hey 14:05:41 o/ 14:05:46 sorry I'm late 14:05:50 o/ 14:05:53 rhallisey: Hey, we're discussing the containers CI job 14:05:55 EmilienM: right there was a lot of discussion at the summit though about not using docker-compose any longer, etc 14:06:04 what's the status there, do we have a plan to get it working again? 14:06:22 o/ 14:06:51 so the job worked for awhile. I can rebase it to get it back working, but with composable roles the gate will be flaky 14:06:59 shardy: we will need the job again at some point 14:07:03 o/ 14:07:06 o/ 14:07:14 rhallisey: Ok, so we're trying to decide if we can get it working, or if it should be temporarily removed 14:07:16 of course, we can always add it back 14:07:31 obviously getting it working is preferable, but we need folks willing to keep it green 14:07:40 I considered the current state as temproarily removed 14:07:50 which I think is appropriate 14:08:00 maybe keep it only for tripleo-ci project jobs so rhallisey can still run some tests or moving it to experimental pipeline 14:08:09 so he can "check experimental" when needed 14:08:13 rhallisey: Ok, well it's disabled, but we're discussing removing it from the default pipeline 14:08:22 EmilienM: +1 that's perhaps a reasonable compromise 14:08:35 shardy, ya I agree with EmilienM 14:08:40 that would be best 14:08:46 slagle: does that work for you? 14:08:52 sure 14:08:53 yup +1 to what EmilienM said, that way it wont be using up as many jenkins slaves 14:09:01 rhallisey: if you're not familiar with infra, I can make the patch, let me know. 14:09:14 #info agreement to make containers job temporarily only run via experimental pipeline 14:09:39 #info tempest run for nonha periodic jobs 14:09:44 sshnaidm: ^^ 14:09:47 EmilienM, may ask some questions. I've done it once :) 14:10:04 rhallisey: cool, we catch-up after the meeting 14:10:12 I think timeout is not the issue now after HW upgrade as I see in logs 14:10:34 so we can try to run tempest tests on periodic nonha jobs as the shortest ones 14:10:41 shardy, sorry about being a tad late. I'm in a large meeting in westford 14:10:54 rhallisey: no problem 14:10:59 https://review.openstack.org/#/c/297038/ 14:11:11 container stuff 14:11:31 sshnaidm: ya, I think we have the time available to do this now 14:11:52 I'd like to hold on Tempest 14:12:00 It really isn't our biggest problem ATM 14:12:05 +1 14:12:13 our CI is broken almost every day 14:12:17 can we stabilize it first? 14:12:28 like seriously... could someone point out what if anything Tempest is going to catch that our existing CI isn't already telling us 14:12:30 sshnaidm: befor we merge that though can we just do one think, submit a WIP patch to run a fake period job to make sure there working as that patch isn't tested in ci 14:12:52 derekh: I was just about to ask how we see that patch actually working :) 14:12:53 Our existing ping test is working quite well 14:12:57 dprince, we already caought 2 bugs with tempest runs 14:13:09 derekh, sure 14:13:09 sshnaidm: what bugs? 14:13:26 sshnaidm: that is great. The question is does the extra time Tempest costs us warrent running it on every single test 14:13:29 on the other hand running tempest on periodic job does not hurt as much as running in our gate. 14:13:31 trown, I don't remember the numbers, one is about sahara non available and another one, I can find later 14:13:35 I have seen it catch packaging bugs where out of tree tempest plugins werent packaged right 14:13:44 dprince: I think the proposal is only to run it once per day? 14:13:47 sshnaidm: If those bugs are like constantly regressing you may have a case 14:13:56 trown: right, puppet-CI catches is very often. 14:13:57 shardy: we've been over this *so* many times 14:13:58 dprince: its, for the periodic job, it doesn't add anything to normal ci runs 14:14:06 nobody is again running tempest in the periodic jobs 14:14:16 okay, that is fine then 14:14:33 dprince, EmilienM I don't think tempest will hurt CI, the current fails have their own reasons, not sure tempest test will break something 14:14:40 dprince: we went over it a couple of weeks ago, and the consensus was OK, let's add it to the periodic job if we have enough walltime 14:14:50 period is fine 14:15:03 if we merge that, our promotes, which are also done by periodic jobs, will be dependent on tempest 14:15:04 so we're all good 14:15:15 slagle: yes they will 14:15:19 slagle: that is a good point 14:15:22 slagle, right 14:15:29 i dont think we really need that tbh 14:15:36 but... my main concern is adding wall time to the existing jobs 14:15:51 what about a separate periodic job for tempest that does not influence the promote? 14:16:16 dprince, as I saw all last nonha jobs it's not an issue, we have a time now 14:16:34 slagle, the point here is not promote bugs, afaiu.. 14:16:51 sshnaidm: even so. I think what you are hearing is basically Tempest doesn't add much to our existing CI 14:16:55 I raised a question on the ML about the CI pipeline recently: http://lists.openstack.org/pipermail/openstack-dev/2016-May/095143.html and have some ideas how to reduce CI times. Would be interested in feedback on it after the meeting 14:17:15 sshnaidm: there is value in having our promotion job match (as closely as possible) what our Trunk jobs do as well 14:17:18 dprince, if it already caught a few bugs, why not? 14:18:01 sshnaidm: list the bugs and make the case. I'm more interested if those bugs are perhaps corner cases... that may not often cause regressions in TripleO 14:18:08 Ok, we're going to have to time-box this discussion and move on I think 14:18:21 we keep having the same discussion every time, so lets all vote on https://review.openstack.org/#/c/297038 14:18:23 i think the issue is that not every bug out there needs to block tripleo forward progress 14:18:45 sshnaidm: please link the bugs found there, and post a link to where we can see results of it running 14:18:49 slagle: that true but if it doesn't block something we'll possibly ignore it 14:19:24 slagle: every bug out there typically does block forward progress anyway tho ;) 14:19:33 derekh: agreed, and i think i'm leaning towards ignoring some things is ok :) 14:19:45 #topic bugs 14:19:46 obviously we want to find and report all bugs that we can 14:20:07 slagle: then a valid bug could make it the whole way to the end of a cycle until we start panicing about it 14:20:24 #link https://bugs.launchpad.net/tripleo/?orderby=-id&start=0 14:20:38 dprince, shardy one of them: https://review.openstack.org/#/c/309042/ 14:20:50 shardy: I'm -2 on the Tempest patch as is I think 14:20:58 So there's a packaging issue wrt mistral (caught by the periodic job), it now needs designateclient 14:22:03 I'm trying to get a packaging fix in for that, and I confirmed the long term plan is for mistral to have soft dependencies so we won't have to install all-the-clients 14:22:14 Any other bugs folks want to discuss? 14:22:26 shardy: so rbrady filed a bug on the Mistral timeouts issue 14:22:30 shardy: https://bugs.launchpad.net/mistral/+bug/1581649 14:22:31 Launchpad bug 1581649 in Mistral "Action Execution Response Timeout" [High,New] 14:22:59 dprince: I'm still working on that bug 14:23:00 which I don't actually see myself. But apparently it could be related to eventlet and the fact that we run Mistral API in Apache WSGI 14:23:15 I am seeing that bug too 14:23:19 rbrady: yep, just highlighting it so people know where to look if they hit that issue 14:23:37 shardy: that is all from me on bugs 14:23:49 dprince: interesting, thanks! I hit that myself quite a few times 14:24:04 +1 me too i thought it was underpowered undercloud 14:24:18 it would be helpful that anyone else seeing that bug notes that in launchpad 14:24:23 i often had to just re-run the execution 14:24:40 rbrady: will do - I hit it pretty much 50% of my executions 14:24:48 shardy: same 14:24:56 which sounds like a blocker to switching the client over to mistral 14:25:13 +1 14:25:20 #link https://launchpad.net/tripleo/+milestone/newton-1 14:25:36 shardy: yes, this needs to be fixed 14:26:04 So, the newton-1 milestone is in less than two weeks - I'm hoping we can cut some releases around the end of next week (or during the week after, but I'll be on PTO) 14:26:38 can folks review that list, and ensure anything you consider release-blocker gets on there, and anything else gets moved to n-2? 14:26:49 shardy: well, https://blueprints.launchpad.net/tripleo/+spec/overcloud-upgrades 14:26:57 shardy: i was winderig about that one 14:27:12 marios: Yup, that was going to be my next question - can we break that down into some smaller pieces? 14:27:25 that's a catch-all BP and I'm guessing we can't declare it implemented for n-1? 14:27:42 shardy: right... i wasn't even clear what is was about exactly 14:28:16 yea i think we can break it down into smaller BPs 14:28:40 shardy: i can work with jistr to define something for next week's meeting? 14:28:42 marios: any remaining pieces (including CI coverage) we need to be able to declare mitaka->newton upgrades fully supported in upstream builds 14:28:54 marios: +1 14:28:58 shardy: or is that too late already? 14:28:59 marios: sure, sounds good, thanks! 14:29:22 marios: we can chat about it between now and next week - please enumerate the todo list somewhere, e.g etherpad 14:29:32 shardy: ack 14:29:34 ack 14:29:35 +1 feel free to drag me in if needed 14:29:39 trown: what's the status wrt quickstart - can we declare that implemented now? 14:30:10 shardy: we are still a bit blocked on the image front, but I am meeting with derekh later today to sync on that 14:30:25 shardy: since the only consumable image atm is from RDO 14:31:00 trown: Ok, looks like we may have to bump to n-2 then, can you perhaps link an etherpad from the BP with details of what remains, or link bugs in the tripleo-quickstart launcpad? 14:31:15 shardy: will do 14:31:17 zero blueprints for n-1 14:31:42 Ok, sorry we kinda side-tracked off bugs a bit there 14:31:49 #topic Projects releases or stable backports 14:32:35 on the aodh backport topic, i did see what deps get pulled in if you try and install aodh from mitaka on a liberty controller 14:32:40 So, pradk posted http://lists.openstack.org/pipermail/openstack-dev/2016-May/095097.html 14:32:42 and it was only aodh related packages 14:32:50 afaict 14:33:10 so i think we might be able to migrate to aodh as the first step in a mitaka upgrade 14:33:13 slagle: interesting, so we could potentially do that as part of the upgrade vs backport? 14:33:22 shardy: yes, that's my hope 14:33:24 I like that better 14:33:24 and avoid the backport 14:33:29 +1 14:33:30 I think that would be better if possible 14:33:54 If folks have other opinions please reply to the thread 14:34:14 i'll reply with what i found in my test 14:34:19 I guess this is a somewhat common requirement, so it'd be good to solve it in a general way 14:34:35 since for stable/mitaka we will not be even considering such backports 14:35:06 I already had to abandon our application for stable:follows_policy governance tag due to stable/liberty 14:35:39 we're now blocked from applying that until liberty is EOL, so a non-backport solution would be preferable from that angle also 14:35:48 slagle: thanks 14:35:50 the general way we had in mind previously involved doing it as a last step of the upgrade 14:36:04 together with the "convergence via puppet" 14:36:11 but that implies not missing the deprecation period 14:36:28 (we need to be able to use the old service/config/whatever with the new packages) 14:36:36 jistr: AIUI the problem is aodh obsoletes the ceilometer-alarm packages, so then we'd have no alarm service for the duration of the upgrade? 14:36:48 (which could be days while all the computes are upgraded?) 14:36:51 yea 14:37:00 but it wouldn't have to obsolete them necessarily i think 14:37:12 but the issue is 14:37:28 that ceilometer-alarm is not present in upstream release of mitaka IIUC 14:38:05 jistr: ack - well can you, slagle and pradk talk, and post an update to the thread with your preferred approach? 14:38:47 sure thing. If we can find better solution than what we've found out so far, that would be great. 14:38:58 +1 14:39:06 Ok, anything else re releases or backports to mention? 14:39:36 #topic CI 14:39:52 derekh: care to give an update re the new super-speedy CI? 14:40:21 Main CI story is that we (eventually) got HW upgrades done for the machines doing CI 14:40:33 all of them not have 128G of RAM (was 64) 14:40:42 and 200G SSD's 14:41:10 AFAICT the performance looks quite a bit better? 14:41:15 but w'ere only currently using the SSD'd for the testenv's , jenkins slaves arn't using them yet 14:41:34 shardy: well, we haven't had much time to evaluate, CI was broken a lot of time since Friday. 14:41:37 testenvs is where it was most important though 14:41:54 but yeah, upgrades jobs ~1h45 I've seen 14:42:12 shardy: ya things look better but I'm reluctant to say anything concrete until we have a few days where tunk is actually working 14:42:17 some nonha take 1h23 14:42:30 performance is definately improved. Would be nice to try and quantify it a bit further though 14:42:45 the upgrade took a day longer then I had hoped, due to our bastion host refusing to reboot 14:42:58 Yeah, definitely moving in the right direction, but would be good to figure out if we can get the runtimes a little more consistent 14:43:16 dprince: yup, once we get a few days of results I'll try and do a comparison 14:43:25 might even generate a graph or something 14:43:27 e.g I've seen nonha jobs vary between 1hr11 and 1hr40+ 14:43:28 derekh: thanks for driving this 14:43:55 pabelanger: did you have anything to add re your CI work? 14:43:57 and thanks for the teamwork in configuring the new drives in the BIOS (slagle, bnemec, EmilienM) 14:44:02 sure 14:44:13 I'd like to see if we can merge: https://review.openstack.org/#/c/316973/ 14:44:15 +1 thanks to everyone involved in the hardware upgrade and subsequent cleanup 14:44:22 ya,thanks all with the bios help, that would have been a PITA alone 14:44:30 also big kudos for pabelanger for staying very late on Thursday evening 14:44:34 then find a time to migrate to centos-7 jenkins slaves: https://review.openstack.org/#/c/316856/ 14:44:52 Then the last step is launching an AFS mirror, which derekh is going to be commenting on the ML 14:44:55 pabelanger++ 14:46:13 nice move :-) 14:46:56 pabelanger: that patch looks good to me 14:47:11 thanks pabelanger, we'll check out those patches 14:47:20 #topic Specs 14:47:29 https://review.openstack.org/#/q/project:openstack/tripleo-specs+status:open 14:47:35 #link https://review.openstack.org/#/q/project:openstack/tripleo-specs+status:open 14:47:59 beagles: thanks for your feedback on the dpdk and sriov ones, still need others to check those out 14:48:30 yup 14:48:32 we also need to confirm direction wrt the lightweight HA one 14:49:01 e.g if we can switch to only that model, or if we must support the old model too 14:49:26 Any other updates re specs/features? 14:50:18 dprince: other than needing more reviews, any updates re composable services? 14:50:41 hopefully we can get a clear few days in CI and start landing those 14:51:35 I posted a further example of how the fully composable/custom roles might work in https://review.openstack.org/#/c/315679/ 14:51:36 shardy: no, reviews are good 14:51:40 shardy: working CI will help too 14:51:47 Not yet functional, but feedback re the general approach is welcome 14:52:10 Ok then 14:52:15 #topic Open Discussion 14:52:30 Anyone have anything else to raise? 14:52:30 shardy: do you think we'd drive the templating via mistral? 14:52:39 slagle: yes, that was my plan 14:53:01 for now I've put the script in t-h-t, but I'd expect it to move to tripleo-common and be driven via mistral as part of the deployment workflow 14:53:07 hey guys I wanted to ask if anybody has thoughts about the problem raised by EmilienM in the mailing list? 14:53:09 cool, will have a look 14:53:18 on sharing data across roles 14:53:43 gfidente: I think the allnodes deployment is probably the place to do it 14:53:49 and if there are opinions about my reply to have an 'allNodesConfig' output from the different toles? 14:54:05 we've done that in various templates already 14:54:10 let's try that 14:54:25 ack 14:54:36 gfidente: Yeah, figuring out how that may wire in to composable services is tricky, but the only way to pass data between all the roles will be in allnodeconfig I think 14:54:50 gfidente: do you have a PoC already? 14:54:58 or any example of existing code? 14:55:03 EmilienM no so we can start by using the *existing* allNodesConfig 14:55:14 until we figure how to 'compose' that from the roles 14:55:33 gfidente: the composable role bits we have today should be thought of as global config blogs 14:55:55 gfidente: any per node settings are still being handled in the server templates themselves (compute.yaml, controller.yaml, etc.) 14:56:25 and then the all nodes configs roll up that information into lists, etc. when needed 14:56:45 dprince yes but it's only distributed to nodes of same 'type' 14:56:46 if we need to access data between the two I'd just suggest using hiera I think 14:57:11 so yes for networking I have a submission which dumps IPs into hieradata 14:57:28 but for inter-role data we can't assume hiera is available 14:57:45 because we could be writing hiera on a different node then the one we need to read the data from 14:57:51 gfidente: it may not be available, but it eventually would be right? 14:57:57 gfidente: hiera lookup with a default? 14:58:08 let's continue on #tripleo 14:58:15 +1 14:58:16 it's probably different cases we need to cover 14:58:35 gfidente: sure, there are some corner cases to cover yet 14:58:50 hiera seems okay for networking though 14:59:00 cause we dump it from controller and consume from service 14:59:05 Ok, we're pretty much out of time, thanks everyone! 14:59:07 on same node 14:59:11 #endmeeting