20:00:47 #startmeeting tripleo 20:00:48 Meeting started Mon Jun 17 20:00:47 2013 UTC. The chair is lifeless. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:49 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:50 hi everyone 20:00:51 The meeting name has been set to 'tripleo' 20:00:59 lifeless: hi 20:01:00 hi 20:01:05 morning lifeless 20:01:07 hi all 20:01:49 #topic agenda 20:01:50 bugs 20:01:51 Grizzly test rack status 20:01:51 CI virtualized testing progress 20:01:51 open discussion 20:01:55 #topic bugs 20:02:07 https://bugs.launchpad.net/tripleo/ as usual 20:02:19 we're down to 7 crits 20:02:47 SpamapS: you have 3 of them 20:02:57 SpamapS: care to give a brief status on tehm? 20:03:56 lifeless: I think we might be seeing a lot of this w/ TOCI https://bugs.launchpad.net/tripleo/+bug/1166838 20:03:57 sure let me catch up 20:03:58 Launchpad bug 1166838 in tripleo "rabbitmq does not start correctly on boot" [High,Triaged] 20:04:14 ok while SpamapS catches up 20:04:20 https://bugs.launchpad.net/heat/+bug/1191931 20:04:21 Launchpad bug 1191931 in heat "AssertionError when creating a stack." [Critical,New] 20:04:23 I should get https://bugs.launchpad.net/tripleo/+bug/1191714 done today 20:04:27 Launchpad bug 1191714 in tripleo "400.Bad.Request..X-Instance-ID.header.is.mising.from.reque " [Critical,Triaged] 20:04:27 btw, I believe this is breaking heat right now 20:04:28 oh, he's up :) 20:04:37 * lifeless hands the mike to SpamapS 20:05:16 Hm, I haven't checked, all 3 of those might be already merged, just needing better docs. 20:05:22 SpamapS: ok, and thus tripleo ? 20:05:49 ok no https://bugs.launchpad.net/tripleo/+bug/1182249 is still ongoing 20:05:50 Launchpad bug 1182249 in tripleo "quantum configuration is overly hardcoded" [Critical,In progress] 20:06:04 that one needs os-apply-config and/or os-refresh-config to have access to the ec2 metadata 20:06:20 dprince: so, several things we can do - we should move stuff out of first-boot and into orc calls 20:06:40 dprince: which we should *anyway* as rabbit is a service we may reconfigure. 20:06:40 https://bugs.launchpad.net/tripleo/+bug/1183223 is a bit vague and I may need to work on the wording/split it out 20:06:43 Launchpad bug 1183223 in tripleo "nova-compute.yaml missing parameters" [Critical,In progress] 20:07:02 https://bugs.launchpad.net/tripleo/+bug/1183442 20:07:04 Launchpad bug 1183442 in tripleo "Heat metadata updates do not work" [Critical,In progress] 20:07:06 dprince: that would ameliorate the toci issue even if we don't diagnose the root issue in Ubunty. 20:07:09 lifeless: sure. Just wanted to get that marked as high priority (and potentially being worked on as Derek) 20:07:11 Ubuntu. 20:07:17 dprince: ack 20:07:28 I think that one is fixed actually 20:07:50 will need to test and verify, but all of the code is in place to fix it theoretically 20:08:21 SpamapS: reference commit? 20:08:26 SpamapS: if that heat bug is hurting us, care to add a tripleo task? 20:10:25 dprince: there are several, it will take me a while to dig it out 20:10:42 lifeless: yes I'm doing so. Its basically stopping us dead in the water (might be stopping all users dead) 20:10:54 SpamapS: no worries. I can review history myself. Thanks for the update. 20:11:22 dprince: it was worked around in t-i-e and recently fixed in keystoneclient for good 20:12:14 dprince: a8c2ae7e1506defaa36f035377af2b7b04aaed87 20:12:36 lifeless: thanks. 20:12:48 we had tat listed as fixing 1183732 20:12:51 but 20:12:53 I think its teh same 20:13:46 ok, onto the others 20:13:57 bug 1184484 has provisional patches from quantum devs 20:13:58 Launchpad bug 1184484 in tripleo "Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool" [Critical,Triaged] https://launchpad.net/bugs/1184484 20:14:09 I tried to apply them to the HP POC environment 20:14:23 but it made all quantum APIs return empty responses. 20:14:27 Which was undesirable 20:15:06 so I reverted it. My plan is to get the current arc of 'get it up and working without fiddling' going, and then find a couple of spare machines in that environment and do a fresh build. 20:15:32 o/ 20:15:34 quantum folk have marked https://bugs.launchpad.net/tripleo/+bug/1189385 incomplete. 20:15:36 Launchpad bug 1189385 in tripleo "quantum-server hung up it's listening port" [Critical,Triaged] 20:15:43 I'm going to ping them asking what they are missing 20:15:58 #action lifeless to chase 1189385 diagnostics for quantum devs. 20:16:17 bug 1188301 - I think clint was tracking it? 20:16:18 Launchpad bug 1188301 in tripleo "keystone kvs driver causes process to grow indefinitely and spin on CPU with thousands of keys in a single python dict" [Critical,Triaged] https://launchpad.net/bugs/1188301 20:16:44 lifeless: did you see the respond on bug 1184484 ? 20:16:45 Launchpad bug 1184484 in tripleo "Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool" [Critical,Triaged] https://launchpad.net/bugs/1184484 20:16:46 response rather ? 20:16:58 lifeless: its the same old upper/lower problem again 20:17:15 lifeless: oh, bug 1188301 is fixed, keystone defaults to sql now! \o/ 20:17:49 https://review.openstack.org/#/c/32970/ 20:18:20 I linked that to bug 1188378 though 20:18:21 Launchpad bug 1188378 in keystone "keystone.token.backends.sql uses a single delete command to flush expired tokens causing replication lag and potential deadlocks" [Medium,In progress] https://launchpad.net/bugs/1188378 20:18:35 oh actually no I linked both 20:18:37 SpamapS: I'm confused. 20:18:44 but the bot picked the other bug which I mentioned first 20:18:44 SpamapS: is it fixed, or is it pending review to be fixed? 20:18:53 lifeless: keystone upstream defaults to sql 20:19:02 lifeless: we have a config file that still says kvs though 20:19:15 ok, so not fixed for us, because we copied the files. 20:19:18 lifeless: https://review.openstack.org/#/c/32970/ fixes that 20:19:31 yup 20:19:36 dprince: re percona toolkit 20:19:37 Just need to work on that 20:19:45 dprince: what other options are there? 20:19:46 lifeless: this is why I wanted to shrink the size of the config file 20:20:06 lifeless: well... could we fix this in keystone? 20:20:06 jog0: yes, and as I said I'm with you in principle, we just need some care. 20:20:13 lifeless: ++ 20:20:25 lifeless: I have a WIP fix in keystone but MySQL doesn't support LIMIT in the IN() sub-query clauses .. so I have to do something mysql specific. 20:20:46 dprince: right now we're broken by default. I'd like to unbreak us first, and get more portable, long term fixes second. 20:20:49 pt-archiver has been cleaning out our PoC table for a week now 20:20:50 dprince: what do you think of that ? 20:21:27 lifeless: this will certainly break our Fedora efforts. That is my main objection. 20:21:42 dprince: what if SpamapS puts a 'if ubuntu' thing around the pt cleaner. 20:21:45 Frankly I think we should mov to memcached eventually, but that is yet another 3rd party service to scale :p 20:21:50 dprince: fedora will be no worse off - broken is broken. 20:22:22 dprince: but it will build and install and run until you get too much contention with the upstream gc code. 20:22:34 lifeless: go for it. I suppose I'd just like to see keystone support this with its database design. 20:22:40 dprince: Me too! 20:22:48 dprince: I just don't want to be hostage to educating them 20:22:50 dprince: If you have some guidance on how to ask sqlalchemy if I'm using mysql, and then do a specially crafted query in sqlalchemy.. https://review.openstack.org/#/c/32044/ needs your comments :) 20:23:07 lifeless: we should probably put a comment in to make note of this as well so that it doesn't confuse people, etc. 20:23:17 dprince: I'm not suggesting we stop caring about it, just that we get the move away from kvs in place 20:23:32 ok, so 20:24:15 #action spamaps to: - make the kvs->sql change still build and run on fedora; ensure there is a bug upstream in keystone about the bad sql behaviour, with medium priority task on tripleo. 20:24:30 dprince: ^ I think that meets all your concerns; if not please feel free to tweak it so that it does. 20:25:01 ok and bug 1191714 I am working on 20:25:02 Launchpad bug 1191714 in tripleo "400.Bad.Request..X-Instance-ID.header.is.mising.from.reque " [Critical,Triaged] https://launchpad.net/bugs/1191714 20:25:16 this is fallout from the overcloud changes : it's a setting that has to be different in undercloud and overcloud. 20:25:36 right now any seed cloud/bootstrap cloud built with tripleo will fail metadata access from instances/. 20:26:08 Any other bug stuff to discuss? 20:26:36 #topic grizzly rack status 20:26:46 so our POC has been getting hammered by some test users 20:26:50 which is great. 20:26:58 the only issue they have had so far is the quantum poolsize one. 20:27:25 With the comment spamaps pointed out, we can try switching to that quantum version again 20:27:46 #action lifeless to test the quantum deadlock fix on the POC again. 20:28:09 So, this is pretty good news, the path to near-production was a lot smoother than it might have been :) 20:28:23 anything else on the POC environment ? 20:28:51 moar PoC racks plz 20:30:25 I have 2 requests to bring the tripleo love to prod racks 20:30:38 they know we're not finished, and caveats etc. 20:30:47 but they want to see how it flies. 20:31:07 so - thats pending other folks bandwidth. Will keep everyone apprised as things eventuate. 20:31:26 #topic CI virtual testing progress 20:31:30 pleia2: tag 20:31:34 hello! 20:31:48 so testing on lxc is moving along 20:32:19 got most networking stuff sorted last week and openstack using boot-stack is mostly running within lxc, just working out some launching issues with some of the services 20:32:40 sweet 20:32:42 at this point I don't foresee us hitting any major blockers 20:32:47 do you need more eyeballs ? 20:33:12 not right at this moment 20:33:16 but soon 20:33:41 will the CI virt testing just put test the undercloud/boot-stack in LXC 20:33:52 jog0: that's the plan 20:34:12 so no overcloud testing in first pass 20:34:34 jog0: the primary goal is to get baremetal code path test coverage. 20:34:58 jog0: with a little bit of 'image builds properly' and 'tripleo config seems legit' built in. 20:35:07 jog0: -> baby steps. 20:35:20 * jog0 *nod* 20:35:31 yeah, and we're starting off simple just so we have a basic setup (this whole thing has somewhat stalled partially because I've been trying to do all-the-things) 20:35:40 pleia2: ok, please shout in #tripleo when you need someone to eyeballs logs or whatever to help diagnose failure-to-startup. 20:35:50 lifeless: great, thanks :) 20:35:52 #topic open discussion 20:35:53 pleia2: is this still using TOCI? 20:36:13 dprince: it's diverged quite a bit, but I hope to pull it back and submit some patches to TOCI 20:36:47 pleia2: cool. I sort of went the other way... and we are close to having TOCI driving real bare metal. 20:37:05 dprince: it's doing a lot of the same things, so if we could have some switches built in to handle virtual+lxc it would be great 20:37:20 right now I'm running everything by hand though 20:37:21 pleia2: we'll be more resource strapped there... but we are finding good things. 20:37:27 * pleia2 nods 20:37:50 lifeless: I've got a couple things I'd like to run past you all 20:38:36 dprince: shoot! 20:39:30 lifeless: Okay. First thing this troubleshooting thing. 20:40:09 lifeless: I have a review up to make it so that we don't always have to hang the deployment process if something bad happens in the deploy ramdisk. 20:40:40 lifeless: we *can't* hang the deploy process for CI. it will kill my resource pools and the failure rate is still way to high. 20:40:54 So that is step one. (don't hang it) 20:41:12 https://review.openstack.org/#/c/33076/ 20:41:49 hanging bad 20:41:58 there is a timeout mechanism for nova-bm 20:42:02 it's off by default... 20:42:02 Step two would be to have a simple err message tracking capability. I understand we are working on a proper agent... but in the meantime for the "H" release we need something. So maybe something like this: https://review.openstack.org/#/c/33341/ 20:42:40 lifeless: A timeout would be good as well. But it is hanging because we call a 'bash' shell inline. That is just plain bad IMO. 20:43:20 lifeless: So with the second branch above ^^ we'd essentiall just add a small blip to the nova-bare-metal-deploy helper so we can get and log the message. 20:43:32 +500 20:43:58 lifeless: I feel like this is a bit home brew... but I gotta say I can't do much about automating this without these things. 20:44:08 I am totally in favour of this sort of thing; devananda has some reasonable concerns about not changing nova baremetal, but IMO leaving it totally broken is not feasible. 20:44:19 lifeless: lastly, we need to get devananda's Nova branch to improve the IMPI power commands in. 20:44:46 lifeless: the Nova change is really small. and totally backwards compatible. I'll push it by the end of the day too. 20:44:48 I don't think a super duper agent is needed in the short term : its a nice thing to have, but not a necessary condition for any of this. 20:45:05 lifeless: okay. I think we are on the same page. 20:45:24 lifeless: Okay. Slightly different topic. 20:45:24 yes, ack on that - I'll dig up that review and see where we are at. 20:45:31 lifeless does dib now default to using tmpfs for its building magic? 20:45:32 backticks. Can they go away. 20:45:44 dprince: `` -> $() ? 20:45:47 I'm much prefer we use the more formal $() 20:45:56 lifeless: its a style thing... but yes. 20:46:04 fine by me; add it to HACKING or README or something so it's discoverable. 20:46:21 sdake: yes 20:46:28 lifeless: Cool. We don't have a bash HACKING that I know of but I'll take a shot at that. 20:46:41 lifeless I guess I'm a dummy but the command line option to run a fedora dib doesn't immediately stick out at me from the -h or readme.md 20:46:42 sdake: see README.md - Requirements. 20:46:56 sdake: disk-image-create fedora 20:47:08 thanks I'll try that lifeless ;) 20:47:23 sure is fast 20:47:33 sdake: :> 20:47:49 will be nice with the official F19 and later imges too 20:48:00 be sweet if it had an api to go with it :) 20:48:19 ya f17 lost cause at this point ;) 20:48:27 sdake: actually I think its still too slow, need to add some parallel in there, as well as make it trivial to setup local pypi and openstack git mirrors; but folk like derekh (not in this channel atm) have that in-progress 20:48:32 I plan to change all the heat instances to default to f19 when it comes out 20:49:06 sdake: an API would be nice; I think structurally we should layer that on top - separate concern. 20:49:14 lifeless agree 20:51:25 sounds like we're done :) 20:52:33 agreed 20:52:34 #endmeeting