20:00:21 #startmeeting tripleo 20:00:22 Meeting started Mon Jun 3 20:00:21 2013 UTC. The chair is lifeless. Information about MeetBot at http://wiki.debian.org/MeetBot. 20:00:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 20:00:25 The meeting name has been set to 'tripleo' 20:00:48 #agenda 20:00:51 bugs 20:00:51 Grizzly test rack progress 20:00:51 CI virtualized testing progress 20:00:51 open discussion 20:00:52 bah 20:00:57 #topic agenda 20:01:01 bugs 20:01:01 Grizzly test rack progress 20:01:01 CI virtualized testing progress 20:01:01 open discussion 20:01:10 #topic bugs 20:01:26 https://bugs.launchpad.net/tripleo/ 20:01:30 sigh. 20:01:32 #link https://bugs.launchpad.net/tripleo/ 20:02:15 o/ 20:02:20 10 criticals 20:02:37 4 in progress 20:02:53 SpamapS: am I wrong, or do you have 1182249 too ? 20:03:11 and 1182732 and 1182737 ? 20:03:24 checking 20:03:35 and 1183442 ? :) 20:03:45 I believe 1182249 yes 20:04:08 lifeless: which bug? :) 20:04:19 #action lifeless https://bugs.launchpad.net/tripleo/+bug/1184484 I will add it to the discussion about defaults on the -dev list. 20:04:20 Launchpad bug 1184484 in tripleo "Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool" [Critical,Triaged] 20:04:23 Ng: iLO 20:04:24 Not sure if my patches in review handle 1182732 20:04:32 Ng: https://bugs.launchpad.net/tripleo/ 20:05:23 I do have 1183442 20:06:22 is there some reason gerrit doesn't manage LP projects for stackforge? 20:07:15 unlinking https://bugs.launchpad.net/tripleo/+bug/1182732 - we have a separate workaround task. 20:07:16 Launchpad bug 1182732 in quantum "bad dependency on quantumclient breaks metadata agent" [High,Confirmed] 20:07:55 SpamapS: it will 20:07:58 SpamapS: if it's configured correctly 20:08:12 SpamapS: we should be configured correctly now; clarkb gave us a hand last week to sort it out 20:09:16 good, will cross my fingers :) 20:09:48 Ng: https://bugs.launchpad.net/tripleo/+bug/1178112 specifically 20:09:49 Launchpad bug 1178112 in tripleo "baremetal kernel boot options make console inaccessible on ILO environments" [Critical,Triaged] 20:11:14 so that leaves two 20:11:21 one is a workaround issue 20:11:34 not a lot we can do; clearly quantum hasn't been used at moderate scale 20:11:43 thats bug 20:11:48 bug 1184484 20:11:49 Launchpad bug 1184484 in tripleo "Quantum default settings will cause deadlocks due to overflow of sqlalchemy_pool" [Critical,Triaged] https://launchpad.net/bugs/1184484 20:12:10 and I'm fairly sure we have to have 1182737 fixed to bring up an automated overcloud 20:12:16 lifeless: repointed at the commit that landed in dib and market as fix committed 20:12:36 Ng: \o/ - as dib isn't doing releases yet, just fix released please. 20:12:49 k 20:12:53 SpamapS: as you sure you're not installing git trunk of quantumclient yet ? 20:13:07 lifeless: was just looking 20:13:08 SpamapS: I thought you brought up an overcloud in fully automated fashion and it worked? 20:13:57 lifeless: 99% automated.. still getting stuck at booting an instance and having metadata because of a lack of routers.. 20:14:11 lifeless: in my notes I have "install quantumclient from trunk in quantum venv" 20:14:23 kk 20:14:33 SpamapS: I will debug that with you later today? 20:14:51 so, bugs done to death I think; lots of high, but lets get the fire drill sorted before we worry about that. 20:15:07 lifeless: yes I've got it working well with very straight forward manual steps 20:15:08 #topic Grizzly test rack POC 20:15:21 lifeless: also we should lean on quantumclient maintainers maybe? 20:15:25 So, we have a live working grizzly cloud. 20:15:42 SpamapS: we should. 20:16:28 but we have no monitoring in place. 20:16:33 lifeless: I need to lean on the keystoneclient maintainers for similar reasons. :) 20:16:47 Ng: GheRivero: perhaps thats something you guys have lots of experience and would like to take up the mantle for? 20:16:48 lifeless: sure we do, our POC users will phone us if it breaks. ;) 20:16:49 * SpamapS hides 20:17:50 lifeless: monitoring? sure. do we have any ideas about what we want? 20:18:01 lifeless: yeah, sure (don't know what, but if you say so... :) 20:18:13 well 20:18:38 icinga or nagios perhaps? NobodyCam had the start of an element, but it's stalled AFAICT' 20:18:58 perhaps work with him on that, and on the heat template for same 20:19:07 I assume alerting is included in monitoring? 20:19:10 I can also give a hand there. 20:19:20 cody-somerville: cool 20:19:25 I'd love to have an icinga-heat bit that, given heat read credentials, can interpret a heat stack and generate all of the monitoring. 20:19:40 SpamapS: +1, and I welcome our strong AI overlords. 20:19:49 I ment to get back to the nagios ... but got side tracked on ironic 20:20:17 so, we've three weeks of POC to go; be really good to have monitoring sooner rather than earlier. 20:20:20 jog0: oh yeah, HI 20:20:21 ! 20:20:26 what do we want? 20:20:34 I think we want base level hardware/OS monitoring. 20:20:49 We want cloud health - have we maxed out any resource 20:20:50 * jog0 waves to lifeless and the rest of the room 20:21:05 We want API health: are all the API endpoints answering, and doing so in a reasonable timeframe. 20:21:27 We want functional monitoring - is spinning up/down instances working, is networking working. 20:21:43 it's likely we want more than one tool; but a consolidated view of their data. 20:21:58 should I turn this into a blueprint/etherpad? 20:22:34 we should probably have this captured somewhere 20:22:39 I'd love it if someone can pick it up and run with it; dragging other folk in as needed. 20:22:45 lifeless: and cloud health also detects if a box dies? 20:23:00 jog0: the base hardware/os layer stuff would capture that 20:23:25 jog0: I'm inclined to worry about automated remedial actions at a later date 20:23:40 ah, perhaps we should sync up offline, so I can get up to speed 20:24:00 jog0: sure; though we have some time here. 20:24:24 Note that Heat wants to be able to do some of that remedial action stuff too. 20:24:27 Basically, tripleo aims to deliver a production ready cloud; having *an* answer for monitoring is an important thing. 20:24:32 SpamapS: right 20:24:41 SpamapS: I'm thinking an icinga endpoint can be the canary check. 20:24:52 yes indeed 20:24:55 SpamapS: at 10000ft view 20:25:16 jog0: the other thing as SpamapS just brought up, is that solid service monitoring is a key part of safe deployment automation. 20:25:28 jog0: so you can stop a deploy mid-way if things go pear shaped. 20:26:24 right 20:26:33 right now we have nothing; so we need someting 20:26:47 #action lifeless to capture 10000ft view of monitoring needs in a blueprint 20:26:58 #action someone to take point on monitoring 20:27:44 I have an open todo to track down the missing machines; echohead gave me a spreadsheet, but I don't know [yet] the network topology 20:27:51 anything else about the test rack? 20:28:53 ok 20:29:02 #topic 20:29:02 next topic: 2nd test rack? :) 20:29:07 #topic CI virtualized testing progress 20:29:15 so, this one is lots of fun 20:29:24 SpamapS: once we can bring up and pull down parallel clouds in this rack I'll ask for a test row. 20:29:35 a couple months ago I was tasked with testing nova-baremtal https://bugs.launchpad.net/openstack-ci/+bug/1082795 20:29:36 pleia2: tag, you're it 20:29:36 Launchpad bug 1082795 in openstack-ci "Add baremetal testing" [High,Triaged] 20:29:55 as we all know, a ton has changed since then, ironic and all 20:30:19 but I've still been focusing on tripleo to do virtualized testing of the soundness of launching these test bmnodes 20:30:30 so two things 20:30:59 1. This is difficult, I tried just straight toci that dprince worked on, but our virtualized environment don't really allow for this (they don't have kvm, and qemu is way way too slow) 20:31:18 how slow? 20:31:26 Like, can we get working-but-slow, and then iterate? 20:31:30 openstack starts exhibiting strange timeout bugs slow, not usable 20:31:40 2 minutes to ssh in 20:32:19 ok, thats pretty messed up. 20:32:29 2. Am I still on the right track here at all by using tripleo? If I use lifeless' takeover node I end up pulling out so much virtualization that I'm really just testing dib and launching of nodes (and haven't quite figured out networking on that) 20:32:57 which isn't really tripleo anymore, but probably is where I want to be testing baremetal-wise (I think) 20:33:01 pleia2: I have working nested kvm on my i7 laptop 20:33:05 pleia2: so, what code path do you want to test ? 20:33:07 pleia2: using boot-stack 20:33:14 SpamapS: cloud test environments are rackspace/HPCS. 20:33:21 SpamapS: so thats interesting but irrelevant 20:33:22 SpamapS: me too, but not on hpcloud 20:33:27 gah 20:33:32 can't even load kvm module on hpcloud 20:33:42 yeah didn't realize thats what we were talking about 20:33:45 pleia2: what codepath are you aiming to test. 20:34:20 lifeless: so that's what I realized this morning - I don't know, nova-baremetal is now ironic (and doesn't yet have a nova driver afaik) 20:34:30 nova-baremetal still exists. 20:34:36 ironic is coming together. 20:34:47 right, and it's a goal for ironic to have a driver which I assume will behave the same way 20:34:53 +for nova 20:34:58 once ironic is integrated, we'll still want to know that the use case of 'nova boot baremetal' still works. 20:35:05 so, lets ignore ironic. 20:35:28 if we do that, what codepath do you want to test? 20:35:51 nova 20:35:58 ok 20:36:04 so the minimum you need for that is 20:36:07 the nova code 20:36:10 configured for baremetal 20:36:15 * pleia2 nods 20:36:19 you need a dedicated network 20:36:37 with 'physical' machines on it w/PXE boot configured 20:36:38 yes, this is my current challenge when trying to do this virtualized without nesting 20:36:52 and you need a power driver capable of turning them on / off. 20:37:26 I suggest that a solid 'it worked' test is to boot a vanilla ubuntu image, ssh in with metadata supplied ssh key. 20:37:48 tripleo's boot-stack is neither here nor there w.r.t. to testing this specific code path. 20:37:55 ok 20:37:57 Just a thought. Have we ever tried lxc as a way around the nesting problems? 20:38:05 SpamapS: nope 20:38:10 SpamapS: lxc container set to pxe boot ? 20:38:19 SpamapS: with a different kernel.... 20:38:20 just qemu (drop in for kvm, easy to test) 20:38:26 SpamapS: I don't think its a fit. 20:38:37 SpamapS: though it's an interesting idea 20:38:40 lifeless: oh well if we're testing _that_ part yeah there's no point. 20:38:50 so I've had a few ideas, but they end up being so weird that we don't end up testing what we think we're testing (and the tests could break in weird ways) 20:39:06 pleia2: so, *tripleo* want to test the full path. 20:39:26 pleia2: one reason you've been steered at tripleo, I think, is so that you can kill two birds with one stone. 20:39:47 pleia2: a) nova baremetal functional/integration test. b) tripleo boot-stack functional/integration test. 20:40:01 pleia2: I'll let -infra folk weigh in on the relative importance of that, but... 20:40:03 yeah, and it also tests dib 20:40:16 pleia2: for my part, I think 'lets get /a/ test in place and upgrade it later' 20:40:18 (or, is potentially broken by dib :)) 20:40:59 now, in the absence of a test cloud with nested vm enabled.... which btw the grizzly POC rack could be setup 20:41:06 or a bare metal test cloud 20:41:15 we're going to have nested KVM for the baremetal node you boot 20:41:27 we don't have to have nested KVM for the boot-stack node. 20:41:34 SpamapS: oh, I may have misinterpreted you.... 20:41:41 so my thought was to spin up 3 hpcloud/rackspace instances 20:41:43 pleia2: we could run the boot-stack image in lxc perhaps. 20:42:08 SpamapS: ^ is that what you meant ? 20:42:28 one would be what usually is physical hardware, then boot-stack then the baremetal node, but those are all public machines, not a private lan where they all talk 20:43:06 I don't think it will buy you anything 20:43:16 yeah, it's a mess 20:43:20 as they'll have to run a VPN to get the layer 2 network to do PXE 20:43:27 and that implies nested KVM on each machine 20:43:50 except the boot-stack on; but - see lxc. 20:43:55 sp, SpamapS is afk ;). I'll riff 20:44:23 use dib to build a boot-stack image. loopback mount it and lxc boot it - no nested kvm 20:44:29 hah, so lxc container inside the hpcloud instance? 20:44:33 we document 'use kvm to boot the seed cloud' 20:44:46 we can also document 'use lxc to boot the seed cloud', just as well 20:44:56 bmnodes will still be nested kvm 20:45:14 and you'll still have a br99 or whatever between the bm nodes and eth1 in the boot-stack container. 20:45:14 sorry yeah had local interrupt 20:45:17 well, qemu, right? 20:45:23 since we can't do nested kvm 20:45:23 pleia2: ack 20:45:35 lifeless: and yes I meant run boot-stack in lxc 20:45:49 SpamapS: so yeah, I misinterpreted you :(. 20:45:53 SpamapS: argue more, dammit! 20:46:12 lifeless: I was mid-argument when wife needed muscles 20:47:00 doh! 20:47:03 pleia2: what do you think ? 20:47:05 ok, so instead of booting boot-stack as a kvm instance, we make it lxc (can lxc boot qcow2?), right? 20:47:36 then we just create the bmnode as usual (except with qemu rather than kvm) 20:48:12 just I run boot-stack setup on three virtual box vms, dib,boot-stack,bm-node.... no nested vms at all 20:48:25 *justFYI* 20:48:34 NobodyCam: those are nested when your host is a cloud instance 20:48:38 NobodyCam: thats the issue 20:48:44 yeah, we're doing this on a public cloud 20:48:47 pleia2: yes. And qemu-nbd can loopback mount qcow2. 20:48:57 lifeless: ok, cool 20:49:15 ok, I have a plan, thanks lifeless and SpamapS 20:49:19 cool 20:49:23 (now to learn more about lxc :)) 20:49:32 #topic open discussion 20:50:15 anything? 20:50:17 so many bugs 20:50:20 so little time :) 20:50:32 (I think that may mean tripleo is healthy) 20:51:44 oh 20:51:46 MORE PEOPLEZ PLEAHSE 20:51:51 os-config-applier is now os-apply-config 20:52:24 also I was thinking o-a-c should have a way to reference instance metadata the same way it references heat metadata. 20:52:43 mmm 20:52:55 what about a thing to suck instance metadata down to disk as json 20:53:13 and oac unions in some well defined manner multiple json files? 20:53:26 yeah thats the way I was thinking of doing it actually. 20:53:58 local_ip = {{instance_metadata.private_ip}} or something like that. 20:54:16 sob, you want to kill my sed ?:) 20:54:39 should we namespace the heat variables too ? 20:54:43 {{heat.goo}} ? 20:54:49 Yeah thats what pops into my head as well 20:55:01 so 20:55:04 though another thought is to just reserve some namespaces 20:55:07 what I was thinking was that neither was namespaced 20:55:22 and we define what happens on conflicts in a formal predictable manner 20:55:25 hey guys 20:55:35 so that you can locally override something 20:55:41 sthakkar: hi ? 20:56:04 * mestery thinks sthakkar is early for the next meeting. :) 20:56:20 mestery is right. sorry guys :) 20:56:55 lifeless: well I will put together a bug about the need for access to metadata.. the design can come later. 20:57:16 ok, so I think thats a wrap then. 20:57:19 last call 20:57:57 #endmeeting