19:00:30 #startmeeting ironic 19:00:31 Meeting started Mon Dec 2 19:00:30 2013 UTC and is due to finish in 60 minutes. The chair is devananda. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:00:32 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:00:35 The meeting name has been set to 'ironic' 19:00:37 #chair NobodyCam 19:00:38 Current chairs: NobodyCam devananda 19:00:45 :) 19:00:52 #topic greetings & roll call 19:00:55 hi all! who's here? 19:00:57 o/ 19:00:58 o/ 19:01:03 o/ 19:01:04 * NobodyCam o/ 19:01:08 o/ 19:01:11 o/ 19:01:29 great! 19:01:33 for reference, here's the agenda 19:01:34 #link https://wiki.openstack.org/wiki/Meetings/Ironic#Agenda_for_next_meeting 19:02:02 #topic announcements / updates 19:02:13 just one today 19:02:42 i started writing a consistent hash ring for the conductor service 19:03:03 to solve several problems that we ran into over the last two weeks 19:03:10 nice, any idea when ur going to put a review up? 19:03:29 around routing RPC messages, knowing which conductor is responsible for which instance, and handling failures 19:03:32 yes 19:03:34 one already landed 19:03:40 and btw, does we need it in order to get out of the incubation process? 19:04:24 lucasagomes: I think we do. our we'll have to do a lot of reengerring 19:04:35 * devananda looks for the commit 19:04:35 NobodyCam, right 19:04:37 our = or 19:04:54 right yea def it's important 19:05:03 #link https://review.openstack.org/#/c/58607/ 19:05:14 just asking cause in our last conversation it was unclear whether we would need it or not 19:05:17 yea 19:05:40 #link https://blueprints.launchpad.net/openstack/?searchtext=instance-mapping-by-consistent-hash <- fyi 19:05:43 so i did the mental excersize of: what happens if we dont have this? --> we can only run one conductor instance --> can we exit incubation like that? --> probably not. 19:06:05 and decided to dig in and get'er'done 19:06:19 great! 19:06:32 also talked with the nova folks briefly about that, and got the same impression from them 19:07:05 o/ 19:07:24 I know tripleo might not need it in the moment, but yea having it done is def a big/great step 19:08:21 the lack of this in nova-bm can be worked around with eg. pacemaker + drbd 19:09:17 in theory that could also provide some HA for ironic. but we would still need to restrict it to a single conductor instance (until we add this hash ring) 19:09:26 so. bleh. more to do. 19:09:46 ok, moving on (we can come back to this in open discussion) 19:09:55 #topic action items from last week 19:10:06 #link http://eavesdrop.openstack.org/meetings/ironic/2013/ironic.2013-11-25-19.01.html 19:10:20 i think 2 of those were romcheg's, and he's not here today 19:10:30 I ported the last bug on the list at whiteboard today 19:10:42 gotta find some more things that might need to be ported from nova bm havanna to ironic 19:10:54 dkehn and NobodyCam and I talked about the nova-network api bits, and I think it's clear. dkehn, any updates? 19:10:56 #link https://review.openstack.org/#/c/59493/ 19:11:04 lucasagomes: awesome, thanks! 19:11:12 nothing that I've heard 19:11:14 lucasagomes: great! 19:11:45 I think everyone is watching the neutron stabilization progress 19:12:09 dkehn: when do you think you'll have some code for the pxe driver -> neutron integration? 19:12:37 working it presently, ran into issues with bring up the dev env 19:12:47 working with NobodyCam to resolver 19:12:53 ack 19:13:21 but the 1st stage is just the PXE data then will work the rest assuming no issue with env 19:14:24 dkehn: I will be afk a good chuck ot today hit me on gtalk if you have questions 19:14:35 as far as oslo/sqla-migrate vs. alembic, folks seem to still like alembic in principle, but i don't have any concrete "yes we should move" answer yet 19:14:36 k 19:15:20 #topic integration and testing 19:15:53 romcheg isn't here today, and he's been doing most of the work in this area 19:16:19 yuriyz: don't suppose you have any updates from him? 19:16:26 I have been working on the nova integration 19:16:42 we are making progress. 19:16:51 no from romcheg 19:17:02 we are going to make prove of concept scheme for integration testing 19:17:06 Exposing node deploy() will be a biggie 19:17:33 vkozhukalov: are you working with -infra on that? 19:17:59 something like launching one VM, installing ironic on it, then launching another VM and booting it from the first one 19:18:23 devananda: no, we just started to do that 19:18:44 vkozhukalov: there's a lot of work that has / is being done around testing tripleo in -infra, which means using nova-baremetal. much of that work can probably be used in the same way for testing ironic 19:19:16 vkozhukalov: will you be around after this meeting? we should chat with infra team :) 19:19:31 devananda: ok, we can 19:19:36 vkozhukalov: great, thanks! 19:19:43 #topic nova driver 19:19:53 oh thats me 19:19:55 (skipping the client because NobodyCam has to leave soon -- will come back to it) 19:20:03 NobodyCam: hi! how goes it? 19:20:12 we are making progress. :-p 19:20:22 can we jump to api 19:20:29 oh. sure 19:20:39 #topic python-ironicclient & API service 19:20:43 lucasagomes: that's you! 19:20:49 lucasagomes: you have thoughts on deploy? 19:20:49 vkozhukalov, might worth to take a look at https://github.com/openstack-infra/tripleo-ci 19:20:56 oh right, so as NobodyCam mentioned 19:21:11 we need to expose a way to trigger the node deploy from the API/client libraries 19:21:16 I thought about something like 19:21:48 POST /nodes//deploy returning 202 with the location header field pointing to /nodes//state case the request gets accepted 19:22:00 403 in case the deployment was already trigged and not completed 19:22:06 also we need a way to abort 19:22:19 so folk could do a DELETE /nodes//deploy 19:22:26 lucasagomes: will deploy be sync or async? 19:22:28 to abort the operation 19:22:31 async 19:22:36 that's why 202 + location 19:22:55 location = he can look at the state resource to see in which state the node currently is 19:23:01 lucasagomes: what will be in the POST body? 19:23:02 + the target state 19:23:26 devananda, didn't think about it yet 19:23:35 just got me think about how it would work at the end of today 19:23:42 NobodyCam: you'll need to have a while loop in the nova driver, polling nodes//state, to see when it reaches "done", or if it errors, and also tracking some timeout in Nova 19:23:42 so there's some gaps, just the initial idea 19:23:44 so nova driver will have to poll 19:23:50 yep 19:24:35 if nova times out (ie.. very long deploy) we will be able to roll back / delete 19:24:45 does the nova driver use the ironicclient, or issue a POST directly? 19:24:46 NobodyCam: eg, https://github.com/openstack/nova/blob/master/nova/virt/baremetal/pxe.py#L455 19:24:53 rloo: ironicclient 19:24:58 rloo, it uses the ironic client libs 19:24:59 currently ironic client 19:25:04 :) 19:25:04 hmm, what about an eg --poll option, like nova boot has? 19:25:21 rloo: that's a CLI thing 19:25:40 rloo: CLI and API are distinct, even though they're packaged together 19:26:01 ya that loop really need to be in nova 19:26:05 oops. 19:26:08 *nova driver 19:26:20 nova driver wraps the client API. the CLI also wraps the client API,.... BUT the CLI shouldn't include any "deploy" method 19:26:22 Is Ironic meeting still here? 19:26:27 romcheg_: hi! yes 19:26:34 yea 19:26:39 hi romcheg_ :) 19:26:46 yea just the lib will contain the method to trigger the deployment 19:26:57 cli won't expose it 19:26:58 Hi, sorry for being late. Street meeting took more time :) 19:27:00 romcheg_: we can come back to your things in a few minutes 19:27:24 lucasagomes: I think a POST with 202 is fine in principle 19:27:42 I'm good with polling 19:27:44 right, it's also good to say that 19:28:01 WSME right now doesnt support returning a Location in the HTTP header 19:28:03 so lone as we can "break/stop" the deploy 19:28:10 #link https://bugs.launchpad.net/wsme/+bug/1233687 19:28:12 Launchpad bug 1233687 in wsme "Return Location with POST 201 return code" [Wishlist,New] 19:28:24 same prob for 202 ^ 19:28:25 hah 19:28:27 ok 19:28:41 so people go there and clicks in the affect me button :P 19:28:44 so we can just do that in the nova driver anyway 19:28:58 * devananda clicks "affects me" 19:29:01 lucasagomes: I could build the link 19:29:08 ya 19:29:19 NobodyCam, yes you can build it, np 19:29:40 i dont think you need to build it, really 19:29:49 yea build it = call a method in the lib 19:29:51 * NobodyCam is running short on time 19:29:51 the client lib already has a node.state object, ya? 19:29:53 right 19:30:24 NobodyCam: go if/when you need to. i can fill you in later 19:30:46 devananda: TY ... sorry for running out 1/2 way thru.. 19:31:09 NobodyCam_afk, see ya later 19:31:26 lucasagomes: have more to discuss on the API / client libs? 19:31:34 devananda, not from me 19:31:41 if there's no objections I will start working on that tomorrow 19:31:48 lucasagomes: ++ 19:31:52 so I NobodyCam_afk can start using it asap 19:31:58 anyone else, questions on API / client? 19:32:12 what about a way to interrupt it? 19:32:30 rloo, it will use DELETE to abort operation 19:32:41 so the same way you POST to that resource to trigger the deploy 19:32:47 you can DELETE to abort 19:32:55 Ok. (I have to admit, i don't know what already exists.) 19:32:58 as far as an API goes, I think that's reasonable 19:33:18 i'm not sure how easily we can get the plubming to actually interrupt an in-progress deploy 19:33:38 yea, that will be another challenge :) 19:33:50 probably solved by the way the ramdisk will do things 19:33:53 and we probably shouldn't implement DELETE /nodes//deploy until we can actually satisfy that request 19:33:57 like asking for the next steps 19:34:10 perhaps 19:34:16 my concern is more around the node locking 19:34:57 like aborting not release the node? 19:35:13 whether DELETE // interrupt is async or sync, we'll still have the problem that the node resource is locked by the greenthread which is doing the deploy 19:35:42 we can't just go update the DB record while that's going on and expect deploy() to behave reasonably 19:36:15 oh yea, aboarting will need efforts in a couple of areas 19:36:24 most important maybe is the ramdisk 19:36:30 how it will know it has to abort etc 19:36:39 right. I think DELETE an in-progress deploy should wait until we can look more into those areas, and it's not needed for coming out of incubation 19:37:00 cool, so do not expose DELETE for now? 19:37:13 or expose it and raise an NotImplemented error? 19:37:28 I would not expose it yet 19:37:41 right 19:37:47 +1 for not exposing DELETE 19:37:55 NotImplemented vs NotFound. I prefer the latter 19:38:00 I was going to suggest exposing/raising error, and adding comment why. 19:38:45 or have some place so someone knows why DELETE doesn't exist yet. 19:39:03 rloo, I think the main thing is that we don't need it for coming out of incubation so, we can add it after (and also docs about it) 19:39:09 we can certainly add inline documentation in the API code about it 19:39:16 and it may be worth adding a BP to track the intent for it 19:39:17 yea like a TODO there 19:39:29 yeah, i understand. but it is hard to know, during all this progress, what is avail, not avail, and why. 19:39:31 devananda, if u want to give me one action to write a bp 19:39:49 #action lucasagomes to file a BP for DELETE /nodes//deploy 19:40:03 rloo, do you think a TODO in the code explanation our intentions and why it's not implemented in the moment would be enough? 19:40:16 yup. enough for me anyway :-) thx. 19:40:25 ok will do that :) 19:41:19 ok, moving on 19:41:23 romcheg_: still around? 19:41:48 devananda: yup 19:42:20 #topic integration and testing 19:42:22 Actually I do not have a lot of updates. 19:42:52 romcheg_: give us what you've got :) 19:43:16 I rebased my patch to infra-config to clarkb's and waiting until that refactoring is finished 19:43:54 romcheg_: my change just got approved, it needs a little babysitting, but once we are happy with it, your change will be reviewable 19:43:55 The tempest patch does not attract a lot of people unfortunatelly 19:44:20 clarkb: Cool. Will take a look at that in the morning 19:44:28 romcheg_: if i understand correctly, after your infra/config patch lands, we should have some devstack tests in ironic's pipeline, yes? 19:45:31 devananda: as we discussed previously, we will add tempest tests for Ironic for gate and check pipelines to Ironic and to the experimental pipeline for tempest 19:45:48 romcheg_: right 19:46:32 romcheg_: i'm wondering if there are any other dependencies, besides https://review.openstack.org/#/c/53917, to get it working in the ironic pipeline 19:47:25 devananda: No, only this configuration change and the tests 19:47:39 great 19:48:09 #topic open discussion 19:48:15 aight 19:48:24 look! a whole 12 mintues for open discussion today :) 19:48:29 devananda, is it part of Ironic plan's to get metrics from other devices (e.g storage arrays) just like we will be getting metrics for servers (via IPMI) ? 19:48:36 I continuously check the tests against the latest Ironic to detect any changes that broke Ironic 19:48:52 lucasagomes: only devices which ironic is managing/deploying to 19:49:03 right 19:49:16 Hopefully everything works now so as soon as those two patches landed, we will have tempest tests for Ironic 19:49:28 lucasagomes: there was some interest in having ironic able to deploy firmware // small OS images to network switches (eg, open compute / ODCA stuff) 19:49:46 lucasagomes: which I generally dont think we're ready for. but that kinda touches on the same space as your question 19:49:47 so we plan to do things on switches for e.g that would be one of the cases of devices we would be able to control using ironic? 19:49:49 right 19:50:18 that makes sense to me 19:50:45 lucasagomes: the whole "configure the hardware" bit gets very wierd if ironic starts configuring switches and SAN 19:51:04 lucasagomes: i really dont think it shoould do that. we have other services for that 19:51:45 right yea we should not start squashing a lot of things into ironic for sure 19:51:49 focused tools 19:52:09 lucasagomes: right. OTOH, if someone wants to install a new OS on their switch, well, _that_ is Ironic's domain 19:52:28 but it exposes some really wierd questions 19:53:03 talking to Nova's API to have Ironic deploy an image from Glance onto their hardware switch, then using Neutron to configure that switch 19:53:15 yea, it's not something for icehouse for sure, but in the future we might start need to discuss things like it 19:53:17 but we needed some networking in order to do the deploy in the first place .... 19:53:37 definitely worth longer discussions 19:54:20 devananda, another question re consistent hashing... I saw ur implementing the hashring class, we r not going to use any lib that does it already? any reasons for that, lack of py3 support? 19:54:44 devananda, +1 for discussions 19:55:30 lucasagomes: i found 2 py libs out there, one of which was unmaintained, and none within openstack yet 19:55:40 i looked at swift's hash ring 19:55:55 and talked with notmyname to get some ideas, but that code is too tightly coupled to swift's particular needs 19:56:17 so this is coupled to our needs, and the ring code itself is pretty small 19:56:30 right 19:56:34 the complexity is going to be in the routing and rebalancing code that I'm working on now 19:56:54 that has to do with the list of dead conductors? 19:56:57 yes 19:57:05 right yea I understand 19:57:18 good stuff :) 19:57:21 eg, a conductor misses a few heartbeats -- don't take it out and rebalance everything. just skip it and talk to the next replica 19:57:58 i think we only need to do a full rebalance in two cases: new conductor joins the ring; admin removes a conductor from the ring 19:57:59 anything about the number of replicas? 19:58:08 I saw that they tend to use like loads of replicas to make it faster 19:58:31 lucasagomes: in swift, sure. in ironic, more replicas won't make deploys faster or anything 19:58:35 devananda, ahh thats interesting yea, if someone joins 19:58:36 just means more resilience to temporary failures 19:58:53 we would need to rebalance and set the nodes to be controller by specific conductors 19:59:22 rebalance will redistribute the nodes across conductors (with the appropriate drivers) 19:59:39 it's not a manual admin-has-to-move-nodes thing 20:00:08 and the conductor<->node relationship isn't stored in the DB (apart from TaskManager locks) 20:00:38 i think we'll need some good docs for the hash ring stuff 20:00:39 oh yea otherwise it would be more complicated even to do a take over 20:00:47 so i'm going to work on diagrams today to explain them 20:00:53 cool 20:00:57 looking fwd to see some patches coming 20:01:18 even more complicated* 20:01:19 i'll un-draft the patch i have once i've cleaned it up a bit 20:01:45 great :) 20:02:19 anything else? we're a tad over time now 20:02:49 ok, thanks all! 20:02:58 #endmeeting