21:00:26 <oneswig> #startmeeting scientific-sig 21:00:27 <openstack> Meeting started Tue Sep 18 21:00:26 2018 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:27 <janders_> g'day everyone 21:00:28 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:30 <openstack> The meeting name has been set to 'scientific_sig' 21:00:37 <oneswig> greetings janders_ and all 21:00:50 <oneswig> what's new? 21:01:20 <oneswig> #link agenda for today is https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_September_18th_2018 21:02:08 <oneswig> Tomorrow is upgrade day over here... but this specific time it's Pike->Queens 21:03:17 <oneswig> We've been doing the drill on the staging environment but there's nothing quite like the real thing ... 21:03:34 <janders_> oneswig: what are the main challenges? 21:04:20 <oneswig> In this case, not too many. One concern is correctly managing resource classes in Ironic 21:04:42 <janders_> right! are you doing BIOS/firmware upgrades as well? 21:05:07 <oneswig> oh no. That's not in the plan (should it be I wonder?) 21:05:28 <b1air> o/ 21:05:44 <oneswig> G'day b1air, which airport are you in today? :-) 21:05:44 <b1air> Do all the changes all at once!! 21:05:56 <janders_> if you were to, would you use something like lifecycle manager, or would you temporarily boot ironic nodes into a "service image" with all the tools? 21:05:58 <b1air> Very near AKL as it happens 21:06:00 <oneswig> Fighting talk from a safe distance, that 21:06:07 <b1air> ;-) 21:06:34 <oneswig> janders_: last time we did this, it was the latter - a heat stack for all compute instances with a service image in it. 21:07:16 <janders_> right! in a KVM-centric world, it's easy - just incorporate all the BIOS/FW management tools in the image. Ironic changes this paradigm so I was wondering how do you go about it. Might be an interesting forum topic. 21:07:26 <martial_> (difficulty joining on the phone) 21:07:30 <oneswig> Have you seen an Ansible playbook for doing firmware upgrades via the dell idrac? 21:07:33 <oneswig> Hello martial_ 21:07:40 <oneswig> #chair b1air martial_ 21:07:41 <openstack> Current chairs: b1air martial_ oneswig 21:07:45 <oneswig> (remiss of me) 21:07:47 <janders_> do you pxeboot the service image via ironic or outside ironic? 21:08:18 <oneswig> In that case we booted it like a standard compute instance, via Ironic 21:08:24 <b1air> KVM world easy? Pull the other one @janders_ ! :-) 21:08:54 <janders_> no.. I looked at the playbooks managing the settings but not the BIOS/FW versions. If it works (and I'm not worried about the playbooks, I'm worried about the Dell hardware side :) it'd be gold 21:09:14 <janders_> oneswig: does this mean you had to delete all the ironic instances first? 21:09:31 <janders_> b1air: KVM world is easy in this one sense :) 21:09:42 <oneswig> In that case, yes - I guess the lifecycle manager could have avoided that, do you think? 21:10:01 <janders_> oneswig: yes - it will do all of this in the pre-boot environment (if it works..) 21:10:41 <janders_> when I say "if it works" - on our few hundred nodes of HPC it definitely works for 70-95% nodes. Success rates vary. The ones that failed usually just need more attempts.. (thanks, Dell) 21:11:10 <b1air> Power drain? 21:11:30 <janders_> however I am unsure if Mellanox firmware can be done via Lifecycle Controller (we usually do this part from the compute OS) 21:11:33 <oneswig> janders_: is this the playbooks at https://github.com/dell/Dell-EMC-Ansible-Modules-for-iDRAC ? 21:12:10 <b1air> janders_: only if it is a Dell OEM Mellanox part - that's the value add 21:13:28 <janders_> b1air: most of our HCAs are indeed OEM - I need to revisit this (I guess the guys have always done this with mft & flint, cause it works 99/100) - in the ironic world doing everything from LC could simplify things 21:14:27 <janders_> closer to the main topics - from your experience, how big do the forum sessions typically get? 21:14:49 <oneswig> janders_: there has also been talk previously of performing these actions as a manual cleaning step - less obtrusive but without out-of-band dependencies on idrac 21:15:00 <b1air> At Monash we found the LCs to be ok reliability-wise from 13G 21:15:27 <oneswig> janders_: perhaps we should, indeed, look at the agenda.. 21:15:37 <oneswig> #topic Forum sessions 21:16:18 <oneswig> Forum sessions I've been in have ranged in size from ~8 people to ~50 (but about 12 holding court) 21:16:29 <janders_> oneswig: this is a neat way to do it in a rolling fashion - however the drawback is having a mix of versions for quite a while as users delete/reprovision the nodes. I'm trying to come up with an option of doing it all in a defined downtime window, without affecting existing ironic instances. 21:16:36 <janders_> b1air: that is great to hear! :) 21:16:55 <janders_> oneswig: that is good - it shouldn't be impossible to get some bandwidth in these sessions! :) 21:17:32 <oneswig> I get the feeling one on Ironic and BIOS firmware management could be interesting! 21:17:46 <oneswig> Facilitating it but also, conversely, preventing it 21:19:30 <priteau> janders_: I think at CERN they have a way of letting the instance owner select their downtime period 21:19:52 <priteau> I am trying to find where I saw it described 21:20:14 <oneswig> Good evening priteau! 21:20:34 <janders_> wow - very cool idea.. I wonder if it's leveraging AZs (which might have different downtime windows) or something else 21:20:35 <priteau> Hi everyone by the way :-) 21:20:54 <priteau> janders_: it may even be per-host 21:21:26 <b1air> Sounds a bit like AWS' reboot/downtime scheduling API 21:22:42 <janders_> thinking about it - if it's just the instance that's supposed to be up and it has no volumes etc attached it can be quite fine grained 21:23:13 <janders_> however if the instance is leveraging any services coming off the control plane, it might be tricky to go below AZ-level downtime 21:23:28 <janders_> or at least that's my quick high level thought without looking into details 21:23:51 <janders_> very interesting topic though! :) 21:24:39 <oneswig> question of procedure - do we add a proposal like this to the Ironic forum etherpad, or mint our own SIG etherpad and add it to the list? 21:25:42 <priteau> I found http://openstack-in-production.blogspot.com/2018/01/keep-calm-and-reboot-patching-recent.html, but it's not how I remember it 21:26:54 <oneswig> Another area I am interested in pursuing is support for the recent features introduced to Ironic for alternative boot methods (boot from volume, boot to ramdisk) - is there scope for getting these working with multi-tenant networking? 21:26:55 <priteau> Maybe there is another procedure for the less critical upgrades 21:30:19 <janders_> oneswig: alternative boot methods would definitely be of interest. Looking at the PTG notes there are some good ideas so it looks like the next step would be to find out if/when these ideas can be implemented 21:31:01 <janders_> something from my side (across all the storage-related components) would be BeeGFS support/integration in OpenStack 21:31:30 <oneswig> Ooh, interesting. 21:31:35 <janders_> would you guys be interested in this, too? 21:31:37 <oneswig> Like, in Manila? 21:31:48 <janders_> yes, that's the most powerful scenario 21:32:05 <oneswig> Absolutely! We've got playbooks for it, but nothing "integrated" 21:32:12 <oneswig> (but does it need to be?) 21:32:16 <janders_> but running VM instances (for those who still need VMs) and cinder volumes off BeeGFS would be of value as well 21:32:56 <oneswig> That follows quite closely what IBM was up to with SpectrumScale 21:33:00 <janders_> given no kerberos support in BeeGFS for the time being I think it would be very useful to have some smarts there 21:33:22 <oneswig> OK, let's get these down... 21:33:29 <janders_> haha! you found the logic behind my thinking 21:33:38 <oneswig> #link SIG brainstorming ideas https://etherpad.openstack.org/p/BER-stein-forum-scientific-sig 21:33:58 <janders_> I liked what IBM have done with GPFS/Spectrum however I find deploying and maintaining this solution more and more painful as time goes 21:34:13 <janders_> I see the same sentiment on the storage side 21:34:22 <janders_> "it's good, but..." 21:34:41 <janders_> I'll add some points to the etherpad now 21:35:11 <janders_> ok, you already have - thank you! :) 21:36:10 <janders_> another storage related idea 21:36:27 <janders_> would you find it useful to be able to separate storage backends for instance boot drives and ephemeral drives? 21:36:43 <janders_> I like the raw performance of node-local SSD/NVMe 21:37:10 <janders_> however having something more resilient (and possibly shared) for the boot drive is good, too 21:37:34 <janders_> I would happily see support for splitting the two up (I do not think this is possible today, please correct me if I am wrong) 21:37:41 <goldenfri> I was just thinking about that today, so I 2nd that 21:38:17 <janders_> in this case, we could even wipe ephemeral on live migration (this would have to be configurable) so only the boot drive needs to persist 21:38:53 <oneswig> It seems like a good idea to me, certainly worth suggesting 21:38:58 <janders_> ok! 21:38:59 <oneswig> hello goldenfri! 21:39:17 <goldenfri> o/ 21:40:07 <priteau> janders_: if the ephemeral storage is mounted while live migrating, wouldn't the guest OS complain if data gets wiped out? 21:42:02 <janders_> good point, there would have to be some smarts around it. I don't have this fully thought through yet, but I think the capability would be useful. Perhaps cloud-* services could help facilitate this? 21:42:10 <oneswig> OK we are linked up to https://wiki.openstack.org/wiki/Forum/Berlin2018#Etherpads_from_Teams_and_Working_Groups 21:42:31 <janders_> but obviously if there's heavy IO hitting ephemeral, some service trying to umount /dev/sdb won't have a lot of luck.. 21:42:56 <b1air> +1 to janders_ ephemeral separation feature request 21:43:28 <priteau> janders_: VM-aware live migration? 21:43:31 <b1air> I see it more likely to be used with cold migration 21:44:07 <b1air> Where you have a fleet of long lived instances that you want to move around due to underlying maintenance etc 21:46:16 <janders_> another thing I'm looking at is using trim/discard like features for node cleaning - however bits of this might be already implemented, looking at ironic and pxe_idrac/pxe_ilo bits 21:46:23 <janders_> have any of you used this with success? 21:46:52 <janders_> (I might have asked this question here already, not sure) 21:47:22 <b1air> Yes I recall discussing this before, but don't think anything came of it yet 21:47:24 <oneswig> Did we cover this last week? I think there's an Ironic config parameter for key rotation 21:47:57 <b1air> With hardware encrypted storage? 21:48:08 <oneswig> We use it, and when I checked up I believe it was as simple as that - with the caveat that some of the drives needed a firmware update (of course!) 21:48:29 <priteau> janders_: you asked last week ;-) http://eavesdrop.openstack.org/meetings/scientific_sig/2018/scientific_sig.2018-09-12-11.00.log.html#l-139 21:48:40 <oneswig> b1air: hardware encryption as I understand it but with an empty secret. 21:49:01 <oneswig> So not really encryption... 21:49:55 <b1air> Cunning - the baddies will never suspect an empty password! 21:50:04 <janders_> oneswig: :) I've discussed this with too many parties and lost track (scientific-sig, RHAT, Dell, ... ) 21:50:39 <oneswig> janders_: your comrades here are the source of truth, you can't trust those other guys :-) 21:51:11 <janders_> that's right :) can't trust those sales organisations 21:51:31 <oneswig> There was one other matter to cover today, before I forget 21:51:38 <priteau> keycloack 21:51:43 <oneswig> #topic SIG event space at Berlin 21:52:09 <oneswig> priteau: I think we have that on the agenda for next week 21:52:18 <priteau> Oh, I looked at the wrong week :-) 21:52:47 <oneswig> I know - it's a handy aide memoire for me, probably confusing for anyone else! 21:53:13 <oneswig> Anyway - We have the option of 1 working group session + 1 bof session (ie, what we've had at previous summits). 21:53:43 <oneswig> I think this works well enough, unless anyone prefers to shorten it? 21:53:59 <oneswig> b1air? martial_? Thoughts on that? 21:56:56 <janders_> I have couple more forum ideas - given we're running low on time I will fire these away now 21:57:17 <oneswig> Please do. 21:57:23 <janders_> 1) being able to schedule a bare-metal instance to a specific piece of hardware (I don't think this is supported today) - would this be useful to you? 21:57:43 <janders_> think --availability-zone host:x.y.z equivalent for Ironic 21:57:44 <oneswig> On the SIG events - looks like Wednesday morning is clear for the AI-HPC-GPU track 21:58:22 <oneswig> janders_: I believe that exists, in the form of a three-tuple delimited by colons 21:58:34 <janders_> 2) I don't think "nova rebuild" works with baremetal instances - I think it would be something useful 21:58:43 <oneswig> The form might be nova::<Ironic uuid of the node> 21:59:24 <oneswig> On 2, are you sure? I think I've rebuilt Ironic instances before 21:59:43 <oneswig> Let's follow up on that... 21:59:48 <janders_> in this case, I will retest both and update the etherpad as required 21:59:58 <oneswig> good plan, let us know! 22:00:07 <oneswig> OK, we are out of time 22:00:14 <oneswig> Thanks everyone 22:00:32 <oneswig> keep adding to that etherpad if you get more ideas we should advocate 22:00:56 <oneswig> https://etherpad.openstack.org/p/BER-stein-forum-scientific-sig 22:00:59 <janders_> thanks guys! 22:01:02 <oneswig> #endmeeting