21:00:08 #startmeeting scientific-sig 21:00:09 Meeting started Tue Nov 26 21:00:08 2019 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 21:00:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 21:00:12 The meeting name has been set to 'scientific_sig' 21:00:24 I remembered the runes this week... 21:00:36 #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_November_29th_2019 21:00:42 hey Stig! 21:00:52 Hello all 21:00:57 Hi janders trandles 21:00:59 hi 21:01:09 I understand we'll be missing a few due to Thanksgiving 21:01:14 Hi slaweq, welcome! 21:01:17 Thanks for coming 21:01:40 thx for invite me :) 21:02:00 np, I appreciate it. 21:02:26 Let's go straight to that - hopefully jomlowe will catch us up 21:03:17 #topic Helping supporting Linuxbridge 21:03:23 ok 21:03:37 Hi slaweq, so what's the context? 21:03:39 so, first of all sorry if my message after PTG was scary for anyone 21:04:15 basically in neutron developers team we had feeling that linuxbridge agent is almost not used as there was no almost none development of this driver 21:04:31 and as we want not to include ovn driver as one of in-tree drivers 21:04:51 our idea was to maybe start thinking slowly about deprecating linuxbridge agent 21:05:04 but apparently it is used by many deployments 21:05:12 so we will for sure not deprecate it 21:05:20 You don't need to worry about it 21:05:26 I think it is popular in this group because it's quite simple and faster 21:05:35 slaweq: thanks! 21:05:35 hello 21:05:49 Is it causing overhead to keep linuxbridge in tree? 21:05:51 as we have clear signal that this driver is used by people who needs simple solution and don't want other, more advanced features 21:05:51 hi rbudden 21:06:18 but we don't have almost nobody in our devs team who would take care of LB agent and mech driver 21:06:40 does it need much work? 21:06:57 apologies for a dumb question - we're only talking about the mech driver here, not the LB between the instance and the hypervisor where secgroups are applied right? 21:06:58 i was going to ask, what’s the overhead to continue support? what’s needed? 21:07:04 so if You are using it, it would be great if someone of You could be like point of contact in case when there are e.g. LB related gate issues or things like that 21:07:16 janders: that's correct 21:07:33 oneswig: no, I don't think this would require a lot of work 21:07:50 Tim joining 21:07:54 slaweq: does that mean joining openstack-neutron etc? 21:08:00 Hi noggin143, evening 21:08:39 oneswig: yes, basically that's what is needed 21:08:48 and I would like to put someone on the list in https://docs.openstack.org/neutron/latest/contributor/policies/bugs.html 21:08:51 o/ 21:08:56 as point of contact for linuxbridge stuff 21:08:58 Hi b1airo 21:09:02 #chair b1airo 21:09:03 Current chairs: b1airo oneswig 21:09:10 morning 21:09:19 so sometimes we can ping that person to triage some LB related bug or gate issue etc. 21:09:51 Is there a backlog of LB bugs somewhere that would be good examples? 21:09:52 than I would be sure that it's really "maintained" 21:09:54 oneswig: sorry, could you repeat your question to slaweq - i assume it was something about what is needed from neutron team's perspective to avoid deprecation of linuxbridge ? 21:10:38 b1airo: that was it, only it has been decided not to deprecate. It would still be good to help out with its maintenance though. 21:10:40 oneswig: list of bugs with linuxbridge tag is here: 21:10:42 https://bugs.launchpad.net/neutron/?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commente 21:10:44 r=&field.subscriber=&field.structural_subscriber=&field.tag=linuxbridge&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_b 21:10:46 lueprints=on 21:11:00 what's the performance advantage of LB over ovs, roughly? are we talking 20% or 500%? 21:11:10 b1airo: yes, we now have clear signal that this driver is widely used so we will not plan to deprecate it 21:11:28 good to hear! 21:11:57 it is good to have a simple driver, even if it is not so functional for those who liked nova-network 21:12:07 so basically that's all from my side 21:12:15 performance data I have 21:12:24 #link presentation from 2017 https://docs.google.com/presentation/d/1MSTquzyPaDUyqW2pAmYouP6qdYAi6G1n7IJqBzqPCr8/edit?usp=sharing 21:12:40 if someone of You would like to be such linuxbridge contact person, please reach out to me on irc or email, or speak now if You want :) 21:12:43 Slide 30 21:12:51 woohoo! 21:13:22 Thanks slaweq, appreciated. 21:13:32 +1 21:14:06 i suspect someone from Nectar might be happy to be a contact - LB is still widely used there as the default plugin with more advanced networking provided by Midonet and more performant local networking done with passthrough 21:14:32 Midonet still going? Forgive my ignorance. 21:14:44 no sorrison online to ping at the moment though... 21:15:17 b1airo: seconded :-) 21:15:38 I think Midonet looked to be in trouble but then got bought by Ericsson, who are presumably using Midonet in their deployments 21:16:02 b1airo: good to hear it found a safe harbour of sorts. 21:16:10 so possibly an 11th hour reprieve 21:16:26 it's a really nice platform, despite java 21:18:21 oof, documentation stops at Newton 21:18:42 Hi jmlowe 21:18:47 Hi 21:19:06 Perhaps it was just, you know, completed with nothing more to add? 21:19:34 Could be, always a concern though 21:20:30 Just kidding :-) 21:20:48 OK, shall we move on? 21:21:03 #topic Supercomputing round-up 21:21:42 13300 people apparently - it was big and busy! 21:22:51 We met a few new people who may find their way to this SIG 21:23:20 great! where from? 21:23:24 fingers crossed - did anyone get the name of that guy asking about performance issues? 21:23:49 No, he didn't hang around afterwards unfortunately, afaict 21:23:52 wow, I guess my dreams of going back to Austin are forever dashed with those kinds of attendance numbers 21:24:45 sounds worse than kubecon 21:24:46 Don't worry b1airo he'll be back next year, he was there last year 21:24:55 It was interesting to see where cloud and cloud-like tech is starting to show up in product lines. In many cases, cloud is creeping in quietly. It's only when you ask about specifics that vendors tell you things like k8s and some SDN is being used as part of their plumbing. 21:25:00 yeah, i remember :-) 21:26:00 noggin143: by my calculation, it was as big as kubecon + openstack shanghai combined! 21:26:12 what would be the top3 most interesting k8s use cases? 21:26:50 janders: most "interesting" or "scariest?" 21:27:04 let's look at one of each? :) 21:27:15 janders: the Cray Shasta control plane is hosted on K8S, for example... 21:27:30 file in whichever category you like! 21:27:30 does Bright Computing have anything to do with that? 21:27:32 That covers the scary use case, oneswig ;) 21:28:16 janders: perhaps they use that to provision the controller nodes but it wasn't disclosed. 21:28:41 AFAIK Bright doesn't have anything to do with it...but I'm not 100% sure about that 21:28:44 Shasta is the scariest 21:28:56 off the record don't be surprised if you see similar arch by Brigth 21:29:14 this came up in a discussion I had with them a while back after we had a really bad scaling problem with our BCM 21:29:53 how about the interesting bits? 21:30:00 (now that we have scary ticked off) 21:30:22 So i was interested in these very loosely coherent filesystems developed for Summit 21:30:23 Interesting use case: someone had a poster talking about k8s + OpenMPI 21:30:33 trandles: not using ORTE again? 21:30:49 trandles: do you recall how did they approach the comms challange? 21:30:53 oneswig: any references ? 21:31:11 looking... 21:31:21 oneswig: that I'm not sure about...the poster was vague and the author standing there was getting picked apart by someone who seemed to know all of the gory comms internals 21:31:26 oneswig: for loosely coherent filesystems, that is. 21:31:49 I didn't stay to hear details, I just kept wandering off thinking "wow, look for more on that at sc20" 21:32:06 for sure 21:32:31 shameless plug: https://www.youtube.com/watch?v=sPfZGHWnKNM 21:33:08 janders: thx for the link ;) 21:33:17 I can confirm there's a lot of interesting work on K8s-RDMA and there's more coming 21:33:37 noggin143: it was this session - two approaches presented https://sc19.supercomputing.org/presentation/?id=pap191&sess=sess167 21:33:51 There's a link to a paper on the research. 21:34:40 whoa... these numbers make io500 look like a sandpit :) 21:35:06 It was interesting because a lot of talk to this point has been on burst buffers, but this work was all about writing data to local NVME and then draining it to GPFS afterwards. The coherency is loose to the point of non-existence but it may work if the appilcation knows it's not there. 21:35:41 janders: re Bright, I spoke to them (including CTO - Martjian?), and got the impression they had not done any significant work on containerising control plane with k8s or anything else - they had looked at it a couple of years ago but decided it was too fragile at the time... which is kinda laughable when in the next breath they tell you their upgrade strategy for Bright OpenStack is a complete reinstall and 21:35:41 migration 21:35:45 janders: quite an expensive alternative to the sandpit :-) 21:36:16 indeed :) 21:36:30 b1airo: when did that conversation take place? 21:36:56 I had two myself, one similar to what you described and another one some time later where I had an impression of a 180deg turn 21:37:28 last week :-) 21:37:36 oh dear... another 180? 21:37:53 someone might be going in circles 21:38:24 On a related matter, I went to the OpenHPC Bof. There was a guy from ARM/Linaro who was interested in cloud-native deployments of OpenHPC but I think the steering committee are lukewarm at best on this. 21:38:25 they don't seem to have moved past saying "OpenStack is too hard to upgrade" as an excuse 21:38:57 b1airo: you're in that boat with Bright aren't you? 21:39:16 oneswig: +1 on the lukewarm for ARM in HPC/HTC here too 21:39:40 repo repoint; yum -y update; db-sync; service restart - I wonder what's so hard about that... 21:39:40 last i recall they used warewulf for provisioning bare metal 21:39:50 and with containers it should be even easier or so I hear 21:39:54 i forget if they still maintain an openstack/cloud release tool 21:39:54 noggin143: au contraire, on the ARM front there appears to be much interest around the new Fujitsu A64FX 21:39:56 they did at one point 21:40:26 oneswig: is the Fujitsu thing real ? I'd heard 202[2-9] 21:40:27 yep, it's really ugly. i haven't dived depth-first into the details yet, but they are basically saying that to get from 8.1 to 9.0 we should completely reinstall 21:40:40 +1 21:40:46 rbudden: All kinds of deployment covered - both warewulf and xcat :-) 21:40:48 Looks like the 2020 KubeCon North America event once again overlaps SC perfectly. I heard twice from vendors at SC19 who said "I don't know, my kubernetes guy is at KubeCon this week." ...sigh... 21:41:37 trandles: thank you for the early heads-up... Boston this time I see 21:41:46 nice to see Open Infra back in Vancouver too 21:41:54 noggin143: I spoke to some people at Riken, they are expecting samples imminently, deployment through next year. It might have been lost in translation but that was the gist afaict 21:41:58 looks like I've got my base conference schedule set :) 21:42:03 trandles: better than "Hi, did you know Kubernetes is replacing traditional HPC..." 21:42:22 Kubecon had an interesting launch of the "Research SIG" which was kind of k8s version of the the Scientific WG 21:42:38 noggin143: were you there? 21:42:43 b1airo: HA! I didn't see anyone from Trilio at SC... 21:42:48 noggin143: unfortunately I wasn't able to attend this one - would you be happy to tell us about it? 21:42:55 quite interested 21:43:10 oneswig: no, but Ricardo was (https://www.youtube.com/watch?v=g9FQxzK9E_M&list=PLj6h78yzYM2NDs-iu8WU5fMxINxHXlien&index=108&t=0s) 21:43:20 trandles: they were there actually, but just a couple of guys lurking - all the sales folks were at kubecon 21:44:08 I wonder how Kubecon compared to OpenStack Paris, ah those heady days 21:44:23 I wasn't in Paris but San Diego was massive 21:44:24 10k ppl 21:44:25 kubecon had lots of Helm v3, Machine Learning, GPUs, binder etc. Parties were up to the Austin level 21:44:37 We are still trying to use Trilio... 21:44:55 interesting choice of words jonmills_nasa ... 21:45:18 We'll be going to the Kubecon Amsterdam one in 2020 21:46:02 We should probably pay a bit more lip-service to the agenda... any more on SC / Kubecon? 21:46:21 Kubecon had a significant ML footprint 21:46:24 quite interesting 21:46:44 the notebook area seems very active too 21:46:52 janders: does seem a good environment for rapid development in ML. 21:46:56 both from Web Companies side (Ubers of the World) 21:47:06 and more traditional HPC (Erez gave a good talk on GPUs) 21:47:20 OK let's move on 21:47:30 I was happy to add RDMA support into the picture with my talk 21:47:31 #topic issues mixing baremetal and virt 21:47:49 janders: good to hear that. 21:47:50 it's good/interesting to hear the phrase "K8s as de-facto standard for ML workloads" 21:48:15 I feel this is driving K8s away from being just-a-webhosting-platform 21:48:27 and there's now buy-in for that - good for our group I suppose 21:49:00 quick one this - back at base the team hit and fixed some issues with nova-compute-ironic last week while we were off gallavanting with the conference crowd 21:49:06 janders: yup, tensorflow, keras, pytorch - lots of opportunities for data scientists 21:49:46 As I understand it, if you're running multiple instances of nova-compute-ironic using a hash ring to share the work, you can sometimes lose resources. 21:49:58 I'm somewhat interested in mixing baremetal and virt, anybody doing this right now? 21:50:12 onswig: do you have a pointer ? We're looking to start sharding for scale with Ironic 21:50:16 +1 is this production ready? 21:50:25 jmlowe: we have some clients doing this... 21:50:30 jmlowe: you are not alone 21:50:35 +1, I am very interested, too 21:50:39 #link brace of patchs available here https://review.opendev.org/#/c/694802/ 21:50:40 * trandles too 21:50:59 I think there were 4-5 distinct patches in the end. 21:51:03 oneswig: can you elaborate more on "losing resources"? 21:51:08 how does this manifest itself? 21:51:15 nodes never getting scheduled to? 21:51:29 what scale are we thinking? 21:51:47 oneswig: BTW, we're working with upstream in several ironic areas such as incremental inspection (good for firmware level checks) and benchmarking/burn in as part of the standard cleaning process 21:52:14 janders: As I understand it, nodes would become lost from the resource tracker. But I only have the sketchiest details. 21:52:18 I have let's say 400 ironic instances in mind, 20-30 sharing with virt 21:53:02 jmlowe: sounds like you're progressing with the nova-compute-as-a-workload architecture? :) 21:53:03 It only occurs when using multiple instances of nova-compute-ironic, afaik 21:53:08 jmlowe: we're running about 4000 at the moment - Arne's presentation from Shanghai at https://twiki.cern.ch/twiki/pub/CMgroup/CmPresentationList/From_hire_to_retire!_Server_Life-cycle_management_with_Ironic_at_CERN_-_OpenInfraSummit_SHANGHAI_06NOV2019.pptx 21:53:30 janders: that's what I have in mind 21:53:42 jmlowe: great to hear! :) 21:53:59 janders: what do you do for tagged interfaces - only 1 network to the hypervisor, or something crafty with the IB? 21:54:14 1 network per port 21:54:28 I had dual and quad port nodes so that was enough 21:54:42 but trunking vlans/pkeys is definitely something I want going forward 21:55:08 I thought it was something like that. A good use case for trunking Ironic ports (if that's ever a thing). 21:55:15 (if you're wondering why/how I had quad port IB in servers - the answer is Dell blades) 21:55:44 We've let the slug of time crawl away from us again. 21:55:59 5 mins on Gnocchi? 21:56:11 oneswig: hah! 21:56:14 ok! 21:56:16 #topic Gnocchi 21:56:18 mmm soft fluffy potato 21:56:29 and I propose we re-add the virt+bm again to next week's agenda 21:56:30 OK, wasn't expecting that 21:56:40 janders: happy to do so 21:56:50 great, thanks oneswig 21:57:02 i suggest shunting Gnocchi to next week 21:57:16 good idea, but maybe let 21:57:16 when paired with the carbonara backend it was delicious 21:57:17 fair enough 21:57:18 b1airo: makes sense to me. 21:57:20 s quickly draw context? 21:57:22 yep thanks oneswig 21:57:23 i could ask Sam or someone else from Nectar to come talk about that and LB + Mido... 21:57:35 b1airo: would be great, thanks. 21:57:55 gnocchi isn't getting any more dead, so second to pushing 21:58:16 :'-D 21:58:23 Next week I'm also hoping johnthetubaguy will join us to talk about unified limits (quotas for mixed baremetal and virt resources, hierarchical projects, among other things) 21:58:37 gnocchi is good at killing ceph afaik 21:58:40 oneswig: :-) 21:58:46 even if almost dead itself 21:58:48 oooh, I'd love to quota vgpu's, I think that's part of it 21:58:50 now, back to InfluxDB tuning 21:59:05 noggin143: ha, what a rich seam 21:59:11 jmlowe: we're looking to do some GPU quota too, another topic ? 21:59:18 yeah, really need a nvme ceph pool to keep gnocchi from killing things 21:59:27 jmlowe: indeed. 21:59:42 These are multi-region discussions, clearly! 21:59:42 okay - almost over time, so thank you all, great chat - I just wish we had another hour! :) 21:59:43 I found out the hard way 21:59:50 we shall continue next week 21:59:51 i thought that whole gnocchi on ceph thing looked weird when it first came up 22:00:04 yep, DDoS-a-a-S! 22:00:09 jmlowe: wasn't it obvious with those billions of tiny writes ... 22:00:10 no-one mentioned iops in the Ceph for HPC BoF at SC19... 22:00:18 Bye folks, I'll read next week's log. ;) 22:00:37 Thanks all, alas we must close 22:00:42 #endmeeting