#openstack-meeting log

21:00:08 <oneswig> #startmeeting scientific-sig
21:00:09 <openstack> Meeting started Tue Nov 26 21:00:08 2019 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:12 <openstack> The meeting name has been set to 'scientific_sig'
21:00:24 <oneswig> I remembered the runes this week...
21:00:36 <oneswig> #link agenda for today https://wiki.openstack.org/wiki/Scientific_SIG#IRC_Meeting_November_29th_2019
21:00:42 <janders> hey Stig!
21:00:52 <trandles> Hello all
21:00:57 <oneswig> Hi janders trandles
21:00:59 <slaweq> hi
21:01:09 <oneswig> I understand we'll be missing a few due to Thanksgiving
21:01:14 <oneswig> Hi slaweq, welcome!
21:01:17 <oneswig> Thanks for coming
21:01:40 <slaweq> thx for invite me :)
21:02:00 <oneswig> np, I appreciate it.
21:02:26 <oneswig> Let's go straight to that - hopefully jomlowe will catch us up
21:03:17 <oneswig> #topic Helping supporting Linuxbridge
21:03:23 <slaweq> ok
21:03:37 <oneswig> Hi slaweq, so what's the context?
21:03:39 <slaweq> so, first of all sorry if my message after PTG was scary for anyone
21:04:15 <slaweq> basically in neutron developers team we had feeling that linuxbridge agent is almost not used as there was no almost none development of this driver
21:04:31 <slaweq> and as we want not to include ovn driver as one of in-tree drivers
21:04:51 <slaweq> our idea was to maybe start thinking slowly about deprecating linuxbridge agent
21:05:04 <slaweq> but apparently it is used by many deployments
21:05:12 <slaweq> so we will for sure not deprecate it
21:05:20 <slaweq> You don't need to worry about it
21:05:26 <oneswig> I think it is popular in this group because it's quite simple and faster
21:05:35 <oneswig> slaweq: thanks!
21:05:35 <rbudden> hello
21:05:49 <oneswig> Is it causing overhead to keep linuxbridge in tree?
21:05:51 <slaweq> as we have clear signal that this driver is used by people who needs simple solution and don't want other, more advanced features
21:05:51 <oneswig> hi rbudden
21:06:18 <slaweq> but we don't have almost nobody in our devs team who would take care of LB agent and mech driver
21:06:40 <oneswig> does it need much work?
21:06:57 <janders> apologies for a dumb question - we're only talking about the mech driver here, not the LB between the instance and the hypervisor where secgroups are applied right?
21:06:58 <rbudden> i was going to ask, what’s the overhead to continue support? what’s needed?
21:07:04 <slaweq> so if You are using it, it would be great if someone of You could be like point of contact in case when there are e.g. LB related gate issues or things like that
21:07:16 <slaweq> janders: that's correct
21:07:33 <slaweq> oneswig: no, I don't think this would require a lot of work
21:07:50 <noggin143> Tim joining
21:07:54 <oneswig> slaweq: does that mean joining openstack-neutron etc?
21:08:00 <oneswig> Hi noggin143, evening
21:08:39 <slaweq> oneswig: yes, basically that's what is needed
21:08:48 <slaweq> and I would like to put someone on the list in https://docs.openstack.org/neutron/latest/contributor/policies/bugs.html
21:08:51 <b1airo> o/
21:08:56 <slaweq> as point of contact for linuxbridge stuff
21:08:58 <oneswig> Hi b1airo
21:09:02 <oneswig> #chair b1airo
21:09:03 <openstack> Current chairs: b1airo oneswig
21:09:10 <b1airo> morning
21:09:19 <slaweq> so sometimes we can ping that person to triage some LB related bug or gate issue etc.
21:09:51 <oneswig> Is there a backlog of LB bugs somewhere that would be good examples?
21:09:52 <slaweq> than I would be sure that it's really "maintained"
21:09:54 <b1airo> oneswig: sorry, could you repeat your question to slaweq - i assume it was something about what is needed from neutron team's perspective to avoid deprecation of linuxbridge ?
21:10:38 <oneswig> b1airo: that was it, only it has been decided not to deprecate.  It would still be good to help out with its maintenance though.
21:10:40 <slaweq> oneswig: list of bugs with linuxbridge tag is here:
21:10:42 <slaweq> https://bugs.launchpad.net/neutron/?field.searchtext=&orderby=-importance&search=Search&field.status%3Alist=NEW&field.status%3Alist=CONFIRMED&field.status%3Alist=TRIAGED&field.status%3Alist=INPROGRESS&field.status%3Alist=FIXCOMMITTED&field.status%3Alist=INCOMPLETE_WITH_RESPONSE&field.status%3Alist=INCOMPLETE_WITHOUT_RESPONSE&assignee_option=any&field.assignee=&field.bug_reporter=&field.bug_commente
21:10:44 <slaweq> r=&field.subscriber=&field.structural_subscriber=&field.tag=linuxbridge&field.tags_combinator=ANY&field.has_cve.used=&field.omit_dupes.used=&field.omit_dupes=on&field.affects_me.used=&field.has_patch.used=&field.has_branches.used=&field.has_branches=on&field.has_no_branches.used=&field.has_no_branches=on&field.has_blueprints.used=&field.has_blueprints=on&field.has_no_blueprints.used=&field.has_no_b
21:10:46 <slaweq> lueprints=on
21:11:00 <janders> what's the performance advantage of LB over ovs, roughly? are we talking 20% or 500%?
21:11:10 <slaweq> b1airo: yes, we now have clear signal that this driver is widely used so we will not plan to deprecate it
21:11:28 <rbudden> good to hear!
21:11:57 <noggin143> it is good to have a simple driver, even if it is not so functional for those who liked nova-network
21:12:07 <slaweq> so basically that's all from my side
21:12:15 <oneswig> performance data I have
21:12:24 <oneswig> #link presentation from 2017 https://docs.google.com/presentation/d/1MSTquzyPaDUyqW2pAmYouP6qdYAi6G1n7IJqBzqPCr8/edit?usp=sharing
21:12:40 <slaweq> if someone of You would like to be such linuxbridge contact person, please reach out to me on irc or email, or speak now if You want :)
21:12:43 <oneswig> Slide 30
21:12:51 <jmlowe> woohoo!
21:13:22 <oneswig> Thanks slaweq, appreciated.
21:13:32 <noggin143> +1
21:14:06 <b1airo> i suspect someone from Nectar might be happy to be a contact - LB is still widely used there as the default plugin with more advanced networking provided by Midonet and more performant local networking done with passthrough
21:14:32 <oneswig> Midonet still going?  Forgive my ignorance.
21:14:44 <b1airo> no sorrison online to ping at the moment though...
21:15:17 <oneswig> b1airo: seconded :-)
21:15:38 <b1airo> I think Midonet looked to be in trouble but then got bought by Ericsson, who are presumably using Midonet in their deployments
21:16:02 <oneswig> b1airo: good to hear it found a safe harbour of sorts.
21:16:10 <b1airo> so possibly an 11th hour reprieve
21:16:26 <b1airo> it's a really nice platform, despite java
21:18:21 <jmlowe> oof, documentation stops at Newton
21:18:42 <oneswig> Hi jmlowe
21:18:47 <jmlowe> Hi
21:19:06 <oneswig> Perhaps it was just, you know, completed with nothing more to add?
21:19:34 <jmlowe> Could be, always a concern though
21:20:30 <oneswig> Just kidding :-)
21:20:48 <oneswig> OK, shall we move on?
21:21:03 <oneswig> #topic Supercomputing round-up
21:21:42 <oneswig> 13300 people apparently - it was big and busy!
21:22:51 <oneswig> We met a few new people who may find their way to this SIG
21:23:20 <janders> great! where from?
21:23:24 <b1airo> fingers crossed - did anyone get the name of that guy asking about performance issues?
21:23:49 <oneswig> No, he didn't hang around afterwards unfortunately, afaict
21:23:52 <jmlowe> wow, I guess my dreams of going back to Austin are forever dashed with those kinds of attendance numbers
21:24:45 <noggin143> sounds worse than kubecon
21:24:46 <jmlowe> Don't worry b1airo he'll be back next year, he was there last year
21:24:55 <trandles> It was interesting to see where cloud and cloud-like tech is starting to show up in product lines. In many cases, cloud is creeping in quietly. It's only when you ask about specifics that vendors tell you things like k8s and some SDN is being used as part of their plumbing.
21:25:00 <b1airo> yeah, i remember :-)
21:26:00 <oneswig> noggin143: by my calculation, it was as big as kubecon + openstack shanghai combined!
21:26:12 <janders> what would be the top3 most interesting k8s use cases?
21:26:50 <trandles> janders: most "interesting" or "scariest?"
21:27:04 <janders> let's look at one of each? :)
21:27:15 <oneswig> janders: the Cray Shasta control plane is hosted on K8S, for example...
21:27:30 <oneswig> file in whichever category you like!
21:27:30 <janders> does Bright Computing have anything to do with that?
21:27:32 <trandles> That covers the scary use case, oneswig  ;)
21:28:16 <oneswig> janders: perhaps they use that to provision the controller nodes but it wasn't disclosed.
21:28:41 <trandles> AFAIK Bright doesn't have anything to do with it...but I'm not 100% sure about that
21:28:44 <jmlowe> Shasta is the scariest
21:28:56 <janders> off the record don't be surprised if you see similar arch by Brigth
21:29:14 <janders> this came up in a discussion I had with them a while back after we had a really bad scaling problem with our BCM
21:29:53 <janders> how about the interesting bits?
21:30:00 <janders> (now that we have scary ticked off)
21:30:22 <oneswig> So i was interested in these very loosely coherent filesystems developed for Summit
21:30:23 <trandles> Interesting use case: someone had a poster talking about k8s + OpenMPI
21:30:33 <oneswig> trandles: not using ORTE again?
21:30:49 <janders> trandles: do you recall how did they approach the comms challange?
21:30:53 <noggin143> oneswig: any references ?
21:31:11 <oneswig> looking...
21:31:21 <trandles> oneswig: that I'm not sure about...the poster was vague and the author standing there was getting picked apart by someone who seemed to know all of the gory comms internals
21:31:26 <noggin143> oneswig: for loosely coherent filesystems, that is.
21:31:49 <trandles> I didn't stay to hear details, I just kept wandering off thinking "wow, look for more on that at sc20"
21:32:06 <janders> for sure
21:32:31 <janders> shameless plug: https://www.youtube.com/watch?v=sPfZGHWnKNM
21:33:08 <trandles> janders: thx for the link ;)
21:33:17 <janders> I can confirm there's a lot of interesting work on K8s-RDMA and there's more coming
21:33:37 <oneswig> noggin143: it was this session - two approaches presented https://sc19.supercomputing.org/presentation/?id=pap191&sess=sess167
21:33:51 <oneswig> There's a link to a paper on the research.
21:34:40 <janders> whoa... these numbers make io500 look like a sandpit :)
21:35:06 <oneswig> It was interesting because a lot of talk to this point has been on burst buffers, but this work was all about writing data to local NVME and then draining it to GPFS afterwards.  The coherency is loose to the point of non-existence but it may work if the appilcation knows it's not there.
21:35:41 <b1airo> janders: re Bright, I spoke to them (including CTO - Martjian?), and got the impression they had not done any significant work on containerising control plane with k8s or anything else - they had looked at it a couple of years ago but decided it was too fragile at the time... which is kinda laughable when in the next breath they tell you their upgrade strategy for Bright OpenStack is a complete reinstall and
21:35:41 <b1airo> migration
21:35:45 <noggin143> janders: quite an expensive alternative to the sandpit :-)
21:36:16 <janders> indeed :)
21:36:30 <janders> b1airo: when did that conversation take place?
21:36:56 <janders> I had two myself, one similar to what you described and another one some time later where I had an impression of a 180deg turn
21:37:28 <b1airo> last week :-)
21:37:36 <janders> oh dear... another 180?
21:37:53 <janders> someone might be going in circles
21:38:24 <oneswig> On a related matter, I went to the OpenHPC Bof.  There was a guy from ARM/Linaro who was interested in cloud-native deployments of OpenHPC but I think the steering committee are lukewarm at best on this.
21:38:25 <b1airo> they don't seem to have moved past saying "OpenStack is too hard to upgrade" as an excuse
21:38:57 <oneswig> b1airo: you're in that boat with Bright aren't you?
21:39:16 <noggin143> oneswig: +1 on the lukewarm for ARM in HPC/HTC here too
21:39:40 <janders> repo repoint; yum -y update; db-sync; service restart - I wonder what's so hard about that...
21:39:40 <rbudden> last i recall they used warewulf for provisioning bare metal
21:39:50 <janders> and with containers it should be even easier or so I hear
21:39:54 <rbudden> i forget if they still maintain an openstack/cloud release tool
21:39:54 <oneswig> noggin143: au contraire, on the ARM front there appears to be much interest around the new Fujitsu A64FX
21:39:56 <rbudden> they did at one point
21:40:26 <noggin143> oneswig: is the Fujitsu thing real ? I'd heard 202[2-9]
21:40:27 <b1airo> yep, it's really ugly. i haven't dived depth-first into the details yet, but they are basically saying that to get from 8.1 to 9.0 we should completely reinstall
21:40:40 <janders> +1
21:40:46 <oneswig> rbudden: All kinds of deployment covered - both warewulf and xcat :-)
21:40:48 <trandles> Looks like the 2020 KubeCon North America event once again overlaps SC perfectly. I heard twice from vendors at SC19 who said "I don't know, my kubernetes guy is at KubeCon this week." ...sigh...
21:41:37 <janders> trandles: thank you for the early heads-up... Boston this time I see
21:41:46 <janders> nice to see Open Infra back in Vancouver too
21:41:54 <oneswig> noggin143: I spoke to some people at Riken, they are expecting samples imminently, deployment through next year.  It might have been lost in translation but that was the gist afaict
21:41:58 <janders> looks like I've got my base conference schedule set :)
21:42:03 <b1airo> trandles: better than "Hi, did you know Kubernetes is replacing traditional HPC..."
21:42:22 <noggin143> Kubecon had an interesting launch of the "Research SIG" which was kind of k8s version of the the Scientific WG
21:42:38 <oneswig> noggin143: were you there?
21:42:43 <trandles> b1airo: HA! I didn't see anyone from Trilio at SC...
21:42:48 <janders> noggin143: unfortunately I wasn't able to attend this one - would you be happy to tell us about it?
21:42:55 <janders> quite interested
21:43:10 <noggin143> oneswig: no, but Ricardo was (https://www.youtube.com/watch?v=g9FQxzK9E_M&list=PLj6h78yzYM2NDs-iu8WU5fMxINxHXlien&index=108&t=0s)
21:43:20 <b1airo> trandles: they were there actually, but just a couple of guys lurking - all the sales folks were at kubecon
21:44:08 <oneswig> I wonder how Kubecon compared to OpenStack Paris, ah those heady days
21:44:23 <janders> I wasn't in Paris but San Diego was massive
21:44:24 <janders> 10k ppl
21:44:25 <noggin143> kubecon had lots of Helm v3, Machine Learning, GPUs, binder etc. Parties were up to the Austin level
21:44:37 <jonmills_nasa> We are still trying to use Trilio...
21:44:55 <b1airo> interesting choice of words jonmills_nasa ...
21:45:18 <noggin143> We'll be going to the Kubecon Amsterdam one in 2020
21:46:02 <oneswig> We should probably pay a bit more lip-service to the agenda... any more on SC / Kubecon?
21:46:21 <janders> Kubecon had a significant ML footprint
21:46:24 <janders> quite interesting
21:46:44 <noggin143> the notebook area seems very active too
21:46:52 <oneswig> janders: does seem a good environment for rapid development in ML.
21:46:56 <janders> both from Web Companies side (Ubers of the World)
21:47:06 <janders> and more traditional HPC (Erez gave a good talk on GPUs)
21:47:20 <oneswig> OK let's move on
21:47:30 <janders> I was happy to add RDMA support into the picture with my talk
21:47:31 <oneswig> #topic issues mixing baremetal and virt
21:47:49 <oneswig> janders: good to hear that.
21:47:50 <janders> it's good/interesting to hear the phrase "K8s as de-facto standard for ML workloads"
21:48:15 <janders> I feel this is driving K8s away from being just-a-webhosting-platform
21:48:27 <janders> and there's now buy-in for that - good for our group I suppose
21:49:00 <oneswig> quick one this - back at base the team hit and fixed some issues with nova-compute-ironic last week while we were off gallavanting with the conference crowd
21:49:06 <noggin143> janders: yup, tensorflow, keras, pytorch - lots of opportunities for data scientists
21:49:46 <oneswig> As I understand it, if you're running multiple instances of nova-compute-ironic using a hash ring to share the work, you can sometimes lose resources.
21:49:58 <jmlowe> I'm somewhat interested in mixing baremetal and virt, anybody doing this right now?
21:50:12 <noggin143> onswig: do you have a pointer ? We're looking to start sharding for scale with Ironic
21:50:16 <rbudden> +1 is this production ready?
21:50:25 <oneswig> jmlowe: we have some clients doing this...
21:50:30 <raub> jmlowe: you are not alone
21:50:35 <janders> +1, I am very interested, too
21:50:39 <oneswig> #link brace of patchs available here https://review.opendev.org/#/c/694802/
21:50:40 * trandles too
21:50:59 <oneswig> I think there were 4-5 distinct patches in the end.
21:51:03 <janders> oneswig: can you elaborate more on "losing resources"?
21:51:08 <janders> how does this manifest itself?
21:51:15 <janders> nodes never getting scheduled to?
21:51:29 <janders> what scale are we thinking?
21:51:47 <noggin143> oneswig: BTW, we're working with upstream in several ironic areas such as incremental inspection (good for firmware level checks) and benchmarking/burn in as part of the standard cleaning process
21:52:14 <oneswig> janders: As I understand it, nodes would become lost from the resource tracker.  But I only have the sketchiest details.
21:52:18 <jmlowe> I have let's say 400 ironic instances in mind, 20-30 sharing with virt
21:53:02 <janders> jmlowe: sounds like you're progressing with the nova-compute-as-a-workload architecture? :)
21:53:03 <oneswig> It only occurs when using multiple instances of nova-compute-ironic, afaik
21:53:08 <noggin143> jmlowe: we're running about 4000 at the moment - Arne's presentation from Shanghai at https://twiki.cern.ch/twiki/pub/CMgroup/CmPresentationList/From_hire_to_retire!_Server_Life-cycle_management_with_Ironic_at_CERN_-_OpenInfraSummit_SHANGHAI_06NOV2019.pptx
21:53:30 <jmlowe> janders: that's what I have in mind
21:53:42 <janders> jmlowe: great to hear! :)
21:53:59 <oneswig> janders: what do you do for tagged interfaces - only 1 network to the hypervisor, or something crafty with the IB?
21:54:14 <janders> 1 network per port
21:54:28 <janders> I had dual and quad port nodes so that was enough
21:54:42 <janders> but trunking vlans/pkeys is definitely something I want going forward
21:55:08 <oneswig> I thought it was something like that.  A good use case for trunking Ironic ports (if that's ever a thing).
21:55:15 <janders> (if you're wondering why/how I had quad port IB in servers - the answer is Dell blades)
21:55:44 <oneswig> We've let the slug of time crawl away from us again.
21:55:59 <oneswig> 5 mins on Gnocchi?
21:56:11 <noggin143> oneswig: hah!
21:56:14 <janders> ok!
21:56:16 <oneswig> #topic Gnocchi
21:56:18 <b1airo> mmm soft fluffy potato
21:56:29 <janders> and I propose we re-add the virt+bm again to next week's agenda
21:56:30 <oneswig> OK, wasn't expecting that
21:56:40 <oneswig> janders: happy to do so
21:56:50 <janders> great, thanks oneswig
21:57:02 <b1airo> i suggest shunting Gnocchi to next week
21:57:16 <janders> good idea, but maybe let
21:57:16 <jmlowe> when paired with the carbonara backend it was delicious
21:57:17 <jonmills_nasa> fair enough
21:57:18 <oneswig> b1airo: makes sense to me.
21:57:20 <janders> s quickly draw context?
21:57:22 <rbudden> yep thanks oneswig
21:57:23 <b1airo> i could ask Sam or someone else from Nectar to come talk about that and LB + Mido...
21:57:35 <oneswig> b1airo: would be great, thanks.
21:57:55 <jmlowe> gnocchi isn't getting any more dead, so second to pushing
21:58:16 <b1airo> :'-D
21:58:23 <oneswig> Next week I'm also hoping johnthetubaguy will join us to talk about unified limits (quotas for mixed baremetal and virt resources, hierarchical projects, among other things)
21:58:37 <janders> gnocchi is good at killing ceph afaik
21:58:40 <noggin143> oneswig: :-)
21:58:46 <janders> even if almost dead itself
21:58:48 <jmlowe> oooh, I'd love to quota vgpu's, I think that's part of it
21:58:50 <noggin143> now, back to InfluxDB tuning
21:59:05 <oneswig> noggin143: ha, what a rich seam
21:59:11 <noggin143> jmlowe: we're looking to do some GPU quota too, another topic ?
21:59:18 <jmlowe> yeah, really need a nvme ceph pool to keep gnocchi from killing things
21:59:27 <janders> jmlowe: indeed.
21:59:42 <oneswig> These are multi-region discussions, clearly!
21:59:42 <janders> okay - almost over time, so thank you all, great chat - I just wish we had another hour! :)
21:59:43 <jmlowe> I found out the hard way
21:59:50 <janders> we shall continue next week
21:59:51 <b1airo> i thought that whole gnocchi on ceph thing looked weird when it first came up
22:00:04 <janders> yep, DDoS-a-a-S!
22:00:09 <oneswig> jmlowe: wasn't it obvious with those billions of tiny writes ...
22:00:10 <b1airo> no-one mentioned iops in the Ceph for HPC BoF at SC19...
22:00:18 <trandles> Bye folks, I'll read next week's log. ;)
22:00:37 <oneswig> Thanks all, alas we must close
22:00:42 <oneswig> #endmeeting