#openstack-meeting log

21:10:06 <b1airo> #startmeeting scientific-sig
21:10:07 <openstack> Meeting started Tue Mar 17 21:10:06 2020 UTC and is due to finish in 60 minutes.  The chair is b1airo. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:10:08 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:10:10 <openstack> The meeting name has been set to 'scientific_sig'
21:10:14 <b1airo> #chair oneswig
21:10:15 <openstack> Current chairs: b1airo oneswig
21:10:19 <oneswig> g'day
21:10:23 <b1airo> howdy
21:10:28 <oneswig> what's new?
21:10:56 <b1airo> whole team wfh test today
21:11:15 <oneswig> Been doing it ourselves too.
21:11:20 <trandles_> Well, my VPN connection to work dropped, sorry about that
21:11:28 <trandles_> Too many working from home I guess
21:11:51 <b1airo> other than that, not much. trying to psych myself to dive into our old Bright OpenStack private cloud again this week if i can find some time
21:12:22 <b1airo> your VPN takes default gateway i take trandles_ ?
21:12:31 <b1airo> your VPN takes default gateway i take it trandles_ ?
21:12:40 <trandles_> yeah
21:12:54 <trandles_> Everything routes through the VPN when connected
21:13:09 <oneswig> tedious
21:13:09 <trandles_> It was surprisingly stable until just now
21:13:57 <oneswig> The full WFH impact hasn't hit as schools remain open here.
21:14:14 <trandles_> Schools closed here Monday for 3 weeks
21:15:12 <trandles_> So no definitive word on Vancouver?
21:15:14 <oneswig> Quick q: does anyone know how atomic ceph rbd snapshots are?  If a volume is active when the snapshot is made, how coherent is the snapshot?
21:16:27 <oneswig> trandles_: not seen anything myself but not paid much attention either.
21:20:11 <b1airo> oneswig: they are atomic from the perspective of the block device, i.e., all blocks are snapshot at same time - whether any filesystems atop that device are consistent at that point in time is a different question/issue though
21:21:18 <b1airo> i imagine there are some potential caveats there if you're cross-mounting a single rbd though
21:22:02 <oneswig> OK thanks b1airo, good to know
21:23:42 <b1airo> any cephfs fun to report?
21:24:17 <b1airo> also, anyone used dc/os aka mesosphere?
21:24:36 <oneswig> Nothing recently on CephFS.
21:25:18 <oneswig> b1airo: Nick Jones (formerly of our team) joined D2IQ, formerly Mesosphere.  But I think he's working on Kubernetes.
21:25:19 <trandles_> b1airo, I hear rumors that someone in LANL HPC is doing some work with DC/OS but it's not me and I don't know too awful much
21:26:01 <trandles_> I had a summer student do an evaluation of it but it was in 2016 so I'm sure it bears little resemblance to recent DC/OS
21:26:41 <b1airo> cheers. in the rebranding process they seem to have moved "mesosphere" from company name to product line ("ksphere" being a newer line of business)
21:28:01 <b1airo> one of my teams is using it in dev/test and we're trying to figure out what makes most sense for prod, e.g., keep using dc/os, just do vanilla k8s, konvoy, something else entirely...
21:28:40 <b1airo> Magnum would be another possibility if we had Bright under control enough to deploy it
21:31:11 <oneswig> The recent experience we've had with Magnum is that the version of Kubernetes deployed is tied to the version of Magnum (which releases more slowly).  That can be a problem if, shameless neophyte that you are, you want to always run the latest version of k8s and hang the portability.
21:31:40 <b1airo> would really like to have one container/service orchestration platform that we can use for hosting our own and end-user's services
21:31:41 <oneswig> I think they might be looking to decouple that more, to everyone's benefit
21:32:14 <oneswig> We certainly like using Magnum for that, noting this limitation that you're often a version or two behind on k8s
21:32:18 <b1airo> yeah, sounds like a prerequisite oneswig
21:33:15 <b1airo> Cray's original Urika tools had this issue and were dead in the water due to it
21:35:17 <oneswig> I think their issues were far more severe than that.
21:35:29 <b1airo> lol
21:35:47 <b1airo> i was being kinder than usual
21:36:31 <oneswig> trandles_: how's the first months of OSA going?
21:36:54 <trandles_> It's been...rewarding?
21:36:56 <trandles_> lol
21:37:24 <trandles_> When it just works it's great. When something breaks and you can figure out WTF is going on it feels like you've accomplished something.  ;)
21:38:11 <trandles_> To be fair almost ALL of the issues I have had are either self-inflicted, a gap in documentation, or due to the gnocchi fiasco.
21:39:17 <trandles_> I think the benefits of OSA greatly outweigh the challenges I've encountered. I'm also super happy with using LXC for the service containers.
21:39:54 <oneswig> Interesting - and I don't think your issues are unique to OSA by any means.
21:41:10 <trandles_> LXC is certainly more lightweight than docker. It's rock-solid stable, survives reboots easily, etc. I had a filesystem fill on one of my two controller testbeds last week and debugging the mess it created was fairly easy with everything in LXC.
21:43:34 <oneswig> Does it still use bridge networking, or is it host networking?
21:43:49 <trandles_> bridge
21:45:06 <oneswig> For a while early on I remember we deployed Monasca in LXC containers and it didn't make the host feel rock solid.  That was a few years ago now though.
21:45:19 <trandles_> https://docs.openstack.org/openstack-ansible/train/reference/architecture/container-networking.html
21:45:56 <trandles_> That's the reference for what I deployed on the testbed. We're reviewing it for all of our needs now and seeing what we're going to modify for the larger production cloud.
21:46:28 <trandles_> When I say "we" I mean with one of our network architects who knows how everything we need to interface without outside of the cloud is put together and a security guy
21:48:36 <trandles_> We're also using the testbed to convince program that it's worth their money to buy storage appliances and good network gear to offload as much as we can.
21:48:38 <oneswig> Is it also Linuxbridge networking for tenant overlay networking?
21:48:47 <trandles_> It's linuxbridge now, yes
21:49:22 <oneswig> All the more surprising that there was a plan to deprecate that.
21:49:35 <trandles_> Networking is the one place where we'll probably end up writing a good bit of our own ansible to get what we actually need.
21:49:41 <trandles_> But that's not surprising
21:51:36 <oneswig> trandles_: will you document/present your networking changes, are they generic for a high-security environment or specific to your place?
21:53:01 <trandles_> I might be able to. An example is everywhere our production clusters interface with the backend network (license servers, all filesystems, etc.) we are going from the cluster fabric to IB (IPoIB) and running quagga/zebra for OSPF.
21:53:59 <trandles_> It's pretty specific to our setup and might be changing soon...but that's outside of my control and a large part of why I'm working closely with one of our network gurus to get this right before we actually deploy
21:55:58 <oneswig> makes sense, I think.
21:57:29 <trandles_> I'm not sure it makes sense any more. A lot of this was a product of wanting insane bandwidth and having a crazily parallel backbone to get it. Paths all over the place, incredibly difficult to debug. Now we can buy 100gig commodity hardware.
21:58:02 <trandles_> It has me kinda dragging my feet a little bit knowing that it's likely to be a lot simpler soon.
21:59:26 <oneswig> Will it though - surely the clients will just want more of that?
22:00:10 <oneswig> We are at the hour - final comments?
22:00:29 <trandles_> It's the same problem we've always had. Make the network faster and simpler and now we need a better filesystem. ;)
22:00:59 <oneswig> Right, move the bottleneck!
22:01:39 <oneswig> #endmeeting