21:00:21 <oneswig> #startmeeting scientific-sig
21:00:22 <openstack> Meeting started Tue Mar  2 21:00:21 2021 UTC and is due to finish in 60 minutes.  The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot.
21:00:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
21:00:26 <openstack> The meeting name has been set to 'scientific_sig'
21:00:33 <martial> Hello Stig
21:00:40 <oneswig> Hi martial
21:00:45 <oneswig> #chair martial
21:00:46 <openstack> Current chairs: martial oneswig
21:00:49 <oneswig> How's things?
21:01:17 <oneswig> P2302 is that the NIST federation work?
21:01:29 <martial> not bad, just a little crazy :)
21:02:29 <oneswig> Hopefully crazy in a good way
21:02:29 <martial> IEEE actually, the NIST work got published in NIST SP500-332
21:02:38 <martial> #link https://www.nist.gov/publications/nist-cloud-federation-reference-architecture
21:03:12 <janders> g'day oneswig martial o/
21:03:18 <janders> how is it going?
21:03:38 <oneswig> Hi janders, good thanks
21:03:55 <martial> doing well, thanks janders
21:04:35 <oneswig> Busy :-)
21:05:43 <martial> busy good? :)
21:05:51 <julianp> Hi all. Bookmarked that link martial. Time to learn more about cloud federation.
21:05:54 <oneswig> I'm only 2 years late on this but I saw a really neat talk on large-scale Ceph administration: https://www.youtube.com/watch?v=niFNZN5EKvE
21:06:44 <oneswig> Well worth a look, it presents a very nice way of visualising the spread of utilisation across nodes.
21:11:01 <oneswig> It came up after a group we work with were adding larger drives to an existing Ceph cluster which pushed it to hitting hard limits of PGs/OSD.  Sounds quite painful.
21:13:13 <oneswig> julianp: you were asking a while back about infrastructure.  I think we are getting much closer now to having guests on the system.
21:13:37 <julianp> Eeeexcellent. Thanks for thinking of me oneswig.
21:13:39 <martial> I wonder if we can ask Rion to have another Minio conversation
21:14:00 <oneswig> Rion having more fun with MinIO?
21:15:40 <oneswig> julianp: will be in touch soon I hope!
21:16:12 <martial> well we are heavy with Ceph on SSDs but if you remember we had a small video chat with Rion about why Minio was useful for deployments
21:16:25 <julianp> Much obliged oneswig.
21:17:09 <julianp> martial do you remember why Minio was considered useful?
21:18:14 <martial> ease of deployment seemed to be a core reason
21:18:34 <julianp> Gotcha.
21:21:16 <oneswig> martial: ever compared it to Portworx?
21:22:40 <martial> I have a very small minio equivalent setup for testing but never tried portworx
21:25:28 <martial> any new setup for you Stig?
21:26:04 <oneswig> I've been banging my head on a real puzzler for the last few days.  I have a set of hosts that take ~3s to run "time ssh centos@host hostname"
21:26:19 <oneswig> It's not DNS before you ask, pretty sure on that now :-)
21:27:22 <priteau> Something auth-related, PAM?
21:27:26 <janders> what OS is the ssh connection originating from?
21:27:34 <oneswig> I'm still uncertain on the root cause.  There's some smoking guns relating to SELinux blocking access to /etc/ld.so.cache that look suspicious.
21:27:37 <oneswig> CentOS 8.3
21:28:24 <oneswig> Hi priteau :-) auth is ssh keys - although there's plenty of pam modules involved in that login.
21:28:25 <janders> does 'setenforce 0' make any difference?
21:28:41 <janders> let me see what my NVMe cleaning lab is on
21:28:44 <oneswig> janders: disabling selinux and rebooting the node is not apparently helping...
21:29:11 <oneswig> It's bizarre because I have other nodes in the same environment for which the same test takes a more sensible 0.2s
21:29:26 <janders> oneswig disabling SEL both client and server side?
21:29:26 <martial> same hardware/kernel version?
21:29:37 <janders> oneswig melding servers would come in handy :)
21:29:38 <oneswig> Different hardware, same kernel
21:29:57 <oneswig> melding?
21:30:22 <oneswig> janders: making the client == the server, ie ssh localhost, has the same delay
21:30:48 <martial> ohhhh I have had this happen, it was a network device driver for me
21:31:11 <julianp> Does `ssh -vvvv` show you where it gets stuck?
21:31:26 <priteau> ssh localhost shouldn't be slowed down by a NIC issue though
21:31:35 <janders> oneswig does the problem seem to stick to either the piece of hardware in question being a client or a server?
21:31:41 <janders> priteau good point
21:32:09 <martial> pierre, agreeing with you but localhost also was slow
21:32:25 <oneswig> julianp: a bit.  The ssh debug output isn't timestamped, alas.  There was a message, I'll see if I can dig it out.
21:32:32 <julianp> I bet it was the butler in the library with a candlestick.
21:32:46 <martial> now I can not remember if there was something else related to it
21:34:09 <oneswig> I've been running strace on client and server to try to spot something, that's my current effort.
21:34:18 <oneswig> Interesting though!
21:34:36 <oneswig> brb
21:34:45 <priteau> Have you tried changing various other settings in sshd_config? GSSAPI maybe?
21:36:16 <martial> I was checking in our slack to see if we documented this one
21:36:39 <oneswig> I removed the GSSAPI auth method and that bought some time, a fraction of a second
21:36:48 <martial> no luck
21:38:08 <martial> silly question because that was part of our checklist: IPv6 disabled?
21:38:14 <oneswig> uninstalling cockpit also gained me about 0.1s.  Small things.
21:38:37 <oneswig> IPV6 I haven't tried - worth a shot!
21:39:42 <priteau> MTU? (although I've seen it cause hangs, not slowdowns)
21:40:33 <priteau> (and it would probably not affect ssh localhost)
21:41:05 <oneswig> martial: just tried it, no joy alas
21:41:14 <martial> files listed first in your nsswitch.conf? so it uses /etc/hosts ?
21:41:50 <janders> oneswig a bit brute-force, but maybe worthwhile copying /etc of a "good" and a "bad" machine and doing a recursive diff?
21:42:11 <oneswig> priteau: I don't think it's MTU.
21:42:43 <martial> similar idea as above, you could try "UseDNS no" in your sshd_config
21:43:03 <janders> oneswig is console login normal (making sure it's ssh only)?
21:43:07 <oneswig> If I run "ssh-keygen" on a dodgy node, there's a long delay before it prompts me for the filename.  That might be connected
21:43:35 <oneswig> martial: UseDNS no is set - been bitten by that one before :-)
21:44:04 <martial> okay not DNS, not IPv6 (use -4 :) )
21:44:04 <julianp> As for the timestamps not being in the ssh output, you can add it using `ts` found in `moreutils` TIL: ssh -vvvv some-host hostname 2>&1 | ts
21:44:32 <oneswig> janders: I'd need to get onto the BMC and it's one of those HPE boxes where you have to buy a license to use the console after the node boots...
21:45:12 <oneswig> julianp: that is new to me, neat trick!
21:46:00 <priteau> The slow ssh-keygen is very odd, I don't think it should do any I/O
21:46:07 <priteau> Network I/O I mean
21:46:12 <priteau> Something related to OpenSSL then?
21:46:27 <janders> oneswig that licensing model is ridiculous!
21:46:53 <priteau> oneswig: what does "cat /proc/sys/kernel/random/entropy_avail" say?
21:47:14 <julianp> Oh that's a good idea priteau. We've run into that.
21:47:38 <oneswig> entropy_avail = 3443 - is that enough?
21:47:47 <priteau> Should be
21:48:04 <janders> oneswig the two machines I'm currently working on have ~3400 and ~3800
21:48:10 <janders> priteau nice trick!
21:48:13 <oneswig> Doesn't look like I can get moreutils on CentOS - perhaps it's in EPEL?
21:49:02 <julianp> I believe you can replicate the moreutils ts functionality using some awk.
21:49:28 <oneswig> julianp: now we are talking :-)
21:49:49 <julianp> xD
21:52:19 <julianp> ssh -vvvv -C condo hostname 2>&1 | awk '{ print strftime(),$0 }'
21:52:28 <oneswig> I'm clearly going to have to follow up to the SIG on this if I ever figure it out...
21:52:49 <julianp> Yes please! My curiosity is piqued.
21:53:17 <oneswig> Anything can be curious, until you *have* to solve it
21:54:06 <oneswig> for AOB: couple of small points
21:54:15 <martial> nice trick julian
21:55:29 <oneswig> PTG (virtual) dates were announced - 19-23 April - https://www.openstack.org/ptg/
21:56:47 <oneswig> Next week (Wednesday 1100 UTC) I'm hoping to have a session on Jupyterhub and OpenStack.  I think there's a good deal to cover on how to provide user-friendly integrations
21:57:09 <julianp> Ooh. I'm interested in that.
21:57:35 <oneswig> julianp: hopefully not too early a start for you?
21:57:53 <janders> oneswig from my side, I completed the initial work on support for NVMe-native disk cleaning in Ironic
21:58:02 <janders> if you are interested, I can give a preso on that in a couple weeks
21:58:21 <oneswig> janders: That would be great, I'd love to see it.
21:58:46 <janders> initial = works best on an all-NVMe nodes (it doesn't make much of a difference in hybrid HDD-NVMe configs or on SSDs)
21:58:58 <oneswig> Random but related question: Do you know, with software RAID in Ironic, can I label the RAID block devices?  They come up in an arbitrary order.
21:59:29 <janders> oneswig not sure. Might be worth asking on #openstack-ironic
21:59:31 <oneswig> janders: would be good to hear more.
21:59:41 <oneswig> thanks janders, will do
21:59:50 <oneswig> OK, nearly time - final comments?
22:00:17 <oneswig> #endmeeting