11:00:12 #startmeeting scientific-sig 11:00:13 Meeting started Wed Apr 8 11:00:12 2020 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:00:16 The meeting name has been set to 'scientific_sig' 11:00:33 #topic greetings 11:00:37 g'day 11:00:41 hi janders 11:00:57 What's new with you? 11:01:01 hi 11:01:08 interesting cx6-eth breakages for AOB :) 11:01:10 Hi dh3 11:01:22 already looking forward to it... 11:01:49 Hello. 11:02:00 greetings verdurin 11:02:26 Let's get the show on the road 11:02:37 #topic Kolla user group 11:02:52 OK, only new item on this week's agenda 11:03:25 For Kolla users there's a new effort to get users and operators talking with developers 11:03:51 #link Kolla user group thread http://lists.openstack.org/pipermail/openstack-discuss/2020-April/013787.html 11:04:44 If you're using Kolla, hopefully it will be a worthwhile discussion 11:06:32 Yes, hoping to attend. 11:06:45 Is there something similar for TripleO - I know there are feedback sessions at the summits 11:07:07 I call it tripleno :P 11:07:17 (can be made work though :) ) 11:08:49 OK, good to know. 11:09:24 I've been to quite a few operator feedback sessions in different forms, hopefully this will be productive. 11:09:33 Move on? 11:09:41 #topic AOB 11:10:00 ok... i promised some cx6-eth stories 11:10:28 we used to have an issue getting VPI to work. All IB config *just worked* but eth-ib less so 11:10:44 a FW version came out some time back that was supposed to fix it, we started testing this a couple weeks back 11:10:59 catch: it seems for VPI to work, eth needs to be on port1 and ib on port2 11:11:20 and we're wired the other way round and have limited DC access due to lockdown so it's tricky to swap it over 11:11:25 janders: ??? 11:11:29 in any case - it does malfunction in an interesting way 11:11:35 eth link comes up no worries 11:11:43 till... you try to use it with LB or ovs 11:11:58 when it just drops any traffic with MAC address not matching physical point 11:12:16 our friends at Mellanox are looking at it but that's where things are at 11:12:28 ive seen a few issues with VPI but not this one :) 11:12:37 have you guys seen anything like that? 11:12:56 I have... kind of... 11:13:07 what card was that on? 11:13:11 also a cx6? 11:13:15 or something different? 11:13:21 ConnectX-5, probably different issue. 11:13:32 This is SRIOV over a bonded LAG 11:14:14 sorry got dropped out 11:14:21 (not VPIs fault :) ) 11:14:35 oneswig: what was the problem on cx5? 11:14:49 Aha, well it was related to VF-LAG and OVS 11:15:06 hi, just joined 11:15:09 I got into a situation where I can receive broadcast traffic but not unicast 11:15:13 Hi belmoreira 11:15:20 right! 11:15:31 Are you using OVS 2.12? 11:15:34 was that only specific to VFs, or would it impact traffic across the board? 11:15:45 (checking) 11:16:00 Saw it first in VFs. When I installed OVS 2.12 it affected both 11:16:54 Haven't investigated in sufficient detail yet but it might not be related 11:17:41 2.9.0 is the version 11:17:49 BTW The RDO build of OVS 2.12 apparently is quite old compared to recent code on that branch 11:19:25 right! 11:19:32 this is OSP13/Queens based project 11:19:40 so it may be worth re-testing with latest ovs 11:19:52 has upgrading ovs helped in your case? 11:19:54 OVS troubleshooting tools are a dark art all to their own 11:20:15 yeah we may need to sacrifice a packet or two to the ovs gods... 11:20:20 janders: not yet, 2.11->2.12 caused many problems on first attempt 11:20:39 Need to go back and do it again, with better preparation 11:22:06 belmoreira: was talking with someone recently about external hardware inventories and Ironic. I was thinking about Arne's work on that. Did CERN settle on a hardware inventory management system? 11:22:33 not yet 11:24:18 belmoreira: so what's new at CERN? 11:24:58 :) related with ironic, we are now moving into conductor groups 11:25:37 this allows us to split the ironic infrastructure more or less like cells 11:25:56 how many nodes are you managing now? 11:27:35 ~5100 11:28:03 nice work :-) 11:29:01 How do you size the number of nodes managed by a conductor group? 11:30:29 We introduced the first conductor group with ~600 nodes 11:30:53 the metric that we use is the time that the resource tracker takes to run 11:31:35 clever! 11:32:05 Have you reduced how often it runs? I think I remember you changed it to run every few hours? 11:33:24 yes, but with the latest versions it impacts node state updates 11:33:49 with 600 nodes it takes around ~15 minutes to run 11:34:13 On a related subject, someone in our team mentioned the software-raid deployment has improved flexibility now. 11:34:17 we are discussing in having conductor groups with 500 nodes 11:36:11 I think the raid work is already merged upstream. For details Arne is the best person to contact 11:37:17 great, thanks. 11:40:21 On a different subject - dh3: has there been any further development on your work integrating Lustre with OpenStack clients? 11:42:03 oneswig: "a bit" - it is fighting for our attention with other projects, and we were hitting some weird Lustre errors which made upgrading the client version more urgent (we are settling on 2.12.4 now) but the users still want it 11:44:06 the general approach of building an image with Lustre already in it is working though, we have several groups using it as poc 11:44:16 (can't escape posix :) ) 11:44:22 dh3: I saw you guys are now famous - some familiar faces here https://www.youtube.com/watch?v=WAWJxVYH9QM&feature=youtu.be 11:45:47 haha, I knew there was something in the works but I hadn't seen that. we are in contract (re)negotiation land now, different kind of fun 11:46:44 good luck with that! :-) 11:47:13 thanks :) 11:49:18 dh3: there has been some work recently with Cambridge Uni on improved Ansible automation for Lustre. Might be worth sharing notes with you on that. 11:50:21 oneswig: we'd be interested. We don't let our main site Ansible touch Lustre servers much at the moment, trying to keep them as "black box" - only to install stats scripts and (soon) to control iptables 11:51:47 dh3: ok sounds good. 11:52:01 I didn't have more for today - anyone else? 11:54:36 OK y'all, good to see everyone, thanks for joining :-) 11:54:41 #endmeeting