11:03:55 <oneswig> #startmeeting scientific-sig 11:03:56 <openstack> Meeting started Wed Sep 23 11:03:55 2020 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:03:57 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:03:59 <openstack> The meeting name has been set to 'scientific_sig' 11:04:11 <oneswig> apologies for lateness 11:10:39 <belmoreira> o/ 11:11:04 <belmoreira> hi oneswig 11:11:27 <oneswig> Hi belmoreira, how's things? 11:11:46 <oneswig> priteau was mentioning the discussion in the large scale sig this morning 11:12:09 <belmoreira> good, and with you? 11:12:28 <oneswig> I have been negligent in my attention to the Scientific SIG :-( 11:12:52 <oneswig> Otherwise things are going well I'd say 11:12:53 <belmoreira> yes, we had the large scale sig meeting this morning and the discussion for the next goals and PTG 11:13:20 <belmoreira> good to hear that 11:14:56 <oneswig> What are the pain points you are having with scaling? 11:17:34 <belmoreira> well, I think we hit everything possible to hit during the last years... for now until we can physically add more nodes we should be good 11:20:10 <oneswig> Pierre mentioned you were describing limitations with cells because of not helping network scaling - is that new or something you've been fighting from the beginning? 11:20:57 <belmoreira> it's more related with neutron scalability, not nova or cells design 11:21:33 <belmoreira> last year we split the infrastructure in 2 regions. Currently we have 3 regions 11:26:39 <belmoreira> I think it will be interesting to share it 11:27:22 <oneswig> How does storage link between the 3 regions, do you have to share storage? 11:36:08 <janders> hi oneswig belmoreira 11:36:22 <janders> belmoreira do you have multiple cells per region, do I get that right? 11:36:36 <oneswig> Hi janders, good to see you 11:36:56 <janders> oneswig good to see you too. Apologies for joining late - team meeting clash. 11:36:59 <b1airo> g'evening 11:37:08 <janders> hey b1airo! 11:37:23 <b1airo> my late excuse is more about beer... 11:37:36 <janders> b1airo important! :) 11:38:10 <oneswig> Hi b1airo, very important. Back home? 11:38:15 <oneswig> #chair b1airo 11:38:16 <openstack> Current chairs: b1airo oneswig 11:38:21 <b1airo> oh totally janders , certainly higher priority than meetings anyways :-P 11:38:38 <janders> I look forward to times when we can combine both again 11:38:45 <janders> as it should be 11:38:46 <b1airo> no oneswig , still up north hanging out with the NIWA crew 11:39:14 <belmoreira> janders yes, we have multiple cells 11:39:34 <b1airo> how many now belmoreira ? 11:39:35 <belmoreira> in 3 regions we have more than 70 cells in total 11:39:46 <b1airo> *whistle* 11:40:02 <janders> belmoreira do any of your cells span regions, or is each cell contained within one region? 11:40:11 <belmoreira> each cell has a maximum of 250 nodes 11:40:35 <belmoreira> janders cells are per region 11:40:41 <janders> belmoreira which aspect of scalability do cells help with the most in your experience? 11:40:44 <b1airo> are you still following the same, err, "disposable" cell controllers model? :-) 11:41:24 <belmoreira> each cell has it's own rabbit infrastructure 11:41:58 <b1airo> (i vaguely recall you are running your cell controllers within the prod cloud itself... 🐢) 11:41:59 <belmoreira> and it's a good failure domain, in case of issues things are contained 11:42:56 <belmoreira> b1airo :) yes, all our controller plane runs inside the cloud itself (inception) 11:43:07 <janders> belmoreira nice! :) 11:43:30 <b1airo> agree on that - spanning regions (a user facing construct) across cells (a backend scalability and failure domain concern) seems like a questionable idea 11:43:32 <janders> belmoreira does this architecture pose a challenge in case of a need of a full-system shutdown? 11:44:30 <belmoreira> you mean a shutdown in the data centre :) 11:44:35 <janders> belmoreira I really like it, just wonder what extra measures are needed to prevent losing the "starer motor" 11:44:51 <janders> belmoreira yeah 11:45:13 <belmoreira> yes, sure... if that happens we need to understand what needs to be available first 11:46:13 <b1airo> i guess maybe the api top-level needs to come up first, followed by compute "cell0" (i guess that must be a thing in this architecture? 11:46:14 <belmoreira> but is not a big issue, because instance start doesn't need the control plane 11:46:37 <janders> belmoreira right! 11:47:03 <janders> belmoreira do you have dedicated compute nodes for infra services, so that they are separate from user workloads and easy to identify? 11:47:07 <belmoreira> b1airo yes, if we really need APIs from the beginning, but if a disaster happens APIs availability will be probably the last 11:47:46 <b1airo> ha, good point! so "cell0" is really just select instance startup directly on compute nodes? 11:48:25 <janders> do the infra instances have networking statically configured? 11:48:35 <janders> (cause I suppose DHCP services may not be available yet) 11:49:11 <b1airo> was coming to the networking question too :-) 11:49:11 <janders> or is the inception arch cell-specific, with neutron being independent of this? 11:49:16 <belmoreira> b1airo in a case of a disaster we will probably force instance start per compute node 11:49:31 <belmoreira> and worry to have the DBs up 11:51:31 <belmoreira> janders we use DHCP, but it's a separate infrastructure... yes, it needs to be up 11:51:50 <janders> belmoreira makes sense 11:52:00 <belmoreira> janders users and infra instances share the same infrastructure 11:52:14 <janders> belmoreira no noisy neighbour issues? 11:52:39 <belmoreira> only compute instances have their dedicated cells/regions 11:53:37 <belmoreira> janders yes, sometimes... we usually live migrate noisy neighbours to less busy compute nodes 11:54:16 <martial_> Late (still getting kids ready) 11:54:53 <oneswig> Hi martial_, morning 11:55:13 <janders> belmoreira it's awesome to hear about your architecture and your experiences with it, thanks for sharing! 11:55:40 <oneswig> janders: how's things with you? 11:55:56 <janders> oneswig good, thank you for asking! :) 11:56:08 <belmoreira> janders np 11:56:45 <janders> something I've been looking most recently that might be useful for the SIG is potentially introducing NVMe-aware cleaning to Ironic 11:56:57 <janders> (think trim/discard/... functionality) 11:57:20 <oneswig> janders: I've seen key rotation used for SATA SSDs, does that also apply here? 11:57:44 <oneswig> The discard idea is good though! 11:58:07 <janders> yeah some of the "secure" deletion options leverage manipulating crypto keys 11:58:30 <janders> what's supported really varies but I hope we can find enough common ground 11:59:04 <janders> how are things at your end oneswig? What are you guys up to these days? 11:59:34 <oneswig> janders: too much to describe in our final minute :-( 11:59:50 <oneswig> I think finally we are back to having rather too much fun. 11:59:57 <janders> oneswig true! poor timing on my behalf. Next time! 12:00:16 <oneswig> Until next time. I promise to come better prepared. 12:00:30 <janders> have a good one all 12:00:36 <oneswig> time to close 12:00:38 <oneswig> #endmeeting