11:03:55 #startmeeting scientific-sig 11:03:56 Meeting started Wed Sep 23 11:03:55 2020 UTC and is due to finish in 60 minutes. The chair is oneswig. Information about MeetBot at http://wiki.debian.org/MeetBot. 11:03:57 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 11:03:59 The meeting name has been set to 'scientific_sig' 11:04:11 apologies for lateness 11:10:39 o/ 11:11:04 hi oneswig 11:11:27 Hi belmoreira, how's things? 11:11:46 priteau was mentioning the discussion in the large scale sig this morning 11:12:09 good, and with you? 11:12:28 I have been negligent in my attention to the Scientific SIG :-( 11:12:52 Otherwise things are going well I'd say 11:12:53 yes, we had the large scale sig meeting this morning and the discussion for the next goals and PTG 11:13:20 good to hear that 11:14:56 What are the pain points you are having with scaling? 11:17:34 well, I think we hit everything possible to hit during the last years... for now until we can physically add more nodes we should be good 11:20:10 Pierre mentioned you were describing limitations with cells because of not helping network scaling - is that new or something you've been fighting from the beginning? 11:20:57 it's more related with neutron scalability, not nova or cells design 11:21:33 last year we split the infrastructure in 2 regions. Currently we have 3 regions 11:26:39 I think it will be interesting to share it 11:27:22 How does storage link between the 3 regions, do you have to share storage? 11:36:08 hi oneswig belmoreira 11:36:22 belmoreira do you have multiple cells per region, do I get that right? 11:36:36 Hi janders, good to see you 11:36:56 oneswig good to see you too. Apologies for joining late - team meeting clash. 11:36:59 g'evening 11:37:08 hey b1airo! 11:37:23 my late excuse is more about beer... 11:37:36 b1airo important! :) 11:38:10 Hi b1airo, very important. Back home? 11:38:15 #chair b1airo 11:38:16 Current chairs: b1airo oneswig 11:38:21 oh totally janders , certainly higher priority than meetings anyways :-P 11:38:38 I look forward to times when we can combine both again 11:38:45 as it should be 11:38:46 no oneswig , still up north hanging out with the NIWA crew 11:39:14 janders yes, we have multiple cells 11:39:34 how many now belmoreira ? 11:39:35 in 3 regions we have more than 70 cells in total 11:39:46 *whistle* 11:40:02 belmoreira do any of your cells span regions, or is each cell contained within one region? 11:40:11 each cell has a maximum of 250 nodes 11:40:35 janders cells are per region 11:40:41 belmoreira which aspect of scalability do cells help with the most in your experience? 11:40:44 are you still following the same, err, "disposable" cell controllers model? :-) 11:41:24 each cell has it's own rabbit infrastructure 11:41:58 (i vaguely recall you are running your cell controllers within the prod cloud itself... 🐢) 11:41:59 and it's a good failure domain, in case of issues things are contained 11:42:56 b1airo :) yes, all our controller plane runs inside the cloud itself (inception) 11:43:07 belmoreira nice! :) 11:43:30 agree on that - spanning regions (a user facing construct) across cells (a backend scalability and failure domain concern) seems like a questionable idea 11:43:32 belmoreira does this architecture pose a challenge in case of a need of a full-system shutdown? 11:44:30 you mean a shutdown in the data centre :) 11:44:35 belmoreira I really like it, just wonder what extra measures are needed to prevent losing the "starer motor" 11:44:51 belmoreira yeah 11:45:13 yes, sure... if that happens we need to understand what needs to be available first 11:46:13 i guess maybe the api top-level needs to come up first, followed by compute "cell0" (i guess that must be a thing in this architecture? 11:46:14 but is not a big issue, because instance start doesn't need the control plane 11:46:37 belmoreira right! 11:47:03 belmoreira do you have dedicated compute nodes for infra services, so that they are separate from user workloads and easy to identify? 11:47:07 b1airo yes, if we really need APIs from the beginning, but if a disaster happens APIs availability will be probably the last 11:47:46 ha, good point! so "cell0" is really just select instance startup directly on compute nodes? 11:48:25 do the infra instances have networking statically configured? 11:48:35 (cause I suppose DHCP services may not be available yet) 11:49:11 was coming to the networking question too :-) 11:49:11 or is the inception arch cell-specific, with neutron being independent of this? 11:49:16 b1airo in a case of a disaster we will probably force instance start per compute node 11:49:31 and worry to have the DBs up 11:51:31 janders we use DHCP, but it's a separate infrastructure... yes, it needs to be up 11:51:50 belmoreira makes sense 11:52:00 janders users and infra instances share the same infrastructure 11:52:14 belmoreira no noisy neighbour issues? 11:52:39 only compute instances have their dedicated cells/regions 11:53:37 janders yes, sometimes... we usually live migrate noisy neighbours to less busy compute nodes 11:54:16 Late (still getting kids ready) 11:54:53 Hi martial_, morning 11:55:13 belmoreira it's awesome to hear about your architecture and your experiences with it, thanks for sharing! 11:55:40 janders: how's things with you? 11:55:56 oneswig good, thank you for asking! :) 11:56:08 janders np 11:56:45 something I've been looking most recently that might be useful for the SIG is potentially introducing NVMe-aware cleaning to Ironic 11:56:57 (think trim/discard/... functionality) 11:57:20 janders: I've seen key rotation used for SATA SSDs, does that also apply here? 11:57:44 The discard idea is good though! 11:58:07 yeah some of the "secure" deletion options leverage manipulating crypto keys 11:58:30 what's supported really varies but I hope we can find enough common ground 11:59:04 how are things at your end oneswig? What are you guys up to these days? 11:59:34 janders: too much to describe in our final minute :-( 11:59:50 I think finally we are back to having rather too much fun. 11:59:57 oneswig true! poor timing on my behalf. Next time! 12:00:16 Until next time. I promise to come better prepared. 12:00:30 have a good one all 12:00:36 time to close 12:00:38 #endmeeting