09:00:09 <ttx> #startmeeting large_scale_sig 09:00:09 <openstack> Meeting started Wed Feb 26 09:00:09 2020 UTC and is due to finish in 60 minutes. The chair is ttx. Information about MeetBot at http://wiki.debian.org/MeetBot. 09:00:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 09:00:12 <openstack> The meeting name has been set to 'large_scale_sig' 09:00:14 <ttx> Hi! 09:00:20 <ttx> #topic Rollcall 09:00:26 <amorin> LO 09:00:32 <ttx> amorin: how are things? 09:00:44 <oneswig> hello 09:00:51 <amorin> pretty well 09:01:01 <ttx> Agenda for today at: 09:01:06 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-meeting 09:02:10 <ttx> Let's wait a couple minutes for our friends in Asia 09:02:14 <masahito> o/ 09:02:16 <amorin> ack 09:03:09 <ttx> jiaopengju1: around? 09:03:37 <ttx> Ok, let's get started 09:03:46 <ttx> #topic Progress on "Documenting large scale operations" goal 09:03:51 <ttx> #link https://etherpad.openstack.org/p/large-scale-sig-documentation 09:03:58 <ttx> amorin: any progress on documenting configuration defaults for large scale ? 09:04:09 <ttx> any blocker / help needed? 09:04:13 <amorin> yes 09:04:24 <amorin> I have been working on that in the past few days 09:04:37 <amorin> as you can see in the link you pasted 09:04:59 <amorin> I collected some params that need to be tweaked on nova / neutron 09:05:07 <amorin> some of them, it's just a beginning 09:05:18 <amorin> but I'd like to go further with that now 09:05:29 <amorin> so I have some questions for all of you :p 09:06:00 <ttx> re: oslo.privsep on Neutron, last time I looked it was not there yet 09:06:02 <amorin> first is, should we create a wiki page somewhere? 09:06:17 <amorin> ah yes, thanks 09:06:33 <ttx> It's mostly cinder/nova at this point 09:06:45 <ttx> (and os-brick) 09:06:51 <amorin> ok 09:07:03 <amorin> so neutron and others are still using rootwrap daemon 09:07:20 <amorin> that's the kind of info I want to share on the wiki page 09:07:28 <amorin> do you know how I can create that? 09:07:30 <ttx> and even there, most of the time it's a thin wrapper rather than a reimplementation of functions for higher security 09:07:35 <amorin> (I havn't checked at all) 09:07:56 <ttx> so from a performance perspective, equivalent to rootwrap-daemon 09:08:16 <amorin> ack 09:08:16 <ttx> only Nova did serious usage of privsep 09:08:33 <ttx> resulting in stability, security and performance benefits 09:08:33 <amorin> that's what came out from talk with matt 09:09:13 <amorin> I dont want to disperse, but maybe the large scale group could also help on pushing privsep in other projects 09:09:30 <ttx> re:wiki, I think it would be great to create a page under https://wiki.openstack.org/wiki/Large_Scale_SIG 09:10:05 <ttx> amorin: yeah, could be a collaboration with the Security SIG... or some pop-up team to push for privsep in a specific direction 09:10:17 <ttx> I have a long-standing TODO to work on privsepping osbrick 09:10:24 <amorin> ok :p 09:10:29 <oneswig> ttx: where do the performance benefits come in with privsep? 09:10:31 <ttx> since it sounded like a fun project 09:11:34 <ttx> oneswig: mostly in using narrow python functions rather than calling on IPC... gain depends on function called though 09:11:58 <ttx> in some rare cases I could see it being actually slower, but overall there should be a small gain. 09:12:03 <ttx> But the real gain is security 09:12:23 <oneswig> OK, that makes sense thanks 09:12:57 <ttx> amorin: so +1 on creating a wiki page 09:13:10 <amorin> I will do then 09:13:16 <amorin> and register the link on https://wiki.openstack.org/wiki/Large_Scale_SIG 09:13:22 <ttx> ++ 09:13:26 <amorin> that was the first point, 09:13:33 <amorin> then, after the wiki page is done 09:13:44 <amorin> I'd like to contribute to documentation, on nova for example 09:13:54 <amorin> with a small change on some parameters docs 09:14:02 <amorin> to show a link to this page 09:14:13 <amorin> what do you think? 09:14:29 <amorin> #action amorin do a wiki page for large scale documentation 09:15:02 <ttx> it sounds like a good idea. To be discussed with Nova team I guess 09:15:34 <amorin> yes, I will do a proposal that could be used to start talking 09:15:51 <ttx> maybe they would prefer making it direct part of doc, but at least at a first stage the content of that wiki page are likely to be a moving target 09:16:19 <ttx> so until it stabilizes it really makes sense to expose it through a small doc patch 09:16:27 <amorin> ok 09:16:53 <ttx> amorin: anything else on that topic? 09:17:02 <amorin> #action amorin start a patch on documentation for nova 09:17:24 <amorin> last is, if you have some params on your side that could be identified in that documentation 09:17:28 <amorin> feel free to add them :p 09:17:34 <amorin> and that's all from me! 09:17:51 <amorin> the scaling story here: 09:17:56 <amorin> #link https://etherpad.openstack.org/p/scaling-stories 09:18:01 <ttx> #idea pop up team between Large Scale SIG / Security SIG to push privsep in specific areas 09:18:04 <amorin> is an amazing souce of info 09:18:14 <ttx> yes, I wish there were more, really 09:18:17 <amorin> I'd love to have more like that 09:18:18 <slaweq> amorin: ttx: we are moving with many things to e.g. pyroute2 library to use privsep 09:18:20 <amorin> +1 09:18:30 <amorin> slaweq, nice! 09:18:33 <slaweq> but it's still not finished and in some places we are still using rootwrap 09:18:35 <ttx> slaweq: great to see progress there ! 09:19:11 <ttx> OK... Looking at the etherpad I don't think we had any new doc links added, so let's skip that 09:19:27 <ttx> #topic Progress on "Scaling within one cluster" goal 09:19:48 <ttx> Re: https://etherpad.openstack.org/p/scaling-stories , no recent story posted, so we should carry that over 09:20:01 <ttx> I was thinking of sending a periodic reminder 09:20:18 <amorin> +1 09:20:31 <ttx> but ideally we'd have scaling stories from this group, to make it look more like a living resource 09:20:44 <amorin> at OVH we can add some, I'll try to gather the info internally 09:20:55 <ttx> Also I was thinking of doing a live exercise on this at the Large Scale usage of infrastructure track at OpenDev 09:20:59 <ttx> (more on that later) 09:21:16 <ttx> amorin: can be anecdotes, does not have to be as comprehensive as the first one posted 09:21:21 <amorin> I am pretty sure CERN can also add some very interesting things 09:21:52 <ttx> Like "I remember when we did this and had to do this to get it to work" 09:22:17 <ttx> We got a couple more reviews on oslo.metrics blueprint at https://review.opendev.org/#/c/704733/ 09:22:28 <amorin> yes 09:22:35 <ttx> masahito: Based on bnemec's comment, I'd say we should not overthink it... Maybe post a new patchset to take comments into account and run for approval ? 09:22:36 <masahito> Thanks for the comments. 09:23:07 <masahito> ah, yes. I plan to update this in tomorrow. 09:23:11 <ttx> Nice! 09:23:32 <oneswig> sounds great 09:23:34 <amorin> great, for me it's quite ok, I think we must first focus on simplicity 09:24:01 <amorin> one of my concern was about the fact that oslo metrics was sending the data somewhere 09:24:07 <amorin> are we sure we want to do that? 09:24:16 <ttx> yes and getting the code out there. If there was a fatal flaw in the idea, we'd have had more comments on that spec :) 09:24:57 <ttx> amorin: you mean central collection? what would the alternative be? 09:25:30 <masahito> And as Ben commented it's better to public our codes for a first look. But I need to check which doesn't have sensitive information in our company. 09:25:43 <amorin> IMHO, oslo.metrics should collect the metrics and leave it locally on an unix socket 09:25:45 <ttx> masahito: of course, take your time :) 09:26:08 <amorin> the central collection could be done by antoher tool 09:26:35 <amorin> but if your implementation is already doing it, then it's ok :p 09:26:37 <masahito> amorin: Good point. In our cases, oslo.metrics sends data. But if that's not good for other company, I don't want to push our cases. 09:27:01 <amorin> in my company, we already have some kind of collectors on hosts, 09:27:07 <amorin> so I'd like to reuse those 09:27:14 <ttx> amorin: could that be one config option? 09:27:20 <amorin> of course 09:27:47 <amorin> I wasnt aware that an implemetnation was already done 09:28:01 <ttx> maybe we can propose that once the original code is posted 09:28:11 <amorin> make sense 09:28:23 <ttx> amorin: it's what masahito presented in Shanghai, running at LINE 09:28:24 <masahito> amorin: Not a much implementation. We just started. 09:28:36 <amorin> ah ok! 09:28:40 <amorin> perfect then ;p 09:29:18 <ttx> If there is one thing I learned about telemetry and metrics collection, it's that you need to be flexible to adapt to every case 09:29:32 <ttx> and that simple is better than complex for that reason 09:30:08 <ttx> ok, anything else to discuss re: "Scaling within one cluster" goal ? 09:30:21 <oneswig> I added something 09:30:23 <ttx> #action masahito to post new patchset of spec 09:30:32 <oneswig> on bare metal scaling. Is it time for that? 09:31:00 <ttx> yes, in a sec 09:31:08 <ttx> #topic Other topics 09:31:15 <ttx> oneswig: go for it 09:31:46 <oneswig> ah right 09:31:58 <oneswig> so I was thinking this might be another front for discussion 09:32:18 <oneswig> how big can bare metal clusters go, and how are they managed in extremis? 09:32:44 <oneswig> We have experience with working on the rate at which nodes can be deployed concurrently 09:33:07 <ttx> I feel like that's definitely aligned with the "scaling within one cluster" goal, for a specific definition of cluster 09:33:33 <oneswig> One limitation can be the processing rate of networking-generic-switch, if we aren't reverting to flat networking 09:33:46 <oneswig> ttx: think so too :-) 09:34:05 <ttx> oneswig: it could start as a scaling story 09:34:23 <oneswig> I don't have much to contribute on this yet but I'm happy to write and maintain a scaling story for this based on our experiences. 09:34:25 <ttx> I don;t think we have enough data points to start making separate categories 09:34:34 <amorin> that could be nice 09:34:50 <amorin> you eventually found some tunings to be done for your story, no? 09:34:50 <oneswig> cool, sign me up. 09:35:02 <ttx> Once we have plenty of stories we'll have to think on how to best present them, but for now the big etherpad is enough 09:35:11 <oneswig> Well, the most recent work has been a combination of tunables and refactoring of n-g-s 09:35:24 <ttx> #action oneswig to contribute a scaling story on bare metal cluster scaling 09:35:39 <oneswig> nb, might take a couple of months 09:35:59 <ttx> oneswig: Oath / Verizon Media has a massive footprint of bare metal clusters, would be nice to hear from them 09:36:08 <ttx> which brings us to our next point 09:36:10 <oneswig> for sure! 09:36:19 <ttx> belmoreira and masahito volunteered to join the programming committee for the "Large-scale Usage of Open Source Infrastructure Software" track 09:36:23 <ttx> at OpenDev event in Vancouver, June 8-11 09:36:31 <ttx> It's our opportunity to shape the discussion so that it aligns with the group's objectives 09:36:37 <ttx> And recruit new members 09:36:39 <masahito> We also have tons of baremetal but not managed by OpenStack ;-< 09:36:53 <ttx> I'm already involved with another track, so Allison Price will be the OSF contact to organize that track. 09:37:05 <ttx> I'll follow closely still, and we can discuss it in this meeting 09:37:10 <amorin> unfortunately I wont be able to attend 09:37:22 <ttx> amorin: that's unfortunate 09:37:31 <ttx> A track has several subthemes... 09:37:36 <oneswig> I expect to be there and will participate provided it doesn't conflict with the scientific sig 09:37:49 <ttx> So we could have one around collecting scaling stories from the participants 09:37:54 <ttx> oneswig: it does not 09:37:58 <masahito> I plan to be there as ttx said 09:38:12 <ttx> Opendev tracks are in the mornings and PTG in the afternoons (+ last day) 09:38:33 <ttx> We could also have a specific theme around scaling bare metal clusters 09:38:57 <ttx> The format is pretty open, the idea is to prefer direct engagement over formal talks 09:39:13 <oneswig> ttx: sounds good to me. 09:39:47 <ttx> so as you are contacted and start discussing the contents, please push our SIG's agenda ! 09:40:06 <oneswig> +1 09:40:11 <ttx> ideally we'd recruit new members, but at the very least we should collect information 09:41:01 <ttx> I'd really like a scaling-stories session where people can informally talk about scaling anecdotes, and then we'd take notes and follow up with them to submit something more detailed 09:41:34 <ttx> #topic Next meeting 09:41:40 <ttx> everyone still OK with a biweekly frequency? 09:41:50 <amorin> it's ok for me 09:41:50 <oneswig> yes 09:41:55 <masahito> yes 09:42:35 <ttx> #info next meeting: March 11 09:42:47 <ttx> Anything else to mention before we close ? 09:43:32 <oneswig> nothing here 09:43:36 <ttx> #idea have a scaling-stories session where people can informally talk about scaling anecdotes, and then we'd take notes and follow up with them to submit something more detailed 09:43:57 <ttx> ok I think I captured everything 09:44:01 <ttx> Thanks everyone 09:44:07 <ttx> #endmeeting