15:00:13 #startmeeting openstack-helm 15:00:13 Meeting started Tue Jun 12 15:00:13 2018 UTC and is due to finish in 60 minutes. The chair is mattmceuen. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:14 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:16 The meeting name has been set to 'openstack_helm' 15:00:19 #topic rollcall 15:00:36 o/ 15:00:45 Agenda for today: https://etherpad.openstack.org/p/openstack-helm-meeting-2018-06-12 15:00:51 o/ 15:01:04 o/ 15:01:13 please go ahead and add anything you'd like to discuss. We have a pretty full agenda (including some things that got pushed from last time) 15:01:32 GM all 15:01:35 O/ 15:01:57 Alright -- first 15:02:02 #topic Storyboard 15:02:16 As you're all probably aware, we switched from Launchpad to Storyboard on Friday! 15:02:39 Here is our project group: https://storyboard.openstack.org/#!/project_group/64 15:02:55 From there you can click on different projects for osh, osh-infra, osh-addons 15:03:28 Yay!! 15:03:39 We also have a Worklist for OSH 1.0, which is something we really needed: https://storyboard.openstack.org/#!/worklist/341 15:03:58 W00t 15:04:00 This list captures all* the outstanding items from the OSH 1.0 spec that have yet to be completed 15:04:14 \o/ 15:04:17 * I haven't added the documentation updates needed -- that still needs to be added 15:04:25 So y'all are aware of the structure 15:04:34 There are a bunch of stories 15:04:40 stories have some number of tasks 15:04:46 the worklist pulls together the tasks 15:05:16 And if you start tagging your PS commit msg with "Task: XXXX" it will update task status automatically 15:05:22 Seems tasks are a one merge thing? 15:05:29 And if you tag your PS commit msg with "Story: XXXX" it will link back to the story from the PS 15:05:37 Yes - I think you are right portdirect 15:05:43 One task per PS 15:05:45 So we may need to create more tasks as we go for bigger items 15:05:54 Plus tasks for things other than PS when those come up 15:05:58 +++ 15:06:19 There were a few tasks I put out there that I prefixed with "Spike: " 15:06:38 Which are explicit investigation tasks for creating "work" tasks, because I am lazy^H^H^H a great PTL 15:06:51 +! 15:07:11 I'll be honest - putting it all down in storyboard gave me a new appreciation for the volume of work standing between us an 1.0 15:07:16 There is no rocket science there 15:07:21 Just a lot of tractable things 15:07:41 So if anyone has been looking for something to sink their teeth into in the code (or soon, docs) -- WE NEED YOU!!! 15:07:44 :) 15:08:16 Please go grab a task here (assign it to yourself) and then we will help you succeed at it https://storyboard.openstack.org/#!/worklist/341 15:08:24 Cool - great job Matt 15:08:35 Thanks rwellum and thank you for your help getting us here too 15:08:49 That's all I got, anything else storyboard related guys? 15:09:07 o/ 15:09:34 o/ 15:09:36 #topic Discuss multi-node deployer feedback 15:09:40 GM folks! 15:10:20 rwellum - you've been going through some deployment activities using the multi-node guide 15:10:38 Regarding this, I spoke to a couple of people at the Summit - mainly from kolla-k8s, and we've all tried the multinode and all had failed - but in a good way - as in we learned but had more questions. 15:11:03 Main question was environment - the guide is heavily linked to the Gate scripts correct? 15:11:09 yup 15:11:19 simply as no one had had the time to write better docs 15:11:32 and so we knew we had somthing that worked 15:11:33 Yeah understood. 15:11:47 abit in a highly opinionated way 15:11:59 what kinds of issues did you run into rwellum? 15:12:18 I'll reach out to them and I'll gather more concrete evidence and then ask for advice. I need to redeploy as well and I know Matt is doing it now too. 15:12:42 Mainly assumptions - missing packages for example. 15:12:50 this heat template, though horrible may help: https://github.com/portdirect/openstack-helm-dev/blob/master/osh-cluster.yaml 15:13:07 i use it to deploy a multinode osh cluster 2-3 times a week 15:13:24 Will take a look for sure 15:13:43 I started going through the multi-node guide this weekend for the first time in a bit, and ran into a couple User Error issues, a couple minor doc updates I'll make, and an ingress thing that I will figure out next. For me though there was a fair amount of user error stemming from reusing a previously-used environment and not reading all the instructions 100% carefully :D 15:14:48 The "assumptions" part is the most likely and insidious thing 15:15:06 I think also there's a bit of ambiguousness around which is the master node, and which is a minion. At least that was me - as I tried to make the scripts work with my setup. 15:15:55 That is probably something that can be called out more explicitly - agree 15:16:02 I'll add that into my "minor updates" PS 15:16:13 Sounds good. 15:16:28 Would it help to have an etherpad to throw these things on as they come up? 15:16:53 Yeah for sure. 15:18:05 https://etherpad.openstack.org/p/openstack-helm-multinode-doc 15:18:20 Let's do it - I want to get the doc all trued up by next week 15:19:06 o/ 15:19:12 Good morning. 15:19:12 hey roman_g 15:19:47 Anything else on the multinode guide right today rwellum, or are we good to has through the update process? 15:20:31 No that's fine thanks 15:20:55 Cool beans - you have the next one too 15:21:00 #topic Vitrage chart 15:21:17 How is Vitrage looking! 15:21:39 I have commitment from my team to start working on this June 20th. Ultimately I want to get team members on-board with osh through this effort. 15:21:53 stack 15:22:11 sry wrong window 15:22:46 Also I have commitment to run a POC for OSH - replacing TripleO - so quite a bit of OSH involvement hopefully. 15:22:51 Awesome rwellum - agree, and if there's anything we can do to help your team members get off the ground let us know 15:23:11 nice! 15:23:16 When we kick off I'll invite them to this meeting and do a round of intro's if that's ok? 15:23:25 definitely. 15:23:31 cool then. 15:23:40 Hey sgrasley o/ no worries :) 15:24:00 That was it for me. Except I need to get multinode up and running to show them how easy it is to set up and use etc. 15:24:16 The proof is in the pudding 15:25:01 Let's get all issues into that etherpad stat so we can get things solved and/or clarified - I will do this as well 15:25:07 Or 'hardtack' 15:25:11 https://etherpad.openstack.org/p/openstack-helm-multinode-doc 15:25:16 +++ 15:25:49 without logs/info its very hard if not impossible to help 15:26:11 Ah - that's why we started to put the debug guide together right? 15:26:35 https://etherpad.openstack.org/p/openstack-helm-troubleshooting 15:26:43 Yeap they're kind of sister things 15:26:53 In the sense of -- some issues should result in doc updates 15:27:02 And some issues shouldn't, but people will run into them anyway 15:27:06 So our job is to feed back good info for you guys - so we don't waste your time with basic questions etc. 15:27:11 And we should get those into the troubleshooting guide 15:27:46 Agreed 15:28:00 just info would be good - theres a lot of discussion of difficulty - but no direction on what we should be looking at yet :) 15:28:09 As a rule of thumb, if someone runs into an issue it's either 15:28:09 1) They didn't read (guilty) 15:28:09 2) The doc wasn't clear enough 15:28:09 3) It's just something that happens, maybe with #1 or #2, that should be documented in the troubleshooting guide 15:28:24 So capturing all of them is a good start 15:28:35 4) they installed on a dirtly env 15:28:43 so brought in a load of pre-existing issues 15:29:25 Alrighty moving on 15:29:27 alanmeadows 15:29:31 #topic Production vs Bare Bones values.yaml 15:29:39 We didn't get a chance to discuss last time 15:30:05 Paint us the picture here 15:31:06 sure 15:31:46 this was discussed a bit in the following PS: https://review.openstack.org/#/c/543012/ 15:32:27 Historically, the ask has always been: let's try and have OSH run OOTB as it would in production, otherwise we sweep operator issues under the rug 15:32:48 challenging things such as, making ceph a first class citizen, enabling HA by default and so on 15:33:14 as the developer audience grows, and we have solved for many of these already, it makes sense to prune back this mandate a bit 15:34:20 values.yaml can instead be the bare bones necessary to stand an openstack/openstack-helm environment up, while we continue to test more multi-node/high availability options elsewhere 15:34:37 whether thats a "production.yaml" armada gate or whatever we decide 15:34:57 ++ 15:35:07 Why? As time goes by and you solve many of the issues - surely it becomes easier to deploy as full production? 15:35:29 rwellum: good example if things like those raised in the above ps 15:35:45 @rwellum: the resources necessary for the deployment grow and grow if we continue to exercise full high availability as the `default` 15:35:59 limiting our potential developer pool 15:36:05 where 'production like' setting demand specific (and large) hardware configs 15:36:16 the benefit here is two fold 15:36:51 we slim down whats needed to deploy it OOTB, default values becomes easy to explain, and there is a path for deploying this in production that both has optimal settings and uses an optimal mechanism (armada) 15:37:09 (e.g. you shouldn't be `$ helm install` (x30) in production) 15:37:44 So by 'full production' you mean as portdirect is hinting - large scale? 15:37:55 Is it fair to say that we want to 15:37:55 1) enable the defaults to run on a wide variety of deployment types (e.g. laptop) 15:37:55 2) we want to enable as much prod-like / HA functionality as will run across that range by default 15:37:55 3) the remining prod-specific overrides would be captured in an override yaml 15:38:04 rwellum: your touching on the core issue 15:38:16 Production = opionion 15:38:25 the term is a little loose.. but how any sane person would generally want to deploy this thing into either a lab or production 15:38:27 alanmeadows: what about small-scale "production"? like dual-node? 15:38:36 and whenever you say that there is... opinion involved :) 15:39:14 I think what we want to avoid is the helm community spending lots of time keeping dozens of references `up to date` or having some of them languish 15:39:41 I think ideally we would maintain developer values and a `large scale` and `resilient` production manifest 15:39:55 anything in the middle would be an exercise to the user based on the production manifest 15:40:19 I think you don't want to debase your product by making it so simple to deploy that it's not recognizably a realistic OpenStack deployment. I think Armada is amazing, but if you're telling people OSH is so complicated in production you need this tool - then I think it's a not a great message. 15:41:00 We need to work on messaging it seems 15:41:26 it's like I showed my team OSH running on a VM, and they loved it but they immediately asked me about multi-node, real hardware, CEPH, SRIOV etc. 15:41:27 But what we are trying to say is the opposite 15:41:34 rwellum: its a fair point, but we've already received enough messaging at workshops to demonstrate there is need to educate people on at least how we envision this running and being *maintained* in production 15:41:53 questions such as "would we be running all of these bash commands in production" like confusion 15:42:01 Yep 15:42:13 I see.. 15:42:15 why are we assuming that small scale, but HA-capable, setup would be too big for developers? I mean, one thing we could aim at is have one of the default deployments be HA-capable, small-scale (dev friendly), but at the same also easily overridable (customizabale) for large scales. 15:42:41 Yes well said piotrrr 15:42:56 I don't think we should strip out prod-grade configuration out of the defaults until it becomes constrictive. 15:43:23 I.e. we want for the defaults to run on a laptop, basically. A lot of HA functionality can run on a laptop, and that's exactly the dev environment I want 15:43:46 yup, +1 15:43:49 well keep in mind folks, to be HA/resilient/so on or not is a simple matter of overrides 15:43:54 nothing changes about the core product 15:44:20 and to be honest - we are pretty much at the base level that we are refering to 15:44:31 anywhere these are removed to stream line resource usage has to be gated somewhere though (again as the PS says) 15:44:42 this is more about adding functionaility via over-rides and tooling 15:44:47 we cant remove them until something else is caring for them, so this conversation is really just starting a dialog about what that thing is 15:45:15 the last thing we want is to remove something like l3 ha being a default and then find out it doesnt work when someone turns it on 15:46:00 +++ 15:46:07 Agree 15:46:16 and what is super ironic here - we turned that off in the gate anyway 15:46:26 so it was *never* tested - lol 15:46:28 Instead of thinking about it as "resilient overrides" vs "dumbed down defaults" I'd think of it as "generically resilient defaults" vs "any additional prod-only overrides" 15:47:09 sigh 15:47:14 lets take a another perfect example that requires extra oomph to validate 15:47:22 multiple mariadb replicas... 15:47:30 something needs to validate this functions 15:47:39 but I certainly dont want to demand developers have mariadb(x3) 15:48:03 which means it shouldn't be OOTB 15:48:10 but it should be part of a production manifest we gate on 15:48:12 and even then - you may find some production deployments that dont want galera 15:48:26 +1 15:48:29 But mariadb 3x is nothing to do with 'production' is it? 15:48:37 exactly rwellum 15:48:38 It's more like a scale test 15:49:00 alan, fair point, but what is bad about having 3x mariadb in a dev environment? is it only about memory usage, or is it something else? 15:49:05 no - its opinion - we need to validate our chart works with 3x replication 15:49:22 but wheter its required for production (i think it is) is up to the de 15:49:30 rwellum: I suppose different ways of looking at it, and perhaps this is where the opinion comes in -- mariadb (x3) is production for us and I think somewhat typical 15:49:43 (for galera users) 15:50:00 I see. Trying not to get to hung up on the word production here. 15:50:19 piotrrr that is an excellent point - and for lighterweight services than mariadb, I'd argue it may make sense to have x3 replicas on a single node IMO 15:50:23 piotrrr: I think the point is you take that example and multiply it by 30 other things like that, and you get a heavy deployment 15:50:35 or maybe your laptop is really big :) 15:51:03 Also I am unsure about mattmceuen point about most users want to run on a laptop - I think that's so limiting, and you're not trying to replace stackdev right? 15:51:09 I suppose we could have a constrained gate that makes sure that the defaults work well with only e.g. 16GB RAM :) 15:51:22 mattmceuen: 8gb ram in gate 15:51:29 In other words the base package shouldn't be delimited by running on a laptop. imo,. 15:51:59 My thought was more that it should be the settings that run "anywhere", and running on a laptop helps bring that into relief 15:52:06 as it doesn't get much more constrained than that 15:52:22 and it still is a very common use case, at least for folks doing OSH dev 15:52:29 Yeah but not sure I agree - don't pick the lowest common denominator to define your product. 15:52:40 there's not that many services in openstack which need to be configured extremely different way between non-HA vs HA deployments. e.g. nova-api doesn't care. you could have a OOTB lightweigh HA-deployment with 3xmariadb and any number of nova-api-like-services (even 1). 15:53:37 so, services which require different config to run in HA- deploy them in HA by default. Other services, those that easily scale in active-active manner, allow people to choose how many they want. Dev will choose "1". production will choose "3" or more. 15:54:39 I think whats great about OSH is values.yaml in each chart != product 15:54:39 TBF piotrrr alanmeadows is discussing a 'bare-bones' solution. 15:55:08 what we want is osh to work out of the box for everyone 15:55:31 and have a robust set of documentation, gating and examples for various deployments with it 15:55:41 no one in any kind of environment, whether dual-host HA or a lab, or production is going to run it without overriding it 15:55:50 +++ 15:55:55 so we need to show them the way on how thats manageable 15:56:05 and for all others, who are OOTB, they must be developers 15:56:22 I think if we align on that, we may start seeing eye-to-eye 15:56:38 and a big part of 'production ready' is day2 15:56:39 developers or just kicking the tires, I should say 15:56:46 how do we manage this thing 15:56:52 osh makes openstack simpler 15:56:56 but its still openstack 15:57:11 no less than a million knobs to turn ;-) 15:57:12 and requires careful consideration of what you are deplying 15:57:47 Small, medium and large working out the box. 15:57:57 Examples on how to overide anything 15:58:01 voila? 15:58:19 kinda rwellum 15:58:29 though we need to work out what small, med and large are 15:58:30 :) 15:58:33 what we were kicking around is a small works OOTB (remember there is only one box) 15:58:58 medium = you need to configure neutron.. 15:59:11 large is cared for with an example and gated armada manifest, turning on all typical large knobs to show they work, while simultaneously serving as an example of how to stand it up with a single command, manage it day 2, and override anything 15:59:37 Speaking of boxes 15:59:40 timeboxes 15:59:47 out of time - thanks everyone :) 16:00:03 speaking of that...I've asked on the main channel a few times about how to configure the nova-compute chart to handle heterogeneous compute nodes, multiple sizes of hugepages, PCI devices, dedicated CPUs, etc. Can anyone point me at discussions/documentation on how this is supposed to be done? 16:00:06 Great discussion, we can keep it going in the OSH chat room but I think we're coaliescing 16:00:10 #endmeeting