#openstack-meeting-5 log

15:00:13 <mattmceuen> #startmeeting openstack-helm
15:00:13 <openstack> Meeting started Tue Jun 12 15:00:13 2018 UTC and is due to finish in 60 minutes.  The chair is mattmceuen. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:14 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:16 <openstack> The meeting name has been set to 'openstack_helm'
15:00:19 <mattmceuen> #topic rollcall
15:00:36 <rwellum> o/
15:00:45 <mattmceuen> Agenda for today:  https://etherpad.openstack.org/p/openstack-helm-meeting-2018-06-12
15:00:51 <gmmaha> o/
15:01:04 <piotrrr> o/
15:01:13 <mattmceuen> please go ahead and add anything you'd like to discuss.  We have a pretty full agenda (including some things that got pushed from last time)
15:01:32 <mattmceuen> GM all
15:01:35 <portdirect> O/
15:01:57 <mattmceuen> Alright -- first
15:02:02 <mattmceuen> #topic Storyboard
15:02:16 <mattmceuen> As you're all probably aware, we switched from Launchpad to Storyboard on Friday!
15:02:39 <mattmceuen> Here is our project group: https://storyboard.openstack.org/#!/project_group/64
15:02:55 <mattmceuen> From there you can click on different projects for osh, osh-infra, osh-addons
15:03:28 <rwellum> Yay!!
15:03:39 <mattmceuen> We also have a Worklist for OSH 1.0, which is something we really needed: https://storyboard.openstack.org/#!/worklist/341
15:03:58 <portdirect> W00t
15:04:00 <mattmceuen> This list captures all* the outstanding items from the OSH 1.0 spec that have yet to be completed
15:04:14 <gmmaha> \o/
15:04:17 <mattmceuen> * I haven't added the documentation updates needed -- that still needs to be added
15:04:25 <mattmceuen> So y'all are aware of the structure
15:04:34 <mattmceuen> There are a bunch of stories
15:04:40 <mattmceuen> stories have some number of tasks
15:04:46 <mattmceuen> the worklist pulls together the tasks
15:05:16 <mattmceuen> And if you start tagging your PS commit msg with "Task: XXXX" it will update task status automatically
15:05:22 <portdirect> Seems tasks are a one merge thing?
15:05:29 <mattmceuen> And if you tag your PS commit msg with "Story: XXXX" it will link back to the story from the PS
15:05:37 <mattmceuen> Yes - I think you are right portdirect
15:05:43 <mattmceuen> One task per PS
15:05:45 <portdirect> So we may need to create more tasks as we go for bigger items
15:05:54 <mattmceuen> Plus tasks for things other than PS when those come up
15:05:58 <mattmceuen> +++
15:06:19 <mattmceuen> There were a few tasks I put out there that I prefixed with "Spike: "
15:06:38 <mattmceuen> Which are explicit investigation tasks for creating "work" tasks, because I am lazy^H^H^H a great PTL
15:06:51 <rwellum> +!
15:07:11 <mattmceuen> I'll be honest - putting it all down in storyboard gave me a new appreciation for the volume of work standing between us an 1.0
15:07:16 <mattmceuen> There is no rocket science there
15:07:21 <mattmceuen> Just a lot of tractable things
15:07:41 <mattmceuen> So if anyone has been looking for something to sink their teeth into in the code (or soon, docs) -- WE NEED YOU!!!
15:07:44 <mattmceuen> :)
15:08:16 <mattmceuen> Please go grab a task here (assign it to yourself)  and then we will help you succeed  at it https://storyboard.openstack.org/#!/worklist/341
15:08:24 <rwellum> Cool - great job Matt
15:08:35 <mattmceuen> Thanks rwellum and thank you for your help getting us here too
15:08:49 <mattmceuen> That's all I got, anything else storyboard related guys?
15:09:07 <srwilkers> o/
15:09:34 <alanmeadows> o/
15:09:36 <mattmceuen> #topic Discuss multi-node deployer feedback
15:09:40 <mattmceuen> GM folks!
15:10:20 <mattmceuen> rwellum -  you've been going through some deployment activities using the multi-node guide
15:10:38 <rwellum> Regarding this, I spoke to a couple of people at the Summit - mainly from kolla-k8s, and we've all tried the multinode and all had failed - but in a good way - as in we learned but had more questions.
15:11:03 <rwellum> Main question was environment - the guide is heavily linked to the Gate scripts correct?
15:11:09 <portdirect> yup
15:11:19 <portdirect> simply as no one had had the time to write better docs
15:11:32 <portdirect> and so we knew we had somthing that worked
15:11:33 <rwellum> Yeah understood.
15:11:47 <portdirect> abit in a highly opinionated way
15:11:59 <mattmceuen> what kinds of issues did you run into rwellum?
15:12:18 <rwellum> I'll reach out to them and I'll gather more concrete evidence and then ask for advice. I need to redeploy as well and I know Matt is doing it now too.
15:12:42 <rwellum> Mainly assumptions - missing packages for example.
15:12:50 <portdirect> this heat template, though horrible may help: https://github.com/portdirect/openstack-helm-dev/blob/master/osh-cluster.yaml
15:13:07 <portdirect> i use it to deploy a multinode osh cluster 2-3 times a week
15:13:24 <rwellum> Will take a look for sure
15:13:43 <mattmceuen> I started going through the multi-node guide this weekend for the first time in a bit, and ran into a couple User Error issues, a couple minor doc updates I'll make, and an ingress thing that I will figure out next.  For me though there was a fair amount of user error stemming from reusing a previously-used environment and not reading all the instructions 100% carefully :D
15:14:48 <mattmceuen> The "assumptions" part is the most likely and insidious thing
15:15:06 <rwellum> I think also there's a bit of ambiguousness around which is the master node, and which is a minion. At least that was me - as I tried to make the scripts work with my setup.
15:15:55 <mattmceuen> That is probably something that can be called out more explicitly - agree
15:16:02 <mattmceuen> I'll add that into my "minor updates" PS
15:16:13 <rwellum> Sounds good.
15:16:28 <mattmceuen> Would it help to have an etherpad to throw these things on as they come up?
15:16:53 <rwellum> Yeah for sure.
15:18:05 <mattmceuen> https://etherpad.openstack.org/p/openstack-helm-multinode-doc
15:18:20 <mattmceuen> Let's do it - I want to get the doc all trued up by next week
15:19:06 <roman_g> o/
15:19:12 <roman_g> Good morning.
15:19:12 <mattmceuen> hey roman_g
15:19:47 <mattmceuen> Anything else on the multinode guide right today rwellum, or are we good to has through the update process?
15:20:31 <rwellum> No that's fine thanks
15:20:55 <mattmceuen> Cool beans - you have the next one too
15:21:00 <mattmceuen> #topic Vitrage chart
15:21:17 <mattmceuen> How is Vitrage looking!
15:21:39 <rwellum> I have commitment from my team to start working on this June 20th. Ultimately I want to get team members on-board with osh through this effort.
15:21:53 <sgrasley> stack
15:22:11 <sgrasley> sry wrong window
15:22:46 <rwellum> Also I have commitment to run a POC for OSH - replacing TripleO - so quite a bit of OSH involvement hopefully.
15:22:51 <mattmceuen> Awesome rwellum - agree, and if there's anything we can do to help your team members get off the ground let us know
15:23:11 <mattmceuen> nice!
15:23:16 <rwellum> When we kick off I'll invite them to this meeting and do a round of intro's if that's ok?
15:23:25 <mattmceuen> definitely.
15:23:31 <rwellum> cool then.
15:23:40 <mattmceuen> Hey sgrasley o/ no worries :)
15:24:00 <rwellum> That was it for me. Except I need to get multinode up and running to show them how easy it is to set up and use etc.
15:24:16 <mattmceuen> The proof is in the pudding
15:25:01 <mattmceuen> Let's get all issues into that etherpad stat so we can get things solved and/or clarified - I will do this as well
15:25:07 <rwellum> Or 'hardtack'
15:25:11 <mattmceuen> https://etherpad.openstack.org/p/openstack-helm-multinode-doc
15:25:16 <portdirect> +++
15:25:49 <portdirect> without logs/info its very hard if not impossible to help
15:26:11 <rwellum> Ah - that's why we started to put the debug guide together right?
15:26:35 <rwellum> https://etherpad.openstack.org/p/openstack-helm-troubleshooting
15:26:43 <mattmceuen> Yeap they're kind of sister things
15:26:53 <mattmceuen> In the sense of -- some issues should result in doc updates
15:27:02 <mattmceuen> And some issues shouldn't, but people will run into them anyway
15:27:06 <rwellum> So our job is to feed back good info for you guys - so we don't waste your time with basic questions etc.
15:27:11 <mattmceuen> And we should get those into the troubleshooting guide
15:27:46 <rwellum> Agreed
15:28:00 <portdirect> just info would be good - theres a lot of discussion of difficulty - but no direction on what we should be looking at yet :)
15:28:09 <mattmceuen> As a rule of thumb, if someone runs into an issue it's either
15:28:09 <mattmceuen> 1) They didn't read (guilty)
15:28:09 <mattmceuen> 2) The doc wasn't clear enough
15:28:09 <mattmceuen> 3) It's just something that happens, maybe with #1 or #2, that should be documented in the troubleshooting guide
15:28:24 <mattmceuen> So capturing all of them is a good start
15:28:35 <portdirect> 4) they installed on a dirtly env
15:28:43 <portdirect> so brought in a load of pre-existing issues
15:29:25 <mattmceuen> Alrighty moving on
15:29:27 <mattmceuen> alanmeadows
15:29:31 <mattmceuen> #topic Production vs Bare Bones values.yaml
15:29:39 <mattmceuen> We didn't get a chance to discuss last time
15:30:05 <mattmceuen> Paint us the picture here
15:31:06 <alanmeadows> sure
15:31:46 <alanmeadows> this was discussed a bit in the following PS: https://review.openstack.org/#/c/543012/
15:32:27 <alanmeadows> Historically, the ask has always been: let's try and have OSH run OOTB as it would in production, otherwise we sweep operator issues under the rug
15:32:48 <alanmeadows> challenging things such as, making ceph a first class citizen, enabling HA by default and so on
15:33:14 <alanmeadows> as the developer audience grows, and we have solved for many of these already, it makes sense to prune back this mandate a bit
15:34:20 <alanmeadows> values.yaml can instead be the bare bones necessary to stand an openstack/openstack-helm environment up, while we continue to test more multi-node/high availability options elsewhere
15:34:37 <alanmeadows> whether thats a "production.yaml" armada gate or whatever we decide
15:34:57 <portdirect> ++
15:35:07 <rwellum> Why? As time goes by and you solve many of the issues - surely it becomes easier to deploy as full production?
15:35:29 <portdirect> rwellum: good example if things like those raised in the above ps
15:35:45 <alanmeadows> @rwellum: the resources necessary for the deployment grow and grow if we continue to exercise full high availability as the `default`
15:35:59 <alanmeadows> limiting our potential developer pool
15:36:05 <portdirect> where 'production like' setting demand specific (and large) hardware configs
15:36:16 <alanmeadows> the benefit here is two fold
15:36:51 <alanmeadows> we slim down whats needed to deploy it OOTB, default values becomes easy to explain, and there is a path for deploying this in production that both has optimal settings and uses an optimal mechanism (armada)
15:37:09 <alanmeadows> (e.g. you shouldn't be `$ helm install` (x30) in production)
15:37:44 <rwellum> So by 'full production' you mean as portdirect is hinting - large scale?
15:37:55 <mattmceuen> Is it fair to say that we want to
15:37:55 <mattmceuen> 1) enable the defaults to run on a wide variety of deployment types (e.g. laptop)
15:37:55 <mattmceuen> 2) we want to enable as much prod-like / HA functionality as will run across that range by default
15:37:55 <mattmceuen> 3) the remining prod-specific overrides would be captured in an override yaml
15:38:04 <portdirect> rwellum: your touching on the core issue
15:38:16 <portdirect> Production = opionion
15:38:25 <alanmeadows> the term is a little loose.. but how any sane person would generally want to deploy this thing into either a lab or production
15:38:27 <cfriesen> alanmeadows: what about small-scale "production"?  like dual-node?
15:38:36 <alanmeadows> and whenever you say that there is... opinion involved :)
15:39:14 <alanmeadows> I think what we want to avoid is the helm community spending lots of time keeping dozens of references `up to date` or having some of them languish
15:39:41 <alanmeadows> I think ideally we would maintain developer values and a `large scale` and `resilient` production manifest
15:39:55 <alanmeadows> anything in the middle would be an exercise to the user based on the production manifest
15:40:19 <rwellum> I think you don't want to debase your product by making it so simple to deploy that it's not recognizably a realistic OpenStack deployment. I think Armada is amazing, but if you're telling people OSH is so complicated in production you need this tool - then I think it's a not a great message.
15:41:00 <portdirect> We need to work on messaging it seems
15:41:26 <rwellum> it's like I showed my team OSH running on a VM, and they loved it but they immediately asked me about multi-node, real hardware, CEPH, SRIOV etc.
15:41:27 <portdirect> But what we are trying to say is the opposite
15:41:34 <alanmeadows> rwellum: its a fair point, but we've already received enough messaging at workshops to demonstrate there is need to educate people on at least how we envision this running and being *maintained* in production
15:41:53 <alanmeadows> questions such as "would we be running all of these bash commands in production" like confusion
15:42:01 <mattmceuen> Yep
15:42:13 <rwellum> I see..
15:42:15 <piotrrr> why are we assuming that small scale, but HA-capable, setup would be too big for developers? I mean, one thing we could aim at is have one of the default deployments be HA-capable, small-scale (dev friendly), but at the same also easily overridable (customizabale) for large scales.
15:42:41 <rwellum> Yes well said piotrrr
15:42:56 <mattmceuen> I don't think we should strip out prod-grade configuration out of the defaults until it becomes constrictive.
15:43:23 <mattmceuen> I.e. we want for the defaults to run on a laptop, basically.  A lot of HA functionality can run on a laptop, and that's exactly the dev environment I want
15:43:46 <piotrrr> yup, +1
15:43:49 <alanmeadows> well keep in mind folks, to be HA/resilient/so on or not is a simple matter of overrides
15:43:54 <alanmeadows> nothing changes about the core product
15:44:20 <portdirect> and to be honest - we are pretty much at the base level that we are refering to
15:44:31 <alanmeadows> anywhere these are removed to stream line resource usage has to be gated somewhere though (again as the PS says)
15:44:42 <portdirect> this is more about adding functionaility via over-rides and tooling
15:44:47 <alanmeadows> we cant remove them until something else is caring for them, so this conversation is really just starting a dialog about what that thing is
15:45:15 <alanmeadows> the last thing we want is to remove something like l3 ha being a default and then find out it doesnt work when someone turns it on
15:46:00 <portdirect> +++
15:46:07 <mattmceuen> Agree
15:46:16 <portdirect> and what is super ironic here - we turned that off in the gate anyway
15:46:26 <portdirect> so it was *never* tested  - lol
15:46:28 <mattmceuen> Instead of thinking about it as "resilient overrides" vs "dumbed down defaults"  I'd think of it as "generically resilient defaults" vs "any additional prod-only overrides"
15:47:09 <mattmceuen> sigh
15:47:14 <alanmeadows> lets take a another perfect example that requires extra oomph to validate
15:47:22 <alanmeadows> multiple mariadb replicas...
15:47:30 <alanmeadows> something needs to validate this functions
15:47:39 <alanmeadows> but I certainly dont want to demand developers have mariadb(x3)
15:48:03 <alanmeadows> which means it shouldn't be OOTB
15:48:10 <alanmeadows> but it should be part of a production manifest we gate on
15:48:12 <portdirect> and even then - you may find some production deployments that dont want galera
15:48:26 <mattmceuen> +1
15:48:29 <rwellum> But mariadb 3x is nothing to do with 'production' is it?
15:48:37 <portdirect> exactly rwellum
15:48:38 <rwellum> It's more like a scale test
15:49:00 <piotrrr> alan, fair point, but what is bad about having 3x mariadb in a dev environment? is it only about memory usage, or is it something else?
15:49:05 <portdirect> no  - its opinion - we need to validate our chart works with 3x replication
15:49:22 <portdirect> but wheter its required for production (i think it is) is up to the de
15:49:30 <alanmeadows> rwellum: I suppose different ways of looking at it, and perhaps this is where the opinion comes in -- mariadb (x3) is production for us and I think somewhat typical
15:49:43 <alanmeadows> (for galera users)
15:50:00 <rwellum> I see. Trying not to get to hung up on the word production here.
15:50:19 <mattmceuen> piotrrr that is an excellent point - and for lighterweight services than mariadb, I'd argue it may make sense to have x3 replicas on a single node IMO
15:50:23 <alanmeadows> piotrrr: I think the point is you take that example and multiply it by 30 other things like that, and you get a heavy deployment
15:50:35 <alanmeadows> or maybe your laptop is really big :)
15:51:03 <rwellum> Also I am unsure about mattmceuen point about most users want to run on a laptop - I think that's so limiting, and you're not trying to replace stackdev right?
15:51:09 <mattmceuen> I suppose we could have a constrained gate that makes sure that the defaults work well with only e.g. 16GB RAM :)
15:51:22 <portdirect> mattmceuen: 8gb ram in gate
15:51:29 <rwellum> In other words the base package shouldn't be delimited by running on a laptop. imo,.
15:51:59 <mattmceuen> My thought was more that it should be the settings that run "anywhere", and running on a laptop helps bring that into relief
15:52:06 <mattmceuen> as it doesn't get much more constrained than that
15:52:22 <mattmceuen> and it still is a very common use case, at least for folks doing OSH dev
15:52:29 <rwellum> Yeah but not sure I agree - don't pick the lowest common denominator to define your product.
15:52:40 <piotrrr> there's not that many services in openstack which need to be configured extremely different way between non-HA vs HA deployments. e.g. nova-api doesn't care. you could have a OOTB lightweigh HA-deployment  with 3xmariadb and any number of nova-api-like-services (even 1).
15:53:37 <piotrrr> so, services which require different config to run in HA- deploy them in HA by default. Other services, those that easily scale in active-active manner, allow people to choose how many they want. Dev will choose "1". production will choose "3" or more.
15:54:39 <alanmeadows> I think whats great about OSH is values.yaml in each chart != product
15:54:39 <rwellum> TBF piotrrr alanmeadows is discussing a 'bare-bones' solution.
15:55:08 <portdirect> what we want is osh to work out of the box for everyone
15:55:31 <portdirect> and have a robust set of documentation, gating and examples for various deployments with it
15:55:41 <alanmeadows> no one in any kind of environment, whether dual-host HA or a lab, or production is going to run it without overriding it
15:55:50 <portdirect> +++
15:55:55 <alanmeadows> so we need to show them the way on how thats manageable
15:56:05 <alanmeadows> and for all others, who are OOTB, they must be developers
15:56:22 <alanmeadows> I think if we align on that, we may start seeing eye-to-eye
15:56:38 <portdirect> and a big part of 'production ready' is day2
15:56:39 <alanmeadows> developers or just kicking the tires, I should say
15:56:46 <portdirect> how do we manage this thing
15:56:52 <portdirect> osh makes openstack simpler
15:56:56 <portdirect> but its still openstack
15:57:11 <alanmeadows> no less than a million knobs to turn ;-)
15:57:12 <portdirect> and requires careful consideration of what you are deplying
15:57:47 <rwellum> Small, medium and large working out the box.
15:57:57 <rwellum> Examples on how to overide anything
15:58:01 <rwellum> voila?
15:58:19 <portdirect> kinda rwellum
15:58:29 <portdirect> though we need to work out what small, med and large are
15:58:30 <portdirect> :)
15:58:33 <alanmeadows> what we were kicking around is a small works OOTB (remember there is only one box)
15:58:58 <portdirect> medium = you need to configure neutron..
15:59:11 <alanmeadows> large is cared for with an example and gated armada manifest, turning on all typical large knobs to show they work, while simultaneously serving as an example of how to stand it up with a single command, manage it day 2, and override anything
15:59:37 <mattmceuen> Speaking of boxes
15:59:40 <mattmceuen> timeboxes
15:59:47 <mattmceuen> out of time - thanks everyone :)
16:00:03 <cfriesen> speaking of that...I've asked on the main channel a few times about how to configure the nova-compute chart to handle heterogeneous compute nodes, multiple sizes of hugepages, PCI devices, dedicated CPUs, etc.  Can anyone point me at discussions/documentation on how this is supposed to be done?
16:00:06 <mattmceuen> Great discussion, we can keep it going in the OSH chat room but I think we're coaliescing
16:00:10 <mattmceuen> #endmeeting