#openstack-meeting-5 log

15:01:50 <portdirect> #startmeeting openstack-helm
15:01:51 <openstack> Meeting started Tue Oct  2 15:01:50 2018 UTC and is due to finish in 60 minutes.  The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:52 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:54 <openstack> The meeting name has been set to 'openstack_helm'
15:02:03 <portdirect> lets give it a few mins for people to roll in
15:02:09 <portdirect> #topic rollcall
15:02:14 <portdirect> o/
15:02:17 <srwilkers> o/
15:02:27 <lamt> \o
15:03:07 <roman_g> o/
15:03:59 <portdirect> agenda for today: https://etherpad.openstack.org/p/openstack-helm-meeting-2018-10-02, will give until 5 past and then kick off
15:05:26 <portdirect> oh hai gagehugo
15:05:37 <gagehugo> o/
15:05:40 <portdirect> ok - lets get going
15:05:49 <srwilkers> a wild gagehugo appears
15:06:00 <portdirect> #topic Libvirt restarts
15:06:30 <portdirect> so once again, we seem to have lost the ability to restart libvirt pods without stopping vms
15:07:23 <portdirect> as far as i can make out, the pid reaper of k8s is now (since 1.9) clever enough to kill child processes of pods, even when running in host pid mode
15:07:45 <portdirect> and uses cgroups to target the pids to reap
15:08:38 <srwilkers> ahh
15:08:44 <portdirect> i think the soultion to this is wo get ourself out of the k8s managed cgroups entirely
15:09:00 <portdirect> and so have proposed the following: https://review.openstack.org/#/c/607072/2/libvirt/templates/bin/_libvirt.sh.tpl
15:09:33 <alanmeadows> I thought we used to run libvirt in hostIPC--I'm not sure if I am misremembering or we stopped--this seems to re-enable that
15:09:57 <portdirect> we never did - though i rememeber the same
15:10:11 <alanmeadows> shouldn't hostIPC effectively *be* the flag we need to tell k8s to stop mucking with this? Are we seeing behavior we don't expect?
15:10:12 <portdirect> i re-enabled as part of this - as it makes sense to have this
15:10:49 <portdirect> alanmeadows: its not - see the cgroups/pid reaper comment above
15:10:52 <alanmeadows> To be sure, there *should* be a k8s flag to effectively disable the cgroup, repeating, and other "helpers" and I thought hPid, and hIPC were it
15:11:01 <alanmeadows> s/repeating/reaping/
15:11:30 <portdirect> they no-longer are, looking at the kubelet source, theres no way to disable this
15:11:51 <alanmeadows> this feels like a k8s gap
15:11:57 <portdirect> and for everyone but us - i think what it does is an improvement
15:12:04 <portdirect> no disagreement there
15:12:11 <alanmeadows> sure, just feel like there needs to be a "don't get smart" button
15:12:15 <portdirect> rkt stage 1 fly would offer this
15:12:21 <alanmeadows> libvirt is just one of several use cases
15:12:47 <portdirect> so - i think this to me suggests two things
15:13:03 <portdirect> 1) we need a fix to this NOW, is the above the right way to do this?
15:13:34 <portdirect> 2) lets use the fix we end up with, and get a bug opened with k8s to support "dumb" containers - just like the good 'ol days
15:15:01 <portdirect> the though behind what im doing above is that we essentially run libvirt as a transient unit on the host
15:15:11 <srwilkers> the approach above seems acceptable to me, unless im missing something
15:15:39 <portdirect> so for pretty much the whole world - we get normal operation
15:16:05 <portdirect> the one thing being that we dont specify a name for the transient unit - so systemd assigns one
15:16:15 <portdirect> this allows the pod to be restarted
15:16:37 <portdirect> or even the chart to be removed, and qemu processes will be left running
15:17:27 <portdirect> and then when the pod/chart comes back - libvirt will start up in a new scope, but manage the quems left in the old one just fine
15:17:36 <portdirect> seem sane?
15:18:21 <alanmeadows> we validated it can not only see them but can touch them?
15:18:21 <srwilkers> i think so
15:18:49 <portdirect> alanmeadows: yes
15:19:08 <portdirect> though i do still need to check when using the cgfroupfs driver for docker/k8s
15:19:32 <portdirect> that this still works, and that also leads nicely into the next point
15:19:46 <alanmeadows> what are the interactions of this and the recommendation to disable the hugetlb cgroup in the boot parameters
15:19:56 <alanmeadows> are both still required?
15:20:04 <portdirect> no - this removes that requirement
15:21:09 <portdirect> we super need to gate this - once we have fixed this issue - I really want to get a light weight gate in that just confirms that the libvirt chart can be deployed, start a vm, and then be removed and deployed again, with 0 imact on the running vm
15:21:20 <alanmeadows> last question
15:21:22 <portdirect> the end of this would probably be initiating a reboot
15:21:36 <portdirect> I dont think openstakc would be required for this gate
15:21:37 <srwilkers> portdirect: yeah, was going to see if we could include that in the gate rework you're going to chat about later
15:21:41 <alanmeadows> if cgroup_disable=hugetlb is still leveraged, this doesn't care and operates fine?
15:21:48 <portdirect> yes
15:22:30 <portdirect> its why on l35 i get the cgroups to manually use/over-ride dynamicly: https://review.openstack.org/#/c/607072/2/libvirt/templates/bin/_libvirt.sh.tpl
15:24:05 <portdirect> we ok here? to leave any further convo to review?
15:24:29 <srwilkers> yeah, works for me
15:25:27 <portdirect> ok
15:25:29 <portdirect> #topic Calico V3
15:25:39 <portdirect> so i dont think anticw is here
15:25:58 <portdirect> but theres been a load of work done on updating our now long in the tooth calico chart
15:26:03 <portdirect> adding v3 support
15:27:03 <portdirect> https://review.openstack.org/#/c/607065/
15:27:09 <portdirect> please review away
15:27:54 <portdirect> I'm super excited about this - as it offers a ray of hope for the future, that we can get out of the quagmire of iptables rules from the kube-proxy and move to ipvs
15:27:59 <portdirect> but baby steps...
15:29:22 <portdirect> hey anticw ' we were just talking about you
15:29:24 <srwilkers> cool.  will review this proper later today
15:29:40 <portdirect> anything you'd like to point out re the calico v3 work?
15:30:17 <anticw> it works
15:31:02 <anticw> there are some cosmetic changes done to try stay aligned with upstream
15:31:20 <anticw> not all of those are required, but having them means a later upgrade should be easier
15:32:58 <portdirect> sounds great anticw
15:33:04 <portdirect> thx for your work on this
15:34:18 <anticw> np, the other cleanups people brought up i've put on a list and we can decide which of those are needed
15:34:34 <anticw> as you pointed out some of them run counter to a uniform interface to other CNS
15:34:40 <anticw> CNIs
15:35:29 <portdirect> sure - from what i have seen the core is good solid, and the only real discussion may be aournd some of the config entrypoints
15:35:38 <portdirect> but i think we can hash that out in review
15:37:04 <srwilkers> works for me
15:38:09 <portdirect> ok
15:38:12 <portdirect> #topic MariaDB
15:38:54 <portdirect> so I've got a wip up here: https://review.openstack.org/#/c/604556/ that i hope radically improves our galera clustering ability
15:39:08 <portdirect> ive been testing it reasonably hard
15:39:54 <portdirect> the biggest gaps atm that i'm aware of is the need to handle locks on configmaps better so we get acid like use out of them
15:40:04 <portdirect> and also get xtrabackup working again
15:40:34 <portdirect> thankfully both of these are relativly simple,  though the configmap mutex may require a bit of time
15:41:05 <portdirect> would be great to get people to run this though its paces, and report back any shortcomings
15:41:31 <srwilkers> even if it does, i think this is a step in a better direction.  i've been playing with some of the changes for a bit now, and im pretty happy with it thus far
15:42:41 <portdirect> ok - so the last thing from me this week:
15:42:42 <portdirect> #topic Gate intervention
15:43:13 <portdirect> evardjp is planning on doing and extensive overhall of the gates, and bring some much needed sanity to them
15:43:26 <portdirect> though hes away this week - boo!
15:43:50 <portdirect> that said, theres an urgent need to get our gates in a slightly better state than they are today
15:44:24 <portdirect> so after this meeting im planning on refactoring some of them to get us to a point where things can merge without one million retrys
15:44:47 <srwilkers> that'd be great
15:45:17 <portdirect> the main method to do this will to be cutting out duplicate tests - and also potentially adding an extra gate, so we can split the load
15:45:38 <srwilkers> not sure if it matters now, but do we want to consider moving some of the checks to experimental checks (where it makes sense), until we can get the larger overhaul started/completed?
15:45:42 <portdirect> as most failures seem to be the nodepool vm's just bing pushed harder than they can take
15:46:17 <portdirect> srwilkers: if by the end of day i've not made signifigant progress - i think that, may be the short term bandage we need
15:46:29 <srwilkers> portdirect: yeah. i was playing around with some of the osh-infra gates just to see how things performed when the logging and monitoring charts were split into separate jobs
15:47:59 <portdirect> while on the subject of gates:
15:48:04 <portdirect> #topic Armada gate
15:48:12 <portdirect> srwilkers: you're up
15:48:27 <srwilkers> i've got a few changes pending for the armada gate in openstack-helm
15:49:11 <srwilkers> the first adds the Elasticsearch admin password to the nagios chart definition, as the current nagios chart supports querying elasticsearch for logged events
15:49:56 <srwilkers> the second adds ragosgw to the lma manifest, along with the required overrides to take advantage of the s3 support for elasticsearch
15:50:47 <srwilkers> the third is more reactive, as it seems the rabbitmq helm tests fail sporadically in the armada gate.  that change proposes disabling them for the time being
15:51:24 <srwilkers> and the fourth is the most important in my mind.  it's the introduction of an ocata armada gate.  and the question becomes:  do we sunset the newton armada gate?
15:51:29 <portdirect> for rabbitmq - we prob dont need to run as many as we do in the upstream gates
15:51:59 <srwilkers> portdirect: probably not.  i can update that patchset to instead reduce us down to one rabbit deployment
15:52:07 <portdirect> ++
15:52:47 <portdirect> we got consensus at the ptg to sunset newton totally
15:53:07 <portdirect> and move the default to ocata
15:53:43 <srwilkers> thats why im leaning towards sunsetting the newton armada gate with the ocata armada patchset, along with avoiding adding another 5 node check to our runs
15:54:19 <portdirect> sounds good - though I think the 1st step would be to make ocata images the defaults in charts
15:54:25 <lamt> are we sunsetting newton for just the armada job or all the jobs?
15:54:31 <lamt> I volunteer to do that
15:54:36 <srwilkers> lamt: nice :)
15:54:41 <portdirect> lamt: if you could that would be awesome
15:55:08 <lamt> will start - those newton images start to pain me anyway
15:55:11 <portdirect> please add a loci newton gate though
15:55:20 <lamt> will do
15:55:24 <srwilkers> unrelated portdirect:  we can take my last point wrt the values spec offline, so we have time for open discussion
15:55:37 <portdirect> ok - sounds good
15:55:37 <srwilkers> we can handle that in the #openstack-helm channel
15:55:51 <portdirect> #topic open discussion / review needed
15:57:24 <srwilkers> crickets :)
15:57:44 <portdirect> ok - lets wrap up then
15:57:54 <portdirect> #endmeeting