#openstack-meeting-4 log

15:00:54 <portdirect> #startmeeting openstack-helm
15:00:55 <openstack> Meeting started Tue Apr 16 15:00:54 2019 UTC and is due to finish in 60 minutes.  The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:59 <openstack> The meeting name has been set to 'openstack_helm'
15:01:07 <portdirect> lets give it a few mins for people to turn up
15:01:12 <portdirect> #topic rollcall
15:01:28 <portdirect> the agenda is here: https://etherpad.openstack.org/p/openstack-helm-meeting-2019-04-16
15:01:36 <itxaka> o/
15:01:42 <portdirect> please feel free to add to it, and we'll kick off at 5 past
15:01:53 <srwilkers> o/
15:02:02 <jsuchome> o/
15:03:34 <alanmeadows> o/
15:04:02 <gagehugo> o/
15:04:37 <megheisler> o/
15:05:14 <portdirect> ok - lets go
15:05:23 <portdirect> #topic Zuul testing doesnt recover job logs
15:05:32 <portdirect> itxaka the floor is yours
15:05:42 <itxaka> yay
15:05:58 <itxaka> so testing tempest patcehs I found out unable to query the logs
15:06:15 <itxaka> as the ttl for the job/pod expired, those logs were lost forever and ever and ever (not really)
15:06:40 <itxaka> was wondering if there was anything already in place to recover that kind of logs as they seem to be valuable
15:07:00 <portdirect> we've hit this sort of thing before, esp with pods that crash
15:07:08 <itxaka> but Im reading srwilkers response in there so it seems like there is not and we would need to deploy something ourselves for that
15:07:08 <srwilkers> yep
15:07:18 <portdirect> srwilkers: you had any thoughts here, as i know you've noodled on it before?
15:07:32 <srwilkers> i think it requires two parts
15:08:00 <srwilkers> one: we really need to be able to override backofflimits for the jobs in our charts
15:08:04 <prabhusesha_> This is Prabhu
15:08:19 <srwilkers> some jobs have this, others don't
15:09:14 <srwilkers> also, we need to revisit having our zuul postrun jobs include `kubectl logs -p foo -n bar` for cases where we've got pods that are in in a crashloop state or a state where they fail once or twice before being happy
15:09:30 <srwilkers> as insight into both of those types of situations is valuable
15:09:49 <itxaka> +1
15:10:05 <srwilkers> but without the ability to override the backofflimits for jobs, that doesn't do us much good if kubernetes just deletes the pod and doesn't have a pod for us to query for logs
15:10:08 <srwilkers> previous or otherwise
15:10:22 <portdirect> the above would be great as a 1st step
15:11:02 <portdirect> i know we've also discussed 'lma-lite' here - would there be value in looking at that as well?
15:11:54 <itxaka> isnt there a "time to keep" for jobs that we could leverage as well to leave those failed jobs around?
15:11:58 <srwilkers> i think that's sane down the road, but also would require us to deploy the logging stack with every job
15:13:19 <itxaka> ttlSecondsAfterFinished is what I was thinking about, but that is only for jobs, not sure how would it affect other resources
15:14:04 <portdirect> would it perhaps make sense to also think about modifying the `wait-for-pods.sh` to detect a crash
15:14:12 <portdirect> and then instantly fail if it does?
15:14:45 <portdirect> as really we should have 0 crashes on deployment in the gates, and this would also allow us to capture the logs before they go away
15:14:49 <jsuchome> is it crashing right after deployment?
15:15:00 <itxaka> wouldn't that cause false positives when something goes temporary wrong (i.e. quay.io down for a few seconds)
15:15:56 <portdirect> that would not cause a crashloop itxaka
15:16:42 <srwilkers> i'd say we could add that and see if anything falls out initially
15:16:45 <srwilkers> it seems like a sane idea
15:16:50 <itxaka> really? Becuase I seen it locally, crashloopbackoff due not being able to pull the image
15:17:04 <portdirect> that doesnt sound right itxaka
15:17:16 <portdirect> you should see 'errimagepull'
15:18:07 <srwilkers> errimagepull or imagepullbackoff are the only two i've seen related to pulling images -- if the pods in a crashloop, it's indicative of an issue after the entrypoints been executed
15:18:18 <srwilkers> as far as im aware, anyway
15:18:32 <itxaka> umm, I migth be wrong there
15:18:43 <itxaka> unfortunately my env is in a pristine status so I cannot check now :)
15:18:58 <portdirect> lets get a ps up for both - and we can see what the fallout is
15:19:16 <portdirect> srwilkers: did you have one for `logs -p` already? I cant rememeber?
15:19:25 <srwilkers> yeah -- it's pretty dusty at this point
15:19:47 <srwilkers> https://review.openstack.org/#/c/603229/4
15:20:39 <portdirect> the only change required there is just checking to see if there is infact a previous pod?
15:21:09 <portdirect> or to jerry rig it, just allow failure of --previous?
15:21:09 <srwilkers> i think if there's not, it just returns nothing
15:21:18 <itxaka> nice!
15:21:28 <jsuchome> this looks good actually
15:21:39 <srwilkers> i lied
15:21:44 <srwilkers> it'll require an update
15:22:37 <portdirect> with that - lets see how things go
15:22:40 <portdirect> ok to move on?
15:22:51 <srwilkers> although with us setting ignore_errors: true there, might not be an issue
15:22:53 <srwilkers> yep
15:23:11 <portdirect> ok
15:23:15 <portdirect> #topic Split 140-compute-kit.sh into nova+neutron
15:23:26 <itxaka> me again
15:23:45 <itxaka> we were just wondering for the reasons of having both nova+neutron together in the same script
15:23:56 <itxaka> and were wondering if there was a willingness to change it?
15:24:13 <portdirect> totally - the reason for it ebing together is just historical
15:24:18 <jsuchome> We actually found out that with this compute kit, wait-for-pods can timeout quite often
15:24:24 <portdirect> as the two has circular deps
15:24:35 <srwilkers> well, even if you split them out you may run into timeout issues
15:24:37 <jsuchome> so the idea was not just split the deployment script, but also have 2 calls of wait-for-pods
15:24:37 <portdirect> jsuchome: how many nodes are you attempting to deploy with these scripts?
15:24:45 <srwilkers> jsuchome: that wouldnt work
15:24:49 <srwilkers> as they have dependencies on each other
15:25:01 <jsuchome> ah, that's what we feared
15:25:03 <srwilkers> which is why we initially tied them to the same deployment script
15:25:17 <itxaka> would it be possible to make them less dependant on each other?
15:25:19 <portdirect> nova compute depends on neutron metadata...
15:25:30 <srwilkers> itxaka: not that im aware of
15:25:32 <jsuchome> not that many (3 actually), but we're running kubernetes on openstack already
15:25:37 <itxaka> oh well
15:25:49 <jsuchome> well, then just neutron comes first, right?
15:26:12 <portdirect> no - as neutron metadata proxy, depends on.... nova-metadata-api...
15:26:21 <portdirect> and the circle of joy continues :D
15:26:41 <srwilkers> its the circcccclllee, the circle of life
15:26:44 <jamesgu__> would separate nova compute out of Nova chart help?
15:27:02 <portdirect> at the cost of huge complexity
15:27:26 <portdirect> why not just deploy, and then wait after?
15:27:51 <itxaka> yeah seems like we just need to fix the probes to be faster at detecting readyness and increase our timeouts :)
15:28:12 <jsuchome> increase our timeouts, the universal solution for everything...
15:28:23 <jamesgu__> it works... :-)
15:28:33 <itxaka> the simplest fix to a problem is just to wait a bit more for it to solve it by itself
15:28:49 <itxaka> and RUN as far away as possible :P
15:28:55 <portdirect> why is it timing out for you? image pulls or somthing else?
15:29:24 <srwilkers> itxaka: you've got the right idea ;)
15:29:39 <jsuchome> we did not do detailed profiling, it just happened several times
15:29:54 <itxaka> I guess a bit of everything? image pulls, insufficient IO, maybe not enough workers, probes take a bit of time, etc...
15:30:03 <itxaka> its usually pretty close to the timeout so...
15:30:10 <jsuchome> yeah, database backend also
15:30:30 <itxaka> ok, seems good to me, not much from my side on this topic
15:30:30 <portdirect> nova-db-sync is certainly a monster on a clean deployment
15:30:50 <portdirect> ok to move on, and revisit this later if we need to?
15:30:55 <itxaka> +1
15:30:57 <jsuchome> +1
15:31:07 <srwilkers> yep
15:31:10 <portdirect> #topic: Differences in deployment scripts (component vs multinode)
15:31:21 <itxaka> omg, itxaka again? shut up already
15:31:26 <srwilkers> itxaka: XD
15:31:34 <itxaka> tbh I dont remember why I wrote this down
15:31:38 <itxaka> jsuchome, do you?
15:31:39 <portdirect> lol - your bringing a good set of topics dude
15:31:51 <srwilkers> i mean, i think we're overdue for cleaning up our tools/deployment/foo
15:31:51 <itxaka> consolidate ?
15:31:52 <jsuchome> probably came out from that compute-kit discussion
15:32:01 <srwilkers> because to be frank, it's getting messy.  and part of that's my fault
15:32:26 <portdirect> we probably could do with a re-think, we certainly need to split out the over-rides
15:32:41 <srwilkers> yep
15:32:48 <jsuchome> (overrides coming up next ... :-) )
15:32:56 <portdirect> but do the scripts that drive them need to be split, or could they just call the approprate over-rides dependant on an env var?
15:33:11 <srwilkers> i think the latter is the more sane one there
15:33:52 <portdirect> so we put habve things like ./tools/deployment/over-rides/[single|multi]/<service>.yaml
15:34:04 <itxaka> that would be cool
15:34:08 <srwilkers> ++
15:34:10 <portdirect> called by ./tools/deployment/scripts/<service>.sh ?
15:34:34 <itxaka> will also make it much clearer what we are using where
15:34:49 <srwilkers> not to get too far off track here, but i'd like to see something similar for the armada deployment foo too
15:35:37 <portdirect> srwilkers: perhaps pegleg would help there? though lets keep this to the scripted deployment for now
15:36:43 <srwilkers> portdirect: sounds good
15:38:16 <itxaka> Can someone write the armada/pegleg stuff in the minutes? I have no idea what they do so I cannot really explain it properly in there
15:38:27 <itxaka> oh, I see portdirect is already on it, thanks!
15:38:48 <itxaka> maybe @jamesgu__ can help on that, gotta have some knowledge on those tools :)
15:39:34 <portdirect> ok - does someone want to give it a (and excuse the pun) bash at refactoring here?
15:40:21 <srwilkers> if nobody else wants to, i can give it a go
15:40:33 <srwilkers> but might be a good opportunity for someone else to get more familiar with our jobs
15:40:46 <itxaka> mhhh I would like to but not sure if Im the best at bash thingies, if only  evrardjp was here...
15:40:52 <srwilkers> as that's where i've been living more often than not lately
15:41:06 <itxaka> I can have a look at it, maybe bother srwilkers with patches and guidance
15:41:16 <portdirect> srwilkers - lets work together on getting a poc up - and see if anyone wants to take it from there?
15:41:30 <srwilkers> sounds good to me
15:41:39 <srwilkers> either of those sounds good :)
15:41:44 <jsuchome> I could take a look at it as well, bash is nice ;-)
15:41:55 <portdirect> lol - eveyone wants in
15:41:57 <itxaka> poc + me/jsuchome taking from there sounds good to me
15:42:09 <itxaka> unless jsuchome wasnt to make the poc himself :P
15:42:11 <srwilkers> nice :)
15:42:30 <jsuchome> ah, no thanks, I'll leave it to the elders :-)
15:42:30 <portdirect> lets go for the poc and itxaka/jsuchome if that works?
15:42:44 <jsuchome> +1
15:42:47 <portdirect> jsuchome: just 'cause our minds are rotted, dont mean we are old ;P
15:42:59 <itxaka> +1
15:43:08 <portdirect> ok - lets move on
15:43:18 <portdirect> #topic Tungsten Fabric Review
15:43:25 <itxaka> that not me, yay :P
15:43:32 <portdirect> prabhusesha, the floor is yours :)
15:43:39 <prabhusesha_> Hi
15:43:56 <prabhusesha_> I want somebody to pick up the review and provide us set of comments
15:44:22 <prabhusesha_> I meant,  first set of comments
15:44:37 <portdirect> could you link to it?
15:44:42 <prabhusesha_> I'm continuing the work what madhukar was doing
15:44:55 <prabhusesha_> https://review.openstack.org/#/c/622573/2
15:45:10 <portdirect> prabhusesha_: is fantasic to see this restarted
15:45:52 <prabhusesha_> I want to get this thing merged soon.  I need all of your support
15:46:01 <prabhusesha_> I'm also kind of new to helm
15:46:14 <prabhusesha_> but I'm picking up pretty fast
15:46:29 <portdirect> are any specific images required for this?
15:47:00 <prabhusesha_> you need neutron, heat & nova plugin images
15:47:08 <prabhusesha_> from TF
15:47:27 <prabhusesha_> I'm drafting a blueprint
15:47:35 <portdirect> am i just missing them? as i dont see them here: https://review.openstack.org/#/c/622573/2/tools/overrides/backends/networking/tungstenfabric/nova-ocata.yaml
15:47:42 <portdirect> also how do you deploy TF itself?
15:48:25 <prabhusesha_> there are TF specific charts
15:48:46 <portdirect> are they public?
15:49:13 <prabhusesha_> currently it's private, work is happening in that front also
15:49:34 <portdirect> ok - i think that will block the efforts here largly
15:49:44 <portdirect> we can provide some high level feedback
15:49:54 <portdirect> but without public, and open images
15:49:57 <itxaka> would we need to build extra images with modifications to support this?
15:49:57 <portdirect> as well as charts
15:49:59 <prabhusesha_> that will good
15:50:06 <itxaka> maybe that falls under the loci meeting instead :)
15:50:08 <portdirect> we cant do any more than that
15:50:21 <prabhusesha_> let me get my stuff in right place
15:50:28 <portdirect> and i'd be uncomfortable mergeing, untill that is resolved
15:50:35 <portdirect> sounds good prabhusesha_
15:50:38 <prabhusesha_> I agree
15:50:55 <portdirect> and as itxaka points out, it would be great to work with loci for image building
15:51:04 <prabhusesha_> high level comments will be helpful
15:51:33 <prabhusesha_> itxaka:  I can get back to you on that
15:51:39 <itxaka> +1
15:51:55 <portdirect> ok - lets move on
15:52:01 <portdirect> #topic: Office-Hours
15:52:32 <portdirect> so last week we decided to kick start the offic hours effort again
15:52:56 <portdirect> initially these will be from 20:00-21:00UTC on Wednesdays in the openstack-helm channel
15:53:39 <portdirect> the above time is simply as its when we can make sure that there is core-reviewer attendance
15:53:50 <portdirect> but i know its rubbish for folks in the EU
15:54:14 <portdirect> i hope we can change that as soon as we can get some coverage for that timezone :)
15:54:54 <portdirect> thats all i got here really. ok to move on?
15:55:02 <itxaka> +1
15:55:16 <portdirect> #topic: Add internal tenant id in conf
15:55:25 <LiangFang> hi
15:55:44 <LiangFang> portdirect: thanks gave comments for my review
15:55:49 <portdirect> LiangFang: yous ps looks great - I've got one comment i need to add in gerrit, but at that point lgtm
15:56:23 <portdirect> https://review.openstack.org/#/c/647493/
15:57:14 <LiangFang> thanks, one thing is that my environment is broken, so I have not verified in my environment
15:57:29 <LiangFang> I don't know if CI is strong enough
15:57:45 <portdirect> looks ok in ci: http://logs.openstack.org/93/647493/7/check/openstack-helm-cinder/49e27c2/primary/pod-logs/openstack/cinder-create-internal-tenant-wkvrv/create-internal-tenant.txt.gz
15:58:02 <LiangFang> but the patchset before 7 is verified in my environment
15:58:15 <portdirect> thought we should also add '--or-show` in the user/project management job, as otherwise its just a single shot, and will fail on re-runs
15:58:50 <LiangFang> ok
15:59:11 <portdirect> ok - we about to run out of time, ok to wrap up?
15:59:22 <portdirect> #topic reviews
15:59:41 <portdirect> as per-always there are some reviews that would realy apprecate some attention
15:59:46 <portdirect> Reviews:
15:59:46 <portdirect> https://review.openstack.org/#/c/651491/ Add OpenSUSE Leap15 testing - adds directories with value overrides + one test job
15:59:46 <portdirect> Should the job be voting from start?
15:59:46 <portdirect> We plan to add more jobs as followups, but they could already be included as part of this one, if seen more appropriate
15:59:46 <portdirect> https://review.openstack.org/#/c/642067/ Allow more generic overrides for nova placement-api - same approach as with the patches already merged
15:59:47 <portdirect> https://review.openstack.org/#/c/644907/ Add an option to the health probe to test all pids - Fixes broken nova health probes for >= rocky
15:59:47 <portdirect> https://review.openstack.org/#/c/650933/ Add tempest suse image and zuul job
15:59:48 <portdirect> https://review.openstack.org/#/c/647493/ Add internal tenant id in conf (irc: LiangFang)
15:59:48 <portdirect> https://review.openstack.org/#/c/622573/2 Tungsten Fabric plugin changes
16:00:04 <LiangFang> 20:00-21:00 UTC seems is middle night in Asia:)
16:00:21 <portdirect> LiangFang: we need to improve there as well for sure
16:00:39 <LiangFang> ok, thanks
16:00:44 <portdirect> thanks everyone!
16:00:49 <portdirect> #endmeeting