15:00:54 <portdirect> #startmeeting openstack-helm 15:00:55 <openstack> Meeting started Tue Apr 16 15:00:54 2019 UTC and is due to finish in 60 minutes. The chair is portdirect. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:56 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:59 <openstack> The meeting name has been set to 'openstack_helm' 15:01:07 <portdirect> lets give it a few mins for people to turn up 15:01:12 <portdirect> #topic rollcall 15:01:28 <portdirect> the agenda is here: https://etherpad.openstack.org/p/openstack-helm-meeting-2019-04-16 15:01:36 <itxaka> o/ 15:01:42 <portdirect> please feel free to add to it, and we'll kick off at 5 past 15:01:53 <srwilkers> o/ 15:02:02 <jsuchome> o/ 15:03:34 <alanmeadows> o/ 15:04:02 <gagehugo> o/ 15:04:37 <megheisler> o/ 15:05:14 <portdirect> ok - lets go 15:05:23 <portdirect> #topic Zuul testing doesnt recover job logs 15:05:32 <portdirect> itxaka the floor is yours 15:05:42 <itxaka> yay 15:05:58 <itxaka> so testing tempest patcehs I found out unable to query the logs 15:06:15 <itxaka> as the ttl for the job/pod expired, those logs were lost forever and ever and ever (not really) 15:06:40 <itxaka> was wondering if there was anything already in place to recover that kind of logs as they seem to be valuable 15:07:00 <portdirect> we've hit this sort of thing before, esp with pods that crash 15:07:08 <itxaka> but Im reading srwilkers response in there so it seems like there is not and we would need to deploy something ourselves for that 15:07:08 <srwilkers> yep 15:07:18 <portdirect> srwilkers: you had any thoughts here, as i know you've noodled on it before? 15:07:32 <srwilkers> i think it requires two parts 15:08:00 <srwilkers> one: we really need to be able to override backofflimits for the jobs in our charts 15:08:04 <prabhusesha_> This is Prabhu 15:08:19 <srwilkers> some jobs have this, others don't 15:09:14 <srwilkers> also, we need to revisit having our zuul postrun jobs include `kubectl logs -p foo -n bar` for cases where we've got pods that are in in a crashloop state or a state where they fail once or twice before being happy 15:09:30 <srwilkers> as insight into both of those types of situations is valuable 15:09:49 <itxaka> +1 15:10:05 <srwilkers> but without the ability to override the backofflimits for jobs, that doesn't do us much good if kubernetes just deletes the pod and doesn't have a pod for us to query for logs 15:10:08 <srwilkers> previous or otherwise 15:10:22 <portdirect> the above would be great as a 1st step 15:11:02 <portdirect> i know we've also discussed 'lma-lite' here - would there be value in looking at that as well? 15:11:54 <itxaka> isnt there a "time to keep" for jobs that we could leverage as well to leave those failed jobs around? 15:11:58 <srwilkers> i think that's sane down the road, but also would require us to deploy the logging stack with every job 15:13:19 <itxaka> ttlSecondsAfterFinished is what I was thinking about, but that is only for jobs, not sure how would it affect other resources 15:14:04 <portdirect> would it perhaps make sense to also think about modifying the `wait-for-pods.sh` to detect a crash 15:14:12 <portdirect> and then instantly fail if it does? 15:14:45 <portdirect> as really we should have 0 crashes on deployment in the gates, and this would also allow us to capture the logs before they go away 15:14:49 <jsuchome> is it crashing right after deployment? 15:15:00 <itxaka> wouldn't that cause false positives when something goes temporary wrong (i.e. quay.io down for a few seconds) 15:15:56 <portdirect> that would not cause a crashloop itxaka 15:16:42 <srwilkers> i'd say we could add that and see if anything falls out initially 15:16:45 <srwilkers> it seems like a sane idea 15:16:50 <itxaka> really? Becuase I seen it locally, crashloopbackoff due not being able to pull the image 15:17:04 <portdirect> that doesnt sound right itxaka 15:17:16 <portdirect> you should see 'errimagepull' 15:18:07 <srwilkers> errimagepull or imagepullbackoff are the only two i've seen related to pulling images -- if the pods in a crashloop, it's indicative of an issue after the entrypoints been executed 15:18:18 <srwilkers> as far as im aware, anyway 15:18:32 <itxaka> umm, I migth be wrong there 15:18:43 <itxaka> unfortunately my env is in a pristine status so I cannot check now :) 15:18:58 <portdirect> lets get a ps up for both - and we can see what the fallout is 15:19:16 <portdirect> srwilkers: did you have one for `logs -p` already? I cant rememeber? 15:19:25 <srwilkers> yeah -- it's pretty dusty at this point 15:19:47 <srwilkers> https://review.openstack.org/#/c/603229/4 15:20:39 <portdirect> the only change required there is just checking to see if there is infact a previous pod? 15:21:09 <portdirect> or to jerry rig it, just allow failure of --previous? 15:21:09 <srwilkers> i think if there's not, it just returns nothing 15:21:18 <itxaka> nice! 15:21:28 <jsuchome> this looks good actually 15:21:39 <srwilkers> i lied 15:21:44 <srwilkers> it'll require an update 15:22:37 <portdirect> with that - lets see how things go 15:22:40 <portdirect> ok to move on? 15:22:51 <srwilkers> although with us setting ignore_errors: true there, might not be an issue 15:22:53 <srwilkers> yep 15:23:11 <portdirect> ok 15:23:15 <portdirect> #topic Split 140-compute-kit.sh into nova+neutron 15:23:26 <itxaka> me again 15:23:45 <itxaka> we were just wondering for the reasons of having both nova+neutron together in the same script 15:23:56 <itxaka> and were wondering if there was a willingness to change it? 15:24:13 <portdirect> totally - the reason for it ebing together is just historical 15:24:18 <jsuchome> We actually found out that with this compute kit, wait-for-pods can timeout quite often 15:24:24 <portdirect> as the two has circular deps 15:24:35 <srwilkers> well, even if you split them out you may run into timeout issues 15:24:37 <jsuchome> so the idea was not just split the deployment script, but also have 2 calls of wait-for-pods 15:24:37 <portdirect> jsuchome: how many nodes are you attempting to deploy with these scripts? 15:24:45 <srwilkers> jsuchome: that wouldnt work 15:24:49 <srwilkers> as they have dependencies on each other 15:25:01 <jsuchome> ah, that's what we feared 15:25:03 <srwilkers> which is why we initially tied them to the same deployment script 15:25:17 <itxaka> would it be possible to make them less dependant on each other? 15:25:19 <portdirect> nova compute depends on neutron metadata... 15:25:30 <srwilkers> itxaka: not that im aware of 15:25:32 <jsuchome> not that many (3 actually), but we're running kubernetes on openstack already 15:25:37 <itxaka> oh well 15:25:49 <jsuchome> well, then just neutron comes first, right? 15:26:12 <portdirect> no - as neutron metadata proxy, depends on.... nova-metadata-api... 15:26:21 <portdirect> and the circle of joy continues :D 15:26:41 <srwilkers> its the circcccclllee, the circle of life 15:26:44 <jamesgu__> would separate nova compute out of Nova chart help? 15:27:02 <portdirect> at the cost of huge complexity 15:27:26 <portdirect> why not just deploy, and then wait after? 15:27:51 <itxaka> yeah seems like we just need to fix the probes to be faster at detecting readyness and increase our timeouts :) 15:28:12 <jsuchome> increase our timeouts, the universal solution for everything... 15:28:23 <jamesgu__> it works... :-) 15:28:33 <itxaka> the simplest fix to a problem is just to wait a bit more for it to solve it by itself 15:28:49 <itxaka> and RUN as far away as possible :P 15:28:55 <portdirect> why is it timing out for you? image pulls or somthing else? 15:29:24 <srwilkers> itxaka: you've got the right idea ;) 15:29:39 <jsuchome> we did not do detailed profiling, it just happened several times 15:29:54 <itxaka> I guess a bit of everything? image pulls, insufficient IO, maybe not enough workers, probes take a bit of time, etc... 15:30:03 <itxaka> its usually pretty close to the timeout so... 15:30:10 <jsuchome> yeah, database backend also 15:30:30 <itxaka> ok, seems good to me, not much from my side on this topic 15:30:30 <portdirect> nova-db-sync is certainly a monster on a clean deployment 15:30:50 <portdirect> ok to move on, and revisit this later if we need to? 15:30:55 <itxaka> +1 15:30:57 <jsuchome> +1 15:31:07 <srwilkers> yep 15:31:10 <portdirect> #topic: Differences in deployment scripts (component vs multinode) 15:31:21 <itxaka> omg, itxaka again? shut up already 15:31:26 <srwilkers> itxaka: XD 15:31:34 <itxaka> tbh I dont remember why I wrote this down 15:31:38 <itxaka> jsuchome, do you? 15:31:39 <portdirect> lol - your bringing a good set of topics dude 15:31:51 <srwilkers> i mean, i think we're overdue for cleaning up our tools/deployment/foo 15:31:51 <itxaka> consolidate ? 15:31:52 <jsuchome> probably came out from that compute-kit discussion 15:32:01 <srwilkers> because to be frank, it's getting messy. and part of that's my fault 15:32:26 <portdirect> we probably could do with a re-think, we certainly need to split out the over-rides 15:32:41 <srwilkers> yep 15:32:48 <jsuchome> (overrides coming up next ... :-) ) 15:32:56 <portdirect> but do the scripts that drive them need to be split, or could they just call the approprate over-rides dependant on an env var? 15:33:11 <srwilkers> i think the latter is the more sane one there 15:33:52 <portdirect> so we put habve things like ./tools/deployment/over-rides/[single|multi]/<service>.yaml 15:34:04 <itxaka> that would be cool 15:34:08 <srwilkers> ++ 15:34:10 <portdirect> called by ./tools/deployment/scripts/<service>.sh ? 15:34:34 <itxaka> will also make it much clearer what we are using where 15:34:49 <srwilkers> not to get too far off track here, but i'd like to see something similar for the armada deployment foo too 15:35:37 <portdirect> srwilkers: perhaps pegleg would help there? though lets keep this to the scripted deployment for now 15:36:43 <srwilkers> portdirect: sounds good 15:38:16 <itxaka> Can someone write the armada/pegleg stuff in the minutes? I have no idea what they do so I cannot really explain it properly in there 15:38:27 <itxaka> oh, I see portdirect is already on it, thanks! 15:38:48 <itxaka> maybe @jamesgu__ can help on that, gotta have some knowledge on those tools :) 15:39:34 <portdirect> ok - does someone want to give it a (and excuse the pun) bash at refactoring here? 15:40:21 <srwilkers> if nobody else wants to, i can give it a go 15:40:33 <srwilkers> but might be a good opportunity for someone else to get more familiar with our jobs 15:40:46 <itxaka> mhhh I would like to but not sure if Im the best at bash thingies, if only evrardjp was here... 15:40:52 <srwilkers> as that's where i've been living more often than not lately 15:41:06 <itxaka> I can have a look at it, maybe bother srwilkers with patches and guidance 15:41:16 <portdirect> srwilkers - lets work together on getting a poc up - and see if anyone wants to take it from there? 15:41:30 <srwilkers> sounds good to me 15:41:39 <srwilkers> either of those sounds good :) 15:41:44 <jsuchome> I could take a look at it as well, bash is nice ;-) 15:41:55 <portdirect> lol - eveyone wants in 15:41:57 <itxaka> poc + me/jsuchome taking from there sounds good to me 15:42:09 <itxaka> unless jsuchome wasnt to make the poc himself :P 15:42:11 <srwilkers> nice :) 15:42:30 <jsuchome> ah, no thanks, I'll leave it to the elders :-) 15:42:30 <portdirect> lets go for the poc and itxaka/jsuchome if that works? 15:42:44 <jsuchome> +1 15:42:47 <portdirect> jsuchome: just 'cause our minds are rotted, dont mean we are old ;P 15:42:59 <itxaka> +1 15:43:08 <portdirect> ok - lets move on 15:43:18 <portdirect> #topic Tungsten Fabric Review 15:43:25 <itxaka> that not me, yay :P 15:43:32 <portdirect> prabhusesha, the floor is yours :) 15:43:39 <prabhusesha_> Hi 15:43:56 <prabhusesha_> I want somebody to pick up the review and provide us set of comments 15:44:22 <prabhusesha_> I meant, first set of comments 15:44:37 <portdirect> could you link to it? 15:44:42 <prabhusesha_> I'm continuing the work what madhukar was doing 15:44:55 <prabhusesha_> https://review.openstack.org/#/c/622573/2 15:45:10 <portdirect> prabhusesha_: is fantasic to see this restarted 15:45:52 <prabhusesha_> I want to get this thing merged soon. I need all of your support 15:46:01 <prabhusesha_> I'm also kind of new to helm 15:46:14 <prabhusesha_> but I'm picking up pretty fast 15:46:29 <portdirect> are any specific images required for this? 15:47:00 <prabhusesha_> you need neutron, heat & nova plugin images 15:47:08 <prabhusesha_> from TF 15:47:27 <prabhusesha_> I'm drafting a blueprint 15:47:35 <portdirect> am i just missing them? as i dont see them here: https://review.openstack.org/#/c/622573/2/tools/overrides/backends/networking/tungstenfabric/nova-ocata.yaml 15:47:42 <portdirect> also how do you deploy TF itself? 15:48:25 <prabhusesha_> there are TF specific charts 15:48:46 <portdirect> are they public? 15:49:13 <prabhusesha_> currently it's private, work is happening in that front also 15:49:34 <portdirect> ok - i think that will block the efforts here largly 15:49:44 <portdirect> we can provide some high level feedback 15:49:54 <portdirect> but without public, and open images 15:49:57 <itxaka> would we need to build extra images with modifications to support this? 15:49:57 <portdirect> as well as charts 15:49:59 <prabhusesha_> that will good 15:50:06 <itxaka> maybe that falls under the loci meeting instead :) 15:50:08 <portdirect> we cant do any more than that 15:50:21 <prabhusesha_> let me get my stuff in right place 15:50:28 <portdirect> and i'd be uncomfortable mergeing, untill that is resolved 15:50:35 <portdirect> sounds good prabhusesha_ 15:50:38 <prabhusesha_> I agree 15:50:55 <portdirect> and as itxaka points out, it would be great to work with loci for image building 15:51:04 <prabhusesha_> high level comments will be helpful 15:51:33 <prabhusesha_> itxaka: I can get back to you on that 15:51:39 <itxaka> +1 15:51:55 <portdirect> ok - lets move on 15:52:01 <portdirect> #topic: Office-Hours 15:52:32 <portdirect> so last week we decided to kick start the offic hours effort again 15:52:56 <portdirect> initially these will be from 20:00-21:00UTC on Wednesdays in the openstack-helm channel 15:53:39 <portdirect> the above time is simply as its when we can make sure that there is core-reviewer attendance 15:53:50 <portdirect> but i know its rubbish for folks in the EU 15:54:14 <portdirect> i hope we can change that as soon as we can get some coverage for that timezone :) 15:54:54 <portdirect> thats all i got here really. ok to move on? 15:55:02 <itxaka> +1 15:55:16 <portdirect> #topic: Add internal tenant id in conf 15:55:25 <LiangFang> hi 15:55:44 <LiangFang> portdirect: thanks gave comments for my review 15:55:49 <portdirect> LiangFang: yous ps looks great - I've got one comment i need to add in gerrit, but at that point lgtm 15:56:23 <portdirect> https://review.openstack.org/#/c/647493/ 15:57:14 <LiangFang> thanks, one thing is that my environment is broken, so I have not verified in my environment 15:57:29 <LiangFang> I don't know if CI is strong enough 15:57:45 <portdirect> looks ok in ci: http://logs.openstack.org/93/647493/7/check/openstack-helm-cinder/49e27c2/primary/pod-logs/openstack/cinder-create-internal-tenant-wkvrv/create-internal-tenant.txt.gz 15:58:02 <LiangFang> but the patchset before 7 is verified in my environment 15:58:15 <portdirect> thought we should also add '--or-show` in the user/project management job, as otherwise its just a single shot, and will fail on re-runs 15:58:50 <LiangFang> ok 15:59:11 <portdirect> ok - we about to run out of time, ok to wrap up? 15:59:22 <portdirect> #topic reviews 15:59:41 <portdirect> as per-always there are some reviews that would realy apprecate some attention 15:59:46 <portdirect> Reviews: 15:59:46 <portdirect> https://review.openstack.org/#/c/651491/ Add OpenSUSE Leap15 testing - adds directories with value overrides + one test job 15:59:46 <portdirect> Should the job be voting from start? 15:59:46 <portdirect> We plan to add more jobs as followups, but they could already be included as part of this one, if seen more appropriate 15:59:46 <portdirect> https://review.openstack.org/#/c/642067/ Allow more generic overrides for nova placement-api - same approach as with the patches already merged 15:59:47 <portdirect> https://review.openstack.org/#/c/644907/ Add an option to the health probe to test all pids - Fixes broken nova health probes for >= rocky 15:59:47 <portdirect> https://review.openstack.org/#/c/650933/ Add tempest suse image and zuul job 15:59:48 <portdirect> https://review.openstack.org/#/c/647493/ Add internal tenant id in conf (irc: LiangFang) 15:59:48 <portdirect> https://review.openstack.org/#/c/622573/2 Tungsten Fabric plugin changes 16:00:04 <LiangFang> 20:00-21:00 UTC seems is middle night in Asia:) 16:00:21 <portdirect> LiangFang: we need to improve there as well for sure 16:00:39 <LiangFang> ok, thanks 16:00:44 <portdirect> thanks everyone! 16:00:49 <portdirect> #endmeeting