14:00:16 <ricolin> #startmeeting heat 14:00:17 <openstack> Meeting started Wed May 9 14:00:16 2018 UTC and is due to finish in 60 minutes. The chair is ricolin. Information about MeetBot at http://wiki.debian.org/MeetBot. 14:00:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 14:00:20 <openstack> The meeting name has been set to 'heat' 14:00:22 <ricolin> hi TheJulia 14:00:30 <ricolin> #topic roll call 14:01:26 <ramishra> hi 14:01:33 <ricolin> o/ 14:02:56 <ricolin> #topic adding items to agenda 14:03:01 <ricolin> #link https://wiki.openstack.org/wiki/Meetings/HeatAgenda#Agenda_.282018-05-02_1400_UTC.29 14:04:13 <kazsh> o/ 14:04:21 <ricolin> kazsh, hi 14:04:32 <kazsh> sorry for be late 14:04:45 <ricolin> kazsh, no worry, you just on time 14:04:54 <ricolin> #topic Gate failure 14:05:02 <ricolin> #link http://status.openstack.org/openstack-health/#/g/project/openstack~2Fheat 14:05:30 <ricolin> here's some exception I found 14:05:34 <ricolin> #link https://etherpad.openstack.org/p/heat-gate-failures 14:06:26 <ricolin> ramishra, do you think it's a wise choice to increase image size for software config test again? 14:06:57 <ricolin> right now we use image_ref for test like test_server_signal_userdata_format_software_config 14:07:04 <ramishra> ricolin: my view we've these higher failure rates from time to time, unless we know the root cause (which may be related to infra issues etc), how long we'll skip the tests 14:07:49 <ramishra> ricolin: I'm not sure if that's the reason, I mean increase the image size would fix the issue, so can't comment 14:08:26 <ricolin> ramishra, I don't have any root cause from infra though 14:08:31 <therve> ricolin: Where's the download from? 14:08:51 <ricolin> TheJulia, maybe you know better on the status of gate infra?:) 14:09:28 <TheJulia> ricolin: in what terms? 14:09:31 <ricolin> from local environment IIRC 14:09:48 <therve> We create an octavia image on every build? :/ 14:09:53 <TheJulia> yeouch 14:10:20 <ramishra> TheJulia: not now I suppose, AFAIK, the dib stuff is gone now 14:10:23 <TheJulia> so yeah, that might be problematic because if your building an image, then your suspect to every upstream failure 14:10:35 <TheJulia> hmm 14:10:37 <ramishra> therve: ^^ 14:10:54 <therve> ramishra: What did rico linked then? 14:11:16 <TheJulia> I don't know your guys gates/testing well enough recommend anything, but for ironic we split the build out and we save on tarballs.o.o for future builds for our deployment agent 14:11:39 <ramishra> therve: that's fixed, let me find the patch 14:11:53 <therve> Ah, ok 14:11:58 <ricolin> ramishra, cool! 14:12:27 <ramishra> https://review.openstack.org/#/c/559416/ 14:12:31 <ricolin> TheJulia, that's good idea 14:13:16 <therve> ramishra: That patch is 3 weeks old 14:14:50 <ramishra> therve: what about the failure? I don't think dib is used now 14:15:26 <therve> ramishra: http://logs.openstack.org/07/567207/1/check/heat-functional-convg-mysql-lbaasv2-non-apache/ff9c346/logs/devstacklog.txt.gz#_2018-05-09_12_58_07_766 14:15:29 <therve> From today 14:17:21 <ramishra> therve: ok, it seems they still add some dib elements to the ubuntu-minimal image 14:17:26 <ricolin> maybe I missing something but we using master for octavia in devstack, so I'm not sure that's fixed 14:18:12 <therve> Overall I'm not sure what testing octavia brings us 14:18:20 <therve> If anything, it should be something in their gate 14:18:55 <ramishra> therve: the same way we used to test lbaas stuff earlier;) Though I'm fine if we want not to test the resources and skip the tests 14:18:56 <ricolin> is there ways we can avoid build that image all over? 14:18:57 <TheJulia> Even if there is value for you testing, the build should be occuring on their end, not in your job if you can help it at all 14:19:17 <therve> Ah 14:19:31 <therve> ramishra: https://tarballs.openstack.org/octavia/test-images/test-only-amphora-x64-haproxy-ubuntu-xenial.qcow2 14:19:31 <TheJulia> ricolin: they could have a post job that uploads to tarballs.o.o for you guys? :) 14:19:39 <TheJulia> Hey, there you go! 14:19:40 <therve> There already 14:20:03 <ricolin> therve, yay 14:20:05 <ramishra> TheJulia: we don't build it, it's there in their destack plugin and we use it 14:20:12 <therve> Checkout kuryr-kubernetes 14:20:14 <TheJulia> I believe tarballs.o.o gets mirrored into the various clouds where jobs execute, so downloading that should be safe CI job wise 14:20:23 <TheJulia> ramishra: ugh 14:20:35 <therve> TheJulia: I don't think so, but it's better than building it anyway 14:21:14 <therve> ricolin: OK, I'll do that fix at least 14:21:15 <ricolin> ramishra, I guess we can try to report this to octavia team 14:21:31 <ricolin> therve, that's even better!!:) 14:21:58 <TheJulia> it would be good for their plugin consumers to be able to opt to download instead of build, fwiw 14:22:15 <ricolin> TheJulia, agree on that! 14:23:09 <ricolin> how about this one http://logs.openstack.org/90/566190/1/gate/heat-functional-orig-mysql-lbaasv2/0fb2e98/job-output.txt.gz#_2018-05-04_12_43_27_744637 14:24:34 <openstackgerrit> Thomas Herve proposed openstack/heat master: Download octavia image in tests https://review.openstack.org/567238 14:25:23 <therve> ricolin: I mean, we're not getting through errors during the meeting 14:25:23 <ramishra> ricolin: That seems some issue where it can't communicate with listener and lb goes to ERROR/DEGRADED state 14:25:43 <TheJulia> looks like there is some nasty libvirt errors in the related nova log 14:26:21 <ricolin> therve, of course, this is the last one I bring! 14:27:08 <ricolin> TheJulia, and that means we nearly got no way to fix it:/ 14:27:31 <ramishra> ricolin: I don't know exactly what's going on there though, there are plenty of failures of the tests using fedora images too 14:28:16 <ricolin> ramishra, I got no clue at all:/ 14:28:26 <ramishra> I was suspecting if it's some issue with mirroring for some clouds as we use local mirrors, but not sure 14:28:59 <ricolin> ramishra, might be, but I'm not sure about that too 14:29:09 <ricolin> Anyway, let's move on we need time for other stuff 14:29:21 <ricolin> #topic remove jobs 14:29:52 <ramishra> therve: the failure rates are not that bad to be alarmed IMHO, we had similar periods earlier too:) 14:29:53 <ricolin> would like to give a quick discuss on that too 14:30:15 <therve> ramishra: What? We have like 5 rechecks per merge? 14:30:39 <ramishra> therve: yes, we had them earlier 14:30:48 <therve> ramishra: And it was horrible? 14:30:51 <therve> Not sure what's your point 14:31:23 <ricolin> I guess we can release the gate tension by remove some job?:) 14:31:38 <therve> ricolin: Yeah let's make the non-apache non voting 14:31:42 <ricolin> we just remove one non-voting out #link https://review.openstack.org/#/c/564623/ 14:31:52 <ramishra> therve: my poing is probably it's some infra issue and load on the test infra, though we can make some optimizations where and there 14:32:19 <ricolin> ramishra, agree on that 14:32:30 <therve> ramishra: Right, so we should and not just sit waiting then? 14:32:39 <ricolin> therve, ramishra how about `grenade-heat 14:32:40 <ricolin> ` 14:32:53 <ramishra> like it's easier to merge stuff during my morning time, but yeah we should try to find the root cause 14:33:03 <ramishra> therve: I did not say that 14:34:35 <therve> ricolin: I don't know about grenade-heat 14:34:37 <ramishra> ricolin: Let's move on to other stuff and see if we can make the gate stable in the coming days 14:34:39 <therve> Doesn't seem to be failing 14:35:01 <therve> ramishra: Every time I type recheck a little part of my soul dies 14:35:47 <ramishra> therve: I understand, even worse when cores do blind rechecks:/ 14:35:48 <ricolin> ramishra, sure, just I think we can try to remove some job for good 14:36:31 <ricolin> I do like the idea to non-voting non-apache 14:37:12 <ramishra> ricolin: we don't have lot of jobs like other projects, but yes we can make that one non-voting 14:37:13 <ricolin> if that not making anyone uncomfortable 14:37:42 <ricolin> ramishra, cool 14:38:05 <ricolin> let's move on than 14:38:06 <ricolin> #topic StoryBoard Migrated 14:38:16 <ricolin> #link https://storyboard.openstack.org/#!/project_group/82 14:38:16 <ricolin> #link https://etherpad.openstack.org/p/Heat-StoryBoard-Migration-Info 14:39:01 <ricolin> just FYI all our bugs and comments in bugs are there in StoryBoard 14:39:30 <ricolin> also hope this will make it easier to track 14:39:33 <ricolin> #link https://storyboard.openstack.org/#!/board/71 14:40:05 <ricolin> All cores should have the owner right for that board now 14:40:49 <ricolin> BTW the board loading is a bit slow, try to figure out what we can improve on that 14:41:40 <ricolin> therve, I change all launchpad title back now, think you might like to know 14:41:47 <therve> ricolin: Thanks 14:42:39 <ricolin> I will keep watching irc, gerrit as long as I can to help on anyone to adopt new platform 14:43:16 <ricolin> also already add the info to our Vancouver project update, Onboarding, and user feedback session 14:43:38 <ricolin> Let's all I got 14:43:53 <ramishra> ricolin: how would be milestone/release tracking/tagging etc be done now? I need to read the documentation though 14:44:06 <ramishra> may be that can be added to the etherpad too 14:44:35 <ricolin> ramishra, Okay, I will try to get more information on that part 14:44:37 <ricolin> thx 14:46:03 <ricolin> ramishra, also about the reason they not adding Storyboard url to relative Launchpad bug is because with performance issue, it will take extramly long time to do that 14:46:42 <ricolin> and since most information is there in StoryBoard, they decide not to add them 14:46:58 <ricolin> that's the information I got from StoryBoard team 14:47:07 <ricolin> ramishra, think you might like to know:) 14:48:35 <ricolin> I do like to discuss about Translate Properties failed issue, but since that's also involved with zaneb, let's skip it to next week 14:49:15 <ricolin> I have update the fix for octavia pool member resource 14:49:19 <ricolin> #link https://review.openstack.org/#/c/541558/ 14:49:41 <ramishra> ricolin: it's not easy to find the correspnding story for a bug for someone new, I did not like the reasoning though. However, we are where we're 14:49:46 <ricolin> but we still have to discuss about how can we deal with property issue for good 14:50:47 <ricolin> ramishra, I will try my best to make it easier for all developer in heat 14:51:08 <ricolin> ramishra, maybe do more tagging 14:51:33 <ricolin> still try to figure out where people suffer the most 14:52:16 <ricolin> ramishra, I think the finding corresponding bug is no.1 in my list 14:52:25 <ricolin> so that's good point 14:52:48 <ricolin> Let's move on 14:53:14 <ricolin> #topic Bugs and PBs 14:53:32 <ricolin> Anything to raise for any bugs or bps? 14:53:59 <TheJulia> Is this the appropriate time to try and stir up discussion regarding a bug fix? 14:54:22 <ricolin> TheJulia, sure:) 14:55:56 <TheJulia> So I posted https://review.openstack.org/#/c/564104/ which seems to be a really long standing issue because were were hitting major issues downstream downstream with baremetal instances going into ERROR state in nova. There has been some back and forth, but I'd like to either attempt to get consensus or further drive discussion. 14:56:10 <ricolin> #link https://review.openstack.org/#/c/564104/ 14:57:03 <therve> I think ramishra said it was fine, so I trust him on that one :) 14:57:06 <TheJulia> tl;dr ports are being orphaned in neutron in certain cases when an instance being built goes into ERROR state when nova is told to build with a port 14:57:15 <ramishra> TheJulia: I'm ok with the fix in principle, though checking for resource FAILED status could be better, we can test it at the gate by marking a server resource unhealthy 14:57:59 <TheJulia> ramishra: is the server resource status ultimately tied to the instance status? 14:58:00 <ramishra> but it would introduce a bug as we try to rollback FAILED resources 14:58:41 <TheJulia> Well, as far as I'm aware there is no rolling back an instance in ERROR state *shrugs* 14:59:01 <ramishra> TheJulia: it's a heat bug which should be fixed 15:00:03 <TheJulia> the fact that it tries to roll back instnaces in error state now? 15:00:40 <ramishra> TheJulia: yes, if a server is in error state and heat does not know about it(resource is not in FAILED state), then we probably would not do anything 15:01:18 <TheJulia> so would it not make sense to check both instance and resource status? 15:01:44 * TheJulia is totally not the expert here 15:02:07 <ramishra> TheJulia: In principle we assume things are ok unless we know about it.. 15:03:09 <TheJulia> so resource makes sense then since it is trying to replace the failed instance. 15:03:27 <ramishra> For a FAILED resource, if a update is cancelled we do try to rollback which we've to fix 15:04:25 <TheJulia> okay, so checking resource shouldn't be a problem really... I think 15:04:38 <TheJulia> your guys call, I can change it to do whatever :) 15:05:20 <ramishra> I mean we've something like (observe reality) where we check the actual state of the resource in their respective services 15:05:30 <ramishra> but that's based on user request AFAIK 15:05:45 <ricolin> TheJulia, you can checking both I think:) 15:06:08 <TheJulia> that is likely safer rollback wise and I can just put a note in describing why 15:06:35 <TheJulia> works for me 15:06:41 <ricolin> TheJulia, cool 15:07:27 <ricolin> I'm going to close the meeting since we already overtime, Thanks all for join!! 15:07:31 <ricolin> #endmeeting