#heat log

14:00:16 <ricolin> #startmeeting heat
14:00:17 <openstack> Meeting started Wed May  9 14:00:16 2018 UTC and is due to finish in 60 minutes.  The chair is ricolin. Information about MeetBot at http://wiki.debian.org/MeetBot.
14:00:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
14:00:20 <openstack> The meeting name has been set to 'heat'
14:00:22 <ricolin> hi TheJulia
14:00:30 <ricolin> #topic roll call
14:01:26 <ramishra> hi
14:01:33 <ricolin> o/
14:02:56 <ricolin> #topic adding items to agenda
14:03:01 <ricolin> #link https://wiki.openstack.org/wiki/Meetings/HeatAgenda#Agenda_.282018-05-02_1400_UTC.29
14:04:13 <kazsh> o/
14:04:21 <ricolin> kazsh, hi
14:04:32 <kazsh> sorry for be late
14:04:45 <ricolin> kazsh, no worry, you just on time
14:04:54 <ricolin> #topic Gate failure
14:05:02 <ricolin> #link http://status.openstack.org/openstack-health/#/g/project/openstack~2Fheat
14:05:30 <ricolin> here's some exception I found
14:05:34 <ricolin> #link https://etherpad.openstack.org/p/heat-gate-failures
14:06:26 <ricolin> ramishra, do you think it's a wise choice to increase image size for software config test again?
14:06:57 <ricolin> right now we use image_ref for test like test_server_signal_userdata_format_software_config
14:07:04 <ramishra> ricolin: my view we've these higher failure rates from time to time, unless we know the root cause (which may be related to infra issues etc), how long we'll skip the tests
14:07:49 <ramishra> ricolin: I'm not sure if that's the reason, I mean increase the image size would fix the issue, so can't comment
14:08:26 <ricolin> ramishra, I don't have any root cause from infra though
14:08:31 <therve> ricolin: Where's the download from?
14:08:51 <ricolin> TheJulia, maybe you know better on the status of gate infra?:)
14:09:28 <TheJulia> ricolin: in what terms?
14:09:31 <ricolin> from local environment IIRC
14:09:48 <therve> We create an octavia image on every build? :/
14:09:53 <TheJulia> yeouch
14:10:20 <ramishra> TheJulia: not now I suppose, AFAIK, the dib stuff is gone now
14:10:23 <TheJulia> so yeah, that might be problematic because if your building an image, then your suspect to every upstream failure
14:10:35 <TheJulia> hmm
14:10:37 <ramishra> therve: ^^
14:10:54 <therve> ramishra: What did rico linked then?
14:11:16 <TheJulia> I don't know your guys gates/testing well enough recommend anything, but for ironic we split the build out and we save on tarballs.o.o for future builds for our deployment agent
14:11:39 <ramishra> therve: that's fixed, let me find the patch
14:11:53 <therve> Ah, ok
14:11:58 <ricolin> ramishra, cool!
14:12:27 <ramishra> https://review.openstack.org/#/c/559416/
14:12:31 <ricolin> TheJulia, that's good idea
14:13:16 <therve> ramishra: That patch is 3 weeks old
14:14:50 <ramishra> therve: what about the failure? I don't think dib is used now
14:15:26 <therve> ramishra: http://logs.openstack.org/07/567207/1/check/heat-functional-convg-mysql-lbaasv2-non-apache/ff9c346/logs/devstacklog.txt.gz#_2018-05-09_12_58_07_766
14:15:29 <therve> From today
14:17:21 <ramishra> therve: ok, it seems they still add some dib elements to the ubuntu-minimal image
14:17:26 <ricolin> maybe I missing something but we using master for octavia in devstack, so I'm not sure that's fixed
14:18:12 <therve> Overall I'm not sure what testing octavia brings us
14:18:20 <therve> If anything, it should be something in their gate
14:18:55 <ramishra> therve: the same way we used to test lbaas stuff earlier;) Though I'm fine if we want not to test the resources and skip the tests
14:18:56 <ricolin> is there ways we can avoid build that image all over?
14:18:57 <TheJulia> Even if there is value for you testing, the build should be occuring on their end, not in your job if you can help it at all
14:19:17 <therve> Ah
14:19:31 <therve> ramishra: https://tarballs.openstack.org/octavia/test-images/test-only-amphora-x64-haproxy-ubuntu-xenial.qcow2
14:19:31 <TheJulia> ricolin: they could have a post job that uploads to tarballs.o.o for you guys? :)
14:19:39 <TheJulia> Hey, there you go!
14:19:40 <therve> There already
14:20:03 <ricolin> therve, yay
14:20:05 <ramishra> TheJulia: we don't build it, it's there in their destack plugin and we use it
14:20:12 <therve> Checkout kuryr-kubernetes
14:20:14 <TheJulia> I believe tarballs.o.o gets mirrored into the various clouds where jobs execute, so downloading that should be safe CI job wise
14:20:23 <TheJulia> ramishra: ugh
14:20:35 <therve> TheJulia: I don't think so, but it's better than building it anyway
14:21:14 <therve> ricolin: OK, I'll do that fix at least
14:21:15 <ricolin> ramishra, I guess we can try to report this to octavia team
14:21:31 <ricolin> therve, that's even better!!:)
14:21:58 <TheJulia> it would be good for their plugin consumers to be able to opt to download instead of build, fwiw
14:22:15 <ricolin> TheJulia, agree on that!
14:23:09 <ricolin> how about this one http://logs.openstack.org/90/566190/1/gate/heat-functional-orig-mysql-lbaasv2/0fb2e98/job-output.txt.gz#_2018-05-04_12_43_27_744637
14:24:34 <openstackgerrit> Thomas Herve proposed openstack/heat master: Download octavia image in tests  https://review.openstack.org/567238
14:25:23 <therve> ricolin: I mean, we're not getting through errors during the meeting
14:25:23 <ramishra> ricolin: That seems some issue where it can't communicate with listener and lb goes to ERROR/DEGRADED state
14:25:43 <TheJulia> looks like there is some nasty libvirt errors in the related nova log
14:26:21 <ricolin> therve, of course, this is the last one I bring!
14:27:08 <ricolin> TheJulia, and that means we nearly got no way to fix it:/
14:27:31 <ramishra> ricolin: I don't know exactly what's going on there though, there are plenty of failures of the tests using fedora images too
14:28:16 <ricolin> ramishra, I got no clue at all:/
14:28:26 <ramishra> I was suspecting if it's some issue with mirroring for some clouds as we use local mirrors, but not sure
14:28:59 <ricolin> ramishra, might be, but I'm not sure about that too
14:29:09 <ricolin> Anyway, let's move on we need time for other stuff
14:29:21 <ricolin> #topic remove jobs
14:29:52 <ramishra> therve: the failure rates are not that bad to be alarmed IMHO, we had similar periods earlier too:)
14:29:53 <ricolin> would like to give a quick discuss on that too
14:30:15 <therve> ramishra: What? We have like 5 rechecks per merge?
14:30:39 <ramishra> therve: yes, we had them earlier
14:30:48 <therve> ramishra: And it was horrible?
14:30:51 <therve> Not sure what's your point
14:31:23 <ricolin> I guess we can release the gate tension by remove some job?:)
14:31:38 <therve> ricolin: Yeah let's make the non-apache non voting
14:31:42 <ricolin> we just remove one non-voting out #link https://review.openstack.org/#/c/564623/
14:31:52 <ramishra> therve: my poing is probably it's some infra issue and load on the test infra, though we can make some optimizations where and there
14:32:19 <ricolin> ramishra, agree on that
14:32:30 <therve> ramishra: Right, so we should and not just sit waiting then?
14:32:39 <ricolin> therve, ramishra how about `grenade-heat
14:32:40 <ricolin> `
14:32:53 <ramishra> like it's easier to merge stuff during my morning time, but yeah we should try to find the root cause
14:33:03 <ramishra> therve: I did not say that
14:34:35 <therve> ricolin: I don't know about grenade-heat
14:34:37 <ramishra> ricolin: Let's move on to other stuff and see if we can make the gate stable in the coming days
14:34:39 <therve> Doesn't seem to be failing
14:35:01 <therve> ramishra: Every time I type recheck a little part of my soul dies
14:35:47 <ramishra> therve: I understand, even worse when cores do blind rechecks:/
14:35:48 <ricolin> ramishra, sure, just I think we can try to remove some job for good
14:36:31 <ricolin> I do like the idea to non-voting non-apache
14:37:12 <ramishra> ricolin: we don't have lot of jobs like other projects, but yes we can make that one non-voting
14:37:13 <ricolin> if that not making anyone uncomfortable
14:37:42 <ricolin> ramishra, cool
14:38:05 <ricolin> let's move on than
14:38:06 <ricolin> #topic StoryBoard Migrated
14:38:16 <ricolin> #link https://storyboard.openstack.org/#!/project_group/82
14:38:16 <ricolin> #link https://etherpad.openstack.org/p/Heat-StoryBoard-Migration-Info
14:39:01 <ricolin> just FYI all our bugs and comments in bugs are there in StoryBoard
14:39:30 <ricolin> also hope this will make it easier to track
14:39:33 <ricolin> #link https://storyboard.openstack.org/#!/board/71
14:40:05 <ricolin> All cores should have the owner right for that board now
14:40:49 <ricolin> BTW the board loading is a bit slow, try to figure out what we can improve on that
14:41:40 <ricolin> therve, I change all launchpad title back now, think you might like to know
14:41:47 <therve> ricolin: Thanks
14:42:39 <ricolin> I will keep watching irc, gerrit as long as I can to help on anyone to adopt new platform
14:43:16 <ricolin> also already add the info to our Vancouver project update, Onboarding, and user feedback session
14:43:38 <ricolin> Let's all I got
14:43:53 <ramishra> ricolin: how would be milestone/release tracking/tagging etc be done now? I need to read the documentation though
14:44:06 <ramishra> may be that can be added to the etherpad too
14:44:35 <ricolin> ramishra, Okay, I will try to get more information on that part
14:44:37 <ricolin> thx
14:46:03 <ricolin> ramishra, also about the reason they not adding Storyboard url to relative Launchpad bug is because with performance issue, it will take extramly long time to do that
14:46:42 <ricolin> and since most information is there in StoryBoard, they decide not to add them
14:46:58 <ricolin> that's the information I got from StoryBoard team
14:47:07 <ricolin> ramishra, think you might like to know:)
14:48:35 <ricolin> I do like to discuss about Translate Properties failed issue, but since that's also involved with zaneb, let's skip it to next week
14:49:15 <ricolin> I have update the fix for octavia pool member resource
14:49:19 <ricolin> #link https://review.openstack.org/#/c/541558/
14:49:41 <ramishra> ricolin: it's not easy to find the correspnding story for a bug for someone new, I did not like the reasoning though. However, we are where we're
14:49:46 <ricolin> but we still have to discuss about how can we deal with property issue for good
14:50:47 <ricolin> ramishra, I will try my best to make it easier for all developer in heat
14:51:08 <ricolin> ramishra, maybe do more tagging
14:51:33 <ricolin> still try to figure out where people suffer the most
14:52:16 <ricolin> ramishra, I think the finding corresponding bug is no.1 in my list
14:52:25 <ricolin> so that's good point
14:52:48 <ricolin> Let's move on
14:53:14 <ricolin> #topic Bugs and PBs
14:53:32 <ricolin> Anything to raise for any bugs or bps?
14:53:59 <TheJulia> Is this the appropriate time to try and stir up discussion regarding a bug fix?
14:54:22 <ricolin> TheJulia, sure:)
14:55:56 <TheJulia> So I posted https://review.openstack.org/#/c/564104/ which seems to be a really long standing issue because were were hitting major issues downstream downstream with baremetal instances going into ERROR state in nova. There has been some back and forth, but I'd like to either attempt to get consensus or further drive discussion.
14:56:10 <ricolin> #link https://review.openstack.org/#/c/564104/
14:57:03 <therve> I think ramishra said it was fine, so I trust him on that one :)
14:57:06 <TheJulia> tl;dr ports are being orphaned in neutron in certain cases when an instance being built goes into ERROR state when nova is told to build with a port
14:57:15 <ramishra> TheJulia: I'm ok with the fix in principle, though checking for resource FAILED status could be better, we can test it at the gate by marking a server resource unhealthy
14:57:59 <TheJulia> ramishra: is the server resource status ultimately tied to the instance status?
14:58:00 <ramishra> but it would introduce a bug as we try to rollback FAILED resources
14:58:41 <TheJulia> Well, as far as I'm aware there is no rolling back an instance in ERROR state *shrugs*
14:59:01 <ramishra> TheJulia: it's a heat bug which should be fixed
15:00:03 <TheJulia> the fact that it tries to roll back instnaces in error state now?
15:00:40 <ramishra> TheJulia: yes, if a server is in error state and heat does not know about it(resource is not in FAILED state), then we probably would not do anything
15:01:18 <TheJulia> so would it not make sense to check both instance and resource status?
15:01:44 * TheJulia is totally not the expert here
15:02:07 <ramishra> TheJulia: In principle we assume things are ok unless we know about it..
15:03:09 <TheJulia> so resource makes sense then since it is trying to replace the failed instance.
15:03:27 <ramishra> For a FAILED resource, if a update is cancelled we do try to rollback which we've to fix
15:04:25 <TheJulia> okay, so checking resource shouldn't be a problem really... I think
15:04:38 <TheJulia> your guys call, I can change it to do whatever :)
15:05:20 <ramishra> I mean we've something like (observe reality) where we check the actual state of the resource in their respective services
15:05:30 <ramishra> but that's based on user request AFAIK
15:05:45 <ricolin> TheJulia, you can checking both I think:)
15:06:08 <TheJulia> that is likely safer rollback wise and I can just put a note in describing why
15:06:35 <TheJulia> works for me
15:06:41 <ricolin> TheJulia, cool
15:07:27 <ricolin> I'm going to close the meeting since we already overtime, Thanks all for join!!
15:07:31 <ricolin> #endmeeting