#openstack-meeting log

16:01:16 <ihrachys> #startmeeting neutron_ci
16:01:17 <openstack> Meeting started Tue Feb 14 16:01:16 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:21 <openstack> The meeting name has been set to 'neutron_ci'
16:01:21 <ihrachys> hello everyone :)
16:01:22 <jlibosva> o/
16:01:24 <manjeets> hi
16:01:51 <kevinbenton> hi
16:01:54 <reedip_1> o/
16:02:02 <dasm> o/
16:02:17 <ihrachys> #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda
16:02:32 <ihrachys> (or lack of it, I need to write up some stub topics)
16:02:43 <ihrachys> let's start with action items from the previous meeting
16:02:54 <ihrachys> #topic Action items from previous meeting
16:03:04 <ihrachys> "ihrachys to follow up with armax on periodic functional job not showing up in grafana"
16:03:19 <ihrachys> so indeed the periodic functional does not show up in http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen
16:03:20 <ihrachys> BUT
16:03:38 <ihrachys> we can see runs in http://logs.openstack.org/periodic/periodic-neutron-dsvm-functional-ubuntu-xenial/
16:04:11 <ihrachys> and armax suggested that it does not show up because of a glitch in grafana that makes it impossible to draw a trend while results are identical
16:04:30 <ihrachys> so we believe that till it fails, it won't be in the grafana
16:04:53 <jlibosva> it's also on openstack-health http://status.openstack.org/openstack-health/#/job/periodic-neutron-dsvm-functional-ubuntu-xenial
16:04:57 <ihrachys> that's actually interesting that it hasn't failed in 15 runs
16:05:10 <ihrachys> jlibosva: oh thanks for the link!
16:05:50 <ihrachys> we saw failure rate in check queue at 30% to 40% before, so even though today it's 10%, the chance of not hitting it once in 15 runs in the past is quite low
16:06:10 <ihrachys> so either we are lucky, or there is some other thing in play (maybe we allocate different VMs for periodic?)
16:06:42 <ihrachys> so far, we are going to monitor, and maybe in a week we will have a failure; if not, we will look closer at job definitions.
16:07:10 <ihrachys> ok next action was: "ihrachys to look at e-r bot for openstack-neutron channel"
16:07:22 <ihrachys> that's actually something to decide first if we even want it
16:07:52 <ihrachys> I noticed some other projects like nova or glance have a irc bot that reports into their channels on captured classified failures
16:08:13 <ihrachys> I was thinking, it may help to give us some better understanding of what hits our gates
16:08:22 <kevinbenton> i vote +1 for that
16:08:24 <ihrachys> so far I proposed a patch that enables the bot at https://review.openstack.org/#/c/433735/
16:08:42 <jlibosva> +1 from me too. If it will become annoying or un-useful, we can disable it anytime
16:08:42 <ihrachys> I guess we can enable it and see how it goes; if it spams too much, we tweak or disable it later
16:08:58 <manjeets> +1
16:09:22 <ihrachys> ok and I see mtreinish comment that it's wrong place to do it; I will update the right one after the meeting
16:09:32 <ihrachys> as long as we agree on the direction, which I believe we do :)
16:09:54 <dasm> ihrachys: how does it look on channel? irc bot shows url? do you have example of that?
16:10:45 <dasm> i'm ok with enabling it, just asking to know what'll be changed
16:11:36 <kevinbenton> every 2 seconds it will report if we have any failures :)
16:11:42 <dasm> kevinbenton: ++ :)
16:11:48 <ihrachys> like in http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2017-01-16.log.html#t2017-01-16T03:12:54
16:12:02 <ihrachys> though if it knows the error, it will report the bug and all
16:12:06 <dasm> ihrachys: ack
16:12:47 <ihrachys> ok, that's an example of bug recognized: http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2017-01-08.log.html#t2017-01-08T00:13:53
16:13:07 <ihrachys> I guess we won't know if it's useful until we try
16:13:13 <manjeets> i have a question what if patch is culprit over the failure of job will it still report ?
16:13:32 <dasm> ihrachys: btw "recognized bugs". if i recall correctly, we have to manually categorize recognized bugs. is it true?
16:13:35 <manjeets> or there is a way to filter out
16:13:36 <ihrachys> manjeets: it probably reports on gate jobs only
16:13:46 <ihrachys> that would be my expectation
16:13:47 <dasm> i think, electrocucaracha did something like that for couple bugs
16:13:48 <mtreinish> ihrachys: do you want to have it report uncategorized failures too? Your patch didn't have that
16:14:25 <ihrachys> mtreinish: yeah, could be; I haven't spent much time thinking on it yet, just sent a strawman to have it for discussion
16:14:49 <ihrachys> mtreinish: while we have you here, could you confirm it monitors gate queue only?
16:15:12 <mtreinish> the irc reporting?
16:15:15 <ihrachys> yea
16:15:26 <ihrachys> otherwise it would spam with irrelevant messages
16:15:52 <mtreinish> yeah it doesn't report to irc for check iirc
16:16:30 <mtreinish> but it's been a while since I looked at the code/config
16:16:36 <ihrachys> that makes sense
16:16:48 <ihrachys> mtreinish: is the bot useful for qa team?
16:17:05 <ihrachys> do you find it helps, or just spams with low actual profit?
16:17:49 <mtreinish> so in the past it was quite useful, especially when we first started e-r
16:18:03 <mtreinish> but nowadays I don't think many people pay attention to it
16:18:16 <mtreinish> if you've got people who are willing to stay on top of it I think it'll be useful
16:18:52 <ihrachys> aye; we were looking for ways to direct attention to elastic-recheck tooling lately, and I am hopeful it will give proper signals
16:19:12 <ihrachys> ok next action item was: "manjeets to produce a gerrit dashboard for gate and functional failures"
16:19:18 <manjeets> https://github.com/manjeetbhatia/neutron_stuff/blob/master/create_gate_failure_dash.py
16:19:29 <manjeets> this can be used to create fresh one all the time
16:19:39 <ihrachys> manjeets: do you have link to the resulting dashboard handy?
16:19:55 <ihrachys> use url shortener please
16:20:22 <manjeets> https://github.com/manjeetbhatia/neutron_stuff/blob/master/README.md
16:20:42 <manjeets> ihrachys, ok yes there is only one patch from 14 tickets in progress atm
16:20:49 <manjeets> which it captured
16:21:02 <jlibosva> manjeets++
16:21:03 <ihrachys> is it because the dashboard doesn't capture some, or that's indeed all we have?
16:21:04 <jlibosva> good stuff
16:21:08 <kevinbenton> mtreinish: can we alter the bot to direct messages at whoever is currently talking in the channel? :)
16:21:14 <manjeets> I manually checked some of them are abandoned
16:21:28 <manjeets> only one patch is open ihrachys
16:22:11 <mtreinish> kevinbenton: heh, it wouldn't be that hard. All the info is there to add that "feature" :)
16:22:51 <reedip_1> manjeets +1
16:22:58 <ihrachys> kevinbenton: haha. we should also make it customize the message depending on the rate of failure reports. sometimes it asks politely, sometimes 'everyone just shut up and fix the damned gate'
16:22:59 <dasm> kevinbenton: are you planning to silence irc channel? :)
16:23:50 <ihrachys> manjeets: thanks a lot for the work. do you plan to contribute the script to some official repo? like we did for milestone target dashboard.
16:24:27 <manjeets> ihrachys, some stuff is hardcoded I'll send this script to neutron/tools once i fix those
16:24:49 <ihrachys> aye. we then can hook it into infra so that it shows up at http://status.openstack.org/reviews/
16:25:13 <ihrachys> #action manjeets to polish the dashboard script and propose it for neutron/tools/
16:25:28 <manjeets> yes we can do that
16:25:46 <ihrachys> ok next item is "jlibosva to follow up on scenario failures"
16:25:54 <ihrachys> jlibosva: your stage. how does scenario job feel today?
16:26:06 <jlibosva> unfortunately no fixes were merged
16:26:12 <jlibosva> we have one for qos that's been failing a lot
16:26:18 * jlibosva looks for link
16:26:44 <jlibosva> https://review.openstack.org/#/c/430309/ - it's already approved but failing on gate
16:27:23 <jlibosva> another patch is to increase debugability - https://review.openstack.org/#/c/427312/ - there is a sporadic failure where tempest is unable to ssh to the instance
16:28:02 <jlibosva> I suspect that that's due to slow hypervisors, I compared locally boottime of my machine and the one on gate and gate is ~14x slower
16:28:28 <jlibosva> given that it runs with ubuntu that starts a lot of services, our tempest conf might just have insufficient timeout
16:28:52 <jlibosva> but it's hard to tell without getting instance boot console output
16:29:04 <ihrachys> yeah, overall I noticed job timeouts in gate lately, but we will cover it a bit later
16:29:09 <jlibosva> I think those are two major issues, I haven't seen anything else
16:29:31 <jlibosva> ihrachys: yeah, that made me just think that maybe some infra changed. But that's for a separate discussion
16:29:33 <ihrachys> jlibosva: generally speaking, what's the strategy for scenario job assuming we solve the remaining stability issues? do we have a plan to make it vote?
16:29:55 <jlibosva> ihrachys: no formal plan but I'd like to make it voting once it reaches some reasonable failure rate
16:30:23 <mlavalle> jlibosva: what's reasonable?
16:30:24 <jlibosva> we'll see after we get those two patches in and eventually increase timeout (which are already 2 times higher than with cirros IIRC)
16:30:53 <jlibosva> mlavalle: optimistic guess - 10-15%
16:31:22 <clarkb> jlibosva: qemu is slow yes
16:31:27 <clarkb> thats normal and expected
16:31:33 <clarkb> (and why cirros is used most places)
16:32:02 <jlibosva> currently we have BUILD_TIMEOUT=392 for Ubuntu
16:32:35 <ihrachys> jlibosva: if we have mutiple tests running in parallel starting instances on low memory machines, then maybe reduce number of test workers? that should give better per-test timing?
16:33:34 <jlibosva> ihrachys: that's a good idea. Given that amount of tests is very low. I'll try to send a patch for that
16:33:50 <ihrachys> #action jlibosva to try reducing parallelization for scenario tests
16:33:52 <jlibosva> at least to get some additional info whether that helps
16:34:30 <ihrachys> we gotta have a plan to make it vote, otherwise it will be another fullstack job that we break once in a while
16:35:02 <ihrachys> it also makes sense to plan for what we do with two flavors of the job we have - ovs and linuxbridge; but that's probably a discussion for another venue. ptg?
16:35:28 <ihrachys> ok, last action item was "ihrachys to read about how swappiness is supposed to work, and why it doesn't in gate"
16:36:36 <ihrachys> I haven't found much; the only thing I noticed is that we use =30 in devstack while default is claimed to be 60 by several sources. my understanding is that we set it to =10 before just because some kernels were completely disabling it (=0)
16:36:57 <ihrachys> that being said, the previous attempt to raise from =10 to =30 didn't help too much, we still hit oom-killer with swap free
16:37:29 <ihrachys> so another thing that came to my mind is that a process (mysqld, qemu?) could block memory from swapping (there is a syscall for that)
16:37:46 <ihrachys> and afaik we lack info on those memory segments in current logs collected
16:37:56 <ihrachys> so I am going to look at how we could dump that
16:38:10 <dasm> ihrachys: do you maybe have any links about swappiness to share?
16:38:24 <ihrachys> #action ihrachys to look at getting more info from kernel about ram-locked memory segments
16:38:47 <ihrachys> dasm: there are some you can find by mere googling. afaik there is no official doc, just blog posts and stackoverflow and such
16:38:53 <dasm> ihrachys: ack
16:39:18 <ihrachys> ok and we are done with action items, woohoo
16:39:35 <clarkb> ihrachys: yes mlock and friends. Also apparently kernel allocations aren't swappable
16:40:31 <ihrachys> now, let's discuss current gate issues as we know them
16:40:38 <ihrachys> #topic Gate issues
16:41:36 <ihrachys> #link https://goo.gl/8vigPl Open bugs
16:41:59 <ihrachys> the bug that was the most affecting in the past is at the top of the list
16:42:05 <ihrachys> #link https://bugs.launchpad.net/neutron/+bug/1627106 ovsdb native timeouts
16:42:05 <openstack> Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [Critical,In progress] - Assigned to Miguel Angel Ajo (mangelajo)
16:42:21 <ihrachys> otherwiseguy: kevinbenton: what's the latest status there? I see the patch was in gate but now bumped off?
16:42:28 <jlibosva> otherwiseguy has a related patch for it pushed recently
16:42:55 <jlibosva> https://review.openstack.org/#/c/429095/
16:43:13 <ihrachys> yeah that's the patch I meant
16:43:13 <otherwiseguy> ihrachys, It looked like another patch started using a function that I hadn't updated because we didn't use it.
16:43:51 <otherwiseguy> New update modifies that function to get rid of the verify() call where we can. Can't remove it when calling "add" on a map column, but don't think we do that.
16:44:09 <ihrachys> otherwiseguy: ok, other than that, do we have stability issues with that? I think that's what resulted in bumping off the gate in the past.
16:45:02 <otherwiseguy> ihrachys, I can only say "we'll see". Does "Bumped off the gate mean setting it WIP" or does it require something else?
16:45:33 <ihrachys> well I don't think it's necessarily WIP, but I assume armax wanted to collect more stats on its success.
16:45:47 <otherwiseguy> Because I set it WIP to add the change I just mentioned.
16:45:51 <ihrachys> afaik the bump was triggered by general gate instability, so we don't really know if it's because of this patch
16:46:21 <otherwiseguy> Not necessarily because of any stability.
16:46:22 <ihrachys> otherwiseguy: ok, I guess then we just wait for the new version, recheck a bunch of times again, and see if we can get it in
16:46:24 <otherwiseguy> issues
16:46:45 <ihrachys> otherwiseguy: thanks for working on it
16:47:03 <otherwiseguy> But if things magically got better after removing it from the gate, then that would scare me a bit.
16:47:50 <ihrachys> otherwiseguy: no, it's not that, it's just armax was not sure if it helps if we need to recheck so much
16:47:54 <ihrachys> to pass it in
16:48:18 <ihrachys> I believe it's just a matter of caution, we don't want to introduce another vector of instability
16:48:19 <otherwiseguy> We didn't need to recheck it so much, I was just rechecking a bunch to see if there were any timeout errors.
16:48:39 <otherwiseguy> There were failures occasionally, but none I could definitely match to the patch.
16:49:19 <ihrachys> ok, let's move on. another issue that affects our gates lately is tempest job timeouts due to slow machine run.
16:49:35 <ihrachys> that one is a bit tricky, and doesn't really seem neutron specific
16:49:51 <ihrachys> I started discussion at http://lists.openstack.org/pipermail/openstack-dev/2017-February/111923.html
16:50:06 <ihrachys> but tl;dr is it seems sometimes machines we run tests on are very slow
16:50:11 <ihrachys> like 2-3 times slower than usual
16:50:19 <ihrachys> which makes zuul abrupt runs in the middle
16:50:21 <ihrachys> after 2h
16:50:35 <ihrachys> so there is a bit of discussion inside the thread, you may want to have a look
16:50:47 <ihrachys> I don't think at this point we know the next steps to take on that one
16:50:58 <mlavalle> do you mean zuul abruptly ends the runs?
16:50:58 <ihrachys> it's also of concern that timeouts were not happening as often before
16:51:24 <ihrachys> mlavalle: well it's devstack itself I believe, it sets some timeout traps and kill tempest
16:52:02 <mlavalle> yeah, I see that quite often in patchses I review. Just making sure we were talking about the same thing :-)
16:52:08 <dasm> are we still seeing timeouts? i've seen infra info about storage problems (which i believe should be fixed already).
16:52:14 <dasm> maybe both are unrelated
16:52:24 <clarkb> dasm: that should be unrelated
16:52:30 <dasm> clarkb: ack
16:52:39 <clarkb> dasm: storage problems affected centos7 package installs as we had a bad centos7 mirror index
16:52:42 <ihrachys> I was thinking that timeouts are happening lately, and we touched swappiness lately too [also number of rpc workers but armax claims it could affect test times], so I was thinking, maybe we somehow pushed the failing jobs to the edge some more so as to trigger timeouts on slower machines.
16:52:42 <clarkb> (but that should fail fast)
16:53:43 <dasm> ihrachys: are you suggesting reverting swappiness and verifying timeouts?
16:54:08 <dasm> s/verifying/checking
16:54:31 <ihrachys> I don't suggest touching anything just yet, I am just thinking aloud of what could make those timeouts triggered
16:54:50 <ihrachys> note it can be as well external to openstack, but we can't help that, so it's better to focus for what we can do
16:55:20 <ihrachys> dasm: also note that it's not like all jobs timeout; most of them pass successfully, and in time that is a lot lower than 2h
16:55:29 <ihrachys> good runs are usually ~1h 10-20 mins
16:55:36 <ihrachys> sometimes even less than an hour
16:55:52 <ihrachys> so I suspect machines in different clouds are not uniform
16:56:13 <dasm> hmm.. maybe correlation between infrastructure? we should probably look at this
16:56:31 <dasm> i can try to verify if it's somehow related and if we're seeing timeouts just for specific clouds
16:56:33 <ihrachys> it may be just that for those slow machines we have, we may have gotten the time pushed a bit beyond the limit
16:56:49 <ihrachys> of note, there does not seem to be correlation with cloud used, or project.
16:57:05 <dasm> hmm
16:57:51 <ihrachys> I am gonna dump cpu flags in devstack to see if there is a difference between what's available for the machines
16:57:59 <ihrachys> #action ihrachys to dump cpu flags in devstack gate
16:58:05 <ihrachys> apart from that, I am out of ideas on next steps; if someone has, please speak up in the email thread.
16:58:37 <ihrachys> oh 2 mins left; and before we complete the meeting, I want to mention several patches we landed lately.
16:58:59 <ihrachys> one is removing requests dependency that somewhat reduces memory usage for l3 agent: https://review.openstack.org/#/c/432367/
16:59:38 <ihrachys> then we also enabled dstat service in functional gate: https://review.openstack.org/427358 so now next time we spot a timeout in ovsdb native we can check the load of the system at that point.
16:59:56 <ihrachys> I don't expect it to give a definite answer, but who knows, at least there is chance
17:00:30 <ihrachys> there is also a patch up for review to fix cleanup for floating ips in some api tests: https://review.openstack.org/432713
17:00:39 <jlibosva> I have a patch for fullstack -
17:00:39 <ihrachys> ok we are at the top of the hour
17:00:45 <jlibosva> I wanted to raise here
17:00:49 <ihrachys> jlibosva: shoot quick
17:00:51 <jlibosva> https://review.openstack.org/#/c/433157/
17:00:57 <jlibosva> didn't report a bug as I was hitting it locally
17:01:08 <ihrachys> ack. let's follow up in the project channel
17:01:08 <jlibosva> and was lazy to search logstash :) but I can do that if it's eneded
17:01:10 <ihrachys> thanks everyone
17:01:12 <jlibosva> thanks!
17:01:12 <dasm> o/
17:01:13 <ihrachys> #endmeeting