16:01:16 <ihrachys> #startmeeting neutron_ci 16:01:17 <openstack> Meeting started Tue Feb 14 16:01:16 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:18 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:21 <openstack> The meeting name has been set to 'neutron_ci' 16:01:21 <ihrachys> hello everyone :) 16:01:22 <jlibosva> o/ 16:01:24 <manjeets> hi 16:01:51 <kevinbenton> hi 16:01:54 <reedip_1> o/ 16:02:02 <dasm> o/ 16:02:17 <ihrachys> #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda 16:02:32 <ihrachys> (or lack of it, I need to write up some stub topics) 16:02:43 <ihrachys> let's start with action items from the previous meeting 16:02:54 <ihrachys> #topic Action items from previous meeting 16:03:04 <ihrachys> "ihrachys to follow up with armax on periodic functional job not showing up in grafana" 16:03:19 <ihrachys> so indeed the periodic functional does not show up in http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen 16:03:20 <ihrachys> BUT 16:03:38 <ihrachys> we can see runs in http://logs.openstack.org/periodic/periodic-neutron-dsvm-functional-ubuntu-xenial/ 16:04:11 <ihrachys> and armax suggested that it does not show up because of a glitch in grafana that makes it impossible to draw a trend while results are identical 16:04:30 <ihrachys> so we believe that till it fails, it won't be in the grafana 16:04:53 <jlibosva> it's also on openstack-health http://status.openstack.org/openstack-health/#/job/periodic-neutron-dsvm-functional-ubuntu-xenial 16:04:57 <ihrachys> that's actually interesting that it hasn't failed in 15 runs 16:05:10 <ihrachys> jlibosva: oh thanks for the link! 16:05:50 <ihrachys> we saw failure rate in check queue at 30% to 40% before, so even though today it's 10%, the chance of not hitting it once in 15 runs in the past is quite low 16:06:10 <ihrachys> so either we are lucky, or there is some other thing in play (maybe we allocate different VMs for periodic?) 16:06:42 <ihrachys> so far, we are going to monitor, and maybe in a week we will have a failure; if not, we will look closer at job definitions. 16:07:10 <ihrachys> ok next action was: "ihrachys to look at e-r bot for openstack-neutron channel" 16:07:22 <ihrachys> that's actually something to decide first if we even want it 16:07:52 <ihrachys> I noticed some other projects like nova or glance have a irc bot that reports into their channels on captured classified failures 16:08:13 <ihrachys> I was thinking, it may help to give us some better understanding of what hits our gates 16:08:22 <kevinbenton> i vote +1 for that 16:08:24 <ihrachys> so far I proposed a patch that enables the bot at https://review.openstack.org/#/c/433735/ 16:08:42 <jlibosva> +1 from me too. If it will become annoying or un-useful, we can disable it anytime 16:08:42 <ihrachys> I guess we can enable it and see how it goes; if it spams too much, we tweak or disable it later 16:08:58 <manjeets> +1 16:09:22 <ihrachys> ok and I see mtreinish comment that it's wrong place to do it; I will update the right one after the meeting 16:09:32 <ihrachys> as long as we agree on the direction, which I believe we do :) 16:09:54 <dasm> ihrachys: how does it look on channel? irc bot shows url? do you have example of that? 16:10:45 <dasm> i'm ok with enabling it, just asking to know what'll be changed 16:11:36 <kevinbenton> every 2 seconds it will report if we have any failures :) 16:11:42 <dasm> kevinbenton: ++ :) 16:11:48 <ihrachys> like in http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2017-01-16.log.html#t2017-01-16T03:12:54 16:12:02 <ihrachys> though if it knows the error, it will report the bug and all 16:12:06 <dasm> ihrachys: ack 16:12:47 <ihrachys> ok, that's an example of bug recognized: http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2017-01-08.log.html#t2017-01-08T00:13:53 16:13:07 <ihrachys> I guess we won't know if it's useful until we try 16:13:13 <manjeets> i have a question what if patch is culprit over the failure of job will it still report ? 16:13:32 <dasm> ihrachys: btw "recognized bugs". if i recall correctly, we have to manually categorize recognized bugs. is it true? 16:13:35 <manjeets> or there is a way to filter out 16:13:36 <ihrachys> manjeets: it probably reports on gate jobs only 16:13:46 <ihrachys> that would be my expectation 16:13:47 <dasm> i think, electrocucaracha did something like that for couple bugs 16:13:48 <mtreinish> ihrachys: do you want to have it report uncategorized failures too? Your patch didn't have that 16:14:25 <ihrachys> mtreinish: yeah, could be; I haven't spent much time thinking on it yet, just sent a strawman to have it for discussion 16:14:49 <ihrachys> mtreinish: while we have you here, could you confirm it monitors gate queue only? 16:15:12 <mtreinish> the irc reporting? 16:15:15 <ihrachys> yea 16:15:26 <ihrachys> otherwise it would spam with irrelevant messages 16:15:52 <mtreinish> yeah it doesn't report to irc for check iirc 16:16:30 <mtreinish> but it's been a while since I looked at the code/config 16:16:36 <ihrachys> that makes sense 16:16:48 <ihrachys> mtreinish: is the bot useful for qa team? 16:17:05 <ihrachys> do you find it helps, or just spams with low actual profit? 16:17:49 <mtreinish> so in the past it was quite useful, especially when we first started e-r 16:18:03 <mtreinish> but nowadays I don't think many people pay attention to it 16:18:16 <mtreinish> if you've got people who are willing to stay on top of it I think it'll be useful 16:18:52 <ihrachys> aye; we were looking for ways to direct attention to elastic-recheck tooling lately, and I am hopeful it will give proper signals 16:19:12 <ihrachys> ok next action item was: "manjeets to produce a gerrit dashboard for gate and functional failures" 16:19:18 <manjeets> https://github.com/manjeetbhatia/neutron_stuff/blob/master/create_gate_failure_dash.py 16:19:29 <manjeets> this can be used to create fresh one all the time 16:19:39 <ihrachys> manjeets: do you have link to the resulting dashboard handy? 16:19:55 <ihrachys> use url shortener please 16:20:22 <manjeets> https://github.com/manjeetbhatia/neutron_stuff/blob/master/README.md 16:20:42 <manjeets> ihrachys, ok yes there is only one patch from 14 tickets in progress atm 16:20:49 <manjeets> which it captured 16:21:02 <jlibosva> manjeets++ 16:21:03 <ihrachys> is it because the dashboard doesn't capture some, or that's indeed all we have? 16:21:04 <jlibosva> good stuff 16:21:08 <kevinbenton> mtreinish: can we alter the bot to direct messages at whoever is currently talking in the channel? :) 16:21:14 <manjeets> I manually checked some of them are abandoned 16:21:28 <manjeets> only one patch is open ihrachys 16:22:11 <mtreinish> kevinbenton: heh, it wouldn't be that hard. All the info is there to add that "feature" :) 16:22:51 <reedip_1> manjeets +1 16:22:58 <ihrachys> kevinbenton: haha. we should also make it customize the message depending on the rate of failure reports. sometimes it asks politely, sometimes 'everyone just shut up and fix the damned gate' 16:22:59 <dasm> kevinbenton: are you planning to silence irc channel? :) 16:23:50 <ihrachys> manjeets: thanks a lot for the work. do you plan to contribute the script to some official repo? like we did for milestone target dashboard. 16:24:27 <manjeets> ihrachys, some stuff is hardcoded I'll send this script to neutron/tools once i fix those 16:24:49 <ihrachys> aye. we then can hook it into infra so that it shows up at http://status.openstack.org/reviews/ 16:25:13 <ihrachys> #action manjeets to polish the dashboard script and propose it for neutron/tools/ 16:25:28 <manjeets> yes we can do that 16:25:46 <ihrachys> ok next item is "jlibosva to follow up on scenario failures" 16:25:54 <ihrachys> jlibosva: your stage. how does scenario job feel today? 16:26:06 <jlibosva> unfortunately no fixes were merged 16:26:12 <jlibosva> we have one for qos that's been failing a lot 16:26:18 * jlibosva looks for link 16:26:44 <jlibosva> https://review.openstack.org/#/c/430309/ - it's already approved but failing on gate 16:27:23 <jlibosva> another patch is to increase debugability - https://review.openstack.org/#/c/427312/ - there is a sporadic failure where tempest is unable to ssh to the instance 16:28:02 <jlibosva> I suspect that that's due to slow hypervisors, I compared locally boottime of my machine and the one on gate and gate is ~14x slower 16:28:28 <jlibosva> given that it runs with ubuntu that starts a lot of services, our tempest conf might just have insufficient timeout 16:28:52 <jlibosva> but it's hard to tell without getting instance boot console output 16:29:04 <ihrachys> yeah, overall I noticed job timeouts in gate lately, but we will cover it a bit later 16:29:09 <jlibosva> I think those are two major issues, I haven't seen anything else 16:29:31 <jlibosva> ihrachys: yeah, that made me just think that maybe some infra changed. But that's for a separate discussion 16:29:33 <ihrachys> jlibosva: generally speaking, what's the strategy for scenario job assuming we solve the remaining stability issues? do we have a plan to make it vote? 16:29:55 <jlibosva> ihrachys: no formal plan but I'd like to make it voting once it reaches some reasonable failure rate 16:30:23 <mlavalle> jlibosva: what's reasonable? 16:30:24 <jlibosva> we'll see after we get those two patches in and eventually increase timeout (which are already 2 times higher than with cirros IIRC) 16:30:53 <jlibosva> mlavalle: optimistic guess - 10-15% 16:31:22 <clarkb> jlibosva: qemu is slow yes 16:31:27 <clarkb> thats normal and expected 16:31:33 <clarkb> (and why cirros is used most places) 16:32:02 <jlibosva> currently we have BUILD_TIMEOUT=392 for Ubuntu 16:32:35 <ihrachys> jlibosva: if we have mutiple tests running in parallel starting instances on low memory machines, then maybe reduce number of test workers? that should give better per-test timing? 16:33:34 <jlibosva> ihrachys: that's a good idea. Given that amount of tests is very low. I'll try to send a patch for that 16:33:50 <ihrachys> #action jlibosva to try reducing parallelization for scenario tests 16:33:52 <jlibosva> at least to get some additional info whether that helps 16:34:30 <ihrachys> we gotta have a plan to make it vote, otherwise it will be another fullstack job that we break once in a while 16:35:02 <ihrachys> it also makes sense to plan for what we do with two flavors of the job we have - ovs and linuxbridge; but that's probably a discussion for another venue. ptg? 16:35:28 <ihrachys> ok, last action item was "ihrachys to read about how swappiness is supposed to work, and why it doesn't in gate" 16:36:36 <ihrachys> I haven't found much; the only thing I noticed is that we use =30 in devstack while default is claimed to be 60 by several sources. my understanding is that we set it to =10 before just because some kernels were completely disabling it (=0) 16:36:57 <ihrachys> that being said, the previous attempt to raise from =10 to =30 didn't help too much, we still hit oom-killer with swap free 16:37:29 <ihrachys> so another thing that came to my mind is that a process (mysqld, qemu?) could block memory from swapping (there is a syscall for that) 16:37:46 <ihrachys> and afaik we lack info on those memory segments in current logs collected 16:37:56 <ihrachys> so I am going to look at how we could dump that 16:38:10 <dasm> ihrachys: do you maybe have any links about swappiness to share? 16:38:24 <ihrachys> #action ihrachys to look at getting more info from kernel about ram-locked memory segments 16:38:47 <ihrachys> dasm: there are some you can find by mere googling. afaik there is no official doc, just blog posts and stackoverflow and such 16:38:53 <dasm> ihrachys: ack 16:39:18 <ihrachys> ok and we are done with action items, woohoo 16:39:35 <clarkb> ihrachys: yes mlock and friends. Also apparently kernel allocations aren't swappable 16:40:31 <ihrachys> now, let's discuss current gate issues as we know them 16:40:38 <ihrachys> #topic Gate issues 16:41:36 <ihrachys> #link https://goo.gl/8vigPl Open bugs 16:41:59 <ihrachys> the bug that was the most affecting in the past is at the top of the list 16:42:05 <ihrachys> #link https://bugs.launchpad.net/neutron/+bug/1627106 ovsdb native timeouts 16:42:05 <openstack> Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [Critical,In progress] - Assigned to Miguel Angel Ajo (mangelajo) 16:42:21 <ihrachys> otherwiseguy: kevinbenton: what's the latest status there? I see the patch was in gate but now bumped off? 16:42:28 <jlibosva> otherwiseguy has a related patch for it pushed recently 16:42:55 <jlibosva> https://review.openstack.org/#/c/429095/ 16:43:13 <ihrachys> yeah that's the patch I meant 16:43:13 <otherwiseguy> ihrachys, It looked like another patch started using a function that I hadn't updated because we didn't use it. 16:43:51 <otherwiseguy> New update modifies that function to get rid of the verify() call where we can. Can't remove it when calling "add" on a map column, but don't think we do that. 16:44:09 <ihrachys> otherwiseguy: ok, other than that, do we have stability issues with that? I think that's what resulted in bumping off the gate in the past. 16:45:02 <otherwiseguy> ihrachys, I can only say "we'll see". Does "Bumped off the gate mean setting it WIP" or does it require something else? 16:45:33 <ihrachys> well I don't think it's necessarily WIP, but I assume armax wanted to collect more stats on its success. 16:45:47 <otherwiseguy> Because I set it WIP to add the change I just mentioned. 16:45:51 <ihrachys> afaik the bump was triggered by general gate instability, so we don't really know if it's because of this patch 16:46:21 <otherwiseguy> Not necessarily because of any stability. 16:46:22 <ihrachys> otherwiseguy: ok, I guess then we just wait for the new version, recheck a bunch of times again, and see if we can get it in 16:46:24 <otherwiseguy> issues 16:46:45 <ihrachys> otherwiseguy: thanks for working on it 16:47:03 <otherwiseguy> But if things magically got better after removing it from the gate, then that would scare me a bit. 16:47:50 <ihrachys> otherwiseguy: no, it's not that, it's just armax was not sure if it helps if we need to recheck so much 16:47:54 <ihrachys> to pass it in 16:48:18 <ihrachys> I believe it's just a matter of caution, we don't want to introduce another vector of instability 16:48:19 <otherwiseguy> We didn't need to recheck it so much, I was just rechecking a bunch to see if there were any timeout errors. 16:48:39 <otherwiseguy> There were failures occasionally, but none I could definitely match to the patch. 16:49:19 <ihrachys> ok, let's move on. another issue that affects our gates lately is tempest job timeouts due to slow machine run. 16:49:35 <ihrachys> that one is a bit tricky, and doesn't really seem neutron specific 16:49:51 <ihrachys> I started discussion at http://lists.openstack.org/pipermail/openstack-dev/2017-February/111923.html 16:50:06 <ihrachys> but tl;dr is it seems sometimes machines we run tests on are very slow 16:50:11 <ihrachys> like 2-3 times slower than usual 16:50:19 <ihrachys> which makes zuul abrupt runs in the middle 16:50:21 <ihrachys> after 2h 16:50:35 <ihrachys> so there is a bit of discussion inside the thread, you may want to have a look 16:50:47 <ihrachys> I don't think at this point we know the next steps to take on that one 16:50:58 <mlavalle> do you mean zuul abruptly ends the runs? 16:50:58 <ihrachys> it's also of concern that timeouts were not happening as often before 16:51:24 <ihrachys> mlavalle: well it's devstack itself I believe, it sets some timeout traps and kill tempest 16:52:02 <mlavalle> yeah, I see that quite often in patchses I review. Just making sure we were talking about the same thing :-) 16:52:08 <dasm> are we still seeing timeouts? i've seen infra info about storage problems (which i believe should be fixed already). 16:52:14 <dasm> maybe both are unrelated 16:52:24 <clarkb> dasm: that should be unrelated 16:52:30 <dasm> clarkb: ack 16:52:39 <clarkb> dasm: storage problems affected centos7 package installs as we had a bad centos7 mirror index 16:52:42 <ihrachys> I was thinking that timeouts are happening lately, and we touched swappiness lately too [also number of rpc workers but armax claims it could affect test times], so I was thinking, maybe we somehow pushed the failing jobs to the edge some more so as to trigger timeouts on slower machines. 16:52:42 <clarkb> (but that should fail fast) 16:53:43 <dasm> ihrachys: are you suggesting reverting swappiness and verifying timeouts? 16:54:08 <dasm> s/verifying/checking 16:54:31 <ihrachys> I don't suggest touching anything just yet, I am just thinking aloud of what could make those timeouts triggered 16:54:50 <ihrachys> note it can be as well external to openstack, but we can't help that, so it's better to focus for what we can do 16:55:20 <ihrachys> dasm: also note that it's not like all jobs timeout; most of them pass successfully, and in time that is a lot lower than 2h 16:55:29 <ihrachys> good runs are usually ~1h 10-20 mins 16:55:36 <ihrachys> sometimes even less than an hour 16:55:52 <ihrachys> so I suspect machines in different clouds are not uniform 16:56:13 <dasm> hmm.. maybe correlation between infrastructure? we should probably look at this 16:56:31 <dasm> i can try to verify if it's somehow related and if we're seeing timeouts just for specific clouds 16:56:33 <ihrachys> it may be just that for those slow machines we have, we may have gotten the time pushed a bit beyond the limit 16:56:49 <ihrachys> of note, there does not seem to be correlation with cloud used, or project. 16:57:05 <dasm> hmm 16:57:51 <ihrachys> I am gonna dump cpu flags in devstack to see if there is a difference between what's available for the machines 16:57:59 <ihrachys> #action ihrachys to dump cpu flags in devstack gate 16:58:05 <ihrachys> apart from that, I am out of ideas on next steps; if someone has, please speak up in the email thread. 16:58:37 <ihrachys> oh 2 mins left; and before we complete the meeting, I want to mention several patches we landed lately. 16:58:59 <ihrachys> one is removing requests dependency that somewhat reduces memory usage for l3 agent: https://review.openstack.org/#/c/432367/ 16:59:38 <ihrachys> then we also enabled dstat service in functional gate: https://review.openstack.org/427358 so now next time we spot a timeout in ovsdb native we can check the load of the system at that point. 16:59:56 <ihrachys> I don't expect it to give a definite answer, but who knows, at least there is chance 17:00:30 <ihrachys> there is also a patch up for review to fix cleanup for floating ips in some api tests: https://review.openstack.org/432713 17:00:39 <jlibosva> I have a patch for fullstack - 17:00:39 <ihrachys> ok we are at the top of the hour 17:00:45 <jlibosva> I wanted to raise here 17:00:49 <ihrachys> jlibosva: shoot quick 17:00:51 <jlibosva> https://review.openstack.org/#/c/433157/ 17:00:57 <jlibosva> didn't report a bug as I was hitting it locally 17:01:08 <ihrachys> ack. let's follow up in the project channel 17:01:08 <jlibosva> and was lazy to search logstash :) but I can do that if it's eneded 17:01:10 <ihrachys> thanks everyone 17:01:12 <jlibosva> thanks! 17:01:12 <dasm> o/ 17:01:13 <ihrachys> #endmeeting