16:01:16 #startmeeting neutron_ci 16:01:17 Meeting started Tue Feb 14 16:01:16 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:18 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:21 The meeting name has been set to 'neutron_ci' 16:01:21 hello everyone :) 16:01:22 o/ 16:01:24 hi 16:01:51 hi 16:01:54 o/ 16:02:02 o/ 16:02:17 #link https://wiki.openstack.org/wiki/Meetings/NeutronCI Agenda 16:02:32 (or lack of it, I need to write up some stub topics) 16:02:43 let's start with action items from the previous meeting 16:02:54 #topic Action items from previous meeting 16:03:04 "ihrachys to follow up with armax on periodic functional job not showing up in grafana" 16:03:19 so indeed the periodic functional does not show up in http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=7&fullscreen 16:03:20 BUT 16:03:38 we can see runs in http://logs.openstack.org/periodic/periodic-neutron-dsvm-functional-ubuntu-xenial/ 16:04:11 and armax suggested that it does not show up because of a glitch in grafana that makes it impossible to draw a trend while results are identical 16:04:30 so we believe that till it fails, it won't be in the grafana 16:04:53 it's also on openstack-health http://status.openstack.org/openstack-health/#/job/periodic-neutron-dsvm-functional-ubuntu-xenial 16:04:57 that's actually interesting that it hasn't failed in 15 runs 16:05:10 jlibosva: oh thanks for the link! 16:05:50 we saw failure rate in check queue at 30% to 40% before, so even though today it's 10%, the chance of not hitting it once in 15 runs in the past is quite low 16:06:10 so either we are lucky, or there is some other thing in play (maybe we allocate different VMs for periodic?) 16:06:42 so far, we are going to monitor, and maybe in a week we will have a failure; if not, we will look closer at job definitions. 16:07:10 ok next action was: "ihrachys to look at e-r bot for openstack-neutron channel" 16:07:22 that's actually something to decide first if we even want it 16:07:52 I noticed some other projects like nova or glance have a irc bot that reports into their channels on captured classified failures 16:08:13 I was thinking, it may help to give us some better understanding of what hits our gates 16:08:22 i vote +1 for that 16:08:24 so far I proposed a patch that enables the bot at https://review.openstack.org/#/c/433735/ 16:08:42 +1 from me too. If it will become annoying or un-useful, we can disable it anytime 16:08:42 I guess we can enable it and see how it goes; if it spams too much, we tweak or disable it later 16:08:58 +1 16:09:22 ok and I see mtreinish comment that it's wrong place to do it; I will update the right one after the meeting 16:09:32 as long as we agree on the direction, which I believe we do :) 16:09:54 ihrachys: how does it look on channel? irc bot shows url? do you have example of that? 16:10:45 i'm ok with enabling it, just asking to know what'll be changed 16:11:36 every 2 seconds it will report if we have any failures :) 16:11:42 kevinbenton: ++ :) 16:11:48 like in http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2017-01-16.log.html#t2017-01-16T03:12:54 16:12:02 though if it knows the error, it will report the bug and all 16:12:06 ihrachys: ack 16:12:47 ok, that's an example of bug recognized: http://eavesdrop.openstack.org/irclogs/%23openstack-qa/%23openstack-qa.2017-01-08.log.html#t2017-01-08T00:13:53 16:13:07 I guess we won't know if it's useful until we try 16:13:13 i have a question what if patch is culprit over the failure of job will it still report ? 16:13:32 ihrachys: btw "recognized bugs". if i recall correctly, we have to manually categorize recognized bugs. is it true? 16:13:35 or there is a way to filter out 16:13:36 manjeets: it probably reports on gate jobs only 16:13:46 that would be my expectation 16:13:47 i think, electrocucaracha did something like that for couple bugs 16:13:48 ihrachys: do you want to have it report uncategorized failures too? Your patch didn't have that 16:14:25 mtreinish: yeah, could be; I haven't spent much time thinking on it yet, just sent a strawman to have it for discussion 16:14:49 mtreinish: while we have you here, could you confirm it monitors gate queue only? 16:15:12 the irc reporting? 16:15:15 yea 16:15:26 otherwise it would spam with irrelevant messages 16:15:52 yeah it doesn't report to irc for check iirc 16:16:30 but it's been a while since I looked at the code/config 16:16:36 that makes sense 16:16:48 mtreinish: is the bot useful for qa team? 16:17:05 do you find it helps, or just spams with low actual profit? 16:17:49 so in the past it was quite useful, especially when we first started e-r 16:18:03 but nowadays I don't think many people pay attention to it 16:18:16 if you've got people who are willing to stay on top of it I think it'll be useful 16:18:52 aye; we were looking for ways to direct attention to elastic-recheck tooling lately, and I am hopeful it will give proper signals 16:19:12 ok next action item was: "manjeets to produce a gerrit dashboard for gate and functional failures" 16:19:18 https://github.com/manjeetbhatia/neutron_stuff/blob/master/create_gate_failure_dash.py 16:19:29 this can be used to create fresh one all the time 16:19:39 manjeets: do you have link to the resulting dashboard handy? 16:19:55 use url shortener please 16:20:22 https://github.com/manjeetbhatia/neutron_stuff/blob/master/README.md 16:20:42 ihrachys, ok yes there is only one patch from 14 tickets in progress atm 16:20:49 which it captured 16:21:02 manjeets++ 16:21:03 is it because the dashboard doesn't capture some, or that's indeed all we have? 16:21:04 good stuff 16:21:08 mtreinish: can we alter the bot to direct messages at whoever is currently talking in the channel? :) 16:21:14 I manually checked some of them are abandoned 16:21:28 only one patch is open ihrachys 16:22:11 kevinbenton: heh, it wouldn't be that hard. All the info is there to add that "feature" :) 16:22:51 manjeets +1 16:22:58 kevinbenton: haha. we should also make it customize the message depending on the rate of failure reports. sometimes it asks politely, sometimes 'everyone just shut up and fix the damned gate' 16:22:59 kevinbenton: are you planning to silence irc channel? :) 16:23:50 manjeets: thanks a lot for the work. do you plan to contribute the script to some official repo? like we did for milestone target dashboard. 16:24:27 ihrachys, some stuff is hardcoded I'll send this script to neutron/tools once i fix those 16:24:49 aye. we then can hook it into infra so that it shows up at http://status.openstack.org/reviews/ 16:25:13 #action manjeets to polish the dashboard script and propose it for neutron/tools/ 16:25:28 yes we can do that 16:25:46 ok next item is "jlibosva to follow up on scenario failures" 16:25:54 jlibosva: your stage. how does scenario job feel today? 16:26:06 unfortunately no fixes were merged 16:26:12 we have one for qos that's been failing a lot 16:26:18 * jlibosva looks for link 16:26:44 https://review.openstack.org/#/c/430309/ - it's already approved but failing on gate 16:27:23 another patch is to increase debugability - https://review.openstack.org/#/c/427312/ - there is a sporadic failure where tempest is unable to ssh to the instance 16:28:02 I suspect that that's due to slow hypervisors, I compared locally boottime of my machine and the one on gate and gate is ~14x slower 16:28:28 given that it runs with ubuntu that starts a lot of services, our tempest conf might just have insufficient timeout 16:28:52 but it's hard to tell without getting instance boot console output 16:29:04 yeah, overall I noticed job timeouts in gate lately, but we will cover it a bit later 16:29:09 I think those are two major issues, I haven't seen anything else 16:29:31 ihrachys: yeah, that made me just think that maybe some infra changed. But that's for a separate discussion 16:29:33 jlibosva: generally speaking, what's the strategy for scenario job assuming we solve the remaining stability issues? do we have a plan to make it vote? 16:29:55 ihrachys: no formal plan but I'd like to make it voting once it reaches some reasonable failure rate 16:30:23 jlibosva: what's reasonable? 16:30:24 we'll see after we get those two patches in and eventually increase timeout (which are already 2 times higher than with cirros IIRC) 16:30:53 mlavalle: optimistic guess - 10-15% 16:31:22 jlibosva: qemu is slow yes 16:31:27 thats normal and expected 16:31:33 (and why cirros is used most places) 16:32:02 currently we have BUILD_TIMEOUT=392 for Ubuntu 16:32:35 jlibosva: if we have mutiple tests running in parallel starting instances on low memory machines, then maybe reduce number of test workers? that should give better per-test timing? 16:33:34 ihrachys: that's a good idea. Given that amount of tests is very low. I'll try to send a patch for that 16:33:50 #action jlibosva to try reducing parallelization for scenario tests 16:33:52 at least to get some additional info whether that helps 16:34:30 we gotta have a plan to make it vote, otherwise it will be another fullstack job that we break once in a while 16:35:02 it also makes sense to plan for what we do with two flavors of the job we have - ovs and linuxbridge; but that's probably a discussion for another venue. ptg? 16:35:28 ok, last action item was "ihrachys to read about how swappiness is supposed to work, and why it doesn't in gate" 16:36:36 I haven't found much; the only thing I noticed is that we use =30 in devstack while default is claimed to be 60 by several sources. my understanding is that we set it to =10 before just because some kernels were completely disabling it (=0) 16:36:57 that being said, the previous attempt to raise from =10 to =30 didn't help too much, we still hit oom-killer with swap free 16:37:29 so another thing that came to my mind is that a process (mysqld, qemu?) could block memory from swapping (there is a syscall for that) 16:37:46 and afaik we lack info on those memory segments in current logs collected 16:37:56 so I am going to look at how we could dump that 16:38:10 ihrachys: do you maybe have any links about swappiness to share? 16:38:24 #action ihrachys to look at getting more info from kernel about ram-locked memory segments 16:38:47 dasm: there are some you can find by mere googling. afaik there is no official doc, just blog posts and stackoverflow and such 16:38:53 ihrachys: ack 16:39:18 ok and we are done with action items, woohoo 16:39:35 ihrachys: yes mlock and friends. Also apparently kernel allocations aren't swappable 16:40:31 now, let's discuss current gate issues as we know them 16:40:38 #topic Gate issues 16:41:36 #link https://goo.gl/8vigPl Open bugs 16:41:59 the bug that was the most affecting in the past is at the top of the list 16:42:05 #link https://bugs.launchpad.net/neutron/+bug/1627106 ovsdb native timeouts 16:42:05 Launchpad bug 1627106 in neutron "TimeoutException while executing tests adding bridge using OVSDB native" [Critical,In progress] - Assigned to Miguel Angel Ajo (mangelajo) 16:42:21 otherwiseguy: kevinbenton: what's the latest status there? I see the patch was in gate but now bumped off? 16:42:28 otherwiseguy has a related patch for it pushed recently 16:42:55 https://review.openstack.org/#/c/429095/ 16:43:13 yeah that's the patch I meant 16:43:13 ihrachys, It looked like another patch started using a function that I hadn't updated because we didn't use it. 16:43:51 New update modifies that function to get rid of the verify() call where we can. Can't remove it when calling "add" on a map column, but don't think we do that. 16:44:09 otherwiseguy: ok, other than that, do we have stability issues with that? I think that's what resulted in bumping off the gate in the past. 16:45:02 ihrachys, I can only say "we'll see". Does "Bumped off the gate mean setting it WIP" or does it require something else? 16:45:33 well I don't think it's necessarily WIP, but I assume armax wanted to collect more stats on its success. 16:45:47 Because I set it WIP to add the change I just mentioned. 16:45:51 afaik the bump was triggered by general gate instability, so we don't really know if it's because of this patch 16:46:21 Not necessarily because of any stability. 16:46:22 otherwiseguy: ok, I guess then we just wait for the new version, recheck a bunch of times again, and see if we can get it in 16:46:24 issues 16:46:45 otherwiseguy: thanks for working on it 16:47:03 But if things magically got better after removing it from the gate, then that would scare me a bit. 16:47:50 otherwiseguy: no, it's not that, it's just armax was not sure if it helps if we need to recheck so much 16:47:54 to pass it in 16:48:18 I believe it's just a matter of caution, we don't want to introduce another vector of instability 16:48:19 We didn't need to recheck it so much, I was just rechecking a bunch to see if there were any timeout errors. 16:48:39 There were failures occasionally, but none I could definitely match to the patch. 16:49:19 ok, let's move on. another issue that affects our gates lately is tempest job timeouts due to slow machine run. 16:49:35 that one is a bit tricky, and doesn't really seem neutron specific 16:49:51 I started discussion at http://lists.openstack.org/pipermail/openstack-dev/2017-February/111923.html 16:50:06 but tl;dr is it seems sometimes machines we run tests on are very slow 16:50:11 like 2-3 times slower than usual 16:50:19 which makes zuul abrupt runs in the middle 16:50:21 after 2h 16:50:35 so there is a bit of discussion inside the thread, you may want to have a look 16:50:47 I don't think at this point we know the next steps to take on that one 16:50:58 do you mean zuul abruptly ends the runs? 16:50:58 it's also of concern that timeouts were not happening as often before 16:51:24 mlavalle: well it's devstack itself I believe, it sets some timeout traps and kill tempest 16:52:02 yeah, I see that quite often in patchses I review. Just making sure we were talking about the same thing :-) 16:52:08 are we still seeing timeouts? i've seen infra info about storage problems (which i believe should be fixed already). 16:52:14 maybe both are unrelated 16:52:24 dasm: that should be unrelated 16:52:30 clarkb: ack 16:52:39 dasm: storage problems affected centos7 package installs as we had a bad centos7 mirror index 16:52:42 I was thinking that timeouts are happening lately, and we touched swappiness lately too [also number of rpc workers but armax claims it could affect test times], so I was thinking, maybe we somehow pushed the failing jobs to the edge some more so as to trigger timeouts on slower machines. 16:52:42 (but that should fail fast) 16:53:43 ihrachys: are you suggesting reverting swappiness and verifying timeouts? 16:54:08 s/verifying/checking 16:54:31 I don't suggest touching anything just yet, I am just thinking aloud of what could make those timeouts triggered 16:54:50 note it can be as well external to openstack, but we can't help that, so it's better to focus for what we can do 16:55:20 dasm: also note that it's not like all jobs timeout; most of them pass successfully, and in time that is a lot lower than 2h 16:55:29 good runs are usually ~1h 10-20 mins 16:55:36 sometimes even less than an hour 16:55:52 so I suspect machines in different clouds are not uniform 16:56:13 hmm.. maybe correlation between infrastructure? we should probably look at this 16:56:31 i can try to verify if it's somehow related and if we're seeing timeouts just for specific clouds 16:56:33 it may be just that for those slow machines we have, we may have gotten the time pushed a bit beyond the limit 16:56:49 of note, there does not seem to be correlation with cloud used, or project. 16:57:05 hmm 16:57:51 I am gonna dump cpu flags in devstack to see if there is a difference between what's available for the machines 16:57:59 #action ihrachys to dump cpu flags in devstack gate 16:58:05 apart from that, I am out of ideas on next steps; if someone has, please speak up in the email thread. 16:58:37 oh 2 mins left; and before we complete the meeting, I want to mention several patches we landed lately. 16:58:59 one is removing requests dependency that somewhat reduces memory usage for l3 agent: https://review.openstack.org/#/c/432367/ 16:59:38 then we also enabled dstat service in functional gate: https://review.openstack.org/427358 so now next time we spot a timeout in ovsdb native we can check the load of the system at that point. 16:59:56 I don't expect it to give a definite answer, but who knows, at least there is chance 17:00:30 there is also a patch up for review to fix cleanup for floating ips in some api tests: https://review.openstack.org/432713 17:00:39 I have a patch for fullstack - 17:00:39 ok we are at the top of the hour 17:00:45 I wanted to raise here 17:00:49 jlibosva: shoot quick 17:00:51 https://review.openstack.org/#/c/433157/ 17:00:57 didn't report a bug as I was hitting it locally 17:01:08 ack. let's follow up in the project channel 17:01:08 and was lazy to search logstash :) but I can do that if it's eneded 17:01:10 thanks everyone 17:01:12 thanks! 17:01:12 o/ 17:01:13 #endmeeting