16:01:40 <ihrachys> #startmeeting neutron_ci 16:01:41 <openstack> Meeting started Tue Mar 6 16:01:40 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:45 <openstack> The meeting name has been set to 'neutron_ci' 16:01:48 <mlavalle> o/ 16:01:51 <slaweq> hi 16:02:15 <jlibosva> o/ 16:02:29 <ihrachys> hi to everyone who have escaped snowpenstack, or who was lucky enough to not go :p 16:02:46 <ihrachys> #topic Actions from prev meeting 16:03:05 <ihrachys> "ihrachys to update grafana periodic board with new names" 16:03:31 <ihrachys> merged https://review.openstack.org/547661 16:03:41 <ihrachys> "report test_get_service_by_service_and_host_name failure in periodic -pg- job" 16:03:51 <ihrachys> ouch that one was on me but I forgot to put my name on it and missed 16:04:18 <ihrachys> meh, it's already quite old and we will check status of the job later anyway and figure out what to do with it 16:04:24 <ihrachys> "mlavalle to look into linuxbridge ssh timeout failures" 16:04:37 <slaweq> I was looking on that one 16:04:43 <mlavalle> ihrachys: I think we gave that one to slaweq 16:05:06 <slaweq> so I was debugging it for a while and I couldn't reproduce it locally 16:05:31 <slaweq> so finally I pushed some DNM patches to be able to telnet to remote_pdb when test will fail 16:05:48 <slaweq> this didn't give me anything also 16:06:08 <slaweq> so I talked with infra team to add my ssh key to such node where this tests failed 16:06:25 <slaweq> and I checked there that all is working fine when I logged in 16:07:02 <ihrachys> and you check in European hours I assume. maybe it's genuine cloud load and slow instance boot? 16:07:04 <slaweq> so I tried one (stupid) simple thing: https://review.openstack.org/#/c/549324/ 16:07:21 <slaweq> ihrachys: no, tests on gate was failing many time 16:07:25 <slaweq> *many times 16:08:16 <slaweq> after all it looks for me that it is just problem with too long time for instance boot 16:09:10 <ihrachys> in linuxbridge job, it's a single node doing all things, tempest, controller, compute... 16:09:20 <ihrachys> in dvr one, compute is separate 16:09:21 <slaweq> when I proposed patch https://review.openstack.org/#/c/549324/ I run gate jobs twice on it and in both cases lb scenario tests catched my additional log 16:09:44 <slaweq> ihrachys: exactly - I then found that lb is single node and dvr is multinode config 16:09:59 <slaweq> I couldn't find anything else wrong with it :/ 16:10:03 <slaweq> ok, I'm done now :) 16:10:21 <ihrachys> slaweq, when it hits the message, does it recover quickly after that or still takes some time? 16:10:45 <slaweq> let me check as I don't remember exactly 16:11:09 <slaweq> http://logs.openstack.org/24/549324/2/check/neutron-tempest-plugin-scenario-linuxbridge/16197fa/logs/tempest.txt.gz#_2018-03-03_09_52_10_363 16:11:14 <slaweq> here is example of such log 16:12:10 <slaweq> it looks that it took about 1 minute more to connect 16:12:45 <ihrachys> yeah. and what about the other one? is it also aligned to 1 minute? 16:13:54 <ihrachys> no, seems like there it took 50 seconds 16:14:12 <ihrachys> so yeah, it looks just like slow instance spawning 16:14:14 <slaweq> it's not exactly same time for all 16:14:30 <slaweq> it looks like that for me 16:14:36 <ihrachys> afair we don't have nested virt enabled in gate 16:15:02 <ihrachys> afair it was a major issue for octavia scenarios where they needed to start amphora instance per balancer 16:15:47 <slaweq> yes, there was an issue but AFAIR problem was with nested virtualization was only in OVH cloud 16:16:34 <ihrachys> slaweq, it would probably be interesting to see how long it usually takes in dvr runs 16:16:49 <ihrachys> to compare. is it also close to the limit? 16:17:05 <slaweq> one thing which I also found is that it takes really long time to boot an instance in such failed case 16:17:07 <slaweq> e.g.: http://logs.openstack.org/71/548071/5/check/neutron-tempest-plugin-scenario-linuxbridge/d792a4a/job-output.txt.gz#_2018-03-02_17_02_34_200671 16:17:26 <slaweq> from instance logs it looks that it is ready after more than 400 seconds 16:17:37 <slaweq> and it was similar in other cases which I checked 16:17:59 <slaweq> ihrachys: I can check such logs in dvr scenario to compare it 16:19:34 <ihrachys> slaweq, do I interpret it correctly that it stayed in raising interfaces stage for 3min 42s / 8min 25s ? 16:19:54 <ihrachys> "Starting Raise network interfaces" 16:20:08 <ihrachys> here http://logs.openstack.org/71/548071/5/check/neutron-tempest-plugin-scenario-linuxbridge/d792a4a/job-output.txt.gz#_2018-03-02_17_02_34_168025 16:20:17 <slaweq> I don't know exactly 16:20:49 <ihrachys> well it looks like it. in this case, it may be something on our side that blocks the interface up. 16:20:54 <ihrachys> maybe dhcp not replying or smth 16:21:11 <ihrachys> though eventually it succeeds 16:21:13 <slaweq> but please note that first entry is 2min 52s/ 7 min 47s 16:21:55 <ihrachys> yeah but maybe that's when it starts to log 16:22:20 <ihrachys> my interpretation is that - 'the job took long time already, let's start log about it being still in progress' 16:22:27 <slaweq> but please check http://logs.openstack.org/71/548071/5/check/neutron-tempest-plugin-scenario-linuxbridge/d792a4a/job-output.txt.gz#_2018-03-02_17_02_34_162615 16:22:50 <slaweq> it took 166 seconds even here, before this "raise network interface" 16:23:15 <ihrachys> yeah but I mean, those 3min+ can be decisive for failure 16:23:20 <ihrachys> the limit is what, 5 minutes? 16:23:35 <slaweq> something like that, 16:23:43 <ihrachys> maybe it doesn't sit there for too long in dvr job 16:23:46 <slaweq> I will check also dhcp logs for it 16:23:49 <ihrachys> 'just' 166 seconds or so 16:24:04 <slaweq> please add me action for that for next week :) 16:24:39 <ihrachys> #action slaweq to check why it takes too long to raise interfaces in linuxbirdge scenarios job, and to compare with dvr 16:24:50 <slaweq> thx 16:25:26 <ihrachys> "ihrachys to update grafana boards to include master data only" 16:25:37 <ihrachys> so I actually looked into it, and I believe it's my misunderstanding 16:25:53 <ihrachys> the board explicitly pulls data from .master. queries 16:26:10 <ihrachys> so there is no issue with mixed data 16:26:37 <ihrachys> #topic Grafana 16:26:37 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:27:19 <ihrachys> the prev time we decided to give functional and other newly stable jobs some time till next meeting and see what proves itself with time and maybe consider some for voting / gating 16:28:26 <jlibosva> functional is not voting? 16:28:27 <ihrachys> looking at functional / fullstack / api board, it's not clear we can deduce much from it 16:28:34 <ihrachys> jlibosva, I believe it's not gating 16:28:43 <ihrachys> I mean, it's in check queue but not gate queue 16:28:46 <jlibosva> is it some recent change? 16:28:58 <ihrachys> eh. now you make me sanity check :) 16:29:17 <jlibosva> oh, probably it's just not shown in grafana 16:29:18 <ihrachys> oh actually nevermind, it's voting / gating 16:29:43 <jlibosva> I thought for a second something happened and it was pulled back from gate Q :) 16:29:53 <jlibosva> sorry for disturbance 16:29:58 <ihrachys> :) yeah I am bad at tracking the real world 16:30:18 <ihrachys> so as for fullstack, I see spikes in the last week but those seem to repeat what -api- job does 16:30:46 <ihrachys> currently at ~25% failure rate 16:31:27 <ihrachys> we even have unit tests at 20% 16:31:36 <ihrachys> not sure what happened during the last week 16:32:11 <ihrachys> but anyway, we'll take a look at fullstack separately 16:32:26 <ihrachys> scenarios are doing bad too, 50% for both 16:32:53 <slaweq> for linuxbridge it's still same issue as we already talked today 16:33:19 <ihrachys> slaweq, yeah but now failure rate reflected back in dvr. it was rather stable the last time we gathered. 16:33:34 <slaweq> I see 16:33:57 <ihrachys> but yeah, not clear how much of the rate is because of gate breakage that brought all jobs to 100% failure rate two days ago 16:34:22 <ihrachys> maybe we should revisit it again later to get a better picture 16:34:38 <ihrachys> #action ihrachys to check grafana stats several days later when dust settles 16:34:47 <mlavalle> ++ 16:34:54 <ihrachys> one thing that is new on the board is it seems that rally job is totally busted 16:35:18 <slaweq> yes, but there is issue for that already reported I think 16:35:19 <jlibosva> it's gonna be fixed soon 16:35:28 <jlibosva> https://bugs.launchpad.net/rally/+bug/1753713 16:35:30 <openstack> Launchpad bug 1753713 in Rally "Rally job on neutron CI broken" [Critical,Fix released] - Assigned to Andrey Kurilin (andreykurilin) 16:35:34 <jlibosva> fix: https://review.openstack.org/#/c/549627/ 16:35:48 <ihrachys> fix merged 16:35:53 <ihrachys> so we should be back to normal 16:35:55 <jlibosva> oh, it merged :) 16:36:04 <ihrachys> they don't need to release right? 16:36:04 <jlibosva> 6 minutes ago, nice :) 16:36:07 <ihrachys> it's from git 16:36:11 <jlibosva> that I don't know 16:37:07 <ihrachys> yeah it's from master: http://logs.openstack.org/22/549822/4/check/neutron-rally-neutron/1ef49e4/logs/devstacklog.txt.gz#_2018-03-06_11_18_33_774 16:37:36 * jlibosva nods 16:37:55 <ihrachys> periodics are stable so that's good too 16:38:00 <ihrachys> ok, now to specific jobs 16:38:02 <ihrachys> #topic Fullstack 16:38:15 <ihrachys> slaweq, any patches in your pipeline for fullstack? 16:38:20 <slaweq> no 16:38:40 <slaweq> I wasn't doing anything with fullstack in last days 16:39:17 <ihrachys> ehm, I have a question 16:39:22 <ihrachys> I was looking at a random patch 16:39:28 <ihrachys> looking for fullstack results 16:39:28 <ihrachys> https://review.openstack.org/#/c/536747/ 16:39:30 <ihrachys> and I don't see any 16:39:43 <ihrachys> is it just me 16:39:59 <jlibosva> it's from january 16:40:04 <slaweq> I don't see fullstack there also 16:40:20 <ihrachys> ah right. and there's: 16:40:20 <ihrachys> neutron-fullstack finger://ze03.openstack.org/db42044c8be14621bd79843789dac0c8 : POST_FAILURE in 43m 40s (non-voting) 16:40:22 <jlibosva> today I saw fullstack results 16:40:23 <ihrachys> in comments 16:40:26 <ihrachys> ok nevermind 16:41:02 <ihrachys> ok here's another one that is fresh 16:41:03 <ihrachys> http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/testr_results.html.gz 16:41:41 <slaweq> looks like some problem with starting agents/processes 16:42:50 <ihrachys> yeah I checked test_ha_router and neutron-server + 4 agents (2 dhcp and 2 ovs) are up. shouldn't there be also l3? 16:43:16 <jlibosva> there should 16:43:19 <ihrachys> here is where it failed to start agnet: http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router.txt.gz#_2018-03-06_08_20_19_235 16:43:27 <jlibosva> I checked logs now and Halting async process [neutron/tests/fullstack/cmd/l3_agent.py --log-dir /opt/stack/logs/dsvm-fullstack-logs/TestLinuxBridgeConnectivitySameNetwork.test_connectivity_VXLAN-and-l2pop_ --log-file neutron-l3-agent--2018-03-06--08-13-38-219990.log --config-file /tmp/tmpzpOvIu/tmpZEeIKB/neutron.conf --config-file /tmp/tmpzpOvIu/tmpZEeIKB/l3_agent.ini] in response to an error. 16:44:11 <ihrachys> right. but not many details 16:44:49 <jlibosva> I can have a look at it 16:44:55 <jlibosva> I'll try it locally 16:45:04 <ihrachys> yeah. maybe first step is to understand how to collect logs in those cases. 16:45:26 <ihrachys> maybe we can somehow get the output at least until it starts logging into console 16:45:34 <ihrachys> eh, sorry into the logfile 16:46:01 <slaweq> I'm not sure if agent logs something to console in such case 16:46:25 <ihrachys> and another weird thing is, it fails to start the agent, but still waits for nothing for minutes and minutes. it could fail immediately maybe. 16:46:34 <slaweq> even if You spawn it locally with same command if You give --log-file then it will not log to console IMO 16:46:36 <jlibosva> I might have a patch for it somewhere in abandoned patches, to read stderr and stdout in case process was halted in response to error 16:47:12 <slaweq> recently I was checking it with some DNM patches pushed to gerrit and rechecked :) 16:47:17 <jlibosva> it should still write output to async process buffers 16:47:24 <slaweq> but maybe we should have merged some better logging there 16:47:46 <slaweq> jlibosva: You're right 16:48:11 <ihrachys> #action jlibosva to look into agent startup failure and missing logs in: http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/ 16:48:12 <slaweq> some time ago I was checking it to fix some problem with l3 agent in fullstack tests 16:49:26 <ihrachys> there are not that many fullstack failures actually 16:49:32 <ihrachys> (I am trying to find another one) 16:50:25 <slaweq> btw. there is currently only one test marked as unstable_test() in fullstack 16:50:38 <slaweq> related to security groups 16:50:53 <slaweq> so if there is no so many failures, it's good IMO :) 16:51:02 <ihrachys> yeah that's good 16:51:09 <ihrachys> I actually failed to find another fresh one 16:51:37 <ihrachys> so let's assume it's good. we will give it a week to prove it on chart since now it's probably distorted by gate instability 16:52:14 <ihrachys> #topic Scenarios 16:52:29 <ihrachys> we know what's happening in linuxbridge, so let's figure out what's the deal with dvr one 16:54:04 <haleyb> someone said dvr :) 16:54:16 <ihrachys> (while looking for a failure, I noticed scenarios not triggered in https://review.openstack.org/#/c/550055/) 16:54:31 <jlibosva> haleyb: you have higlighting set for dvr? :) 16:54:56 <haleyb> yup, i'm just looking out of corner of eye due to other meeting 16:55:18 <ihrachys> ok found one failure here: http://logs.openstack.org/07/548607/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/84cacc5/job-output.txt.gz#_2018-03-06_13_42_42_767535 16:55:24 <ihrachys> ssh connection timeout it is 16:55:44 <ihrachys> slaweq, see it took 1min 29s / 6min 22s 16:55:48 <ihrachys> to raise interfaces there 16:55:55 <ihrachys> still a lot 16:56:00 <slaweq> ihrachys: yes, I was just looking for that :) 16:56:49 <ihrachys> so maybe effectively bumping the ssh timeout for the job would fix that one too 16:58:00 <ihrachys> the instance is up for ~249.666088 until it claims full boot 16:58:06 <ihrachys> it's ~4 minutes 16:58:19 <ihrachys> the timeout of 5 minutes would be enough in this case, so maybe it's something else 16:59:31 <ihrachys> I guess we can revisit that one after we deal with linuxbridge. maybe the fix will be the same. 16:59:43 <mlavalle> fingers crossed 16:59:47 <ihrachys> ok we are out of time for today. thanks for joining. 16:59:52 <ihrachys> #endmeeting