#openstack-meeting log

16:01:40 <ihrachys> #startmeeting neutron_ci
16:01:41 <openstack> Meeting started Tue Mar  6 16:01:40 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:42 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:45 <openstack> The meeting name has been set to 'neutron_ci'
16:01:48 <mlavalle> o/
16:01:51 <slaweq> hi
16:02:15 <jlibosva> o/
16:02:29 <ihrachys> hi to everyone who have escaped snowpenstack, or who was lucky enough to not go :p
16:02:46 <ihrachys> #topic Actions from prev meeting
16:03:05 <ihrachys> "ihrachys to update grafana periodic board with new names"
16:03:31 <ihrachys> merged https://review.openstack.org/547661
16:03:41 <ihrachys> "report test_get_service_by_service_and_host_name failure in periodic -pg- job"
16:03:51 <ihrachys> ouch that one was on me but I forgot to put my name on it and missed
16:04:18 <ihrachys> meh, it's already quite old and we will check status of the job later anyway and figure out what to do with it
16:04:24 <ihrachys> "mlavalle to look into linuxbridge ssh timeout failures"
16:04:37 <slaweq> I was looking on that one
16:04:43 <mlavalle> ihrachys: I think we gave that one to slaweq
16:05:06 <slaweq> so I was debugging it for a while and I couldn't reproduce it locally
16:05:31 <slaweq> so finally I pushed some DNM patches to be able to telnet to remote_pdb when test will fail
16:05:48 <slaweq> this didn't give me anything also
16:06:08 <slaweq> so I talked with infra team to add my ssh key to such node where this tests failed
16:06:25 <slaweq> and I checked there that all is working fine when I logged in
16:07:02 <ihrachys> and you check in European hours I assume. maybe it's genuine cloud load and slow instance boot?
16:07:04 <slaweq> so I tried one (stupid) simple thing: https://review.openstack.org/#/c/549324/
16:07:21 <slaweq> ihrachys: no, tests on gate was failing many time
16:07:25 <slaweq> *many times
16:08:16 <slaweq> after all it looks for me that it is just problem with too long time for instance boot
16:09:10 <ihrachys> in linuxbridge job, it's a single node doing all things, tempest, controller, compute...
16:09:20 <ihrachys> in dvr one, compute is separate
16:09:21 <slaweq> when I proposed patch https://review.openstack.org/#/c/549324/ I run gate jobs twice on it and in both cases lb scenario tests catched my additional log
16:09:44 <slaweq> ihrachys: exactly - I then found that lb is single node and dvr is multinode config
16:09:59 <slaweq> I couldn't find anything else wrong with it :/
16:10:03 <slaweq> ok, I'm done now :)
16:10:21 <ihrachys> slaweq, when it hits the message, does it recover quickly after that or still takes some time?
16:10:45 <slaweq> let me check as I don't remember exactly
16:11:09 <slaweq> http://logs.openstack.org/24/549324/2/check/neutron-tempest-plugin-scenario-linuxbridge/16197fa/logs/tempest.txt.gz#_2018-03-03_09_52_10_363
16:11:14 <slaweq> here is example of such log
16:12:10 <slaweq> it looks that it took about 1 minute more to connect
16:12:45 <ihrachys> yeah. and what about the other one? is it also aligned to 1 minute?
16:13:54 <ihrachys> no, seems like there it took 50 seconds
16:14:12 <ihrachys> so yeah, it looks just like slow instance spawning
16:14:14 <slaweq> it's not exactly same time for all
16:14:30 <slaweq> it looks like that for me
16:14:36 <ihrachys> afair we don't have nested virt enabled in gate
16:15:02 <ihrachys> afair it was a major issue for octavia scenarios where they needed to start amphora instance per balancer
16:15:47 <slaweq> yes, there was an issue but AFAIR problem was with nested virtualization was only in OVH cloud
16:16:34 <ihrachys> slaweq, it would probably be interesting to see how long it usually takes in dvr runs
16:16:49 <ihrachys> to compare. is it also close to the limit?
16:17:05 <slaweq> one thing which I also found is that it takes really long time to boot an instance in such failed case
16:17:07 <slaweq> e.g.: http://logs.openstack.org/71/548071/5/check/neutron-tempest-plugin-scenario-linuxbridge/d792a4a/job-output.txt.gz#_2018-03-02_17_02_34_200671
16:17:26 <slaweq> from instance logs it looks that it is ready after more than 400 seconds
16:17:37 <slaweq> and it was similar in other cases which I checked
16:17:59 <slaweq> ihrachys: I can check such logs in dvr scenario to compare it
16:19:34 <ihrachys> slaweq, do I interpret it correctly that it stayed in raising interfaces stage for 3min 42s / 8min 25s ?
16:19:54 <ihrachys> "Starting Raise network interfaces"
16:20:08 <ihrachys> here http://logs.openstack.org/71/548071/5/check/neutron-tempest-plugin-scenario-linuxbridge/d792a4a/job-output.txt.gz#_2018-03-02_17_02_34_168025
16:20:17 <slaweq> I don't know exactly
16:20:49 <ihrachys> well it looks like it. in this case, it may be something on our side that blocks the interface up.
16:20:54 <ihrachys> maybe dhcp not replying or smth
16:21:11 <ihrachys> though eventually it succeeds
16:21:13 <slaweq> but please note that first entry is 2min 52s/ 7 min 47s
16:21:55 <ihrachys> yeah but maybe that's when it starts to log
16:22:20 <ihrachys> my interpretation is that - 'the job took long time already, let's start log about it being still in progress'
16:22:27 <slaweq> but please check http://logs.openstack.org/71/548071/5/check/neutron-tempest-plugin-scenario-linuxbridge/d792a4a/job-output.txt.gz#_2018-03-02_17_02_34_162615
16:22:50 <slaweq> it took 166 seconds even here, before this "raise network interface"
16:23:15 <ihrachys> yeah but I mean, those 3min+ can be decisive for failure
16:23:20 <ihrachys> the limit is what, 5 minutes?
16:23:35 <slaweq> something like that,
16:23:43 <ihrachys> maybe it doesn't sit there for too long in dvr job
16:23:46 <slaweq> I will check also dhcp logs for it
16:23:49 <ihrachys> 'just' 166 seconds or so
16:24:04 <slaweq> please add me action for that for next week :)
16:24:39 <ihrachys> #action slaweq to check why it takes too long to raise interfaces in linuxbirdge scenarios job, and to compare with dvr
16:24:50 <slaweq> thx
16:25:26 <ihrachys> "ihrachys to update grafana boards to include master data only"
16:25:37 <ihrachys> so I actually looked into it, and I believe it's my misunderstanding
16:25:53 <ihrachys> the board explicitly pulls data from .master. queries
16:26:10 <ihrachys> so there is no issue with mixed data
16:26:37 <ihrachys> #topic Grafana
16:26:37 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:27:19 <ihrachys> the prev time we decided to give functional and other newly stable jobs some time till next meeting and see what proves itself with time and maybe consider some for voting / gating
16:28:26 <jlibosva> functional is not voting?
16:28:27 <ihrachys> looking at functional / fullstack / api board, it's not clear we can deduce much from it
16:28:34 <ihrachys> jlibosva, I believe it's not gating
16:28:43 <ihrachys> I mean, it's in check queue but not gate queue
16:28:46 <jlibosva> is it some recent change?
16:28:58 <ihrachys> eh. now you make me sanity check :)
16:29:17 <jlibosva> oh, probably it's just not shown in grafana
16:29:18 <ihrachys> oh actually nevermind, it's voting / gating
16:29:43 <jlibosva> I thought for a second something happened and it was pulled back from gate Q :)
16:29:53 <jlibosva> sorry for disturbance
16:29:58 <ihrachys> :) yeah I am bad at tracking the real world
16:30:18 <ihrachys> so as for fullstack, I see spikes in the last week but those seem to repeat what -api- job does
16:30:46 <ihrachys> currently at ~25% failure rate
16:31:27 <ihrachys> we even have unit tests at 20%
16:31:36 <ihrachys> not sure what happened during the last week
16:32:11 <ihrachys> but anyway, we'll take a look at fullstack separately
16:32:26 <ihrachys> scenarios are doing bad too, 50% for both
16:32:53 <slaweq> for linuxbridge it's still same issue as we already talked today
16:33:19 <ihrachys> slaweq, yeah but now failure rate reflected back in dvr. it was rather stable the last time we gathered.
16:33:34 <slaweq> I see
16:33:57 <ihrachys> but yeah, not clear how much of the rate is because of gate breakage that brought all jobs to 100% failure rate two days ago
16:34:22 <ihrachys> maybe we should revisit it again later to get a better picture
16:34:38 <ihrachys> #action ihrachys to check grafana stats several days later when dust settles
16:34:47 <mlavalle> ++
16:34:54 <ihrachys> one thing that is new on the board is it seems that rally job is totally busted
16:35:18 <slaweq> yes, but there is issue for that already reported I think
16:35:19 <jlibosva> it's gonna be fixed soon
16:35:28 <jlibosva> https://bugs.launchpad.net/rally/+bug/1753713
16:35:30 <openstack> Launchpad bug 1753713 in Rally "Rally job on neutron CI broken" [Critical,Fix released] - Assigned to Andrey Kurilin (andreykurilin)
16:35:34 <jlibosva> fix: https://review.openstack.org/#/c/549627/
16:35:48 <ihrachys> fix merged
16:35:53 <ihrachys> so we should be back to normal
16:35:55 <jlibosva> oh, it merged :)
16:36:04 <ihrachys> they don't need to release right?
16:36:04 <jlibosva> 6 minutes ago, nice :)
16:36:07 <ihrachys> it's from git
16:36:11 <jlibosva> that I don't know
16:37:07 <ihrachys> yeah it's from master: http://logs.openstack.org/22/549822/4/check/neutron-rally-neutron/1ef49e4/logs/devstacklog.txt.gz#_2018-03-06_11_18_33_774
16:37:36 * jlibosva nods
16:37:55 <ihrachys> periodics are stable so that's good too
16:38:00 <ihrachys> ok, now to specific jobs
16:38:02 <ihrachys> #topic Fullstack
16:38:15 <ihrachys> slaweq, any patches in your pipeline for fullstack?
16:38:20 <slaweq> no
16:38:40 <slaweq> I wasn't doing anything with fullstack in last days
16:39:17 <ihrachys> ehm, I have a question
16:39:22 <ihrachys> I was looking at a random patch
16:39:28 <ihrachys> looking for fullstack results
16:39:28 <ihrachys> https://review.openstack.org/#/c/536747/
16:39:30 <ihrachys> and I don't see any
16:39:43 <ihrachys> is it just me
16:39:59 <jlibosva> it's from january
16:40:04 <slaweq> I don't see fullstack there also
16:40:20 <ihrachys> ah right. and there's:
16:40:20 <ihrachys> neutron-fullstack finger://ze03.openstack.org/db42044c8be14621bd79843789dac0c8 : POST_FAILURE in 43m 40s (non-voting)
16:40:22 <jlibosva> today I saw fullstack results
16:40:23 <ihrachys> in comments
16:40:26 <ihrachys> ok nevermind
16:41:02 <ihrachys> ok here's another one that is fresh
16:41:03 <ihrachys> http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/testr_results.html.gz
16:41:41 <slaweq> looks like some problem with starting agents/processes
16:42:50 <ihrachys> yeah I checked test_ha_router and neutron-server + 4 agents (2 dhcp and 2 ovs) are up. shouldn't there be also l3?
16:43:16 <jlibosva> there should
16:43:19 <ihrachys> here is where it failed to start agnet: http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/dsvm-fullstack-logs/TestHAL3Agent.test_ha_router.txt.gz#_2018-03-06_08_20_19_235
16:43:27 <jlibosva> I checked logs now and Halting async process [neutron/tests/fullstack/cmd/l3_agent.py --log-dir /opt/stack/logs/dsvm-fullstack-logs/TestLinuxBridgeConnectivitySameNetwork.test_connectivity_VXLAN-and-l2pop_ --log-file neutron-l3-agent--2018-03-06--08-13-38-219990.log --config-file /tmp/tmpzpOvIu/tmpZEeIKB/neutron.conf --config-file /tmp/tmpzpOvIu/tmpZEeIKB/l3_agent.ini] in response to an error.
16:44:11 <ihrachys> right. but not many details
16:44:49 <jlibosva> I can have a look at it
16:44:55 <jlibosva> I'll try it locally
16:45:04 <ihrachys> yeah. maybe first step is to understand how to collect logs in those cases.
16:45:26 <ihrachys> maybe we can somehow get the output at least until it starts logging into console
16:45:34 <ihrachys> eh, sorry into the logfile
16:46:01 <slaweq> I'm not sure if agent logs something to console in such case
16:46:25 <ihrachys> and another weird thing is, it fails to start the agent, but still waits for nothing for minutes and minutes. it could fail immediately maybe.
16:46:34 <slaweq> even if You spawn it locally with same command if You give --log-file then it will not log to console IMO
16:46:36 <jlibosva> I might have a patch for it somewhere in abandoned patches, to read stderr and stdout in case process was halted in response to error
16:47:12 <slaweq> recently I was checking it with some DNM patches pushed to gerrit and rechecked :)
16:47:17 <jlibosva> it should still write output to async process buffers
16:47:24 <slaweq> but maybe we should have merged some better logging there
16:47:46 <slaweq> jlibosva: You're right
16:48:11 <ihrachys> #action jlibosva to look into agent startup failure and missing logs in: http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/
16:48:12 <slaweq> some time ago I was checking it to fix some problem with l3 agent in fullstack tests
16:49:26 <ihrachys> there are not that many fullstack failures actually
16:49:32 <ihrachys> (I am trying to find another one)
16:50:25 <slaweq> btw. there is currently only one test marked as unstable_test() in fullstack
16:50:38 <slaweq> related to security groups
16:50:53 <slaweq> so if there is no so many failures, it's good IMO :)
16:51:02 <ihrachys> yeah that's good
16:51:09 <ihrachys> I actually failed to find another fresh one
16:51:37 <ihrachys> so let's assume it's good. we will give it a week to prove it on chart since now it's probably distorted by gate instability
16:52:14 <ihrachys> #topic Scenarios
16:52:29 <ihrachys> we know what's happening in linuxbridge, so let's figure out what's the deal with dvr one
16:54:04 <haleyb> someone said dvr :)
16:54:16 <ihrachys> (while looking for a failure, I noticed scenarios not triggered in https://review.openstack.org/#/c/550055/)
16:54:31 <jlibosva> haleyb: you have higlighting set for dvr? :)
16:54:56 <haleyb> yup, i'm just looking out of corner of eye due to other meeting
16:55:18 <ihrachys> ok found one failure here: http://logs.openstack.org/07/548607/1/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/84cacc5/job-output.txt.gz#_2018-03-06_13_42_42_767535
16:55:24 <ihrachys> ssh connection timeout it is
16:55:44 <ihrachys> slaweq, see it took 1min 29s / 6min 22s
16:55:48 <ihrachys> to raise interfaces there
16:55:55 <ihrachys> still a lot
16:56:00 <slaweq> ihrachys: yes, I was just looking for that :)
16:56:49 <ihrachys> so maybe effectively bumping the ssh timeout for the job would fix that one too
16:58:00 <ihrachys> the instance is up for ~249.666088 until it claims full boot
16:58:06 <ihrachys> it's ~4 minutes
16:58:19 <ihrachys> the timeout of 5 minutes would be enough in this case, so maybe it's something else
16:59:31 <ihrachys> I guess we can revisit that one after we deal with linuxbridge. maybe the fix will be the same.
16:59:43 <mlavalle> fingers crossed
16:59:47 <ihrachys> ok we are out of time for today. thanks for joining.
16:59:52 <ihrachys> #endmeeting