16:01:22 <ihrachys> #startmeeting neutron_ci
16:01:23 <openstack> Meeting started Tue Jan 23 16:01:22 2018 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:24 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:27 <openstack> The meeting name has been set to 'neutron_ci'
16:01:29 <mlavalle> o/
16:01:33 <slaweq> hi
16:02:08 <ihrachys> hello friends :)
16:02:09 <ihrachys> #topic Actions from prev meeting
16:02:16 <ihrachys> "mlavalle to follow up with stadium projects on taking over / merge patches to switch imports to new tempest repo"
16:02:18 <jlibosva> o/
16:02:25 <haleyb> o/
16:02:43 <mlavalle> VPNaaS is merged: https://review.openstack.org/#/c/521342/
16:03:01 <mlavalle> I'm sorry, that's midonet
16:03:02 <ihrachys> not really?
16:03:05 <ihrachys> ok
16:03:35 <mlavalle> VPNaaS is ready to go IMO: https://review.openstack.org/#/c/521341
16:03:43 <ihrachys> but i pushed vpnaas in
16:04:04 <mlavalle> great :-)
16:04:07 <ihrachys> mlavalle, what about https://review.openstack.org/#/c/521346/ ?
16:04:12 <ihrachys> that's dynamic-routing
16:04:27 <ihrachys> I think that's the last standing
16:04:39 <mlavalle> I am talking to chandamkumar in the qa channel about it
16:05:25 <mlavalle> chandankumar that is
16:05:36 <mlavalle> he is been really helpful
16:05:53 <mlavalle> our conversations are asynch. I haven't figured out his hours
16:06:03 <mlavalle> so we should get it in soon
16:06:15 <ihrachys> ok. after we land that one, we should be able to clean up tempest bits from tree
16:06:22 <ihrachys> I believe we don't have the patch proposed
16:06:43 <ihrachys> it may be a good moment to propose one, making it depend on dynamic-routing / vpnaas
16:06:59 <mlavalle> I can take care of that
16:07:26 <ihrachys> ok cool
16:07:40 <ihrachys> #action mlavalle to propose a patch removing last tempest bits from neutron tree
16:08:18 <ihrachys> "jlibosva to talk to otherwiseguy about making progress for functional ovsdb issues (timeouts, retries, stream parsing failures)"
16:08:25 <ihrachys> jlibosva, your floor
16:08:35 <jlibosva> well, I did but we haven't found anything specific
16:09:00 <jlibosva> one of theories is that it's related to the weird data appearing in the ovsdbapp socket that's use to communication
16:09:19 <ihrachys> right. and we haven't landed a patch that would log more that could give us leads
16:09:23 <jlibosva> so maybe we will try to fix that first and we see. it's not clear why the try_again is happening
16:09:28 <jlibosva> we did
16:09:32 <jlibosva> well, we didn't land it
16:09:39 <jlibosva> it's just on review
16:09:42 <ihrachys> right.
16:09:50 <jlibosva> https://review.openstack.org/#/c/525775/
16:10:11 <ihrachys> but is it ok that it's not in tree and hence little info is collected?
16:10:39 <jlibosva> It seems that it's reproduced very often so I'd give it couple of rechecks
16:10:41 * otherwiseguy is looking at the functional ovsdb stuff still
16:11:14 <ihrachys> jlibosva, you mean, we can debug / collect info without landing it, just with rechecks?
16:11:20 <jlibosva> ihrachys: right
16:11:25 <ihrachys> ok
16:11:32 <otherwiseguy> the log message is...weird...it is labeled as coming from a test where ovsdb stuff shouldn't even be happening I think.
16:11:44 <otherwiseguy> (the error parsing stream message)
16:11:59 <otherwiseguy> Like it has shown up on an ovs fw vsctl test.
16:12:25 <ihrachys> otherwiseguy, could be eventlet not doing its job properly. the other error message you see, could it be triggered by another tester thread?
16:12:31 <ihrachys> do they all share the same rootwrap daemon?
16:13:30 <otherwiseguy> ihrachys, it seems very likely to be eventlet-related anyway (just my gut feeling, though). Not sure on rw daemon. but the ovsdb native stuff doesn't use rwd.
16:13:55 <ihrachys> ah right, sorry I mixed things again. that's ovsdb socket.
16:14:08 <ihrachys> so the error is "RuntimeError: Port nonexistantport does not exist"
16:14:16 <otherwiseguy> And it might be somehow related to the weird monkey patching of the ovs logging stuff too...i just don't really know yet.
16:14:22 <ihrachys> I think nonexistantport is a unique name, (maybe) in the tree
16:14:46 <ihrachys> so we could check which test it belongs to, and check whether it's same tester worker or a different one, when it executed comparing to the one that fails
16:15:46 <slaweq> ihrachys: this name is used only in this test: https://github.com/openstack/neutron/blob/ff24fc07278797459b77434662c92667d690b0b4/neutron/tests/functional/agent/test_ovs_lib.py#L79
16:15:56 <ihrachys> right
16:16:01 <otherwiseguy> The log message I'm seeing in weird places is the "Error parsing stream row/col" which doesn't have a lot of info to match up on.
16:18:50 <ihrachys> otherwiseguy, will you be able to match the error message source against time of retries / stream parsing issue and report in bug?
16:19:15 <ihrachys> I don't think it will lead to something but if there is some chance...
16:20:10 <otherwiseguy> ihrachys, I will certainly try. :)
16:20:16 <ihrachys> otherwiseguy, you said there is little to match on. you mean, the stream data not included in the exception?
16:20:50 <otherwiseguy> Like for instance this message: 2018-01-22 06:47:14.095 [neutron.tests.functional.agent.linux.test_ovsdb_monitor.TestSimpleInterfaceMonitor.test_get_events_includes_ofport_vsctl_] 10724 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 4, byte 4: invalid keyword 'rors'
16:21:09 <ihrachys> right. but can't we at this point grab more data from the socket and dump it?
16:21:18 <ihrachys> or will it interact badly with other consumers?
16:21:27 <otherwiseguy> If it happens to hit on that one patch that I'm testing.
16:21:32 <ihrachys> the connection is probably per thread so it would be ok no?
16:21:53 <otherwiseguy> Getting the info from the stream is kind of invasive.
16:22:05 <ihrachys> otherwiseguy, right. and that's what I was alluding to. maybe there is a merit in merging it to give it a higher chance to hit it with useful data dumped somewhere in gate.
16:22:41 <ihrachys> otherwiseguy, well, it's invasive but at this point it may be the time to reset / reconnect anyway since you can't trust the data inside, so it could be one of cleanup steps.
16:23:00 <ihrachys> if we were to reconnect on the error, would it help?
16:23:00 <otherwiseguy> If I don't come up with something in the next couple of days it may be worth it.
16:23:14 <ihrachys> ok
16:23:15 <otherwiseguy> ihrachys, that might help.
16:23:35 <ihrachys> #action otherwiseguy to continue digging ovsdb retry / stream parsing issue and try things
16:23:48 <ihrachys> I put action item so that we get back to it next time
16:24:03 <otherwiseguy> ihrachys, good plan. reminders always a good thing for me. ;)
16:24:07 <ihrachys> next was "jlibosva to check why functional job times out globally instead of triggering a local test case timeout for TRY_AGAIN ovsdb issue"
16:24:14 <ihrachys> that's somewhat related
16:24:51 <jlibosva> so I just briefly looked at the code and it seems to me we don't have a per-test timeout implicitly. I need to test it though and confirm. I think you can flip this as an ongoing AI for me till the next meeting
16:25:14 <ihrachys> jlibosva, I thought ostestr applies some default
16:25:44 <jlibosva> I thought we have a Timeout class for that that uses Timeout fixture
16:26:31 <ihrachys> there is this OS_TEST_TIMEOUT setting that you can pass into the env
16:27:12 <ihrachys> but it's for openstack/oslotest based classes
16:27:17 <ihrachys> do we use the library?
16:27:31 <jlibosva> ok, I need to dig there. I thought we have to do it by ourselves
16:27:45 <ihrachys> well our DietTestCase inherits from it
16:27:47 <slaweq> ihrachys: but it looks that OS_TEST_TIMEOUT is not set in functional tests: https://github.com/openstack/neutron/blob/ff24fc07278797459b77434662c92667d690b0b4/tox.ini#L51
16:28:08 <ihrachys> though maybe it actually requires setting of the value: http://git.openstack.org/cgit/openstack/oslotest/tree/oslotest/base.py#n38
16:28:29 <slaweq> sorry, it is set in testenv:common
16:28:40 <ihrachys> slaweq, https://github.com/openstack/neutron/blob/ff24fc07278797459b77434662c92667d690b0b4/tox.ini#L23
16:28:41 <ihrachys> right
16:28:47 <ihrachys> and it should have been inherited
16:28:55 <slaweq> yes, my mistake, sorry
16:29:15 <ihrachys> anyway, there's that thing for jlibosva to be aware of, and he will continue digging :)
16:29:21 <jlibosva> yep
16:29:37 <ihrachys> #action jlibosva to continue digging why functional tests loop in ovsdb retry and don't timeout
16:29:47 <ihrachys> next was "slaweq to report a bug about l2 ext manager not catching exceptions from extensions"
16:29:51 <jlibosva> and I confirm that if we inherit from neutron.tests.base we get the timeout
16:30:02 <slaweq> I reported it: https://bugs.launchpad.net/neutron/+bug/1743647
16:30:03 <openstack> Launchpad bug 1743647 in neutron "L2 extension manager don't handle drivers errors" [Low,Confirmed]
16:30:16 <ihrachys> jlibosva, oh so we don't. ok that seems like a good lead then.
16:30:37 <ihrachys> slaweq, I tagged it as low-hanging-fruit
16:30:57 <slaweq> ok, thx
16:31:15 <ihrachys> "ihrachys to report bug for l2 agent restart fullstack failures and dig it"
16:31:19 <ihrachys> eh...
16:31:44 <ihrachys> you get the idea. I think I forgot to create a trello card for it and it slipped
16:31:53 <ihrachys> I will fix it now :)
16:32:30 <ihrachys> sorry for that
16:32:37 <ihrachys> next was "mlavalle to bring up floating ip failures in dvr scenario job to l3 team"
16:33:03 <mlavalle> we did discuss talk about it. I will let haleyb update about it
16:33:54 <haleyb> i don't have any update, been busy with other items
16:34:19 <ihrachys> but at least the gist of discussion from any of you maybe?
16:35:00 <mlavalle> well, we have had difficulty following up on that one
16:35:27 <mlavalle> haleyb: is there anyway I can help?
16:36:38 <haleyb> just finding bug... getting a setup in that failing state would be best, then we could inspect rules, etc
16:37:11 <mlavalle> ok, let's talk about it in channel
16:37:34 <ihrachys> #action mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario
16:37:51 <haleyb> light fire under haleyb :)
16:38:23 <ihrachys> this issue is quite old (another example of those hard cracks is the ovsdb timeout thingy), so if you have mad ideas they may still be worth trying at this point
16:38:40 <ihrachys> haleyb, you are on the hook too :p
16:38:43 <ihrachys> next was "mlavalle to transit remaining legacy- jobs to new format"
16:39:18 <mlavalle> the remaining one is legacy-tempest-dsvm-py35
16:39:54 <mlavalle> the qa team is reimplementing that job
16:40:14 <mlavalle> https://review.openstack.org/#/c/524153/
16:40:32 <ihrachys> mlavalle, oh so it's inherited from another repo?
16:40:50 <mlavalle> it comes from here: http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/zuul.d/zuul-legacy-project-templates.yaml#n153
16:41:12 <ihrachys> oh ok. then I guess we can leave it to qa
16:41:26 <mlavalle> that's why I didn't see it when moving / renaming the other ones
16:41:32 <ihrachys> we'll just need to make sure that once they replace the job we change grafana
16:42:00 <ihrachys> there are also 'periodic' jobs that are legacy. not sure whether it's not the same case.
16:42:09 <ihrachys> but those seem specific to our project
16:42:15 <ihrachys> like legacy-periodic-neutron-py35-with-neutron-lib-master
16:42:21 <ihrachys> or legacy-periodic-tempest-dsvm-neutron-with-ryu-master
16:42:31 <mlavalle> I didn't have time to look at those
16:42:36 <mlavalle> I can follow up next
16:42:45 <ihrachys> I guess they are still defined in infra repos. are we expected to move those too? that would be a good q to answer.
16:43:01 <mlavalle> yes, I can start from there
16:43:06 <mlavalle> asking that question
16:43:07 <ihrachys> #action mlavalle to check with infra what's the plan for periodics / maybe migrate jobs
16:43:10 <ihrachys> great
16:43:38 <mlavalle> at least we keep unraveling the thread :-)
16:43:42 <ihrachys> those are all AIs we had and we have whopping 17mins for the rest
16:43:51 <ihrachys> #topic Grafana
16:43:55 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:44:56 <ihrachys> ok one thing I notice is fullstack is back to 100%
16:45:37 <ihrachys> also, periodic -pg- (postgres) job is busted for several days so it's probably not a sporadic failure
16:46:15 <ihrachys> scenarios are 40% for dvr and 20% for linuxbridge
16:46:18 <ihrachys> like the previous week
16:46:43 <slaweq> ihrachys: this fullstack failure rate is kind of demotivating for me :(
16:46:52 <ihrachys> ovsfw job is at 80% though afaiu it was always bad
16:47:23 <ihrachys> slaweq, I feel you. but maybe we can spot something new there in a latest run. we will check in a sec.
16:47:24 <jlibosva> oh I thought it's 100%, I allocated some time as there is always a volume test failing so I planned to look at it
16:47:44 <slaweq> ihrachys: sure, I just saying
16:48:05 <ihrachys> other things don't look new / revealing, so let's move on
16:48:07 <ihrachys> #topic Fullstack
16:48:09 <haleyb> i added a new dashboard for the neutron-tempest-plugin repo, http://grafana.openstack.org/dashboard/db/neutron-tempest-plugin-failure-rate
16:48:13 <haleyb> doh
16:48:18 <ihrachys> ah right, thanks haleyb
16:48:31 <ihrachys> I need to get practiced to remember about it
16:48:54 <ihrachys> so about the fullstack failure rate
16:49:32 <ihrachys> http://logs.openstack.org/80/536380/2/check/neutron-fullstack/138850e/logs/testr_results.html.gz
16:50:04 <slaweq> yes, I found same issue in many tests during last few days
16:51:30 <slaweq> I can try to check it this week
16:51:46 <ihrachys> looks like l3agent failed to start? http://logs.openstack.org/80/536380/2/check/neutron-fullstack/138850e/logs/dsvm-fullstack-logs/TestLegacyL3Agent.test_mtu_update.txt.gz#_2018-01-23_10_46_42_690
16:52:04 <ihrachys> there is no l3agent log in service log dir
16:52:51 <ihrachys> slaweq, ok please do. maybe l3_agent.py is busted, like import broken or smth
16:52:56 <ihrachys> so it doesn't get to logging anything
16:53:06 <slaweq> yes, I will check it ASAP
16:53:11 <slaweq> (probably tomorrow)
16:53:12 <ihrachys> thanks
16:53:29 <ihrachys> #action slaweq to check why fullstack job is busted / l3agent fails to start
16:53:32 <slaweq> recently I'm working only on issues with fullstack so I have some "experience" :D
16:53:52 <ihrachys> you are in the flow, as they say. on bug fixing spree. :)
16:54:04 <ihrachys> your work is not for naught :)
16:54:14 <slaweq> thx :)
16:54:49 <ihrachys> speaking of fullstack, slaweq also has this patch that should help some failures: https://review.openstack.org/#/c/536367/
16:54:56 <ihrachys> I haven't gotten to it myself
16:55:10 <ihrachys> but since jlibosva and haleyb both voted +2 maybe they can push it in
16:55:33 <haleyb> pushed
16:55:41 <slaweq> haleyb: thx
16:55:54 <ihrachys> thanks
16:56:22 <ihrachys> another related thing that slaweq noticed during fullstack spree is this: https://review.openstack.org/#/c/536342/
16:56:34 <ihrachys> if nothing else, it's a performance optimization for how agents fetch SGs
16:57:15 <ihrachys> is there anything else for fullstack?
16:57:21 <ihrachys> I mean, that would require reviews
16:57:37 <mlavalle> yeap
16:57:42 <slaweq> I don't have anything else :)
16:57:44 <mlavalle> in my pile
16:57:58 <ihrachys> mlavalle, shoot
16:58:18 <mlavalle> I meant slawek's patch i in my pile
16:58:22 <ihrachys> oh ok :)(
16:58:31 <slaweq> mlavalle: thx
16:58:51 <ihrachys> ok I guess we don't have much time to discuss other stuff so we will wrap up. next time we will focus on scenarios first.
16:59:03 <mlavalle> ok
16:59:29 <ihrachys> thanks everyone and especially slaweq on doing Sisyphus work. that stone must roll over the hill one day.
16:59:42 <ihrachys> #endmeeting