16:01:22 #startmeeting neutron_ci 16:01:23 Meeting started Tue Jan 23 16:01:22 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:24 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:27 The meeting name has been set to 'neutron_ci' 16:01:29 o/ 16:01:33 hi 16:02:08 hello friends :) 16:02:09 #topic Actions from prev meeting 16:02:16 "mlavalle to follow up with stadium projects on taking over / merge patches to switch imports to new tempest repo" 16:02:18 o/ 16:02:25 o/ 16:02:43 VPNaaS is merged: https://review.openstack.org/#/c/521342/ 16:03:01 I'm sorry, that's midonet 16:03:02 not really? 16:03:05 ok 16:03:35 VPNaaS is ready to go IMO: https://review.openstack.org/#/c/521341 16:03:43 but i pushed vpnaas in 16:04:04 great :-) 16:04:07 mlavalle, what about https://review.openstack.org/#/c/521346/ ? 16:04:12 that's dynamic-routing 16:04:27 I think that's the last standing 16:04:39 I am talking to chandamkumar in the qa channel about it 16:05:25 chandankumar that is 16:05:36 he is been really helpful 16:05:53 our conversations are asynch. I haven't figured out his hours 16:06:03 so we should get it in soon 16:06:15 ok. after we land that one, we should be able to clean up tempest bits from tree 16:06:22 I believe we don't have the patch proposed 16:06:43 it may be a good moment to propose one, making it depend on dynamic-routing / vpnaas 16:06:59 I can take care of that 16:07:26 ok cool 16:07:40 #action mlavalle to propose a patch removing last tempest bits from neutron tree 16:08:18 "jlibosva to talk to otherwiseguy about making progress for functional ovsdb issues (timeouts, retries, stream parsing failures)" 16:08:25 jlibosva, your floor 16:08:35 well, I did but we haven't found anything specific 16:09:00 one of theories is that it's related to the weird data appearing in the ovsdbapp socket that's use to communication 16:09:19 right. and we haven't landed a patch that would log more that could give us leads 16:09:23 so maybe we will try to fix that first and we see. it's not clear why the try_again is happening 16:09:28 we did 16:09:32 well, we didn't land it 16:09:39 it's just on review 16:09:42 right. 16:09:50 https://review.openstack.org/#/c/525775/ 16:10:11 but is it ok that it's not in tree and hence little info is collected? 16:10:39 It seems that it's reproduced very often so I'd give it couple of rechecks 16:10:41 * otherwiseguy is looking at the functional ovsdb stuff still 16:11:14 jlibosva, you mean, we can debug / collect info without landing it, just with rechecks? 16:11:20 ihrachys: right 16:11:25 ok 16:11:32 the log message is...weird...it is labeled as coming from a test where ovsdb stuff shouldn't even be happening I think. 16:11:44 (the error parsing stream message) 16:11:59 Like it has shown up on an ovs fw vsctl test. 16:12:25 otherwiseguy, could be eventlet not doing its job properly. the other error message you see, could it be triggered by another tester thread? 16:12:31 do they all share the same rootwrap daemon? 16:13:30 ihrachys, it seems very likely to be eventlet-related anyway (just my gut feeling, though). Not sure on rw daemon. but the ovsdb native stuff doesn't use rwd. 16:13:55 ah right, sorry I mixed things again. that's ovsdb socket. 16:14:08 so the error is "RuntimeError: Port nonexistantport does not exist" 16:14:16 And it might be somehow related to the weird monkey patching of the ovs logging stuff too...i just don't really know yet. 16:14:22 I think nonexistantport is a unique name, (maybe) in the tree 16:14:46 so we could check which test it belongs to, and check whether it's same tester worker or a different one, when it executed comparing to the one that fails 16:15:46 ihrachys: this name is used only in this test: https://github.com/openstack/neutron/blob/ff24fc07278797459b77434662c92667d690b0b4/neutron/tests/functional/agent/test_ovs_lib.py#L79 16:15:56 right 16:16:01 The log message I'm seeing in weird places is the "Error parsing stream row/col" which doesn't have a lot of info to match up on. 16:18:50 otherwiseguy, will you be able to match the error message source against time of retries / stream parsing issue and report in bug? 16:19:15 I don't think it will lead to something but if there is some chance... 16:20:10 ihrachys, I will certainly try. :) 16:20:16 otherwiseguy, you said there is little to match on. you mean, the stream data not included in the exception? 16:20:50 Like for instance this message: 2018-01-22 06:47:14.095 [neutron.tests.functional.agent.linux.test_ovsdb_monitor.TestSimpleInterfaceMonitor.test_get_events_includes_ofport_vsctl_] 10724 WARNING ovsdbapp.backend.ovs_idl.vlog [-] tcp:127.0.0.1:6640: error parsing stream: line 0, column 4, byte 4: invalid keyword 'rors' 16:21:09 right. but can't we at this point grab more data from the socket and dump it? 16:21:18 or will it interact badly with other consumers? 16:21:27 If it happens to hit on that one patch that I'm testing. 16:21:32 the connection is probably per thread so it would be ok no? 16:21:53 Getting the info from the stream is kind of invasive. 16:22:05 otherwiseguy, right. and that's what I was alluding to. maybe there is a merit in merging it to give it a higher chance to hit it with useful data dumped somewhere in gate. 16:22:41 otherwiseguy, well, it's invasive but at this point it may be the time to reset / reconnect anyway since you can't trust the data inside, so it could be one of cleanup steps. 16:23:00 if we were to reconnect on the error, would it help? 16:23:00 If I don't come up with something in the next couple of days it may be worth it. 16:23:14 ok 16:23:15 ihrachys, that might help. 16:23:35 #action otherwiseguy to continue digging ovsdb retry / stream parsing issue and try things 16:23:48 I put action item so that we get back to it next time 16:24:03 ihrachys, good plan. reminders always a good thing for me. ;) 16:24:07 next was "jlibosva to check why functional job times out globally instead of triggering a local test case timeout for TRY_AGAIN ovsdb issue" 16:24:14 that's somewhat related 16:24:51 so I just briefly looked at the code and it seems to me we don't have a per-test timeout implicitly. I need to test it though and confirm. I think you can flip this as an ongoing AI for me till the next meeting 16:25:14 jlibosva, I thought ostestr applies some default 16:25:44 I thought we have a Timeout class for that that uses Timeout fixture 16:26:31 there is this OS_TEST_TIMEOUT setting that you can pass into the env 16:27:12 but it's for openstack/oslotest based classes 16:27:17 do we use the library? 16:27:31 ok, I need to dig there. I thought we have to do it by ourselves 16:27:45 well our DietTestCase inherits from it 16:27:47 ihrachys: but it looks that OS_TEST_TIMEOUT is not set in functional tests: https://github.com/openstack/neutron/blob/ff24fc07278797459b77434662c92667d690b0b4/tox.ini#L51 16:28:08 though maybe it actually requires setting of the value: http://git.openstack.org/cgit/openstack/oslotest/tree/oslotest/base.py#n38 16:28:29 sorry, it is set in testenv:common 16:28:40 slaweq, https://github.com/openstack/neutron/blob/ff24fc07278797459b77434662c92667d690b0b4/tox.ini#L23 16:28:41 right 16:28:47 and it should have been inherited 16:28:55 yes, my mistake, sorry 16:29:15 anyway, there's that thing for jlibosva to be aware of, and he will continue digging :) 16:29:21 yep 16:29:37 #action jlibosva to continue digging why functional tests loop in ovsdb retry and don't timeout 16:29:47 next was "slaweq to report a bug about l2 ext manager not catching exceptions from extensions" 16:29:51 and I confirm that if we inherit from neutron.tests.base we get the timeout 16:30:02 I reported it: https://bugs.launchpad.net/neutron/+bug/1743647 16:30:03 Launchpad bug 1743647 in neutron "L2 extension manager don't handle drivers errors" [Low,Confirmed] 16:30:16 jlibosva, oh so we don't. ok that seems like a good lead then. 16:30:37 slaweq, I tagged it as low-hanging-fruit 16:30:57 ok, thx 16:31:15 "ihrachys to report bug for l2 agent restart fullstack failures and dig it" 16:31:19 eh... 16:31:44 you get the idea. I think I forgot to create a trello card for it and it slipped 16:31:53 I will fix it now :) 16:32:30 sorry for that 16:32:37 next was "mlavalle to bring up floating ip failures in dvr scenario job to l3 team" 16:33:03 we did discuss talk about it. I will let haleyb update about it 16:33:54 i don't have any update, been busy with other items 16:34:19 but at least the gist of discussion from any of you maybe? 16:35:00 well, we have had difficulty following up on that one 16:35:27 haleyb: is there anyway I can help? 16:36:38 just finding bug... getting a setup in that failing state would be best, then we could inspect rules, etc 16:37:11 ok, let's talk about it in channel 16:37:34 #action mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario 16:37:51 light fire under haleyb :) 16:38:23 this issue is quite old (another example of those hard cracks is the ovsdb timeout thingy), so if you have mad ideas they may still be worth trying at this point 16:38:40 haleyb, you are on the hook too :p 16:38:43 next was "mlavalle to transit remaining legacy- jobs to new format" 16:39:18 the remaining one is legacy-tempest-dsvm-py35 16:39:54 the qa team is reimplementing that job 16:40:14 https://review.openstack.org/#/c/524153/ 16:40:32 mlavalle, oh so it's inherited from another repo? 16:40:50 it comes from here: http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/zuul.d/zuul-legacy-project-templates.yaml#n153 16:41:12 oh ok. then I guess we can leave it to qa 16:41:26 that's why I didn't see it when moving / renaming the other ones 16:41:32 we'll just need to make sure that once they replace the job we change grafana 16:42:00 there are also 'periodic' jobs that are legacy. not sure whether it's not the same case. 16:42:09 but those seem specific to our project 16:42:15 like legacy-periodic-neutron-py35-with-neutron-lib-master 16:42:21 or legacy-periodic-tempest-dsvm-neutron-with-ryu-master 16:42:31 I didn't have time to look at those 16:42:36 I can follow up next 16:42:45 I guess they are still defined in infra repos. are we expected to move those too? that would be a good q to answer. 16:43:01 yes, I can start from there 16:43:06 asking that question 16:43:07 #action mlavalle to check with infra what's the plan for periodics / maybe migrate jobs 16:43:10 great 16:43:38 at least we keep unraveling the thread :-) 16:43:42 those are all AIs we had and we have whopping 17mins for the rest 16:43:51 #topic Grafana 16:43:55 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:44:56 ok one thing I notice is fullstack is back to 100% 16:45:37 also, periodic -pg- (postgres) job is busted for several days so it's probably not a sporadic failure 16:46:15 scenarios are 40% for dvr and 20% for linuxbridge 16:46:18 like the previous week 16:46:43 ihrachys: this fullstack failure rate is kind of demotivating for me :( 16:46:52 ovsfw job is at 80% though afaiu it was always bad 16:47:23 slaweq, I feel you. but maybe we can spot something new there in a latest run. we will check in a sec. 16:47:24 oh I thought it's 100%, I allocated some time as there is always a volume test failing so I planned to look at it 16:47:44 ihrachys: sure, I just saying 16:48:05 other things don't look new / revealing, so let's move on 16:48:07 #topic Fullstack 16:48:09 i added a new dashboard for the neutron-tempest-plugin repo, http://grafana.openstack.org/dashboard/db/neutron-tempest-plugin-failure-rate 16:48:13 doh 16:48:18 ah right, thanks haleyb 16:48:31 I need to get practiced to remember about it 16:48:54 so about the fullstack failure rate 16:49:32 http://logs.openstack.org/80/536380/2/check/neutron-fullstack/138850e/logs/testr_results.html.gz 16:50:04 yes, I found same issue in many tests during last few days 16:51:30 I can try to check it this week 16:51:46 looks like l3agent failed to start? http://logs.openstack.org/80/536380/2/check/neutron-fullstack/138850e/logs/dsvm-fullstack-logs/TestLegacyL3Agent.test_mtu_update.txt.gz#_2018-01-23_10_46_42_690 16:52:04 there is no l3agent log in service log dir 16:52:51 slaweq, ok please do. maybe l3_agent.py is busted, like import broken or smth 16:52:56 so it doesn't get to logging anything 16:53:06 yes, I will check it ASAP 16:53:11 (probably tomorrow) 16:53:12 thanks 16:53:29 #action slaweq to check why fullstack job is busted / l3agent fails to start 16:53:32 recently I'm working only on issues with fullstack so I have some "experience" :D 16:53:52 you are in the flow, as they say. on bug fixing spree. :) 16:54:04 your work is not for naught :) 16:54:14 thx :) 16:54:49 speaking of fullstack, slaweq also has this patch that should help some failures: https://review.openstack.org/#/c/536367/ 16:54:56 I haven't gotten to it myself 16:55:10 but since jlibosva and haleyb both voted +2 maybe they can push it in 16:55:33 pushed 16:55:41 haleyb: thx 16:55:54 thanks 16:56:22 another related thing that slaweq noticed during fullstack spree is this: https://review.openstack.org/#/c/536342/ 16:56:34 if nothing else, it's a performance optimization for how agents fetch SGs 16:57:15 is there anything else for fullstack? 16:57:21 I mean, that would require reviews 16:57:37 yeap 16:57:42 I don't have anything else :) 16:57:44 in my pile 16:57:58 mlavalle, shoot 16:58:18 I meant slawek's patch i in my pile 16:58:22 oh ok :)( 16:58:31 mlavalle: thx 16:58:51 ok I guess we don't have much time to discuss other stuff so we will wrap up. next time we will focus on scenarios first. 16:59:03 ok 16:59:29 thanks everyone and especially slaweq on doing Sisyphus work. that stone must roll over the hill one day. 16:59:42 #endmeeting