16:00:41 #startmeeting neutron_ci 16:00:42 Meeting started Tue Oct 17 16:00:41 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:43 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:45 The meeting name has been set to 'neutron_ci' 16:00:46 o/ 16:00:47 o/ 16:00:50 hi 16:01:05 hi 16:01:19 lots of hands, I love it 16:01:24 #topic Actions from prev meeting 16:01:27 "jlibosva to expand grafana window for periodics" 16:01:43 https://review.openstack.org/#/c/510934/ 16:01:47 still in review 16:02:09 afaik grafana now doesn't show anything 16:02:18 after zuul3 16:02:30 so maybe we'll need to rebase 16:02:35 riiight. I believe haleyb was going to have a look at producing a new one, but there were no data in grafite? 16:02:49 yeah, haleyb mentioned it yesterday 16:03:12 ihrachys: i don't see any, and have been busy with something else 16:03:16 I accidentally switched topic, sorry :) 16:03:19 damn customers 16:03:27 :)) 16:03:30 lol 16:03:46 haleyb, should someone take it over? 16:04:43 ihrachys: i will ping someone in infra today after my meetings, customer issue not hot any more 16:04:51 unless someone else wants it :) 16:05:26 if you can find time for that, keep it. I am only concerned if you don't. 16:05:32 *if you can't 16:06:02 #action haleyb to follow up with infra on missing grafite data for zuulv3 jobs 16:06:41 jlibosva, I guess your patch will need to wait for that to resolve 16:06:51 that's what I wanted to point out at the very beginning 16:07:41 ok, next item was "ihrachys to report bug for trunk scenario failure" 16:08:31 there is this bug that I believe captures the trunk scenario failures: https://bugs.launchpad.net/neutron/+bug/1676966 16:08:32 Launchpad bug 1676966 in neutron "TrunkManagerTestCase.test_connectivity failed to spawn ping" [Medium,Confirmed] 16:08:45 there is also https://bugs.launchpad.net/neutron/+bug/1722644 but I believe it's different 16:08:46 Launchpad bug 1722644 in neutron "TrunkTest fails for OVS/DVR scenario job" [High,Confirmed] 16:09:06 the first one is for fullstack? 16:09:07 eh, sorry, nevermind, the first is in functional tests 16:09:14 ah, no functoinal 16:09:18 yeah. I think we had another one.. wait. 16:09:41 ihrachys: the second seems legit :) 16:09:58 then somebody opened https://bugs.launchpad.net/neutron/+bug/1722967 16:09:59 Launchpad bug 1722967 in neutron "init_handler argument error in ovs agent" [High,In progress] - Assigned to Jakub Libosvar (libosvar) 16:10:07 oh yeah, I mixed things. the first is the legit one, and then there is that init_handler one 16:10:10 which solves the traces seen in the bug you reported 16:10:20 but then it still fails on SSH issue afaik 16:10:21 btw I haven't seen the error from the init_handler one in logs 16:10:35 does it happen all the time? 16:10:37 yes 16:10:40 hmm 16:10:55 ihrachys: perhaps you need to check the other node? 16:11:01 oh, it should happen on both 16:11:26 it's there: http://logs.openstack.org/13/474213/7/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/752030f/logs/screen-q-agt.txt.gz?level=ERROR#_Oct_10_17_35_10_787871 16:11:36 that's from the link in 1722644 16:11:43 I will check again. but anyway, the connectivity failure is still a different one. we had this failure long before. 16:12:10 for 1722967, the fix is in the gate: https://review.openstack.org/#/c/511428/ 16:12:23 so maybe we could just refresh the info on the 172644 with new traceback :) 16:12:26 jlibosva, right. but I haven't seen it in other logs from other runs. 16:12:38 mlavalle: thanks for +W :) 16:12:51 thanks for fixing! 16:13:28 ihrachys: anyways, it's fixed by now :) but as far as I can tell it was 100% reproducible as there were wrong parameter passed during initialization of ovs agent 16:13:42 was* 16:14:12 ok. but we still have connectivity issue on top correct? 16:14:21 I saw those, yes 16:14:29 ok 16:14:58 if someone has cycles to have a look at scenario trunk failure, welcome. it's the last failure that keeps the job at 100% failure rate. 16:15:08 next item was "jlibosva to look how failing tests can not affect results" 16:15:26 I believe it's about exploring how to "disable" some fullstack tests 16:15:29 to make the job voting 16:16:06 I don't have any patch yet but I went through testtools code and found they have a way to set "expected failure" on result 16:16:29 but I think the easiest way would be to decorate test methods and in case they raise an exception, log it and swallow 16:16:55 as expected failure means that if the test passes, the final result will be negative 16:17:53 and btw I haven't found any docs on how to set result object for testtools test runner anyways :) 16:18:14 jlibosva, so you mean, those test cases will always successful? 16:18:58 right 16:19:10 or skipped 16:19:16 so we know if they failed or not 16:19:38 jlibosva, you mean the decorator will call self.skip? 16:19:43 correct 16:19:54 ok that probably makes sense 16:20:26 I guess first step is producing the decorator, and then we can go one by one and disable them. 16:20:26 I can craft a patch to try it out 16:20:47 yeah, we could disable one test case along with the decorator to see that it works :) 16:20:52 #action jlibosva to post a patch for decorator that skips test case results if they failed. 16:23:09 ok, sorry 16:23:26 #topic Grafana 16:24:10 since we can't use grafana to understand if there are gate issues, is anyone aware of critical gate failures that we may want to discuss? 16:24:26 just that i broke networking-bgpvpn 16:24:28 the review queue doesn't seem too red, so probably fine 16:24:33 haleyb, how so 16:24:57 there was this bug reported: https://bugs.launchpad.net/neutron/+bug/1724253 16:24:58 Launchpad bug 1724253 in neutron "error in netns privsep wrapper: ImportError: No module named agent.linux.ip_lib" [Undecided,New] 16:25:03 it sounds severe :) 16:25:04 the pyroute2 network namespace changes in neutron broke it, i'm working with tmorin to help get it fixed 16:25:18 haleyb: and Heat gate but it's not related :) 16:25:20 sounds like that's the bug I just pasted 16:25:34 https://review.openstack.org/#/c/503280 and https://review.openstack.org/#/c/500109/ 16:25:43 jlibosva: yes, that's the bug 16:26:57 that's all i had on that 16:27:30 hm, ok. which part of the script do they need for privsep? 16:28:21 i'm not exactly sure, but tmorin mentioned the bgpvpn fullstack tests ran as root, and that's the issue 16:30:06 #topic Scenarios 16:30:29 in addition to trunk, we had router migration tests that failed, but I think we fixed those with Anil's patches 16:30:57 the late failed runs look like: http://logs.openstack.org/56/509156/8/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/32b6e45/logs/testr_results.html.gz 16:31:17 test_floatingip scenarios fail rather often 16:31:29 haleyb, mlavalle is it something that's on radar of l3 team? 16:31:37 bug 1717302 is for that, swami was going to look at it 16:31:38 bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] https://launchpad.net/bugs/1717302 16:32:21 oh nice. I marked it with gate-failure tag 16:32:29 and there is another patchset here: https://review.openstack.org/#/c/512179/ 16:32:33 it's not exactly gate failure since it's non-voting but easier to track 16:32:44 not completely sure it is stricty related to this conversation 16:33:49 probably not for scenarios, but worth a look 16:33:55 yeap 16:34:43 ok, seems like we at least track all of failures. 16:34:55 #topic Fullstack 16:35:24 except the idea of skipping some test cases, we also had this patch from jlibosva to isolate fullstack services in namespaces. 16:35:36 https://review.openstack.org/506722 16:35:42 it's in conflict/wip 16:35:52 yeah, I suck 16:36:03 no, you don't... lol 16:36:19 :] 16:36:31 I need to continue working on that, I see a big benefit in it 16:36:38 if you do, then what do I do?.. 16:36:53 in other news, armax was looking at the trunk failure for fullstack: https://review.openstack.org/#/c/504186/ 16:37:05 seems like armax is swamped and may need a helping hand 16:37:29 I can help with that if it would be fine 16:37:56 slaweq, I think it would be fine. I asked him in gerrit just in case. 16:38:13 if he doesn't reply in next days nor posts new stuff, I think it's fair to take over. 16:38:17 ok, I will take a look at this 16:38:24 slaweq, thanks a lot! 16:38:37 slaweq++ 16:38:39 #action slaweq to take over https://review.openstack.org/#/c/504186/ from armax 16:39:02 #topic Gate setup 16:39:15 one more thing about fullstack 16:39:21 #undo 16:39:22 Removing item from minutes: #topic Gate setup 16:39:29 slaweq, shoot 16:39:47 I was talking with jlibosva yesterday and I want also to try to run execute with privsep instead of RWD 16:39:58 just FYI :) 16:40:08 I also have a question 16:40:09 we talked about that in one of previous meetings 16:40:18 it's related to one bug 16:40:23 that I'm just searching for :) 16:40:42 thx jlibosva - I don't have it right now 16:40:46 https://bugs.launchpad.net/neutron/+bug/1721796 16:40:48 Launchpad bug 1721796 in neutron "wait_until_true is not rootwrap daemon friendly" [Medium,In progress] - Assigned to Jakub Libosvar (libosvar) 16:40:53 and 16:40:54 https://bugs.launchpad.net/neutron/+bug/1654287 16:40:56 Launchpad bug 1654287 in oslo.rootwrap "rootwrap daemon may return output of previous command" [Critical,New] 16:40:56 slaweq, that's fine, though wasn't it the idea of privsep to separate privileges (meaning, use different caps per external call?) 16:41:49 if that would be used in tests executor only, does it matter? 16:42:28 I thought you meant making execute() somehow grab all caps via privsep and run the command with effective root 16:42:37 but as I looked quickly today some other projects are doing something like that with privsep (e.g. os-brick) 16:43:04 ihrachys: I don't know yet how exactly should it be done 16:43:09 does it mean we will get rid of RWD completely? 16:43:38 I don't know for now 16:43:56 I was thinking we should use it just to avoid using rwd in wait_until_true 16:44:12 so in fullstack, whenever we're waiting for some resource to be ready 16:44:36 which is like a whitebox test thing that is not provided via API 16:45:52 ok. though it sounds like an uphill battle 16:45:56 what happened to https://review.openstack.org/#/c/510161/ ? 16:46:04 wouldn't it tackle the issue for fullstack? 16:47:01 hard to say, there are good points about "what if the predicate gets stuck" 16:47:29 we have per test case timeout 16:47:36 using eventlet? :) 16:47:47 but then, we are back to the original problem? 16:47:57 that some state is in buffer 16:48:25 on relevant note, anyone talked to oslo folks? 16:48:27 I just had an idea 16:48:30 about the rootwrap behavior 16:49:04 It looks like nobody did 16:49:19 or maybe dalvarez 16:49:52 I wonder if it's possible to flush the state somehow before each call to RWD 16:50:14 I will follow up with oslo 16:50:15 IIRC he attempted to do that in rwd 16:50:45 #action ihrachys to talk to oslo folks about RWD garbage data in the socket on call interrupted 16:50:46 ah, he probably didn't send a patch to gerrit 16:51:30 I will reach out to him too 16:51:50 anything else for fullstack? 16:52:08 let's move on 16:52:10 time 16:52:26 #topic Gate setup 16:52:40 I had a question on what's next steps for legacy- zuulv3 jobs 16:53:00 do we stick to those for the time being, or should we start moving them to neutron repo/transform into ansible 16:53:25 I'd rather start doing it now 16:53:32 at least one by one 16:53:47 do we have someone with spare cycles to start exploring that? 16:54:04 I was going to look at it myself 16:54:06 I guess once first job is cleared, it should be easier to do next 16:54:15 slowly 16:54:26 #action mlavalle to explore how to move legacy jobs to new style 16:54:32 since it doesn't seem we are under pressure right now 16:54:46 yeah, that's good. enough balls to juggle already 16:54:56 yeah, once one patch works can split it up since should be similar 16:54:59 yeah, once we do one, the rest should be easy 16:55:20 another thing I had in mind is the dvr-ha job that never became voting 16:55:21 but remember, I'll do it slowly 16:55:42 haleyb, is it fair to say that we need fresh grafana data before we assess if the job is ready for primetime? 16:56:41 and also with the zuulv3 deal, I suspect we'll need to first move it to new style, only then flip the switch 16:56:45 ihrachys: yes, and i'd assume each job change would require a grafana change, and then some waiting to look at falire rates 16:57:19 let's do it independently of the job changes 16:57:45 mlavalle, not sure infra would love to give us path forward without first moving to new style 16:57:56 ahhhh, ok 16:57:56 but we'll see 16:58:11 another thing to note is we are stuck at tempest plugin split. 16:58:26 the last patch that was going to introduce new jobs to the tempest plugin repo was: https://review.openstack.org/#/c/507038/ 16:58:33 and it needs zuulv3 refinement 16:59:03 the concern here is that the longer it takes to switch to test classes from the plugin, the more we diverge from the base that is in-tree 16:59:08 since we continue landing new tests 16:59:26 gotta revisit it quick and see if Chandan needs help 16:59:31 the sync later will be a PITA 16:59:39 we are sadly out of time 16:59:49 we also had a plan to make the multinode job the default and voting, which will have to wait i guess 16:59:50 but I will sync with Chandan 17:00:11 yeah, lots of initiatives stalled 17:00:33 that's why it's important to move forward with zuulv3 migration, it's not just for the love of ansible 17:00:36 thanks everyone 17:00:38 #endmeeting