16:00:41 <ihrachys> #startmeeting neutron_ci 16:00:42 <openstack> Meeting started Tue Oct 17 16:00:41 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:43 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:45 <openstack> The meeting name has been set to 'neutron_ci' 16:00:46 <jlibosva> o/ 16:00:47 <mlavalle> o/ 16:00:50 <slaweq> hi 16:01:05 <haleyb> hi 16:01:19 <ihrachys> lots of hands, I love it 16:01:24 <ihrachys> #topic Actions from prev meeting 16:01:27 <ihrachys> "jlibosva to expand grafana window for periodics" 16:01:43 <ihrachys> https://review.openstack.org/#/c/510934/ 16:01:47 <ihrachys> still in review 16:02:09 <jlibosva> afaik grafana now doesn't show anything 16:02:18 <jlibosva> after zuul3 16:02:30 <jlibosva> so maybe we'll need to rebase 16:02:35 <ihrachys> riiight. I believe haleyb was going to have a look at producing a new one, but there were no data in grafite? 16:02:49 <mlavalle> yeah, haleyb mentioned it yesterday 16:03:12 <haleyb> ihrachys: i don't see any, and have been busy with something else 16:03:16 <jlibosva> I accidentally switched topic, sorry :) 16:03:19 <haleyb> damn customers 16:03:27 <ihrachys> :)) 16:03:30 <mlavalle> lol 16:03:46 <ihrachys> haleyb, should someone take it over? 16:04:43 <haleyb> ihrachys: i will ping someone in infra today after my meetings, customer issue not hot any more 16:04:51 <haleyb> unless someone else wants it :) 16:05:26 <ihrachys> if you can find time for that, keep it. I am only concerned if you don't. 16:05:32 <ihrachys> *if you can't 16:06:02 <ihrachys> #action haleyb to follow up with infra on missing grafite data for zuulv3 jobs 16:06:41 <ihrachys> jlibosva, I guess your patch will need to wait for that to resolve 16:06:51 <jlibosva> that's what I wanted to point out at the very beginning 16:07:41 <ihrachys> ok, next item was "ihrachys to report bug for trunk scenario failure" 16:08:31 <ihrachys> there is this bug that I believe captures the trunk scenario failures: https://bugs.launchpad.net/neutron/+bug/1676966 16:08:32 <openstack> Launchpad bug 1676966 in neutron "TrunkManagerTestCase.test_connectivity failed to spawn ping" [Medium,Confirmed] 16:08:45 <ihrachys> there is also https://bugs.launchpad.net/neutron/+bug/1722644 but I believe it's different 16:08:46 <openstack> Launchpad bug 1722644 in neutron "TrunkTest fails for OVS/DVR scenario job" [High,Confirmed] 16:09:06 <jlibosva> the first one is for fullstack? 16:09:07 <ihrachys> eh, sorry, nevermind, the first is in functional tests 16:09:14 <jlibosva> ah, no functoinal 16:09:18 <ihrachys> yeah. I think we had another one.. wait. 16:09:41 <jlibosva> ihrachys: the second seems legit :) 16:09:58 <jlibosva> then somebody opened https://bugs.launchpad.net/neutron/+bug/1722967 16:09:59 <openstack> Launchpad bug 1722967 in neutron "init_handler argument error in ovs agent" [High,In progress] - Assigned to Jakub Libosvar (libosvar) 16:10:07 <ihrachys> oh yeah, I mixed things. the first is the legit one, and then there is that init_handler one 16:10:10 <jlibosva> which solves the traces seen in the bug you reported 16:10:20 <jlibosva> but then it still fails on SSH issue afaik 16:10:21 <ihrachys> btw I haven't seen the error from the init_handler one in logs 16:10:35 <ihrachys> does it happen all the time? 16:10:37 <jlibosva> yes 16:10:40 <ihrachys> hmm 16:10:55 <jlibosva> ihrachys: perhaps you need to check the other node? 16:11:01 <jlibosva> oh, it should happen on both 16:11:26 <jlibosva> it's there: http://logs.openstack.org/13/474213/7/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/752030f/logs/screen-q-agt.txt.gz?level=ERROR#_Oct_10_17_35_10_787871 16:11:36 <jlibosva> that's from the link in 1722644 16:11:43 <ihrachys> I will check again. but anyway, the connectivity failure is still a different one. we had this failure long before. 16:12:10 <mlavalle> for 1722967, the fix is in the gate: https://review.openstack.org/#/c/511428/ 16:12:23 <jlibosva> so maybe we could just refresh the info on the 172644 with new traceback :) 16:12:26 <ihrachys> jlibosva, right. but I haven't seen it in other logs from other runs. 16:12:38 <jlibosva> mlavalle: thanks for +W :) 16:12:51 <mlavalle> thanks for fixing! 16:13:28 <jlibosva> ihrachys: anyways, it's fixed by now :) but as far as I can tell it was 100% reproducible as there were wrong parameter passed during initialization of ovs agent 16:13:42 <jlibosva> was* 16:14:12 <ihrachys> ok. but we still have connectivity issue on top correct? 16:14:21 <jlibosva> I saw those, yes 16:14:29 <ihrachys> ok 16:14:58 <ihrachys> if someone has cycles to have a look at scenario trunk failure, welcome. it's the last failure that keeps the job at 100% failure rate. 16:15:08 <ihrachys> next item was "jlibosva to look how failing tests can not affect results" 16:15:26 <ihrachys> I believe it's about exploring how to "disable" some fullstack tests 16:15:29 <ihrachys> to make the job voting 16:16:06 <jlibosva> I don't have any patch yet but I went through testtools code and found they have a way to set "expected failure" on result 16:16:29 <jlibosva> but I think the easiest way would be to decorate test methods and in case they raise an exception, log it and swallow 16:16:55 <jlibosva> as expected failure means that if the test passes, the final result will be negative 16:17:53 <jlibosva> and btw I haven't found any docs on how to set result object for testtools test runner anyways :) 16:18:14 <ihrachys> jlibosva, so you mean, those test cases will always successful? 16:18:58 <jlibosva> right 16:19:10 <jlibosva> or skipped 16:19:16 <jlibosva> so we know if they failed or not 16:19:38 <ihrachys> jlibosva, you mean the decorator will call self.skip? 16:19:43 <jlibosva> correct 16:19:54 <ihrachys> ok that probably makes sense 16:20:26 <ihrachys> I guess first step is producing the decorator, and then we can go one by one and disable them. 16:20:26 <jlibosva> I can craft a patch to try it out 16:20:47 <jlibosva> yeah, we could disable one test case along with the decorator to see that it works :) 16:20:52 <ihrachys> #action jlibosva to post a patch for decorator that skips test case results if they failed. 16:23:09 <ihrachys> ok, sorry 16:23:26 <ihrachys> #topic Grafana 16:24:10 <ihrachys> since we can't use grafana to understand if there are gate issues, is anyone aware of critical gate failures that we may want to discuss? 16:24:26 <haleyb> just that i broke networking-bgpvpn 16:24:28 <ihrachys> the review queue doesn't seem too red, so probably fine 16:24:33 <ihrachys> haleyb, how so 16:24:57 <jlibosva> there was this bug reported: https://bugs.launchpad.net/neutron/+bug/1724253 16:24:58 <openstack> Launchpad bug 1724253 in neutron "error in netns privsep wrapper: ImportError: No module named agent.linux.ip_lib" [Undecided,New] 16:25:03 <jlibosva> it sounds severe :) 16:25:04 <haleyb> the pyroute2 network namespace changes in neutron broke it, i'm working with tmorin to help get it fixed 16:25:18 <slaweq> haleyb: and Heat gate but it's not related :) 16:25:20 <jlibosva> sounds like that's the bug I just pasted 16:25:34 <haleyb> https://review.openstack.org/#/c/503280 and https://review.openstack.org/#/c/500109/ 16:25:43 <haleyb> jlibosva: yes, that's the bug 16:26:57 <haleyb> that's all i had on that 16:27:30 <ihrachys> hm, ok. which part of the script do they need for privsep? 16:28:21 <haleyb> i'm not exactly sure, but tmorin mentioned the bgpvpn fullstack tests ran as root, and that's the issue 16:30:06 <ihrachys> #topic Scenarios 16:30:29 <ihrachys> in addition to trunk, we had router migration tests that failed, but I think we fixed those with Anil's patches 16:30:57 <ihrachys> the late failed runs look like: http://logs.openstack.org/56/509156/8/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/32b6e45/logs/testr_results.html.gz 16:31:17 <ihrachys> test_floatingip scenarios fail rather often 16:31:29 <ihrachys> haleyb, mlavalle is it something that's on radar of l3 team? 16:31:37 <haleyb> bug 1717302 is for that, swami was going to look at it 16:31:38 <openstack> bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] https://launchpad.net/bugs/1717302 16:32:21 <ihrachys> oh nice. I marked it with gate-failure tag 16:32:29 <mlavalle> and there is another patchset here: https://review.openstack.org/#/c/512179/ 16:32:33 <ihrachys> it's not exactly gate failure since it's non-voting but easier to track 16:32:44 <mlavalle> not completely sure it is stricty related to this conversation 16:33:49 <ihrachys> probably not for scenarios, but worth a look 16:33:55 <mlavalle> yeap 16:34:43 <ihrachys> ok, seems like we at least track all of failures. 16:34:55 <ihrachys> #topic Fullstack 16:35:24 <ihrachys> except the idea of skipping some test cases, we also had this patch from jlibosva to isolate fullstack services in namespaces. 16:35:36 <ihrachys> https://review.openstack.org/506722 16:35:42 <ihrachys> it's in conflict/wip 16:35:52 <jlibosva> yeah, I suck 16:36:03 <mlavalle> no, you don't... lol 16:36:19 <jlibosva> :] 16:36:31 <jlibosva> I need to continue working on that, I see a big benefit in it 16:36:38 <ihrachys> if you do, then what do I do?.. 16:36:53 <ihrachys> in other news, armax was looking at the trunk failure for fullstack: https://review.openstack.org/#/c/504186/ 16:37:05 <ihrachys> seems like armax is swamped and may need a helping hand 16:37:29 <slaweq> I can help with that if it would be fine 16:37:56 <ihrachys> slaweq, I think it would be fine. I asked him in gerrit just in case. 16:38:13 <ihrachys> if he doesn't reply in next days nor posts new stuff, I think it's fair to take over. 16:38:17 <slaweq> ok, I will take a look at this 16:38:24 <ihrachys> slaweq, thanks a lot! 16:38:37 <jlibosva> slaweq++ 16:38:39 <ihrachys> #action slaweq to take over https://review.openstack.org/#/c/504186/ from armax 16:39:02 <ihrachys> #topic Gate setup 16:39:15 <slaweq> one more thing about fullstack 16:39:21 <ihrachys> #undo 16:39:22 <openstack> Removing item from minutes: #topic Gate setup 16:39:29 <ihrachys> slaweq, shoot 16:39:47 <slaweq> I was talking with jlibosva yesterday and I want also to try to run execute with privsep instead of RWD 16:39:58 <slaweq> just FYI :) 16:40:08 <mlavalle> I also have a question 16:40:09 <jlibosva> we talked about that in one of previous meetings 16:40:18 <jlibosva> it's related to one bug 16:40:23 <jlibosva> that I'm just searching for :) 16:40:42 <slaweq> thx jlibosva - I don't have it right now 16:40:46 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1721796 16:40:48 <openstack> Launchpad bug 1721796 in neutron "wait_until_true is not rootwrap daemon friendly" [Medium,In progress] - Assigned to Jakub Libosvar (libosvar) 16:40:53 <jlibosva> and 16:40:54 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1654287 16:40:56 <openstack> Launchpad bug 1654287 in oslo.rootwrap "rootwrap daemon may return output of previous command" [Critical,New] 16:40:56 <ihrachys> slaweq, that's fine, though wasn't it the idea of privsep to separate privileges (meaning, use different caps per external call?) 16:41:49 <jlibosva> if that would be used in tests executor only, does it matter? 16:42:28 <ihrachys> I thought you meant making execute() somehow grab all caps via privsep and run the command with effective root 16:42:37 <slaweq> but as I looked quickly today some other projects are doing something like that with privsep (e.g. os-brick) 16:43:04 <slaweq> ihrachys: I don't know yet how exactly should it be done 16:43:09 <ihrachys> does it mean we will get rid of RWD completely? 16:43:38 <slaweq> I don't know for now 16:43:56 <jlibosva> I was thinking we should use it just to avoid using rwd in wait_until_true 16:44:12 <jlibosva> so in fullstack, whenever we're waiting for some resource to be ready 16:44:36 <jlibosva> which is like a whitebox test thing that is not provided via API 16:45:52 <ihrachys> ok. though it sounds like an uphill battle 16:45:56 <ihrachys> what happened to https://review.openstack.org/#/c/510161/ ? 16:46:04 <ihrachys> wouldn't it tackle the issue for fullstack? 16:47:01 <jlibosva> hard to say, there are good points about "what if the predicate gets stuck" 16:47:29 <ihrachys> we have per test case timeout 16:47:36 <jlibosva> using eventlet? :) 16:47:47 <ihrachys> but then, we are back to the original problem? 16:47:57 <ihrachys> that some state is in buffer 16:48:25 <ihrachys> on relevant note, anyone talked to oslo folks? 16:48:27 <jlibosva> I just had an idea 16:48:30 <ihrachys> about the rootwrap behavior 16:49:04 <jlibosva> It looks like nobody did 16:49:19 <jlibosva> or maybe dalvarez 16:49:52 <ihrachys> I wonder if it's possible to flush the state somehow before each call to RWD 16:50:14 <ihrachys> I will follow up with oslo 16:50:15 <jlibosva> IIRC he attempted to do that in rwd 16:50:45 <ihrachys> #action ihrachys to talk to oslo folks about RWD garbage data in the socket on call interrupted 16:50:46 <jlibosva> ah, he probably didn't send a patch to gerrit 16:51:30 <ihrachys> I will reach out to him too 16:51:50 <ihrachys> anything else for fullstack? 16:52:08 <mlavalle> let's move on 16:52:10 <mlavalle> time 16:52:26 <ihrachys> #topic Gate setup 16:52:40 <ihrachys> I had a question on what's next steps for legacy- zuulv3 jobs 16:53:00 <ihrachys> do we stick to those for the time being, or should we start moving them to neutron repo/transform into ansible 16:53:25 <mlavalle> I'd rather start doing it now 16:53:32 <mlavalle> at least one by one 16:53:47 <ihrachys> do we have someone with spare cycles to start exploring that? 16:54:04 <mlavalle> I was going to look at it myself 16:54:06 <ihrachys> I guess once first job is cleared, it should be easier to do next 16:54:15 <mlavalle> slowly 16:54:26 <ihrachys> #action mlavalle to explore how to move legacy jobs to new style 16:54:32 <mlavalle> since it doesn't seem we are under pressure right now 16:54:46 <ihrachys> yeah, that's good. enough balls to juggle already 16:54:56 <haleyb> yeah, once one patch works can split it up since should be similar 16:54:59 <mlavalle> yeah, once we do one, the rest should be easy 16:55:20 <ihrachys> another thing I had in mind is the dvr-ha job that never became voting 16:55:21 <mlavalle> but remember, I'll do it slowly 16:55:42 <ihrachys> haleyb, is it fair to say that we need fresh grafana data before we assess if the job is ready for primetime? 16:56:41 <ihrachys> and also with the zuulv3 deal, I suspect we'll need to first move it to new style, only then flip the switch 16:56:45 <haleyb> ihrachys: yes, and i'd assume each job change would require a grafana change, and then some waiting to look at falire rates 16:57:19 <mlavalle> let's do it independently of the job changes 16:57:45 <ihrachys> mlavalle, not sure infra would love to give us path forward without first moving to new style 16:57:56 <mlavalle> ahhhh, ok 16:57:56 <ihrachys> but we'll see 16:58:11 <ihrachys> another thing to note is we are stuck at tempest plugin split. 16:58:26 <ihrachys> the last patch that was going to introduce new jobs to the tempest plugin repo was: https://review.openstack.org/#/c/507038/ 16:58:33 <ihrachys> and it needs zuulv3 refinement 16:59:03 <ihrachys> the concern here is that the longer it takes to switch to test classes from the plugin, the more we diverge from the base that is in-tree 16:59:08 <ihrachys> since we continue landing new tests 16:59:26 <ihrachys> gotta revisit it quick and see if Chandan needs help 16:59:31 <ihrachys> the sync later will be a PITA 16:59:39 <ihrachys> we are sadly out of time 16:59:49 <haleyb> we also had a plan to make the multinode job the default and voting, which will have to wait i guess 16:59:50 <ihrachys> but I will sync with Chandan 17:00:11 <ihrachys> yeah, lots of initiatives stalled 17:00:33 <ihrachys> that's why it's important to move forward with zuulv3 migration, it's not just for the love of ansible 17:00:36 <ihrachys> thanks everyone 17:00:38 <ihrachys> #endmeeting