#openstack-meeting log

16:00:41 <ihrachys> #startmeeting neutron_ci
16:00:42 <openstack> Meeting started Tue Oct 17 16:00:41 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:43 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:45 <openstack> The meeting name has been set to 'neutron_ci'
16:00:46 <jlibosva> o/
16:00:47 <mlavalle> o/
16:00:50 <slaweq> hi
16:01:05 <haleyb> hi
16:01:19 <ihrachys> lots of hands, I love it
16:01:24 <ihrachys> #topic Actions from prev meeting
16:01:27 <ihrachys> "jlibosva to expand grafana window for periodics"
16:01:43 <ihrachys> https://review.openstack.org/#/c/510934/
16:01:47 <ihrachys> still in review
16:02:09 <jlibosva> afaik grafana now doesn't show anything
16:02:18 <jlibosva> after zuul3
16:02:30 <jlibosva> so maybe we'll need to rebase
16:02:35 <ihrachys> riiight. I believe haleyb was going to have a look at producing a new one, but there were no data in grafite?
16:02:49 <mlavalle> yeah, haleyb mentioned it yesterday
16:03:12 <haleyb> ihrachys: i don't see any, and have been busy with something else
16:03:16 <jlibosva> I accidentally switched topic, sorry :)
16:03:19 <haleyb> damn customers
16:03:27 <ihrachys> :))
16:03:30 <mlavalle> lol
16:03:46 <ihrachys> haleyb, should someone take it over?
16:04:43 <haleyb> ihrachys: i will ping someone in infra today after my meetings, customer issue not hot any more
16:04:51 <haleyb> unless someone else wants it :)
16:05:26 <ihrachys> if you can find time for that, keep it. I am only concerned if you don't.
16:05:32 <ihrachys> *if you can't
16:06:02 <ihrachys> #action haleyb to follow up with infra on missing grafite data for zuulv3 jobs
16:06:41 <ihrachys> jlibosva, I guess your patch will need to wait for that to resolve
16:06:51 <jlibosva> that's what I wanted to point out at the very beginning
16:07:41 <ihrachys> ok, next item was "ihrachys to report bug for trunk scenario failure"
16:08:31 <ihrachys> there is this bug that I believe captures the trunk scenario failures: https://bugs.launchpad.net/neutron/+bug/1676966
16:08:32 <openstack> Launchpad bug 1676966 in neutron "TrunkManagerTestCase.test_connectivity failed to spawn ping" [Medium,Confirmed]
16:08:45 <ihrachys> there is also https://bugs.launchpad.net/neutron/+bug/1722644 but I believe it's different
16:08:46 <openstack> Launchpad bug 1722644 in neutron "TrunkTest fails for OVS/DVR scenario job" [High,Confirmed]
16:09:06 <jlibosva> the first one is for fullstack?
16:09:07 <ihrachys> eh, sorry, nevermind, the first is in functional tests
16:09:14 <jlibosva> ah, no functoinal
16:09:18 <ihrachys> yeah. I think we had another one.. wait.
16:09:41 <jlibosva> ihrachys: the second seems legit :)
16:09:58 <jlibosva> then somebody opened https://bugs.launchpad.net/neutron/+bug/1722967
16:09:59 <openstack> Launchpad bug 1722967 in neutron "init_handler argument error in ovs agent" [High,In progress] - Assigned to Jakub Libosvar (libosvar)
16:10:07 <ihrachys> oh yeah, I mixed things. the first is the legit one, and then there is that init_handler one
16:10:10 <jlibosva> which solves the traces seen in the bug you reported
16:10:20 <jlibosva> but then it still fails on SSH issue afaik
16:10:21 <ihrachys> btw I haven't seen the error from the init_handler one in logs
16:10:35 <ihrachys> does it happen all the time?
16:10:37 <jlibosva> yes
16:10:40 <ihrachys> hmm
16:10:55 <jlibosva> ihrachys: perhaps you need to check the other node?
16:11:01 <jlibosva> oh, it should happen on both
16:11:26 <jlibosva> it's there: http://logs.openstack.org/13/474213/7/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/752030f/logs/screen-q-agt.txt.gz?level=ERROR#_Oct_10_17_35_10_787871
16:11:36 <jlibosva> that's from the link in 1722644
16:11:43 <ihrachys> I will check again. but anyway, the connectivity failure is still a different one. we had this failure long before.
16:12:10 <mlavalle> for 1722967, the fix is in the gate: https://review.openstack.org/#/c/511428/
16:12:23 <jlibosva> so maybe we could just refresh the info on the 172644 with new traceback :)
16:12:26 <ihrachys> jlibosva, right. but I haven't seen it in other logs from other runs.
16:12:38 <jlibosva> mlavalle: thanks for +W :)
16:12:51 <mlavalle> thanks for fixing!
16:13:28 <jlibosva> ihrachys: anyways, it's fixed by now :) but as far as I can tell it was 100% reproducible as there were wrong parameter passed during initialization of ovs agent
16:13:42 <jlibosva> was*
16:14:12 <ihrachys> ok. but we still have connectivity issue on top correct?
16:14:21 <jlibosva> I saw those, yes
16:14:29 <ihrachys> ok
16:14:58 <ihrachys> if someone has cycles to have a look at scenario trunk failure, welcome. it's the last failure that keeps the job at 100% failure rate.
16:15:08 <ihrachys> next item was "jlibosva to look how failing tests can not affect results"
16:15:26 <ihrachys> I believe it's about exploring how to "disable" some fullstack tests
16:15:29 <ihrachys> to make the job voting
16:16:06 <jlibosva> I don't have any patch yet but I went through testtools code and found they have a way to set "expected failure" on result
16:16:29 <jlibosva> but I think the easiest way would be to decorate test methods and in case they raise an exception, log it and swallow
16:16:55 <jlibosva> as expected failure means that if the test passes, the final result will be negative
16:17:53 <jlibosva> and btw I haven't found any docs on how to set result object for testtools test runner anyways :)
16:18:14 <ihrachys> jlibosva, so you mean, those test cases will always successful?
16:18:58 <jlibosva> right
16:19:10 <jlibosva> or skipped
16:19:16 <jlibosva> so we know if they failed or not
16:19:38 <ihrachys> jlibosva, you mean the decorator will call self.skip?
16:19:43 <jlibosva> correct
16:19:54 <ihrachys> ok that probably makes sense
16:20:26 <ihrachys> I guess first step is producing the decorator, and then we can go one by one and disable them.
16:20:26 <jlibosva> I can craft a patch to try it out
16:20:47 <jlibosva> yeah, we could disable one test case along with the decorator to see that it works :)
16:20:52 <ihrachys> #action jlibosva to post a patch for decorator that skips test case results if they failed.
16:23:09 <ihrachys> ok, sorry
16:23:26 <ihrachys> #topic Grafana
16:24:10 <ihrachys> since we can't use grafana to understand if there are gate issues, is anyone aware of critical gate failures that we may want to discuss?
16:24:26 <haleyb> just that i broke networking-bgpvpn
16:24:28 <ihrachys> the review queue doesn't seem too red, so probably fine
16:24:33 <ihrachys> haleyb, how so
16:24:57 <jlibosva> there was this bug reported: https://bugs.launchpad.net/neutron/+bug/1724253
16:24:58 <openstack> Launchpad bug 1724253 in neutron "error in netns privsep wrapper: ImportError: No module named agent.linux.ip_lib" [Undecided,New]
16:25:03 <jlibosva> it sounds severe :)
16:25:04 <haleyb> the pyroute2 network namespace changes in neutron broke it, i'm working with tmorin to help get it fixed
16:25:18 <slaweq> haleyb: and Heat gate but it's not related :)
16:25:20 <jlibosva> sounds like that's the bug I just pasted
16:25:34 <haleyb> https://review.openstack.org/#/c/503280 and https://review.openstack.org/#/c/500109/
16:25:43 <haleyb> jlibosva: yes, that's the bug
16:26:57 <haleyb> that's all i had on that
16:27:30 <ihrachys> hm, ok. which part of the script do they need for privsep?
16:28:21 <haleyb> i'm not exactly sure, but tmorin mentioned the bgpvpn fullstack tests ran as root, and that's the issue
16:30:06 <ihrachys> #topic Scenarios
16:30:29 <ihrachys> in addition to trunk, we had router migration tests that failed, but I think we fixed those with Anil's patches
16:30:57 <ihrachys> the late failed runs look like: http://logs.openstack.org/56/509156/8/check/legacy-tempest-dsvm-neutron-dvr-multinode-scenario/32b6e45/logs/testr_results.html.gz
16:31:17 <ihrachys> test_floatingip scenarios fail rather often
16:31:29 <ihrachys> haleyb, mlavalle is it something that's on radar of l3 team?
16:31:37 <haleyb> bug 1717302 is for that, swami was going to look at it
16:31:38 <openstack> bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] https://launchpad.net/bugs/1717302
16:32:21 <ihrachys> oh nice. I marked it with gate-failure tag
16:32:29 <mlavalle> and there is another patchset here: https://review.openstack.org/#/c/512179/
16:32:33 <ihrachys> it's not exactly gate failure since it's non-voting but easier to track
16:32:44 <mlavalle> not completely sure it is stricty related to this conversation
16:33:49 <ihrachys> probably not for scenarios, but worth a look
16:33:55 <mlavalle> yeap
16:34:43 <ihrachys> ok, seems like we at least track all of failures.
16:34:55 <ihrachys> #topic Fullstack
16:35:24 <ihrachys> except the idea of skipping some test cases, we also had this patch from jlibosva to isolate fullstack services in namespaces.
16:35:36 <ihrachys> https://review.openstack.org/506722
16:35:42 <ihrachys> it's in conflict/wip
16:35:52 <jlibosva> yeah, I suck
16:36:03 <mlavalle> no, you don't... lol
16:36:19 <jlibosva> :]
16:36:31 <jlibosva> I need to continue working on that, I see a big benefit in it
16:36:38 <ihrachys> if you do, then what do I do?..
16:36:53 <ihrachys> in other news, armax was looking at the trunk failure for fullstack: https://review.openstack.org/#/c/504186/
16:37:05 <ihrachys> seems like armax is swamped and may need a helping hand
16:37:29 <slaweq> I can help with that if it would be fine
16:37:56 <ihrachys> slaweq, I think it would be fine. I asked him in gerrit just in case.
16:38:13 <ihrachys> if he doesn't reply in next days nor posts new stuff, I think it's fair to take over.
16:38:17 <slaweq> ok, I will take a look at this
16:38:24 <ihrachys> slaweq, thanks a lot!
16:38:37 <jlibosva> slaweq++
16:38:39 <ihrachys> #action slaweq to take over https://review.openstack.org/#/c/504186/ from armax
16:39:02 <ihrachys> #topic Gate setup
16:39:15 <slaweq> one more thing about fullstack
16:39:21 <ihrachys> #undo
16:39:22 <openstack> Removing item from minutes: #topic Gate setup
16:39:29 <ihrachys> slaweq, shoot
16:39:47 <slaweq> I was talking with jlibosva yesterday and I want also to try to run execute with privsep instead of RWD
16:39:58 <slaweq> just FYI :)
16:40:08 <mlavalle> I also have a question
16:40:09 <jlibosva> we talked about that in one of previous meetings
16:40:18 <jlibosva> it's related to one bug
16:40:23 <jlibosva> that I'm just searching for :)
16:40:42 <slaweq> thx jlibosva - I don't have it right now
16:40:46 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1721796
16:40:48 <openstack> Launchpad bug 1721796 in neutron "wait_until_true is not rootwrap daemon friendly" [Medium,In progress] - Assigned to Jakub Libosvar (libosvar)
16:40:53 <jlibosva> and
16:40:54 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1654287
16:40:56 <openstack> Launchpad bug 1654287 in oslo.rootwrap "rootwrap daemon may return output of previous command" [Critical,New]
16:40:56 <ihrachys> slaweq, that's fine, though wasn't it the idea of privsep to separate privileges (meaning, use different caps per external call?)
16:41:49 <jlibosva> if that would be used in tests executor only, does it matter?
16:42:28 <ihrachys> I thought you meant making execute() somehow grab all caps via privsep and run the command with effective root
16:42:37 <slaweq> but as I looked quickly today some other projects are doing something like that with privsep (e.g. os-brick)
16:43:04 <slaweq> ihrachys: I don't know yet how exactly should it be done
16:43:09 <ihrachys> does it mean we will get rid of RWD completely?
16:43:38 <slaweq> I don't know for now
16:43:56 <jlibosva> I was thinking we should use it just to avoid using rwd in wait_until_true
16:44:12 <jlibosva> so in fullstack, whenever we're waiting for some resource to be ready
16:44:36 <jlibosva> which is like a whitebox test thing that is not provided via API
16:45:52 <ihrachys> ok. though it sounds like an uphill battle
16:45:56 <ihrachys> what happened to https://review.openstack.org/#/c/510161/ ?
16:46:04 <ihrachys> wouldn't it tackle the issue for fullstack?
16:47:01 <jlibosva> hard to say, there are good points about "what if the predicate gets stuck"
16:47:29 <ihrachys> we have per test case timeout
16:47:36 <jlibosva> using eventlet? :)
16:47:47 <ihrachys> but then, we are back to the original problem?
16:47:57 <ihrachys> that some state is in buffer
16:48:25 <ihrachys> on relevant note, anyone talked to oslo folks?
16:48:27 <jlibosva> I just had an idea
16:48:30 <ihrachys> about the rootwrap behavior
16:49:04 <jlibosva> It looks like nobody did
16:49:19 <jlibosva> or maybe dalvarez
16:49:52 <ihrachys> I wonder if it's possible to flush the state somehow before each call to RWD
16:50:14 <ihrachys> I will follow up with oslo
16:50:15 <jlibosva> IIRC he attempted to do that in rwd
16:50:45 <ihrachys> #action ihrachys to talk to oslo folks about RWD garbage data in the socket on call interrupted
16:50:46 <jlibosva> ah, he probably didn't send a patch to gerrit
16:51:30 <ihrachys> I will reach out to him too
16:51:50 <ihrachys> anything else for fullstack?
16:52:08 <mlavalle> let's move on
16:52:10 <mlavalle> time
16:52:26 <ihrachys> #topic Gate setup
16:52:40 <ihrachys> I had a question on what's next steps for legacy- zuulv3 jobs
16:53:00 <ihrachys> do we stick to those for the time being, or should we start moving them to neutron repo/transform into ansible
16:53:25 <mlavalle> I'd rather start doing it now
16:53:32 <mlavalle> at least one by one
16:53:47 <ihrachys> do we have someone with spare cycles to start exploring that?
16:54:04 <mlavalle> I was going to look at it myself
16:54:06 <ihrachys> I guess once first job is cleared, it should be easier to do next
16:54:15 <mlavalle> slowly
16:54:26 <ihrachys> #action mlavalle to explore how to move legacy jobs to new style
16:54:32 <mlavalle> since it doesn't seem we are under pressure right now
16:54:46 <ihrachys> yeah, that's good. enough balls to juggle already
16:54:56 <haleyb> yeah, once one patch works can split it up since should be similar
16:54:59 <mlavalle> yeah, once we do one, the rest should be easy
16:55:20 <ihrachys> another thing I had in mind is the dvr-ha job that never became voting
16:55:21 <mlavalle> but remember, I'll do it slowly
16:55:42 <ihrachys> haleyb, is it fair to say that we need fresh grafana data before we assess if the job is ready for primetime?
16:56:41 <ihrachys> and also with the zuulv3 deal, I suspect we'll need to first move it to new style, only then flip the switch
16:56:45 <haleyb> ihrachys: yes, and i'd assume each job change would require a grafana change, and then some waiting to look at falire rates
16:57:19 <mlavalle> let's do it independently of the job changes
16:57:45 <ihrachys> mlavalle, not sure infra would love to give us path forward without first moving to new style
16:57:56 <mlavalle> ahhhh, ok
16:57:56 <ihrachys> but we'll see
16:58:11 <ihrachys> another thing to note is we are stuck at tempest plugin split.
16:58:26 <ihrachys> the last patch that was going to introduce new jobs to the tempest plugin repo was: https://review.openstack.org/#/c/507038/
16:58:33 <ihrachys> and it needs zuulv3 refinement
16:59:03 <ihrachys> the concern here is that the longer it takes to switch to test classes from the plugin, the more we diverge from the base that is in-tree
16:59:08 <ihrachys> since we continue landing new tests
16:59:26 <ihrachys> gotta revisit it quick and see if Chandan needs help
16:59:31 <ihrachys> the sync later will be a PITA
16:59:39 <ihrachys> we are sadly out of time
16:59:49 <haleyb> we also had a plan to make the multinode job the default and voting, which will have to wait i guess
16:59:50 <ihrachys> but I will sync with Chandan
17:00:11 <ihrachys> yeah, lots of initiatives stalled
17:00:33 <ihrachys> that's why it's important to move forward with zuulv3 migration, it's not just for the love of ansible
17:00:36 <ihrachys> thanks everyone
17:00:38 <ihrachys> #endmeeting