#openstack-meeting log

16:00:25 <ihrachys> #startmeeting neutron_ci
16:00:26 <openstack> Meeting started Tue Oct 10 16:00:25 2017 UTC and is due to finish in 60 minutes.  The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:30 <openstack> The meeting name has been set to 'neutron_ci'
16:00:38 <ihrachys> o/
16:00:42 <slaweq> hello
16:00:59 <ihrachys> slaweq, nice to see you joining
16:00:59 <ihrachys> #topic Action items from prev week
16:01:12 <ihrachys> first was "haleyb to update grafana board with new job names"
16:01:13 <mlavalle> o/
16:01:33 <jlibosva> \o
16:01:48 <ihrachys> haleyb, where do we stand on grafana?
16:02:03 <ihrachys> I think since they reverted to v2.5, we were thinking about having two boards for both?
16:02:03 <haleyb> ihrachys: i have a test patch but need to get feedback, still don't know what the layout will be with zuulv3
16:02:27 <ihrachys> https://review.openstack.org/#/c/509291/ ?
16:03:13 <ihrachys> whose feedback do you seek? I presume infra?
16:03:20 <haleyb> right, but i don't see stats for those on the collection tree
16:03:33 * haleyb tries to think of that page
16:03:45 <ihrachys> clarkb, whom should talk to about where to get stats for new zuul job failures?
16:03:48 <haleyb> http://graphite.openstack.org/
16:03:56 <ihrachys> seems like only old jobs are in graphite
16:04:11 <clarkb> ihrachys: jeblair is probably the best person to ask
16:04:25 <mlavalle> clarkb: could you give us some feedback on https://review.openstack.org/#/c/509291/?
16:04:40 <mlavalle> lol, clarkb you were listening
16:04:42 <ihrachys> thanks. haleyb could you follow up with jeblair on the matter? I think he hangs out in #openstack-infra
16:04:42 <clarkb> I know there were some initial firewall issues but I thought those got sorted out, but possible we aren't allowing the new job stats through yet
16:05:01 <clarkb> pabelanger is probably the other person to ask as I think he was working on the firewall bits
16:05:09 <haleyb> yes, i'll follow-up since we'll need them starting tomorrow
16:05:21 <ihrachys> haleyb, what's tomorrow?
16:05:46 * ihrachys is out of touch lately
16:05:53 <haleyb> zuulv3 #2, so the neutron dashboard will stop reporting
16:06:01 <ihrachys> they switch back tomorrow?
16:06:01 <jeblair> https://review.openstack.org/510580 is needed for v3 job stats
16:06:21 <jeblair> i think folks have considered that low-priority, but if it's a big impact for you, i can escalate it
16:06:25 <ihrachys> jeblair, great! is it reasonable to expect it to land before the switch?
16:06:27 <mlavalle> ihrachys: yeap
16:06:41 <jeblair> ihrachys: if you need it, i think so, yes
16:07:03 <pabelanger> I can review here shortly
16:07:04 <ihrachys> jeblair, yeah, we would like to have stats when switch occurs. the previous time we switched, we had problems determining where we are
16:07:24 <ihrachys> thanks folks, we appreciate all the work and quick response
16:07:29 <haleyb> jeblair: thanks, when that merges should graphite now have the new job stats?
16:07:32 <jeblair> ihrachys: ok.  i'll push on it.  that updates the docs too, so you should be able to construct the new statsd keys you'll need to update the grafana boards
16:07:48 <jeblair> haleyb: after we restart zuul with it.  probably some time today.
16:08:17 <haleyb> great, thanks!
16:08:26 <ihrachys> haleyb, seems like you'll need to change queries a bit
16:08:45 <ihrachys> I guess we can do that in parallel to the infra patch being reviewed/merged
16:08:58 <haleyb> yes, and i can update the FAQ since others might want to know
16:09:14 <ihrachys> cool
16:09:19 <ihrachys> this was the only AI from the prev meeting
16:09:37 <ihrachys> #topic zuulv3 preparations
16:09:52 <ihrachys> the prev meeting, we were in fire drill mode because v3 was enabled and resulted in a havoc
16:10:14 <ihrachys> we started this etherpad back then to understand what fails: https://etherpad.openstack.org/p/neutron-zuulv3-grievances
16:10:29 <ihrachys> and we had people assigned to investigate each repo/branch
16:10:50 <ihrachys> since then we switched back to 2.5 so it was not as pressing
16:11:03 <boden> well v3 is coming back online tomorrow
16:11:04 <boden> http://lists.openstack.org/pipermail/openstack-dev/2017-October/123337.html
16:11:05 <ihrachys> and it seems like with the switch back, we lost ability to trigger results for zuulv3
16:11:16 <ihrachys> boden, yeah, haleyb told that ^
16:11:20 <boden> oh sorry
16:11:47 <ihrachys> clarkb, I think the plan was to be able to trigger zuul results no? they would just be non-voting
16:13:00 <boden> I don’t want to speak of others, but I was seeing Zuul results (non gating), but as of late I only see Jenkins… maybe I’m missing something, so I recently submitted a test patch for lib
16:13:19 <clarkb> ihrachys: we did a soft rollback of v3 so v2.5 and v3 have been running together and both reporting on changes
16:13:30 <clarkb> ihrachys: the idea was you'd use the results during the last week or so to continue debugging
16:13:42 <ihrachys> clarkb, I don't think I saw any Zuul results since rollback
16:13:46 <clarkb> jenkins votes are from v2, zuul results are v3
16:13:52 <ihrachys> I was told it's long queue
16:13:59 <ihrachys> but I haven't seen it in a week anyway
16:13:59 <clarkb> ihrachys: it is, but it is processing them
16:14:12 <ihrachys> so projects didn't have a chance to fix anything really
16:14:18 <clarkb> you can query label:Verified,zuul or something to see where it has voted in gerrit
16:14:33 <clarkb> ihrachys: I'm not sure that is the case, many projects have been able to debug and fix things aiui including tripleo
16:14:55 <ihrachys> we sent a bunch of 'sentinel' patches for different branches and projects, and couldn't get anything
16:15:15 <ihrachys> examples: https://review.openstack.org/#/c/509251/ https://review.openstack.org/#/c/502564/
16:15:23 <boden> and also for neutron-lib https://review.openstack.org/#/c/493280/
16:15:25 <ihrachys> and boden seems to confirm that
16:15:48 <boden> yes… I did see Zuul results for a few days, but not as of the last few days
16:16:02 <ihrachys> I saw last results on the day of the rollback
16:16:06 <clarkb> https://review.openstack.org/#/q/project:openstack/neutron+label:Verified%252Czuul is an example query
16:16:12 <ihrachys> before we started pushing sentinels
16:17:11 <ihrachys> clarkb, ok I guess it was unrealistic to expect it to trigger at patches we sent
16:17:15 <ihrachys> is it that slow
16:17:16 <ihrachys> ?
16:17:33 <clarkb> ihrachys: I think its ~30 hours behind right now (it has 20% of our node capacity right now)
16:17:49 <ihrachys> well we definitely sent those the previous Tue
16:18:04 <ihrachys> maybe some restart caught us in between and reset queues
16:18:18 <clarkb> as jeblair mentioned above there have also been restarts to ddress problems as fixes have come in so things may have been caught by that as well
16:18:27 <clarkb> but its definitely voting on some neutron change sbased on that query at least
16:18:43 <ihrachys> ok gotcha
16:19:05 <ihrachys> folks, let's revisit repos/branches assigned to each of us before tomorrow
16:19:15 <ihrachys> I mean as in the list at https://etherpad.openstack.org/p/neutron-zuulv3-grievances
16:20:09 <ihrachys> #topic Grafana
16:20:15 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:21:26 <ihrachys> I see periodic ryu job failed the last time it executed
16:21:57 <jlibosva> I was thinking whether we should put to grafana longer time than 24 hours as it's run once per day then we have binary data in grafana
16:22:05 <jlibosva> if that's even possible
16:22:36 <jlibosva> like a week, so we would have 7 samples instead of one in periodic graph
16:22:52 <ihrachys> I think it makes sense.
16:23:27 <ihrachys> I see some boards use 12hours instead of 24hours
16:24:02 <ihrachys> dunno if we can go higher though
16:24:36 <ihrachys> but it's an argument to movingAverage function
16:25:23 <ihrachys> http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.movingAverage
16:25:47 <jlibosva> I thought it's asPercentage
16:25:51 <ihrachys> seems like it allows different scales, and if nothing else, we can have "number N of datapoints"
16:25:52 <jlibosva> asPercent*
16:26:11 <jlibosva> ah no, you're right :)
16:26:26 <ihrachys> jlibosva, wanna send the patch?
16:26:32 <jlibosva> yep
16:26:51 <ihrachys> #action jlibosva to expand grafana window for periodics
16:27:20 <ihrachys> I briefly checked the ryu failure, it does seem like a regular volume tempest issue: http://logs.openstack.org/periodic/periodic-tempest-dsvm-neutron-with-ryu-master-ubuntu-xenial/b36a348/console.html
16:28:13 <ihrachys> other than that, we have scenarios and fullstack that we'll discuss separately
16:28:22 <ihrachys> anything else related to grafana in general?
16:28:48 <ihrachys> ok
16:28:55 <ihrachys> #topic Scenario jobs
16:29:19 <ihrachys> anilvenkata was working on router migration failures lately
16:29:26 <ihrachys> a bunch of fixes landed
16:29:37 <ihrachys> we are landing the hopefully final test-only fix here: https://review.openstack.org/#/c/500384/
16:30:18 <jlibosva> big kudos to anilvenkata
16:30:31 <ihrachys> Anil also suggested initially that we may need to enable ARP responder when in DVR mode: https://review.openstack.org/#/c/510090/ because he experienced some ARPs lost in the gate, but seems like it should work either way.
16:30:32 <haleyb> anilvenkata++
16:30:49 <ihrachys> we will revisit the state of the job in next days and see if this piece is complete
16:31:19 <ihrachys> there are still other failures in the job, as can be seen in: http://logs.openstack.org/90/510090/2/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/a2d5341/logs/testr_results.html.gz
16:32:20 <ihrachys> haleyb, re east-west test failures, it seems like some other case of FIP not configured/not available
16:32:51 <haleyb> ihrachys: yes, or an arp issue
16:32:52 <ihrachys> haleyb, I think it makes sense to bring it up at l3team meeting
16:33:02 <ihrachys> since I believe it fails consistently
16:33:18 <haleyb> i'll put it on our agenda
16:33:22 <ihrachys> we don't have a bug reported for that failure, so that could be a good start
16:33:27 <ihrachys> thanks!
16:33:57 <ihrachys> the second failure is related to trunk ports
16:34:26 <ihrachys> we don't seem to have a bug for that either
16:34:58 <ihrachys> armax, are you available to have a look at the trunk scenario failure?
16:37:06 <ihrachys> seems like Armando is not there
16:37:07 <mlavalle> he might not be around
16:37:11 <ihrachys> yeah
16:37:36 <ihrachys> I will report the bug and if someone has cycles to triage would be great, for I don'
16:37:38 <ihrachys> *don't
16:37:48 <ihrachys> #action ihrachys to report bug for trunk scenario failure
16:38:48 <ihrachys> #topic Fullstack
16:39:12 <ihrachys> one of failures there are also trunk related
16:39:16 <ihrachys> armax has this patch https://review.openstack.org/#/c/504186/
16:39:22 <ihrachys> but it sits in W-1 for a while already
16:39:27 <ihrachys> almost a month
16:39:29 <jlibosva> I also started slowly filling https://etherpad.openstack.org/p/fullstack-issues
16:39:43 <jlibosva> the switch to rwd causes lots of issues
16:39:55 * jlibosva looks for an LP bug
16:40:08 <ihrachys> rwd?
16:40:11 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1654287
16:40:13 <openstack> Launchpad bug 1654287 in oslo.rootwrap "rootwrap daemon may return output of previous command" [Undecided,New]
16:40:16 <jlibosva> rootwrap daemon
16:40:40 <jlibosva> not rewind ;)
16:41:08 <ihrachys> jlibosva, aha. so that's what happens? I think we saw that with the issue that slaweq was dealing with where sysctl failed with netns error
16:41:11 <jlibosva> so I sent out a patch - https://review.openstack.org/#/c/510161/
16:41:24 <jlibosva> ihrachys: yeah, it's tricky :)
16:41:35 <jlibosva> but reviewers have good points there
16:41:51 <slaweq> ihrachys: yes, as I'm reading this bug now, it explains me why this issue with namespaces is happening
16:42:13 * ihrachys is stunned John reviewed that
16:42:20 * ihrachys checks the calendar date
16:42:25 <jlibosva> we hit that issue in the past, so when I saw one command had netstat output, it rang a bell
16:42:25 <ihrachys> no, it's indeed 2017
16:42:38 <jlibosva> ihrachys: yeah, he was also on irc :D
16:43:30 <jlibosva> so my question also is, beside upstream nodes are not configuring sudo, is there any other reason to use rootwrap?
16:43:50 <ihrachys> that was the only one I had
16:44:04 <ihrachys> but is it happening in the tester threads only?
16:44:10 <jlibosva> we can exercise rootwrap only how production code uses it while testrunner could still use sudo, presuming we configure it correctly in the job
16:44:22 <jlibosva> so far I have never seen it in production code
16:44:33 <jlibosva> as we use wait_until_true usually to wait for some resource to be ready in the tests
16:44:40 <jlibosva> in production we use it only in ipv6 prefix delegation
16:44:46 <jlibosva> it == wait_until_true
16:44:51 <ihrachys> but it's in neutron/common/utils.py so could be used outside
16:44:56 <jlibosva> and wait_until_true and rootwradp daemon are not freinds
16:45:40 <ihrachys> but now they are right?
16:45:44 <jlibosva> so maybe we should consider moving it to tests dir, document it's not nice to use it with rwd - or fix rwd
16:45:44 <ihrachys> with the patch
16:45:57 <jlibosva> it's not 100% reproducible
16:46:34 <ihrachys> isn't the fullstack issue one that hits us rather regularly?
16:46:50 <ihrachys> the one where we fail to create netns
16:47:14 <slaweq> it was/is quite common reason of failures AFAIK
16:47:45 <jlibosva> one I saw was when allocating port - http://logs.openstack.org/67/488567/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/eb8f9a3/testr_results.html.gz
16:49:19 <ihrachys> I will need to read through the comments to understand why it's not an ideal solution
16:49:28 <jlibosva> we can move on and discuss next steps on LP - https://bugs.launchpad.net/neutron/+bug/1721796
16:49:29 <openstack> Launchpad bug 1721796 in neutron "wait_until_true is not rootwrap daemon friendly" [Medium,In progress] - Assigned to Jakub Libosvar (libosvar)
16:49:36 <ihrachys> that being said, you think we can have it anyway? or we should work on smth else?
16:49:53 <jlibosva> Ideally we should fix oslo rootwrap
16:50:12 <ihrachys> I know what they will tell us :)
16:50:16 <jlibosva> I think dalvarez attempted to fix it in the past, I'm not sure it would be possible though
16:50:18 <ihrachys> 'use oslo.privsep'
16:51:03 <jlibosva> well, if we could have an execute() with privsep, that would solve the issues
16:51:25 <jlibosva> that's a good point
16:51:29 <jlibosva> to use privsep :)
16:51:36 <jlibosva> I haven't thought about it
16:51:57 <ihrachys> it's not about execute, it's about executables that we trigger. you can't solve it with a single patch. and I am not even sure why would we, realistically.
16:51:58 <jlibosva> we have 8 minutes :)
16:52:05 <ihrachys> ok ok
16:52:31 <ihrachys> apart from that, anything interesting about fullstack? how's your work to switch to 'containers' for services going?
16:52:47 <jlibosva> I have "something" that is not done
16:53:03 <jlibosva> https://review.openstack.org/#/c/506722/
16:53:20 <jlibosva> fighting with rootwrap filters, that's the last thing I looked at
16:53:29 <jlibosva> I think I haven't touched it last week
16:53:40 <jlibosva> but I have one thing I wanted to discuss regarding fullstack
16:53:45 <ihrachys> shoot
16:53:59 <jlibosva> when I was going through failing tests, I noticed some of them are stable-ish
16:54:22 <ihrachys> as in 'always fail'?
16:54:32 <jlibosva> so I had an idea that we could divide tests into stable and under-work
16:54:38 <jlibosva> no, as they pass :)
16:54:56 <jlibosva> so we would run all tests, but collect results only from picked ones
16:55:20 <jlibosva> then we would have stable fullstack and we could make it voting, to make sure those stable are not broken again
16:55:27 <ihrachys> I think most are stable now. it's the same bunch that fails, more or less. so maybe go with a blacklist instead?
16:55:53 <jlibosva> while we could work on stabilizing those 'under-work' which wouldn't affect the jenkins vote
16:55:58 <jlibosva> or result of testr
16:56:29 <jlibosva> right, we'd have something like a blacklist tests that won't affect exit code of test runner
16:56:39 <jlibosva> so if they fail, they are skipped. if they pass, they pass
16:56:57 <ihrachys> I agree with having it. we could even disable them completely if we have a bug reported.
16:57:09 <ihrachys> then whoever works on the fix, removes it from the list
16:57:25 <jlibosva> well, disabling them would not make them run and hence provide ways to debug
16:57:44 <ihrachys> jlibosva, send a patch that removes it from the list - you have the way to debug no?
16:57:55 <ihrachys> or you need data points?
16:57:56 <jlibosva> but then you'd need to recheck, recheck, recheck
16:58:05 <ihrachys> ok gotcha
16:58:18 <ihrachys> I think there was a decorator to mark a case to not affect result
16:58:25 <jlibosva> nice
16:58:35 <jlibosva> that's what I was thinking about :)
16:58:54 <ihrachys> xfail
16:58:55 <ihrachys> https://docs.pytest.org/en/latest/skipping.html
16:59:16 <ihrachys> so, do you want to start the list? I think we can find inspiration in how we did it back when we worked on py3
16:59:19 <slaweq> ihrachys: but then if it will not fail once, it will "impact" result probably
16:59:22 <jlibosva> we don't use pytest though
16:59:58 <ihrachys> jlibosva, as long as we report a bug per disabled/ignored test, I am fine
17:00:12 <jlibosva> #action jlibosva to look how failing tests can not affect results
17:00:13 <jlibosva> ok
17:00:14 <ihrachys> we need to wrap up
17:00:16 <jlibosva> I made an AI for myself
17:00:24 <ihrachys> ok :)
17:00:26 <ihrachys> thanks folks
17:00:29 <mlavalle> o/
17:00:29 <ihrachys> #endmeeting