16:00:25 <ihrachys> #startmeeting neutron_ci 16:00:26 <openstack> Meeting started Tue Oct 10 16:00:25 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:27 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:30 <openstack> The meeting name has been set to 'neutron_ci' 16:00:38 <ihrachys> o/ 16:00:42 <slaweq> hello 16:00:59 <ihrachys> slaweq, nice to see you joining 16:00:59 <ihrachys> #topic Action items from prev week 16:01:12 <ihrachys> first was "haleyb to update grafana board with new job names" 16:01:13 <mlavalle> o/ 16:01:33 <jlibosva> \o 16:01:48 <ihrachys> haleyb, where do we stand on grafana? 16:02:03 <ihrachys> I think since they reverted to v2.5, we were thinking about having two boards for both? 16:02:03 <haleyb> ihrachys: i have a test patch but need to get feedback, still don't know what the layout will be with zuulv3 16:02:27 <ihrachys> https://review.openstack.org/#/c/509291/ ? 16:03:13 <ihrachys> whose feedback do you seek? I presume infra? 16:03:20 <haleyb> right, but i don't see stats for those on the collection tree 16:03:33 * haleyb tries to think of that page 16:03:45 <ihrachys> clarkb, whom should talk to about where to get stats for new zuul job failures? 16:03:48 <haleyb> http://graphite.openstack.org/ 16:03:56 <ihrachys> seems like only old jobs are in graphite 16:04:11 <clarkb> ihrachys: jeblair is probably the best person to ask 16:04:25 <mlavalle> clarkb: could you give us some feedback on https://review.openstack.org/#/c/509291/? 16:04:40 <mlavalle> lol, clarkb you were listening 16:04:42 <ihrachys> thanks. haleyb could you follow up with jeblair on the matter? I think he hangs out in #openstack-infra 16:04:42 <clarkb> I know there were some initial firewall issues but I thought those got sorted out, but possible we aren't allowing the new job stats through yet 16:05:01 <clarkb> pabelanger is probably the other person to ask as I think he was working on the firewall bits 16:05:09 <haleyb> yes, i'll follow-up since we'll need them starting tomorrow 16:05:21 <ihrachys> haleyb, what's tomorrow? 16:05:46 * ihrachys is out of touch lately 16:05:53 <haleyb> zuulv3 #2, so the neutron dashboard will stop reporting 16:06:01 <ihrachys> they switch back tomorrow? 16:06:01 <jeblair> https://review.openstack.org/510580 is needed for v3 job stats 16:06:21 <jeblair> i think folks have considered that low-priority, but if it's a big impact for you, i can escalate it 16:06:25 <ihrachys> jeblair, great! is it reasonable to expect it to land before the switch? 16:06:27 <mlavalle> ihrachys: yeap 16:06:41 <jeblair> ihrachys: if you need it, i think so, yes 16:07:03 <pabelanger> I can review here shortly 16:07:04 <ihrachys> jeblair, yeah, we would like to have stats when switch occurs. the previous time we switched, we had problems determining where we are 16:07:24 <ihrachys> thanks folks, we appreciate all the work and quick response 16:07:29 <haleyb> jeblair: thanks, when that merges should graphite now have the new job stats? 16:07:32 <jeblair> ihrachys: ok. i'll push on it. that updates the docs too, so you should be able to construct the new statsd keys you'll need to update the grafana boards 16:07:48 <jeblair> haleyb: after we restart zuul with it. probably some time today. 16:08:17 <haleyb> great, thanks! 16:08:26 <ihrachys> haleyb, seems like you'll need to change queries a bit 16:08:45 <ihrachys> I guess we can do that in parallel to the infra patch being reviewed/merged 16:08:58 <haleyb> yes, and i can update the FAQ since others might want to know 16:09:14 <ihrachys> cool 16:09:19 <ihrachys> this was the only AI from the prev meeting 16:09:37 <ihrachys> #topic zuulv3 preparations 16:09:52 <ihrachys> the prev meeting, we were in fire drill mode because v3 was enabled and resulted in a havoc 16:10:14 <ihrachys> we started this etherpad back then to understand what fails: https://etherpad.openstack.org/p/neutron-zuulv3-grievances 16:10:29 <ihrachys> and we had people assigned to investigate each repo/branch 16:10:50 <ihrachys> since then we switched back to 2.5 so it was not as pressing 16:11:03 <boden> well v3 is coming back online tomorrow 16:11:04 <boden> http://lists.openstack.org/pipermail/openstack-dev/2017-October/123337.html 16:11:05 <ihrachys> and it seems like with the switch back, we lost ability to trigger results for zuulv3 16:11:16 <ihrachys> boden, yeah, haleyb told that ^ 16:11:20 <boden> oh sorry 16:11:47 <ihrachys> clarkb, I think the plan was to be able to trigger zuul results no? they would just be non-voting 16:13:00 <boden> I don’t want to speak of others, but I was seeing Zuul results (non gating), but as of late I only see Jenkins… maybe I’m missing something, so I recently submitted a test patch for lib 16:13:19 <clarkb> ihrachys: we did a soft rollback of v3 so v2.5 and v3 have been running together and both reporting on changes 16:13:30 <clarkb> ihrachys: the idea was you'd use the results during the last week or so to continue debugging 16:13:42 <ihrachys> clarkb, I don't think I saw any Zuul results since rollback 16:13:46 <clarkb> jenkins votes are from v2, zuul results are v3 16:13:52 <ihrachys> I was told it's long queue 16:13:59 <ihrachys> but I haven't seen it in a week anyway 16:13:59 <clarkb> ihrachys: it is, but it is processing them 16:14:12 <ihrachys> so projects didn't have a chance to fix anything really 16:14:18 <clarkb> you can query label:Verified,zuul or something to see where it has voted in gerrit 16:14:33 <clarkb> ihrachys: I'm not sure that is the case, many projects have been able to debug and fix things aiui including tripleo 16:14:55 <ihrachys> we sent a bunch of 'sentinel' patches for different branches and projects, and couldn't get anything 16:15:15 <ihrachys> examples: https://review.openstack.org/#/c/509251/ https://review.openstack.org/#/c/502564/ 16:15:23 <boden> and also for neutron-lib https://review.openstack.org/#/c/493280/ 16:15:25 <ihrachys> and boden seems to confirm that 16:15:48 <boden> yes… I did see Zuul results for a few days, but not as of the last few days 16:16:02 <ihrachys> I saw last results on the day of the rollback 16:16:06 <clarkb> https://review.openstack.org/#/q/project:openstack/neutron+label:Verified%252Czuul is an example query 16:16:12 <ihrachys> before we started pushing sentinels 16:17:11 <ihrachys> clarkb, ok I guess it was unrealistic to expect it to trigger at patches we sent 16:17:15 <ihrachys> is it that slow 16:17:16 <ihrachys> ? 16:17:33 <clarkb> ihrachys: I think its ~30 hours behind right now (it has 20% of our node capacity right now) 16:17:49 <ihrachys> well we definitely sent those the previous Tue 16:18:04 <ihrachys> maybe some restart caught us in between and reset queues 16:18:18 <clarkb> as jeblair mentioned above there have also been restarts to ddress problems as fixes have come in so things may have been caught by that as well 16:18:27 <clarkb> but its definitely voting on some neutron change sbased on that query at least 16:18:43 <ihrachys> ok gotcha 16:19:05 <ihrachys> folks, let's revisit repos/branches assigned to each of us before tomorrow 16:19:15 <ihrachys> I mean as in the list at https://etherpad.openstack.org/p/neutron-zuulv3-grievances 16:20:09 <ihrachys> #topic Grafana 16:20:15 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:21:26 <ihrachys> I see periodic ryu job failed the last time it executed 16:21:57 <jlibosva> I was thinking whether we should put to grafana longer time than 24 hours as it's run once per day then we have binary data in grafana 16:22:05 <jlibosva> if that's even possible 16:22:36 <jlibosva> like a week, so we would have 7 samples instead of one in periodic graph 16:22:52 <ihrachys> I think it makes sense. 16:23:27 <ihrachys> I see some boards use 12hours instead of 24hours 16:24:02 <ihrachys> dunno if we can go higher though 16:24:36 <ihrachys> but it's an argument to movingAverage function 16:25:23 <ihrachys> http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.movingAverage 16:25:47 <jlibosva> I thought it's asPercentage 16:25:51 <ihrachys> seems like it allows different scales, and if nothing else, we can have "number N of datapoints" 16:25:52 <jlibosva> asPercent* 16:26:11 <jlibosva> ah no, you're right :) 16:26:26 <ihrachys> jlibosva, wanna send the patch? 16:26:32 <jlibosva> yep 16:26:51 <ihrachys> #action jlibosva to expand grafana window for periodics 16:27:20 <ihrachys> I briefly checked the ryu failure, it does seem like a regular volume tempest issue: http://logs.openstack.org/periodic/periodic-tempest-dsvm-neutron-with-ryu-master-ubuntu-xenial/b36a348/console.html 16:28:13 <ihrachys> other than that, we have scenarios and fullstack that we'll discuss separately 16:28:22 <ihrachys> anything else related to grafana in general? 16:28:48 <ihrachys> ok 16:28:55 <ihrachys> #topic Scenario jobs 16:29:19 <ihrachys> anilvenkata was working on router migration failures lately 16:29:26 <ihrachys> a bunch of fixes landed 16:29:37 <ihrachys> we are landing the hopefully final test-only fix here: https://review.openstack.org/#/c/500384/ 16:30:18 <jlibosva> big kudos to anilvenkata 16:30:31 <ihrachys> Anil also suggested initially that we may need to enable ARP responder when in DVR mode: https://review.openstack.org/#/c/510090/ because he experienced some ARPs lost in the gate, but seems like it should work either way. 16:30:32 <haleyb> anilvenkata++ 16:30:49 <ihrachys> we will revisit the state of the job in next days and see if this piece is complete 16:31:19 <ihrachys> there are still other failures in the job, as can be seen in: http://logs.openstack.org/90/510090/2/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/a2d5341/logs/testr_results.html.gz 16:32:20 <ihrachys> haleyb, re east-west test failures, it seems like some other case of FIP not configured/not available 16:32:51 <haleyb> ihrachys: yes, or an arp issue 16:32:52 <ihrachys> haleyb, I think it makes sense to bring it up at l3team meeting 16:33:02 <ihrachys> since I believe it fails consistently 16:33:18 <haleyb> i'll put it on our agenda 16:33:22 <ihrachys> we don't have a bug reported for that failure, so that could be a good start 16:33:27 <ihrachys> thanks! 16:33:57 <ihrachys> the second failure is related to trunk ports 16:34:26 <ihrachys> we don't seem to have a bug for that either 16:34:58 <ihrachys> armax, are you available to have a look at the trunk scenario failure? 16:37:06 <ihrachys> seems like Armando is not there 16:37:07 <mlavalle> he might not be around 16:37:11 <ihrachys> yeah 16:37:36 <ihrachys> I will report the bug and if someone has cycles to triage would be great, for I don' 16:37:38 <ihrachys> *don't 16:37:48 <ihrachys> #action ihrachys to report bug for trunk scenario failure 16:38:48 <ihrachys> #topic Fullstack 16:39:12 <ihrachys> one of failures there are also trunk related 16:39:16 <ihrachys> armax has this patch https://review.openstack.org/#/c/504186/ 16:39:22 <ihrachys> but it sits in W-1 for a while already 16:39:27 <ihrachys> almost a month 16:39:29 <jlibosva> I also started slowly filling https://etherpad.openstack.org/p/fullstack-issues 16:39:43 <jlibosva> the switch to rwd causes lots of issues 16:39:55 * jlibosva looks for an LP bug 16:40:08 <ihrachys> rwd? 16:40:11 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1654287 16:40:13 <openstack> Launchpad bug 1654287 in oslo.rootwrap "rootwrap daemon may return output of previous command" [Undecided,New] 16:40:16 <jlibosva> rootwrap daemon 16:40:40 <jlibosva> not rewind ;) 16:41:08 <ihrachys> jlibosva, aha. so that's what happens? I think we saw that with the issue that slaweq was dealing with where sysctl failed with netns error 16:41:11 <jlibosva> so I sent out a patch - https://review.openstack.org/#/c/510161/ 16:41:24 <jlibosva> ihrachys: yeah, it's tricky :) 16:41:35 <jlibosva> but reviewers have good points there 16:41:51 <slaweq> ihrachys: yes, as I'm reading this bug now, it explains me why this issue with namespaces is happening 16:42:13 * ihrachys is stunned John reviewed that 16:42:20 * ihrachys checks the calendar date 16:42:25 <jlibosva> we hit that issue in the past, so when I saw one command had netstat output, it rang a bell 16:42:25 <ihrachys> no, it's indeed 2017 16:42:38 <jlibosva> ihrachys: yeah, he was also on irc :D 16:43:30 <jlibosva> so my question also is, beside upstream nodes are not configuring sudo, is there any other reason to use rootwrap? 16:43:50 <ihrachys> that was the only one I had 16:44:04 <ihrachys> but is it happening in the tester threads only? 16:44:10 <jlibosva> we can exercise rootwrap only how production code uses it while testrunner could still use sudo, presuming we configure it correctly in the job 16:44:22 <jlibosva> so far I have never seen it in production code 16:44:33 <jlibosva> as we use wait_until_true usually to wait for some resource to be ready in the tests 16:44:40 <jlibosva> in production we use it only in ipv6 prefix delegation 16:44:46 <jlibosva> it == wait_until_true 16:44:51 <ihrachys> but it's in neutron/common/utils.py so could be used outside 16:44:56 <jlibosva> and wait_until_true and rootwradp daemon are not freinds 16:45:40 <ihrachys> but now they are right? 16:45:44 <jlibosva> so maybe we should consider moving it to tests dir, document it's not nice to use it with rwd - or fix rwd 16:45:44 <ihrachys> with the patch 16:45:57 <jlibosva> it's not 100% reproducible 16:46:34 <ihrachys> isn't the fullstack issue one that hits us rather regularly? 16:46:50 <ihrachys> the one where we fail to create netns 16:47:14 <slaweq> it was/is quite common reason of failures AFAIK 16:47:45 <jlibosva> one I saw was when allocating port - http://logs.openstack.org/67/488567/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/eb8f9a3/testr_results.html.gz 16:49:19 <ihrachys> I will need to read through the comments to understand why it's not an ideal solution 16:49:28 <jlibosva> we can move on and discuss next steps on LP - https://bugs.launchpad.net/neutron/+bug/1721796 16:49:29 <openstack> Launchpad bug 1721796 in neutron "wait_until_true is not rootwrap daemon friendly" [Medium,In progress] - Assigned to Jakub Libosvar (libosvar) 16:49:36 <ihrachys> that being said, you think we can have it anyway? or we should work on smth else? 16:49:53 <jlibosva> Ideally we should fix oslo rootwrap 16:50:12 <ihrachys> I know what they will tell us :) 16:50:16 <jlibosva> I think dalvarez attempted to fix it in the past, I'm not sure it would be possible though 16:50:18 <ihrachys> 'use oslo.privsep' 16:51:03 <jlibosva> well, if we could have an execute() with privsep, that would solve the issues 16:51:25 <jlibosva> that's a good point 16:51:29 <jlibosva> to use privsep :) 16:51:36 <jlibosva> I haven't thought about it 16:51:57 <ihrachys> it's not about execute, it's about executables that we trigger. you can't solve it with a single patch. and I am not even sure why would we, realistically. 16:51:58 <jlibosva> we have 8 minutes :) 16:52:05 <ihrachys> ok ok 16:52:31 <ihrachys> apart from that, anything interesting about fullstack? how's your work to switch to 'containers' for services going? 16:52:47 <jlibosva> I have "something" that is not done 16:53:03 <jlibosva> https://review.openstack.org/#/c/506722/ 16:53:20 <jlibosva> fighting with rootwrap filters, that's the last thing I looked at 16:53:29 <jlibosva> I think I haven't touched it last week 16:53:40 <jlibosva> but I have one thing I wanted to discuss regarding fullstack 16:53:45 <ihrachys> shoot 16:53:59 <jlibosva> when I was going through failing tests, I noticed some of them are stable-ish 16:54:22 <ihrachys> as in 'always fail'? 16:54:32 <jlibosva> so I had an idea that we could divide tests into stable and under-work 16:54:38 <jlibosva> no, as they pass :) 16:54:56 <jlibosva> so we would run all tests, but collect results only from picked ones 16:55:20 <jlibosva> then we would have stable fullstack and we could make it voting, to make sure those stable are not broken again 16:55:27 <ihrachys> I think most are stable now. it's the same bunch that fails, more or less. so maybe go with a blacklist instead? 16:55:53 <jlibosva> while we could work on stabilizing those 'under-work' which wouldn't affect the jenkins vote 16:55:58 <jlibosva> or result of testr 16:56:29 <jlibosva> right, we'd have something like a blacklist tests that won't affect exit code of test runner 16:56:39 <jlibosva> so if they fail, they are skipped. if they pass, they pass 16:56:57 <ihrachys> I agree with having it. we could even disable them completely if we have a bug reported. 16:57:09 <ihrachys> then whoever works on the fix, removes it from the list 16:57:25 <jlibosva> well, disabling them would not make them run and hence provide ways to debug 16:57:44 <ihrachys> jlibosva, send a patch that removes it from the list - you have the way to debug no? 16:57:55 <ihrachys> or you need data points? 16:57:56 <jlibosva> but then you'd need to recheck, recheck, recheck 16:58:05 <ihrachys> ok gotcha 16:58:18 <ihrachys> I think there was a decorator to mark a case to not affect result 16:58:25 <jlibosva> nice 16:58:35 <jlibosva> that's what I was thinking about :) 16:58:54 <ihrachys> xfail 16:58:55 <ihrachys> https://docs.pytest.org/en/latest/skipping.html 16:59:16 <ihrachys> so, do you want to start the list? I think we can find inspiration in how we did it back when we worked on py3 16:59:19 <slaweq> ihrachys: but then if it will not fail once, it will "impact" result probably 16:59:22 <jlibosva> we don't use pytest though 16:59:58 <ihrachys> jlibosva, as long as we report a bug per disabled/ignored test, I am fine 17:00:12 <jlibosva> #action jlibosva to look how failing tests can not affect results 17:00:13 <jlibosva> ok 17:00:14 <ihrachys> we need to wrap up 17:00:16 <jlibosva> I made an AI for myself 17:00:24 <ihrachys> ok :) 17:00:26 <ihrachys> thanks folks 17:00:29 <mlavalle> o/ 17:00:29 <ihrachys> #endmeeting