16:00:25 #startmeeting neutron_ci 16:00:26 Meeting started Tue Oct 10 16:00:25 2017 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:27 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:30 The meeting name has been set to 'neutron_ci' 16:00:38 o/ 16:00:42 hello 16:00:59 slaweq, nice to see you joining 16:00:59 #topic Action items from prev week 16:01:12 first was "haleyb to update grafana board with new job names" 16:01:13 o/ 16:01:33 \o 16:01:48 haleyb, where do we stand on grafana? 16:02:03 I think since they reverted to v2.5, we were thinking about having two boards for both? 16:02:03 ihrachys: i have a test patch but need to get feedback, still don't know what the layout will be with zuulv3 16:02:27 https://review.openstack.org/#/c/509291/ ? 16:03:13 whose feedback do you seek? I presume infra? 16:03:20 right, but i don't see stats for those on the collection tree 16:03:33 * haleyb tries to think of that page 16:03:45 clarkb, whom should talk to about where to get stats for new zuul job failures? 16:03:48 http://graphite.openstack.org/ 16:03:56 seems like only old jobs are in graphite 16:04:11 ihrachys: jeblair is probably the best person to ask 16:04:25 clarkb: could you give us some feedback on https://review.openstack.org/#/c/509291/? 16:04:40 lol, clarkb you were listening 16:04:42 thanks. haleyb could you follow up with jeblair on the matter? I think he hangs out in #openstack-infra 16:04:42 I know there were some initial firewall issues but I thought those got sorted out, but possible we aren't allowing the new job stats through yet 16:05:01 pabelanger is probably the other person to ask as I think he was working on the firewall bits 16:05:09 yes, i'll follow-up since we'll need them starting tomorrow 16:05:21 haleyb, what's tomorrow? 16:05:46 * ihrachys is out of touch lately 16:05:53 zuulv3 #2, so the neutron dashboard will stop reporting 16:06:01 they switch back tomorrow? 16:06:01 https://review.openstack.org/510580 is needed for v3 job stats 16:06:21 i think folks have considered that low-priority, but if it's a big impact for you, i can escalate it 16:06:25 jeblair, great! is it reasonable to expect it to land before the switch? 16:06:27 ihrachys: yeap 16:06:41 ihrachys: if you need it, i think so, yes 16:07:03 I can review here shortly 16:07:04 jeblair, yeah, we would like to have stats when switch occurs. the previous time we switched, we had problems determining where we are 16:07:24 thanks folks, we appreciate all the work and quick response 16:07:29 jeblair: thanks, when that merges should graphite now have the new job stats? 16:07:32 ihrachys: ok. i'll push on it. that updates the docs too, so you should be able to construct the new statsd keys you'll need to update the grafana boards 16:07:48 haleyb: after we restart zuul with it. probably some time today. 16:08:17 great, thanks! 16:08:26 haleyb, seems like you'll need to change queries a bit 16:08:45 I guess we can do that in parallel to the infra patch being reviewed/merged 16:08:58 yes, and i can update the FAQ since others might want to know 16:09:14 cool 16:09:19 this was the only AI from the prev meeting 16:09:37 #topic zuulv3 preparations 16:09:52 the prev meeting, we were in fire drill mode because v3 was enabled and resulted in a havoc 16:10:14 we started this etherpad back then to understand what fails: https://etherpad.openstack.org/p/neutron-zuulv3-grievances 16:10:29 and we had people assigned to investigate each repo/branch 16:10:50 since then we switched back to 2.5 so it was not as pressing 16:11:03 well v3 is coming back online tomorrow 16:11:04 http://lists.openstack.org/pipermail/openstack-dev/2017-October/123337.html 16:11:05 and it seems like with the switch back, we lost ability to trigger results for zuulv3 16:11:16 boden, yeah, haleyb told that ^ 16:11:20 oh sorry 16:11:47 clarkb, I think the plan was to be able to trigger zuul results no? they would just be non-voting 16:13:00 I don’t want to speak of others, but I was seeing Zuul results (non gating), but as of late I only see Jenkins… maybe I’m missing something, so I recently submitted a test patch for lib 16:13:19 ihrachys: we did a soft rollback of v3 so v2.5 and v3 have been running together and both reporting on changes 16:13:30 ihrachys: the idea was you'd use the results during the last week or so to continue debugging 16:13:42 clarkb, I don't think I saw any Zuul results since rollback 16:13:46 jenkins votes are from v2, zuul results are v3 16:13:52 I was told it's long queue 16:13:59 but I haven't seen it in a week anyway 16:13:59 ihrachys: it is, but it is processing them 16:14:12 so projects didn't have a chance to fix anything really 16:14:18 you can query label:Verified,zuul or something to see where it has voted in gerrit 16:14:33 ihrachys: I'm not sure that is the case, many projects have been able to debug and fix things aiui including tripleo 16:14:55 we sent a bunch of 'sentinel' patches for different branches and projects, and couldn't get anything 16:15:15 examples: https://review.openstack.org/#/c/509251/ https://review.openstack.org/#/c/502564/ 16:15:23 and also for neutron-lib https://review.openstack.org/#/c/493280/ 16:15:25 and boden seems to confirm that 16:15:48 yes… I did see Zuul results for a few days, but not as of the last few days 16:16:02 I saw last results on the day of the rollback 16:16:06 https://review.openstack.org/#/q/project:openstack/neutron+label:Verified%252Czuul is an example query 16:16:12 before we started pushing sentinels 16:17:11 clarkb, ok I guess it was unrealistic to expect it to trigger at patches we sent 16:17:15 is it that slow 16:17:16 ? 16:17:33 ihrachys: I think its ~30 hours behind right now (it has 20% of our node capacity right now) 16:17:49 well we definitely sent those the previous Tue 16:18:04 maybe some restart caught us in between and reset queues 16:18:18 as jeblair mentioned above there have also been restarts to ddress problems as fixes have come in so things may have been caught by that as well 16:18:27 but its definitely voting on some neutron change sbased on that query at least 16:18:43 ok gotcha 16:19:05 folks, let's revisit repos/branches assigned to each of us before tomorrow 16:19:15 I mean as in the list at https://etherpad.openstack.org/p/neutron-zuulv3-grievances 16:20:09 #topic Grafana 16:20:15 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:21:26 I see periodic ryu job failed the last time it executed 16:21:57 I was thinking whether we should put to grafana longer time than 24 hours as it's run once per day then we have binary data in grafana 16:22:05 if that's even possible 16:22:36 like a week, so we would have 7 samples instead of one in periodic graph 16:22:52 I think it makes sense. 16:23:27 I see some boards use 12hours instead of 24hours 16:24:02 dunno if we can go higher though 16:24:36 but it's an argument to movingAverage function 16:25:23 http://graphite.readthedocs.io/en/latest/functions.html#graphite.render.functions.movingAverage 16:25:47 I thought it's asPercentage 16:25:51 seems like it allows different scales, and if nothing else, we can have "number N of datapoints" 16:25:52 asPercent* 16:26:11 ah no, you're right :) 16:26:26 jlibosva, wanna send the patch? 16:26:32 yep 16:26:51 #action jlibosva to expand grafana window for periodics 16:27:20 I briefly checked the ryu failure, it does seem like a regular volume tempest issue: http://logs.openstack.org/periodic/periodic-tempest-dsvm-neutron-with-ryu-master-ubuntu-xenial/b36a348/console.html 16:28:13 other than that, we have scenarios and fullstack that we'll discuss separately 16:28:22 anything else related to grafana in general? 16:28:48 ok 16:28:55 #topic Scenario jobs 16:29:19 anilvenkata was working on router migration failures lately 16:29:26 a bunch of fixes landed 16:29:37 we are landing the hopefully final test-only fix here: https://review.openstack.org/#/c/500384/ 16:30:18 big kudos to anilvenkata 16:30:31 Anil also suggested initially that we may need to enable ARP responder when in DVR mode: https://review.openstack.org/#/c/510090/ because he experienced some ARPs lost in the gate, but seems like it should work either way. 16:30:32 anilvenkata++ 16:30:49 we will revisit the state of the job in next days and see if this piece is complete 16:31:19 there are still other failures in the job, as can be seen in: http://logs.openstack.org/90/510090/2/check/gate-tempest-dsvm-neutron-dvr-multinode-scenario-ubuntu-xenial-nv/a2d5341/logs/testr_results.html.gz 16:32:20 haleyb, re east-west test failures, it seems like some other case of FIP not configured/not available 16:32:51 ihrachys: yes, or an arp issue 16:32:52 haleyb, I think it makes sense to bring it up at l3team meeting 16:33:02 since I believe it fails consistently 16:33:18 i'll put it on our agenda 16:33:22 we don't have a bug reported for that failure, so that could be a good start 16:33:27 thanks! 16:33:57 the second failure is related to trunk ports 16:34:26 we don't seem to have a bug for that either 16:34:58 armax, are you available to have a look at the trunk scenario failure? 16:37:06 seems like Armando is not there 16:37:07 he might not be around 16:37:11 yeah 16:37:36 I will report the bug and if someone has cycles to triage would be great, for I don' 16:37:38 *don't 16:37:48 #action ihrachys to report bug for trunk scenario failure 16:38:48 #topic Fullstack 16:39:12 one of failures there are also trunk related 16:39:16 armax has this patch https://review.openstack.org/#/c/504186/ 16:39:22 but it sits in W-1 for a while already 16:39:27 almost a month 16:39:29 I also started slowly filling https://etherpad.openstack.org/p/fullstack-issues 16:39:43 the switch to rwd causes lots of issues 16:39:55 * jlibosva looks for an LP bug 16:40:08 rwd? 16:40:11 https://bugs.launchpad.net/neutron/+bug/1654287 16:40:13 Launchpad bug 1654287 in oslo.rootwrap "rootwrap daemon may return output of previous command" [Undecided,New] 16:40:16 rootwrap daemon 16:40:40 not rewind ;) 16:41:08 jlibosva, aha. so that's what happens? I think we saw that with the issue that slaweq was dealing with where sysctl failed with netns error 16:41:11 so I sent out a patch - https://review.openstack.org/#/c/510161/ 16:41:24 ihrachys: yeah, it's tricky :) 16:41:35 but reviewers have good points there 16:41:51 ihrachys: yes, as I'm reading this bug now, it explains me why this issue with namespaces is happening 16:42:13 * ihrachys is stunned John reviewed that 16:42:20 * ihrachys checks the calendar date 16:42:25 we hit that issue in the past, so when I saw one command had netstat output, it rang a bell 16:42:25 no, it's indeed 2017 16:42:38 ihrachys: yeah, he was also on irc :D 16:43:30 so my question also is, beside upstream nodes are not configuring sudo, is there any other reason to use rootwrap? 16:43:50 that was the only one I had 16:44:04 but is it happening in the tester threads only? 16:44:10 we can exercise rootwrap only how production code uses it while testrunner could still use sudo, presuming we configure it correctly in the job 16:44:22 so far I have never seen it in production code 16:44:33 as we use wait_until_true usually to wait for some resource to be ready in the tests 16:44:40 in production we use it only in ipv6 prefix delegation 16:44:46 it == wait_until_true 16:44:51 but it's in neutron/common/utils.py so could be used outside 16:44:56 and wait_until_true and rootwradp daemon are not freinds 16:45:40 but now they are right? 16:45:44 so maybe we should consider moving it to tests dir, document it's not nice to use it with rwd - or fix rwd 16:45:44 with the patch 16:45:57 it's not 100% reproducible 16:46:34 isn't the fullstack issue one that hits us rather regularly? 16:46:50 the one where we fail to create netns 16:47:14 it was/is quite common reason of failures AFAIK 16:47:45 one I saw was when allocating port - http://logs.openstack.org/67/488567/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/eb8f9a3/testr_results.html.gz 16:49:19 I will need to read through the comments to understand why it's not an ideal solution 16:49:28 we can move on and discuss next steps on LP - https://bugs.launchpad.net/neutron/+bug/1721796 16:49:29 Launchpad bug 1721796 in neutron "wait_until_true is not rootwrap daemon friendly" [Medium,In progress] - Assigned to Jakub Libosvar (libosvar) 16:49:36 that being said, you think we can have it anyway? or we should work on smth else? 16:49:53 Ideally we should fix oslo rootwrap 16:50:12 I know what they will tell us :) 16:50:16 I think dalvarez attempted to fix it in the past, I'm not sure it would be possible though 16:50:18 'use oslo.privsep' 16:51:03 well, if we could have an execute() with privsep, that would solve the issues 16:51:25 that's a good point 16:51:29 to use privsep :) 16:51:36 I haven't thought about it 16:51:57 it's not about execute, it's about executables that we trigger. you can't solve it with a single patch. and I am not even sure why would we, realistically. 16:51:58 we have 8 minutes :) 16:52:05 ok ok 16:52:31 apart from that, anything interesting about fullstack? how's your work to switch to 'containers' for services going? 16:52:47 I have "something" that is not done 16:53:03 https://review.openstack.org/#/c/506722/ 16:53:20 fighting with rootwrap filters, that's the last thing I looked at 16:53:29 I think I haven't touched it last week 16:53:40 but I have one thing I wanted to discuss regarding fullstack 16:53:45 shoot 16:53:59 when I was going through failing tests, I noticed some of them are stable-ish 16:54:22 as in 'always fail'? 16:54:32 so I had an idea that we could divide tests into stable and under-work 16:54:38 no, as they pass :) 16:54:56 so we would run all tests, but collect results only from picked ones 16:55:20 then we would have stable fullstack and we could make it voting, to make sure those stable are not broken again 16:55:27 I think most are stable now. it's the same bunch that fails, more or less. so maybe go with a blacklist instead? 16:55:53 while we could work on stabilizing those 'under-work' which wouldn't affect the jenkins vote 16:55:58 or result of testr 16:56:29 right, we'd have something like a blacklist tests that won't affect exit code of test runner 16:56:39 so if they fail, they are skipped. if they pass, they pass 16:56:57 I agree with having it. we could even disable them completely if we have a bug reported. 16:57:09 then whoever works on the fix, removes it from the list 16:57:25 well, disabling them would not make them run and hence provide ways to debug 16:57:44 jlibosva, send a patch that removes it from the list - you have the way to debug no? 16:57:55 or you need data points? 16:57:56 but then you'd need to recheck, recheck, recheck 16:58:05 ok gotcha 16:58:18 I think there was a decorator to mark a case to not affect result 16:58:25 nice 16:58:35 that's what I was thinking about :) 16:58:54 xfail 16:58:55 https://docs.pytest.org/en/latest/skipping.html 16:59:16 so, do you want to start the list? I think we can find inspiration in how we did it back when we worked on py3 16:59:19 ihrachys: but then if it will not fail once, it will "impact" result probably 16:59:22 we don't use pytest though 16:59:58 jlibosva, as long as we report a bug per disabled/ignored test, I am fine 17:00:12 #action jlibosva to look how failing tests can not affect results 17:00:13 ok 17:00:14 we need to wrap up 17:00:16 I made an AI for myself 17:00:24 ok :) 17:00:26 thanks folks 17:00:29 o/ 17:00:29 #endmeeting