16:00:33 <ihrachys> #startmeeting neutron_ci 16:00:34 <openstack> Meeting started Tue Jan 9 16:00:33 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:35 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:38 <openstack> The meeting name has been set to 'neutron_ci' 16:00:42 <mlavalle> o/ 16:00:48 <slaweq> hi 16:00:49 <ihrachys> heya 16:00:52 <jlibosva> o/ 16:01:06 <ihrachys> I think I happily missed the last time we were supposed to have a meeting. sorry for that. :) 16:01:17 <ihrachys> post-holiday recovery is hard! 16:01:29 <ihrachys> #topic Actions from prev meeting 16:01:41 <ihrachys> first is "mlavalle to send patch(es) removing duplicate jobs from neutron gate" 16:01:58 <mlavalle> I did 16:02:02 <ihrachys> I believe it's https://review.openstack.org/530500 and https://review.openstack.org/531496 16:02:19 <ihrachys> and of course we should land the last bit adding new jobs to stable: https://review.openstack.org/531070 16:02:38 <mlavalle> yeap 16:03:19 <ihrachys> mlavalle, speaking of your concerns in https://review.openstack.org/#/c/531496/ 16:03:31 <mlavalle> ok 16:03:34 <ihrachys> mlavalle, I would think that we replace old jobs with in-tree ones for those other projects 16:03:41 <mlavalle> ok 16:03:50 <mlavalle> will work on that 16:03:54 <ihrachys> you are too agreeable today! come on! :) 16:04:14 <mlavalle> you speak with reason, so why not :-) 16:04:15 <ihrachys> mlavalle, I am not saying that's the right path; I just *assume* 16:04:22 <ihrachys> still makes sense to check with infra first 16:04:29 <mlavalle> yes I will do that 16:04:43 <mlavalle> but we don't need to rush on that patch 16:04:55 <mlavalle> as long as 530500 lands 16:04:58 <mlavalle> soon 16:04:59 <ihrachys> mlavalle, the project-config one will already remove duplicates from gate for us? 16:05:20 <mlavalle> 530500 will remove the duplication 16:05:40 <ihrachys> ok good. yeah, we can leave the second one for infra take. 16:05:45 <ihrachys> ok next item was "mlavalle to report back about result of drivers discussion on unified tempest plugin for all stadium projects" 16:06:00 <ihrachys> mlavalle, could you please give tl;dr of the discussion? 16:06:14 <mlavalle> we discussed it in the drivers meeting before the Holidays 16:06:44 <mlavalle> we will let the stadium projects join the unified tempest plugin or keep their tests in their repos 16:06:55 <mlavalle> we don't want to stretch them with more work 16:07:04 <mlavalle> sent a message to the ML last week 16:07:24 <mlavalle> and discussed it in the Neutron meeting last week 16:07:32 <mlavalle> was well received 16:07:36 <mlavalle> done 16:08:02 <ihrachys> ok cool. are there candidates who are willing to go through the exercise? 16:08:11 <mlavalle> I haven't heard 16:08:18 <mlavalle> back 16:08:24 <ihrachys> ok, fine by me! 16:08:34 <ihrachys> next is "frickler to post patch updating neutron grafana board to include designate scenario job" 16:08:40 <ihrachys> I believe this was merged https://review.openstack.org/#/c/529822/ 16:08:56 <ihrachys> I just saw the job in grafana before the meeting, so it works 16:09:07 <ihrachys> next one is "jlibosva to close bug 1722644 and open a new one for trunk connectivity failures in dvr and linuxbridge scenario jobs" 16:09:08 <openstack> bug 1722644 in neutron "TrunkTest fails for OVS/DVR scenario job" [High,Confirmed] https://launchpad.net/bugs/1722644 16:09:23 <jlibosva> I did 16:09:27 <jlibosva> https://bugs.launchpad.net/neutron/+bug/1740885 16:09:28 <openstack> Launchpad bug 1740885 in neutron "Security group updates fail when port hasn't been initialized yet" [High,In progress] - Assigned to Jakub Libosvar (libosvar) 16:09:39 <ihrachys> the other bug doesn't seem closed 16:09:41 <jlibosva> fix is here: https://review.openstack.org/#/c/531414/ 16:10:25 <jlibosva> oops, I just wrote a comment I'm gonna close it and I acutally didn't 16:10:27 <jlibosva> closed now 16:10:48 <ihrachys> jlibosva, right. I started looking at the patch yesterday, and was wondering if the fix makes it ignore some updates from the server. can we miss an update and leave a port in old state with the fix? 16:11:48 <jlibosva> I guess so, the update stores e.g. IP in case a new port is added to remote security group. So we should still receive an update but let the implementation of flow rules deferred 16:12:50 <ihrachys> to clarify, you mean there is a potential race now? 16:13:22 <jlibosva> there has been a race and the 531414 fixes it 16:13:38 <jlibosva> it caches the new data and once the port initializes, it already has the correct data in cache 16:14:04 <ihrachys> oh I see, so there is cache involved so we should be good 16:14:39 <jlibosva> BUT 16:15:02 <jlibosva> I saw the ovsfw job failing, reporting the issue, so I still need to investigate why it fails 16:15:23 <jlibosva> I'm not able to reproduce it, I'll need to run tempest in the loop overnight and put a breakpoint once it fails so I'll be able to debug 16:15:33 <jlibosva> it's likely a different issue 16:15:55 <ihrachys> the ovsfw job was never too stable. but yeah, it's good to look into it. 16:16:07 <ihrachys> ok we have way forward here. 16:16:10 <ihrachys> next item was "jlibosva to disable trunk scenario connectivity tests" 16:16:39 <ihrachys> I believe it's https://review.openstack.org/530760 16:16:44 <ihrachys> and it's merged 16:16:47 <jlibosva> yeah 16:16:54 <jlibosva> so now next step should be to enable it back :) 16:17:25 <jlibosva> perhaps it should be part of 531414 patch 16:17:25 <ihrachys> sure 16:18:09 <ihrachys> yeah it makes sense to have it incorporated. we can recheck it a bunch before merging. 16:18:47 <ihrachys> ok next was "ihrachys to report sec group fullstack failure" 16:18:53 <ihrachys> I am ashamed but I forgot about it 16:19:08 * haleyb wanders in late 16:19:11 <ihrachys> I will follow up after the meeting 16:19:12 <mlavalle> the Holidays joy I guess 16:19:12 <ihrachys> #action ihrachys to report sec group fullstack failure 16:19:29 <ihrachys> next item was "slaweq to debug qos fullstack failure https://bugs.launchpad.net/neutron/+bug/1737892" 16:19:30 <openstack> Launchpad bug 1737892 in neutron "Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:20:05 <slaweq> so i was debugging it for some time 16:20:15 <slaweq> and I have no idea why it is failing 16:20:52 <slaweq> I was even able to reproduce it locally once for every 20-30 runs but I don't know what is happening there 16:21:04 <slaweq> I will have to check it more still 16:21:23 <ihrachys> I see. "Also strange thing is that after test is finished, those records are deleted from ovs." <- don't we recycle ovsdb on each run? 16:21:24 <slaweq> but maybe for now I will propose patch to mark this test as unstable 16:21:50 <slaweq> ihrachys: maybe, but I'm not sure about that 16:21:52 <ihrachys> ah wait, we didn't have it isolated properly yet 16:22:29 <slaweq> no, it's same ovsdb for all tests IMO 16:22:53 <ihrachys> so if I read it right, ovsdb command is executed but structures are still in place? 16:22:59 <slaweq> yes 16:23:28 <jlibosva> could it be related to the ovsdb issue we have been hitting? that commands return TRY_AGAIN and never succeed? 16:23:29 <slaweq> and ovsdb commands are finished with success every time as I was checking it locally 16:23:32 <jlibosva> or they timeout? 16:23:36 <jlibosva> ok 16:23:46 <slaweq> no, there wasn't any retry or timeout on it 16:23:49 <jlibosva> it doesn't sound like it then 16:24:00 <slaweq> transactions was finished fine always 16:24:09 <ihrachys> could it be something recreates them after destroy? 16:24:46 <slaweq> ihrachys: I don't think so because I was also locally watching for them with watch command and it didn't flap or something like that 16:25:18 <slaweq> but I will check it once again maybe 16:26:11 <ihrachys> ok. I think disabling the test is a good idea while we are looking into it. 16:26:27 <slaweq> ihrachys: ok, so I will do such patch today 16:26:30 <ihrachys> and those were all action items we had 16:26:38 <ihrachys> #topic Tempest plugin 16:26:48 <ihrachys> I think we are mostly done with tempest plugin per se 16:27:04 <ihrachys> though there are some tempest bits in-tree that required stadium projects to switch to new repo first 16:27:25 <ihrachys> and those were not moving quick enough 16:27:47 <ihrachys> f.e. vpnaas: https://review.openstack.org/#/c/521341/ 16:28:02 <ihrachys> I recollect that mlavalle was going to talk to their representatives about moving those patches forward 16:28:08 <ihrachys> mlavalle, any news on this front? 16:28:29 <mlavalle> well, now it is my turn to say that I forgot about that 16:28:53 <mlavalle> I will follow up this week 16:28:58 <mlavalle> sorry :-( 16:29:21 <ihrachys> #action mlavalle to follow up with stadium projects on switching to new tempest repo 16:29:27 <ihrachys> that's fine it happens :) 16:29:41 <ihrachys> the list was at https://etherpad.openstack.org/p/neutron-tempest-plugin-job-move line 27+ 16:29:49 <mlavalle> ok 16:29:56 <ihrachys> afaiu we should cover vpnaas dynamic-routing and midonet 16:30:07 <mlavalle> cool 16:30:55 <ihrachys> afair there were concerns about how to install the new repo, and devstack plugin was added to neutron-tempest-plugin, so stadium projects should consume it 16:31:09 <ihrachys> #topic Grafana 16:31:15 <ihrachys> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:31:25 <ihrachys> we actually have some good news there 16:31:44 <ihrachys> linuxbridge scenario job seems to be in decent shape after recent tests disabled and security group fix 16:31:55 <ihrachys> it's at ~10% right now 16:32:08 <ihrachys> which I would say a regular failure rate one could expect from a tempest job 16:32:50 <mlavalle> ++ 16:32:51 <ihrachys> so that's good. we should monitor and eventually make it voting if it will keep the level. 16:32:53 <slaweq> nice :) 16:33:34 <jlibosva> kudos to those who fixed it :) 16:33:36 <ihrachys> dvr flavor is not as great though also down from 90% level it stayed in for months 16:33:48 <ihrachys> currently at ~40% on my chart 16:34:09 <ihrachys> and fullstack is in same bad shape. so mixed news but definitely progress on scenario side. 16:34:20 <ihrachys> kudos to everyone who was and is pushing those forward 16:34:44 <jlibosva> I see functional is now ~10%, yesterday EMEA morning, it was up to 50% 16:35:15 <ihrachys> I saw a lot of weird errors in gate yesterday 16:35:22 <ihrachys> RETRY_TIMEOUTS and stuff 16:35:32 <ihrachys> could be that the gate was just unstable in general 16:35:49 <ihrachys> but we'll keep an eye 16:36:10 <ihrachys> f.e. I see a similar spike for unit tests around same time 16:36:19 <ihrachys> #topic Scenarios 16:36:35 <ihrachys> so since linuxbridge seems good now, let's have a look at a latest failure for dvr 16:37:35 <ihrachys> ok I took http://logs.openstack.org/98/513398/10/check/neutron-tempest-plugin-dvr-multinode-scenario/568c685/job-output.txt.gz 16:37:41 <ihrachys> but it seems like a timeout for the job 16:37:47 <ihrachys> took almost 3h and failed 16:38:39 <ihrachys> here is another one: http://logs.openstack.org/51/529551/4/check/neutron-tempest-plugin-dvr-multinode-scenario/ee927b8/job-output.txt.gz 16:38:48 <ihrachys> same story 16:39:05 <jlibosva> perhaps we should try now to increase the concurrency? let me check the load 16:39:11 <ihrachys> so for what I see, it either times out or it passes 16:39:24 <ihrachys> I suspect it may be related to meltdown 16:39:40 <ihrachys> we were looking yesterday in neutron channel at slowdown for rally scenarios 16:39:54 <haleyb> https://bugs.launchpad.net/neutron/+bug/1717302 still isn't fixed either 16:39:55 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:39:57 <ihrachys> x2-x3 slowdown for some scenarios 16:40:13 <ihrachys> and figured it depends on cloud and whether they are patched... 16:41:14 <jlibosva> haleyb: don't we skip the fip test cases? 16:41:31 <jlibosva> I mean that they are tagged as unstable, so skipped if they fail 16:41:33 <haleyb> oh yeah, didn't see a link in the bug 16:41:41 <ihrachys> jlibosva, we do skip them, yes 16:42:07 <ihrachys> jlibosva, you mentioned concurrency. currently we seem to run with =2 16:42:24 <ihrachys> would the suggestion be to run with eg. number of cpus? 16:42:35 <jlibosva> just remove the concurrency and let the runner decide 16:42:40 <ihrachys> I see most scenarios already taking like 5minutes+ each in good run 16:42:52 <jlibosva> I remember I added the concurrency there because we thought the server gets choked on slow machines 16:43:04 <ihrachys> I suspect there may be hardcoded timeouts in some of scenarios we have, like waiting for a resource to come back 16:43:26 <slaweq> but wouldn't it be problem with e.g. memory consumption if You will have more treads? 16:43:31 <slaweq> (just asking) 16:43:46 <ihrachys> slaweq, there may be. especially where scenarios spin up instances. 16:44:11 <ihrachys> I guess it's easier to post a patch and recheck it to death 16:44:18 <ihrachys> and see how it fares 16:44:32 <ihrachys> I can have a look at it. need to report a bug for timeouts too. 16:44:53 <ihrachys> #action ihrachys to report bug for dvr scenario job timeouts and try concurrency increase 16:45:10 <ihrachys> it's interesting that linuxbridge doesn't seem to trigger it. 16:45:20 <ihrachys> is it because maybe more tests are executed for dvr? 16:46:18 <ihrachys> it's 36 tests in dvr and 28 in linuxbridge 16:47:03 <ihrachys> some of those are dvr migration tests so that's good 16:47:23 <ihrachys> I also noticed that NetworkMtuBaseTest is not executed for linuxbridge because apparently we assume gre type driver enabled and it's not supported by linuxbridge 16:47:42 <ihrachys> we can probably configure the plugin for vxlan and have it executed for linuxbridge too then 16:48:06 <jlibosva> makes sense, let's make linuxbridge suffer too :) 16:48:25 <ihrachys> actually, it's 31 vs 21 tests 16:48:32 <ihrachys> I read the output incorrectly before 16:48:47 <ihrachys> jlibosva, huh. well it shouldn't affect much and would add coverage 16:49:00 <jlibosva> I totally agree :) 16:49:03 <ihrachys> I will report a bug for that at least. it's not critical to fix it right away. 16:49:18 <ihrachys> #action ihrachys to report bug for mtu scenario not executed for linuxbridge job 16:50:03 <ihrachys> anyway, the diff in total time spent for tests is ~3k seconds 16:50:35 <ihrachys> which is like 50 minutes? 16:50:57 <ihrachys> that's kinda significant 16:51:10 <jlibosva> that could be it, average test is like 390 seconds? so if you add 10 more, with concurrency two, it adds almost 30 minutes 16:51:34 <ihrachys> yeah. and linuxbridge is already 2h:20m in good run 16:51:39 <jlibosva> with bad luck using some slow cloud, it could start hitting timeouts too. That's why I think it's good to make linuxbridge suffering too, to confirm our theory :) 16:51:50 <ihrachys> add 30-50 on top and you have timeout triggered 16:52:05 <ihrachys> ok anyway, we have way forward. let's switch topics 16:52:08 <ihrachys> #topic Fullstack 16:52:20 <jlibosva> I remember in tempest repo, we have the slow tag so maybe we will need to start using it in the future 16:52:37 <ihrachys> jlibosva, yeah but ideally we still want to have all those executed in some way in gate 16:52:50 <ihrachys> so the best you can do is then split the job into pieces 16:53:00 <ihrachys> ok fullstack. same exercise, looking at latest runs. 16:53:11 <jlibosva> yeah, looking at times of particular tests, they all are slow 16:53:13 <ihrachys> example: http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz 16:53:28 <ihrachys> this is I believe the failure I should have reported but failed to. 16:53:54 <ihrachys> jlibosva, afaiu we don't have accelerated virtualization in infra clouds. so starting an instance takes ages. 16:54:12 <ihrachys> let's see if other fullstack runs are same 16:54:28 <ihrachys> ok this one is different: http://logs.openstack.org/98/513398/10/check/neutron-fullstack/b62a726/logs/testr_results.html.gz 16:54:48 <ihrachys> our old friend "Commands [...] exceeded timeout 10 seconds" 16:55:13 <haleyb> not a very nice friend 16:55:18 <ihrachys> I recollect jlibosva reported https://bugs.launchpad.net/neutron/+bug/1741889 lately 16:55:19 <openstack> Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [Critical,New] 16:55:19 <mlavalle> nope 16:55:38 <jlibosva> I thought it's more sever as by the time I reported the functional job seemed busted 16:55:41 <jlibosva> so it's not that hot anymore 16:56:07 <jlibosva> severe* 16:56:25 <mlavalle> you mean 1741889? 16:56:29 <jlibosva> yeah 16:56:50 <mlavalle> should we lower the importance then? 16:56:58 <jlibosva> possibly 16:56:59 <ihrachys> jlibosva, do you suggest that may be intermittent and same and we may no longer experience it? 16:57:48 <jlibosva> well, I mean we talked about it before, you said UT had a peak at about that time, so the bug might not be the reason for functional failure rate peak 16:58:14 <jlibosva> although it's not a nice back as it still keeps coming back in weird intervals, like twice a year :) and then it goes away 16:58:21 <ihrachys> ok I lowered to High it for now 16:58:27 <ihrachys> and added details about fullstack 16:58:36 <jlibosva> could be related to slow hw used 16:58:37 <jlibosva> dunno 16:58:38 <mlavalle> thanks 16:58:40 <jlibosva> just thinking out loud 16:58:49 <ihrachys> yeah I think it's ok to just monitor 16:59:01 <ihrachys> that other sec group failure seems more important to have a look. 16:59:14 <ihrachys> and we have some work to do for scenarios anyway till next week 16:59:21 <slaweq> I can check on this failure with SG 16:59:29 <slaweq> *check this failure 16:59:38 <ihrachys> slaweq, if you like, I would be grateful of course. 16:59:42 <slaweq> sure 16:59:50 <ihrachys> have some other pressing things on my plate so I don't insist :) 16:59:53 <ihrachys> cool! 17:00:09 <ihrachys> #action slaweq to take over sec group failure in fullstack (report bug / triage / fix) 17:00:14 <ihrachys> we are out of time 17:00:31 <slaweq> just to be sure, we are talking about issue liek in http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz 17:00:33 <slaweq> right? 17:00:40 <ihrachys> slaweq, yes 17:00:41 <slaweq> ok 17:00:44 <slaweq> thx 17:01:07 <ihrachys> ok. thanks everyone for being so helpful and showing initiative. it's good it's just me slagging here, otherwise we wouldn't achieve all we did. 17:01:11 <ihrachys> #endmeeting