16:00:33 #startmeeting neutron_ci 16:00:34 Meeting started Tue Jan 9 16:00:33 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:35 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:38 The meeting name has been set to 'neutron_ci' 16:00:42 o/ 16:00:48 hi 16:00:49 heya 16:00:52 o/ 16:01:06 I think I happily missed the last time we were supposed to have a meeting. sorry for that. :) 16:01:17 post-holiday recovery is hard! 16:01:29 #topic Actions from prev meeting 16:01:41 first is "mlavalle to send patch(es) removing duplicate jobs from neutron gate" 16:01:58 I did 16:02:02 I believe it's https://review.openstack.org/530500 and https://review.openstack.org/531496 16:02:19 and of course we should land the last bit adding new jobs to stable: https://review.openstack.org/531070 16:02:38 yeap 16:03:19 mlavalle, speaking of your concerns in https://review.openstack.org/#/c/531496/ 16:03:31 ok 16:03:34 mlavalle, I would think that we replace old jobs with in-tree ones for those other projects 16:03:41 ok 16:03:50 will work on that 16:03:54 you are too agreeable today! come on! :) 16:04:14 you speak with reason, so why not :-) 16:04:15 mlavalle, I am not saying that's the right path; I just *assume* 16:04:22 still makes sense to check with infra first 16:04:29 yes I will do that 16:04:43 but we don't need to rush on that patch 16:04:55 as long as 530500 lands 16:04:58 soon 16:04:59 mlavalle, the project-config one will already remove duplicates from gate for us? 16:05:20 530500 will remove the duplication 16:05:40 ok good. yeah, we can leave the second one for infra take. 16:05:45 ok next item was "mlavalle to report back about result of drivers discussion on unified tempest plugin for all stadium projects" 16:06:00 mlavalle, could you please give tl;dr of the discussion? 16:06:14 we discussed it in the drivers meeting before the Holidays 16:06:44 we will let the stadium projects join the unified tempest plugin or keep their tests in their repos 16:06:55 we don't want to stretch them with more work 16:07:04 sent a message to the ML last week 16:07:24 and discussed it in the Neutron meeting last week 16:07:32 was well received 16:07:36 done 16:08:02 ok cool. are there candidates who are willing to go through the exercise? 16:08:11 I haven't heard 16:08:18 back 16:08:24 ok, fine by me! 16:08:34 next is "frickler to post patch updating neutron grafana board to include designate scenario job" 16:08:40 I believe this was merged https://review.openstack.org/#/c/529822/ 16:08:56 I just saw the job in grafana before the meeting, so it works 16:09:07 next one is "jlibosva to close bug 1722644 and open a new one for trunk connectivity failures in dvr and linuxbridge scenario jobs" 16:09:08 bug 1722644 in neutron "TrunkTest fails for OVS/DVR scenario job" [High,Confirmed] https://launchpad.net/bugs/1722644 16:09:23 I did 16:09:27 https://bugs.launchpad.net/neutron/+bug/1740885 16:09:28 Launchpad bug 1740885 in neutron "Security group updates fail when port hasn't been initialized yet" [High,In progress] - Assigned to Jakub Libosvar (libosvar) 16:09:39 the other bug doesn't seem closed 16:09:41 fix is here: https://review.openstack.org/#/c/531414/ 16:10:25 oops, I just wrote a comment I'm gonna close it and I acutally didn't 16:10:27 closed now 16:10:48 jlibosva, right. I started looking at the patch yesterday, and was wondering if the fix makes it ignore some updates from the server. can we miss an update and leave a port in old state with the fix? 16:11:48 I guess so, the update stores e.g. IP in case a new port is added to remote security group. So we should still receive an update but let the implementation of flow rules deferred 16:12:50 to clarify, you mean there is a potential race now? 16:13:22 there has been a race and the 531414 fixes it 16:13:38 it caches the new data and once the port initializes, it already has the correct data in cache 16:14:04 oh I see, so there is cache involved so we should be good 16:14:39 BUT 16:15:02 I saw the ovsfw job failing, reporting the issue, so I still need to investigate why it fails 16:15:23 I'm not able to reproduce it, I'll need to run tempest in the loop overnight and put a breakpoint once it fails so I'll be able to debug 16:15:33 it's likely a different issue 16:15:55 the ovsfw job was never too stable. but yeah, it's good to look into it. 16:16:07 ok we have way forward here. 16:16:10 next item was "jlibosva to disable trunk scenario connectivity tests" 16:16:39 I believe it's https://review.openstack.org/530760 16:16:44 and it's merged 16:16:47 yeah 16:16:54 so now next step should be to enable it back :) 16:17:25 perhaps it should be part of 531414 patch 16:17:25 sure 16:18:09 yeah it makes sense to have it incorporated. we can recheck it a bunch before merging. 16:18:47 ok next was "ihrachys to report sec group fullstack failure" 16:18:53 I am ashamed but I forgot about it 16:19:08 * haleyb wanders in late 16:19:11 I will follow up after the meeting 16:19:12 the Holidays joy I guess 16:19:12 #action ihrachys to report sec group fullstack failure 16:19:29 next item was "slaweq to debug qos fullstack failure https://bugs.launchpad.net/neutron/+bug/1737892" 16:19:30 Launchpad bug 1737892 in neutron "Fullstack test test_qos.TestBwLimitQoSOvs.test_bw_limit_qos_port_removed failing many times" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:20:05 so i was debugging it for some time 16:20:15 and I have no idea why it is failing 16:20:52 I was even able to reproduce it locally once for every 20-30 runs but I don't know what is happening there 16:21:04 I will have to check it more still 16:21:23 I see. "Also strange thing is that after test is finished, those records are deleted from ovs." <- don't we recycle ovsdb on each run? 16:21:24 but maybe for now I will propose patch to mark this test as unstable 16:21:50 ihrachys: maybe, but I'm not sure about that 16:21:52 ah wait, we didn't have it isolated properly yet 16:22:29 no, it's same ovsdb for all tests IMO 16:22:53 so if I read it right, ovsdb command is executed but structures are still in place? 16:22:59 yes 16:23:28 could it be related to the ovsdb issue we have been hitting? that commands return TRY_AGAIN and never succeed? 16:23:29 and ovsdb commands are finished with success every time as I was checking it locally 16:23:32 or they timeout? 16:23:36 ok 16:23:46 no, there wasn't any retry or timeout on it 16:23:49 it doesn't sound like it then 16:24:00 transactions was finished fine always 16:24:09 could it be something recreates them after destroy? 16:24:46 ihrachys: I don't think so because I was also locally watching for them with watch command and it didn't flap or something like that 16:25:18 but I will check it once again maybe 16:26:11 ok. I think disabling the test is a good idea while we are looking into it. 16:26:27 ihrachys: ok, so I will do such patch today 16:26:30 and those were all action items we had 16:26:38 #topic Tempest plugin 16:26:48 I think we are mostly done with tempest plugin per se 16:27:04 though there are some tempest bits in-tree that required stadium projects to switch to new repo first 16:27:25 and those were not moving quick enough 16:27:47 f.e. vpnaas: https://review.openstack.org/#/c/521341/ 16:28:02 I recollect that mlavalle was going to talk to their representatives about moving those patches forward 16:28:08 mlavalle, any news on this front? 16:28:29 well, now it is my turn to say that I forgot about that 16:28:53 I will follow up this week 16:28:58 sorry :-( 16:29:21 #action mlavalle to follow up with stadium projects on switching to new tempest repo 16:29:27 that's fine it happens :) 16:29:41 the list was at https://etherpad.openstack.org/p/neutron-tempest-plugin-job-move line 27+ 16:29:49 ok 16:29:56 afaiu we should cover vpnaas dynamic-routing and midonet 16:30:07 cool 16:30:55 afair there were concerns about how to install the new repo, and devstack plugin was added to neutron-tempest-plugin, so stadium projects should consume it 16:31:09 #topic Grafana 16:31:15 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:31:25 we actually have some good news there 16:31:44 linuxbridge scenario job seems to be in decent shape after recent tests disabled and security group fix 16:31:55 it's at ~10% right now 16:32:08 which I would say a regular failure rate one could expect from a tempest job 16:32:50 ++ 16:32:51 so that's good. we should monitor and eventually make it voting if it will keep the level. 16:32:53 nice :) 16:33:34 kudos to those who fixed it :) 16:33:36 dvr flavor is not as great though also down from 90% level it stayed in for months 16:33:48 currently at ~40% on my chart 16:34:09 and fullstack is in same bad shape. so mixed news but definitely progress on scenario side. 16:34:20 kudos to everyone who was and is pushing those forward 16:34:44 I see functional is now ~10%, yesterday EMEA morning, it was up to 50% 16:35:15 I saw a lot of weird errors in gate yesterday 16:35:22 RETRY_TIMEOUTS and stuff 16:35:32 could be that the gate was just unstable in general 16:35:49 but we'll keep an eye 16:36:10 f.e. I see a similar spike for unit tests around same time 16:36:19 #topic Scenarios 16:36:35 so since linuxbridge seems good now, let's have a look at a latest failure for dvr 16:37:35 ok I took http://logs.openstack.org/98/513398/10/check/neutron-tempest-plugin-dvr-multinode-scenario/568c685/job-output.txt.gz 16:37:41 but it seems like a timeout for the job 16:37:47 took almost 3h and failed 16:38:39 here is another one: http://logs.openstack.org/51/529551/4/check/neutron-tempest-plugin-dvr-multinode-scenario/ee927b8/job-output.txt.gz 16:38:48 same story 16:39:05 perhaps we should try now to increase the concurrency? let me check the load 16:39:11 so for what I see, it either times out or it passes 16:39:24 I suspect it may be related to meltdown 16:39:40 we were looking yesterday in neutron channel at slowdown for rally scenarios 16:39:54 https://bugs.launchpad.net/neutron/+bug/1717302 still isn't fixed either 16:39:55 Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:39:57 x2-x3 slowdown for some scenarios 16:40:13 and figured it depends on cloud and whether they are patched... 16:41:14 haleyb: don't we skip the fip test cases? 16:41:31 I mean that they are tagged as unstable, so skipped if they fail 16:41:33 oh yeah, didn't see a link in the bug 16:41:41 jlibosva, we do skip them, yes 16:42:07 jlibosva, you mentioned concurrency. currently we seem to run with =2 16:42:24 would the suggestion be to run with eg. number of cpus? 16:42:35 just remove the concurrency and let the runner decide 16:42:40 I see most scenarios already taking like 5minutes+ each in good run 16:42:52 I remember I added the concurrency there because we thought the server gets choked on slow machines 16:43:04 I suspect there may be hardcoded timeouts in some of scenarios we have, like waiting for a resource to come back 16:43:26 but wouldn't it be problem with e.g. memory consumption if You will have more treads? 16:43:31 (just asking) 16:43:46 slaweq, there may be. especially where scenarios spin up instances. 16:44:11 I guess it's easier to post a patch and recheck it to death 16:44:18 and see how it fares 16:44:32 I can have a look at it. need to report a bug for timeouts too. 16:44:53 #action ihrachys to report bug for dvr scenario job timeouts and try concurrency increase 16:45:10 it's interesting that linuxbridge doesn't seem to trigger it. 16:45:20 is it because maybe more tests are executed for dvr? 16:46:18 it's 36 tests in dvr and 28 in linuxbridge 16:47:03 some of those are dvr migration tests so that's good 16:47:23 I also noticed that NetworkMtuBaseTest is not executed for linuxbridge because apparently we assume gre type driver enabled and it's not supported by linuxbridge 16:47:42 we can probably configure the plugin for vxlan and have it executed for linuxbridge too then 16:48:06 makes sense, let's make linuxbridge suffer too :) 16:48:25 actually, it's 31 vs 21 tests 16:48:32 I read the output incorrectly before 16:48:47 jlibosva, huh. well it shouldn't affect much and would add coverage 16:49:00 I totally agree :) 16:49:03 I will report a bug for that at least. it's not critical to fix it right away. 16:49:18 #action ihrachys to report bug for mtu scenario not executed for linuxbridge job 16:50:03 anyway, the diff in total time spent for tests is ~3k seconds 16:50:35 which is like 50 minutes? 16:50:57 that's kinda significant 16:51:10 that could be it, average test is like 390 seconds? so if you add 10 more, with concurrency two, it adds almost 30 minutes 16:51:34 yeah. and linuxbridge is already 2h:20m in good run 16:51:39 with bad luck using some slow cloud, it could start hitting timeouts too. That's why I think it's good to make linuxbridge suffering too, to confirm our theory :) 16:51:50 add 30-50 on top and you have timeout triggered 16:52:05 ok anyway, we have way forward. let's switch topics 16:52:08 #topic Fullstack 16:52:20 I remember in tempest repo, we have the slow tag so maybe we will need to start using it in the future 16:52:37 jlibosva, yeah but ideally we still want to have all those executed in some way in gate 16:52:50 so the best you can do is then split the job into pieces 16:53:00 ok fullstack. same exercise, looking at latest runs. 16:53:11 yeah, looking at times of particular tests, they all are slow 16:53:13 example: http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz 16:53:28 this is I believe the failure I should have reported but failed to. 16:53:54 jlibosva, afaiu we don't have accelerated virtualization in infra clouds. so starting an instance takes ages. 16:54:12 let's see if other fullstack runs are same 16:54:28 ok this one is different: http://logs.openstack.org/98/513398/10/check/neutron-fullstack/b62a726/logs/testr_results.html.gz 16:54:48 our old friend "Commands [...] exceeded timeout 10 seconds" 16:55:13 not a very nice friend 16:55:18 I recollect jlibosva reported https://bugs.launchpad.net/neutron/+bug/1741889 lately 16:55:19 Launchpad bug 1741889 in neutron "functional: DbAddCommand sometimes times out after 10 seconds" [Critical,New] 16:55:19 nope 16:55:38 I thought it's more sever as by the time I reported the functional job seemed busted 16:55:41 so it's not that hot anymore 16:56:07 severe* 16:56:25 you mean 1741889? 16:56:29 yeah 16:56:50 should we lower the importance then? 16:56:58 possibly 16:56:59 jlibosva, do you suggest that may be intermittent and same and we may no longer experience it? 16:57:48 well, I mean we talked about it before, you said UT had a peak at about that time, so the bug might not be the reason for functional failure rate peak 16:58:14 although it's not a nice back as it still keeps coming back in weird intervals, like twice a year :) and then it goes away 16:58:21 ok I lowered to High it for now 16:58:27 and added details about fullstack 16:58:36 could be related to slow hw used 16:58:37 dunno 16:58:38 thanks 16:58:40 just thinking out loud 16:58:49 yeah I think it's ok to just monitor 16:59:01 that other sec group failure seems more important to have a look. 16:59:14 and we have some work to do for scenarios anyway till next week 16:59:21 I can check on this failure with SG 16:59:29 *check this failure 16:59:38 slaweq, if you like, I would be grateful of course. 16:59:42 sure 16:59:50 have some other pressing things on my plate so I don't insist :) 16:59:53 cool! 17:00:09 #action slaweq to take over sec group failure in fullstack (report bug / triage / fix) 17:00:14 we are out of time 17:00:31 just to be sure, we are talking about issue liek in http://logs.openstack.org/43/529143/3/check/neutron-fullstack/d031a6b/logs/testr_results.html.gz 17:00:33 right? 17:00:40 slaweq, yes 17:00:41 ok 17:00:44 thx 17:01:07 ok. thanks everyone for being so helpful and showing initiative. it's good it's just me slagging here, otherwise we wouldn't achieve all we did. 17:01:11 #endmeeting