16:00:35 #startmeeting neutron_ci 16:00:36 Meeting started Tue Mar 27 16:00:35 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:37 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:38 hi 16:00:39 The meeting name has been set to 'neutron_ci' 16:01:04 o/ 16:01:21 o/ 16:01:23 hi 16:01:48 late o/ 16:01:57 it's my first time as chair of this meeting so please be tolerant for me :) 16:02:01 let's go 16:02:13 #topic Actions from prev meeting 16:02:29 jlibosva to take a look on dvr trunk tests issue: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz 16:03:02 I spend a fair amount of time digging into it but I haven't been able to root cause it. I can see some errors in ovs firewall but that's recovered 16:03:02 jlibosva: any updates? 16:03:07 sorry :) 16:03:29 I'll keep digging into it. I might need to send some additional debug patches 16:03:32 do we have a bug report for the issue? 16:03:43 no, not yet 16:04:25 it could be related to the old issue with sending update security group before firewall is initialized that we had in the past 16:05:12 jlibosva: but AFAIR it happens only in those tests related to trunk, right? 16:05:32 yes, only for trunk 16:05:46 and it blocks SSH to parent port 16:06:36 maybe You could add some additional logs with e.g. security groups or something like that and try to spot it once again in job to check what's there? 16:07:01 or freeze test node if it will happen, log into it and debug there 16:07:13 AFAIK it happens quite often so it should be doable IMO 16:07:46 ok, I'll try that. thanks for tips. about freezing test node, is it offical or do I need to inject ssh key? 16:08:06 it is official but You should ask on infra channel for that 16:08:14 and give them Your ssh key 16:08:25 I was trying it once with debugging linuxbridge jobs 16:08:53 I was also trying to do remote pdb and telnet to it when test fails - and that works also fine for me :) 16:09:35 slaweq, man you should document it all 16:09:43 I don't think a lot of people are aware of how to do it 16:09:56 I mean, that doc would be gold 16:10:05 ihrachys: ok, I will try to write it in docs 16:10:20 #action slaweq will write docs how to debug test jobs 16:10:45 thanks! 16:10:56 ok, moving on 16:11:03 next was slaweq to check why dvr-scenario job is broken with netlink errors in l3 agent log 16:11:16 created bug report: https://bugs.launchpad.net/neutron/+bug/1757259 16:11:16 Launchpad bug 1757259 in neutron "Netlink error raised when trying to delete not existing IP address from device" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:11:26 and fix: https://review.openstack.org/#/c/554697/ (merged) 16:11:57 and last action from last week: 16:12:00 slaweq to revert https://review.openstack.org/#/c/541242/ because the bug is probably fixed by https://review.openstack.org/#/c/545820/4 16:12:03 revert done: https://review.openstack.org/#/c/554709/ 16:12:32 it looks that this issue isn't fixed still (but it happens much less frequently) 16:12:53 so I will keep debugging it if I will spot this issue again 16:13:23 any questions/someting to add? or can we move on to next topic? 16:14:12 I don't have anything 16:14:19 ok, so next topic 16:14:24 #topic Grafana 16:14:30 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:14:38 btw. I have one question to You 16:14:55 do we have similar grafana dashboard but for stable branches? 16:15:09 I am only aware of this one which shows results from master branch 16:15:23 no we don't 16:15:43 it was in my todo list for ages and it never bubbled up high enough 16:16:03 ah, ok 16:16:10 should be quite easy though, copy the current one and replace master with whatever the string for stable query 16:16:29 I will add it to my todo list then (but not too high also) :) 16:17:07 have fun :) 16:17:11 ihrachys: thx 16:17:25 ok, so getting back to dashboard for master branch 16:17:43 we have 100% failures of neutron-tempest-plugin-dvr-multinode-scenario 16:18:01 since few days at least 16:18:10 it's 7 days at least 16:18:25 there was problem with FIP QoS test but this one is skipped now in this job 16:18:35 so I checked and found few example of failures 16:18:50 it looks that there are 2 main failures: 16:19:16 issue with neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromDVRHA, like: http://logs.openstack.org/97/552097/8/check/neutron-tempest-plugin-dvr-multinode-scenario/d242b10/logs/testr_results.html.gz 16:19:47 and second issue with neutron_tempest_plugin.scenario.test_trunk.TrunkTest, like: http://logs.openstack.org/20/556120/12/check/neutron-tempest-plugin-dvr-multinode-scenario/76d1a5b/logs/testr_results.html.gz 16:20:35 slaweq: the migration issue is interesting, i thought we had that working before 16:20:59 haleyb: for sure it wasn't failing 100% times :) 16:21:00 haleyb, yes and it was very stable 16:21:20 i will have to take a look 16:21:24 but in all those cases it looks that reason of failure is problem with connectivity 16:21:31 thx haleyb 16:21:53 haleyb, see in instance boot log, metadata is not reachable 16:21:54 #action haleyb to check router migrations issue 16:21:54 [ 456.685625] cloud-init[1021]: 2018-03-27 09:08:57,215 - DataSourceCloudStack.py[CRITICAL]: Giving up on waiting for the metadata from ['http://10.1.0.2/latest/meta-data/instance-id'] after 121 seconds 16:22:34 and it's because of "Connection refused" 16:23:18 slaweq, as for trunk, it's the same as jlibosva was looking at? 16:23:22 right 16:23:25 I was just about to write it 16:23:53 yes, it looks that it's the same 16:25:05 there is much more such examples in recent patches so it should be relatively easy to debug :) 16:25:23 (I don't know if it's possible to reproduce locally) 16:26:33 ok, from other frequently failing jobs we have also neutron-tempest-dvr-ha-multinode-full which failure rate is about 50% 16:26:45 yeah it's like that for quite a while 16:27:02 I found that in most cases failures are also related to broken connectivity, like e.g. * http://logs.openstack.org/52/555952/2/check/neutron-tempest-dvr-ha-multinode-full/9f94fa4/logs/testr_results.html.gz 16:27:39 ihrachys: yes, but I wanted to mention it at least :) 16:28:16 ahh, and one more thing, there are some periodic jobs failing 100% times 16:28:27 like e.g. Openstack-tox-py35-with-oslo-master 16:28:47 my first question to You is: where I can find results of such tests? :) 16:29:26 http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/ 16:29:38 then dive into a job dir 16:29:41 and sort by time 16:29:44 and check the latest 16:29:51 which is http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/openstack-tox-py27-with-oslo-master/f65fec8/job-output.txt.gz 16:29:52 thx ihrachys 16:29:59 DuplicateOptError: duplicate option: polling_interval 16:30:09 I can take a look 16:30:22 probably a conflict in the same section between an oslo lib and our options 16:30:46 #action ihrachys to take a look at problem with openstack-tox-py35-with-oslo-master periodic job 16:30:47 thx 16:31:16 except those jobs I think it is quite fine now 16:31:38 yeah. the unit test failure rate is a bit higher than I would expect but seems consistent with other jobs. 16:31:55 maybe our unit tests are just as good now that they catch majority of issues :) 16:32:13 maybe :) 16:32:19 ++ 16:32:24 we can always hope 16:32:26 hopefully 16:32:42 ok, moving on to next topic 16:32:45 #topic Fullstack 16:33:08 fullstack is now at similar rate as functional and tempest-plugin-api tests so it's fine 16:33:23 we had one issue with it during last week 16:33:33 Bug https://bugs.launchpad.net/neutron/+bug/1757089 16:33:34 Launchpad bug 1757089 in neutron "Fullstack test_port_shut_down(Linux bridge agent) fails quite often" [High,Fix released] - Assigned to Slawek Kaplonski (slaweq) 16:33:49 but it is (I hope) already fixed by https://review.openstack.org/#/c/554940/ 16:34:37 another problem is mentioned before bug with starting nc process: https://bugs.launchpad.net/neutron/+bug/1744402 16:34:38 Launchpad bug 1744402 in neutron "fullstack security groups test fails because ncat process don't starts" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:35:06 For now I added small patch https://review.openstack.org/#/c/556155/ and when it fails I want to check if it's because IP wasn't configured properly or there is some different problem with starting nc proces 16:36:00 anyone wants to add anything? 16:36:11 or should we move on? 16:36:40 nothing from me :) 16:36:55 I'm just amazed how many things you can do 16:37:01 thx :) 16:37:15 jlibosva, THIS 16:37:27 sometimes I feel embarrassed 16:37:31 :) 16:37:37 actually I have one more question here :) 16:38:15 as fullstack is voting since 2 weeks and I don't think there are some big problems with it, should we consieder to make it gating also? 16:38:53 we are before Rocky-1, so it's a good time to give it a try 16:38:59 + 16:39:10 if not now, then when? 16:39:19 great, I will send patch for it then 16:39:27 yay 16:39:35 #action slaweq to make fullstack job gating 16:39:46 ok, next topic 16:39:50 #topic Scenarios 16:40:03 we already talked about issues with dvr scenario job 16:40:15 so I only want to mention about linuxbridge job 16:40:25 we had two issues during last week: 16:40:44 both were in stable/queens branch only 16:41:20 first problem was due to we forgot to backport https://review.openstack.org/#/c/555263/ to queens so it wasn't configured and fip qos test was failing there 16:41:32 and second due to we forgot to backport https://review.openstack.org/#/c/554859/ and this job was failing a lot due to ssh timeouts 16:41:51 this was a problem as this job is voting also in stable/queens now 16:41:59 but I think it's fine now 16:42:20 and this job is on good level of failures again :) 16:42:48 very nice 16:43:08 so I wanted to ask about making this job gating also - what You think about it? 16:43:22 let's do one one week 16:43:29 and then we add the next 16:43:37 how about that? 16:43:39 mlavalle: sure, fine for me :) 16:44:06 that way we don't drive ourselves crazy in case something starts to fail 16:44:27 mlavalle: agree 16:44:38 so I will back with this question next week then :) 16:45:04 I know you will 16:45:09 LOL 16:45:15 relentless Hulk 16:45:21 :-) 16:45:36 ok, so that's all what I have on my list for today 16:45:44 topic #Open discussion 16:46:02 I wanted to ask You to review https://review.openstack.org/#/c/552846/ 16:46:18 it's my first attempt to move some job definition to zuul v3 format 16:46:30 it looks that it works for this job 16:46:47 ahh nice 16:46:48 andreaf was reviewing it from zuul point of view and it is fine for him 16:46:55 will take a look 16:47:00 so I think it's ready to review by You also 16:47:03 thx mlavalle 16:47:06 if andreaf is fine it must be ok. :) 16:47:11 will do 16:47:45 he gave +1 on one of patchsets - I will ask him also to take a look again now 16:47:50 I want to migrate also job definitions. so we can partner doing it 16:48:28 mlavalle: what job definitions You want to migrate? from neutron repo? 16:48:44 yeah I was thinking of those 16:49:09 I can help You with it if You want 16:49:49 ah, and one more question about tempest jobs 16:50:12 sure, let's partner as I said 16:50:29 when I was checking grafana today I found that two jobs: Neutron-tempest-ovsfw and neutron-tempest-multinode-full are quite stable in last few weeks also 16:50:43 You can check it http://grafana.openstack.org/dashboard/db/neutron-failure-rate?from=1516976912028&to=1522153712028&panelId=8&fullscreen 16:51:12 so my question is: do You want to make them voting some day? 16:51:47 what's this multinode-full job about? how is it different from dvr-ha one? do we plan to replace it with dvr-ha eventually? 16:52:01 ihrachys: I don't know to be honest 16:52:12 I just found it on graphs and wanted to ask :) 16:52:51 it might end up being functionally a subset of the dvr-ha 16:52:56 ihrachys: I can compare those jobs for You if You want 16:52:57 I am actually surprised we don't have a multinode tempest full job that would vote :) 16:53:25 so let's clarify the differences and then we can make a decision 16:53:35 ok, fine for me 16:53:42 mlavalle, the thing is, dvr-ha would probably take some time to get to voting, so should we maybe enable the regular one (I believe it's legacy routers) while dvr-ha is being worked on, then revisit? 16:54:03 slaweq, + would be nice to understand the diff to judge 16:54:05 I agree that we should have a voting multinode 16:54:36 #action slaweq will check difference between neutron-tempest-multinode-full and neutron-tempest-dvr-ha-multinode-full 16:54:46 also, source of this job is important. it could be the regular is defined in tempest 16:55:19 ihrachys so I agree with you 16:56:10 great, thx for opinions 16:56:17 and what about ovsfw job? is it to early for this one to make it voting? 16:56:42 it was failing 100% times at the beginning of february 16:57:21 should be voting at some point. I don't see why not 16:57:27 but since few weeks it is better IMHO - spikes are same as for other jobs so it doesn't looks like some issues specific with this job exectly 16:57:43 yeah though again, maybe dvr-ha one should incorporate this firewall driver and effectively replace it 16:57:44 but I think it also have some additional failure, no? 16:58:16 jlibosva: I don't know about any issue specific to this one 16:59:14 ah, right. it copies other jobs nicely last 7 days 16:59:25 the way I see it, there are two major setup options for neutron in-tree stuff: first is old - legacy l3, iptables... another is new stuff - dvr/ha, ovsfw. to keep the number of jobs from blowing further up, I would recommend we try to keep those two major kinds of setups as targets and try to consolidate features around them 16:59:48 ok, I think we can check it for few more weeks and decide then as we are almost out of time now 16:59:55 that's good advice ihrachys. Thanks :-) 16:59:59 I think that was agreed on some PTG but never formally documented 17:00:13 ok, we are out of time now 17:00:17 yeah 17:00:17 #endmeeting