16:01:21 #startmeeting neutron_ci 16:01:22 Meeting started Tue May 8 16:01:21 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:23 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:25 The meeting name has been set to 'neutron_ci' 16:01:28 welcome on CI meeting :) 16:01:28 o/ 16:01:44 the last one of my morning :-) 16:01:48 * slaweq have busy, meetings day :) 16:02:06 mlavalle: same for me but last of my afternoon 16:02:12 o/ 16:02:18 hi ihar 16:02:22 oh you guys stay fir 3h? insane 16:02:24 haleyb will join? 16:02:26 *for 16:02:26 hi 16:02:36 ihar: yes 16:02:46 I just finished QoS meeting and started this one 16:02:47 * haleyb takes an hour off for offline meetings :( 16:02:53 one by one :) 16:03:00 ok, lets start 16:03:07 #topic Actions from previous meetings 16:03:27 last week there was no meeting so let's check actions from 2 weeks 16:03:35 * slaweq will check failed SG fullstack test 16:03:46 I reported bug https://bugs.launchpad.net/neutron/+bug/1767829 16:03:47 Launchpad bug 1767829 in neutron "Fullstack test_securitygroup.TestSecurityGroupsSameNetwork fails often after SG rule delete" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:04:16 I tried to reproduce it with some DNM patch with additional logs but I couldn't 16:04:32 I suppose that it is some race condition in conntrack manager module 16:04:58 slaweq: i hope not :( but i will also try, and do some manual testing, since it's blocking that other patch 16:04:59 haleyb got similar error on one of his patches I think and there it was reproducible 100% of times 16:06:10 ok, so haleyb You will check that on Your patch, right? 16:06:42 slaweq: yes, and i had tweaked the patch with what looked like a fix, but it still failed, so i'll continue 16:06:59 ok, thx 16:07:26 #action haleyb will debug failing security groups fullstack test: https://bugs.launchpad.net/neutron/+bug/1767829 16:07:28 Launchpad bug 1767829 in neutron "Fullstack test_securitygroup.TestSecurityGroupsSameNetwork fails often after SG rule delete" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:07:38 next one is 16:07:39 * slaweq will report bug about failing trunk tests in dvr multinode scenario 16:07:48 Bug report is here: https://bugs.launchpad.net/neutron/+bug/1766701 16:07:49 Launchpad bug 1766701 in neutron "Trunk Tests are failing often in dvr-multinode scenario job" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:08:26 I see that mlavalle is assigned to it 16:08:34 did You find something maybe? 16:08:46 no, I haven't made progress on this one 16:09:13 I will work on it this week 16:09:18 thx 16:09:43 #action mlavalle will check why trunk tests are failing in dvr multinode scenario 16:09:56 thanks 16:10:01 ok, next one 16:10:04 * jlibosva will mark trunk scenario tests as unstable for now 16:10:15 he did: https://review.openstack.org/#/c/564026/ 16:10:15 I think he did 16:10:15 patch 564026 - neutron-tempest-plugin - Mark trunk tests as unstable (MERGED) 16:10:34 so mlavalle, be aware that those tests will now not fail in new jobs 16:10:49 if You will look for failures, You need to look for skipped tests :) 16:10:57 right 16:11:10 next one was: 16:11:11 * slaweq to check rally timeouts and report a bug about that 16:11:21 I reported a bug: https://bugs.launchpad.net/neutron/+bug/1766703 16:11:22 Launchpad bug 1766703 in neutron "Rally tests job is reaching job timeout often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:11:39 and I did some initial checks of logs 16:11:51 there are some comments in bug report 16:12:21 basically it don't looks like we are very close to the limit on good runs 16:14:15 and also in neutron server logs I found that API calls on such bad runs are really slow, e.g. http://logs.openstack.org/24/558724/11/check/neutron-rally-neutron/d891678/logs/screen-q-svc.txt.gz?level=INFO#_Apr_24_04_23_23_632631 16:14:40 also there is a lot of errors in this log file: http://logs.openstack.org/24/558724/11/check/neutron-rally-neutron/d891678/logs/screen-q-svc.txt.gz?level=INFO#_Apr_24_04_14_59_003949 16:15:09 I don't think that this is culprit of slow responses because same errors are in logs of "good" runs 16:15:23 but are You aware of such errors? 16:15:34 ihar: especially You as it looks to be related to db :) 16:16:48 I think savepoint deadlocks are common, as you said already 16:16:59 I have no idea why tho 16:17:08 ok, just asking, thx for checking 16:17:28 one reason for slow db could be limit on connections in oslo.db 16:17:36 and/or wsgi 16:17:51 neutron-server may just queue requests 16:17:51 ihar: thx for tips, I will try to investigate this slow responses more during the week 16:18:16 but I dunno, depends on whether we see slowdowns in the middle of handlers or it just takes a long time to see first messages of a request 16:19:09 problem is that if it reach this timeout then there is no those fancy rally graphs and tables to check everything 16:19:27 so it's not easy to compare with "good" runs 16:19:45 #action slaweq will continue debugging slow rally tests issue 16:20:08 I think we can move on to the next one 16:20:11 * slaweq will report a bug and talk with Federico about issue with neutron-dynamic-routing-dsvm-tempest-with-ryu-master-scenario-ipv4 16:20:21 Bug report: https://bugs.launchpad.net/neutron/+bug/1766702 16:20:23 Launchpad bug 1766702 in neutron "Periodic job * neutron-dynamic-routing-dsvm-tempest-with-ryu-master-scenario-ipv4 fails" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:20:47 dalvarez found that there was also another problem caused by Federico's patch 16:21:00 I already proposed patch to fix what dalvarez found 16:21:22 and I'm working on fix neutron-dynamic-routing tests 16:21:40 problem here is that now tests can't create two subnets with same cidr 16:22:18 and neutron-dynamic-routing in scenario tests creates subnetpool and then subnet "per test" but always uses same cidrs 16:22:34 so first test passed but in second subnet is not created as cidr is already in use 16:23:13 haleyb posted some patch to check it but it wasn't fix for this issue 16:23:24 so I will continue this work also 16:23:48 #action slaweq to fix neutron-dynamic-routing scenario tests bug 16:24:00 ok, that's all from my list 16:24:07 do You have anything to add? 16:25:33 ok, so let's move on to next topic 16:25:46 #topic Grafana 16:25:51 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:27:22 Many tests on high failure rate yesterday and today, but it might be related to some issue with devstack legacy jobs, see: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/latest.log.html#t2018-05-08T07:10:02 16:30:12 except that I think that it's quite "normal" for most tests 16:30:38 do You see anything worrying? 16:32:49 ok, I guess we can move to next topics 16:32:54 has the devstack issue benn solved? 16:33:14 yes, I frickler told me today it should be solved last night 16:33:21 ok 16:33:34 so recheck of Your patch should works fine :) 16:33:49 moving on 16:33:50 #topic Scenarios 16:34:36 I see that dvr-multinode job is now at about 20 to 40% still 16:34:57 even after marking trunk tests and dvr migration tests as unstable 16:35:09 let's check some example failures then :) 16:35:13 :( 16:36:12 http://logs.openstack.org/24/558724/14/check/neutron-tempest-plugin-dvr-multinode-scenario/37dde27/logs/testr_results.html.gz 16:36:55 slaweq: Kernel panic - not syncing: 16:37:04 hah, that's not neutron! 16:37:09 haleyb: :) 16:37:25 I just copied failure result without changing the reason 16:37:34 I'm looking for some other examples still 16:38:02 http://logs.openstack.org/27/534227/9/check/neutron-tempest-plugin-dvr-multinode-scenario/cff3380/logs/testr_results.html.gz 16:38:11 but this should be already marked as unstable, right haleyb? 16:38:45 this one or the last one? 16:39:03 last one 16:39:07 sorry 16:39:22 in first one, if it was kernel panic than it's "fine" for us :) 16:39:29 LOL 16:39:51 not our problem 16:40:11 i'm not sure the migration tests are marked unstable 16:40:12 mlavalle: I think we have enough of our problems still ;) 16:40:18 we do 16:40:32 both those failed metadata 16:40:35 haleyb: I though that You did it few weeks ago 16:41:12 https://review.openstack.org/#/c/561322/ 16:41:13 patch 561322 - neutron-tempest-plugin - Mark DVR/HA migration tests unstable (MERGED) 16:41:18 looking 16:42:31 yes, it should have covered them 16:42:51 so maybe it was some old patch then 16:43:08 but we know this issue already so I think we don't need to talk about it now 16:43:19 thx for check haleyb :) 16:43:33 in the meantime I found one more failed run: 16:43:33 http://logs.openstack.org/78/566178/5/check/neutron-tempest-plugin-dvr-multinode-scenario/92ef438/job-output.txt.gz#_2018-05-08_10_08_04_048287 16:43:43 and here it was global job timeout reached 16:46:06 looks that most tests took about 10 minutes 16:48:03 I think that if such issue will repeat more we will have to investigate what is slowing down those tests 16:48:07 do You agree? :) 16:48:55 yes, agreed 16:49:08 ok, thx haleyb :) 16:49:22 I didn't found different issues for this job from last few days 16:49:48 according to other scenario jobs 16:50:03 I want to remind You that since some time we have voting 2 jobs: 16:50:09 neutron-tempest-plugin-scenario-linuxbridge 16:50:09 and 16:50:15 neutron-tempest-ovsfw 16:50:26 both are IMHO quite stable according to graphana 16:50:43 and I would like to ask if we can consider to make them gating also 16:50:59 * mlavalle looking at grafana 16:51:02 what do You think? 16:51:46 yes, let's give it a try 16:52:19 thx mlavalle for blessing :) 16:52:39 should I send patch or You want to do it? 16:54:14 please send it 16:54:20 ok, I will 16:54:41 #action slaweq to make 2 scenario jobs gating 16:55:14 I don't have anything else according to any of job types for today 16:55:27 do You want to talk about something else? 16:55:41 if not we can finish few minutes before the time :) 16:55:46 I don't have anything else 16:56:52 ok, so enjoy Your free time then ;) 16:56:55 #endmeeting