16:01:21 <slaweq> #startmeeting neutron_ci 16:01:22 <openstack> Meeting started Tue May 8 16:01:21 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:23 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:25 <openstack> The meeting name has been set to 'neutron_ci' 16:01:28 <slaweq> welcome on CI meeting :) 16:01:28 <mlavalle> o/ 16:01:44 <mlavalle> the last one of my morning :-) 16:01:48 * slaweq have busy, meetings day :) 16:02:06 <slaweq> mlavalle: same for me but last of my afternoon 16:02:12 <ihar> o/ 16:02:18 <slaweq> hi ihar 16:02:22 <ihar> oh you guys stay fir 3h? insane 16:02:24 <slaweq> haleyb will join? 16:02:26 <ihar> *for 16:02:26 <haleyb> hi 16:02:36 <slaweq> ihar: yes 16:02:46 <slaweq> I just finished QoS meeting and started this one 16:02:47 * haleyb takes an hour off for offline meetings :( 16:02:53 <slaweq> one by one :) 16:03:00 <slaweq> ok, lets start 16:03:07 <slaweq> #topic Actions from previous meetings 16:03:27 <slaweq> last week there was no meeting so let's check actions from 2 weeks 16:03:35 <slaweq> * slaweq will check failed SG fullstack test 16:03:46 <slaweq> I reported bug https://bugs.launchpad.net/neutron/+bug/1767829 16:03:47 <openstack> Launchpad bug 1767829 in neutron "Fullstack test_securitygroup.TestSecurityGroupsSameNetwork fails often after SG rule delete" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:04:16 <slaweq> I tried to reproduce it with some DNM patch with additional logs but I couldn't 16:04:32 <slaweq> I suppose that it is some race condition in conntrack manager module 16:04:58 <haleyb> slaweq: i hope not :( but i will also try, and do some manual testing, since it's blocking that other patch 16:04:59 <slaweq> haleyb got similar error on one of his patches I think and there it was reproducible 100% of times 16:06:10 <slaweq> ok, so haleyb You will check that on Your patch, right? 16:06:42 <haleyb> slaweq: yes, and i had tweaked the patch with what looked like a fix, but it still failed, so i'll continue 16:06:59 <slaweq> ok, thx 16:07:26 <slaweq> #action haleyb will debug failing security groups fullstack test: https://bugs.launchpad.net/neutron/+bug/1767829 16:07:28 <openstack> Launchpad bug 1767829 in neutron "Fullstack test_securitygroup.TestSecurityGroupsSameNetwork fails often after SG rule delete" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:07:38 <slaweq> next one is 16:07:39 <slaweq> * slaweq will report bug about failing trunk tests in dvr multinode scenario 16:07:48 <slaweq> Bug report is here: https://bugs.launchpad.net/neutron/+bug/1766701 16:07:49 <openstack> Launchpad bug 1766701 in neutron "Trunk Tests are failing often in dvr-multinode scenario job" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:08:26 <slaweq> I see that mlavalle is assigned to it 16:08:34 <slaweq> did You find something maybe? 16:08:46 <mlavalle> no, I haven't made progress on this one 16:09:13 <mlavalle> I will work on it this week 16:09:18 <slaweq> thx 16:09:43 <slaweq> #action mlavalle will check why trunk tests are failing in dvr multinode scenario 16:09:56 <mlavalle> thanks 16:10:01 <slaweq> ok, next one 16:10:04 <slaweq> * jlibosva will mark trunk scenario tests as unstable for now 16:10:15 <slaweq> he did: https://review.openstack.org/#/c/564026/ 16:10:15 <mlavalle> I think he did 16:10:15 <patchbot> patch 564026 - neutron-tempest-plugin - Mark trunk tests as unstable (MERGED) 16:10:34 <slaweq> so mlavalle, be aware that those tests will now not fail in new jobs 16:10:49 <slaweq> if You will look for failures, You need to look for skipped tests :) 16:10:57 <mlavalle> right 16:11:10 <slaweq> next one was: 16:11:11 <slaweq> * slaweq to check rally timeouts and report a bug about that 16:11:21 <slaweq> I reported a bug: https://bugs.launchpad.net/neutron/+bug/1766703 16:11:22 <openstack> Launchpad bug 1766703 in neutron "Rally tests job is reaching job timeout often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:11:39 <slaweq> and I did some initial checks of logs 16:11:51 <slaweq> there are some comments in bug report 16:12:21 <slaweq> basically it don't looks like we are very close to the limit on good runs 16:14:15 <slaweq> and also in neutron server logs I found that API calls on such bad runs are really slow, e.g. http://logs.openstack.org/24/558724/11/check/neutron-rally-neutron/d891678/logs/screen-q-svc.txt.gz?level=INFO#_Apr_24_04_23_23_632631 16:14:40 <slaweq> also there is a lot of errors in this log file: http://logs.openstack.org/24/558724/11/check/neutron-rally-neutron/d891678/logs/screen-q-svc.txt.gz?level=INFO#_Apr_24_04_14_59_003949 16:15:09 <slaweq> I don't think that this is culprit of slow responses because same errors are in logs of "good" runs 16:15:23 <slaweq> but are You aware of such errors? 16:15:34 <slaweq> ihar: especially You as it looks to be related to db :) 16:16:48 <ihar> I think savepoint deadlocks are common, as you said already 16:16:59 <ihar> I have no idea why tho 16:17:08 <slaweq> ok, just asking, thx for checking 16:17:28 <ihar> one reason for slow db could be limit on connections in oslo.db 16:17:36 <ihar> and/or wsgi 16:17:51 <ihar> neutron-server may just queue requests 16:17:51 <slaweq> ihar: thx for tips, I will try to investigate this slow responses more during the week 16:18:16 <ihar> but I dunno, depends on whether we see slowdowns in the middle of handlers or it just takes a long time to see first messages of a request 16:19:09 <slaweq> problem is that if it reach this timeout then there is no those fancy rally graphs and tables to check everything 16:19:27 <slaweq> so it's not easy to compare with "good" runs 16:19:45 <slaweq> #action slaweq will continue debugging slow rally tests issue 16:20:08 <slaweq> I think we can move on to the next one 16:20:11 <slaweq> * slaweq will report a bug and talk with Federico about issue with neutron-dynamic-routing-dsvm-tempest-with-ryu-master-scenario-ipv4 16:20:21 <slaweq> Bug report: https://bugs.launchpad.net/neutron/+bug/1766702 16:20:23 <openstack> Launchpad bug 1766702 in neutron "Periodic job * neutron-dynamic-routing-dsvm-tempest-with-ryu-master-scenario-ipv4 fails" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:20:47 <slaweq> dalvarez found that there was also another problem caused by Federico's patch 16:21:00 <slaweq> I already proposed patch to fix what dalvarez found 16:21:22 <slaweq> and I'm working on fix neutron-dynamic-routing tests 16:21:40 <slaweq> problem here is that now tests can't create two subnets with same cidr 16:22:18 <slaweq> and neutron-dynamic-routing in scenario tests creates subnetpool and then subnet "per test" but always uses same cidrs 16:22:34 <slaweq> so first test passed but in second subnet is not created as cidr is already in use 16:23:13 <slaweq> haleyb posted some patch to check it but it wasn't fix for this issue 16:23:24 <slaweq> so I will continue this work also 16:23:48 <slaweq> #action slaweq to fix neutron-dynamic-routing scenario tests bug 16:24:00 <slaweq> ok, that's all from my list 16:24:07 <slaweq> do You have anything to add? 16:25:33 <slaweq> ok, so let's move on to next topic 16:25:46 <slaweq> #topic Grafana 16:25:51 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:27:22 <slaweq> Many tests on high failure rate yesterday and today, but it might be related to some issue with devstack legacy jobs, see: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/latest.log.html#t2018-05-08T07:10:02 16:30:12 <slaweq> except that I think that it's quite "normal" for most tests 16:30:38 <slaweq> do You see anything worrying? 16:32:49 <slaweq> ok, I guess we can move to next topics 16:32:54 <mlavalle> has the devstack issue benn solved? 16:33:14 <slaweq> yes, I frickler told me today it should be solved last night 16:33:21 <mlavalle> ok 16:33:34 <slaweq> so recheck of Your patch should works fine :) 16:33:49 <slaweq> moving on 16:33:50 <slaweq> #topic Scenarios 16:34:36 <slaweq> I see that dvr-multinode job is now at about 20 to 40% still 16:34:57 <slaweq> even after marking trunk tests and dvr migration tests as unstable 16:35:09 <slaweq> let's check some example failures then :) 16:35:13 <haleyb> :( 16:36:12 <slaweq> http://logs.openstack.org/24/558724/14/check/neutron-tempest-plugin-dvr-multinode-scenario/37dde27/logs/testr_results.html.gz 16:36:55 <haleyb> slaweq: Kernel panic - not syncing: 16:37:04 <haleyb> hah, that's not neutron! 16:37:09 <slaweq> haleyb: :) 16:37:25 <slaweq> I just copied failure result without changing the reason 16:37:34 <slaweq> I'm looking for some other examples still 16:38:02 <slaweq> http://logs.openstack.org/27/534227/9/check/neutron-tempest-plugin-dvr-multinode-scenario/cff3380/logs/testr_results.html.gz 16:38:11 <slaweq> but this should be already marked as unstable, right haleyb? 16:38:45 <haleyb> this one or the last one? 16:39:03 <slaweq> last one 16:39:07 <slaweq> sorry 16:39:22 <slaweq> in first one, if it was kernel panic than it's "fine" for us :) 16:39:29 <mlavalle> LOL 16:39:51 <mlavalle> not our problem 16:40:11 <haleyb> i'm not sure the migration tests are marked unstable 16:40:12 <slaweq> mlavalle: I think we have enough of our problems still ;) 16:40:18 <mlavalle> we do 16:40:32 <haleyb> both those failed metadata 16:40:35 <slaweq> haleyb: I though that You did it few weeks ago 16:41:12 <haleyb> https://review.openstack.org/#/c/561322/ 16:41:13 <patchbot> patch 561322 - neutron-tempest-plugin - Mark DVR/HA migration tests unstable (MERGED) 16:41:18 <haleyb> looking 16:42:31 <haleyb> yes, it should have covered them 16:42:51 <slaweq> so maybe it was some old patch then 16:43:08 <slaweq> but we know this issue already so I think we don't need to talk about it now 16:43:19 <slaweq> thx for check haleyb :) 16:43:33 <slaweq> in the meantime I found one more failed run: 16:43:33 <slaweq> http://logs.openstack.org/78/566178/5/check/neutron-tempest-plugin-dvr-multinode-scenario/92ef438/job-output.txt.gz#_2018-05-08_10_08_04_048287 16:43:43 <slaweq> and here it was global job timeout reached 16:46:06 <slaweq> looks that most tests took about 10 minutes 16:48:03 <slaweq> I think that if such issue will repeat more we will have to investigate what is slowing down those tests 16:48:07 <slaweq> do You agree? :) 16:48:55 <haleyb> yes, agreed 16:49:08 <slaweq> ok, thx haleyb :) 16:49:22 <slaweq> I didn't found different issues for this job from last few days 16:49:48 <slaweq> according to other scenario jobs 16:50:03 <slaweq> I want to remind You that since some time we have voting 2 jobs: 16:50:09 <slaweq> neutron-tempest-plugin-scenario-linuxbridge 16:50:09 <slaweq> and 16:50:15 <slaweq> neutron-tempest-ovsfw 16:50:26 <slaweq> both are IMHO quite stable according to graphana 16:50:43 <slaweq> and I would like to ask if we can consider to make them gating also 16:50:59 * mlavalle looking at grafana 16:51:02 <slaweq> what do You think? 16:51:46 <mlavalle> yes, let's give it a try 16:52:19 <slaweq> thx mlavalle for blessing :) 16:52:39 <slaweq> should I send patch or You want to do it? 16:54:14 <mlavalle> please send it 16:54:20 <slaweq> ok, I will 16:54:41 <slaweq> #action slaweq to make 2 scenario jobs gating 16:55:14 <slaweq> I don't have anything else according to any of job types for today 16:55:27 <slaweq> do You want to talk about something else? 16:55:41 <slaweq> if not we can finish few minutes before the time :) 16:55:46 <mlavalle> I don't have anything else 16:56:52 <slaweq> ok, so enjoy Your free time then ;) 16:56:55 <slaweq> #endmeeting