16:01:47 #startmeeting neutron_ci 16:01:49 Meeting started Tue Mar 13 16:01:47 2018 UTC and is due to finish in 60 minutes. The chair is ihrachys. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:50 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:52 The meeting name has been set to 'neutron_ci' 16:01:56 o/ 16:02:13 o/ 16:02:16 I will need to drop off today in the middle of the meeting so I would like someone to take over the chair from there 16:02:28 volunteers welcome 16:02:37 #topic Actions from prev meeting 16:02:45 "slaweq to check why it takes too long to raise interfaces in linuxbirdge scenarios job, and to compare with dvr" 16:03:24 slaweq, around? 16:03:57 he was here 10 minutes ago 16:04:03 we finished the QoS meeting 16:04:05 I think we agreed last time it was because of multinode? 16:04:23 that dvr is multinode while linuxbridge is allinone 16:05:08 yeah though I am not sure if we made any progress in terms of patches to bump timeout 16:05:31 hi 16:05:33 iirc it was merged 16:05:34 sorry for late 16:05:51 https://review.openstack.org/550832 ? 16:05:59 yep 16:06:08 ihrachys: only thing I could do was increase ssh timeout 16:06:21 ok great. I also noticed on grafana that the linuxbridge job is very stable now 16:06:30 so that probably actually helped 16:06:31 from logs which I checked it wasn't looking that there is any issue with neutron agent's or something like that 16:06:41 great work slaweq 16:06:50 thx 16:06:59 he got an ovation for this in the Neutron meeting 16:07:07 LOL 16:07:11 standing ovation 16:07:19 next was "ihrachys to check grafana stats several days later when dust settles" 16:07:21 indeed :) 16:07:27 that's about fullstack / dvr instability 16:07:47 dvr scenarios are still problematic, but fullstack seems to be more stable than functional now for a week or so 16:07:57 \o/ 16:08:03 looks like if we keep functional voting we should definitely have fullstack too 16:08:10 we can check charts later 16:08:20 "jlibosva to look into agent startup failure and missing logs in: http://logs.openstack.org/83/549283/1/check/neutron-fullstack/cbad08a/logs/" 16:08:45 I tried to reproduce the l3 agent failing to start locally but no success 16:08:59 I posted a patch to log failed processes though, I think it merged 16:09:13 https://review.openstack.org/#/c/550566/ 16:09:29 maybe it was some issue in some other package and was fixed in meantime 16:10:04 nice, we will revisit it the next time it hits 16:10:26 these are all AIs we had 16:10:30 #topic Grafana 16:10:37 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:11:15 as I said before, fullstack looks very nice. there were some failure bumps during the week but they match with other bumps for other jobs 16:11:25 now fullstack at 6% or smth like that 16:11:36 in contrast functional is almost 20% 16:12:19 and as for tempest, linuxbridge scenarios are looking very good 16:12:36 like 2-3% now 16:12:52 in contrast to dvr scenarios that are still 35% 16:13:04 I hope it will be like that now 16:13:20 as I was checking during weekend with this longer ssh timeout it was passing every time 16:13:22 and dvr-ha job is also at the same high failure rate 16:13:50 I think that I saw some trunk related test in dvr failing often 16:14:18 another candidate with reasonable behavior is -ovsfw- though it will need more monitoring I believe 16:14:29 we can dive in each type 16:14:32 #topic Fullstack 16:14:50 considering that the job is quite stable for a while, should we make it vote now? 16:14:59 we are at the start of cycle so that's good 16:15:28 sounds good for me 16:15:31 we can try 16:15:42 would that require to also include fullstack in gate queue? 16:16:04 well we can experiment with partial enablement like we did with functional of course 16:16:27 not that I am saying I would suggest doing it 16:17:10 mlavalle, thoughts 16:17:43 I'd say go for it 16:17:59 slaweq, do you want the honor of posting the patch? 16:18:26 ihrachys: I would love to :) 16:18:28 thx 16:18:40 #action slaweq to enable voting for fullstack 16:18:52 mlavalle, jlibosva so do we go with both queues or check only? 16:19:05 let's start with check 16:19:10 ++ 16:19:16 ok 16:19:17 let's do it with small steps 16:19:31 we will revisit gate in like 2 weeks then 16:19:37 that's what my question was above, I think there is a "rule" that voting jobs should be in gate queue too, isn't there? 16:19:48 jlibosva, there is tradition for sure 16:19:58 but we had functional not in gate but in check for some time 16:20:17 we can post and see if infra objects 16:20:22 I don't mind enabling in both queues 16:20:56 there is also neutron-rally-neutron which is not in gate currently 16:21:01 and is voting in check queue 16:21:42 I will skip discussion of fullstack failures for today 16:21:52 #topic Rally 16:22:07 do we know what's the background behind rally not gating? 16:22:20 is it because it breaks from time to time due to instabilities in rally itself? 16:22:53 I remember there were multiple cases when a patch in rally broke our gates 16:22:58 so maybe that 16:23:21 I am not too excited to enable gating for it 16:23:27 I mean, gate queue 16:23:32 I don't know the details, but it must be that 16:23:57 ok. I would focus on other jobs for now where we have better understanding and control. 16:24:03 #topic Scenarios 16:24:08 glance recently removed rally beacuse it wasn't working at all for them 16:24:12 (just a datapoint) 16:24:26 good to know, clarkb 16:24:57 http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=8&fullscreen 16:25:32 considering linuxbridge scenarios job is in good shape (failure spikes are reflected in other jobs, it's at 5% right now), do we want to consider it for voting too? 16:25:55 yes 16:26:02 let's go for it 16:26:07 early in the cycle 16:26:18 objections? 16:26:31 also only for check queue for now? 16:26:47 * ihrachys waves at rossella_s 16:27:14 I would imagine yes, check queue for now 16:27:16 ihrachys, hi! 16:27:29 ok seems like no objections 16:27:34 hola rossella_s 16:27:48 #action slaweq to enable voting for linuxbridge scenarios 16:27:59 ok, sure 16:28:11 mlavalle, hola :) 16:28:18 hi rossella_s 16:28:27 slaweq, can you take over the meeting from here? I gotta bail out. 16:28:45 sure 16:28:47 #chair slaweq 16:28:48 Current chairs: ihrachys slaweq 16:28:55 do You have link to agenda somewhere? 16:29:02 there is no agenda :) 16:29:19 we can follow previous meetings 16:29:24 but basically look at dvr scenarios that would be the main thing :) 16:29:27 ok, sure 16:29:36 ok bye folks! 16:29:36 I will take care 16:29:41 bye ihrachys 16:29:54 o/ 16:30:14 so moving on with scenario jobs 16:30:29 dvr and dvr-ha are still on high failure ratio 16:31:10 about dvr when I was testing ssh timeout patch I had few times same errors like on http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6186573/logs/testr_results.html.gz 16:32:00 or http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/136c7ef/logs/testr_results.html.gz 16:32:10 but every time it was failure with trunk port 16:32:16 is it just test_subport_connectivity or also lifecycle? 16:32:30 ah, second link answers my question :) sorry 16:32:33 yeah, it's trunk ports 16:32:34 jlibosva: in second link it was lifecycle 16:32:47 any ideas about that? 16:33:07 on third one there was both of them even: http://logs.openstack.org/32/550832/2/check/neutron-tempest-plugin-dvr-multinode-scenario/6d09fd8/logs/testr_results.html.gz 16:33:52 I just found another example of same tests failed 16:34:06 do You have time/resources to check that this week? 16:34:49 I don't have any dvr environment and don't have experience with both dvr and trunk so it might be hard for me 16:36:43 I know dvr but not trunk, do you just need a multi-node environment? 16:36:55 I know trunk and a bit of dvr :) 16:37:03 so we have a winner :) 16:37:30 I can have a look and if I have issues with dvr, I'll poke haleyb 16:37:45 thx jlibosva 16:38:14 #action jlibosva to take a look on dvr trunk tests issue 16:39:04 I'm trying to find some failed neutron-tempest-dvr-ha-multinode-full test now as this one is also around 50% fails 16:40:17 I found one http://logs.openstack.org/14/529814/5/check/neutron-tempest-dvr-ha-multinode-full/d8cfbdf/logs/testr_results.html.gz 16:40:30 but this doesn't look like related to neutron 16:40:52 do You have maybe any other example? 16:41:24 that might be an old bug? https://bugs.launchpad.net/neutron/+bug/1717302 16:41:25 Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,Confirmed] - Assigned to Brian Haley (brian-haley) 16:44:06 haleyb: will You continue work on it? 16:44:19 do we have a logstash query for the above? 16:44:38 I don't know 16:45:20 No we don't have one 16:45:56 slaweq: i have not reproduced it locally yet 16:46:50 maybe this could be used? http://bit.ly/2Fz6FPs 16:47:32 it looks to me that the issue is gone 16:48:01 jlibosva: are You sure that logs from this job are properly indexed in logstash? 16:48:08 no hist over the past 7 days 16:48:12 AFAIR there was some issue with fullstack for example 16:48:36 it's a normal tempest job and afaik all service logs are indexed 16:48:44 ok, it should be 16:49:01 or is the error coming from neutron-keepalived-state-change ? 16:49:06 ok, I will try to find more example of failures for this job 16:49:20 maybe that one is not indexed, I don't know :) 16:49:37 and if I will find something what is happen often I will just report as bugs 16:49:41 what You think? 16:49:53 sounds good 16:50:21 #action slaweq check reasons of failures of neutron-tempest-dvr-ha-multinode-full 16:51:09 so, other tempest jobs are below 10% of failure so it's fine IMO 16:52:02 periodic jobs looks that are fine also 16:52:31 do You have anything to add? 16:53:00 I don't 16:53:22 #topic Open discussion 16:53:50 I just want to say that I found today that our scenario jobs are using neutron-legacy lib still 16:54:00 so I want to switch it to lib/neutron 16:54:25 are You fine with that? or there is any reason why it's like that and I shouldn't change it 16:55:32 not that I'm aware of 16:56:08 ok, so I will send a patch and check how it works then 16:56:42 #action slaweq switch scenario jobs to lib/neutron 16:57:25 ok, so if You don't have anything I think we are done for today 16:58:14 * jlibosva nods 16:58:27 I wasn't for sure as good as ihrachys as a chair but I hope it was somehow :) 16:58:36 you did really great :) 16:58:46 thank You 16:58:54 bye 16:58:57 #endmeeting