16:00:18 <slaweq> #startmeeting neutron_ci 16:00:19 <openstack> Meeting started Tue Aug 21 16:00:18 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:20 <slaweq> hi 16:00:22 <openstack> The meeting name has been set to 'neutron_ci' 16:00:22 <mlavalle> o/ 16:00:56 * mlavalle has to leave in 40 minutes for an appointement 16:01:12 <slaweq> ok, mlavalle so we will try to do it fast :) 16:01:16 <slaweq> lets start then 16:01:29 <slaweq> #topic Actions from previous meetings 16:01:40 <slaweq> njohnston to tweak stable branches dashboards 16:02:06 <slaweq> I don't know if he did something about that but I don't think so 16:03:00 <slaweq> I pinged njohnston on neutron channel, maybe he will join 16:03:16 <mlavalle> ok 16:03:37 <njohnston> o/ 16:03:40 <slaweq> hi njohnston 16:03:57 <slaweq> we are talking about actions from previous meeting 16:03:59 <mlavalle> he always shows up 16:04:00 <njohnston> sorry I'm late, was engrossed in code :-) 16:04:02 <slaweq> and we have: 16:04:04 <slaweq> njohnston to tweak stable branches dashboards 16:04:21 <slaweq> no problem :) 16:04:27 <njohnston> Gah, I forgot all about that. My apologies. I'll do it today. 16:04:47 <slaweq> sure, fine 16:04:53 <slaweq> I will add it for next week then 16:04:58 <slaweq> #action njohnston to tweak stable branches dashboards 16:05:08 <slaweq> next action was: 16:05:10 <slaweq> slaweq add data about slow API to etherpad for PTG 16:05:35 <slaweq> I added it to etherpad but I still need to add some data about what tests and what API calls are slowest 16:06:00 <slaweq> there are also patches to improve that in progress so I will also point to them there 16:06:15 <mlavalle> that's good, thanks 16:06:31 <slaweq> ok, next one was: 16:06:33 <slaweq> slaweq will add number of failures to graphana 16:06:42 <slaweq> I added such graphs to grafana 16:06:53 <slaweq> it shows summarize of jobs from last 24h 16:07:18 <slaweq> lets use it for a while and tweak if that will be necessary 16:07:30 <mlavalle> yeah, we need to give it some time 16:08:05 <slaweq> exactly 16:08:17 <slaweq> I also did small reorganization of graphs there 16:08:35 <slaweq> and moved integration tests to one graph and scenario jobs to separate graph 16:09:06 <slaweq> I wanted to make same graphs for "check" and "gate" queue and I think that now it is like that 16:09:13 <mlavalle> yes, good that you are tweaking it 16:09:42 <slaweq> maybe would be better to move gate queue to separate dashboard even because there is a lot of graphs on it now 16:09:55 <slaweq> but lets see how it will work like that for few weeks 16:10:07 <njohnston> it definitely looks nicer +1 16:10:17 <mlavalle> I have always found the dasboard and some panels too loaded with data 16:10:23 <mlavalle> I'm not that smart 16:10:37 <slaweq> :) 16:10:40 <mlavalle> so I like that you are simplifying it 16:11:00 <slaweq> I'm trying but I'm not the best person for tweaking UX :P 16:11:35 <mlavalle> well, as long as we get something we all understand easily, don't worry about UX orthodoxies 16:11:51 <slaweq> ok, thx :) 16:11:54 <slaweq> I will remember that 16:12:03 <slaweq> ok, last action from last week was: 16:12:05 <slaweq> slaweq to report new bug about fullstack test_securitygroup(linuxbridge-iptables) issue and investigate this 16:12:14 <slaweq> Bug reported already: https://bugs.launchpad.net/neutron/+bug/1779328 I just updated it 16:12:14 <openstack> Launchpad bug 1779328 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails" [High,Confirmed] 16:12:32 <slaweq> I also did small patch to enable debug_iptables_rules in L2 agent config 16:12:40 <slaweq> it was merged today or yesterday 16:13:09 <slaweq> so if it will fail again, I hope that this debug option will help me to figure out what is wrong there 16:13:10 <mlavalle> so are you planning to work on it? 16:13:22 <slaweq> yes, I have an eye on that for now 16:13:36 <mlavalle> ok, I'll assign you to it and mark it as in progress 16:13:43 <slaweq> ok, thx 16:13:48 <slaweq> I forgot about that 16:14:07 <slaweq> it's always iptables driver which is failing, never openvswitch firewall driver 16:14:34 <slaweq> I suppose that it's some race condition again but have no idea what exactly 16:15:15 <slaweq> ok, lets talk about grafana now 16:15:17 <slaweq> #topic Grafana 16:15:27 <mlavalle> yaay, the new and improved 16:15:33 <slaweq> As I said, I reorganized it a bit recently 16:15:42 <slaweq> and I wanted to ask about one thing also 16:16:03 <slaweq> I found that in available metrics are also metrics like jobs TIMED_OUT and POST_FAILURE 16:16:17 <slaweq> should we include them in our graphs? 16:16:36 <slaweq> for now we have it calculated as (SUCCESS / (FAILURE+SUCCESS)) 16:16:54 <mlavalle> well the danger there is that we overload the panels with data 16:16:55 <slaweq> so in graphs we don't see jobs which had TIMEOUT or POST_FAILURE 16:17:28 <njohnston> Can we use wildcards in the selection of metric names? If so we could use a mid-string wildcard to count all fo the TIMED_OUT and POST_FAILURE for all the jobs and just count the number of them occurring as a sort of "healthcheck on zuul" graph 16:17:30 <mlavalle> now, timed out and port failures are infra issues, aren't they? 16:17:44 <mlavalle> post failures ^^^^ 16:17:46 <slaweq> but I was thinking about somethink like ((FAILURE + TIME_OUT + POST_FAILURE) / (FAILURE + TIME_OUT + POST_FAILURE + SUCCESS)) 16:18:03 <mlavalle> mhhhh 16:18:13 <mlavalle> failures are our problem 16:18:25 <slaweq> post failure is infra issue usually 16:18:31 <mlavalle> TIME_OUT + POST_FAILURE are infra's, or am I wrong? 16:18:36 <slaweq> but timeout is mostly our issue (slow api for example) 16:18:50 <mlavalle> ok, I buy that 16:19:01 <slaweq> and time_out is what we hit more often 16:19:04 <njohnston> we should still have the tests get killed by the internal timer and register as a normal FAILURE instead of a TIMED_OUT 16:19:33 <mlavalle> so if we can organize this in such a way that we can easily discriminate what we need to worry about and what we need to communicate to infra, I am all for it 16:19:35 <slaweq> njohnston: yes, but if job is taking long time it reaches some kind of "global" timeout and job is killed 16:20:21 <slaweq> so updating to something like ((FAILURE + TIME_OUT) / (FAILURE + TIME_OUT + SUCCESS)) [in %] 16:20:25 <slaweq> right? 16:21:09 <njohnston> ok 16:21:11 <mlavalle> that formula seems right to indicate "what we need to worryt about" 16:21:14 <slaweq> we will then have percentage of failures and timeouts for all our jobs which isn't post_failure (so percentage of our issues) 16:21:39 <slaweq> ok, so I will update it in grafana 16:21:55 <slaweq> #action slaweq to update grafana dashboard to ((FAILURE + TIME_OUT) / (FAILURE + TIME_OUT + SUCCESS)) 16:21:59 <mlavalle> njohnston: you agree with that formula? 16:23:36 <mlavalle> well, let's move on 16:23:38 <slaweq> I think we lost njohnston :/ 16:23:51 <slaweq> ok 16:23:56 <njohnston> yes that is good 16:24:13 <mlavalle> ++ 16:24:20 <slaweq> speaking about grafana there is no new "big" issues IMO 16:24:31 <slaweq> so let's talk about some specific jobs now 16:24:34 <slaweq> #topic functional 16:24:49 <slaweq> I want to start with functional because there is one urgent issue with it 16:25:02 <slaweq> in stable/queens it looks it's failing 100% times since few days 16:25:03 <mlavalle> ok 16:25:13 <slaweq> bug reported https://bugs.launchpad.net/neutron/+bug/1788185 16:25:13 <openstack> Launchpad bug 1788185 in neutron "[Stable/Queens] Functional tests neutron.tests.functional.agent.l3.test_ha_router failing 100% times " [Critical,Confirmed] 16:25:28 <slaweq> example of failure: http://logs.openstack.org/78/593078/1/check/neutron-functional/28fe681/logs/testr_results.html.gz 16:27:59 <slaweq> last patch merged to queens branch was https://review.openstack.org/#/c/584276/ 16:28:15 <slaweq> which IMO can be potential culprit 16:28:21 <slaweq> but it has to be checked 16:29:34 <mlavalle> nobody working on it? 16:29:39 <mlavalle> the bug I mean 16:29:45 <mlavalle> it has no assignee 16:30:15 <slaweq> I reported it today, few hours ago 16:30:29 <slaweq> and I hadn't got time to work on it yet 16:30:42 <mlavalle> do you have bandwidth? I can help if you don't 16:31:14 <slaweq> would be great if You could check that 16:31:25 <mlavalle> ok, I'll take a stab at it 16:31:31 <slaweq> thx 16:31:39 <mlavalle> just assigned it to me 16:31:57 <mlavalle> I'll yell for help if I get myself in trouble 16:33:22 <slaweq> sure 16:33:49 <slaweq> ok, let's quickly move for other issues 16:34:19 <slaweq> I recently found few times issues in neutron.tests.functional.agent.l3.test_metadata_proxy.UnprivilegedUserGroupMetadataL3AgentTestCase 16:34:36 <slaweq> but this should be "fixed" by patch https://review.openstack.org/#/c/586341/4 :) 16:34:43 <slaweq> so please review it 16:35:05 <mlavalle> added it to my pile 16:35:16 <slaweq> thx 16:35:20 <mlavalle> will look at it when I come back after my appointment 16:35:28 <njohnston> I've been seeing issues with neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip but I haven't been able to track it to a source 16:36:03 <njohnston> It's happened about 40 times in the last week; I added an elastic-recheck change to look for it https://review.openstack.org/593722 16:36:08 <slaweq> njohnston: I know, it's even reported on https://bugs.launchpad.net/bugs/1788006 :) 16:36:08 <openstack> Launchpad bug 1788006 in neutron "neutron_tempest_plugin DNS integration tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Undecided,New] 16:36:47 <njohnston> yeah, I dropped that bug when I exceeded my timebox on tracking down the source of the issue 16:36:48 <mlavalle> mhhh the description sound like a nova issue 16:37:03 <mlavalle> not saying it is not our issue 16:37:10 <mlavalle> but seems odd 16:37:24 <slaweq> yep, it looks so at first glance 16:37:34 <slaweq> but we should check it 16:37:48 <slaweq> anyone has bandwidth for it? 16:38:02 <mlavalle> don't have much, but I'll try to take a look 16:38:03 <njohnston> I could ping the nova-neutron liaison 16:38:28 <njohnston> I believe that is sean-k-mooney 16:38:38 <mlavalle> I'll takle a quick look before pinging sean-k-mooney 16:38:46 <njohnston> thanks much 16:38:51 <slaweq> thx mlavalle 16:38:55 <slaweq> and thx njohnston :) 16:39:13 <mlavalle> ok, I have to leave guys 16:39:15 <slaweq> #action mlavalle to check neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip issue 16:39:20 <slaweq> ok, thx mlavalle 16:39:24 <mlavalle> o/ 16:39:34 <slaweq> it's basically all what I had for today from important things 16:40:13 <slaweq> so njohnston if You don't have anything to talk about I think we can finish earlier today 16:40:16 <slaweq> :) 16:40:50 <njohnston> nothing from me, it just seems like we have a lot of timeouts of late 16:41:40 <njohnston> anyhow, have a good evening slaweq, and thanks as always! 16:41:51 <slaweq> ok, thx njohnston 16:41:57 <slaweq> and have a nice day :) 16:42:04 <slaweq> #endmeeting