16:00:18 #startmeeting neutron_ci 16:00:19 Meeting started Tue Aug 21 16:00:18 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:20 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:20 hi 16:00:22 The meeting name has been set to 'neutron_ci' 16:00:22 o/ 16:00:56 * mlavalle has to leave in 40 minutes for an appointement 16:01:12 ok, mlavalle so we will try to do it fast :) 16:01:16 lets start then 16:01:29 #topic Actions from previous meetings 16:01:40 njohnston to tweak stable branches dashboards 16:02:06 I don't know if he did something about that but I don't think so 16:03:00 I pinged njohnston on neutron channel, maybe he will join 16:03:16 ok 16:03:37 o/ 16:03:40 hi njohnston 16:03:57 we are talking about actions from previous meeting 16:03:59 he always shows up 16:04:00 sorry I'm late, was engrossed in code :-) 16:04:02 and we have: 16:04:04 njohnston to tweak stable branches dashboards 16:04:21 no problem :) 16:04:27 Gah, I forgot all about that. My apologies. I'll do it today. 16:04:47 sure, fine 16:04:53 I will add it for next week then 16:04:58 #action njohnston to tweak stable branches dashboards 16:05:08 next action was: 16:05:10 slaweq add data about slow API to etherpad for PTG 16:05:35 I added it to etherpad but I still need to add some data about what tests and what API calls are slowest 16:06:00 there are also patches to improve that in progress so I will also point to them there 16:06:15 that's good, thanks 16:06:31 ok, next one was: 16:06:33 slaweq will add number of failures to graphana 16:06:42 I added such graphs to grafana 16:06:53 it shows summarize of jobs from last 24h 16:07:18 lets use it for a while and tweak if that will be necessary 16:07:30 yeah, we need to give it some time 16:08:05 exactly 16:08:17 I also did small reorganization of graphs there 16:08:35 and moved integration tests to one graph and scenario jobs to separate graph 16:09:06 I wanted to make same graphs for "check" and "gate" queue and I think that now it is like that 16:09:13 yes, good that you are tweaking it 16:09:42 maybe would be better to move gate queue to separate dashboard even because there is a lot of graphs on it now 16:09:55 but lets see how it will work like that for few weeks 16:10:07 it definitely looks nicer +1 16:10:17 I have always found the dasboard and some panels too loaded with data 16:10:23 I'm not that smart 16:10:37 :) 16:10:40 so I like that you are simplifying it 16:11:00 I'm trying but I'm not the best person for tweaking UX :P 16:11:35 well, as long as we get something we all understand easily, don't worry about UX orthodoxies 16:11:51 ok, thx :) 16:11:54 I will remember that 16:12:03 ok, last action from last week was: 16:12:05 slaweq to report new bug about fullstack test_securitygroup(linuxbridge-iptables) issue and investigate this 16:12:14 Bug reported already: https://bugs.launchpad.net/neutron/+bug/1779328 I just updated it 16:12:14 Launchpad bug 1779328 in neutron "Fullstack tests neutron.tests.fullstack.test_securitygroup.TestSecurityGroupsSameNetwork fails" [High,Confirmed] 16:12:32 I also did small patch to enable debug_iptables_rules in L2 agent config 16:12:40 it was merged today or yesterday 16:13:09 so if it will fail again, I hope that this debug option will help me to figure out what is wrong there 16:13:10 so are you planning to work on it? 16:13:22 yes, I have an eye on that for now 16:13:36 ok, I'll assign you to it and mark it as in progress 16:13:43 ok, thx 16:13:48 I forgot about that 16:14:07 it's always iptables driver which is failing, never openvswitch firewall driver 16:14:34 I suppose that it's some race condition again but have no idea what exactly 16:15:15 ok, lets talk about grafana now 16:15:17 #topic Grafana 16:15:27 yaay, the new and improved 16:15:33 As I said, I reorganized it a bit recently 16:15:42 and I wanted to ask about one thing also 16:16:03 I found that in available metrics are also metrics like jobs TIMED_OUT and POST_FAILURE 16:16:17 should we include them in our graphs? 16:16:36 for now we have it calculated as (SUCCESS / (FAILURE+SUCCESS)) 16:16:54 well the danger there is that we overload the panels with data 16:16:55 so in graphs we don't see jobs which had TIMEOUT or POST_FAILURE 16:17:28 Can we use wildcards in the selection of metric names? If so we could use a mid-string wildcard to count all fo the TIMED_OUT and POST_FAILURE for all the jobs and just count the number of them occurring as a sort of "healthcheck on zuul" graph 16:17:30 now, timed out and port failures are infra issues, aren't they? 16:17:44 post failures ^^^^ 16:17:46 but I was thinking about somethink like ((FAILURE + TIME_OUT + POST_FAILURE) / (FAILURE + TIME_OUT + POST_FAILURE + SUCCESS)) 16:18:03 mhhhh 16:18:13 failures are our problem 16:18:25 post failure is infra issue usually 16:18:31 TIME_OUT + POST_FAILURE are infra's, or am I wrong? 16:18:36 but timeout is mostly our issue (slow api for example) 16:18:50 ok, I buy that 16:19:01 and time_out is what we hit more often 16:19:04 we should still have the tests get killed by the internal timer and register as a normal FAILURE instead of a TIMED_OUT 16:19:33 so if we can organize this in such a way that we can easily discriminate what we need to worry about and what we need to communicate to infra, I am all for it 16:19:35 njohnston: yes, but if job is taking long time it reaches some kind of "global" timeout and job is killed 16:20:21 so updating to something like ((FAILURE + TIME_OUT) / (FAILURE + TIME_OUT + SUCCESS)) [in %] 16:20:25 right? 16:21:09 ok 16:21:11 that formula seems right to indicate "what we need to worryt about" 16:21:14 we will then have percentage of failures and timeouts for all our jobs which isn't post_failure (so percentage of our issues) 16:21:39 ok, so I will update it in grafana 16:21:55 #action slaweq to update grafana dashboard to ((FAILURE + TIME_OUT) / (FAILURE + TIME_OUT + SUCCESS)) 16:21:59 njohnston: you agree with that formula? 16:23:36 well, let's move on 16:23:38 I think we lost njohnston :/ 16:23:51 ok 16:23:56 yes that is good 16:24:13 ++ 16:24:20 speaking about grafana there is no new "big" issues IMO 16:24:31 so let's talk about some specific jobs now 16:24:34 #topic functional 16:24:49 I want to start with functional because there is one urgent issue with it 16:25:02 in stable/queens it looks it's failing 100% times since few days 16:25:03 ok 16:25:13 bug reported https://bugs.launchpad.net/neutron/+bug/1788185 16:25:13 Launchpad bug 1788185 in neutron "[Stable/Queens] Functional tests neutron.tests.functional.agent.l3.test_ha_router failing 100% times " [Critical,Confirmed] 16:25:28 example of failure: http://logs.openstack.org/78/593078/1/check/neutron-functional/28fe681/logs/testr_results.html.gz 16:27:59 last patch merged to queens branch was https://review.openstack.org/#/c/584276/ 16:28:15 which IMO can be potential culprit 16:28:21 but it has to be checked 16:29:34 nobody working on it? 16:29:39 the bug I mean 16:29:45 it has no assignee 16:30:15 I reported it today, few hours ago 16:30:29 and I hadn't got time to work on it yet 16:30:42 do you have bandwidth? I can help if you don't 16:31:14 would be great if You could check that 16:31:25 ok, I'll take a stab at it 16:31:31 thx 16:31:39 just assigned it to me 16:31:57 I'll yell for help if I get myself in trouble 16:33:22 sure 16:33:49 ok, let's quickly move for other issues 16:34:19 I recently found few times issues in neutron.tests.functional.agent.l3.test_metadata_proxy.UnprivilegedUserGroupMetadataL3AgentTestCase 16:34:36 but this should be "fixed" by patch https://review.openstack.org/#/c/586341/4 :) 16:34:43 so please review it 16:35:05 added it to my pile 16:35:16 thx 16:35:20 will look at it when I come back after my appointment 16:35:28 I've been seeing issues with neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip but I haven't been able to track it to a source 16:36:03 It's happened about 40 times in the last week; I added an elastic-recheck change to look for it https://review.openstack.org/593722 16:36:08 njohnston: I know, it's even reported on https://bugs.launchpad.net/bugs/1788006 :) 16:36:08 Launchpad bug 1788006 in neutron "neutron_tempest_plugin DNS integration tests fail with "Server [UUID] failed to reach ACTIVE status and task state "None" within the required time ([INTEGER] s). Current status: BUILD. Current task state: spawning."" [Undecided,New] 16:36:47 yeah, I dropped that bug when I exceeded my timebox on tracking down the source of the issue 16:36:48 mhhh the description sound like a nova issue 16:37:03 not saying it is not our issue 16:37:10 but seems odd 16:37:24 yep, it looks so at first glance 16:37:34 but we should check it 16:37:48 anyone has bandwidth for it? 16:38:02 don't have much, but I'll try to take a look 16:38:03 I could ping the nova-neutron liaison 16:38:28 I believe that is sean-k-mooney 16:38:38 I'll takle a quick look before pinging sean-k-mooney 16:38:46 thanks much 16:38:51 thx mlavalle 16:38:55 and thx njohnston :) 16:39:13 ok, I have to leave guys 16:39:15 #action mlavalle to check neutron_tempest_plugin.scenario.test_dns_integration.DNSIntegrationTests.test_server_with_fip issue 16:39:20 ok, thx mlavalle 16:39:24 o/ 16:39:34 it's basically all what I had for today from important things 16:40:13 so njohnston if You don't have anything to talk about I think we can finish earlier today 16:40:16 :) 16:40:50 nothing from me, it just seems like we have a lot of timeouts of late 16:41:40 anyhow, have a good evening slaweq, and thanks as always! 16:41:51 ok, thx njohnston 16:41:57 and have a nice day :) 16:42:04 #endmeeting