15:00:57 <slaweq> #startmeeting neutron_ci 15:00:58 <openstack> Meeting started Wed Feb 19 15:00:57 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:59 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:01:01 <openstack> The meeting name has been set to 'neutron_ci' 15:01:18 <ralonsoh> hi 15:01:34 <njohnston> o/ 15:02:24 <slaweq> ping bcafarel for CI meeting :) 15:02:35 <bcafarel> o/ 15:02:44 <slaweq> ok, so I think we can start 15:02:45 <bcafarel> one day I will get used to this new chan :) 15:02:47 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:02:51 <slaweq> bcafarel: LOL 15:02:58 <slaweq> then we will change it again :P 15:03:14 <slaweq> #topic Actions from previous meetings 15:03:31 <slaweq> first one: ralonsoh to check issues with unauthorized ping and ncat commands in functional tests 15:03:44 <ralonsoh> a bit messy this week 15:03:54 <ralonsoh> I pushed a patch and then reverted it 15:04:08 <ralonsoh> I have also pushed a patch to mark ncat tests unstable 15:04:40 <ralonsoh> I still don't know why rootwrap filters sometimes are valid and in other tests don't 15:04:49 <ralonsoh> *aren't 15:05:34 <ralonsoh> (that's all) 15:05:54 <slaweq> but it's working for some tests and not working for other tests in same job? 15:05:59 <slaweq> or in different jobs? 15:06:09 <ralonsoh> same job 15:06:21 <ralonsoh> and this is the worst scenario 15:06:28 <ralonsoh> there is no reason for this 15:08:24 <slaweq> when neutron-functional job was legacy zuulv2 job, it run tests with "sudo" 15:08:33 <slaweq> sudo -H -u stack tox -e dsvm-functional 15:09:04 <ralonsoh> but ncat is using the rootwrap command 15:09:09 <ralonsoh> and the filter list 15:09:21 <ralonsoh> (that's OK) 15:09:28 <slaweq> ok, maybe it's some oslo.rootwrap bug than? 15:10:05 <slaweq> maybe we should ask someone from oslo team to take a look at such failed run? 15:10:09 <slaweq> what do You think? 15:10:12 <ralonsoh> I think so. mjozefcz detected that, during a small time, the filters were not present 15:10:49 <ralonsoh> I'll ping oslo folks 15:10:55 <slaweq> ralonsoh: thx 15:11:09 <slaweq> #action ralonsoh to talk with oslo people about our functional tests rootwrap issue 15:12:07 <slaweq> ok, next one 15:12:09 <slaweq> slaweq to report issue with ssh timeout on dvr jobs and check logs there 15:12:20 <slaweq> I reported bug here: https://bugs.launchpad.net/neutron/+bug/1863858 15:12:22 <openstack> Launchpad bug 1863858 in neutron "socket.timeout error in dvr CI jobs cause SSH issues" [Critical,Confirmed] 15:12:31 <slaweq> and I looked at it a bit today 15:12:43 <slaweq> From what I looked it seems that it's failing simply due to ssh timeout. So I guess there is some problem with FIP configuration but I don't know what exactly. 15:13:05 <slaweq> as this started happening only few weeks ago, I checked what we merged recently 15:13:19 <slaweq> and I found https://review.opendev.org/#/c/606385/ which seems suspicious for me 15:13:51 <slaweq> but so far I don't have any strong evidence that this is the culprit of the issue 15:14:47 <slaweq> I proposed revert of this patch https://review.opendev.org/#/c/708624/ just to recheck if couple of times and see if it will be better 15:15:01 <slaweq> currently neutron-tempest-dvr job is failing pretty often due to this bug 15:15:37 <slaweq> if we will not find anything to fix this issue in next few days, maybe we should temporary switch this job to be non-voting 15:15:40 <slaweq> what do You think? 15:15:57 <ralonsoh> but it's failing only one job 15:16:00 <ralonsoh> one test 15:16:04 <ralonsoh> test_resize_volume_backed_server_confirm 15:16:16 <slaweq> no, I saw also other test failing 15:16:24 <ralonsoh> ok 15:16:43 <slaweq> like here: https://19574e4665a40f62095e-6b9500683e6a67d31c1bad572acf67ba.ssl.cf1.rackcdn.com/705982/6/check/neutron-tempest-dvr/8f3fbd0/testr_results.html 15:16:44 <ralonsoh> ok, we can mark is as unstable for now (non-vonting) 15:16:55 <ralonsoh> yes, I saw this too 15:16:55 <slaweq> but that't true, it only happens in dvr job 15:17:09 <bcafarel> https://review.opendev.org/#/c/606385/ seems "old" no? we have it since stein 15:17:59 <slaweq> ahh, yes 15:18:03 <slaweq> so it's not that patch for sure 15:18:18 <slaweq> it was on my list of recent patches because it was recently merged to stein branch 15:18:30 <slaweq> so there was comment in original patch in gerrit 15:18:38 <slaweq> so, this isn't the culprit for sure :/ 15:18:48 <slaweq> thx bcafarel for pointing this out 15:18:56 <bcafarel> np :) 15:19:33 <slaweq> I will try to reproduce this issue locally maybe this week 15:19:49 <slaweq> #action slaweq to try to reproduce and debug neutron-tempest-dvr ssh issue 15:20:26 <slaweq> ok, so that's all from my side about this issue 15:20:59 <slaweq> next one 15:21:01 <slaweq> ralonsoh to check periodic neutron-ovn-tempest-ovs-master-fedora job's failures 15:21:07 <maciejjozefczyk> slaweq, thanks 15:21:22 <ralonsoh> slaweq, I didn't have time for this one 15:21:24 <ralonsoh> sorry 15:21:34 <ralonsoh> (is still in my todo list) 15:21:37 <slaweq> ralonsoh: no problem :) 15:21:45 <slaweq> #action ralonsoh to check periodic neutron-ovn-tempest-ovs-master-fedora job's failures 15:21:53 <slaweq> maciejjozefczyk: for what? :) 15:22:52 <slaweq> ok, lets move on 15:22:55 <slaweq> next topic 15:22:58 <slaweq> #topic Stadium projects 15:23:03 <slaweq> standardize on zuul v3 15:23:15 <slaweq> there was only slow progress this week 15:23:29 <slaweq> we almost merged patch for bagpipe, but some new bug blocked it in gate 15:24:23 <slaweq> anything else regarding stadium project's ci? 15:25:01 <njohnston> nope 15:25:06 <njohnston> like you said, slow progress 15:25:17 <slaweq> ok, so lets move on 15:25:20 <slaweq> #topic Grafana 15:25:26 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 15:27:00 <slaweq> I think we have still the same problems there: 15:27:06 <slaweq> 1. functional tests 15:27:12 <slaweq> 2. neutron-tempest-dvr 15:27:16 <slaweq> 3. grenade jobs 15:27:35 <slaweq> other than that, I think it looks good 15:27:46 <njohnston> agreed 15:27:53 <slaweq> e.g. neutron-tempest-plugin jobs (voting ones) are on very low failure rates 15:28:11 <slaweq> tempest jobs are fine too 15:28:26 <slaweq> even fullstack jobs are pretty stable now 15:30:17 <slaweq> and that conclusion was confirmed when I was looking at some specific failures today 15:30:33 <slaweq> most of the times, it those repeat problems 15:30:52 <slaweq> ok, but do You have anything else regarding grafana to add? 15:31:26 <ralonsoh> no 15:31:33 <njohnston> no 15:31:48 <bcafarel> all good 15:32:11 <slaweq> ok, so lets continue then 15:32:32 <slaweq> #topic fullstack/functional 15:32:56 <slaweq> as I said, I only found those issues with ncat and ping commands: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e30/705237/7/check/neutron-functional/e301c8d/testr_results.html 15:33:05 <slaweq> and https://2a0154cb9a3e47bde3ed-4a9629bf7847ad9c8b03c9755148c549.ssl.cf1.rackcdn.com/705660/4/check/neutron-functional/2e5030b/testr_results.html 15:33:25 <slaweq> ralonsoh: but shouldn't this issue with ping command be solved by https://review.opendev.org/#/c/707452/ ? 15:34:07 <ralonsoh> IMO, this could be a similar problem to the ncat rootwrap one 15:34:24 <slaweq> ahh, ok 15:34:26 <ralonsoh> but of course, this patch modifies the ping commands and filters to match 15:34:43 <ralonsoh> (regardless of a possible problem in oslo rootwrap) 15:34:45 <slaweq> so if we will hopefully resolve rootwrap issue with ncat command, this one should be fine too 15:34:53 <ralonsoh> probably 15:35:02 <ralonsoh> but the patch, IMO, is legit 15:35:59 <slaweq> ralonsoh: do You have list of tests which are using ping/ncat command and are failing due to that? 15:36:16 <slaweq> maybe we could mark them as unstable for now if it's not many tests? 15:36:19 <ralonsoh> no, I don't have this list 15:36:32 <ralonsoh> what I did was find in the code any "ping" usage 15:36:51 <ralonsoh> and I defined a uniformed way of calling this command 15:38:03 <slaweq> ok, lets for now wait for help from oslo folks with that - maybe they will find root cause of the problem 15:38:41 <slaweq> anything else related to functional/fullstack tests? 15:38:53 <ralonsoh> yes 15:39:02 <ralonsoh> https://bugs.launchpad.net/neutron/+bug/1828205/comments/4 15:39:03 <openstack> Launchpad bug 1828205 in neutron ""network-segment-ranges" doesn't return the project_id" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 15:39:25 <ralonsoh> (sorry, this in tempest one) 15:39:32 <slaweq> ralonsoh: no problem 15:39:51 <slaweq> if we don't have anything else related to functional/fullstack we can move to the tempest topic now :) 15:40:02 <slaweq> #topic Tempest/Scenario 15:40:06 <slaweq> and please continue 15:40:09 <slaweq> :) 15:40:10 <ralonsoh> thanks 15:40:16 <ralonsoh> patch: https://review.opendev.org/#/c/707898/ 15:40:26 <ralonsoh> please, take a look at c#4 15:40:36 <ralonsoh> not now, the comment is a bit long 15:40:40 <ralonsoh> (that's all) 15:41:26 <slaweq> ok, I will read it later today 15:42:21 <slaweq> and this Your patch https://review.opendev.org/#/c/707898/ should hopefully resolve this mystery of missing project_id field, right? 15:42:29 <ralonsoh> yes 15:42:59 <slaweq> great :) 15:43:03 <slaweq> at least one down :) 15:43:06 <slaweq> thx ralonsoh 15:43:07 <ralonsoh> this is due to the non finished tenant_id->project_id migration 15:43:11 <ralonsoh> yw 15:43:42 <bcafarel> he, it had been some time since last issue about that migration 15:44:57 <slaweq> from other things, I pushed to tempest patch to increase timeout for tempest-ipv6-only job: https://review.opendev.org/708635 15:45:14 <slaweq> I hope that QA team will accept it :) 15:45:27 <slaweq> if not I will reopen patch in neutron repo 15:45:58 <slaweq> and last thing: 15:46:08 <slaweq> I found one new, "interesting" failure: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d53/708009/4/check/tempest-ipv6-only/d53c891/testr_results.html 15:46:21 <slaweq> did You saw such issues before? 15:46:49 <ralonsoh> Details: {'type': 'RouterInUse', 'message': 'Router 86ed5e27-7dc9-4974-ac7c-4e2e1f0558f0 still has ports', 'detail': ''} 15:46:56 <ralonsoh> yes, a couple of times 15:47:04 <ralonsoh> the resource was not cleaned properly 15:47:06 <slaweq> I don't think it's related to the patch on which it was run 15:47:18 <ralonsoh> I don't think so 15:47:32 <slaweq> ralonsoh: so it seems that it's tempest cleanup issue, right? 15:47:40 <ralonsoh> I think so 15:48:14 <slaweq> ok, so I will report it as a tempest bug 15:48:28 <slaweq> #action slaweq to report tempest bug with routers cleanup 15:48:45 <slaweq> and that's all from my side for today 15:48:58 <slaweq> anything else regarding to scenario jobs? 15:49:33 <bcafarel> I saw a few timeouts on tempest plugin runs in stable branches today, but may just be infra/slow node issue 15:49:37 <bcafarel> rechecks in progress :) 15:49:49 <bcafarel> it was mostly the designate job 15:50:05 <slaweq> bcafarel: I saw some timeouts on designate job on master branch too 15:50:19 <slaweq> but as it was timeout, I didn't check it more 15:50:32 <slaweq> lets hope it's just slow nodes issue :) 15:51:19 <bcafarel> fingers crossed 15:51:22 <slaweq> :) 15:51:37 <slaweq> bcafarel: I think You wanted to raise some other topic on this meeting too, right? 15:51:45 <slaweq> #topic Open discussion 15:51:56 <slaweq> if so, the floor is Yours :) 15:52:30 <ralonsoh> nothing from me 15:52:56 <bcafarel> There was https://bugs.launchpad.net/neutron/+bug/1863830 bug, but liuyulong updated it after my initial run - it was dup of ncat issue in the end 15:52:57 <openstack> Launchpad bug 1863213 in neutron "duplicate for #1863830 Spawning of DHCP processes fail: invalid netcat options" [Undecided,New] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 15:53:19 <slaweq> ahh, ok 15:53:28 <bcafarel> I just was not up to date on recent bugs :) (holidays have this effect) 15:53:40 <slaweq> lucky You :P 15:53:45 <bcafarel> and all good apart from that, glad to see stable branches back in working order 15:53:58 <slaweq> ok, so I think we can finish a bit earlier today 15:54:02 <slaweq> thx for attending 15:54:08 <bcafarel> +1 :) 15:54:12 <bcafarel> o/ 15:54:12 <slaweq> #endmeeting