#openstack-meeting-3 log

15:00:57 <slaweq> #startmeeting neutron_ci
15:00:58 <openstack> Meeting started Wed Feb 19 15:00:57 2020 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:59 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:01 <openstack> The meeting name has been set to 'neutron_ci'
15:01:18 <ralonsoh> hi
15:01:34 <njohnston> o/
15:02:24 <slaweq> ping bcafarel for CI meeting :)
15:02:35 <bcafarel> o/
15:02:44 <slaweq> ok, so I think we can start
15:02:45 <bcafarel> one day I will get used to this new chan :)
15:02:47 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:02:51 <slaweq> bcafarel: LOL
15:02:58 <slaweq> then we will change it again :P
15:03:14 <slaweq> #topic Actions from previous meetings
15:03:31 <slaweq> first one: ralonsoh to check issues with unauthorized ping and ncat commands in functional tests
15:03:44 <ralonsoh> a bit messy this week
15:03:54 <ralonsoh> I pushed a patch and then reverted it
15:04:08 <ralonsoh> I have also pushed a patch to mark ncat tests unstable
15:04:40 <ralonsoh> I still don't know why rootwrap filters sometimes are valid and in other tests don't
15:04:49 <ralonsoh> *aren't
15:05:34 <ralonsoh> (that's all)
15:05:54 <slaweq> but it's working for some tests and not working for other tests in same job?
15:05:59 <slaweq> or in different jobs?
15:06:09 <ralonsoh> same job
15:06:21 <ralonsoh> and this is the worst scenario
15:06:28 <ralonsoh> there is no reason for this
15:08:24 <slaweq> when neutron-functional job was legacy zuulv2 job, it run tests with "sudo"
15:08:33 <slaweq> sudo -H -u stack tox -e dsvm-functional
15:09:04 <ralonsoh> but ncat is using the rootwrap command
15:09:09 <ralonsoh> and the filter list
15:09:21 <ralonsoh> (that's OK)
15:09:28 <slaweq> ok, maybe it's some oslo.rootwrap bug than?
15:10:05 <slaweq> maybe we should ask someone from oslo team to take a look at such failed run?
15:10:09 <slaweq> what do You think?
15:10:12 <ralonsoh> I think so. mjozefcz detected that, during a small time, the filters were not present
15:10:49 <ralonsoh> I'll ping oslo folks
15:10:55 <slaweq> ralonsoh: thx
15:11:09 <slaweq> #action ralonsoh to talk with oslo people about our functional tests rootwrap issue
15:12:07 <slaweq> ok, next one
15:12:09 <slaweq> slaweq to report issue with ssh timeout on dvr jobs and check logs there
15:12:20 <slaweq> I reported bug here: https://bugs.launchpad.net/neutron/+bug/1863858
15:12:22 <openstack> Launchpad bug 1863858 in neutron "socket.timeout error in dvr CI jobs cause SSH issues" [Critical,Confirmed]
15:12:31 <slaweq> and I looked at it a bit today
15:12:43 <slaweq> From what I looked it seems that it's failing simply due to ssh timeout. So I guess there is some problem with FIP configuration but I don't know what exactly.
15:13:05 <slaweq> as this started happening only few weeks ago, I checked what we merged recently
15:13:19 <slaweq> and I found https://review.opendev.org/#/c/606385/ which seems suspicious for me
15:13:51 <slaweq> but so far I don't have any strong evidence that this is the culprit of the issue
15:14:47 <slaweq> I proposed revert of this patch https://review.opendev.org/#/c/708624/ just to recheck if couple of times and see if it will be better
15:15:01 <slaweq> currently neutron-tempest-dvr job is failing pretty often due to this bug
15:15:37 <slaweq> if we will not find anything to fix this issue in next few days, maybe we should temporary switch this job to be non-voting
15:15:40 <slaweq> what do You think?
15:15:57 <ralonsoh> but it's failing only one job
15:16:00 <ralonsoh> one test
15:16:04 <ralonsoh> test_resize_volume_backed_server_confirm
15:16:16 <slaweq> no, I saw also other test failing
15:16:24 <ralonsoh> ok
15:16:43 <slaweq> like here: https://19574e4665a40f62095e-6b9500683e6a67d31c1bad572acf67ba.ssl.cf1.rackcdn.com/705982/6/check/neutron-tempest-dvr/8f3fbd0/testr_results.html
15:16:44 <ralonsoh> ok, we can mark is as unstable for now (non-vonting)
15:16:55 <ralonsoh> yes, I saw this too
15:16:55 <slaweq> but that't true, it only happens in dvr job
15:17:09 <bcafarel> https://review.opendev.org/#/c/606385/ seems "old" no? we have it since stein
15:17:59 <slaweq> ahh, yes
15:18:03 <slaweq> so it's not that patch for sure
15:18:18 <slaweq> it was on my list of recent patches because it was recently merged to stein branch
15:18:30 <slaweq> so there was comment in original patch in gerrit
15:18:38 <slaweq> so, this isn't the culprit for sure :/
15:18:48 <slaweq> thx bcafarel for pointing this out
15:18:56 <bcafarel> np :)
15:19:33 <slaweq> I will try to reproduce this issue locally maybe this week
15:19:49 <slaweq> #action slaweq to try to reproduce and debug neutron-tempest-dvr ssh issue
15:20:26 <slaweq> ok, so that's all from my side about this issue
15:20:59 <slaweq> next one
15:21:01 <slaweq> ralonsoh to check periodic neutron-ovn-tempest-ovs-master-fedora job's failures
15:21:07 <maciejjozefczyk> slaweq, thanks
15:21:22 <ralonsoh> slaweq, I didn't have time for this one
15:21:24 <ralonsoh> sorry
15:21:34 <ralonsoh> (is still in my todo list)
15:21:37 <slaweq> ralonsoh: no problem :)
15:21:45 <slaweq> #action ralonsoh to check periodic neutron-ovn-tempest-ovs-master-fedora job's failures
15:21:53 <slaweq> maciejjozefczyk: for what? :)
15:22:52 <slaweq> ok, lets move on
15:22:55 <slaweq> next topic
15:22:58 <slaweq> #topic Stadium projects
15:23:03 <slaweq> standardize on zuul v3
15:23:15 <slaweq> there was only slow progress this week
15:23:29 <slaweq> we almost merged patch for bagpipe, but some new bug blocked it in gate
15:24:23 <slaweq> anything else regarding stadium project's ci?
15:25:01 <njohnston> nope
15:25:06 <njohnston> like you said, slow progress
15:25:17 <slaweq> ok, so lets move on
15:25:20 <slaweq> #topic Grafana
15:25:26 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
15:27:00 <slaweq> I think we have still the same problems there:
15:27:06 <slaweq> 1. functional tests
15:27:12 <slaweq> 2. neutron-tempest-dvr
15:27:16 <slaweq> 3. grenade jobs
15:27:35 <slaweq> other than that, I think it looks good
15:27:46 <njohnston> agreed
15:27:53 <slaweq> e.g. neutron-tempest-plugin jobs (voting ones) are on very low failure rates
15:28:11 <slaweq> tempest jobs are fine too
15:28:26 <slaweq> even fullstack jobs are pretty stable now
15:30:17 <slaweq> and that conclusion was confirmed when I was looking at some specific failures today
15:30:33 <slaweq> most of the times, it those repeat problems
15:30:52 <slaweq> ok, but do You have anything else regarding grafana to add?
15:31:26 <ralonsoh> no
15:31:33 <njohnston> no
15:31:48 <bcafarel> all good
15:32:11 <slaweq> ok, so lets continue then
15:32:32 <slaweq> #topic fullstack/functional
15:32:56 <slaweq> as I said, I only found those issues with ncat and ping commands: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_e30/705237/7/check/neutron-functional/e301c8d/testr_results.html
15:33:05 <slaweq> and  https://2a0154cb9a3e47bde3ed-4a9629bf7847ad9c8b03c9755148c549.ssl.cf1.rackcdn.com/705660/4/check/neutron-functional/2e5030b/testr_results.html
15:33:25 <slaweq> ralonsoh: but shouldn't this issue with ping command be solved by https://review.opendev.org/#/c/707452/ ?
15:34:07 <ralonsoh> IMO, this could be a similar problem to the ncat rootwrap one
15:34:24 <slaweq> ahh, ok
15:34:26 <ralonsoh> but of course, this patch modifies the ping commands and filters to match
15:34:43 <ralonsoh> (regardless of a possible problem in oslo rootwrap)
15:34:45 <slaweq> so if we will hopefully resolve rootwrap issue with ncat command, this one should be fine too
15:34:53 <ralonsoh> probably
15:35:02 <ralonsoh> but the patch, IMO, is legit
15:35:59 <slaweq> ralonsoh: do You have list of tests which are using ping/ncat command and are failing due to that?
15:36:16 <slaweq> maybe we could mark them as unstable for now if it's not many tests?
15:36:19 <ralonsoh> no, I don't have this list
15:36:32 <ralonsoh> what I did was find in the code any "ping" usage
15:36:51 <ralonsoh> and I defined a uniformed way of calling this command
15:38:03 <slaweq> ok, lets for now wait for help from oslo folks with that - maybe they will find root cause of the problem
15:38:41 <slaweq> anything else related to functional/fullstack tests?
15:38:53 <ralonsoh> yes
15:39:02 <ralonsoh> https://bugs.launchpad.net/neutron/+bug/1828205/comments/4
15:39:03 <openstack> Launchpad bug 1828205 in neutron ""network-segment-ranges" doesn't return the project_id" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
15:39:25 <ralonsoh> (sorry, this in tempest one)
15:39:32 <slaweq> ralonsoh: no problem
15:39:51 <slaweq> if we don't have anything else related to functional/fullstack we can move to the tempest topic now :)
15:40:02 <slaweq> #topic Tempest/Scenario
15:40:06 <slaweq> and please continue
15:40:09 <slaweq> :)
15:40:10 <ralonsoh> thanks
15:40:16 <ralonsoh> patch: https://review.opendev.org/#/c/707898/
15:40:26 <ralonsoh> please, take a look at c#4
15:40:36 <ralonsoh> not now, the comment is a bit long
15:40:40 <ralonsoh> (that's all)
15:41:26 <slaweq> ok, I will read it later today
15:42:21 <slaweq> and this Your patch https://review.opendev.org/#/c/707898/ should hopefully resolve this mystery of missing project_id field, right?
15:42:29 <ralonsoh> yes
15:42:59 <slaweq> great :)
15:43:03 <slaweq> at least one down :)
15:43:06 <slaweq> thx ralonsoh
15:43:07 <ralonsoh> this is due to the non finished tenant_id->project_id migration
15:43:11 <ralonsoh> yw
15:43:42 <bcafarel> he, it had been some time since last issue about that migration
15:44:57 <slaweq> from other things, I pushed to tempest patch to increase timeout for tempest-ipv6-only job: https://review.opendev.org/708635
15:45:14 <slaweq> I hope that QA team will accept it :)
15:45:27 <slaweq> if not I will reopen patch in neutron repo
15:45:58 <slaweq> and last thing:
15:46:08 <slaweq> I found one new, "interesting" failure: https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_d53/708009/4/check/tempest-ipv6-only/d53c891/testr_results.html
15:46:21 <slaweq> did You saw such issues before?
15:46:49 <ralonsoh> Details: {'type': 'RouterInUse', 'message': 'Router 86ed5e27-7dc9-4974-ac7c-4e2e1f0558f0 still has ports', 'detail': ''}
15:46:56 <ralonsoh> yes, a couple of times
15:47:04 <ralonsoh> the resource was not cleaned properly
15:47:06 <slaweq> I don't think it's related to the patch on which it was run
15:47:18 <ralonsoh> I don't think so
15:47:32 <slaweq> ralonsoh: so it seems that it's tempest cleanup issue, right?
15:47:40 <ralonsoh> I think so
15:48:14 <slaweq> ok, so I will report it as a tempest bug
15:48:28 <slaweq> #action slaweq to report tempest bug with routers cleanup
15:48:45 <slaweq> and that's all from my side for today
15:48:58 <slaweq> anything else regarding to scenario jobs?
15:49:33 <bcafarel> I saw a few timeouts on tempest plugin runs in stable branches today, but may just be infra/slow node issue
15:49:37 <bcafarel> rechecks in progress :)
15:49:49 <bcafarel> it was mostly the designate job
15:50:05 <slaweq> bcafarel: I saw some timeouts on designate job on master branch too
15:50:19 <slaweq> but as it was timeout, I didn't check it more
15:50:32 <slaweq> lets hope it's just slow nodes issue :)
15:51:19 <bcafarel> fingers crossed
15:51:22 <slaweq> :)
15:51:37 <slaweq> bcafarel: I think You wanted to raise some other topic on this meeting too, right?
15:51:45 <slaweq> #topic Open discussion
15:51:56 <slaweq> if so, the floor is Yours :)
15:52:30 <ralonsoh> nothing from me
15:52:56 <bcafarel> There was https://bugs.launchpad.net/neutron/+bug/1863830 bug, but liuyulong updated it after my initial run - it was dup of ncat issue in the end
15:52:57 <openstack> Launchpad bug 1863213 in neutron "duplicate for #1863830 Spawning of DHCP processes fail: invalid netcat options" [Undecided,New] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
15:53:19 <slaweq> ahh, ok
15:53:28 <bcafarel> I just was not up to date on recent bugs :) (holidays have this effect)
15:53:40 <slaweq> lucky You :P
15:53:45 <bcafarel> and all good apart from that, glad to see stable branches back in working order
15:53:58 <slaweq> ok, so I think we can finish a bit earlier today
15:54:02 <slaweq> thx for attending
15:54:08 <bcafarel> +1 :)
15:54:12 <bcafarel> o/
15:54:12 <slaweq> #endmeeting