#openstack-meeting log

16:00:07 <slaweq> #startmeeting neutron_ci
16:00:08 <openstack> Meeting started Tue Jul  9 16:00:07 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:10 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:12 <openstack> The meeting name has been set to 'neutron_ci'
16:00:16 <mlavalle> o/
16:00:17 <njohnston> o/
16:01:12 <slaweq> lets wait few more minutes for others
16:02:34 <ralonsoh> hi
16:02:53 <slaweq> ok, so lets start
16:02:54 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:03:08 <slaweq> please open now so it will be ready later
16:03:14 <slaweq> #topic Actions from previous meetings
16:03:17 * mlavalle triggered Grafana
16:03:22 <slaweq> mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues
16:03:32 <mlavalle> I did continue looking at that
16:03:47 <mlavalle> I am still finding the metadata proxy issue
16:04:13 <mlavalle> Left detailed comment here: https://bugs.launchpad.net/neutron/+bug/1830763/comments/3
16:04:14 <openstack> Launchpad bug 1830763 in neutron "Debug neutron-tempest-plugin-dvr-multinode-scenario failures" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:04:52 <mlavalle> The summary is that the instance is ready before haproxy-metadata-proxy is created
16:04:57 <mlavalle> in one of the nodes
16:05:07 <mlavalle> and therefore fails to get its metadata
16:05:29 <njohnston> just FYI, the failure rate for that job has been dropping the past few days and is down to 32%
16:05:30 <mlavalle> this analysis was done with http://logs.openstack.org/14/668914/1/check/neutron-tempest-plugin-dvr-multinode-scenario/f2ce738/
16:06:05 <mlavalle> this patch was created by ralonsoh late last week
16:06:11 <mlavalle> I think
16:06:43 <slaweq> mlavalle: so there is some error on neutron-server side which slows down creation of router on compute node, right?
16:07:01 <mlavalle> yes, that's my theory
16:08:09 <slaweq> do You know what is this "8831ed85-9ccf-48a2-92eb-ab39d3d30e89-ubuntu-bionic-rax-ord-0008" which is reported as duplicate in error message?
16:08:26 <mlavalle> no, I haven't gotten to that yet
16:08:31 <slaweq> is it agent_id-hostname pair?
16:08:33 <slaweq> or what?
16:08:37 <slaweq> ok
16:09:01 <slaweq> so You will continue this investigation, right?
16:09:06 <mlavalle> yes
16:09:10 <mlavalle> indeed
16:09:21 <mlavalle> I'll dig in this case as deeply as possible
16:09:23 <slaweq> thx mlavalle, great progress on this :)
16:09:29 <slaweq> #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues
16:09:42 <mlavalle> the good thing is that this patch is very recent, so the logs will be there
16:09:43 <slaweq> ok, next action was
16:09:50 <slaweq> slaweq to send patch to switch functional tests job in fwaas repo to py3
16:10:11 <slaweq> Patches: https://review.opendev.org/668917 and https://review.opendev.org/668918 are send
16:10:16 <slaweq> also https://review.opendev.org/#/c/669757/ is needed
16:10:23 <slaweq> please take a look if You will have some time
16:10:29 * njohnston will take a look
16:10:34 <slaweq> njohnston: thx
16:10:43 <slaweq> next one:
16:10:45 <slaweq> ralonsoh to continue debugging TestNeutronServer: start function (bug/1833279) with new logs
16:11:29 <ralonsoh> I didn't see any other error in the CI related to thjis bug
16:11:56 <ralonsoh> http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:neutron-functional%20AND%20message:%5C%22Timed%20out%20waiting%20for%20file%5C%22
16:13:42 <slaweq> ok, so maybe we can simply close this bug for now
16:13:49 <slaweq> what do You think?
16:13:57 <slaweq> I also didn't saw it during last week
16:13:59 <ralonsoh> I'll keep an eye on this during the next week
16:14:10 <slaweq> ok, thx ralonsoh
16:14:12 <ralonsoh> and if I don't see anythiong, I'll close it
16:14:14 <ralonsoh> np!
16:14:17 <slaweq> ++
16:14:30 <slaweq> ok, next one
16:14:31 <mlavalle> can I say something about fwaas
16:14:32 <slaweq> slaweq to mark test_ha_router_restart_agents_no_packet_lost as unstable again
16:14:35 <slaweq> Done: https://review.opendev.org/668914
16:14:45 <slaweq> mlavalle: sure, go on
16:15:12 <mlavalle> I pinged Sridar yesterday. He responded back that he is still trying to help
16:15:25 <mlavalle> so I will organize a meeting with him
16:15:35 <mlavalle> This coming Thursday
16:15:40 <mlavalle> I'll invite njohnston
16:15:44 <njohnston> yes please
16:15:47 <bcafarel> late o/ sorry the local Q&A session took some time
16:15:59 <mlavalle> that's it
16:16:23 <slaweq> mlavalle: can You invite me to this meeting as well?
16:16:30 <mlavalle> slaweq: yes
16:16:32 <slaweq> thx
16:17:15 <slaweq> ok, lets move on then
16:17:23 <slaweq> #topic Stadium projects
16:17:31 <slaweq> first Python 3 migration
16:17:37 <slaweq> etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:17:59 <slaweq> I moved finished projects to the bottom of the document as bcafarel suggested last week
16:18:18 <njohnston> I'm currently working on the bagpipe tempest jobs
16:18:23 <slaweq> we have only 6 projects on the "not finished" list
16:18:47 <slaweq> but for most of the things we have volunteers already
16:19:09 <slaweq> I think we should ask yamamoto about networking-midonet
16:19:26 <slaweq> or is there anyone else involved in this project who we can ping? mlavalle do You know?
16:19:40 <slaweq> and thx njohnston for taking care of bagpipe :)
16:19:56 <mlavalle> let's ask yamamoto
16:20:05 <slaweq> ok, I will send him an email this week
16:20:11 <slaweq> good for You?
16:20:14 <mlavalle> yes
16:20:33 <slaweq> #action slaweq to contact yamamoto about networking-midonet py3 status
16:20:59 <slaweq> any other updates on it?
16:22:13 <slaweq> ok, lets move on then
16:22:20 <slaweq> tempest-plugins migration
16:22:26 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:22:28 <slaweq> any updates?
16:22:54 <njohnston> I got back to the fwaas first change and made some progress with it
16:23:19 <slaweq> njohnston: yeah, I saw it today and I commented on one of Your patches already
16:23:24 <njohnston> it has 2 +2s but no +W https://review.opendev.org/#/c/643662/
16:23:42 <njohnston> I'll fix up that issue, it's in the second stage change
16:23:59 <slaweq> +W'ed now
16:24:12 <njohnston> thanks!
16:24:21 <slaweq> thanks for working on this
16:24:40 <slaweq> so we will have still not finished vpnaas and neutron-dynamic-routing
16:25:13 <slaweq> and we should be good on this finally so gmann will be happy :)
16:25:48 <slaweq> mlavalle: tidwellr do You need any help with patches for vpnaas and neutron-dynamic-routing ? I can help with them if needed
16:26:34 <mlavalle> slaweq: if I am slowing down anybody. go ahead. but I would still like to give it a try
16:26:48 <slaweq> mlavalle: no, it's not urgent of course
16:26:52 <slaweq> I just wanted to ask :)
16:27:00 <mlavalle> thanks
16:27:34 <slaweq> ok, any other questions/updates about stadium projects?
16:28:20 <slaweq> ok, so lets move on then
16:28:26 <slaweq> #topic Grafana
16:28:31 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:29:11 <slaweq> I don't see anything urgent or "special" this week
16:29:23 <njohnston> So hey, how's about that neutron-tempest-plugin-dvr-multinode-scenario failure rate?  Looking good!
16:29:53 <slaweq> exactly, it's better
16:29:59 <mlavalle> agree
16:30:29 <slaweq> I think that changing order of connecting router interfaces and spawning instance minimized possibility of this race with metadata proxy
16:30:34 <slaweq> so it is failing less now
16:30:40 <slaweq> but issue is still there :/
16:31:19 <mlavalle> you mean after your patch?
16:31:24 <slaweq> still neutron-tempest-with-uwsgi is failing 100% but there is patch https://review.opendev.org/#/c/668311/ for that
16:31:29 <slaweq> mlavalle: yes
16:32:34 <slaweq> anything else related to grafana?
16:32:37 <mlavalle> nope
16:33:23 <slaweq> ok, lets move on then
16:33:25 <slaweq> #topic fullstack/functional
16:33:41 <slaweq> I found 3 different test failures in functional tests recently:
16:33:47 <slaweq> neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase - http://logs.openstack.org/35/521035/8/check/neutron-functional/65a4ecf/testr_results.html.gz
16:34:03 <slaweq> ^^ I think ralonsoh was looking into something similar some time ago, right?
16:34:19 <ralonsoh> yes, I almost have a patch for this
16:34:41 <slaweq> it's this one which removes locks on netns privileged functions?
16:34:42 <ralonsoh> actually what I'm doing is replacing the ip_monitor in keepalived
16:34:51 <ralonsoh> which is the source of some problems
16:35:15 <ralonsoh> ooooh sorry
16:35:17 <slaweq> but in this case error was "Cannot open network namespace "qrouter-599a3366-e0ea-4a35-9497-4263b8409b00": No such file or directory"
16:35:17 <ralonsoh> one sec
16:35:28 <ralonsoh> yes, one sec (I write without reading first)
16:35:42 <ralonsoh> https://review.opendev.org/#/c/668682/
16:36:20 <ralonsoh> one line, big discussion in the patch
16:36:52 <slaweq> personally I'm fine with merging Your patch
16:37:09 <ralonsoh> (me too)
16:37:19 <slaweq> the config removal proposed by liuyulong should be IMO discussed on neutron team meeting or drivers meeting maybe
16:39:00 <slaweq> mlavalle: njohnston: please add this patch to Your list of reviews and say what is Your opinion about it
16:39:13 <mlavalle> ok
16:39:17 <slaweq> thx a lot
16:39:22 <njohnston_> I think it’s a good topic for the drivers meeting, I don’t think we need to use the time of the entire neutron community to talk about it
16:39:26 <njohnston_> will do
16:40:05 <slaweq> njohnston_: I agree
16:40:25 <slaweq> ok, next one was:
16:40:27 <slaweq> neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase - http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/testr_results.html.gz
16:41:10 <slaweq> which I don't remember to see in the past
16:41:13 <slaweq> did You saw it before?
16:41:46 <ralonsoh> IMO, this is similar to other privsep errors
16:41:57 <ralonsoh> the privsep thread pool is limited
16:42:09 <ralonsoh> and sometimes many ops are executed at the same time
16:42:14 <slaweq> in logs there is almost nothing: http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.linux.test_linuxbridge_arp_protect.LinuxBridgeARPSpoofTestCase.test_arp_protection_update.txt.gz
16:42:33 <ralonsoh> eventually will have an operation without time to be executed
16:43:02 <slaweq> ralonsoh: so do You think that maybe limiting number of workers can improve this?
16:43:32 <ralonsoh> slaweq, how many workers do we have?
16:43:38 <ralonsoh> one per core/thread?
16:44:25 <slaweq> ralonsoh: I don't know exactly
16:44:36 <ralonsoh> slaweq, that should be the limit
16:44:39 <slaweq> ralonsoh: from log in this failed job it looks that it's not specified
16:44:47 <slaweq> and it runs 8 workers IMO
16:44:51 <ralonsoh> slaweq, no, this is infra config
16:44:55 <slaweq> http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/job-output.txt.gz#_2019-07-08_18_54_37_453034
16:46:07 <slaweq> it simply runs "tox -edsvm-functional-python27" and nothing else
16:46:09 <slaweq> http://logs.openstack.org/70/650270/19/check/neutron-functional-python27/6c9c017/job-output.txt.gz#_2019-07-08_18_53_34_474097
16:46:25 <slaweq> so if we want to limit number of test workers we should specify it there IMO
16:46:26 <ralonsoh> slaweq, so by default one worker per thread
16:46:52 <ralonsoh> let me check how to increase, in FT, the privsep pool
16:47:27 <slaweq> ralonsoh: ok, thx a lot
16:47:36 <slaweq> #action ralonsoh to check how to increase, in FT, the privsep pool
16:47:57 <slaweq> last one which I found is
16:47:58 <slaweq> neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase - http://logs.openstack.org/69/668569/3/check/neutron-functional-python27/9d345e8/testr_results.html.gz
16:48:50 <slaweq> and I see error like: http://logs.openstack.org/69/668569/3/check/neutron-functional-python27/9d345e8/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase.test_keepalived_state_change_notification.txt.gz#_2019-07-05_10_15_10_728
16:48:55 <slaweq> in logs from this test
16:49:17 <slaweq> but I'm not sure if that is related or not
16:49:38 <slaweq> I can investigate this one during this week
16:49:54 <slaweq> #action slaweq to check neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase issue
16:50:05 <njohnston> did we merge a check to validate gateway is inside the subnet?  Because this looks like it would fail such a check: ['ip', 'netns', 'exec', 'qrouter-eb89b020-c3b0-410f-82ec-639b197dc05b', 'ip', 'route', 'replace', 'to', '8.8.8.0/24', 'via', '19.4.4.4']
16:50:34 <njohnston> oh, sorry, I misread that it was a default gateway, that just looks like an additional route
16:50:50 <slaweq> but it's not gateway
16:50:57 <slaweq> only extra route IIRC: http://logs.openstack.org/69/668569/3/check/neutron-functional-python27/9d345e8/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase.test_keepalived_state_change_notification.txt.gz#_2019-07-05_10_15_10_662
16:51:03 <njohnston> yeah
16:51:28 <slaweq> so I will look deeper into it this week
16:51:50 <slaweq> and I will report a bug or send a patch, whatever will be needed :)
16:52:12 <slaweq> that's all from my side according to functional (and fullstack) tests
16:52:22 <slaweq> do You have anything else on that topic?
16:53:23 <slaweq> ok, lets move forward than
16:53:25 <slaweq> #topic Tempest/Scenario
16:53:35 <slaweq> first of all, good news
16:53:37 <slaweq> gmann proposed new integrated-gate templates, one for networking is merged: https://review.opendev.org/#/c/668930/
16:53:43 <slaweq> and I proposed to use it in neutron: https://review.opendev.org/#/c/669815/
16:54:17 <slaweq> I will also later try to change our exising neutron-tempest jobs to be inherit from this new jobs dedicated for networking
16:54:36 <slaweq> that should improve our gates as we will not run e.g. cinder or glance related tests in our gate
16:54:38 <slaweq> :)
16:54:58 <mlavalle> cool
16:55:31 <slaweq> and now lets get back to the reality :P
16:55:36 <slaweq> I opened new bug https://bugs.launchpad.net/neutron/+bug/1835914
16:55:36 <openstack> Launchpad bug 1835914 in neutron "Test test_show_network_segment_range failing" [Medium,Confirmed]
16:55:45 <slaweq> I found it at least 2 times during last week
16:56:05 <slaweq> and it is strange for me that network_segment_range object don't have project_id sometimes :/
16:56:24 <slaweq> if there is any volunteer who wants to take a look into this, feel free :)
16:56:32 <njohnston> I noticed weirdnesses like that with bulk port - sometimes the requests that get created in testing aren't complete
16:56:42 <mlavalle> I might give it a try later this week
16:56:52 <slaweq> njohnston: good to know
16:57:02 <haleyb> i'm always confused by the tenant_id/project_id transition, are we supposed to be using both still?  shouldn't one have gone away?
16:57:22 <slaweq> haleyb: but in this case tenant_id wasn't there also
16:57:41 <njohnston> haleyb: either/or
16:58:02 <haleyb> oh, that's strange
16:58:08 <haleyb> both missing that is
16:58:08 <slaweq> haleyb: exactly
16:58:20 <njohnston> I'll try to take a look as well
16:58:42 <slaweq> I set it as medium priority as it don't happens often but it happens from time to time so there is definitely something to check :)
16:58:51 <slaweq> thx njohnston and mlavalle for taking care of it
16:59:05 <slaweq> ok, we are running out of time now
16:59:17 <slaweq> thx for attending
16:59:21 <mlavalle> o/
16:59:23 <njohnston> o/
16:59:30 <slaweq> have a great week and see You online :)
16:59:32 <slaweq> o/
16:59:36 <slaweq> #endmeeting