#openstack-meeting log

16:00:35 <slaweq> #startmeeting neutron_ci
16:00:35 <openstack> Meeting started Tue Mar 26 16:00:35 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:36 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:38 <slaweq> hi
16:00:39 <openstack> The meeting name has been set to 'neutron_ci'
16:00:59 <mlavalle> o/
16:01:04 <bcafarel> o/
16:01:13 <haleyb> hi
16:02:22 <njohnston> here
16:02:30 <slaweq> ok, lets start then
16:02:32 <slaweq> #topic Actions from previous meetings
16:02:41 <slaweq> bcafarel to switch neutron-grenade multinode jobs to bionic nodes
16:02:59 <bcafarel> should be done (I am looking for the link)
16:03:10 <slaweq> I think that it's done even by my patch https://review.openstack.org/#/c/639361/
16:03:12 <bcafarel> in fact it was in the original review to switch to bionic
16:03:34 <bcafarel> ^ that is the one :)
16:03:54 <slaweq> yeah, I had many such "DNM" test patches and I somehow missed that in this one someone changed description and removed "DNM" from title
16:03:56 <slaweq> :)
16:04:09 <bcafarel> :)
16:04:13 <slaweq> thx bcafarel for checking that
16:04:14 <bcafarel> just dropping the override was enough in the end
16:04:25 <slaweq> ok, next one
16:04:28 <slaweq> mlavalle to check https://bugs.launchpad.net/neutron/+bug/1820865
16:04:29 <openstack> Launchpad bug 1820865 in neutron "Fullstack tests are failing because of "OSError: [Errno 22] failed to open netns"" [Critical,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:04:41 <mlavalle> I did checked on it
16:04:50 <mlavalle> created a patch
16:05:06 <mlavalle> but seems ralonsoh proposed a better solution
16:05:33 <mlavalle> https://review.openstack.org/#/c/647721
16:05:51 <slaweq> I saw that You tried to downgrade pyroute2 to check if that isn't issue in nevest version of it, right mlavalle?
16:06:02 <mlavalle> right
16:06:33 <slaweq> ok, I hope that ralonsoh's solution would fix this problem once and for all :)
16:06:38 <mlavalle> I was taking network_namespace_exists as proof that the namespace existed
16:06:54 <mlavalle> but it seems that it can exist but might not be ready to be opened
16:08:01 <lajoskatona> Hi
16:08:01 <slaweq> yes, I also was thinking that if namespace exists than it should be ready always :)
16:08:11 <slaweq> hi lajoskatona
16:08:18 <rubasov> late o/
16:08:25 <slaweq> hi rubasov
16:08:39 <slaweq> ok, lets review ralonsoh's patch and lets hope it will help
16:08:44 <slaweq> next action
16:08:46 <slaweq> slaweq/mlavalle to check https://bugs.launchpad.net/neutron/+bug/1820870
16:08:48 <openstack> Launchpad bug 1820870 in neutron "Fullstack tests are failing because async_process is not started properly" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:09:05 <slaweq> I send some dnm patch with extra logs added
16:09:16 <slaweq> it is for sure no issue with rabbitmq
16:09:47 <slaweq> a lot of errors related to rabbit were caused by that rabbitmq user/vhost was already deleted
16:10:08 <slaweq> what happend there is that sometimes AsyncProcess.is_active() was returning False always
16:10:14 <slaweq> and after 1 minute test was failing
16:10:46 <slaweq> so test worker was cleaning everything, removed rabbitmq user/vhost, killed spawned processes and so on
16:11:08 <slaweq> but process which was "not active" (and caused failure) was in fact running fine
16:11:18 <slaweq> it wasn't killed as test runner didn't know its pid
16:11:34 <slaweq> so it was keep running and logging that it can't connect to rabbitmq
16:11:51 <slaweq> in one case such agent's log got about 1,7 GB :)
16:12:12 <slaweq> all my findings should be described in comments to bug in launchpad
16:12:17 <slaweq> and I did patch https://review.openstack.org/#/c/647605/
16:12:23 <slaweq> please review it
16:12:47 * mlavalle thinks we forgot the grafana url at the beginning of the meeting
16:13:00 <slaweq> mlavalle: right, sorry
16:13:05 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:13:08 <slaweq> please open it now :)
16:13:46 <njohnston> :-)
16:14:17 <slaweq> and that is basically all about this action :)
16:14:23 <slaweq> next one from last week was:
16:14:32 <slaweq> mlavalle to debug reasons of neutron-tempest-plugin-dvr-multinode-scenario failures
16:14:52 <mlavalle> I didn't make much progress there
16:16:03 <slaweq> ok, that's fine :)
16:16:13 <slaweq> we all had more important things last week
16:16:20 <mlavalle> I think I sais I was going to do this slowly
16:16:31 <slaweq> yes :)
16:16:37 <mlavalle> :-)
16:16:45 <slaweq> should I assign it to You for next week to keep it in mind?
16:16:58 <mlavalle> but it's a good idea to ask me every week
16:17:11 <slaweq> ok, I will :)
16:17:15 <mlavalle> that way the fear of facinf the Hulk will push me to make progress
16:17:23 <slaweq> LOL
16:17:32 <slaweq> #action mlavalle to debug reasons of neutron-tempest-plugin-dvr-multinode-scenario failures
16:17:49 <slaweq> and that was all actions from last week
16:17:59 <slaweq> anything You want to add/ask here?
16:18:54 <njohnston> nope
16:19:13 <slaweq> ok, so lets move on to the next topic
16:19:15 <slaweq> #topic Python 3
16:19:21 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:19:26 <slaweq> njohnston: any updates?
16:20:08 <njohnston> I started looking at the vpnaas jobs and tinkering with them
16:20:30 <njohnston> I am trying to learn more from my failures in defining zuul jobs
16:20:43 <njohnston> But zuul is a tough teacher and slaps my hand a lot
16:20:58 <njohnston> Progress is slow but the timeline is long so I think things are OK
16:21:03 <slaweq> if You need any help with zuul I can try to help with it :)
16:21:08 <njohnston> thanks!
16:21:20 <slaweq> just ping me if You will need anything
16:21:37 <njohnston> will do
16:21:53 <slaweq> thx
16:22:03 <slaweq> so next topic is
16:22:04 <slaweq> #topic Ubuntu Bionic in CI jobs
16:22:27 <slaweq> as grenade jobs are switched to bionic I think we are done with all of jobs here
16:22:44 <slaweq> most of them are already zuulv3 syntax so switched some time ago
16:22:53 <slaweq> and legacy jobs were switched recently
16:23:22 <bcafarel> so all done for the topic?
16:23:23 <slaweq> if there are no objections I think we can remove this topic from agenda for next meetings
16:23:37 <slaweq> bcafarel: yes, IMO all is done for us
16:24:08 <njohnston> +1 for being done with it, good job!
16:24:15 <slaweq> also stadium projects are switched to bionic as infra switched legacy jobs to bionic too
16:24:31 <njohnston> is midonet still the lone holdout?
16:24:59 <mlavalle> yes, let's remove this topic from the agenda
16:25:27 <slaweq> njohnston: yes, for midonet there is patch https://review.openstack.org/#/c/639990/
16:25:34 <slaweq> to switch them too
16:25:39 <slaweq> but jobs are non-voting already
16:25:50 <njohnston> ok
16:25:56 <slaweq> and midonet team will work on that I hope
16:26:06 <slaweq> ok mlavalle, thx for confirmation :)
16:26:19 <slaweq> #topic tempest-plugins migration
16:26:24 <slaweq> any updates?
16:26:30 <slaweq> I didn't have time to work on that
16:26:33 <mlavalle> I will work on this this week
16:26:38 <slaweq> tmorin didn't replied to me
16:26:44 <njohnston> I hope to get back to the fwaas migration as soon as my urgent matters clear up
16:27:38 <slaweq> I will also try to do this migration for bgpvpn this or next week
16:27:48 <bcafarel> no progress here either (hope to send at least WIP patch by next meeting)
16:28:09 <slaweq> #topic Grafana
16:28:19 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:28:26 <slaweq> ^^ just as a reminder :)
16:29:59 <slaweq> do You see anything which You want to talk about on grafana?
16:30:15 <slaweq> I don't see anything new and very urgent there
16:30:29 <mlavalle> looks much better this week
16:30:40 <slaweq> and I think that we finally more or less make gates working
16:30:51 <njohnston> \o/
16:31:05 <slaweq> there are still some issues but nothing new, at least I'm not aware of anything new from this week
16:31:38 <mlavalle> agree
16:32:40 <slaweq> so lets talk about some specific jobs now
16:32:42 <slaweq> #topic fullstack/functional
16:32:56 <slaweq> we have still those 2 issues with fullstack jobs:
16:33:02 <slaweq> https://bugs.launchpad.net/neutron/+bug/1820865 and
16:33:04 <openstack> Launchpad bug 1820865 in neutron "Fullstack tests are failing because of "OSError: [Errno 22] failed to open netns"" [Critical,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:33:07 <slaweq> https://bugs.launchpad.net/neutron/+bug/1820870
16:33:08 <openstack> Launchpad bug 1820870 in neutron "Fullstack tests are failing because async_process is not started properly" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:33:15 <slaweq> but patches for both are already proposed
16:33:39 <slaweq> when both will be merged, we can maybe next week make it voting again if it will be better on grafana
16:34:12 <slaweq> anything else You want to talk about, regarding to functional or fullstack jobs?
16:34:38 <mlavalle> agree
16:34:55 <mlavalle> I was going to ask about making the job voting again
16:35:02 <slaweq> :)
16:35:07 <slaweq> I remember about that
16:35:28 <slaweq> but lets fix those two issues and then check how it will be for few days at least
16:35:37 <slaweq> what do You think about such plan?
16:36:09 <njohnston> +1
16:37:05 <slaweq> ok, next topic
16:37:07 <slaweq> #topic Tempest/Scenario
16:37:26 <slaweq> I want to raise here 2 issues
16:37:34 <slaweq> first is https://bugs.launchpad.net/neutron/+bug/1815585
16:37:35 <openstack> Launchpad bug 1815585 in neutron "Floating IP status failed to transition to DOWN in neutron-tempest-plugin-scenario-linuxbridge" [High,Confirmed]
16:37:43 <slaweq> I was debugging it a bit
16:38:15 <slaweq> and what is strange for me is fact that port is unbound and active - and because of that active status test is failing
16:38:40 <slaweq> do You know if that should be possible somehow to have unbound port active?
16:38:48 <slaweq> for me this looks odd
16:39:10 <slaweq> ahh, example of such failure is e.g. here: http://logs.openstack.org/93/631793/6/check/neutron-tempest-plugin-scenario-linuxbridge/1c3c083/testr_results.html.gz
16:39:20 <mlavalle> yes, it looks od
16:39:22 <mlavalle> odd
16:39:45 <slaweq> maybe this is somehow related to https://bugs.launchpad.net/neutron/+bug/1819446
16:39:45 <openstack> Launchpad bug 1819446 in neutron "After the vm's port name is modified, the port status changes from down to active. " [Low,Confirmed]
16:40:03 <slaweq> but in this bug they reported that changing name of port will switch it to active
16:40:17 <mlavalle> even stranger
16:40:49 <slaweq> in test there is no such port updates before this place where test is failing but maybe it can happen not only on name update - I don't know
16:41:15 <slaweq> I have it in my backlog still but if someone would have time, please feel free to take it :)
16:41:56 <slaweq> in the meantime I can mark this test as unstable
16:42:01 <slaweq> what You think about it?
16:42:39 <slaweq> according to logstash it happend 9 times in last 7 days
16:42:39 <mlavalle> yep
16:43:00 <slaweq> ok, thx mlavalle, I will do it today :)
16:43:18 <slaweq> #action slaweq to mark test_floatingip_port_details test as unstable
16:43:54 <slaweq> and second problem which we have for some time already is problem with intermittent ssh failures in various tempest tests
16:44:03 <slaweq> example: http://logs.openstack.org/46/638646/14/check/neutron-tempest-plugin-scenario-linuxbridge/1ff70f5/testr_results.html.gz
16:44:11 <slaweq> here it is linuxbridge job
16:44:26 <slaweq> but it may happend in any tempest job in fact
16:44:41 <slaweq> I didn't saw any "pattern" of those failures
16:45:02 <mlavalle> do we have logstash query for them?
16:45:15 <slaweq> only thing which I think is quite common is fact that those instances can connect to metadata service but ssh to FIP is not working
16:45:24 <slaweq> mlavalle: no, I don't have logstash query for that
16:45:36 <slaweq> and I didn't reported bug for this also
16:45:53 <slaweq> I will report it to be able to track progress there
16:45:57 <mlavalle> if we created a logstash query and attach that for a bug report, we might cooperate in trying to isolate the issue
16:46:05 <slaweq> and I will also try to prepare some query
16:46:14 <slaweq> mlavalle++ I will do
16:46:25 <mlavalle> slaweq: I'll help
16:46:42 <slaweq> #action slaweq to report a bug related to intermittent ssh failures in various tests
16:46:45 <slaweq> thx mlavalle
16:47:09 <slaweq> and that's all from my side for today
16:47:40 <slaweq> anything else You want to talk about regarding to scenario jobs or anything else related to CI?
16:47:47 <slaweq> #topic Open discussion
16:48:29 <bcafarel> quick comment on stable branches, ocata branch gets -1 in recent backport candidates
16:48:37 <bcafarel> https://review.openstack.org/#/q/status:open+project:openstack/neutron+branch:stable/ocata
16:49:04 <bcafarel> I need to check if there's an easy fix, or if it's just getting old
16:49:17 <mlavalle> probably the latter
16:50:03 <bcafarel> pike branch needs some rechecks from time to time, but not as bad
16:50:12 <slaweq> bcafarel: looking at one random test failure: http://logs.openstack.org/51/646651/1/check/openstack-tox-cover/2a21a99/job-output.txt.gz#_2019-03-25_10_50_42_620773
16:50:21 <slaweq> first question is: why it is tested on bionic?
16:51:05 <slaweq> and it's not job defined in our repo I think
16:51:27 <slaweq> so maybe we should ask e.g. gmann about it
16:51:36 <slaweq> or someone else from qa team
16:51:47 <bcafarel> oooh very good point, so that would mostly be a fallout from bionic switch?
16:52:03 <slaweq> I just checked this one job for now :)
16:52:09 <slaweq> I don't know about other failures
16:52:36 <gmann> ocata should not do bionic. i will check the job
16:52:49 <bcafarel> gmann++ thanks
16:52:50 <slaweq> but yeah, looking at 3 other patches, it also failed on openstack-tox-cover job
16:53:01 <slaweq> hi gmann
16:53:06 <slaweq> thx for help with it :)
16:54:37 <slaweq> and thx bcafarel for taking care of those stable branches :)
16:55:12 <slaweq> ok, anything else for today?
16:55:24 <slaweq> if not I think we can finish few minutes earlier :)
16:55:45 <bcafarel> sounds good :)
16:55:58 <mlavalle> o/
16:56:06 <slaweq> ok, thx for attending the meeting
16:56:06 <njohnston> +1 thanks!
16:56:09 <slaweq> #endmeeting