#openstack-meeting log

16:01:48 <slaweq> #startmeeting neutron_ci
16:01:49 <openstack> Meeting started Tue Nov 20 16:01:48 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:01:50 <slaweq> hello
16:01:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:01:54 <openstack> The meeting name has been set to 'neutron_ci'
16:01:54 <njohnston> o/
16:01:54 <mlavalle> o/
16:02:37 <slaweq> lets start last of today's meetings then :)
16:02:38 <slaweq> #topic Actions from previous meetings
16:02:48 <slaweq> mlavalle/slaweq to continue debugging issue with not reachable FIP in scenario jobs
16:02:56 <mlavalle> I continued doing that
16:03:04 <mlavalle> since I came back from berlin
16:03:11 <mlavalle> and before leaving as well
16:03:32 <slaweq> I didn't have time to look at this one, but I found some issue related to https://bugs.launchpad.net/neutron/+bug/1717302
16:03:33 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:03:44 <slaweq> and it maybe can help with this one too
16:03:54 <mlavalle> I've been comparing a "good run" with "bad run"
16:03:59 <slaweq> I sent some patch already: https://review.openstack.org/#/c/618750/
16:04:34 <slaweq> this my patch can help with dvr jobs only
16:05:05 <slaweq> I saw issues like that in quite many test runs, e.g. in http://logs.openstack.org/24/618024/5/check/neutron-tempest-dvr-ha-multinode-full/d27f183/logs/
16:05:42 <mlavalle> is this going to address also https://bugs.launchpad.net/neutron/+bug/1795870?
16:05:43 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:06:50 <slaweq> if You had in l3-agent logs something like: http://logs.openstack.org/24/618024/5/check/neutron-tempest-dvr-ha-multinode-full/d27f183/logs/subnode-2/screen-q-l3.txt.gz?level=ERROR then it should help
16:07:01 <slaweq> if not, then it's different issue probably
16:07:34 <mlavalle> well
16:07:50 <mlavalle> I was thinking of a different aproach
16:08:14 <mlavalle> we recently merged https://review.openstack.org/#/c/609924/
16:09:02 <mlavalle> this fixes a situation where the fip is associated to a port before the port is found
16:09:22 <mlavalle> as a consequence, the fip is create in the snat node, right?
16:09:32 <slaweq> yep
16:10:02 <mlavalle> now, when the port is bound, the patch fixes the migration of the fip / port to the corresponding node
16:10:10 <mlavalle> a compute presumably
16:10:16 <mlavalle> right?
16:10:24 <slaweq> yep
16:10:49 <mlavalle> now, let's look at the code of test_trunk_subport_lifecycle:
16:11:14 <mlavalle> https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_trunk.py#L67
16:11:34 <mlavalle> it creates the port and associates it to a fip
16:11:53 <mlavalle> so in a dvr env, that fip is going to the snat node
16:12:07 <slaweq> yes
16:12:14 <mlavalle> and then the server is created in L76
16:12:23 <mlavalle> so the migration starts
16:12:54 <mlavalle> it is possible that the bug https://bugs.launchpad.net/neutron/+bug/1795870
16:12:55 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:13:24 <mlavalle> is a consequence of the fip not being ready in the compute, because it is migrating
16:13:46 <mlavalle> it is a race that is not fixed by https://review.openstack.org/#/c/609924/
16:13:56 <mlavalle> right?
16:14:36 <slaweq> it can be something like that
16:14:49 <slaweq> maybe this FIP is configured in both snat and fip- namespaces then?
16:14:57 <mlavalle> noiw the trunk test bug happens 95% of the time with DVR
16:15:04 <slaweq> as in our gates all nodes are dvr_snat nodes
16:15:36 <mlavalle> so I intend to change the test script
16:15:52 <mlavalle> to create and associate the fip after the server
16:15:58 <mlavalle> and see what happens
16:16:02 <mlavalle> makes sense?
16:16:28 <slaweq> yes, totally :)
16:16:39 <mlavalle> so I propose the following:
16:16:56 <mlavalle> 1) let's merge your patch first and see the effect on the trunk bug
16:17:20 <mlavalle> 2) Then let's change the trunk test script and see the effect
16:17:39 <mlavalle> this way we learn what fix is having an efect or not
16:17:51 <slaweq> ok, I will address comments on my patch today (or tomorrow morning)
16:18:11 <mlavalle> and the other think that I propose is let's track the trunk test failure independently of the bug you are fixing
16:18:31 <mlavalle> because in the case of the trunk stuff, it might be something else
16:18:43 <slaweq> one more question, if it's like You are described it, is it still happens so often, or maybe it's now fixed by https://review.openstack.org/#/c/609924/11 ?
16:18:47 <mlavalle> I mean the problem might be in the trunks code for example
16:19:19 <mlavalle> it's still happening after merging that patch
16:19:34 <mlavalle> I saw it yesterday
16:19:44 <slaweq> sure, I just raised this patch here as I saw issues like that in many tests different tests (trunk also) - only common thing was not reachable FIP
16:20:17 <slaweq> ok, so lets do it as You described and we will see how it will be
16:20:22 <mlavalle> I am just trying to be disciplined and merge fixes orderly, to learn what fixes what
16:20:37 <mlavalle> makes sense?
16:20:48 <slaweq> #action mlavalle to continue tracking not reachable FIP in trunk tests
16:20:55 <slaweq> sure, makes sense for me :)
16:21:16 <slaweq> I think we can go on to the next one then, right?
16:21:30 <mlavalle> yeap
16:21:38 <mlavalle> thanks for listening :-)
16:21:57 <slaweq> ok, so the next one is:
16:21:59 <slaweq> njohnston rename existing neutron-functional job to neutron-functional-python27 and switch neutron-functional to be py3
16:22:26 <slaweq> njohnston: any updates?
16:22:29 <njohnston> bcafarel already had that underway before we had talked about it: https://review.openstack.org/#/c/577383/
16:23:03 <njohnston> It is still showing the same issue with subunit.parser that I have been stumped by: http://logs.openstack.org/83/577383/12/check/neutron-functional/e02bd4f/job-output.txt.gz#_2018-11-13_14_10_04_212536
16:23:24 <slaweq> bcafarel: are You around? Do You need help with this one?
16:23:45 <njohnston> but if I or anyone else has a eureka moment and figures out what in the functional test harness is not py3 ready, then the change is ready to go
16:24:01 <bcafarel> o/
16:24:20 <njohnston> bcafarel and I have talked about it, when last we chatted he also had not had luck finding the py3 string handling incompatibility that is causing the error
16:24:20 <bcafarel> yeah basically what njohnston said
16:24:49 <slaweq> ok, I will try to take a look this week on it if I will have few minutes
16:24:55 <bcafarel> I will try to catch this issue again, but will not mind at all if someone solves it in the meantime :)
16:25:27 <slaweq> ok, let's move forward then
16:25:30 <slaweq> njohnston make py3 etherpad
16:25:41 <slaweq> I guess it's https://etherpad.openstack.org/p/neutron_ci_python3, right?
16:25:42 <njohnston> As mentioned in the neutron team meeting, that is up
16:25:44 <njohnston> yep
16:25:49 <njohnston> #link https://etherpad.openstack.org/p/neutron_ci_python3
16:25:49 <slaweq> thx njohnston :)
16:26:49 <slaweq> I will go through it and will start doing some patches with conversion to py3 as I have some experience with it already
16:27:06 <njohnston> I need to flesh out the experimental jobs a bit
16:27:14 <njohnston> but I figure it will be a while before we get tot hose anyway
16:27:20 <njohnston> * to those
16:28:29 <slaweq> I think that we should revisit which experimental jobs we still need :)
16:29:08 <njohnston> yes, many of the legacy ones may be able to be removed, like legacy-tempest-dsvm-neutron-dvr-multinode-full
16:29:20 <slaweq> yes, I will take a look at them too
16:29:35 <slaweq> #action slaweq to check which experimental jobs can be removed
16:29:38 <njohnston> since that is covered already by neutron-tempest-dvr-ha-multinode-full in check and gate queues
16:29:53 <slaweq> #action slaweq to start migrating neutron CI jobs to zuul v3 syntax
16:30:33 <njohnston> Much appreciated slaweq, my attempts at zuul v3 conversions have taught me humility
16:30:44 <mlavalle> LOL
16:30:51 <slaweq> :)
16:31:14 <slaweq> I spent some time on converting neutron-tempest-plugin jobs to it so I understand You :)
16:31:33 <slaweq> ok, lets move to the next one then
16:31:35 <slaweq> njohnston check if grenade is ready for py3
16:31:52 <njohnston> Looks like Grenade has py3 supported and has a zuul job defined for py3 by mriedem https://github.com/openstack-dev/grenade/commit/7bae489f38f8f0c82c8eb284d1841ef68d8e9a43
16:32:53 <mriedem> \o/
16:32:59 <slaweq> so we should just switch to use this one instead of what we are using now, right?
16:33:13 <njohnston> yes
16:33:15 <mriedem> still need https://review.openstack.org/#/c/617662/
16:33:24 <mriedem> but maybe unrelated to what you care about
16:35:09 <njohnston> No, I think that is excellent
16:35:16 <njohnston> I think we probably make use of that template
16:35:48 <njohnston> So we can either base our jobs off of that zuul template or just use it outright
16:36:49 <slaweq> I'm now comparing our definition of neutron-grenade job with grenade-py3 job
16:37:08 <slaweq> I see only one difference, in grenade-py3 there is no openstack/neutron in required projects
16:37:29 <slaweq> will we have to add it if we will use this job, or it's not necessary?
16:37:30 <mriedem> i believe that comes from legacy-dsvm-base
16:37:42 <mriedem> http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/zuul.d/jobs.yaml#n915
16:37:56 <mriedem> but i'm never really sure how that all works
16:38:09 <slaweq> ahh, so we don't need to define it in our template
16:38:23 <mriedem> i do'nt think it's needed,
16:38:27 <mriedem> grenade-py3 is stein only,
16:38:35 <mriedem> and devstack has defaulted to neutron since i think newton
16:38:37 <slaweq> thx mriedem
16:39:01 <slaweq> so IMO we could switch to use this template in neutron's .zuul.yaml file
16:39:09 <slaweq> njohnston: will You do it then?
16:39:54 <njohnston> definitely
16:39:58 <slaweq> thx
16:40:21 <mriedem> correct, neutron is the default since ocata, but doesn't matter for this anyway https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate-wrap.sh#L201
16:40:25 <mriedem> *correction
16:40:26 <slaweq> #action njohnston to switch neutron to use integrated-gate-py35 with grenade-py3 job instead of our neutron-grenade job
16:40:45 <slaweq> thx mriedem, that sounds very good for us :)
16:41:15 <slaweq> ok, so lets move on to the next one then
16:41:17 <slaweq> slaweq to check Fullstack tests fails because process is not killed properly (bug 1798472)
16:41:18 <openstack> bug 1798472 in neutron "Fullstack tests fails because process is not killed properly" [High,In progress] https://launchpad.net/bugs/1798472 - Assigned to Slawek Kaplonski (slaweq)
16:41:32 <slaweq> I was checking that one
16:42:01 <slaweq> and what I found is, that in some cases openvswitch-agent or sometimes neutron-server are not responding for SIGTERM at all
16:42:18 <slaweq> and then tests are failing in cleanup phase as process is not stopped properly and timeout is raised
16:42:33 <njohnston> oh interesting
16:42:35 <mlavalle> didn't we approve a patch for that yesterday?
16:42:38 <slaweq> I did patch https://review.openstack.org/#/c/618024/ which should at least workaround it in tests
16:42:42 <slaweq> mlavalle: yes :)
16:42:57 <slaweq> I just wanted to do introduction for everyone ;)
16:43:13 <mlavalle> cool
16:43:31 <slaweq> for now it failed with some not related errors so I rechecked it
16:43:44 <slaweq> it should help with such issues in fullstack tests
16:43:57 <slaweq> ok, next one
16:43:59 <slaweq> mlavalle to check bug 1798475
16:44:00 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475
16:44:17 <mlavalle> no time to check this one
16:44:50 <slaweq> do You plan to check that this week maybe?
16:45:05 <mlavalle> no, I don't thinbk I'll time this week
16:45:15 <mlavalle> Thursday and Friday are holidays
16:45:32 <slaweq> ok, I will assign it to me but I will check it only if I will have some time for it
16:45:45 <slaweq> #action slaweq to check bug 1798475
16:45:47 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475
16:46:01 <slaweq> and the last one was:
16:46:03 <mlavalle> Great, thanks!
16:46:03 <slaweq> slaweq to check why db_migration functional tests don't have logs
16:46:08 <slaweq> yw mlavalle :)
16:46:24 <slaweq> for this last one I didn't have time
16:46:33 <slaweq> so I will assign it to me for next week too
16:46:42 <slaweq> #action slaweq to check why db_migration functional tests don't have logs
16:46:57 <slaweq> ok, that were all actions from previous week
16:47:06 <slaweq> do You want to add anything?
16:47:37 <mlavalle> not from me
16:47:55 <slaweq> ok, lets move on then
16:47:58 <slaweq> #topic Python 3
16:48:11 <slaweq> I think we discussed most things already :)
16:48:16 <njohnston> indeed :-)
16:48:28 <slaweq> I just wanted to mention that fullstack tests are now running on py3 only:
16:48:30 <slaweq> #topic Python 3
16:48:36 <slaweq> #undo
16:48:37 <openstack> Removing item from minutes: #topic Python 3
16:48:39 <slaweq> https://review.openstack.org/#/c/604749/
16:49:10 <slaweq> so we don't have neutron-fullstack-python36 anymore
16:49:13 <slaweq> :)
16:49:36 <slaweq> thx bcafarel and njohnston for that one
16:49:40 <njohnston> Did someone already submit a change to remove that from the grafana dashboard, or shall I do it?
16:50:03 <slaweq> no, I think there is no such patch yet
16:50:12 <slaweq> would be good if You will do it :)
16:50:25 <slaweq> thx for remember about that
16:50:41 <slaweq> #action njohnston to remove neutron-fullstack-python36 from grafana dashboard
16:50:45 <njohnston> Will do, and I'll prep a second one one to match up with the change in functional tests with a proper depends-on
16:50:55 <slaweq> thx njohnston
16:51:07 <slaweq> ok, so speaking about grafana
16:51:09 <slaweq> #topic Grafana
16:51:15 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:53:24 <slaweq> there were quite big failure rates during the weekend on charts but there was not many jobs running then so I don't think we should focus on those
16:53:53 <slaweq> except that, I think that most things are similar like it was two weeks ago
16:54:21 <slaweq> one thing which I want to rais is failing neutron-tempest-postgres-full periodic job
16:54:25 <slaweq> we should check that one
16:54:52 <mlavalle> ok
16:56:23 <slaweq> looks like some nova issue, e.g.: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-tempest-postgres-full/a52bcf9/job-output.txt.gz#_2018-11-18_06_59_45_183264
16:56:31 <slaweq> and nova logs: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-tempest-postgres-full/a52bcf9/logs/screen-n-api.txt.gz?level=ERROR
16:56:43 <slaweq> mriedem: does it ring a bell for You ^^ ?
16:57:43 <mriedem> yes,
16:57:46 <mriedem> but should be fixed
16:58:15 <mriedem> oh nvm this is something else
16:58:31 <slaweq> last such issue is from today, http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-tempest-postgres-full/1de7427/logs/screen-n-api.txt.gz?level=ERROR
16:58:42 <mriedem> https://github.com/openstack/nova/commit/77881659251bdff52163ba1572e13a105eadaf7f
16:59:22 <mriedem> ok so the pg jobs are broken
16:59:25 <mriedem> has anyone reported a bug?
16:59:43 <slaweq> not me, I just noticed that in periodic job
16:59:58 <slaweq> I will report a bug aftet this meeting
17:00:01 <mriedem> thanks
17:00:05 <slaweq> (which is just over) :)
17:00:11 <slaweq> ok, thx for attending
17:00:16 <slaweq> #endmeeting