16:01:48 <slaweq> #startmeeting neutron_ci 16:01:49 <openstack> Meeting started Tue Nov 20 16:01:48 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:01:50 <slaweq> hello 16:01:50 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:01:54 <openstack> The meeting name has been set to 'neutron_ci' 16:01:54 <njohnston> o/ 16:01:54 <mlavalle> o/ 16:02:37 <slaweq> lets start last of today's meetings then :) 16:02:38 <slaweq> #topic Actions from previous meetings 16:02:48 <slaweq> mlavalle/slaweq to continue debugging issue with not reachable FIP in scenario jobs 16:02:56 <mlavalle> I continued doing that 16:03:04 <mlavalle> since I came back from berlin 16:03:11 <mlavalle> and before leaving as well 16:03:32 <slaweq> I didn't have time to look at this one, but I found some issue related to https://bugs.launchpad.net/neutron/+bug/1717302 16:03:33 <openstack> Launchpad bug 1717302 in neutron "Tempest floatingip scenario tests failing on DVR Multinode setup with HA" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:03:44 <slaweq> and it maybe can help with this one too 16:03:54 <mlavalle> I've been comparing a "good run" with "bad run" 16:03:59 <slaweq> I sent some patch already: https://review.openstack.org/#/c/618750/ 16:04:34 <slaweq> this my patch can help with dvr jobs only 16:05:05 <slaweq> I saw issues like that in quite many test runs, e.g. in http://logs.openstack.org/24/618024/5/check/neutron-tempest-dvr-ha-multinode-full/d27f183/logs/ 16:05:42 <mlavalle> is this going to address also https://bugs.launchpad.net/neutron/+bug/1795870? 16:05:43 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:06:50 <slaweq> if You had in l3-agent logs something like: http://logs.openstack.org/24/618024/5/check/neutron-tempest-dvr-ha-multinode-full/d27f183/logs/subnode-2/screen-q-l3.txt.gz?level=ERROR then it should help 16:07:01 <slaweq> if not, then it's different issue probably 16:07:34 <mlavalle> well 16:07:50 <mlavalle> I was thinking of a different aproach 16:08:14 <mlavalle> we recently merged https://review.openstack.org/#/c/609924/ 16:09:02 <mlavalle> this fixes a situation where the fip is associated to a port before the port is found 16:09:22 <mlavalle> as a consequence, the fip is create in the snat node, right? 16:09:32 <slaweq> yep 16:10:02 <mlavalle> now, when the port is bound, the patch fixes the migration of the fip / port to the corresponding node 16:10:10 <mlavalle> a compute presumably 16:10:16 <mlavalle> right? 16:10:24 <slaweq> yep 16:10:49 <mlavalle> now, let's look at the code of test_trunk_subport_lifecycle: 16:11:14 <mlavalle> https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_trunk.py#L67 16:11:34 <mlavalle> it creates the port and associates it to a fip 16:11:53 <mlavalle> so in a dvr env, that fip is going to the snat node 16:12:07 <slaweq> yes 16:12:14 <mlavalle> and then the server is created in L76 16:12:23 <mlavalle> so the migration starts 16:12:54 <mlavalle> it is possible that the bug https://bugs.launchpad.net/neutron/+bug/1795870 16:12:55 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:13:24 <mlavalle> is a consequence of the fip not being ready in the compute, because it is migrating 16:13:46 <mlavalle> it is a race that is not fixed by https://review.openstack.org/#/c/609924/ 16:13:56 <mlavalle> right? 16:14:36 <slaweq> it can be something like that 16:14:49 <slaweq> maybe this FIP is configured in both snat and fip- namespaces then? 16:14:57 <mlavalle> noiw the trunk test bug happens 95% of the time with DVR 16:15:04 <slaweq> as in our gates all nodes are dvr_snat nodes 16:15:36 <mlavalle> so I intend to change the test script 16:15:52 <mlavalle> to create and associate the fip after the server 16:15:58 <mlavalle> and see what happens 16:16:02 <mlavalle> makes sense? 16:16:28 <slaweq> yes, totally :) 16:16:39 <mlavalle> so I propose the following: 16:16:56 <mlavalle> 1) let's merge your patch first and see the effect on the trunk bug 16:17:20 <mlavalle> 2) Then let's change the trunk test script and see the effect 16:17:39 <mlavalle> this way we learn what fix is having an efect or not 16:17:51 <slaweq> ok, I will address comments on my patch today (or tomorrow morning) 16:18:11 <mlavalle> and the other think that I propose is let's track the trunk test failure independently of the bug you are fixing 16:18:31 <mlavalle> because in the case of the trunk stuff, it might be something else 16:18:43 <slaweq> one more question, if it's like You are described it, is it still happens so often, or maybe it's now fixed by https://review.openstack.org/#/c/609924/11 ? 16:18:47 <mlavalle> I mean the problem might be in the trunks code for example 16:19:19 <mlavalle> it's still happening after merging that patch 16:19:34 <mlavalle> I saw it yesterday 16:19:44 <slaweq> sure, I just raised this patch here as I saw issues like that in many tests different tests (trunk also) - only common thing was not reachable FIP 16:20:17 <slaweq> ok, so lets do it as You described and we will see how it will be 16:20:22 <mlavalle> I am just trying to be disciplined and merge fixes orderly, to learn what fixes what 16:20:37 <mlavalle> makes sense? 16:20:48 <slaweq> #action mlavalle to continue tracking not reachable FIP in trunk tests 16:20:55 <slaweq> sure, makes sense for me :) 16:21:16 <slaweq> I think we can go on to the next one then, right? 16:21:30 <mlavalle> yeap 16:21:38 <mlavalle> thanks for listening :-) 16:21:57 <slaweq> ok, so the next one is: 16:21:59 <slaweq> njohnston rename existing neutron-functional job to neutron-functional-python27 and switch neutron-functional to be py3 16:22:26 <slaweq> njohnston: any updates? 16:22:29 <njohnston> bcafarel already had that underway before we had talked about it: https://review.openstack.org/#/c/577383/ 16:23:03 <njohnston> It is still showing the same issue with subunit.parser that I have been stumped by: http://logs.openstack.org/83/577383/12/check/neutron-functional/e02bd4f/job-output.txt.gz#_2018-11-13_14_10_04_212536 16:23:24 <slaweq> bcafarel: are You around? Do You need help with this one? 16:23:45 <njohnston> but if I or anyone else has a eureka moment and figures out what in the functional test harness is not py3 ready, then the change is ready to go 16:24:01 <bcafarel> o/ 16:24:20 <njohnston> bcafarel and I have talked about it, when last we chatted he also had not had luck finding the py3 string handling incompatibility that is causing the error 16:24:20 <bcafarel> yeah basically what njohnston said 16:24:49 <slaweq> ok, I will try to take a look this week on it if I will have few minutes 16:24:55 <bcafarel> I will try to catch this issue again, but will not mind at all if someone solves it in the meantime :) 16:25:27 <slaweq> ok, let's move forward then 16:25:30 <slaweq> njohnston make py3 etherpad 16:25:41 <slaweq> I guess it's https://etherpad.openstack.org/p/neutron_ci_python3, right? 16:25:42 <njohnston> As mentioned in the neutron team meeting, that is up 16:25:44 <njohnston> yep 16:25:49 <njohnston> #link https://etherpad.openstack.org/p/neutron_ci_python3 16:25:49 <slaweq> thx njohnston :) 16:26:49 <slaweq> I will go through it and will start doing some patches with conversion to py3 as I have some experience with it already 16:27:06 <njohnston> I need to flesh out the experimental jobs a bit 16:27:14 <njohnston> but I figure it will be a while before we get tot hose anyway 16:27:20 <njohnston> * to those 16:28:29 <slaweq> I think that we should revisit which experimental jobs we still need :) 16:29:08 <njohnston> yes, many of the legacy ones may be able to be removed, like legacy-tempest-dsvm-neutron-dvr-multinode-full 16:29:20 <slaweq> yes, I will take a look at them too 16:29:35 <slaweq> #action slaweq to check which experimental jobs can be removed 16:29:38 <njohnston> since that is covered already by neutron-tempest-dvr-ha-multinode-full in check and gate queues 16:29:53 <slaweq> #action slaweq to start migrating neutron CI jobs to zuul v3 syntax 16:30:33 <njohnston> Much appreciated slaweq, my attempts at zuul v3 conversions have taught me humility 16:30:44 <mlavalle> LOL 16:30:51 <slaweq> :) 16:31:14 <slaweq> I spent some time on converting neutron-tempest-plugin jobs to it so I understand You :) 16:31:33 <slaweq> ok, lets move to the next one then 16:31:35 <slaweq> njohnston check if grenade is ready for py3 16:31:52 <njohnston> Looks like Grenade has py3 supported and has a zuul job defined for py3 by mriedem https://github.com/openstack-dev/grenade/commit/7bae489f38f8f0c82c8eb284d1841ef68d8e9a43 16:32:53 <mriedem> \o/ 16:32:59 <slaweq> so we should just switch to use this one instead of what we are using now, right? 16:33:13 <njohnston> yes 16:33:15 <mriedem> still need https://review.openstack.org/#/c/617662/ 16:33:24 <mriedem> but maybe unrelated to what you care about 16:35:09 <njohnston> No, I think that is excellent 16:35:16 <njohnston> I think we probably make use of that template 16:35:48 <njohnston> So we can either base our jobs off of that zuul template or just use it outright 16:36:49 <slaweq> I'm now comparing our definition of neutron-grenade job with grenade-py3 job 16:37:08 <slaweq> I see only one difference, in grenade-py3 there is no openstack/neutron in required projects 16:37:29 <slaweq> will we have to add it if we will use this job, or it's not necessary? 16:37:30 <mriedem> i believe that comes from legacy-dsvm-base 16:37:42 <mriedem> http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/zuul.d/jobs.yaml#n915 16:37:56 <mriedem> but i'm never really sure how that all works 16:38:09 <slaweq> ahh, so we don't need to define it in our template 16:38:23 <mriedem> i do'nt think it's needed, 16:38:27 <mriedem> grenade-py3 is stein only, 16:38:35 <mriedem> and devstack has defaulted to neutron since i think newton 16:38:37 <slaweq> thx mriedem 16:39:01 <slaweq> so IMO we could switch to use this template in neutron's .zuul.yaml file 16:39:09 <slaweq> njohnston: will You do it then? 16:39:54 <njohnston> definitely 16:39:58 <slaweq> thx 16:40:21 <mriedem> correct, neutron is the default since ocata, but doesn't matter for this anyway https://github.com/openstack-infra/devstack-gate/blob/master/devstack-vm-gate-wrap.sh#L201 16:40:25 <mriedem> *correction 16:40:26 <slaweq> #action njohnston to switch neutron to use integrated-gate-py35 with grenade-py3 job instead of our neutron-grenade job 16:40:45 <slaweq> thx mriedem, that sounds very good for us :) 16:41:15 <slaweq> ok, so lets move on to the next one then 16:41:17 <slaweq> slaweq to check Fullstack tests fails because process is not killed properly (bug 1798472) 16:41:18 <openstack> bug 1798472 in neutron "Fullstack tests fails because process is not killed properly" [High,In progress] https://launchpad.net/bugs/1798472 - Assigned to Slawek Kaplonski (slaweq) 16:41:32 <slaweq> I was checking that one 16:42:01 <slaweq> and what I found is, that in some cases openvswitch-agent or sometimes neutron-server are not responding for SIGTERM at all 16:42:18 <slaweq> and then tests are failing in cleanup phase as process is not stopped properly and timeout is raised 16:42:33 <njohnston> oh interesting 16:42:35 <mlavalle> didn't we approve a patch for that yesterday? 16:42:38 <slaweq> I did patch https://review.openstack.org/#/c/618024/ which should at least workaround it in tests 16:42:42 <slaweq> mlavalle: yes :) 16:42:57 <slaweq> I just wanted to do introduction for everyone ;) 16:43:13 <mlavalle> cool 16:43:31 <slaweq> for now it failed with some not related errors so I rechecked it 16:43:44 <slaweq> it should help with such issues in fullstack tests 16:43:57 <slaweq> ok, next one 16:43:59 <slaweq> mlavalle to check bug 1798475 16:44:00 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:44:17 <mlavalle> no time to check this one 16:44:50 <slaweq> do You plan to check that this week maybe? 16:45:05 <mlavalle> no, I don't thinbk I'll time this week 16:45:15 <mlavalle> Thursday and Friday are holidays 16:45:32 <slaweq> ok, I will assign it to me but I will check it only if I will have some time for it 16:45:45 <slaweq> #action slaweq to check bug 1798475 16:45:47 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:46:01 <slaweq> and the last one was: 16:46:03 <mlavalle> Great, thanks! 16:46:03 <slaweq> slaweq to check why db_migration functional tests don't have logs 16:46:08 <slaweq> yw mlavalle :) 16:46:24 <slaweq> for this last one I didn't have time 16:46:33 <slaweq> so I will assign it to me for next week too 16:46:42 <slaweq> #action slaweq to check why db_migration functional tests don't have logs 16:46:57 <slaweq> ok, that were all actions from previous week 16:47:06 <slaweq> do You want to add anything? 16:47:37 <mlavalle> not from me 16:47:55 <slaweq> ok, lets move on then 16:47:58 <slaweq> #topic Python 3 16:48:11 <slaweq> I think we discussed most things already :) 16:48:16 <njohnston> indeed :-) 16:48:28 <slaweq> I just wanted to mention that fullstack tests are now running on py3 only: 16:48:30 <slaweq> #topic Python 3 16:48:36 <slaweq> #undo 16:48:37 <openstack> Removing item from minutes: #topic Python 3 16:48:39 <slaweq> https://review.openstack.org/#/c/604749/ 16:49:10 <slaweq> so we don't have neutron-fullstack-python36 anymore 16:49:13 <slaweq> :) 16:49:36 <slaweq> thx bcafarel and njohnston for that one 16:49:40 <njohnston> Did someone already submit a change to remove that from the grafana dashboard, or shall I do it? 16:50:03 <slaweq> no, I think there is no such patch yet 16:50:12 <slaweq> would be good if You will do it :) 16:50:25 <slaweq> thx for remember about that 16:50:41 <slaweq> #action njohnston to remove neutron-fullstack-python36 from grafana dashboard 16:50:45 <njohnston> Will do, and I'll prep a second one one to match up with the change in functional tests with a proper depends-on 16:50:55 <slaweq> thx njohnston 16:51:07 <slaweq> ok, so speaking about grafana 16:51:09 <slaweq> #topic Grafana 16:51:15 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:53:24 <slaweq> there were quite big failure rates during the weekend on charts but there was not many jobs running then so I don't think we should focus on those 16:53:53 <slaweq> except that, I think that most things are similar like it was two weeks ago 16:54:21 <slaweq> one thing which I want to rais is failing neutron-tempest-postgres-full periodic job 16:54:25 <slaweq> we should check that one 16:54:52 <mlavalle> ok 16:56:23 <slaweq> looks like some nova issue, e.g.: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-tempest-postgres-full/a52bcf9/job-output.txt.gz#_2018-11-18_06_59_45_183264 16:56:31 <slaweq> and nova logs: http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-tempest-postgres-full/a52bcf9/logs/screen-n-api.txt.gz?level=ERROR 16:56:43 <slaweq> mriedem: does it ring a bell for You ^^ ? 16:57:43 <mriedem> yes, 16:57:46 <mriedem> but should be fixed 16:58:15 <mriedem> oh nvm this is something else 16:58:31 <slaweq> last such issue is from today, http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-tempest-postgres-full/1de7427/logs/screen-n-api.txt.gz?level=ERROR 16:58:42 <mriedem> https://github.com/openstack/nova/commit/77881659251bdff52163ba1572e13a105eadaf7f 16:59:22 <mriedem> ok so the pg jobs are broken 16:59:25 <mriedem> has anyone reported a bug? 16:59:43 <slaweq> not me, I just noticed that in periodic job 16:59:58 <slaweq> I will report a bug aftet this meeting 17:00:01 <mriedem> thanks 17:00:05 <slaweq> (which is just over) :) 17:00:11 <slaweq> ok, thx for attending 17:00:16 <slaweq> #endmeeting