#openstack-meeting log

16:00:08 <slaweq> #startmeeting neutron_ci
16:00:09 <slaweq> hi
16:00:10 <openstack> Meeting started Tue Oct 16 16:00:08 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:15 <openstack> The meeting name has been set to 'neutron_ci'
16:00:23 <mlavalle> o/
16:01:19 <manjeets_> o/
16:01:58 <slaweq> njohnston: haleyb: are You around for CI meeting?
16:02:11 <haleyb> hi, missed reminder
16:02:30 <slaweq> ok, lets start then
16:02:36 <slaweq> #topic Actions from previous meetings
16:02:45 <slaweq> njohnston look into fullstack compilation of ovs still needed on bionic
16:03:38 <slaweq> I think njohnston is not around now so lets move to the next one
16:03:47 <slaweq> slaweq will create scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue
16:03:58 <slaweq> I today pushed a patch for that https://review.openstack.org/610997
16:04:12 <slaweq> it's in experimental queue now and I will check how it will go
16:04:51 <slaweq> I will update status of it next week, ok?
16:04:56 <mlavalle> cool
16:05:14 <slaweq> #action slaweq to continue checking how jobs will run on Bionic nodes
16:05:24 <slaweq> ok, next one:
16:05:26 <slaweq> mlavalle will check failing trunk scenario test
16:05:47 <mlavalle> I've been working on this bug all this morning
16:06:06 <mlavalle> we have 33 hits over the past 7 days
16:06:42 <mlavalle> Most of those (~27) are in neutron-tempest-plugin-dvr-multinode-scenario
16:06:57 <slaweq> at least this job is non voting :)
16:07:21 <mlavalle> and the failure occurs after creating the servers when trying to check connectivity over the fip
16:07:28 <mlavalle> to one of the servers
16:07:43 <mlavalle> the test hasn't tried to do anything with the trunk yet
16:07:50 <mlavalle> other than creating the trunk
16:08:14 <mlavalle> so at this point I suspect the fip wiring in dvr
16:08:18 <slaweq> did You check how long vm was booting? Maybe it's some timeout again?
16:08:38 <mlavalle> no, the vms finish booting correctly
16:08:44 <mlavalle> both of them
16:08:49 <slaweq> ok
16:09:06 <mlavalle> and then we go and try to ssh to one of them using the fip
16:09:26 <mlavalle> checking connectivity before starting to play with the trunk
16:09:47 <slaweq> I think it is quite common cause of failures in scenario tests - FIP is not reachable
16:09:56 <mlavalle> yeah
16:10:11 <mlavalle> at this point it seems to me to be a generic failure of the fip
16:10:24 <slaweq> now the question is: how we can try to debug this issue?
16:10:38 <mlavalle> but I am taking it as a good opporutnity to debug the neutron-tempest-plugin-dvr-multinode-scenario job
16:11:06 <haleyb> mlavalle: so you can't ping the FIP?
16:11:21 <mlavalle> haleyb: we can't ssh to it
16:12:20 <mlavalle> at this point I am chacking to see if this has to do with in what node the instance / port lands
16:12:28 <mlavalle> controller vs compute
16:12:35 <haleyb> it sounds like the problem we were seeing before the revert... i guess we can start tracing the packets w/tcpdump, etc
16:12:54 <mlavalle> haleyb: yesh, I suspect that
16:13:11 <mlavalle> I need to work this a little bit longer, though
16:13:20 <haleyb> ack
16:13:21 <slaweq> mlavalle: yes, You can ask infra-root to set such job on hold and then if job will fail You will have ssh access to nodes
16:13:46 <mlavalle> slaweq: yeap, I might try that
16:13:47 <slaweq> or do something "ugly" which I was doing when debugging dvr-multinode grenade issue
16:13:55 <mlavalle> at this point I am gathering more evidence
16:14:04 <slaweq> I send DNM patch where I was adding my ssh key to node
16:14:28 <slaweq> and then You can e.g. add some sleep() in test code and login to nodes and debug
16:14:37 <mlavalle> yeap
16:14:44 <slaweq> without asking infra-root for holding job for You :)
16:15:05 * mordred also has timeout issues with floating ip functional tests for openstacksdk - but I haven't ruled out just needing to rework those tests
16:15:32 <slaweq> I wonder if this will still be not reachable after login to host or will all works fine then :)
16:15:40 <mlavalle> will keep you posted mordred
16:16:09 <mordred> mlavalle: thanks! also let me know if I can be of any help
16:16:25 <mlavalle> that's as far as I've gotten with this. I'll continue working it
16:16:56 <slaweq> thx mlavalle for working on this issue :)
16:17:12 <slaweq> #action mlavalle to continue debugging issue with not reachable FIP in scenario jobs
16:18:06 <slaweq> ok, next one then
16:18:16 <slaweq> slaweq will try to reproduce and triage https://bugs.launchpad.net/neutron/+bug/1687027
16:18:16 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:18:28 <mlavalle> we got a fix, didn't we?
16:18:30 <slaweq> Patch is in the gate already https://review.openstack.org/610003
16:18:37 <mlavalle> \o/
16:18:41 <slaweq> mlavalle: yes, it's kind of fix :)
16:18:49 <slaweq> but we should be good with it I hope
16:19:25 <slaweq> and the last one from last week was:
16:19:27 <slaweq> mlavalle to send an email about moving tempest plugins from stadium to separate repo
16:19:40 <mlavalle> dang, I forgot that one
16:19:55 <slaweq> no problem :)
16:19:56 <mlavalle> I'll do it at the end of this meeting
16:20:01 <slaweq> #action mlavalle to send an email about moving tempest plugins from stadium to separate repo
16:20:27 <slaweq> ok, that's all from last week on my list
16:20:31 <slaweq> #topic Grafana
16:20:38 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:21:12 <slaweq> first of all - we are much better with grenade-multinode-dvr job now :)
16:22:17 <slaweq> from other things I see that functional tests were on quite high failure level recently
16:22:33 <slaweq> but it's related to this issue with db migrations which should be fine with my patch
16:22:51 <mlavalle> yeap
16:23:15 <mlavalle> overall, the picture looks good to me
16:23:34 <slaweq> yes, most of jobs are below 20% of failures
16:24:05 <slaweq> so lets talk about few specific issues which I found last week and want to raise here
16:24:16 <slaweq> #topic fullstack/functional
16:24:38 <slaweq> about fullstack tests, I found few times issue with cleanup processes
16:24:41 <slaweq> e.g.:
16:24:50 <slaweq> http://logs.openstack.org/97/602497/5/check/neutron-fullstack/f110a1f/logs/testr_results.html.gz
16:24:59 <slaweq> http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/testr_results.html.gz
16:25:08 <slaweq> so various tests but similar failure
16:26:43 <slaweq> here it looks that some process wasn't exited properly: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_.txt.gz#_2018-10-16_02_43_49_755
16:26:47 <slaweq> at least for me
16:27:42 <slaweq> and in this example it looks that it is openvswitch-agent: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_/neutron-openvswitch-agent--2018-10-16--02-42-43-987526.txt.gz
16:28:31 <slaweq> I don't know if that is an issue with tests or with ovs-agent process really
16:30:00 <slaweq> Looking at logs of this ovs agent it looks that there is no log like "Agent caught SIGTERM, quitting daemon loop." at the end
16:30:11 <slaweq> it is in other agents' logs but not in this one
16:30:11 <mlavalle> so, are these cleanup issues causing tests to fail?
16:30:44 <slaweq> mlavalle: yes, it looks that test if waiting for process to be exited and then timeout exception is raised
16:30:59 <mlavalle> that makes sense
16:32:42 <slaweq> I was thinking that I will open a bug for that to not forget about this and maybe there will be some volunteer to work on it :)
16:32:55 <slaweq> if I will have some cycles, I will try to debug it more
16:32:56 <mlavalle> that's a good idea
16:33:08 <slaweq> ok
16:33:29 <mlavalle> open the bug and if nobody volunteers, please remind us in the next meeting
16:33:38 <slaweq> #action slaweq to report a bug about issue with process ending in fullstack tests
16:33:41 <mlavalle> if I'm done with the trunk one
16:33:49 <slaweq> ok, thx mlavalle
16:33:56 <mlavalle> and there is nothing more urgent, I'll take it
16:34:01 <slaweq> I will remind it for sure if will be necessary
16:34:42 <slaweq> ok, lets move on to scenario jobs now
16:35:11 <slaweq> #topic Tempest/Scenario
16:35:25 <slaweq> first of all we still have this bug: https://bugs.launchpad.net/neutron/+bug/1789434
16:35:25 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia)
16:35:33 <slaweq> and I know that manjeets is working on it
16:35:40 <slaweq> any updates manjeets?
16:37:15 <slaweq> ok, I guess no
16:37:18 <slaweq> so let's move on
16:37:57 <slaweq> in neutron-tempest-plugin-scenario-linuxbridge I spotted at least two times issue with failed to transition FIP to down
16:38:00 <slaweq> for example:
16:38:08 <slaweq> http://logs.openstack.org/67/597567/23/check/neutron-tempest-plugin-scenario-linuxbridge/dc21cb0/testr_results.html.gz
16:38:17 <slaweq> http://logs.openstack.org/97/602497/5/gate/neutron-tempest-plugin-scenario-linuxbridge/52c2890/testr_results.html.gz
16:39:00 <slaweq> haleyb: I think You were debugging similar issue some time ago, right?
16:39:25 <haleyb> slaweq: yes, and i thought the change merged
16:39:31 <slaweq> and You found that this BUILD state is coming from nova
16:39:45 <slaweq> I think Your patch was merged, right
16:39:57 <slaweq> but it looks that this problem still happens
16:39:59 <haleyb> it's actually the state on the associated compute port
16:40:23 <haleyb> i think it did merge as that looks like the new code
16:40:34 <mlavalle> it merged
16:41:01 <mlavalle> I remember we got stuck a little with naming a function
16:41:41 <slaweq> and what is in logs, is that this FIP was set to DOWN here: http://logs.openstack.org/97/602497/5/gate/neutron-tempest-plugin-scenario-linuxbridge/52c2890/controller/logs/screen-q-svc.txt.gz#_Oct_16_01_28_04_498279
16:41:54 <slaweq> and test failed around 2018-10-16 01:27:59,893
16:42:21 <slaweq> maybe node was overloaded and something took longer time simply?
16:42:38 <haleyb> the test still checks the port the FIP is attached to, not the FIP itself, so maybe it was not the best fix?
16:42:46 <haleyb> we wait 120 seconds
16:43:15 <slaweq> haleyb: can I assign it to You as an action for next week?
16:43:23 <slaweq> will You be able to take a look on it?
16:43:27 <haleyb> slaweq: sure
16:43:31 <slaweq> thx a lot
16:43:59 <slaweq> #action haleyb to check issue with failing FIP transition to down state
16:44:43 <slaweq> basically I didn't found other failures related to neutron in jobs which I was checking
16:45:11 <slaweq> there are some issues with volumes and some issues with host mapping to cells in multinode jobs but that's not neutron issues
16:45:28 <slaweq> I have one more short thing to mention about scenario jobs
16:45:41 <slaweq> I have patch to make such jobs run faster: https://review.openstack.org/#/c/609762/
16:46:09 <slaweq> basically I proposed to remove config option "is_image_advanced" and use "advanced_image_ref" instead
16:46:19 <slaweq> and use this advanced image only in tests which really needs it
16:46:28 <slaweq> we have 3 such tests currently
16:46:36 <slaweq> all other tests will use Cirros image
16:47:13 <slaweq> comparing jobs' running time:
16:47:35 <slaweq> neutron-tempest-plugin-dvr-multinode-scenario - 2h 11m 27s without my patch and 1h 22m 34s on my patch
16:48:00 <slaweq> neutron-tempest-plugin-scenario-linuxbridge - 1h 39m 48s without this patch and 1h 09m 23s with patch
16:48:11 <slaweq> so difference is quite big IMO
16:48:15 <mlavalle> yeah
16:48:27 <slaweq> please review this patch if You will have some time :)
16:48:37 <mlavalle> will do
16:48:54 <haleyb> me too
16:49:26 <slaweq> ok, that's all from my side for today
16:49:28 <slaweq> #topic Open discussion
16:49:34 <slaweq> anyone wants to talk about something?
16:49:35 <clarkb> according to e-r, http://status.openstack.org/elastic-recheck/data/integrated_gate.html, there are some neutron fullstack failures too (the functional job at the top of the list should have a fix out there now)
16:50:46 <manjeets> slaweq sorry I was on other internal meeting as i discussed with I need to get into l3 and l2 agents code to point the notification issue
16:51:20 <clarkb> more of an fyi than anything else since you were talking about neutron fails above
16:51:22 <slaweq> clarkb: we already talked about issue like is e.g. in http://logs.openstack.org/04/599604/10/gate/neutron-fullstack/3c364de/logs/testr_results.html.gz
16:51:37 <slaweq> and it looks that it's the reason of 3 of 4 issues pointed there
16:52:34 <slaweq> and those 2: http://logs.openstack.org/03/610003/2/gate/neutron-fullstack-python36/b7f131a/logs/testr_results.html.gz and http://logs.openstack.org/61/608361/2/gate/neutron-fullstack-python36/39f72ae/logs/testr_results.html.gz
16:52:46 <slaweq> looks at first glance that it's again the same culprit
16:53:09 <slaweq> test fails as it tried to kill agent and that was not finished with success
16:53:44 <slaweq> so as I said, I will open a bug report for that for now, and will try to debug it if I will have time for it
16:54:03 <clarkb> ok, you might also want to add e-r queries for known issues if they will persist for longer than a few days
16:54:22 <clarkb> helps others identify problems that have direct effect on our ability to merge code and hopefully motivates people to fix them too :)
16:54:25 <slaweq> clarkb: ok, I will add it, thx
16:54:39 <mlavalle> clarkb: thanks for bringing this up
16:54:57 <mlavalle> this is the perfect forum for CI issues
16:55:05 <mlavalle> and we meet every week at this time
16:55:10 <mlavalle> you are always welcome
16:55:59 <slaweq> #action slaweq to add e-r query for know fullstack issue (when bug will be reported)
16:56:40 <slaweq> anything else or can we finish few minutes before time? :)
16:56:48 <mlavalle> not from me
16:57:34 <slaweq> thx for attending
16:57:39 <slaweq> #endmeeting