16:00:08 <slaweq> #startmeeting neutron_ci 16:00:09 <slaweq> hi 16:00:10 <openstack> Meeting started Tue Oct 16 16:00:08 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:12 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:15 <openstack> The meeting name has been set to 'neutron_ci' 16:00:23 <mlavalle> o/ 16:01:19 <manjeets_> o/ 16:01:58 <slaweq> njohnston: haleyb: are You around for CI meeting? 16:02:11 <haleyb> hi, missed reminder 16:02:30 <slaweq> ok, lets start then 16:02:36 <slaweq> #topic Actions from previous meetings 16:02:45 <slaweq> njohnston look into fullstack compilation of ovs still needed on bionic 16:03:38 <slaweq> I think njohnston is not around now so lets move to the next one 16:03:47 <slaweq> slaweq will create scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue 16:03:58 <slaweq> I today pushed a patch for that https://review.openstack.org/610997 16:04:12 <slaweq> it's in experimental queue now and I will check how it will go 16:04:51 <slaweq> I will update status of it next week, ok? 16:04:56 <mlavalle> cool 16:05:14 <slaweq> #action slaweq to continue checking how jobs will run on Bionic nodes 16:05:24 <slaweq> ok, next one: 16:05:26 <slaweq> mlavalle will check failing trunk scenario test 16:05:47 <mlavalle> I've been working on this bug all this morning 16:06:06 <mlavalle> we have 33 hits over the past 7 days 16:06:42 <mlavalle> Most of those (~27) are in neutron-tempest-plugin-dvr-multinode-scenario 16:06:57 <slaweq> at least this job is non voting :) 16:07:21 <mlavalle> and the failure occurs after creating the servers when trying to check connectivity over the fip 16:07:28 <mlavalle> to one of the servers 16:07:43 <mlavalle> the test hasn't tried to do anything with the trunk yet 16:07:50 <mlavalle> other than creating the trunk 16:08:14 <mlavalle> so at this point I suspect the fip wiring in dvr 16:08:18 <slaweq> did You check how long vm was booting? Maybe it's some timeout again? 16:08:38 <mlavalle> no, the vms finish booting correctly 16:08:44 <mlavalle> both of them 16:08:49 <slaweq> ok 16:09:06 <mlavalle> and then we go and try to ssh to one of them using the fip 16:09:26 <mlavalle> checking connectivity before starting to play with the trunk 16:09:47 <slaweq> I think it is quite common cause of failures in scenario tests - FIP is not reachable 16:09:56 <mlavalle> yeah 16:10:11 <mlavalle> at this point it seems to me to be a generic failure of the fip 16:10:24 <slaweq> now the question is: how we can try to debug this issue? 16:10:38 <mlavalle> but I am taking it as a good opporutnity to debug the neutron-tempest-plugin-dvr-multinode-scenario job 16:11:06 <haleyb> mlavalle: so you can't ping the FIP? 16:11:21 <mlavalle> haleyb: we can't ssh to it 16:12:20 <mlavalle> at this point I am chacking to see if this has to do with in what node the instance / port lands 16:12:28 <mlavalle> controller vs compute 16:12:35 <haleyb> it sounds like the problem we were seeing before the revert... i guess we can start tracing the packets w/tcpdump, etc 16:12:54 <mlavalle> haleyb: yesh, I suspect that 16:13:11 <mlavalle> I need to work this a little bit longer, though 16:13:20 <haleyb> ack 16:13:21 <slaweq> mlavalle: yes, You can ask infra-root to set such job on hold and then if job will fail You will have ssh access to nodes 16:13:46 <mlavalle> slaweq: yeap, I might try that 16:13:47 <slaweq> or do something "ugly" which I was doing when debugging dvr-multinode grenade issue 16:13:55 <mlavalle> at this point I am gathering more evidence 16:14:04 <slaweq> I send DNM patch where I was adding my ssh key to node 16:14:28 <slaweq> and then You can e.g. add some sleep() in test code and login to nodes and debug 16:14:37 <mlavalle> yeap 16:14:44 <slaweq> without asking infra-root for holding job for You :) 16:15:05 * mordred also has timeout issues with floating ip functional tests for openstacksdk - but I haven't ruled out just needing to rework those tests 16:15:32 <slaweq> I wonder if this will still be not reachable after login to host or will all works fine then :) 16:15:40 <mlavalle> will keep you posted mordred 16:16:09 <mordred> mlavalle: thanks! also let me know if I can be of any help 16:16:25 <mlavalle> that's as far as I've gotten with this. I'll continue working it 16:16:56 <slaweq> thx mlavalle for working on this issue :) 16:17:12 <slaweq> #action mlavalle to continue debugging issue with not reachable FIP in scenario jobs 16:18:06 <slaweq> ok, next one then 16:18:16 <slaweq> slaweq will try to reproduce and triage https://bugs.launchpad.net/neutron/+bug/1687027 16:18:16 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:18:28 <mlavalle> we got a fix, didn't we? 16:18:30 <slaweq> Patch is in the gate already https://review.openstack.org/610003 16:18:37 <mlavalle> \o/ 16:18:41 <slaweq> mlavalle: yes, it's kind of fix :) 16:18:49 <slaweq> but we should be good with it I hope 16:19:25 <slaweq> and the last one from last week was: 16:19:27 <slaweq> mlavalle to send an email about moving tempest plugins from stadium to separate repo 16:19:40 <mlavalle> dang, I forgot that one 16:19:55 <slaweq> no problem :) 16:19:56 <mlavalle> I'll do it at the end of this meeting 16:20:01 <slaweq> #action mlavalle to send an email about moving tempest plugins from stadium to separate repo 16:20:27 <slaweq> ok, that's all from last week on my list 16:20:31 <slaweq> #topic Grafana 16:20:38 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:21:12 <slaweq> first of all - we are much better with grenade-multinode-dvr job now :) 16:22:17 <slaweq> from other things I see that functional tests were on quite high failure level recently 16:22:33 <slaweq> but it's related to this issue with db migrations which should be fine with my patch 16:22:51 <mlavalle> yeap 16:23:15 <mlavalle> overall, the picture looks good to me 16:23:34 <slaweq> yes, most of jobs are below 20% of failures 16:24:05 <slaweq> so lets talk about few specific issues which I found last week and want to raise here 16:24:16 <slaweq> #topic fullstack/functional 16:24:38 <slaweq> about fullstack tests, I found few times issue with cleanup processes 16:24:41 <slaweq> e.g.: 16:24:50 <slaweq> http://logs.openstack.org/97/602497/5/check/neutron-fullstack/f110a1f/logs/testr_results.html.gz 16:24:59 <slaweq> http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/testr_results.html.gz 16:25:08 <slaweq> so various tests but similar failure 16:26:43 <slaweq> here it looks that some process wasn't exited properly: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_.txt.gz#_2018-10-16_02_43_49_755 16:26:47 <slaweq> at least for me 16:27:42 <slaweq> and in this example it looks that it is openvswitch-agent: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_/neutron-openvswitch-agent--2018-10-16--02-42-43-987526.txt.gz 16:28:31 <slaweq> I don't know if that is an issue with tests or with ovs-agent process really 16:30:00 <slaweq> Looking at logs of this ovs agent it looks that there is no log like "Agent caught SIGTERM, quitting daemon loop." at the end 16:30:11 <slaweq> it is in other agents' logs but not in this one 16:30:11 <mlavalle> so, are these cleanup issues causing tests to fail? 16:30:44 <slaweq> mlavalle: yes, it looks that test if waiting for process to be exited and then timeout exception is raised 16:30:59 <mlavalle> that makes sense 16:32:42 <slaweq> I was thinking that I will open a bug for that to not forget about this and maybe there will be some volunteer to work on it :) 16:32:55 <slaweq> if I will have some cycles, I will try to debug it more 16:32:56 <mlavalle> that's a good idea 16:33:08 <slaweq> ok 16:33:29 <mlavalle> open the bug and if nobody volunteers, please remind us in the next meeting 16:33:38 <slaweq> #action slaweq to report a bug about issue with process ending in fullstack tests 16:33:41 <mlavalle> if I'm done with the trunk one 16:33:49 <slaweq> ok, thx mlavalle 16:33:56 <mlavalle> and there is nothing more urgent, I'll take it 16:34:01 <slaweq> I will remind it for sure if will be necessary 16:34:42 <slaweq> ok, lets move on to scenario jobs now 16:35:11 <slaweq> #topic Tempest/Scenario 16:35:25 <slaweq> first of all we still have this bug: https://bugs.launchpad.net/neutron/+bug/1789434 16:35:25 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia) 16:35:33 <slaweq> and I know that manjeets is working on it 16:35:40 <slaweq> any updates manjeets? 16:37:15 <slaweq> ok, I guess no 16:37:18 <slaweq> so let's move on 16:37:57 <slaweq> in neutron-tempest-plugin-scenario-linuxbridge I spotted at least two times issue with failed to transition FIP to down 16:38:00 <slaweq> for example: 16:38:08 <slaweq> http://logs.openstack.org/67/597567/23/check/neutron-tempest-plugin-scenario-linuxbridge/dc21cb0/testr_results.html.gz 16:38:17 <slaweq> http://logs.openstack.org/97/602497/5/gate/neutron-tempest-plugin-scenario-linuxbridge/52c2890/testr_results.html.gz 16:39:00 <slaweq> haleyb: I think You were debugging similar issue some time ago, right? 16:39:25 <haleyb> slaweq: yes, and i thought the change merged 16:39:31 <slaweq> and You found that this BUILD state is coming from nova 16:39:45 <slaweq> I think Your patch was merged, right 16:39:57 <slaweq> but it looks that this problem still happens 16:39:59 <haleyb> it's actually the state on the associated compute port 16:40:23 <haleyb> i think it did merge as that looks like the new code 16:40:34 <mlavalle> it merged 16:41:01 <mlavalle> I remember we got stuck a little with naming a function 16:41:41 <slaweq> and what is in logs, is that this FIP was set to DOWN here: http://logs.openstack.org/97/602497/5/gate/neutron-tempest-plugin-scenario-linuxbridge/52c2890/controller/logs/screen-q-svc.txt.gz#_Oct_16_01_28_04_498279 16:41:54 <slaweq> and test failed around 2018-10-16 01:27:59,893 16:42:21 <slaweq> maybe node was overloaded and something took longer time simply? 16:42:38 <haleyb> the test still checks the port the FIP is attached to, not the FIP itself, so maybe it was not the best fix? 16:42:46 <haleyb> we wait 120 seconds 16:43:15 <slaweq> haleyb: can I assign it to You as an action for next week? 16:43:23 <slaweq> will You be able to take a look on it? 16:43:27 <haleyb> slaweq: sure 16:43:31 <slaweq> thx a lot 16:43:59 <slaweq> #action haleyb to check issue with failing FIP transition to down state 16:44:43 <slaweq> basically I didn't found other failures related to neutron in jobs which I was checking 16:45:11 <slaweq> there are some issues with volumes and some issues with host mapping to cells in multinode jobs but that's not neutron issues 16:45:28 <slaweq> I have one more short thing to mention about scenario jobs 16:45:41 <slaweq> I have patch to make such jobs run faster: https://review.openstack.org/#/c/609762/ 16:46:09 <slaweq> basically I proposed to remove config option "is_image_advanced" and use "advanced_image_ref" instead 16:46:19 <slaweq> and use this advanced image only in tests which really needs it 16:46:28 <slaweq> we have 3 such tests currently 16:46:36 <slaweq> all other tests will use Cirros image 16:47:13 <slaweq> comparing jobs' running time: 16:47:35 <slaweq> neutron-tempest-plugin-dvr-multinode-scenario - 2h 11m 27s without my patch and 1h 22m 34s on my patch 16:48:00 <slaweq> neutron-tempest-plugin-scenario-linuxbridge - 1h 39m 48s without this patch and 1h 09m 23s with patch 16:48:11 <slaweq> so difference is quite big IMO 16:48:15 <mlavalle> yeah 16:48:27 <slaweq> please review this patch if You will have some time :) 16:48:37 <mlavalle> will do 16:48:54 <haleyb> me too 16:49:26 <slaweq> ok, that's all from my side for today 16:49:28 <slaweq> #topic Open discussion 16:49:34 <slaweq> anyone wants to talk about something? 16:49:35 <clarkb> according to e-r, http://status.openstack.org/elastic-recheck/data/integrated_gate.html, there are some neutron fullstack failures too (the functional job at the top of the list should have a fix out there now) 16:50:46 <manjeets> slaweq sorry I was on other internal meeting as i discussed with I need to get into l3 and l2 agents code to point the notification issue 16:51:20 <clarkb> more of an fyi than anything else since you were talking about neutron fails above 16:51:22 <slaweq> clarkb: we already talked about issue like is e.g. in http://logs.openstack.org/04/599604/10/gate/neutron-fullstack/3c364de/logs/testr_results.html.gz 16:51:37 <slaweq> and it looks that it's the reason of 3 of 4 issues pointed there 16:52:34 <slaweq> and those 2: http://logs.openstack.org/03/610003/2/gate/neutron-fullstack-python36/b7f131a/logs/testr_results.html.gz and http://logs.openstack.org/61/608361/2/gate/neutron-fullstack-python36/39f72ae/logs/testr_results.html.gz 16:52:46 <slaweq> looks at first glance that it's again the same culprit 16:53:09 <slaweq> test fails as it tried to kill agent and that was not finished with success 16:53:44 <slaweq> so as I said, I will open a bug report for that for now, and will try to debug it if I will have time for it 16:54:03 <clarkb> ok, you might also want to add e-r queries for known issues if they will persist for longer than a few days 16:54:22 <clarkb> helps others identify problems that have direct effect on our ability to merge code and hopefully motivates people to fix them too :) 16:54:25 <slaweq> clarkb: ok, I will add it, thx 16:54:39 <mlavalle> clarkb: thanks for bringing this up 16:54:57 <mlavalle> this is the perfect forum for CI issues 16:55:05 <mlavalle> and we meet every week at this time 16:55:10 <mlavalle> you are always welcome 16:55:59 <slaweq> #action slaweq to add e-r query for know fullstack issue (when bug will be reported) 16:56:40 <slaweq> anything else or can we finish few minutes before time? :) 16:56:48 <mlavalle> not from me 16:57:34 <slaweq> thx for attending 16:57:39 <slaweq> #endmeeting