16:00:08 #startmeeting neutron_ci 16:00:09 hi 16:00:10 Meeting started Tue Oct 16 16:00:08 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:12 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:15 The meeting name has been set to 'neutron_ci' 16:00:23 o/ 16:01:19 o/ 16:01:58 njohnston: haleyb: are You around for CI meeting? 16:02:11 hi, missed reminder 16:02:30 ok, lets start then 16:02:36 #topic Actions from previous meetings 16:02:45 njohnston look into fullstack compilation of ovs still needed on bionic 16:03:38 I think njohnston is not around now so lets move to the next one 16:03:47 slaweq will create scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue 16:03:58 I today pushed a patch for that https://review.openstack.org/610997 16:04:12 it's in experimental queue now and I will check how it will go 16:04:51 I will update status of it next week, ok? 16:04:56 cool 16:05:14 #action slaweq to continue checking how jobs will run on Bionic nodes 16:05:24 ok, next one: 16:05:26 mlavalle will check failing trunk scenario test 16:05:47 I've been working on this bug all this morning 16:06:06 we have 33 hits over the past 7 days 16:06:42 Most of those (~27) are in neutron-tempest-plugin-dvr-multinode-scenario 16:06:57 at least this job is non voting :) 16:07:21 and the failure occurs after creating the servers when trying to check connectivity over the fip 16:07:28 to one of the servers 16:07:43 the test hasn't tried to do anything with the trunk yet 16:07:50 other than creating the trunk 16:08:14 so at this point I suspect the fip wiring in dvr 16:08:18 did You check how long vm was booting? Maybe it's some timeout again? 16:08:38 no, the vms finish booting correctly 16:08:44 both of them 16:08:49 ok 16:09:06 and then we go and try to ssh to one of them using the fip 16:09:26 checking connectivity before starting to play with the trunk 16:09:47 I think it is quite common cause of failures in scenario tests - FIP is not reachable 16:09:56 yeah 16:10:11 at this point it seems to me to be a generic failure of the fip 16:10:24 now the question is: how we can try to debug this issue? 16:10:38 but I am taking it as a good opporutnity to debug the neutron-tempest-plugin-dvr-multinode-scenario job 16:11:06 mlavalle: so you can't ping the FIP? 16:11:21 haleyb: we can't ssh to it 16:12:20 at this point I am chacking to see if this has to do with in what node the instance / port lands 16:12:28 controller vs compute 16:12:35 it sounds like the problem we were seeing before the revert... i guess we can start tracing the packets w/tcpdump, etc 16:12:54 haleyb: yesh, I suspect that 16:13:11 I need to work this a little bit longer, though 16:13:20 ack 16:13:21 mlavalle: yes, You can ask infra-root to set such job on hold and then if job will fail You will have ssh access to nodes 16:13:46 slaweq: yeap, I might try that 16:13:47 or do something "ugly" which I was doing when debugging dvr-multinode grenade issue 16:13:55 at this point I am gathering more evidence 16:14:04 I send DNM patch where I was adding my ssh key to node 16:14:28 and then You can e.g. add some sleep() in test code and login to nodes and debug 16:14:37 yeap 16:14:44 without asking infra-root for holding job for You :) 16:15:05 * mordred also has timeout issues with floating ip functional tests for openstacksdk - but I haven't ruled out just needing to rework those tests 16:15:32 I wonder if this will still be not reachable after login to host or will all works fine then :) 16:15:40 will keep you posted mordred 16:16:09 mlavalle: thanks! also let me know if I can be of any help 16:16:25 that's as far as I've gotten with this. I'll continue working it 16:16:56 thx mlavalle for working on this issue :) 16:17:12 #action mlavalle to continue debugging issue with not reachable FIP in scenario jobs 16:18:06 ok, next one then 16:18:16 slaweq will try to reproduce and triage https://bugs.launchpad.net/neutron/+bug/1687027 16:18:16 Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:18:28 we got a fix, didn't we? 16:18:30 Patch is in the gate already https://review.openstack.org/610003 16:18:37 \o/ 16:18:41 mlavalle: yes, it's kind of fix :) 16:18:49 but we should be good with it I hope 16:19:25 and the last one from last week was: 16:19:27 mlavalle to send an email about moving tempest plugins from stadium to separate repo 16:19:40 dang, I forgot that one 16:19:55 no problem :) 16:19:56 I'll do it at the end of this meeting 16:20:01 #action mlavalle to send an email about moving tempest plugins from stadium to separate repo 16:20:27 ok, that's all from last week on my list 16:20:31 #topic Grafana 16:20:38 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:21:12 first of all - we are much better with grenade-multinode-dvr job now :) 16:22:17 from other things I see that functional tests were on quite high failure level recently 16:22:33 but it's related to this issue with db migrations which should be fine with my patch 16:22:51 yeap 16:23:15 overall, the picture looks good to me 16:23:34 yes, most of jobs are below 20% of failures 16:24:05 so lets talk about few specific issues which I found last week and want to raise here 16:24:16 #topic fullstack/functional 16:24:38 about fullstack tests, I found few times issue with cleanup processes 16:24:41 e.g.: 16:24:50 http://logs.openstack.org/97/602497/5/check/neutron-fullstack/f110a1f/logs/testr_results.html.gz 16:24:59 http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/testr_results.html.gz 16:25:08 so various tests but similar failure 16:26:43 here it looks that some process wasn't exited properly: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_.txt.gz#_2018-10-16_02_43_49_755 16:26:47 at least for me 16:27:42 and in this example it looks that it is openvswitch-agent: http://logs.openstack.org/68/564668/7/check/neutron-fullstack-python36/c4223c2/logs/dsvm-fullstack-logs/TestOvsConnectivitySameNetwork.test_connectivity_GRE-l2pop-arp_responder,openflow-native_/neutron-openvswitch-agent--2018-10-16--02-42-43-987526.txt.gz 16:28:31 I don't know if that is an issue with tests or with ovs-agent process really 16:30:00 Looking at logs of this ovs agent it looks that there is no log like "Agent caught SIGTERM, quitting daemon loop." at the end 16:30:11 it is in other agents' logs but not in this one 16:30:11 so, are these cleanup issues causing tests to fail? 16:30:44 mlavalle: yes, it looks that test if waiting for process to be exited and then timeout exception is raised 16:30:59 that makes sense 16:32:42 I was thinking that I will open a bug for that to not forget about this and maybe there will be some volunteer to work on it :) 16:32:55 if I will have some cycles, I will try to debug it more 16:32:56 that's a good idea 16:33:08 ok 16:33:29 open the bug and if nobody volunteers, please remind us in the next meeting 16:33:38 #action slaweq to report a bug about issue with process ending in fullstack tests 16:33:41 if I'm done with the trunk one 16:33:49 ok, thx mlavalle 16:33:56 and there is nothing more urgent, I'll take it 16:34:01 I will remind it for sure if will be necessary 16:34:42 ok, lets move on to scenario jobs now 16:35:11 #topic Tempest/Scenario 16:35:25 first of all we still have this bug: https://bugs.launchpad.net/neutron/+bug/1789434 16:35:25 Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia) 16:35:33 and I know that manjeets is working on it 16:35:40 any updates manjeets? 16:37:15 ok, I guess no 16:37:18 so let's move on 16:37:57 in neutron-tempest-plugin-scenario-linuxbridge I spotted at least two times issue with failed to transition FIP to down 16:38:00 for example: 16:38:08 http://logs.openstack.org/67/597567/23/check/neutron-tempest-plugin-scenario-linuxbridge/dc21cb0/testr_results.html.gz 16:38:17 http://logs.openstack.org/97/602497/5/gate/neutron-tempest-plugin-scenario-linuxbridge/52c2890/testr_results.html.gz 16:39:00 haleyb: I think You were debugging similar issue some time ago, right? 16:39:25 slaweq: yes, and i thought the change merged 16:39:31 and You found that this BUILD state is coming from nova 16:39:45 I think Your patch was merged, right 16:39:57 but it looks that this problem still happens 16:39:59 it's actually the state on the associated compute port 16:40:23 i think it did merge as that looks like the new code 16:40:34 it merged 16:41:01 I remember we got stuck a little with naming a function 16:41:41 and what is in logs, is that this FIP was set to DOWN here: http://logs.openstack.org/97/602497/5/gate/neutron-tempest-plugin-scenario-linuxbridge/52c2890/controller/logs/screen-q-svc.txt.gz#_Oct_16_01_28_04_498279 16:41:54 and test failed around 2018-10-16 01:27:59,893 16:42:21 maybe node was overloaded and something took longer time simply? 16:42:38 the test still checks the port the FIP is attached to, not the FIP itself, so maybe it was not the best fix? 16:42:46 we wait 120 seconds 16:43:15 haleyb: can I assign it to You as an action for next week? 16:43:23 will You be able to take a look on it? 16:43:27 slaweq: sure 16:43:31 thx a lot 16:43:59 #action haleyb to check issue with failing FIP transition to down state 16:44:43 basically I didn't found other failures related to neutron in jobs which I was checking 16:45:11 there are some issues with volumes and some issues with host mapping to cells in multinode jobs but that's not neutron issues 16:45:28 I have one more short thing to mention about scenario jobs 16:45:41 I have patch to make such jobs run faster: https://review.openstack.org/#/c/609762/ 16:46:09 basically I proposed to remove config option "is_image_advanced" and use "advanced_image_ref" instead 16:46:19 and use this advanced image only in tests which really needs it 16:46:28 we have 3 such tests currently 16:46:36 all other tests will use Cirros image 16:47:13 comparing jobs' running time: 16:47:35 neutron-tempest-plugin-dvr-multinode-scenario - 2h 11m 27s without my patch and 1h 22m 34s on my patch 16:48:00 neutron-tempest-plugin-scenario-linuxbridge - 1h 39m 48s without this patch and 1h 09m 23s with patch 16:48:11 so difference is quite big IMO 16:48:15 yeah 16:48:27 please review this patch if You will have some time :) 16:48:37 will do 16:48:54 me too 16:49:26 ok, that's all from my side for today 16:49:28 #topic Open discussion 16:49:34 anyone wants to talk about something? 16:49:35 according to e-r, http://status.openstack.org/elastic-recheck/data/integrated_gate.html, there are some neutron fullstack failures too (the functional job at the top of the list should have a fix out there now) 16:50:46 slaweq sorry I was on other internal meeting as i discussed with I need to get into l3 and l2 agents code to point the notification issue 16:51:20 more of an fyi than anything else since you were talking about neutron fails above 16:51:22 clarkb: we already talked about issue like is e.g. in http://logs.openstack.org/04/599604/10/gate/neutron-fullstack/3c364de/logs/testr_results.html.gz 16:51:37 and it looks that it's the reason of 3 of 4 issues pointed there 16:52:34 and those 2: http://logs.openstack.org/03/610003/2/gate/neutron-fullstack-python36/b7f131a/logs/testr_results.html.gz and http://logs.openstack.org/61/608361/2/gate/neutron-fullstack-python36/39f72ae/logs/testr_results.html.gz 16:52:46 looks at first glance that it's again the same culprit 16:53:09 test fails as it tried to kill agent and that was not finished with success 16:53:44 so as I said, I will open a bug report for that for now, and will try to debug it if I will have time for it 16:54:03 ok, you might also want to add e-r queries for known issues if they will persist for longer than a few days 16:54:22 helps others identify problems that have direct effect on our ability to merge code and hopefully motivates people to fix them too :) 16:54:25 clarkb: ok, I will add it, thx 16:54:39 clarkb: thanks for bringing this up 16:54:57 this is the perfect forum for CI issues 16:55:05 and we meet every week at this time 16:55:10 you are always welcome 16:55:59 #action slaweq to add e-r query for know fullstack issue (when bug will be reported) 16:56:40 anything else or can we finish few minutes before time? :) 16:56:48 not from me 16:57:34 thx for attending 16:57:39 #endmeeting