16:00:11 #startmeeting neutron_ci 16:00:12 Meeting started Tue Jan 8 16:00:11 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:13 hi 16:00:16 The meeting name has been set to 'neutron_ci' 16:01:36 o/ 16:01:41 hi hongbin 16:01:56 mlavalle: ping :) 16:02:02 o/ 16:02:06 hi mlavalle 16:02:16 sorry, distracted looking at code 16:02:17 I think we can start now 16:02:23 thanks for pinging me 16:02:35 haleyb: bcafarel and njohnson will not be available probably 16:02:38 mlavalle: no problem :) 16:02:52 #topic Actions from previous meetings 16:02:55 they are not getting paid, right? 16:03:11 why? :) 16:03:19 they didn't show up today 16:03:24 just kidding 16:03:30 they are on internal meetup this week 16:03:34 so they are busy 16:03:57 is that all week long? 16:04:09 tuesday to thursday 16:04:14 cool 16:04:25 so I know not to bother them 16:04:32 :) 16:04:46 ok, so lets go with actions from last year then :) 16:04:50 mlavalle will continue debugging trunk tests failures in multinode dvr env 16:05:08 yes 16:05:31 I continued working on it. Yesterday I wrote an update in the bug: https://bugs.launchpad.net/neutron/+bug/1795870/comments/8 16:05:32 Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:06:03 yes, I saw this comment yesterday 16:06:14 Based on the evidence that I list in that note, it seems to me the dvr router is not being scheduled in one of the hosts 16:06:25 in this case the controller 16:06:48 the router in question doesn't show up at all in the le agent log in that host 16:06:52 so vm which were on controller wasn't reachable? 16:06:59 or vm on subnode wasn't reachable? 16:07:08 so I am looking at the router scheduling code 16:07:19 the one in the controller 16:07:25 ahh, ok 16:08:05 so that we can rule out network connectivity between nodes as the reason 16:08:28 yes, we can rule that out 16:08:41 ok 16:08:45 I know that because other tests run 16:09:32 yes, but it could be that e.g. tests where vm was spawned on subnode were failed and tests where vm was spawned on controller were passing - then You couldn't rule that out 16:10:09 but in such case if vm on controller didn't work, it isn't underlay network issue 16:10:11 no, becasue there are tests where 2 vms, one in controller and one in compute, are passing 16:10:40 yes, so we can definitely rule out issues with underlay network :) 16:10:58 hang on 16:11:32 you wrote this test: https://review.openstack.org/#/c/598676/ 16:11:43 and the vms run in both hosts, right? 16:12:09 this test passes 16:12:11 yes, they should be on separate nodes always 16:12:37 that test led me to https://review.openstack.org/#/c/597567 16:12:48 which merged in November 16:13:02 yes, it was big patch 16:13:04 and where we made significant changes to dvr scheduling 16:13:16 so I am looking at that code now 16:13:25 ok 16:13:27 that is why I got late to this meeting ;-) 16:13:31 :) 16:13:49 ok, so You are justified now :P 16:13:50 that's where I am right now 16:13:59 I'll continue pushing forward 16:14:06 ok, thx a lot 16:14:17 I might push a DNM patch to do some testing 16:14:30 sure, let me know if You will need any help 16:14:46 #action mlavalle will continue debugging trunk tests failures in multinode dvr env 16:14:54 thx mlavalle for working on this 16:14:59 thanks 16:15:11 ok, lets move on to the next one 16:15:15 haleyb to report bugs about recent errors in L3 agent logs 16:15:30 can I make one additional comment 16:15:33 ? 16:15:34 sure 16:15:37 go on 16:16:01 several of the other failures in the dvr mltinode job exhibit similar behavior.... 16:16:18 the VM cannot reach the metadata service 16:16:42 so this might be the commong cause of several of the failures trying to ssh to instances 16:17:31 yes, I also think that, I saw some other tests with similar issues, it's not always same test 16:17:57 so this is related to the email you sent me in december 16:18:02 with ssh failures 16:18:26 * mlavalle trying to show diligence with homework assigned by El Comandante 16:18:38 mlavalle: but it's not always dvr jobs with such failures 16:18:42 e.g. here: * haleyb to report bugs about recent errors in L3 agent logs 16:18:44 sorry 16:18:53 http://logs.openstack.org/09/626109/10/check/tempest-full/fab9ab0/testr_results.html.gz 16:19:01 it is the same issue 16:19:07 and it's single node tempest job 16:19:09 without dvr 16:19:12 ok, I'll look at that then 16:19:19 it is useful data 16:20:33 I saw in this job some errors in L3 agent logs: http://logs.openstack.org/09/626109/10/check/tempest-full/fab9ab0/controller/logs/screen-q-l3.txt.gz?level=ERROR 16:20:42 but I'm not sure if that is related to the issue or not 16:21:53 ok, can we move on then? 16:22:02 yes, I'll look a that 16:22:08 thx mlavalle 16:22:10 thanks for the pointers 16:22:39 You welcome :) 16:22:44 ok, lets move on 16:22:53 haleyb to report bugs about recent errors in L3 agent logs 16:22:58 haleyb is not here 16:23:13 but he opened bug https://bugs.launchpad.net/neutron/+bug/1809134 16:23:13 Launchpad bug 1809134 in neutron "TypeError in QoS gateway_ip code in l3-agent logs" [High,In progress] - Assigned to Brian Haley (brian-haley) 16:23:24 and he is working on it so we should be fine 16:24:16 I think we can move on to the next one then 16:24:19 slaweq to talk with bcafarel about SIGHUP issue in functional py3 tests 16:24:43 I talked with bcafarel, he checked that and he found that this issue is now fixed with https://review.openstack.org/#/c/624006/ 16:25:37 any questions or do You want to talk about something else related to last week's actions? 16:25:40 cool 16:26:33 or can we move on as that was all actions from last meeting 16:27:31 ok, lets move on then 16:27:36 let's move on 16:27:36 #topic Python 3 16:27:47 Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3 16:28:09 in this etherpad You can track progress of switching our CI to python3 16:28:23 recently I did some patches related to it: 16:28:32 https://review.openstack.org/#/c/627806/ 16:28:39 https://review.openstack.org/#/c/627053/ 16:28:48 so please review if You will have some time 16:29:14 I’m also working slowly with conversions of neutron-functional job to py3 in https://review.openstack.org/#/c/577383/ but it isn't ready yet I think 16:29:22 so I can take an assignment from there 16:29:40 sure 16:29:43 ok 16:29:46 cool 16:30:08 in patches which I did I was doing together conversion to zuulv3 and python3 16:30:20 that's good idea 16:30:26 but if You want only switch job to py3 now, it's fine too for me 16:30:27 and good pratice 16:31:09 for some grenade jobs I think njohnston was doing some patches 16:31:17 ok 16:31:37 but I didn' check those too much recently so I am not sure what is current state of those jobs 16:32:30 but I think that we have good progress on it and we can switch everything in this cycle I hope :) 16:32:42 great! 16:33:39 btw. I'm not sure if You are aware but there is also almost merged patch https://review.openstack.org/#/c/622415/ 16:33:47 to switch devstack to be py3 by default 16:34:20 I'm not sure, but then I think all devstack based jobs should be running py3 by default 16:34:49 let's keep an eye on that 16:34:54 sure 16:35:24 ok, can we move on to the next topic? 16:36:01 I take it as yes 16:36:03 #topic Grafana 16:36:10 http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1 16:37:11 couple notes on grafana 16:37:33 all tempest and neutron-tempest-plugin jobs are on quite high failure rates recently 16:38:04 and in most cases where I checked it it was this issue where vm couldn't reach metadata or issue where SSH to instance wasn't possible 16:38:22 I know that mlavalle will continue work on first problem 16:38:33 yeah 16:38:33 so I can try to debug second one 16:38:42 ok 16:38:49 #action slaweq to debug problems with ssh to vm in tempest tests 16:38:54 yes, I thionk it is the right approach 16:39:08 if I will find anything I will open bug/bugs for it 16:39:57 there is also one more issue worth to mention, recently we had Functional tests failing 100% times, that was caused by https://bugs.launchpad.net/neutron/+bug/1810518 16:39:57 Launchpad bug 1810518 in oslo.privsep "neutron-functional tests failing with oslo.privsep 1.31" [Critical,Confirmed] - Assigned to Ben Nemec (bnemec) 16:40:32 for now it is workaround by lowering oslo.privsep version in requirements repo but we need to fix this issue somehow 16:41:03 I talked with rubasov today and he has no idea how to fix this :/ 16:41:16 ok 16:41:22 I also spent few hours on it today 16:41:24 well was worth trying 16:41:45 I described my findings in comment on launchpad but I have no idea how to deal with it 16:42:16 I have no any experience with ctypes lib and calling C functions from python 16:42:30 and together with threading 16:43:02 I will ask bnemec today if he found something maybe 16:43:26 and we will see what to do with it next 16:44:16 anything else You want to add/ask here? 16:44:26 if you get stuck, let me know 16:44:34 mlavalle: sure, thx 16:44:36 if nothing else, it is a good learning opportunity 16:44:45 yes, it is 16:44:49 ctypes, C, etc 16:44:56 I read a lot about it today :) 16:45:38 ok, lets move on 16:45:39 #topic fullstack/functional 16:45:48 according to fullstack tests 16:45:59 we still have opened issue https://bugs.launchpad.net/neutron/+bug/1798475 16:45:59 Launchpad bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,In progress] - Assigned to LIU Yulong (dragon889) 16:46:17 liuyulong told me today that he couldn’t spot it recently too much. We will continue work on it 16:46:31 about functional tests 16:46:57 I have small update according to https://bugs.launchpad.net/neutron/+bug/1687027 16:46:58 Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:47:13 I was checking those failed db migrations 16:47:30 and it looks that this isn't any specific problem in Neutron migration scripts 16:47:50 it's just issue with IO performance on some cloud providers probably 16:48:05 ahhh 16:48:16 yeah, that makes it difficult to fix 16:48:21 I described in comment in https://bugs.launchpad.net/neutron/+bug/1687027/comments/41 my findings 16:48:21 Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:48:45 basically in such timeouted run, every step in migration takes very long time 16:49:07 and dstat shows me that it was belowe 100 IOPS during such test 16:49:33 while during "good" test run there was more than 2000 IOPS where I was checking it 16:49:45 right 16:49:51 so we are hitting a limit 16:50:26 I think that sometimes we are just on some slow nodes and that is the reason 16:50:32 I talked with infra about it 16:50:49 they told me that they had some issues with one provider recently but this should be fixed now 16:51:01 I want to send patch which will unmark those tests as unstable 16:51:05 ahh, so there's hope 16:51:16 and try to recheck it for some time to see if that will happen again 16:51:22 yeap 16:51:32 when I was looking at logstash few days ago it wasn't too much such issues 16:51:44 but we will probably have them from time to time still 16:52:11 and that's all for functional/fullstack tests from me for today 16:52:20 do You want to add something? 16:53:20 ok, so I guess that we can go to the next topic now :) 16:53:25 #topic Tempest/Scenario 16:53:45 according to tempest/scenario tests, we already discussed about our 2 main issues there 16:53:54 so I just wanted to ask about one thing 16:53:56 yes 16:54:24 I recently send patch https://review.openstack.org/#/c/627970/ which adds non-voting tempest job based on Fedora 16:54:49 I wanted to ask for review and for opinions from community if it's fine for You to have such job 16:55:13 it's running on py3 of course :) 16:55:16 I'm ok with it 16:55:33 great mlavalle, thx :) 16:55:34 have we seen oppostion in other projecs? 16:55:56 I don't know about any 16:56:04 ok 16:56:08 I'm fine with it 16:56:47 thx mlavalle 16:56:55 please add this patch to Your review list :) 16:57:03 I just did :-) 16:57:09 thx :) 16:57:19 ok, so that's all from my side for today 16:57:33 do You want to talk about anything else quickly? 16:57:56 nope 16:58:10 ok, so thx for attending mlavalle and hongbin 16:58:13 thanks for the great faciliatio, as always 16:58:15 see You next week 16:58:18 o/ 16:58:23 o/ 16:58:26 #endmeeting