16:00:11 <slaweq> #startmeeting neutron_ci 16:00:12 <openstack> Meeting started Tue Jan 8 16:00:11 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:13 <slaweq> hi 16:00:16 <openstack> The meeting name has been set to 'neutron_ci' 16:01:36 <hongbin> o/ 16:01:41 <slaweq> hi hongbin 16:01:56 <slaweq> mlavalle: ping :) 16:02:02 <mlavalle> o/ 16:02:06 <slaweq> hi mlavalle 16:02:16 <mlavalle> sorry, distracted looking at code 16:02:17 <slaweq> I think we can start now 16:02:23 <mlavalle> thanks for pinging me 16:02:35 <slaweq> haleyb: bcafarel and njohnson will not be available probably 16:02:38 <slaweq> mlavalle: no problem :) 16:02:52 <slaweq> #topic Actions from previous meetings 16:02:55 <mlavalle> they are not getting paid, right? 16:03:11 <slaweq> why? :) 16:03:19 <mlavalle> they didn't show up today 16:03:24 <mlavalle> just kidding 16:03:30 <slaweq> they are on internal meetup this week 16:03:34 <slaweq> so they are busy 16:03:57 <mlavalle> is that all week long? 16:04:09 <slaweq> tuesday to thursday 16:04:14 <mlavalle> cool 16:04:25 <mlavalle> so I know not to bother them 16:04:32 <slaweq> :) 16:04:46 <slaweq> ok, so lets go with actions from last year then :) 16:04:50 <slaweq> mlavalle will continue debugging trunk tests failures in multinode dvr env 16:05:08 <mlavalle> yes 16:05:31 <mlavalle> I continued working on it. Yesterday I wrote an update in the bug: https://bugs.launchpad.net/neutron/+bug/1795870/comments/8 16:05:32 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,In progress] - Assigned to Miguel Lavalle (minsel) 16:06:03 <slaweq> yes, I saw this comment yesterday 16:06:14 <mlavalle> Based on the evidence that I list in that note, it seems to me the dvr router is not being scheduled in one of the hosts 16:06:25 <mlavalle> in this case the controller 16:06:48 <mlavalle> the router in question doesn't show up at all in the le agent log in that host 16:06:52 <slaweq> so vm which were on controller wasn't reachable? 16:06:59 <slaweq> or vm on subnode wasn't reachable? 16:07:08 <mlavalle> so I am looking at the router scheduling code 16:07:19 <mlavalle> the one in the controller 16:07:25 <slaweq> ahh, ok 16:08:05 <slaweq> so that we can rule out network connectivity between nodes as the reason 16:08:28 <mlavalle> yes, we can rule that out 16:08:41 <slaweq> ok 16:08:45 <mlavalle> I know that because other tests run 16:09:32 <slaweq> yes, but it could be that e.g. tests where vm was spawned on subnode were failed and tests where vm was spawned on controller were passing - then You couldn't rule that out 16:10:09 <slaweq> but in such case if vm on controller didn't work, it isn't underlay network issue 16:10:11 <mlavalle> no, becasue there are tests where 2 vms, one in controller and one in compute, are passing 16:10:40 <slaweq> yes, so we can definitely rule out issues with underlay network :) 16:10:58 <mlavalle> hang on 16:11:32 <mlavalle> you wrote this test: https://review.openstack.org/#/c/598676/ 16:11:43 <mlavalle> and the vms run in both hosts, right? 16:12:09 <mlavalle> this test passes 16:12:11 <slaweq> yes, they should be on separate nodes always 16:12:37 <mlavalle> that test led me to https://review.openstack.org/#/c/597567 16:12:48 <mlavalle> which merged in November 16:13:02 <slaweq> yes, it was big patch 16:13:04 <mlavalle> and where we made significant changes to dvr scheduling 16:13:16 <mlavalle> so I am looking at that code now 16:13:25 <slaweq> ok 16:13:27 <mlavalle> that is why I got late to this meeting ;-) 16:13:31 <slaweq> :) 16:13:49 <slaweq> ok, so You are justified now :P 16:13:50 <mlavalle> that's where I am right now 16:13:59 <mlavalle> I'll continue pushing forward 16:14:06 <slaweq> ok, thx a lot 16:14:17 <mlavalle> I might push a DNM patch to do some testing 16:14:30 <slaweq> sure, let me know if You will need any help 16:14:46 <slaweq> #action mlavalle will continue debugging trunk tests failures in multinode dvr env 16:14:54 <slaweq> thx mlavalle for working on this 16:14:59 <mlavalle> thanks 16:15:11 <slaweq> ok, lets move on to the next one 16:15:15 <slaweq> haleyb to report bugs about recent errors in L3 agent logs 16:15:30 <mlavalle> can I make one additional comment 16:15:33 <mlavalle> ? 16:15:34 <slaweq> sure 16:15:37 <slaweq> go on 16:16:01 <mlavalle> several of the other failures in the dvr mltinode job exhibit similar behavior.... 16:16:18 <mlavalle> the VM cannot reach the metadata service 16:16:42 <mlavalle> so this might be the commong cause of several of the failures trying to ssh to instances 16:17:31 <slaweq> yes, I also think that, I saw some other tests with similar issues, it's not always same test 16:17:57 <mlavalle> so this is related to the email you sent me in december 16:18:02 <mlavalle> with ssh failures 16:18:26 * mlavalle trying to show diligence with homework assigned by El Comandante 16:18:38 <slaweq> mlavalle: but it's not always dvr jobs with such failures 16:18:42 <slaweq> e.g. here: * haleyb to report bugs about recent errors in L3 agent logs 16:18:44 <slaweq> sorry 16:18:53 <slaweq> http://logs.openstack.org/09/626109/10/check/tempest-full/fab9ab0/testr_results.html.gz 16:19:01 <slaweq> it is the same issue 16:19:07 <slaweq> and it's single node tempest job 16:19:09 <slaweq> without dvr 16:19:12 <mlavalle> ok, I'll look at that then 16:19:19 <mlavalle> it is useful data 16:20:33 <slaweq> I saw in this job some errors in L3 agent logs: http://logs.openstack.org/09/626109/10/check/tempest-full/fab9ab0/controller/logs/screen-q-l3.txt.gz?level=ERROR 16:20:42 <slaweq> but I'm not sure if that is related to the issue or not 16:21:53 <slaweq> ok, can we move on then? 16:22:02 <mlavalle> yes, I'll look a that 16:22:08 <slaweq> thx mlavalle 16:22:10 <mlavalle> thanks for the pointers 16:22:39 <slaweq> You welcome :) 16:22:44 <slaweq> ok, lets move on 16:22:53 <slaweq> haleyb to report bugs about recent errors in L3 agent logs 16:22:58 <slaweq> haleyb is not here 16:23:13 <slaweq> but he opened bug https://bugs.launchpad.net/neutron/+bug/1809134 16:23:13 <openstack> Launchpad bug 1809134 in neutron "TypeError in QoS gateway_ip code in l3-agent logs" [High,In progress] - Assigned to Brian Haley (brian-haley) 16:23:24 <slaweq> and he is working on it so we should be fine 16:24:16 <slaweq> I think we can move on to the next one then 16:24:19 <slaweq> slaweq to talk with bcafarel about SIGHUP issue in functional py3 tests 16:24:43 <slaweq> I talked with bcafarel, he checked that and he found that this issue is now fixed with https://review.openstack.org/#/c/624006/ 16:25:37 <slaweq> any questions or do You want to talk about something else related to last week's actions? 16:25:40 <mlavalle> cool 16:26:33 <slaweq> or can we move on as that was all actions from last meeting 16:27:31 <slaweq> ok, lets move on then 16:27:36 <mlavalle> let's move on 16:27:36 <slaweq> #topic Python 3 16:27:47 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3 16:28:09 <slaweq> in this etherpad You can track progress of switching our CI to python3 16:28:23 <slaweq> recently I did some patches related to it: 16:28:32 <slaweq> https://review.openstack.org/#/c/627806/ 16:28:39 <slaweq> https://review.openstack.org/#/c/627053/ 16:28:48 <slaweq> so please review if You will have some time 16:29:14 <slaweq> I’m also working slowly with conversions of neutron-functional job to py3 in https://review.openstack.org/#/c/577383/ but it isn't ready yet I think 16:29:22 <mlavalle> so I can take an assignment from there 16:29:40 <slaweq> sure 16:29:43 <mlavalle> ok 16:29:46 <mlavalle> cool 16:30:08 <slaweq> in patches which I did I was doing together conversion to zuulv3 and python3 16:30:20 <mlavalle> that's good idea 16:30:26 <slaweq> but if You want only switch job to py3 now, it's fine too for me 16:30:27 <mlavalle> and good pratice 16:31:09 <slaweq> for some grenade jobs I think njohnston was doing some patches 16:31:17 <mlavalle> ok 16:31:37 <slaweq> but I didn' check those too much recently so I am not sure what is current state of those jobs 16:32:30 <slaweq> but I think that we have good progress on it and we can switch everything in this cycle I hope :) 16:32:42 <mlavalle> great! 16:33:39 <slaweq> btw. I'm not sure if You are aware but there is also almost merged patch https://review.openstack.org/#/c/622415/ 16:33:47 <slaweq> to switch devstack to be py3 by default 16:34:20 <slaweq> I'm not sure, but then I think all devstack based jobs should be running py3 by default 16:34:49 <mlavalle> let's keep an eye on that 16:34:54 <slaweq> sure 16:35:24 <slaweq> ok, can we move on to the next topic? 16:36:01 <slaweq> I take it as yes 16:36:03 <slaweq> #topic Grafana 16:36:10 <slaweq> http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1 16:37:11 <slaweq> couple notes on grafana 16:37:33 <slaweq> all tempest and neutron-tempest-plugin jobs are on quite high failure rates recently 16:38:04 <slaweq> and in most cases where I checked it it was this issue where vm couldn't reach metadata or issue where SSH to instance wasn't possible 16:38:22 <slaweq> I know that mlavalle will continue work on first problem 16:38:33 <mlavalle> yeah 16:38:33 <slaweq> so I can try to debug second one 16:38:42 <mlavalle> ok 16:38:49 <slaweq> #action slaweq to debug problems with ssh to vm in tempest tests 16:38:54 <mlavalle> yes, I thionk it is the right approach 16:39:08 <slaweq> if I will find anything I will open bug/bugs for it 16:39:57 <slaweq> there is also one more issue worth to mention, recently we had Functional tests failing 100% times, that was caused by https://bugs.launchpad.net/neutron/+bug/1810518 16:39:57 <openstack> Launchpad bug 1810518 in oslo.privsep "neutron-functional tests failing with oslo.privsep 1.31" [Critical,Confirmed] - Assigned to Ben Nemec (bnemec) 16:40:32 <slaweq> for now it is workaround by lowering oslo.privsep version in requirements repo but we need to fix this issue somehow 16:41:03 <slaweq> I talked with rubasov today and he has no idea how to fix this :/ 16:41:16 <mlavalle> ok 16:41:22 <slaweq> I also spent few hours on it today 16:41:24 <mlavalle> well was worth trying 16:41:45 <slaweq> I described my findings in comment on launchpad but I have no idea how to deal with it 16:42:16 <slaweq> I have no any experience with ctypes lib and calling C functions from python 16:42:30 <slaweq> and together with threading 16:43:02 <slaweq> I will ask bnemec today if he found something maybe 16:43:26 <slaweq> and we will see what to do with it next 16:44:16 <slaweq> anything else You want to add/ask here? 16:44:26 <mlavalle> if you get stuck, let me know 16:44:34 <slaweq> mlavalle: sure, thx 16:44:36 <mlavalle> if nothing else, it is a good learning opportunity 16:44:45 <slaweq> yes, it is 16:44:49 <mlavalle> ctypes, C, etc 16:44:56 <slaweq> I read a lot about it today :) 16:45:38 <slaweq> ok, lets move on 16:45:39 <slaweq> #topic fullstack/functional 16:45:48 <slaweq> according to fullstack tests 16:45:59 <slaweq> we still have opened issue https://bugs.launchpad.net/neutron/+bug/1798475 16:45:59 <openstack> Launchpad bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,In progress] - Assigned to LIU Yulong (dragon889) 16:46:17 <slaweq> liuyulong told me today that he couldn’t spot it recently too much. We will continue work on it 16:46:31 <slaweq> about functional tests 16:46:57 <slaweq> I have small update according to https://bugs.launchpad.net/neutron/+bug/1687027 16:46:58 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:47:13 <slaweq> I was checking those failed db migrations 16:47:30 <slaweq> and it looks that this isn't any specific problem in Neutron migration scripts 16:47:50 <slaweq> it's just issue with IO performance on some cloud providers probably 16:48:05 <mlavalle> ahhh 16:48:16 <mlavalle> yeah, that makes it difficult to fix 16:48:21 <slaweq> I described in comment in https://bugs.launchpad.net/neutron/+bug/1687027/comments/41 my findings 16:48:21 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:48:45 <slaweq> basically in such timeouted run, every step in migration takes very long time 16:49:07 <slaweq> and dstat shows me that it was belowe 100 IOPS during such test 16:49:33 <slaweq> while during "good" test run there was more than 2000 IOPS where I was checking it 16:49:45 <mlavalle> right 16:49:51 <mlavalle> so we are hitting a limit 16:50:26 <slaweq> I think that sometimes we are just on some slow nodes and that is the reason 16:50:32 <slaweq> I talked with infra about it 16:50:49 <slaweq> they told me that they had some issues with one provider recently but this should be fixed now 16:51:01 <slaweq> I want to send patch which will unmark those tests as unstable 16:51:05 <mlavalle> ahh, so there's hope 16:51:16 <slaweq> and try to recheck it for some time to see if that will happen again 16:51:22 <mlavalle> yeap 16:51:32 <slaweq> when I was looking at logstash few days ago it wasn't too much such issues 16:51:44 <slaweq> but we will probably have them from time to time still 16:52:11 <slaweq> and that's all for functional/fullstack tests from me for today 16:52:20 <slaweq> do You want to add something? 16:53:20 <slaweq> ok, so I guess that we can go to the next topic now :) 16:53:25 <slaweq> #topic Tempest/Scenario 16:53:45 <slaweq> according to tempest/scenario tests, we already discussed about our 2 main issues there 16:53:54 <slaweq> so I just wanted to ask about one thing 16:53:56 <mlavalle> yes 16:54:24 <slaweq> I recently send patch https://review.openstack.org/#/c/627970/ which adds non-voting tempest job based on Fedora 16:54:49 <slaweq> I wanted to ask for review and for opinions from community if it's fine for You to have such job 16:55:13 <slaweq> it's running on py3 of course :) 16:55:16 <mlavalle> I'm ok with it 16:55:33 <slaweq> great mlavalle, thx :) 16:55:34 <mlavalle> have we seen oppostion in other projecs? 16:55:56 <slaweq> I don't know about any 16:56:04 <mlavalle> ok 16:56:08 <mlavalle> I'm fine with it 16:56:47 <slaweq> thx mlavalle 16:56:55 <slaweq> please add this patch to Your review list :) 16:57:03 <mlavalle> I just did :-) 16:57:09 <slaweq> thx :) 16:57:19 <slaweq> ok, so that's all from my side for today 16:57:33 <slaweq> do You want to talk about anything else quickly? 16:57:56 <mlavalle> nope 16:58:10 <slaweq> ok, so thx for attending mlavalle and hongbin 16:58:13 <mlavalle> thanks for the great faciliatio, as always 16:58:15 <slaweq> see You next week 16:58:18 <mlavalle> o/ 16:58:23 <slaweq> o/ 16:58:26 <slaweq> #endmeeting