#openstack-meeting log

16:00:11 <slaweq> #startmeeting neutron_ci
16:00:12 <openstack> Meeting started Tue Jan  8 16:00:11 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:13 <slaweq> hi
16:00:16 <openstack> The meeting name has been set to 'neutron_ci'
16:01:36 <hongbin> o/
16:01:41 <slaweq> hi hongbin
16:01:56 <slaweq> mlavalle: ping :)
16:02:02 <mlavalle> o/
16:02:06 <slaweq> hi mlavalle
16:02:16 <mlavalle> sorry, distracted looking at code
16:02:17 <slaweq> I think we can start now
16:02:23 <mlavalle> thanks for pinging me
16:02:35 <slaweq> haleyb: bcafarel and njohnson will not be available probably
16:02:38 <slaweq> mlavalle: no problem :)
16:02:52 <slaweq> #topic Actions from previous meetings
16:02:55 <mlavalle> they are not getting paid, right?
16:03:11 <slaweq> why? :)
16:03:19 <mlavalle> they didn't show up today
16:03:24 <mlavalle> just kidding
16:03:30 <slaweq> they are on internal meetup this week
16:03:34 <slaweq> so they are busy
16:03:57 <mlavalle> is that all week long?
16:04:09 <slaweq> tuesday to thursday
16:04:14 <mlavalle> cool
16:04:25 <mlavalle> so I know not to bother them
16:04:32 <slaweq> :)
16:04:46 <slaweq> ok, so lets go with actions from last year then :)
16:04:50 <slaweq> mlavalle will continue debugging trunk tests failures in multinode dvr env
16:05:08 <mlavalle> yes
16:05:31 <mlavalle> I continued working on it. Yesterday I wrote an update in the bug: https://bugs.launchpad.net/neutron/+bug/1795870/comments/8
16:05:32 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,In progress] - Assigned to Miguel Lavalle (minsel)
16:06:03 <slaweq> yes, I saw this comment yesterday
16:06:14 <mlavalle> Based on the evidence that I list in that note, it seems to me the dvr router is not being scheduled in one of the hosts
16:06:25 <mlavalle> in this case the controller
16:06:48 <mlavalle> the router in question doesn't show up at all in the le agent log in that host
16:06:52 <slaweq> so vm which were on controller wasn't reachable?
16:06:59 <slaweq> or vm on subnode wasn't reachable?
16:07:08 <mlavalle> so I am looking at the router scheduling code
16:07:19 <mlavalle> the one in the controller
16:07:25 <slaweq> ahh, ok
16:08:05 <slaweq> so that we can rule out network connectivity between nodes as the reason
16:08:28 <mlavalle> yes, we can rule that out
16:08:41 <slaweq> ok
16:08:45 <mlavalle> I know that because other tests run
16:09:32 <slaweq> yes, but it could be that e.g. tests where vm was spawned on subnode were failed and tests where vm was spawned on controller were passing - then You couldn't rule that out
16:10:09 <slaweq> but in such case if vm on controller didn't work, it isn't underlay network issue
16:10:11 <mlavalle> no, becasue there are tests where 2 vms, one in controller and one in compute, are passing
16:10:40 <slaweq> yes, so we can definitely rule out issues with underlay network :)
16:10:58 <mlavalle> hang on
16:11:32 <mlavalle> you wrote this test: https://review.openstack.org/#/c/598676/
16:11:43 <mlavalle> and the vms run in both hosts, right?
16:12:09 <mlavalle> this test passes
16:12:11 <slaweq> yes, they should be on separate nodes always
16:12:37 <mlavalle> that test led me to https://review.openstack.org/#/c/597567
16:12:48 <mlavalle> which merged in November
16:13:02 <slaweq> yes, it was big patch
16:13:04 <mlavalle> and where we made significant changes to dvr scheduling
16:13:16 <mlavalle> so I am looking at that code now
16:13:25 <slaweq> ok
16:13:27 <mlavalle> that is why I got late to this meeting ;-)
16:13:31 <slaweq> :)
16:13:49 <slaweq> ok, so You are justified now :P
16:13:50 <mlavalle> that's where I am right now
16:13:59 <mlavalle> I'll continue pushing forward
16:14:06 <slaweq> ok, thx a lot
16:14:17 <mlavalle> I might push a DNM patch to do some testing
16:14:30 <slaweq> sure, let me know if You will need any help
16:14:46 <slaweq> #action mlavalle will continue debugging trunk tests failures in multinode dvr env
16:14:54 <slaweq> thx mlavalle for working on this
16:14:59 <mlavalle> thanks
16:15:11 <slaweq> ok, lets move on to the next one
16:15:15 <slaweq> haleyb to report bugs about recent errors in L3 agent logs
16:15:30 <mlavalle> can I make one additional comment
16:15:33 <mlavalle> ?
16:15:34 <slaweq> sure
16:15:37 <slaweq> go on
16:16:01 <mlavalle> several of the other failures in the dvr mltinode job exhibit similar behavior....
16:16:18 <mlavalle> the VM cannot reach the metadata service
16:16:42 <mlavalle> so this might be the commong cause of several of the failures trying to ssh to instances
16:17:31 <slaweq> yes, I also think that, I saw some other tests with similar issues, it's not always same test
16:17:57 <mlavalle> so this is related to the email you sent me in december
16:18:02 <mlavalle> with ssh failures
16:18:26 * mlavalle trying to show diligence with homework assigned by El Comandante
16:18:38 <slaweq> mlavalle: but it's not always dvr jobs with such failures
16:18:42 <slaweq> e.g. here: * haleyb to report bugs about recent errors in L3 agent logs
16:18:44 <slaweq> sorry
16:18:53 <slaweq> http://logs.openstack.org/09/626109/10/check/tempest-full/fab9ab0/testr_results.html.gz
16:19:01 <slaweq> it is the same issue
16:19:07 <slaweq> and it's single node tempest job
16:19:09 <slaweq> without dvr
16:19:12 <mlavalle> ok, I'll look at that then
16:19:19 <mlavalle> it is useful data
16:20:33 <slaweq> I saw in this job some errors in L3 agent logs: http://logs.openstack.org/09/626109/10/check/tempest-full/fab9ab0/controller/logs/screen-q-l3.txt.gz?level=ERROR
16:20:42 <slaweq> but I'm not sure if that is related to the issue or not
16:21:53 <slaweq> ok, can we move on then?
16:22:02 <mlavalle> yes, I'll look a that
16:22:08 <slaweq> thx mlavalle
16:22:10 <mlavalle> thanks for the pointers
16:22:39 <slaweq> You welcome :)
16:22:44 <slaweq> ok, lets move on
16:22:53 <slaweq> haleyb to report bugs about recent errors in L3 agent logs
16:22:58 <slaweq> haleyb is not here
16:23:13 <slaweq> but he opened bug https://bugs.launchpad.net/neutron/+bug/1809134
16:23:13 <openstack> Launchpad bug 1809134 in neutron "TypeError in QoS gateway_ip code in l3-agent logs" [High,In progress] - Assigned to Brian Haley (brian-haley)
16:23:24 <slaweq> and he is working on it so we should be fine
16:24:16 <slaweq> I think we can move on to the next one then
16:24:19 <slaweq> slaweq to talk with bcafarel about SIGHUP issue in functional py3 tests
16:24:43 <slaweq> I talked with bcafarel, he checked that and he found that this issue is now fixed with https://review.openstack.org/#/c/624006/
16:25:37 <slaweq> any questions or do You want to talk about something else related to last week's actions?
16:25:40 <mlavalle> cool
16:26:33 <slaweq> or can we move on as that was all actions from last meeting
16:27:31 <slaweq> ok, lets move on then
16:27:36 <mlavalle> let's move on
16:27:36 <slaweq> #topic Python 3
16:27:47 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_ci_python3
16:28:09 <slaweq> in this etherpad You can track progress of switching our CI to python3
16:28:23 <slaweq> recently I did some patches related to it:
16:28:32 <slaweq> https://review.openstack.org/#/c/627806/
16:28:39 <slaweq> https://review.openstack.org/#/c/627053/
16:28:48 <slaweq> so please review if You will have some time
16:29:14 <slaweq> I’m also working slowly with conversions of neutron-functional job to py3 in https://review.openstack.org/#/c/577383/  but it isn't ready yet I think
16:29:22 <mlavalle> so I can take an assignment from there
16:29:40 <slaweq> sure
16:29:43 <mlavalle> ok
16:29:46 <mlavalle> cool
16:30:08 <slaweq> in patches which I did I was doing together conversion to zuulv3 and python3
16:30:20 <mlavalle> that's good idea
16:30:26 <slaweq> but if You want only switch job to py3 now, it's fine too for me
16:30:27 <mlavalle> and good pratice
16:31:09 <slaweq> for some grenade jobs I think njohnston was doing some patches
16:31:17 <mlavalle> ok
16:31:37 <slaweq> but I didn' check those too much recently so I am not sure what is current state of those jobs
16:32:30 <slaweq> but I think that we have good progress on it and we can switch everything in this cycle I hope :)
16:32:42 <mlavalle> great!
16:33:39 <slaweq> btw. I'm not sure if You are aware but there is also almost merged patch https://review.openstack.org/#/c/622415/
16:33:47 <slaweq> to switch devstack to be py3 by default
16:34:20 <slaweq> I'm not sure, but then I think all devstack based jobs should be running py3 by default
16:34:49 <mlavalle> let's keep an eye on that
16:34:54 <slaweq> sure
16:35:24 <slaweq> ok, can we move on to the next topic?
16:36:01 <slaweq> I take it as yes
16:36:03 <slaweq> #topic Grafana
16:36:10 <slaweq> http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1
16:37:11 <slaweq> couple notes on grafana
16:37:33 <slaweq> all tempest and neutron-tempest-plugin jobs are on quite high failure rates recently
16:38:04 <slaweq> and in most cases where I checked it it was this issue where vm couldn't reach metadata or issue where SSH to instance wasn't possible
16:38:22 <slaweq> I know that mlavalle will continue work on first problem
16:38:33 <mlavalle> yeah
16:38:33 <slaweq> so I can try to debug second one
16:38:42 <mlavalle> ok
16:38:49 <slaweq> #action slaweq to debug problems with ssh to vm in tempest tests
16:38:54 <mlavalle> yes, I thionk it is the right approach
16:39:08 <slaweq> if I will find anything I will open bug/bugs for it
16:39:57 <slaweq> there is also one more issue worth to mention, recently we had Functional tests failing 100% times, that was caused by https://bugs.launchpad.net/neutron/+bug/1810518
16:39:57 <openstack> Launchpad bug 1810518 in oslo.privsep "neutron-functional tests failing with oslo.privsep 1.31" [Critical,Confirmed] - Assigned to Ben Nemec (bnemec)
16:40:32 <slaweq> for now it is workaround by lowering oslo.privsep version in requirements repo but we need to fix this issue somehow
16:41:03 <slaweq> I talked with rubasov today and he has no idea how to fix this :/
16:41:16 <mlavalle> ok
16:41:22 <slaweq> I also spent few hours on it today
16:41:24 <mlavalle> well was worth trying
16:41:45 <slaweq> I described my findings in comment on launchpad but I have no idea how to deal with it
16:42:16 <slaweq> I have no any experience with ctypes lib and calling C functions from python
16:42:30 <slaweq> and together with threading
16:43:02 <slaweq> I will ask bnemec today if he found something maybe
16:43:26 <slaweq> and we will see what to do with it next
16:44:16 <slaweq> anything else You want to add/ask here?
16:44:26 <mlavalle> if you get stuck, let me know
16:44:34 <slaweq> mlavalle: sure, thx
16:44:36 <mlavalle> if nothing else, it is a good learning opportunity
16:44:45 <slaweq> yes, it is
16:44:49 <mlavalle> ctypes, C, etc
16:44:56 <slaweq> I read a lot about it today :)
16:45:38 <slaweq> ok, lets move on
16:45:39 <slaweq> #topic fullstack/functional
16:45:48 <slaweq> according to fullstack tests
16:45:59 <slaweq> we still have opened issue https://bugs.launchpad.net/neutron/+bug/1798475
16:45:59 <openstack> Launchpad bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,In progress] - Assigned to LIU Yulong (dragon889)
16:46:17 <slaweq> liuyulong told me today that he couldn’t spot it recently too much. We will continue work on it
16:46:31 <slaweq> about functional tests
16:46:57 <slaweq> I have small update according to https://bugs.launchpad.net/neutron/+bug/1687027
16:46:58 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:47:13 <slaweq> I was checking those failed db migrations
16:47:30 <slaweq> and it looks that this isn't any specific problem in Neutron migration scripts
16:47:50 <slaweq> it's just issue with IO performance on some cloud providers probably
16:48:05 <mlavalle> ahhh
16:48:16 <mlavalle> yeah, that makes it difficult to fix
16:48:21 <slaweq> I described in comment in https://bugs.launchpad.net/neutron/+bug/1687027/comments/41 my findings
16:48:21 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:48:45 <slaweq> basically in such timeouted run, every step in migration takes very long time
16:49:07 <slaweq> and dstat shows me that it was belowe 100 IOPS during such test
16:49:33 <slaweq> while during "good" test run there was more than 2000 IOPS where I was checking it
16:49:45 <mlavalle> right
16:49:51 <mlavalle> so we are hitting a limit
16:50:26 <slaweq> I think that sometimes we are just on some slow nodes and that is the reason
16:50:32 <slaweq> I talked with infra about it
16:50:49 <slaweq> they told me that they had some issues with one provider recently but this should be fixed now
16:51:01 <slaweq> I want to send patch which will unmark those tests as unstable
16:51:05 <mlavalle> ahh, so there's hope
16:51:16 <slaweq> and try to recheck it for some time to see if that will happen again
16:51:22 <mlavalle> yeap
16:51:32 <slaweq> when I was looking at logstash few days ago it wasn't too much such issues
16:51:44 <slaweq> but we will probably have them from time to time still
16:52:11 <slaweq> and that's all for functional/fullstack tests from me for today
16:52:20 <slaweq> do You want to add something?
16:53:20 <slaweq> ok, so I guess that we can go to the next topic now :)
16:53:25 <slaweq> #topic Tempest/Scenario
16:53:45 <slaweq> according to tempest/scenario tests, we already discussed about our 2 main issues there
16:53:54 <slaweq> so I just wanted to ask about one thing
16:53:56 <mlavalle> yes
16:54:24 <slaweq> I recently send patch https://review.openstack.org/#/c/627970/ which adds non-voting tempest job based on Fedora
16:54:49 <slaweq> I wanted to ask for review and for opinions from community if it's fine for You to have such job
16:55:13 <slaweq> it's running on py3 of course :)
16:55:16 <mlavalle> I'm ok with it
16:55:33 <slaweq> great mlavalle, thx :)
16:55:34 <mlavalle> have we seen oppostion in other projecs?
16:55:56 <slaweq> I don't know about any
16:56:04 <mlavalle> ok
16:56:08 <mlavalle> I'm fine with it
16:56:47 <slaweq> thx mlavalle
16:56:55 <slaweq> please add this patch to Your review list :)
16:57:03 <mlavalle> I just did :-)
16:57:09 <slaweq> thx :)
16:57:19 <slaweq> ok, so that's all from my side for today
16:57:33 <slaweq> do You want to talk about anything else quickly?
16:57:56 <mlavalle> nope
16:58:10 <slaweq> ok, so thx for attending mlavalle and hongbin
16:58:13 <mlavalle> thanks for the great faciliatio, as always
16:58:15 <slaweq> see You next week
16:58:18 <mlavalle> o/
16:58:23 <slaweq> o/
16:58:26 <slaweq> #endmeeting