#openstack-meeting log

16:00:06 <slaweq> #startmeeting neutron_ci
16:00:08 <openstack> Meeting started Tue Jul 17 16:00:06 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:09 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:09 <slaweq> hi again
16:00:11 <slaweq> :)
16:00:11 <openstack> The meeting name has been set to 'neutron_ci'
16:00:23 <njohnston> o/
16:00:37 <slaweq> mlavalle: haleyb: are You around?
16:00:41 <mlavalle> o/
16:01:07 <slaweq> #topic Actions from previous meetings
16:01:21 <slaweq> there is only one action from last meeting:
16:01:24 <slaweq> mlavalle to follow up with QA team to merge https://review.openstack.org/#/c/578765/
16:01:42 <mlavalle> I forgot. I will do it today
16:01:44 <haleyb> hi
16:01:49 <slaweq> yesterday I pushed new PS there
16:01:54 <mlavalle> had the patch open in my browser the entire time
16:01:56 <slaweq> and it's already +W
16:01:57 <mlavalle> danm
16:02:02 <slaweq> mlavalle: no problem
16:02:06 <slaweq> :)
16:02:23 <slaweq> I hope it will be merged soon, now I had to recheck it few times already
16:02:25 <mlavalle> non-issue then
16:02:35 <slaweq> hi haleyb :)
16:03:01 <slaweq> so, quickly next topic
16:03:03 <slaweq> #topic Grafana
16:03:08 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:03:54 <njohnston> I wish there was a way to set the default for that to the last 7 days.  Is there any reason not to do that?
16:03:55 <slaweq> as I was checking it earlier today, it looks much better now for most of jobs
16:04:15 <slaweq> njohnston: I don't know but would be good IMO :)
16:05:23 <slaweq> I think that Neutron gates are better when I am on holiday :D
16:05:44 <njohnston> #action njohnston to see if we can set the default time period on the grafana dashboard to now-7d
16:05:56 <slaweq> ^^ thx njohnston
16:06:58 <slaweq> for example fullstack which was in bad shape few weeks ago is now below 20% for last week
16:07:41 <slaweq> do You want to add something or can we talk about some specific jobs?
16:07:53 <mlavalle> we still missed you!
16:08:01 <slaweq> why?
16:08:10 <mlavalle> becuse you weren't here
16:08:29 <slaweq> ahh thx mlavalle :)
16:08:40 <mlavalle> I'm saying even if the gate is in good shape, we still want you to be around
16:09:01 <slaweq> ok, now I will be all the time :P
16:09:12 <slaweq> so gate will be back to bad shape proably :P
16:09:18 <mlavalle> LOL
16:09:34 <slaweq> ok, let's move on
16:09:35 <slaweq> #topic Unit tests
16:09:50 <slaweq> we have still issue with timeouts in UT sometimes
16:09:59 <slaweq> it's described in https://bugs.launchpad.net/neutron/+bug/1779077
16:09:59 <openstack> Launchpad bug 1779077 in neutron "Unit test jobs (py35) fails with timeout often" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:10:28 <slaweq> But it’s not related only to python 3.5, I saw failing py27 jobs also and lower-constraints as well sometimes
16:10:45 <slaweq> it happens less often then it was few weeks ago IMO
16:10:52 <slaweq> but still it hits us
16:11:35 <slaweq> I checked today in logstash that it's (probably) not only Neutron's problem - some other projects also hits such timeouts in py27/py35 jobs last week
16:12:03 <slaweq> I was asking on infra channel today but they don't know about any specific reason of it
16:12:42 <slaweq> and I also don't know if maybe problems in other projects weren't due to something else, just checked that it happens from time to time for others also
16:13:13 <manjeets> slaweq, I have question regarding py35 issue im facing in networking-odl, but i'll ask later in neutron channel after this meeting
16:13:14 <mlavalle> worth taking it into consideration
16:13:21 <slaweq> I want to compare time of longest tests in "good" and "bad" runs and maybe I will find something interesting
16:13:36 <slaweq> as it's not very often I'm doing it in meantime for now
16:13:48 <slaweq> and will update this bug if I will find something
16:15:45 <slaweq> that's all from me about UT issues
16:15:50 <slaweq> do You have something to add?
16:16:05 <mlavalle> nope
16:16:13 <slaweq> next topic then
16:16:24 <slaweq> #topic Functional
16:16:44 <slaweq> I think I saw this week again issue like in http://logs.openstack.org/58/574058/10/check/neutron-functional/f38d685/logs/testr_results.html.gz
16:17:00 <slaweq> but I'm not sure exactly where and I couldn't find it today
16:17:09 <slaweq> did You saw such issues recently?
16:17:40 <mlavalle> no, I didn't
16:18:13 <slaweq> so just please be aware and if You will see new such issues, please report a bug for it
16:18:22 <slaweq> as there is no any currently IIRC
16:18:43 <slaweq> we will then be able to track it there
16:18:47 <njohnston> On 7/15 we hit 100% neutron-functional failures in the gate, and then it just went away... I know I saw it at the time, and I don't know what was fixed but now the issue is not there
16:19:26 <njohnston> and it was the weekend and I did not have time to investigate at that moment
16:19:52 <slaweq> 100% of failures during the weekend might be not really an issue
16:20:04 <slaweq> there is not many patches, especially in gate queue then
16:20:08 <njohnston> I saw a bunch of -2s from it
16:20:28 <slaweq> ahh, so maybe there was something really
16:20:42 <slaweq> maybe some of patches fixes this issue somehow :)
16:21:02 <njohnston> That is my assumption
16:21:25 <slaweq> would be good
16:21:38 <slaweq> let's just check if it will happen again
16:23:32 <slaweq> I think we can now go to our "favorite" topic
16:23:34 <slaweq> #topic Scenarios
16:24:09 <slaweq> since few days we have 2 jobs which are on very high failure rate
16:24:18 <slaweq> fortunatelly both are non-voting :)
16:24:39 <slaweq> it's neutron-tempest-multinode-full and neutron-tempest-dvr-ha-multinode-full
16:25:11 <slaweq> today I checked failures from last few days and I found few issues which happens there
16:25:19 <slaweq> first neutron-tempest-multinode-full
16:25:50 <slaweq> there is one or two tests failing every time when it fails: tempest.api.compute.admin.test_servers_on_multinodes.ServersOnMultiNodesTest.test_create_server_with_scheduler_hint_group_anti_affinity
16:26:00 <slaweq> and this issue is probably not related to neutron at all
16:26:16 <slaweq> I saw same failures in tempest jobs also
16:26:34 <slaweq> example of such failures in neutron patches:
16:26:36 <slaweq> * http://logs.openstack.org/08/555608/34/check/neutron-tempest-multinode-full/1d3aafb/logs/testr_results.html.gz
16:26:44 <slaweq> * http://logs.openstack.org/15/583015/1/check/neutron-tempest-multinode-full/bd5eb5e/logs/testr_results.html.gz
16:26:46 <slaweq> * http://logs.openstack.org/21/567621/9/check/neutron-tempest-multinode-full/6bc503f/logs/testr_results.html.gz
16:26:48 <slaweq> * http://logs.openstack.org/59/582659/1/check/neutron-tempest-multinode-full/c503459/logs/testr_results.html.gz
16:28:39 <slaweq> so for this job I don't think there is a lot what we can do
16:28:46 <slaweq> let's go to neutron-tempest-dvr-ha-multinode-full
16:28:54 <slaweq> for this job there are 2 main issues
16:28:56 <njohnston> do we need to open a bug with the nova crew?
16:29:05 <slaweq> njohnston: good question
16:29:17 <slaweq> I didn't check if there is already something opened
16:29:32 <slaweq> but we can definitelly ask on nova channel
16:29:46 <slaweq> I will ask after the meeting
16:29:52 <njohnston> The nova/neutron liaison is sean-k-mooney, he might be able to help
16:30:15 <slaweq> ok, thx for info
16:30:26 <mlavalle> he is closer to your timezone, slaweq
16:30:33 <slaweq> yes, I know
16:30:41 <slaweq> he is in Ireland IIRC
16:30:45 <mlavalle> yeap
16:30:56 <slaweq> so I will ask tomorrow morning
16:31:26 <slaweq> #action slaweq to talk about issue with test_create_server_with_scheduler_hint_group_anti_affinity with nova-neutron liaison
16:32:02 <slaweq> ok, so for neutron-tempest-dvr-ha-multinode-full there are 2 issues generally
16:32:14 <slaweq> first is exactly same as above
16:32:28 <slaweq> and second issue is with tempest.api.compute.volumes.test_attach_volume.AttachVolumeShelveTestJSON.test_{attach, detach}_volume_shelved_or_offload_server tests
16:32:42 <slaweq> examples of failures:
16:32:44 <slaweq> * http://logs.openstack.org/51/414251/74/check/neutron-tempest-dvr-ha-multinode-full/2b4a730/logs/testr_results.html.gz
16:32:46 <slaweq> * http://logs.openstack.org/08/555608/34/check/neutron-tempest-dvr-ha-multinode-full/8d12da6/logs/testr_results.html.gz
16:32:48 <slaweq> * http://logs.openstack.org/29/581029/2/check/neutron-tempest-dvr-ha-multinode-full/f6e69d5/logs/testr_results.html.gz
16:32:50 <slaweq> * http://logs.openstack.org/21/567621/9/check/neutron-tempest-dvr-ha-multinode-full/3d8fd83/logs/testr_results.html.gz
16:32:52 <slaweq> * http://logs.openstack.org/59/582659/1/check/neutron-tempest-dvr-ha-multinode-full/62e9d93/logs/testr_results.html.gz
16:33:05 <slaweq> here it might be related to Neutron as issue is with ssh to instance via floating IP
16:33:19 <slaweq> but I didn't saw this issue in any other job then this with dvr
16:34:45 <slaweq> in l3 agent logs there is few errors like: http://logs.openstack.org/51/414251/74/check/neutron-tempest-dvr-ha-multinode-full/2b4a730/logs/subnode-3/screen-q-l3.txt.gz?level=ERROR
16:34:57 <slaweq> on both subnode-2 and subnode-3
16:35:39 <slaweq> mlavalle: do You think that it might be related?
16:36:05 <mlavalle> slaweq: I was taking a look. don't know yet
16:36:47 <mlavalle> no tracebacks in the log
16:36:52 <mlavalle> just that error
16:37:58 <haleyb> right, don't know what it was doing
16:38:05 <slaweq> strange thing is that it's only on this test
16:39:04 <mlavalle> the message comes from pyroute2: https://github.com/svinota/pyroute2/blob/master/pyroute2/netns/nslink.py#L199
16:39:13 <slaweq> so maybe it's some issue with configuring FIP again after server is shelved/unshelved
16:39:16 <mlavalle> it seems at least
16:40:01 <slaweq> maybe it would be easy to check manually if someone have dvr environment
16:40:17 <mlavalle> it seems they have their own api to handle namespaces
16:40:27 <slaweq> just try to shelve/unshelve server and check connectivity to it
16:40:40 <haleyb> the message is from the close() code, so maybe didn't cause a fatal error, i'm remembering seeing it before but can't place the context
16:40:56 <mlavalle> yeah, it closing
16:43:40 <mlavalle> I can create a DVR environment easily
16:43:48 <mlavalle> if that is the route we want to take
16:44:15 <slaweq> it's my first idea to check as I saw this happens very often in this dvr scenario but not in others
16:44:34 <slaweq> and it's always this test with shelve/unshelve instance
16:44:52 <slaweq> here is what test is doing: https://github.com/openstack/tempest/blob/master/tempest/api/compute/volumes/test_attach_volume.py#L224
16:45:10 <slaweq> it creates server and volume and then shelve server
16:45:18 <slaweq> and unshelve it
16:45:27 <slaweq> after unshelve it tries to connect to it
16:45:36 <slaweq> and this fails
16:46:19 <haleyb> i wonder if CONF.validation.run_validation is True, so that it checks ssh before shelving
16:46:56 <haleyb> i guess it must be
16:47:32 <slaweq> http://logs.openstack.org/51/414251/74/check/neutron-tempest-dvr-ha-multinode-full/2b4a730/logs/tempest_conf.txt.gz
16:47:36 <slaweq> it is set to True
16:50:39 <slaweq> ok, mlavalle if You can deploy dvr env quickly can You test it and maybe report a bug for that issue?
16:51:00 <mlavalle> not necessarilly today, but over the next few days
16:51:10 <slaweq> sure :)
16:51:11 <mlavalle> it's enough dvr, right?
16:51:18 <slaweq> thx a lot
16:51:19 <mlavalle> I don't need ha, right?
16:51:28 <haleyb> and try an 'openstack server shelve/unshelve ...' i guess
16:51:43 <slaweq> mlavalle: good question
16:52:00 <slaweq> scenario is dvr-ha
16:52:10 <slaweq> but maybe dvr would be enough?
16:52:16 <slaweq> I don't know
16:52:21 <mlavalle> slaweq: mhhh, that may require a little more work
16:52:25 <mlavalle> I'll still try
16:52:33 <slaweq> haleyb: yes, I think that shelve/unshelve will be what is done in this test
16:52:52 <mlavalle> haleyb: ack
16:53:25 <haleyb> it might be enough to try with dvr, and see how it goes.  i might have a system to try if i can remember the IP
16:53:38 <mlavalle> LOL
16:53:40 <mlavalle> ok
16:53:51 <slaweq> thx haleyb
16:54:28 <slaweq> #action mlavalle/haleyb to check dvr (dvr-ha) env and shelve/unshelve server
16:54:34 <slaweq> good? :)
16:54:43 <mlavalle> yes
16:54:46 <slaweq> thx
16:54:50 <haleyb> sure, i did find it :)
16:55:09 <slaweq> ok, that's all from me for today :)
16:55:17 <slaweq> do You have anything else to add maybe?
16:55:30 <mlavalle> not from me
16:55:51 <mlavalle> just to say that I am glad I don't have to run 3 metings in a row
16:55:58 <slaweq> LOL
16:56:11 <slaweq> yes, it's hard for sure
16:56:18 <slaweq> 2 for me are more than enough
16:56:25 <mlavalle> it's not the meetings
16:56:30 <mlavalle> it's the preparation
16:56:32 * njohnston is grateful to all of you
16:56:43 <slaweq> yes mlavalle, I agree
16:56:56 <slaweq> ok, thx guys for attending
16:56:59 <slaweq> see You next week
16:57:03 <slaweq> #endmeeting