#openstack-meeting log

16:00:19 <slaweq> #startmeeting neutron_ci
16:00:19 <openstack> Meeting started Tue May  7 16:00:19 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:22 <openstack> The meeting name has been set to 'neutron_ci'
16:00:37 <slaweq> hi
16:00:40 <mlavalle> o/
16:00:56 <haleyb> hi
16:01:44 <slaweq> lets wait couple more minutes for njohnston_ ralonsoh and others, maybe they will join
16:03:13 <slaweq> ok, lets start
16:03:18 <slaweq> first thing:
16:03:20 <slaweq> #link http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1
16:03:33 <slaweq> please open now so it will be ready later :)
16:03:42 <slaweq> #topic Actions from previous meetings
16:03:46 <mlavalle> ok
16:03:58 <slaweq> first action from 2 weeks ago was
16:04:00 <slaweq> mlavalle to continue debuging reasons of neutron-tempest-plugin-dvr-multinode-scenario failures
16:04:13 <mlavalle> I didn't make progress on this one
16:04:27 <mlavalle> due to Summit / PTG
16:04:41 <slaweq> sure, I know :) Can I assign it to You for next week?
16:04:49 <mlavalle> yes please
16:04:55 <slaweq> #action mlavalle to continue debuging reasons of neutron-tempest-plugin-dvr-multinode-scenario failures
16:04:58 <slaweq> thx
16:05:06 <slaweq> mlavalle to recheck tcpdump patch and analyze output from ci jobs
16:05:13 <slaweq> that is next one ^^
16:05:18 <slaweq> any update?
16:05:22 <mlavalle> I spent time this morning looking at that
16:06:09 <mlavalle> I think my tcpdump command is too broad: http://logs.openstack.org/21/653021/2/check/neutron-tempest-plugin-dvr-multinode-scenario/0a24d77/controller/logs/screen-q-l3.txt.gz#_Apr_23_00_47_57_853357
16:06:40 <mlavalle> getting weird output as you can see
16:06:40 <mlavalle> bad checksums and that kind of stuff
16:06:55 <mlavalle> so I am going to focus a little bit more
16:07:24 <mlavalle> I am going to trace qr and qg interfaces
16:07:24 <mlavalle> with tcp and port 22
16:07:37 <mlavalle> makes sense?
16:07:48 <slaweq> and please add "-n" option to not resolv IPs to hostnames
16:07:58 <slaweq> IMHO it will be easier to look
16:08:17 <mlavalle> yes, you are right
16:08:22 <mlavalle> I will probably also focus on singe node jobs first
16:08:33 <haleyb> and maybe -l to not buffer
16:08:36 <slaweq> yes, and also "-e" to print mac addresses
16:09:04 <mlavalle> thanks for the recommendations. I will follow them
16:09:12 <mlavalle> I have a stupid question
16:09:16 <mlavalle> can I ask it?
16:09:20 <slaweq> sure :)
16:09:27 <slaweq> there are no stupid questions ;)
16:10:11 <mlavalle> in looking at what job to focus in kibana, I noticed this: http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/job-output.txt#_2019-05-07_14_03_22_454517
16:10:42 <mlavalle> we are not using password ssh in any case, right?
16:11:01 <slaweq> no, ssh key is used always I think
16:11:21 <mlavalle> yes, that's what I think also
16:11:32 <mlavalle> but it was worth asking the dumb question
16:12:09 <slaweq> this is exactly example of this second "type" of errors with SSH connectivity
16:12:12 <slaweq> look at http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/job-output.txt#_2019-05-07_14_03_22_550727
16:12:29 <slaweq> instance-id was received properly from metadata server
16:12:47 <slaweq> but then 2 lines below, failed to get public-key
16:13:02 <slaweq> it is exactly what I was testing last week
16:13:24 <mlavalle> do you have a feel of what is the ratio between type 1 and type 2 failures?
16:13:43 <slaweq> I don't know exactly but I would say 50:50
16:14:13 <mlavalle> I think the tcpdump testing I'm doing should help with type 1
16:14:34 <mlavalle> so I will focus on those
16:14:43 <mlavalle> in type 2, we know we have connectivity
16:14:56 <slaweq> yes
16:14:58 <mlavalle> because we fail athenticating
16:15:15 <slaweq> in this case problem is with slow answer for metadata requests
16:17:48 <slaweq> http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/compute2/logs/screen-q-meta.txt.gz#_May_07_13_15_02_833919
16:18:00 <slaweq> here is this failed request in neutron metadata agent's logs
16:18:17 <slaweq> and http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/controller/logs/screen-n-api-meta.txt.gz#_May_07_13_15_04_543966 -- that how it looks in nova
16:18:27 <slaweq> 10 seconds gap in logs there
16:18:34 <mlavalle> yeap
16:19:11 <slaweq> mlavalle: maybe You can try to talk with someone from nova team to look into those issues
16:19:41 <mlavalle> ok
16:19:59 <slaweq> thx
16:20:07 <slaweq> can I add it as an action for You also?
16:20:11 <mlavalle> yes
16:20:31 <slaweq> #action mlavalle to talk with nova folks about slow responses for metadata requests
16:20:31 <mlavalle> please
16:20:33 <slaweq> thx
16:20:45 <slaweq> ok, next one was
16:20:47 <slaweq> njohnston move wsgi jobs to check queue nonvoting
16:21:05 <slaweq> I know it's done, we have wsgi jobs running in check queue currently
16:21:31 <slaweq> and tempest job is kinda broken now
16:21:48 <slaweq> so we will have to investigate it also
16:21:58 <slaweq> but that isn't very urgent for now
16:22:26 <slaweq> next one then
16:22:27 <slaweq> ralonsoh to debug issue with neutron_tempest_plugin.api.admin.test_network_segment_range test
16:22:38 <slaweq> I don't know if ralonsoh did anything with it
16:22:45 <ralonsoh> sorry
16:22:50 <ralonsoh> I didn't have time for it
16:22:58 <slaweq> sure, no problem :)
16:23:05 <slaweq> can I assign it to You for this week?
16:23:10 <ralonsoh> sure
16:23:15 <slaweq> #action ralonsoh to debug issue with neutron_tempest_plugin.api.admin.test_network_segment_range test
16:23:18 <slaweq> thx ralonsoh :)
16:23:27 <slaweq> and the last one was:
16:23:29 <slaweq> slaweq to cancel next week meeting
16:23:34 <slaweq> done - that was easy :P
16:23:49 <slaweq> ok, any questions/comments?
16:24:41 <slaweq> ok, I will take this silence as no :)
16:24:50 <slaweq> so lets move on then
16:24:51 <mlavalle> +1
16:24:52 <slaweq> #topic Stadium projects
16:25:02 <slaweq> first "Python 3 migration"
16:25:08 <slaweq> etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:25:34 <slaweq> I know that tidwellr started doing something with neutron-dynamic-routing repo
16:26:07 <slaweq> his patch https://review.opendev.org/#/c/657409/
16:27:37 <slaweq> I just checked that for neutron-lib we are actually good
16:28:50 <slaweq> I will try to go through those projects in next weeks
16:29:07 <slaweq> anyone wants to add something in this topic?
16:29:40 <mlavalle> nope
16:29:57 <slaweq> ok, lets move on
16:30:03 <slaweq> tempest-plugins migration
16:30:09 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:30:28 <slaweq> and I have a question here
16:30:52 <slaweq> I was recently struggling with error on jobs run on rocky and queens repos for networking-bgpvpn
16:31:16 <slaweq> but at the airport yesterday I realized that we probably don't need to run those jobs for stable branches yet
16:31:45 <slaweq> as we will not remove tests from stable branches from stadium projects' repos, right?
16:32:23 <slaweq> so we should only have this jobs in neutron-tempest-plugin repo for master branch for now and add stable branches jobs starting from Train release
16:32:28 <slaweq> is that correct?
16:32:50 <mlavalle> I think so
16:33:41 <slaweq> ok, so that will make at least my patch easier :)
16:33:58 <slaweq> I will remove jobs for stable branches from it and it will be ready for review than
16:34:12 <slaweq> any other questions/updates?
16:35:22 <mlavalle> not from me
16:35:35 <slaweq> ok, so lets move on then
16:35:37 <slaweq> #topic Grafana
16:35:47 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate - just a reminder :)
16:36:54 <slaweq> there isn't anything very bad there - all looks pretty same as usual
16:37:04 <mlavalle> yeap
16:37:08 <mlavalle> I think so
16:37:12 <slaweq> do You so anything what You want  to talk about?
16:38:04 <haleyb> is there a bug for the slow job failure? looks like a volume issue?
16:38:47 <slaweq> haleyb: issues with volume (or volume backup) happens in various tempest jobs quite often
16:38:55 <slaweq> haleyb: do You have link to example?
16:39:06 <haleyb> http://logs.openstack.org/57/656357/1/check/tempest-slow-py3/900d859/testr_results.html.gz
16:39:15 <haleyb> second failure
16:39:47 <slaweq> yes, such errors happens from time to time
16:40:17 <slaweq> I'm not sure if exactly this one was reported to cinder but I was reporting some similar errors already
16:40:38 <slaweq> and we also discussed about it in QA session on PTG
16:40:39 <haleyb> just seemed like it picked up recently in the gate, at 20% now
16:40:55 <slaweq> I hope You all saw recent email from gmann about it
16:43:13 <slaweq> haleyb: I'm not sure if those 20% are only because of this issue
16:43:24 <slaweq> often there are also problems with ssh to instances
16:44:04 <slaweq> ok, lets move on to the next topic
16:44:05 <haleyb> slaweq: yes, that was in the other test failure, it's just at 2x the other jobs for failures
16:44:37 <slaweq> haleyb: where tempest-slow is 2x the other jobs failures? I don't see it to be so high
16:45:32 <haleyb> slaweq: argh, it's number of jobs run, i was looking at the right side...
16:45:56 <slaweq> ahh :)
16:46:11 <slaweq> but that is kinda strange that this job was so many times :)
16:46:33 <haleyb> right, they should all be the same
16:46:58 <slaweq> yes, I will check if this graph is properly defined in grafana
16:47:38 <slaweq> it is not
16:47:55 <slaweq> it counts jobs from check queue instead of gate queue
16:47:58 <slaweq> I will fix that
16:48:17 <slaweq> #action slaweq to fix number of tempest-slow-py3 jobs in grafana
16:48:26 <slaweq> thx haleyb for pointing this :)
16:48:36 <haleyb> :)
16:48:57 <slaweq> ok, lets move on then
16:48:59 <slaweq> #topic fullstack/functional
16:49:11 <slaweq> we still have quite high failure rates for those jobs :/
16:49:37 <slaweq> for functional tests quite often we still hits bug https://bugs.launchpad.net/neutron/+bug/1823038
16:49:38 <openstack> Launchpad bug 1823038 in neutron "Neutron-keepalived-state-change fails to check initial router state" [High,Confirmed]
16:49:46 <slaweq> like e.g. in     * http://logs.openstack.org/64/656164/1/gate/neutron-functional/c59dd7c/testr_results.html.gz
16:50:14 <slaweq> and I have a question to You about that
16:51:14 <slaweq> some time ago I did https://github.com/openstack/neutron/commit/8fec1ffc833eba9b3fc5f812bf881f44b4beba0c
16:51:28 <slaweq> to address this race condition between keepalived and neutron-keepalived-state-change
16:51:36 <slaweq> and it works fine for me locally
16:52:41 <slaweq> but in the gate for some (unknown for me) reason, this initial check of status is failing with error like http://logs.openstack.org/64/656164/1/gate/neutron-functional/c59dd7c/controller/logs/journal_log.txt.gz#_May_07_11_21_05
16:52:49 <slaweq> I have no idea why it is like that
16:53:17 <slaweq> maybe You can take a look into that and help me with it
16:53:57 <haleyb> slaweq: sorry, tuned out for a second, will look
16:53:58 <slaweq> I send today some DNM patch https://review.opendev.org/#/c/657565/ to check if this binary is really in .tox/dsvm-functional/bin directory
16:54:07 <slaweq> and it is there
16:54:37 <mlavalle> thanks haleyb.
16:55:54 <slaweq> thanks haleyb
16:55:55 <haleyb> slaweq: hmm, privsep-helper not found?  that's odd
16:55:56 <mlavalle> fwiw, I've been noting in the devstack I run in my mac (1 controller / network, 1 compute, DVR) that my HA routers sometimes has 2 masters
16:56:29 <haleyb> slaweq: project-config fix @ https://review.opendev.org/657646 :)
16:56:38 <mlavalle> that happens after I restart the deployment
16:56:40 <slaweq> mlavalle: that is probably different issue
16:57:24 <slaweq> when this "my race" happend there were 2 standby routers instead of masters
16:57:28 <slaweq> haleyb: thx
16:58:17 <mlavalle> ok
16:58:30 <slaweq> haleyb: but in my DNM patch I tried to install oslo.privep simply
16:58:32 <mlavalle> I'll try to debug it then
16:58:42 <slaweq> and it looks that for python2.7 job it can find it now
16:58:54 <slaweq> but there is another error there still :/
16:58:59 <slaweq> I will have to look into it
16:59:24 <slaweq> and this is odd because it works fine for me locally, and I also don't think there are such errors in e.g. tempest jobs
16:59:39 <slaweq> so this issue is now strictly related to functional jobs IMO
16:59:52 <slaweq> ok, I think we are running out of time now
17:00:00 <slaweq> thx for attending and see You next week
17:00:08 <slaweq> #endmeeting