16:00:19 <slaweq> #startmeeting neutron_ci 16:00:19 <openstack> Meeting started Tue May 7 16:00:19 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:20 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:22 <openstack> The meeting name has been set to 'neutron_ci' 16:00:37 <slaweq> hi 16:00:40 <mlavalle> o/ 16:00:56 <haleyb> hi 16:01:44 <slaweq> lets wait couple more minutes for njohnston_ ralonsoh and others, maybe they will join 16:03:13 <slaweq> ok, lets start 16:03:18 <slaweq> first thing: 16:03:20 <slaweq> #link http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1 16:03:33 <slaweq> please open now so it will be ready later :) 16:03:42 <slaweq> #topic Actions from previous meetings 16:03:46 <mlavalle> ok 16:03:58 <slaweq> first action from 2 weeks ago was 16:04:00 <slaweq> mlavalle to continue debuging reasons of neutron-tempest-plugin-dvr-multinode-scenario failures 16:04:13 <mlavalle> I didn't make progress on this one 16:04:27 <mlavalle> due to Summit / PTG 16:04:41 <slaweq> sure, I know :) Can I assign it to You for next week? 16:04:49 <mlavalle> yes please 16:04:55 <slaweq> #action mlavalle to continue debuging reasons of neutron-tempest-plugin-dvr-multinode-scenario failures 16:04:58 <slaweq> thx 16:05:06 <slaweq> mlavalle to recheck tcpdump patch and analyze output from ci jobs 16:05:13 <slaweq> that is next one ^^ 16:05:18 <slaweq> any update? 16:05:22 <mlavalle> I spent time this morning looking at that 16:06:09 <mlavalle> I think my tcpdump command is too broad: http://logs.openstack.org/21/653021/2/check/neutron-tempest-plugin-dvr-multinode-scenario/0a24d77/controller/logs/screen-q-l3.txt.gz#_Apr_23_00_47_57_853357 16:06:40 <mlavalle> getting weird output as you can see 16:06:40 <mlavalle> bad checksums and that kind of stuff 16:06:55 <mlavalle> so I am going to focus a little bit more 16:07:24 <mlavalle> I am going to trace qr and qg interfaces 16:07:24 <mlavalle> with tcp and port 22 16:07:37 <mlavalle> makes sense? 16:07:48 <slaweq> and please add "-n" option to not resolv IPs to hostnames 16:07:58 <slaweq> IMHO it will be easier to look 16:08:17 <mlavalle> yes, you are right 16:08:22 <mlavalle> I will probably also focus on singe node jobs first 16:08:33 <haleyb> and maybe -l to not buffer 16:08:36 <slaweq> yes, and also "-e" to print mac addresses 16:09:04 <mlavalle> thanks for the recommendations. I will follow them 16:09:12 <mlavalle> I have a stupid question 16:09:16 <mlavalle> can I ask it? 16:09:20 <slaweq> sure :) 16:09:27 <slaweq> there are no stupid questions ;) 16:10:11 <mlavalle> in looking at what job to focus in kibana, I noticed this: http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/job-output.txt#_2019-05-07_14_03_22_454517 16:10:42 <mlavalle> we are not using password ssh in any case, right? 16:11:01 <slaweq> no, ssh key is used always I think 16:11:21 <mlavalle> yes, that's what I think also 16:11:32 <mlavalle> but it was worth asking the dumb question 16:12:09 <slaweq> this is exactly example of this second "type" of errors with SSH connectivity 16:12:12 <slaweq> look at http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/job-output.txt#_2019-05-07_14_03_22_550727 16:12:29 <slaweq> instance-id was received properly from metadata server 16:12:47 <slaweq> but then 2 lines below, failed to get public-key 16:13:02 <slaweq> it is exactly what I was testing last week 16:13:24 <mlavalle> do you have a feel of what is the ratio between type 1 and type 2 failures? 16:13:43 <slaweq> I don't know exactly but I would say 50:50 16:14:13 <mlavalle> I think the tcpdump testing I'm doing should help with type 1 16:14:34 <mlavalle> so I will focus on those 16:14:43 <mlavalle> in type 2, we know we have connectivity 16:14:56 <slaweq> yes 16:14:58 <mlavalle> because we fail athenticating 16:15:15 <slaweq> in this case problem is with slow answer for metadata requests 16:17:48 <slaweq> http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/compute2/logs/screen-q-meta.txt.gz#_May_07_13_15_02_833919 16:18:00 <slaweq> here is this failed request in neutron metadata agent's logs 16:18:17 <slaweq> and http://logs.openstack.org/66/641866/10/check/neutron-tempest-dvr-ha-multinode-full/ec1105b/controller/logs/screen-n-api-meta.txt.gz#_May_07_13_15_04_543966 -- that how it looks in nova 16:18:27 <slaweq> 10 seconds gap in logs there 16:18:34 <mlavalle> yeap 16:19:11 <slaweq> mlavalle: maybe You can try to talk with someone from nova team to look into those issues 16:19:41 <mlavalle> ok 16:19:59 <slaweq> thx 16:20:07 <slaweq> can I add it as an action for You also? 16:20:11 <mlavalle> yes 16:20:31 <slaweq> #action mlavalle to talk with nova folks about slow responses for metadata requests 16:20:31 <mlavalle> please 16:20:33 <slaweq> thx 16:20:45 <slaweq> ok, next one was 16:20:47 <slaweq> njohnston move wsgi jobs to check queue nonvoting 16:21:05 <slaweq> I know it's done, we have wsgi jobs running in check queue currently 16:21:31 <slaweq> and tempest job is kinda broken now 16:21:48 <slaweq> so we will have to investigate it also 16:21:58 <slaweq> but that isn't very urgent for now 16:22:26 <slaweq> next one then 16:22:27 <slaweq> ralonsoh to debug issue with neutron_tempest_plugin.api.admin.test_network_segment_range test 16:22:38 <slaweq> I don't know if ralonsoh did anything with it 16:22:45 <ralonsoh> sorry 16:22:50 <ralonsoh> I didn't have time for it 16:22:58 <slaweq> sure, no problem :) 16:23:05 <slaweq> can I assign it to You for this week? 16:23:10 <ralonsoh> sure 16:23:15 <slaweq> #action ralonsoh to debug issue with neutron_tempest_plugin.api.admin.test_network_segment_range test 16:23:18 <slaweq> thx ralonsoh :) 16:23:27 <slaweq> and the last one was: 16:23:29 <slaweq> slaweq to cancel next week meeting 16:23:34 <slaweq> done - that was easy :P 16:23:49 <slaweq> ok, any questions/comments? 16:24:41 <slaweq> ok, I will take this silence as no :) 16:24:50 <slaweq> so lets move on then 16:24:51 <mlavalle> +1 16:24:52 <slaweq> #topic Stadium projects 16:25:02 <slaweq> first "Python 3 migration" 16:25:08 <slaweq> etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:25:34 <slaweq> I know that tidwellr started doing something with neutron-dynamic-routing repo 16:26:07 <slaweq> his patch https://review.opendev.org/#/c/657409/ 16:27:37 <slaweq> I just checked that for neutron-lib we are actually good 16:28:50 <slaweq> I will try to go through those projects in next weeks 16:29:07 <slaweq> anyone wants to add something in this topic? 16:29:40 <mlavalle> nope 16:29:57 <slaweq> ok, lets move on 16:30:03 <slaweq> tempest-plugins migration 16:30:09 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo 16:30:28 <slaweq> and I have a question here 16:30:52 <slaweq> I was recently struggling with error on jobs run on rocky and queens repos for networking-bgpvpn 16:31:16 <slaweq> but at the airport yesterday I realized that we probably don't need to run those jobs for stable branches yet 16:31:45 <slaweq> as we will not remove tests from stable branches from stadium projects' repos, right? 16:32:23 <slaweq> so we should only have this jobs in neutron-tempest-plugin repo for master branch for now and add stable branches jobs starting from Train release 16:32:28 <slaweq> is that correct? 16:32:50 <mlavalle> I think so 16:33:41 <slaweq> ok, so that will make at least my patch easier :) 16:33:58 <slaweq> I will remove jobs for stable branches from it and it will be ready for review than 16:34:12 <slaweq> any other questions/updates? 16:35:22 <mlavalle> not from me 16:35:35 <slaweq> ok, so lets move on then 16:35:37 <slaweq> #topic Grafana 16:35:47 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate - just a reminder :) 16:36:54 <slaweq> there isn't anything very bad there - all looks pretty same as usual 16:37:04 <mlavalle> yeap 16:37:08 <mlavalle> I think so 16:37:12 <slaweq> do You so anything what You want to talk about? 16:38:04 <haleyb> is there a bug for the slow job failure? looks like a volume issue? 16:38:47 <slaweq> haleyb: issues with volume (or volume backup) happens in various tempest jobs quite often 16:38:55 <slaweq> haleyb: do You have link to example? 16:39:06 <haleyb> http://logs.openstack.org/57/656357/1/check/tempest-slow-py3/900d859/testr_results.html.gz 16:39:15 <haleyb> second failure 16:39:47 <slaweq> yes, such errors happens from time to time 16:40:17 <slaweq> I'm not sure if exactly this one was reported to cinder but I was reporting some similar errors already 16:40:38 <slaweq> and we also discussed about it in QA session on PTG 16:40:39 <haleyb> just seemed like it picked up recently in the gate, at 20% now 16:40:55 <slaweq> I hope You all saw recent email from gmann about it 16:43:13 <slaweq> haleyb: I'm not sure if those 20% are only because of this issue 16:43:24 <slaweq> often there are also problems with ssh to instances 16:44:04 <slaweq> ok, lets move on to the next topic 16:44:05 <haleyb> slaweq: yes, that was in the other test failure, it's just at 2x the other jobs for failures 16:44:37 <slaweq> haleyb: where tempest-slow is 2x the other jobs failures? I don't see it to be so high 16:45:32 <haleyb> slaweq: argh, it's number of jobs run, i was looking at the right side... 16:45:56 <slaweq> ahh :) 16:46:11 <slaweq> but that is kinda strange that this job was so many times :) 16:46:33 <haleyb> right, they should all be the same 16:46:58 <slaweq> yes, I will check if this graph is properly defined in grafana 16:47:38 <slaweq> it is not 16:47:55 <slaweq> it counts jobs from check queue instead of gate queue 16:47:58 <slaweq> I will fix that 16:48:17 <slaweq> #action slaweq to fix number of tempest-slow-py3 jobs in grafana 16:48:26 <slaweq> thx haleyb for pointing this :) 16:48:36 <haleyb> :) 16:48:57 <slaweq> ok, lets move on then 16:48:59 <slaweq> #topic fullstack/functional 16:49:11 <slaweq> we still have quite high failure rates for those jobs :/ 16:49:37 <slaweq> for functional tests quite often we still hits bug https://bugs.launchpad.net/neutron/+bug/1823038 16:49:38 <openstack> Launchpad bug 1823038 in neutron "Neutron-keepalived-state-change fails to check initial router state" [High,Confirmed] 16:49:46 <slaweq> like e.g. in * http://logs.openstack.org/64/656164/1/gate/neutron-functional/c59dd7c/testr_results.html.gz 16:50:14 <slaweq> and I have a question to You about that 16:51:14 <slaweq> some time ago I did https://github.com/openstack/neutron/commit/8fec1ffc833eba9b3fc5f812bf881f44b4beba0c 16:51:28 <slaweq> to address this race condition between keepalived and neutron-keepalived-state-change 16:51:36 <slaweq> and it works fine for me locally 16:52:41 <slaweq> but in the gate for some (unknown for me) reason, this initial check of status is failing with error like http://logs.openstack.org/64/656164/1/gate/neutron-functional/c59dd7c/controller/logs/journal_log.txt.gz#_May_07_11_21_05 16:52:49 <slaweq> I have no idea why it is like that 16:53:17 <slaweq> maybe You can take a look into that and help me with it 16:53:57 <haleyb> slaweq: sorry, tuned out for a second, will look 16:53:58 <slaweq> I send today some DNM patch https://review.opendev.org/#/c/657565/ to check if this binary is really in .tox/dsvm-functional/bin directory 16:54:07 <slaweq> and it is there 16:54:37 <mlavalle> thanks haleyb. 16:55:54 <slaweq> thanks haleyb 16:55:55 <haleyb> slaweq: hmm, privsep-helper not found? that's odd 16:55:56 <mlavalle> fwiw, I've been noting in the devstack I run in my mac (1 controller / network, 1 compute, DVR) that my HA routers sometimes has 2 masters 16:56:29 <haleyb> slaweq: project-config fix @ https://review.opendev.org/657646 :) 16:56:38 <mlavalle> that happens after I restart the deployment 16:56:40 <slaweq> mlavalle: that is probably different issue 16:57:24 <slaweq> when this "my race" happend there were 2 standby routers instead of masters 16:57:28 <slaweq> haleyb: thx 16:58:17 <mlavalle> ok 16:58:30 <slaweq> haleyb: but in my DNM patch I tried to install oslo.privep simply 16:58:32 <mlavalle> I'll try to debug it then 16:58:42 <slaweq> and it looks that for python2.7 job it can find it now 16:58:54 <slaweq> but there is another error there still :/ 16:58:59 <slaweq> I will have to look into it 16:59:24 <slaweq> and this is odd because it works fine for me locally, and I also don't think there are such errors in e.g. tempest jobs 16:59:39 <slaweq> so this issue is now strictly related to functional jobs IMO 16:59:52 <slaweq> ok, I think we are running out of time now 17:00:00 <slaweq> thx for attending and see You next week 17:00:08 <slaweq> #endmeeting