16:00:04 <slaweq> #startmeeting neutron_ci 16:00:05 <openstack> Meeting started Tue Apr 9 16:00:04 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:09 <slaweq> hi everyone 16:00:10 <openstack> The meeting name has been set to 'neutron_ci' 16:00:52 <ralonsoh> hi 16:00:55 <bcafarel> hey again 16:01:25 <mlavalle> o/ 16:02:03 <slaweq> ok, lets start 16:02:12 <slaweq> first of all 16:02:14 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:02:19 <slaweq> Please open now :) 16:02:31 <mlavalle> done 16:02:32 <haleyb> hi 16:02:38 <slaweq> #topic Actions from previous meetings 16:02:54 <slaweq> mlavalle to debug reasons of neutron-tempest-plugin-dvr-multinode-scenario failures 16:03:04 <slaweq> mlavalle: any progress? 16:03:13 <mlavalle> slaweq: didn't make much progress there :-( 16:03:24 <slaweq> ok, no problem :) 16:03:37 <slaweq> I will assign it to You for next week just to have "reminder" 16:03:42 <slaweq> fine for You? 16:03:50 <mlavalle> yes, thanks! 16:03:58 <slaweq> #action mlavalle to debug reasons of neutron-tempest-plugin-dvr-multinode-scenario failures 16:04:06 <slaweq> next one 16:04:07 <slaweq> slaweq to mark test_floatingip_port_details test as unstable 16:04:13 <slaweq> Done: https://review.openstack.org/#/c/647816/ 16:04:34 <slaweq> and the last one from previous meeting was: 16:04:36 <slaweq> slaweq to report a bug related to intermittent ssh failures in various tests 16:04:43 <slaweq> Bug reported https://bugs.launchpad.net/neutron/+bug/1821912 16:04:44 <openstack> Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,Confirmed] 16:04:59 <slaweq> that is basically all about actions from last week 16:05:06 <slaweq> anything You want to add? 16:05:36 <mlavalle> nope 16:05:51 <slaweq> ok 16:05:56 <slaweq> next topic then 16:06:01 <slaweq> #topic Python 3 16:06:09 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:06:19 <slaweq> I don't have anything new to add here 16:06:24 <slaweq> njohnston any updates? 16:06:43 <njohnston> Me neither; I have this after the tempest plugin migration work in my personal pile. 16:07:06 <slaweq> ahh right, You mentioned that earlier today :) 16:07:15 <slaweq> ok, so lets move to next topic then 16:07:21 <slaweq> #topic tempest-plugins migration 16:07:30 <slaweq> njohnston: any updates? :) 16:07:34 <bcafarel> :) 16:07:44 <slaweq> or anyone else have any updates on this? 16:08:03 <njohnston> So it was pointed otu to me that there need to be zuul jobs defined both in the source repo and in neutron-tempest-plugin 16:08:03 <mlavalle> I started working on the vpnaas one 16:08:07 <mlavalle> https://review.openstack.org/#/c/649373/ 16:08:19 <bcafarel> working on networking-sfc one, should probably push step 1 tomorrow ( 16:08:41 <njohnston> so I am about to push a change to neutron-tempest-plugin that will run fwaas jobs when they are changed in the n-t-p repo. 16:08:47 <mlavalle> I'll push the next revision today or tomorrow 16:09:20 <njohnston> I wish there was a "relevant files" so I could say "only run if one of the affected files is in this path" but I think we only have the opposite, irrelevant-files. 16:09:32 <slaweq> njohnston: yes, I think that having zuul job definition together with tests would be nice to check if tests actually works 16:10:50 <slaweq> njohnston: yes, but maybe we can put such tests for stadium project in separate folder 16:10:59 <slaweq> like neutron_tempest_plugin/fwaas/ 16:11:18 <slaweq> and then You can define that all other files are irrelevant for this job 16:11:26 <njohnston> yes they are definitely in a different folder. And actually it looks like my wish is granced: "files" will do the job: https://zuul-ci.org/docs/zuul/user/config.html#attr-job.files 16:11:40 <njohnston> *granted 16:11:53 <slaweq> njohnston: super :) 16:12:50 <njohnston> that's it for me 16:12:58 <slaweq> thx all for update 16:13:23 <slaweq> I didn't start work on bgpvpn project yet :/ 16:13:31 <slaweq> but I will try to do it soon 16:13:35 <slaweq> I promise :) 16:14:30 <slaweq> ok, lets move on 16:14:32 <slaweq> #topic Grafana 16:14:57 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:15:02 <slaweq> (just a reminder) 16:16:18 <mlavalle> not too bad 16:16:25 <slaweq> yes 16:16:37 <mlavalle> fullstack somewhat misbahaving 16:16:44 <mlavalle> but not tooo bad 16:16:44 <bcafarel> fullstack seems a bit grumpy again recently? 16:17:34 <slaweq> in check queue? 16:17:40 <njohnston> I've definitely had some trouble from fullstack, functional, and iptables_hybrid. The fullstack and functional problems were "ovsdbapp: timeout after 10 seconds" style errors. 16:17:41 <mlavalle> yeah, bit grumpy is good desscription 16:18:33 <slaweq> njohnston: this issue is very long standing one, it popping up from time to time and IMO it's related to slow node on which job is running 16:18:56 <njohnston> yes that was my recollection as well 16:19:19 <slaweq> otherwiseguy was looking into it couple of times and he didn't found anything there 16:20:00 <slaweq> also I think that recently we were able to merged quite many patches without rechecking each of them like 10 times :) 16:20:06 <slaweq> so that is IMHO progress ;) 16:20:22 <njohnston> yes, down to only 7 rechecks - 30% improvement! 16:20:24 <njohnston> ;-) 16:20:28 <slaweq> LOL 16:20:47 <mlavalle> OOL 16:20:49 <mlavalle> LOL 16:21:12 <slaweq> ok, so lets talk about some specific issues now 16:21:19 <slaweq> (I prepared couple for today) 16:21:24 <slaweq> #topic fullstack/functional 16:21:48 <slaweq> first of all, we had issue with fullstack on stable branches recently: https://bugs.launchpad.net/neutron/+bug/1823155 16:21:49 <openstack> Launchpad bug 1823155 in neutron "neutron-fullstack fails on stable/rocky at ovs compilation" [Critical,Confirmed] - Assigned to Bernard Cafarelli (bcafarel) 16:21:53 <slaweq> thx bcafarel it's fixed 16:22:15 <slaweq> bcafarel: do You think we can close this bug? or there is still something to do with it? 16:23:07 <bcafarel> slaweq: I was wondering about backporting it further, as job is non-voting there, but it may be useful to see results 16:23:29 <bcafarel> though I admit I rarely check non-voting jobs results in stable branches 16:23:52 <slaweq> IMO we should backport it, even if it is non-voting, would be better to not have it failing 100% times :) 16:25:01 <bcafarel> slaweq: ack, I'll hit that cherry-pick button, and then we should be good with that bug 16:25:09 <slaweq> ok, thx bcafarel 16:25:12 <mlavalle> ++ 16:25:43 <slaweq> from other things related to fullstack 16:25:59 <slaweq> I was today checking results of ci jobs on some recent patches 16:26:07 <slaweq> and I found failure like http://logs.openstack.org/85/644385/9/gate/neutron-fullstack/f7d4eb4/testr_results.html.gz 16:26:40 <ralonsoh> I'll take a look at this one 16:27:03 <slaweq> ralonsoh: in logs there is error like http://logs.openstack.org/85/644385/9/gate/neutron-fullstack/f7d4eb4/controller/logs/dsvm-fullstack-logs/TestMinBwQoSOvs.test_min_bw_qos_policy_rule_lifecycle_egress,openflow-native_/neutron-openvswitch-agent--2019-04-09--12-35-31-673671_log.txt.gz#_2019-04-09_12_35_46_735 16:27:23 <slaweq> isn't it something what was supposed to be already fixed? 16:27:37 <ralonsoh> this is something new 16:27:38 <mlavalle> in scenario tests, yes 16:28:01 <slaweq> yes, it's on "fresh" patch https://review.openstack.org/#/c/644385/ 16:28:09 <mlavalle> but this failure seems different issue 16:28:13 <slaweq> and in gate queue, so for sure this was run on current master 16:28:30 <ralonsoh> we never had this problem before, but we need to deal with having several min QoS rules at the same time 16:28:38 <ralonsoh> during tests 16:29:06 <slaweq> ok, so ralonsoh can You report it as a bug in launchpad? 16:29:13 * bcafarel needs to leave o/ 16:29:13 <ralonsoh> yes 16:29:18 <slaweq> thx a lot 16:29:28 <slaweq> I will assign this as an action for You, ok? 16:30:04 <slaweq> #action ralonsoh to report and triage new fullstack test_min_bw_qos_policy_rule_lifecycle failure 16:30:51 <mlavalle> yes, if he raises the hand, he gets the action item 16:31:02 <slaweq> mlavalle: :) 16:31:27 <slaweq> ok, lets move on 16:31:47 <slaweq> regarding functional tests I found that we again have some failures with db migrations :/ 16:31:49 <slaweq> like http://logs.openstack.org/73/636473/25/check/neutron-functional-python27/0ef9362/testr_results.html.gz 16:33:17 <slaweq> and what is strange here, that should be IMO skipped with this decorator skip_if_timeout which I added some time ago 16:33:51 <njohnston> yes that is odd 16:34:38 <slaweq> maybe I know why 16:34:49 <slaweq> it is the same error but different exception is raised 16:35:04 <slaweq> in decorator there is catched InterfaceError: https://github.com/openstack/neutron/commit/c0fec676723649a0516cf3d4af0dccc0fe832095#diff-7ae004589c685d938508e48bea69c581R123 16:35:17 <slaweq> and in this failure test oslo_db.exception.DBConnectionError was raised 16:35:46 <slaweq> I will report this as a bug and check why there is such change in raised exceptions 16:36:29 <slaweq> #action slaweq to report and debug yet another db migration functional tests issue 16:37:17 <slaweq> and that is all related to functional/fullstack tests this week 16:37:26 <slaweq> anything else You want to add? 16:38:05 <mlavalle> nope 16:38:26 <slaweq> so, next one 16:38:28 <slaweq> #topic Tempest/Scenario 16:38:44 <slaweq> our main issue there is https://bugs.launchpad.net/neutron/+bug/1821912 16:38:45 <openstack> Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,Confirmed] 16:39:05 <mlavalle> yeap 16:39:17 <slaweq> I think we should think about some possible actions how to debug this issue 16:39:42 <slaweq> I described my "findings" in last comment there 16:40:15 <slaweq> basically from what I saw in logs which I was checking is that VM was able to reach metadata server 16:40:27 <slaweq> so (probably) it was able to reach router 16:40:35 <slaweq> but FIP wasn't working 16:41:47 <slaweq> so it could be some security groups issue maybe or something with configuration of NAT in router's namespace maybe 16:41:53 <slaweq> any ideas? 16:42:11 <mlavalle> I like the idea of DNM with tcpdump 16:43:12 <njohnston> yes I think that is pretty good... this is the exact case where skydive would be helpful 16:44:05 <slaweq> from other things, maybe (for single node jobs) we can do DNM patch to tempest that will simply dump iptables state, routing table state and other things from qrouter- namespace 16:44:49 <slaweq> I know that it's not what tempest should do, but in single node job, we are sure that namespace is on same host as tests are run, we know namespace name, so that also can be maybe useful 16:45:06 <slaweq> njohnston: I know that skydive would be useful here but we don't have it yet :/ 16:46:49 <slaweq> so any volunteers to do that? 16:47:53 <slaweq> ok, I will try do send such DNM patches maybe this or next week 16:48:01 <mlavalle> I will help 16:48:18 <slaweq> #action slaweq prepare DNM patches for debug FIP ssh issue in tempest jobs 16:48:21 <slaweq> thx mlavalle 16:48:34 <mlavalle> I can try to help with the tcpdump patch 16:48:40 <slaweq> great 16:49:12 <mlavalle> it just took me some time to understand what you were talking about 16:49:14 <slaweq> so I will do patch to tempest to "dump router's namespace state" 16:49:16 <mlavalle> I though skydive 16:49:25 <slaweq> ahh 16:49:43 <slaweq> skydive is a tool which ajo and dalvarez were presenting us in Denver 16:49:48 <mlavalle> I know 16:49:55 <slaweq> it is like "distributed tcpdump" 16:50:00 <slaweq> ok 16:50:05 <mlavalle> and I've seen patches from Assaf 16:51:06 <slaweq> but after Assaf send those patches we discussed it and decided that skydive can be more helpful in debugging such issues like we have now instead of doing some new tests 16:51:27 <slaweq> so to sum up 16:51:28 <mlavalle> ah ok 16:51:40 <mlavalle> I'll see if I have time to look at skydive again 16:51:52 <slaweq> #action mlavalle will send DNM patch which will add tcpdump in routers' namespaces to debug ssh issue 16:52:17 <slaweq> #action slaweq will send DNM patch to tempest to dump router's namespace state when ssh will fail 16:52:29 <slaweq> correct ^^? 16:52:35 <mlavalle> yes 16:52:38 <slaweq> ok, thx 16:53:00 <slaweq> from other things related to tempest jobs I saw one failure like: 16:53:02 <slaweq> http://logs.openstack.org/47/650947/2/check/tempest-full/8688145/testr_results.html.gz 16:53:13 <slaweq> it is also "well known" problem 16:53:28 <slaweq> tempest is cleaning after test, sends DELETE request 16:53:46 <slaweq> and this DELETE is processed for veeeeery long time in neutron, here it was like 83 seconds 16:54:02 <slaweq> so tempest http client got timeout, and repeats request 16:54:18 <slaweq> but in the meantime neutron removed resource and tempest got 404 response for its second request 16:54:22 <slaweq> and test is failed 16:54:52 <slaweq> we can maybe talk with gmann and other qa folks if maybe tempest shouldn't fail when it got 404 when deleting something 16:55:03 <slaweq> but that would be only workaround of the problem 16:55:53 <gmann> slaweq: yeah we usually have ignore case for 404 when deleting the resource after tests 16:56:22 <slaweq> so gmann will it be reasonable to send patch to tempest to ignore such case and not fail? 16:56:27 <gmann> slaweq: if delete is part of test then we cannot ignore otherwise you are right we can consider the 404 as pass for tests 16:56:54 <gmann> slaweq: for cleanup things yes. but i did not check the tests you mentioned 16:57:12 <slaweq> gmann: ok, thx 16:57:24 <njohnston> We're just about out of time; do we want to talk about WSGI testing? 16:57:31 <gmann> push a patch and i can review that 16:57:36 <slaweq> so I will check exactly what was the place when this failed and will push patch 16:57:59 <slaweq> njohnston: sure, go on 16:58:17 <slaweq> gmann: thx :) 16:58:32 <slaweq> njohnston: or maybe we can add it to next week's agenda? 16:58:33 <njohnston> well, I was wondering if we should move the wsgi jobs from experimental to check (leaving them nonvoting) 16:58:36 <slaweq> and have more time for it? 16:58:59 <njohnston> just to gather data, and add them to the dashboard so we can see how their failure rates are 16:59:06 <slaweq> I think that we can move those jobs to check queue 16:59:09 <njohnston> and then we can talk more about it next week, but armed with facts 16:59:10 <slaweq> that would be fine 16:59:16 <mlavalle> ++ 16:59:19 <slaweq> njohnston++ 16:59:23 <njohnston> ok I'll do that 16:59:26 <slaweq> thx 16:59:29 <mlavalle> before we go 16:59:33 <njohnston> #action njohnston move wsgi jobs to check queue nonvoting 16:59:38 <mlavalle> congrats to njohnston 16:59:47 <mlavalle> https://www.wsj.com/articles/virginia-wins-ncaa-men-s-basketball-championship-11554781679?mod=article_inline 16:59:57 <slaweq> njohnston: congrats :) 16:59:58 <njohnston> thank you! \o/ 17:00:04 <mlavalle> his alma mater is national basketball champion 17:00:13 <slaweq> ok, thx for attending the meeting 17:00:20 <slaweq> see You all next week 17:00:23 <slaweq> #endmeeting