#openstack-meeting log

16:00:04 <slaweq> #startmeeting neutron_ci
16:00:05 <openstack> Meeting started Tue Apr  9 16:00:04 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:07 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:09 <slaweq> hi everyone
16:00:10 <openstack> The meeting name has been set to 'neutron_ci'
16:00:52 <ralonsoh> hi
16:00:55 <bcafarel> hey again
16:01:25 <mlavalle> o/
16:02:03 <slaweq> ok, lets start
16:02:12 <slaweq> first of all
16:02:14 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:02:19 <slaweq> Please open now :)
16:02:31 <mlavalle> done
16:02:32 <haleyb> hi
16:02:38 <slaweq> #topic Actions from previous meetings
16:02:54 <slaweq> mlavalle to debug reasons of neutron-tempest-plugin-dvr-multinode-scenario failures
16:03:04 <slaweq> mlavalle: any progress?
16:03:13 <mlavalle> slaweq: didn't make much progress there :-(
16:03:24 <slaweq> ok, no problem :)
16:03:37 <slaweq> I will assign it to You for next week just to have "reminder"
16:03:42 <slaweq> fine for You?
16:03:50 <mlavalle> yes, thanks!
16:03:58 <slaweq> #action mlavalle to debug reasons of neutron-tempest-plugin-dvr-multinode-scenario failures
16:04:06 <slaweq> next one
16:04:07 <slaweq> slaweq to mark test_floatingip_port_details test as unstable
16:04:13 <slaweq> Done: https://review.openstack.org/#/c/647816/
16:04:34 <slaweq> and the last one from previous meeting was:
16:04:36 <slaweq> slaweq to report a bug related to intermittent ssh failures in various tests
16:04:43 <slaweq> Bug reported https://bugs.launchpad.net/neutron/+bug/1821912
16:04:44 <openstack> Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,Confirmed]
16:04:59 <slaweq> that is basically all about actions from last week
16:05:06 <slaweq> anything You want to add?
16:05:36 <mlavalle> nope
16:05:51 <slaweq> ok
16:05:56 <slaweq> next topic then
16:06:01 <slaweq> #topic Python 3
16:06:09 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:06:19 <slaweq> I don't have anything new to add here
16:06:24 <slaweq> njohnston any updates?
16:06:43 <njohnston> Me neither; I have this after the tempest plugin migration work in my personal pile.
16:07:06 <slaweq> ahh right, You mentioned that earlier today :)
16:07:15 <slaweq> ok, so lets move to next topic then
16:07:21 <slaweq> #topic tempest-plugins migration
16:07:30 <slaweq> njohnston: any updates? :)
16:07:34 <bcafarel> :)
16:07:44 <slaweq> or anyone else have any updates on this?
16:08:03 <njohnston> So it was pointed otu to me that there need to be zuul jobs defined both in the source repo and in neutron-tempest-plugin
16:08:03 <mlavalle> I started working on the vpnaas one
16:08:07 <mlavalle> https://review.openstack.org/#/c/649373/
16:08:19 <bcafarel> working on networking-sfc one, should probably push step 1 tomorrow (
16:08:41 <njohnston> so I am about to push a change to neutron-tempest-plugin that will run fwaas jobs when they are changed in the n-t-p repo.
16:08:47 <mlavalle> I'll push the next revision today or tomorrow
16:09:20 <njohnston> I wish there was a "relevant files" so I could say "only run if one of the affected files is in this path" but I think we only have the opposite, irrelevant-files.
16:09:32 <slaweq> njohnston: yes, I think that having zuul job definition together with tests would be nice to check if tests actually works
16:10:50 <slaweq> njohnston: yes, but maybe we can put such tests for stadium project in separate folder
16:10:59 <slaweq> like neutron_tempest_plugin/fwaas/
16:11:18 <slaweq> and then You can define that all other files are irrelevant for this job
16:11:26 <njohnston> yes they are definitely in a different folder.  And actually it looks like my wish is granced: "files" will do the job: https://zuul-ci.org/docs/zuul/user/config.html#attr-job.files
16:11:40 <njohnston> *granted
16:11:53 <slaweq> njohnston: super :)
16:12:50 <njohnston> that's it for me
16:12:58 <slaweq> thx all for update
16:13:23 <slaweq> I didn't start work on bgpvpn project yet :/
16:13:31 <slaweq> but I will try to do it soon
16:13:35 <slaweq> I promise :)
16:14:30 <slaweq> ok, lets move on
16:14:32 <slaweq> #topic Grafana
16:14:57 <slaweq> #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:15:02 <slaweq> (just a reminder)
16:16:18 <mlavalle> not too bad
16:16:25 <slaweq> yes
16:16:37 <mlavalle> fullstack somewhat misbahaving
16:16:44 <mlavalle> but not tooo bad
16:16:44 <bcafarel> fullstack seems a bit grumpy again recently?
16:17:34 <slaweq> in check queue?
16:17:40 <njohnston> I've definitely had some trouble from fullstack, functional, and iptables_hybrid.  The fullstack and functional problems were "ovsdbapp: timeout after 10 seconds" style errors.
16:17:41 <mlavalle> yeah, bit grumpy is good desscription
16:18:33 <slaweq> njohnston: this issue is very long standing one, it popping up from time to time and IMO it's related to slow node on which job is running
16:18:56 <njohnston> yes that was my recollection as well
16:19:19 <slaweq> otherwiseguy was looking into it couple of times and he didn't found anything there
16:20:00 <slaweq> also I think that recently we were able to merged quite many patches without rechecking each of them like 10 times :)
16:20:06 <slaweq> so that is IMHO progress ;)
16:20:22 <njohnston> yes, down to only 7 rechecks - 30% improvement!
16:20:24 <njohnston> ;-)
16:20:28 <slaweq> LOL
16:20:47 <mlavalle> OOL
16:20:49 <mlavalle> LOL
16:21:12 <slaweq> ok, so lets talk about some specific issues now
16:21:19 <slaweq> (I prepared couple for today)
16:21:24 <slaweq> #topic fullstack/functional
16:21:48 <slaweq> first of all, we had issue with fullstack on stable branches recently: https://bugs.launchpad.net/neutron/+bug/1823155
16:21:49 <openstack> Launchpad bug 1823155 in neutron "neutron-fullstack fails on stable/rocky at ovs compilation" [Critical,Confirmed] - Assigned to Bernard Cafarelli (bcafarel)
16:21:53 <slaweq> thx bcafarel it's fixed
16:22:15 <slaweq> bcafarel: do You think we can close this bug? or there is still something to do with it?
16:23:07 <bcafarel> slaweq: I was wondering about backporting it further, as job is non-voting there, but it may be useful to see results
16:23:29 <bcafarel> though I admit I rarely check non-voting jobs results in stable branches
16:23:52 <slaweq> IMO we should backport it, even if it is non-voting, would be better to not have it failing 100% times :)
16:25:01 <bcafarel> slaweq: ack, I'll hit that cherry-pick button, and then we should be good with that bug
16:25:09 <slaweq> ok, thx bcafarel
16:25:12 <mlavalle> ++
16:25:43 <slaweq> from other things related to fullstack
16:25:59 <slaweq> I was today checking results of ci jobs on some recent patches
16:26:07 <slaweq> and I found failure like http://logs.openstack.org/85/644385/9/gate/neutron-fullstack/f7d4eb4/testr_results.html.gz
16:26:40 <ralonsoh> I'll take a look at this one
16:27:03 <slaweq> ralonsoh: in logs there is error like http://logs.openstack.org/85/644385/9/gate/neutron-fullstack/f7d4eb4/controller/logs/dsvm-fullstack-logs/TestMinBwQoSOvs.test_min_bw_qos_policy_rule_lifecycle_egress,openflow-native_/neutron-openvswitch-agent--2019-04-09--12-35-31-673671_log.txt.gz#_2019-04-09_12_35_46_735
16:27:23 <slaweq> isn't it something what was supposed to be already fixed?
16:27:37 <ralonsoh> this is something new
16:27:38 <mlavalle> in scenario tests, yes
16:28:01 <slaweq> yes, it's on "fresh" patch https://review.openstack.org/#/c/644385/
16:28:09 <mlavalle> but this failure seems different issue
16:28:13 <slaweq> and in gate queue, so for sure this was run on current master
16:28:30 <ralonsoh> we never had this problem before, but we need to deal with having several min QoS rules at the same time
16:28:38 <ralonsoh> during tests
16:29:06 <slaweq> ok, so ralonsoh can You report it as a bug in launchpad?
16:29:13 * bcafarel needs to leave o/
16:29:13 <ralonsoh> yes
16:29:18 <slaweq> thx a lot
16:29:28 <slaweq> I will assign this as an action for You, ok?
16:30:04 <slaweq> #action ralonsoh to report and triage new fullstack test_min_bw_qos_policy_rule_lifecycle failure
16:30:51 <mlavalle> yes, if he raises the hand, he gets the action item
16:31:02 <slaweq> mlavalle: :)
16:31:27 <slaweq> ok, lets move on
16:31:47 <slaweq> regarding functional tests I found that we again have some failures with db migrations :/
16:31:49 <slaweq> like http://logs.openstack.org/73/636473/25/check/neutron-functional-python27/0ef9362/testr_results.html.gz
16:33:17 <slaweq> and what is strange here, that should be IMO skipped with this decorator skip_if_timeout which I added some time ago
16:33:51 <njohnston> yes that is odd
16:34:38 <slaweq> maybe I know why
16:34:49 <slaweq> it is the same error but different exception is raised
16:35:04 <slaweq> in decorator there is catched InterfaceError: https://github.com/openstack/neutron/commit/c0fec676723649a0516cf3d4af0dccc0fe832095#diff-7ae004589c685d938508e48bea69c581R123
16:35:17 <slaweq> and in this failure test oslo_db.exception.DBConnectionError was raised
16:35:46 <slaweq> I will report this as a bug and check why there is such change in raised exceptions
16:36:29 <slaweq> #action slaweq to report and debug yet another db migration functional tests issue
16:37:17 <slaweq> and that is all related to functional/fullstack tests this week
16:37:26 <slaweq> anything else You want to add?
16:38:05 <mlavalle> nope
16:38:26 <slaweq> so, next one
16:38:28 <slaweq> #topic Tempest/Scenario
16:38:44 <slaweq> our main issue there is https://bugs.launchpad.net/neutron/+bug/1821912
16:38:45 <openstack> Launchpad bug 1821912 in neutron "intermittent ssh failures in various scenario tests" [High,Confirmed]
16:39:05 <mlavalle> yeap
16:39:17 <slaweq> I think we should think about some possible actions how to debug this issue
16:39:42 <slaweq> I described my "findings" in last comment there
16:40:15 <slaweq> basically from what I saw in logs which I was checking is that VM was able to reach metadata server
16:40:27 <slaweq> so (probably) it was able to reach router
16:40:35 <slaweq> but FIP wasn't working
16:41:47 <slaweq> so it could be some security groups issue maybe or something with configuration of NAT in router's namespace maybe
16:41:53 <slaweq> any ideas?
16:42:11 <mlavalle> I like the idea of DNM with tcpdump
16:43:12 <njohnston> yes I think that is pretty good... this is the exact case where skydive would be helpful
16:44:05 <slaweq> from other things, maybe (for single node jobs) we can do DNM patch to tempest that will simply dump iptables state, routing table state and other things from qrouter- namespace
16:44:49 <slaweq> I know that it's not what tempest should do, but in single node job, we are sure that namespace is on same host as tests are run, we know namespace name, so that also can be maybe useful
16:45:06 <slaweq> njohnston: I know that skydive would be useful here but we don't have it yet :/
16:46:49 <slaweq> so any volunteers to do that?
16:47:53 <slaweq> ok, I will try do send such DNM patches maybe this or next week
16:48:01 <mlavalle> I will help
16:48:18 <slaweq> #action slaweq prepare DNM patches for debug FIP ssh issue in tempest jobs
16:48:21 <slaweq> thx mlavalle
16:48:34 <mlavalle> I can try to help with the tcpdump patch
16:48:40 <slaweq> great
16:49:12 <mlavalle> it just took me some time to understand what you were talking about
16:49:14 <slaweq> so I will do patch to tempest to "dump router's namespace state"
16:49:16 <mlavalle> I though skydive
16:49:25 <slaweq> ahh
16:49:43 <slaweq> skydive is a tool which ajo and dalvarez were presenting us in Denver
16:49:48 <mlavalle> I know
16:49:55 <slaweq> it is like "distributed tcpdump"
16:50:00 <slaweq> ok
16:50:05 <mlavalle> and I've seen patches from Assaf
16:51:06 <slaweq> but after Assaf send those patches we discussed it and decided that skydive can be more helpful in debugging such issues like we have now instead of doing some new tests
16:51:27 <slaweq> so to sum up
16:51:28 <mlavalle> ah ok
16:51:40 <mlavalle> I'll see if I have time to look at skydive again
16:51:52 <slaweq> #action mlavalle will send DNM patch which will add tcpdump in routers' namespaces to debug ssh issue
16:52:17 <slaweq> #action slaweq will send DNM patch to tempest to dump router's namespace state when ssh will fail
16:52:29 <slaweq> correct ^^?
16:52:35 <mlavalle> yes
16:52:38 <slaweq> ok, thx
16:53:00 <slaweq> from other things related to tempest jobs I saw one failure like:
16:53:02 <slaweq> http://logs.openstack.org/47/650947/2/check/tempest-full/8688145/testr_results.html.gz
16:53:13 <slaweq> it is also "well known" problem
16:53:28 <slaweq> tempest is cleaning after test, sends DELETE request
16:53:46 <slaweq> and this DELETE is processed for veeeeery long time in neutron, here it was like 83 seconds
16:54:02 <slaweq> so tempest http client got timeout, and repeats request
16:54:18 <slaweq> but in the meantime neutron removed resource and tempest got 404 response for its second request
16:54:22 <slaweq> and test is failed
16:54:52 <slaweq> we can maybe talk with gmann and other qa folks if maybe tempest shouldn't fail when it got 404 when deleting something
16:55:03 <slaweq> but that would be only workaround of the problem
16:55:53 <gmann> slaweq: yeah we usually have ignore case for 404 when deleting the resource after tests
16:56:22 <slaweq> so gmann will it be reasonable to send patch to tempest to ignore such case and not fail?
16:56:27 <gmann> slaweq: if delete is part of test then we cannot ignore otherwise you are right we can consider the 404 as pass for tests
16:56:54 <gmann> slaweq: for cleanup things yes. but i did not check the tests you mentioned
16:57:12 <slaweq> gmann: ok, thx
16:57:24 <njohnston> We're just about out of time; do we want to talk about WSGI testing?
16:57:31 <gmann> push a patch and i can review that
16:57:36 <slaweq> so I will check exactly what was the place when this failed and will push patch
16:57:59 <slaweq> njohnston: sure, go on
16:58:17 <slaweq> gmann: thx :)
16:58:32 <slaweq> njohnston: or maybe we can add it to next week's agenda?
16:58:33 <njohnston> well, I was wondering if we should move the wsgi jobs from experimental to check (leaving them nonvoting)
16:58:36 <slaweq> and have more time for it?
16:58:59 <njohnston> just to gather data, and add them to the dashboard so we can see how their failure rates are
16:59:06 <slaweq> I think that we can move those jobs to check queue
16:59:09 <njohnston> and then we can talk more about it next week, but armed with facts
16:59:10 <slaweq> that would be fine
16:59:16 <mlavalle> ++
16:59:19 <slaweq> njohnston++
16:59:23 <njohnston> ok I'll do that
16:59:26 <slaweq> thx
16:59:29 <mlavalle> before we go
16:59:33 <njohnston> #action njohnston move wsgi jobs to check queue nonvoting
16:59:38 <mlavalle> congrats to njohnston
16:59:47 <mlavalle> https://www.wsj.com/articles/virginia-wins-ncaa-men-s-basketball-championship-11554781679?mod=article_inline
16:59:57 <slaweq> njohnston: congrats :)
16:59:58 <njohnston> thank you! \o/
17:00:04 <mlavalle> his alma mater is national basketball champion
17:00:13 <slaweq> ok, thx for attending the meeting
17:00:20 <slaweq> see You all next week
17:00:23 <slaweq> #endmeeting