#openstack-meeting log

16:00:10 <slaweq> #startmeeting neutron_ci
16:00:12 <openstack> Meeting started Tue Jul 16 16:00:10 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:15 <openstack> The meeting name has been set to 'neutron_ci'
16:00:20 <njohnston_> o/
16:00:33 <slaweq> welcome on another meeting :)
16:00:55 <njohnston_> last meeting of the day for me! \o/
16:00:56 <ralonsoh> hi
16:01:05 <slaweq> njohnston_: yes, for me too :)
16:01:08 <haleyb> hi
16:01:23 <slaweq> ok, lets start as mlavalle will not be here today
16:01:30 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:01:38 <slaweq> please open the link and we can move on
16:01:55 <slaweq> #topic Actions from previous meetings
16:02:04 <slaweq> first one:
16:02:06 <slaweq> mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues
16:02:16 <slaweq> he send me some update about his findings:
16:02:25 <slaweq> "In regards to the test connectivity with 2 routers scenario, when the L3 agent requests the router to the server, we call https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/plugin.py#L1835-L1852
16:02:27 <slaweq> my theory right now is that we are leaving one execption uncaught in that peace of code: duplicate entry. I will propose a WIP patch to prove the theory. That's the next step."
16:03:34 <slaweq> so he is making progress on this one, hopefully this will be fixed soon :)
16:03:51 <slaweq> I will add action for him for the next week to not forget about this
16:03:58 <slaweq> #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues
16:04:22 <slaweq> next one:
16:04:27 <slaweq> slaweq to contact yamamoto about networking-midonet py3 status
16:04:44 <slaweq> I emailed yamamoto and he respond me
16:04:59 <slaweq> I pasted info from him to the related etherpad
16:05:37 <slaweq> basically there are 2 issues with midonet and py3 right now, they are aware of them but they don't know when this can be fixed
16:05:47 <slaweq> next one
16:05:49 <slaweq> ralonsoh to check how to increase, in FT, the privsep pool
16:06:10 <ralonsoh> slaweq, the number of threads is == num of CPUs
16:06:27 <ralonsoh> we should not increase that number thus
16:06:39 <ralonsoh> how to solve the timeouts? I still don't know
16:06:50 <slaweq> ralonsoh: can we maybe limit number of test runners?
16:07:08 <ralonsoh> slaweq, but we'll reduce a lot the speed of execution
16:07:24 <ralonsoh> slaweq, I recommend to do this just as the last resource
16:07:30 <slaweq> ralonsoh: maybe it's worth to check how much longer it would take then
16:07:54 <ralonsoh> slaweq, we can do (num_cpu - 1)
16:08:14 <ralonsoh> just for functional tests? or fullstack too?
16:08:57 <slaweq> ralonsoh: in fullstack it's differently I think I each test as got own "environment" so probably also own privsep-helpers pool
16:09:04 <ralonsoh> ok
16:09:18 <ralonsoh> I'll submit a patch to reduce the number of workers in FT
16:09:24 <slaweq> thx
16:09:42 <slaweq> #action ralonsoh to try a patch to resuce the number of workers in FT
16:10:07 <slaweq> and the last one:
16:10:09 <slaweq> slaweq to check neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase issue
16:10:13 <slaweq> Bug reported https://bugs.launchpad.net/neutron/+bug/1836565
16:10:15 <openstack> Launchpad bug 1836565 in neutron "Functional test test_keepalived_state_change_notification may fail do to race condition" [Medium,In progress] - Assigned to Slawek Kaplonski (slaweq)
16:10:18 <slaweq> Patch proposed https://review.opendev.org/670815
16:10:37 <slaweq> I found that it is race condition which may cause such failure in rare cases
16:10:44 <slaweq> but fortunatelly fix was quite easy :)
16:11:01 <slaweq> any questions/comments?
16:11:43 <ralonsoh> slaweq, I need to check the patch first, I'll comment on it
16:11:55 <slaweq> ralonsoh: sure, thx
16:12:15 <slaweq> ok, so lets move on
16:12:17 <slaweq> #topic Stadium projects
16:12:24 <slaweq> Python 3 migration
16:12:26 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status
16:12:42 <slaweq> I already talked about midonet plugin
16:12:54 <slaweq> but I have also short update anout neutron-fwaas and networking-ovn
16:12:55 <njohnston_> I have a bagpipe update that looks ready to go: https://review.openstack.org/641057
16:13:12 <slaweq> for neutron-fwaas all voting jobs are switched to py3 now
16:13:31 <slaweq> today I sent patch to switch multinode tempest job: https://review.opendev.org/671005
16:14:02 <slaweq> but I think I will have to wait with it and propose it to neutron-tempest-plugin repo when njohnston_ will finish his patches related to neutron-fwaas tempest plugin
16:14:24 <slaweq> for networking-ovn I today sent patch to switch rally job to py3: https://review.opendev.org/671006
16:15:14 <slaweq> with this patch I think we can consieder networking-ovn as done, because last job which isn't switched is based on Centos 7 so can't be run on py3 probably
16:15:21 <slaweq> that's all from me
16:15:56 <slaweq> njohnston_: thx for Your update about bagpipe
16:16:07 <slaweq> I will review this patch tomorrow morning
16:16:08 <njohnston_> slaweq: The fwaas change in the neutron-tempest-plugin repo is already merged,  so you could go ahead and propose it there immediately
16:16:23 <slaweq> njohnston_: ahh, yes, thx
16:16:31 <slaweq> so I will do it that way
16:16:35 <njohnston_> +1
16:17:15 <slaweq> so we are "almost" done with this transition :)
16:17:39 <slaweq> any other questions/comments?
16:18:06 <igordc> not on this topic from me
16:18:15 <slaweq> ok, lets move on then
16:18:16 <slaweq> tempest-plugins migration
16:18:18 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo
16:18:33 <slaweq> any updates?
16:18:57 <njohnston_> As I mentioned before, the first part of the fwaas move happened and I am tinkering with the second half now: https://review.opendev.org/643668
16:19:30 <njohnston_> If you propose your converted multinode tempest job in n-t-p then I will incorporate it into that change
16:19:52 <slaweq> njohnston_: ok, thx a lot :)
16:20:09 <slaweq> from my side, I will tomorrow take a look at tidwellr's patch for neutron-dynamic-routing to help with it
16:21:01 <slaweq> so that should be all related to the stadium projects
16:21:11 <slaweq> I think we can move on to next topics
16:21:13 <slaweq> #topic Grafana
16:22:08 <njohnston_> midonet cogate broken again
16:22:25 <slaweq> I don't know why there is no any data for gate queue since last few days - didn't we merge anything recently?
16:23:19 <njohnston_> the number of jobs that have been run in the gate queue seems pretty low, so I wonder if many patches are making it through
16:23:57 <njohnston_> when lots of jobs have moderate fail numbers - even pep8 has an 11% failure right now, which is unusual - then it seems likely that something will fail in a given run.
16:24:17 <slaweq> njohnston_: about midonet job, You're right but I don't think we have anyone who wants to look into it
16:24:26 <slaweq> I guess it will have to wait for yamamoto
16:24:34 <njohnston_> indeed
16:25:38 <slaweq> njohnston_: about pep8 jobs I think I saw quite many WIP patches recently which may cause this failure rate
16:26:20 <slaweq> from good things I see that functional/fullstack jobs are slightly better now (below 30%)
16:26:21 <njohnston_> ok, could be.  I hope so!
16:27:00 <slaweq> and also our "highest rates" jobs one with uwsgi and one with dvr are better now
16:27:09 <slaweq> not good but better at least :)
16:28:01 <slaweq> anything else You want to add or can we move on?
16:29:13 <slaweq> ok, lets move on
16:29:15 <slaweq> #topic fullstack/functional
16:29:28 <njohnston_> nothing else from me
16:29:36 <slaweq> I saw at least twice this week issue with neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase:
16:29:41 <slaweq> http://logs.openstack.org/15/670815/1/check/neutron-functional-python27/028631e/testr_results.html.gz
16:29:46 <slaweq> http://logs.openstack.org/11/662111/27/check/neutron-functional/8404d24/testr_results.html.gz
16:30:00 <slaweq> but IIRC it's related to privsep pool in tests, right ralonsoh?
16:30:13 <ralonsoh> let me check
16:30:40 <ralonsoh> yes, and I think this is related to the sync decorator in ip_lib
16:31:00 <ralonsoh> but, of course, until we have a new pyroute2 release, we need to keep this sync decorator
16:31:14 <ralonsoh> and in this order: first the sync and then the privsep
16:31:24 <ralonsoh> this is killing sometimes the FTs
16:31:38 <ralonsoh> because you enter into the privsep and the wait
16:31:54 <ralonsoh> that should be in the other way, but this is not suppoerted by py2
16:32:19 <ralonsoh> that' all
16:32:27 <slaweq> ok, thx ralonsoh for confirmation
16:32:36 <slaweq> and do we have bug reported for that?
16:32:40 <ralonsoh> yes
16:32:44 <ralonsoh> I'll find it
16:32:54 <ralonsoh> please don;'t stop for me
16:33:23 <ralonsoh> https://review.opendev.org/#/c/666853/
16:33:30 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1833721
16:33:32 <openstack> Launchpad bug 1833721 in neutron "ip_lib synchronized decorator should wrap the privileged one" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:33:38 <slaweq> ralonsoh: thx a lot
16:33:53 <slaweq> ok
16:34:03 <slaweq> for fullstack I found one new failure
16:34:08 <slaweq> http://logs.openstack.org/70/670570/1/check/neutron-fullstack/b563af4/testr_results.html.gz
16:34:17 <slaweq> failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed
16:34:25 <slaweq> did You saw somethig like that before?
16:34:42 <ralonsoh> yes, but almost one year ago
16:34:49 <ralonsoh> and that was solved
16:34:58 <ralonsoh> I can take a look at this
16:35:42 <slaweq> I see such error in logs: http://logs.openstack.org/70/670570/1/check/neutron-fullstack/b563af4/controller/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_port_removed_egress_.txt.gz#_2019-07-15_22_14_11_638
16:36:14 <slaweq> but I'm not sure if that is related really
16:36:49 <ralonsoh> this should not happen
16:37:11 <ralonsoh> I'll download the logs and I'll grep to see if someone else is deleting this port before
16:37:25 <slaweq> ok, thx ralonsoh
16:38:02 <slaweq> can You also report a bug for it, that we can track it there?
16:38:09 <njohnston_> +1
16:38:25 <ralonsoh> slaweq, perfect
16:38:52 <slaweq> #action ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed
16:38:55 <slaweq> thx ralonsoh :)
16:39:22 <slaweq> anything else related to functional/fullstack tests?
16:40:17 <slaweq> ok, lets move on then
16:40:23 <slaweq> #topic Tempest/Scenario
16:40:38 <slaweq> first few short informations
16:40:39 <slaweq> Two patches which should improve our gate a bit:
16:40:41 <slaweq> https://review.opendev.org/#/c/670717/
16:40:43 <slaweq> https://review.opendev.org/#/c/670718/
16:40:55 <slaweq> please take a look at them if You will have a minute :)
16:41:17 <njohnston_> added them to the pile
16:41:23 <slaweq> thx njohnston_
16:41:26 <slaweq> and second thing
16:41:36 <slaweq> I sent patches to propose 2 new jobs:
16:41:41 <slaweq> https://review.opendev.org/#/c/670029/ - based on TripleO standalone
16:41:50 <igordc> slaweq, do you think these will decrease the odds of sshtimeout?
16:41:59 <slaweq> this is non-voting job for now, please review it too
16:42:08 <slaweq> and the second patch: https://review.opendev.org/#/c/670738/
16:42:34 <slaweq> I realized recently that we don't have simple openvswitch job which runs neutron_tempest_plugin tests
16:42:41 <slaweq> we have only linuxbridge job like that
16:42:55 <slaweq> so I'm proposing to have similar job for ovs too
16:43:04 <slaweq> please let me know what do You think about it
16:43:20 <slaweq> (in patch review of course :))
16:43:28 <njohnston_> thanks, these are great ideas
16:43:36 <slaweq> thx njohnston_ :)
16:43:50 <slaweq> igordc: According to SSH issue, I opened bug https://bugs.launchpad.net/neutron/+bug/1836642
16:43:52 <openstack> Launchpad bug 1836642 in OpenStack Compute (nova) "Metadata responses are very slow sometimes" [Undecided,Incomplete]
16:44:04 <slaweq> basically there are 2 main issues with ssh:
16:44:23 <slaweq> 1. covered by mlavalle's investigation which we talked on the beginning of the meeting
16:45:02 <slaweq> 2. this issue with slow metadata responses for /public-keys/ request which cause failure of confguring ssh key on instance and than failure with ssh
16:45:38 <slaweq> according to 2) I reported it for both nova and neutron as I noticed that in most cases it is nova-metadata-api which is responding very slow
16:46:15 <slaweq> but sean-k-mooney found out today that (at least in one case) nova was waiting very long time on response from neutron to get security groups
16:46:26 <slaweq> (this workflow is really complicated IMHO)
16:46:42 <igordc> slaweq, is the error different than sshtimeout for 2. ?
16:47:09 <slaweq> so I will continue investigating this but if someone wants to help, please take a look at this bug, maybe You will find something interesting :)
16:47:24 <njohnston_> wow, so nova has to ask for security group info in order to satisfy a metadata request?  that seems highly nonintuitive.
16:47:32 <slaweq> igordc: basically in both cases final issue is authentication failure during sshing to the vm
16:47:55 <slaweq> njohnston_: yes, I was also surprised today when sean told me that
16:48:12 <igordc> yep I will take some time and see if I can find out something.. some patches get sshtimeout almost every time
16:48:41 <slaweq> igordc: difference between those 2 cases is that in 1) You will see about 20 attempts to get instance-id from metadata server
16:49:16 <slaweq> and in 2) You will see that instance-id was returned to vm properly and there was "failed to get .../public-keys" message in this console log
16:49:59 <slaweq> and in this second case, in most cases which I checked it was like that becasuse response for this request took more than 10 seconds
16:50:20 <slaweq> 10 seconds is timeout set in curl used in this ec2-metadata script in cirros
16:51:11 <slaweq> that's all from me about this issue
16:51:16 <slaweq> any questions/comments?
16:52:34 <igordc> wondering if certain patches with changes in certain places make this more likely
16:52:47 <slaweq> igordc: I don't think so
16:53:28 <slaweq> IMO, at least this second problem described by me is somehow related to performance of node at which tests are run
16:53:32 <slaweq> and nothing else
16:53:43 <slaweq> but I don't have any proof for that
16:54:40 <slaweq> ok
16:54:42 <slaweq> lets move on
16:54:46 <slaweq> from other things
16:55:05 <slaweq> neutron-tempest-with-uwsgi job is now running better after my fix for tempest was merged
16:55:13 <slaweq> but it has a lot of timeouts
16:55:50 <slaweq> maybe it will be better when it will be switched to run only neutron and nova related tests
16:56:03 <slaweq> but I think it's worth to investigate why it's so slow so often
16:56:21 <slaweq> so I will open bug for it
16:56:39 <slaweq> #action slaweq to open bug about slow neutron-tempest-with-uwsgi job
16:57:03 <slaweq> and that's all from me regarding tempest jobs
16:57:13 <slaweq> anything else You want to add here?
16:57:42 <njohnston_> not I
16:57:50 <ralonsoh> no
16:57:55 <haleyb> -1
16:58:06 <slaweq> ok, thx for attending the meeting
16:58:12 <njohnston_> thanks!
16:58:12 <slaweq> and have a nice week
16:58:17 <slaweq> #endmeeting