16:00:10 <slaweq> #startmeeting neutron_ci 16:00:12 <openstack> Meeting started Tue Jul 16 16:00:10 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:13 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:15 <openstack> The meeting name has been set to 'neutron_ci' 16:00:20 <njohnston_> o/ 16:00:33 <slaweq> welcome on another meeting :) 16:00:55 <njohnston_> last meeting of the day for me! \o/ 16:00:56 <ralonsoh> hi 16:01:05 <slaweq> njohnston_: yes, for me too :) 16:01:08 <haleyb> hi 16:01:23 <slaweq> ok, lets start as mlavalle will not be here today 16:01:30 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:01:38 <slaweq> please open the link and we can move on 16:01:55 <slaweq> #topic Actions from previous meetings 16:02:04 <slaweq> first one: 16:02:06 <slaweq> mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:02:16 <slaweq> he send me some update about his findings: 16:02:25 <slaweq> "In regards to the test connectivity with 2 routers scenario, when the L3 agent requests the router to the server, we call https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/plugin.py#L1835-L1852 16:02:27 <slaweq> my theory right now is that we are leaving one execption uncaught in that peace of code: duplicate entry. I will propose a WIP patch to prove the theory. That's the next step." 16:03:34 <slaweq> so he is making progress on this one, hopefully this will be fixed soon :) 16:03:51 <slaweq> I will add action for him for the next week to not forget about this 16:03:58 <slaweq> #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:04:22 <slaweq> next one: 16:04:27 <slaweq> slaweq to contact yamamoto about networking-midonet py3 status 16:04:44 <slaweq> I emailed yamamoto and he respond me 16:04:59 <slaweq> I pasted info from him to the related etherpad 16:05:37 <slaweq> basically there are 2 issues with midonet and py3 right now, they are aware of them but they don't know when this can be fixed 16:05:47 <slaweq> next one 16:05:49 <slaweq> ralonsoh to check how to increase, in FT, the privsep pool 16:06:10 <ralonsoh> slaweq, the number of threads is == num of CPUs 16:06:27 <ralonsoh> we should not increase that number thus 16:06:39 <ralonsoh> how to solve the timeouts? I still don't know 16:06:50 <slaweq> ralonsoh: can we maybe limit number of test runners? 16:07:08 <ralonsoh> slaweq, but we'll reduce a lot the speed of execution 16:07:24 <ralonsoh> slaweq, I recommend to do this just as the last resource 16:07:30 <slaweq> ralonsoh: maybe it's worth to check how much longer it would take then 16:07:54 <ralonsoh> slaweq, we can do (num_cpu - 1) 16:08:14 <ralonsoh> just for functional tests? or fullstack too? 16:08:57 <slaweq> ralonsoh: in fullstack it's differently I think I each test as got own "environment" so probably also own privsep-helpers pool 16:09:04 <ralonsoh> ok 16:09:18 <ralonsoh> I'll submit a patch to reduce the number of workers in FT 16:09:24 <slaweq> thx 16:09:42 <slaweq> #action ralonsoh to try a patch to resuce the number of workers in FT 16:10:07 <slaweq> and the last one: 16:10:09 <slaweq> slaweq to check neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase issue 16:10:13 <slaweq> Bug reported https://bugs.launchpad.net/neutron/+bug/1836565 16:10:15 <openstack> Launchpad bug 1836565 in neutron "Functional test test_keepalived_state_change_notification may fail do to race condition" [Medium,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:10:18 <slaweq> Patch proposed https://review.opendev.org/670815 16:10:37 <slaweq> I found that it is race condition which may cause such failure in rare cases 16:10:44 <slaweq> but fortunatelly fix was quite easy :) 16:11:01 <slaweq> any questions/comments? 16:11:43 <ralonsoh> slaweq, I need to check the patch first, I'll comment on it 16:11:55 <slaweq> ralonsoh: sure, thx 16:12:15 <slaweq> ok, so lets move on 16:12:17 <slaweq> #topic Stadium projects 16:12:24 <slaweq> Python 3 migration 16:12:26 <slaweq> Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:12:42 <slaweq> I already talked about midonet plugin 16:12:54 <slaweq> but I have also short update anout neutron-fwaas and networking-ovn 16:12:55 <njohnston_> I have a bagpipe update that looks ready to go: https://review.openstack.org/641057 16:13:12 <slaweq> for neutron-fwaas all voting jobs are switched to py3 now 16:13:31 <slaweq> today I sent patch to switch multinode tempest job: https://review.opendev.org/671005 16:14:02 <slaweq> but I think I will have to wait with it and propose it to neutron-tempest-plugin repo when njohnston_ will finish his patches related to neutron-fwaas tempest plugin 16:14:24 <slaweq> for networking-ovn I today sent patch to switch rally job to py3: https://review.opendev.org/671006 16:15:14 <slaweq> with this patch I think we can consieder networking-ovn as done, because last job which isn't switched is based on Centos 7 so can't be run on py3 probably 16:15:21 <slaweq> that's all from me 16:15:56 <slaweq> njohnston_: thx for Your update about bagpipe 16:16:07 <slaweq> I will review this patch tomorrow morning 16:16:08 <njohnston_> slaweq: The fwaas change in the neutron-tempest-plugin repo is already merged, so you could go ahead and propose it there immediately 16:16:23 <slaweq> njohnston_: ahh, yes, thx 16:16:31 <slaweq> so I will do it that way 16:16:35 <njohnston_> +1 16:17:15 <slaweq> so we are "almost" done with this transition :) 16:17:39 <slaweq> any other questions/comments? 16:18:06 <igordc> not on this topic from me 16:18:15 <slaweq> ok, lets move on then 16:18:16 <slaweq> tempest-plugins migration 16:18:18 <slaweq> Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo 16:18:33 <slaweq> any updates? 16:18:57 <njohnston_> As I mentioned before, the first part of the fwaas move happened and I am tinkering with the second half now: https://review.opendev.org/643668 16:19:30 <njohnston_> If you propose your converted multinode tempest job in n-t-p then I will incorporate it into that change 16:19:52 <slaweq> njohnston_: ok, thx a lot :) 16:20:09 <slaweq> from my side, I will tomorrow take a look at tidwellr's patch for neutron-dynamic-routing to help with it 16:21:01 <slaweq> so that should be all related to the stadium projects 16:21:11 <slaweq> I think we can move on to next topics 16:21:13 <slaweq> #topic Grafana 16:22:08 <njohnston_> midonet cogate broken again 16:22:25 <slaweq> I don't know why there is no any data for gate queue since last few days - didn't we merge anything recently? 16:23:19 <njohnston_> the number of jobs that have been run in the gate queue seems pretty low, so I wonder if many patches are making it through 16:23:57 <njohnston_> when lots of jobs have moderate fail numbers - even pep8 has an 11% failure right now, which is unusual - then it seems likely that something will fail in a given run. 16:24:17 <slaweq> njohnston_: about midonet job, You're right but I don't think we have anyone who wants to look into it 16:24:26 <slaweq> I guess it will have to wait for yamamoto 16:24:34 <njohnston_> indeed 16:25:38 <slaweq> njohnston_: about pep8 jobs I think I saw quite many WIP patches recently which may cause this failure rate 16:26:20 <slaweq> from good things I see that functional/fullstack jobs are slightly better now (below 30%) 16:26:21 <njohnston_> ok, could be. I hope so! 16:27:00 <slaweq> and also our "highest rates" jobs one with uwsgi and one with dvr are better now 16:27:09 <slaweq> not good but better at least :) 16:28:01 <slaweq> anything else You want to add or can we move on? 16:29:13 <slaweq> ok, lets move on 16:29:15 <slaweq> #topic fullstack/functional 16:29:28 <njohnston_> nothing else from me 16:29:36 <slaweq> I saw at least twice this week issue with neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase: 16:29:41 <slaweq> http://logs.openstack.org/15/670815/1/check/neutron-functional-python27/028631e/testr_results.html.gz 16:29:46 <slaweq> http://logs.openstack.org/11/662111/27/check/neutron-functional/8404d24/testr_results.html.gz 16:30:00 <slaweq> but IIRC it's related to privsep pool in tests, right ralonsoh? 16:30:13 <ralonsoh> let me check 16:30:40 <ralonsoh> yes, and I think this is related to the sync decorator in ip_lib 16:31:00 <ralonsoh> but, of course, until we have a new pyroute2 release, we need to keep this sync decorator 16:31:14 <ralonsoh> and in this order: first the sync and then the privsep 16:31:24 <ralonsoh> this is killing sometimes the FTs 16:31:38 <ralonsoh> because you enter into the privsep and the wait 16:31:54 <ralonsoh> that should be in the other way, but this is not suppoerted by py2 16:32:19 <ralonsoh> that' all 16:32:27 <slaweq> ok, thx ralonsoh for confirmation 16:32:36 <slaweq> and do we have bug reported for that? 16:32:40 <ralonsoh> yes 16:32:44 <ralonsoh> I'll find it 16:32:54 <ralonsoh> please don;'t stop for me 16:33:23 <ralonsoh> https://review.opendev.org/#/c/666853/ 16:33:30 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1833721 16:33:32 <openstack> Launchpad bug 1833721 in neutron "ip_lib synchronized decorator should wrap the privileged one" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:33:38 <slaweq> ralonsoh: thx a lot 16:33:53 <slaweq> ok 16:34:03 <slaweq> for fullstack I found one new failure 16:34:08 <slaweq> http://logs.openstack.org/70/670570/1/check/neutron-fullstack/b563af4/testr_results.html.gz 16:34:17 <slaweq> failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed 16:34:25 <slaweq> did You saw somethig like that before? 16:34:42 <ralonsoh> yes, but almost one year ago 16:34:49 <ralonsoh> and that was solved 16:34:58 <ralonsoh> I can take a look at this 16:35:42 <slaweq> I see such error in logs: http://logs.openstack.org/70/670570/1/check/neutron-fullstack/b563af4/controller/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_port_removed_egress_.txt.gz#_2019-07-15_22_14_11_638 16:36:14 <slaweq> but I'm not sure if that is related really 16:36:49 <ralonsoh> this should not happen 16:37:11 <ralonsoh> I'll download the logs and I'll grep to see if someone else is deleting this port before 16:37:25 <slaweq> ok, thx ralonsoh 16:38:02 <slaweq> can You also report a bug for it, that we can track it there? 16:38:09 <njohnston_> +1 16:38:25 <ralonsoh> slaweq, perfect 16:38:52 <slaweq> #action ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed 16:38:55 <slaweq> thx ralonsoh :) 16:39:22 <slaweq> anything else related to functional/fullstack tests? 16:40:17 <slaweq> ok, lets move on then 16:40:23 <slaweq> #topic Tempest/Scenario 16:40:38 <slaweq> first few short informations 16:40:39 <slaweq> Two patches which should improve our gate a bit: 16:40:41 <slaweq> https://review.opendev.org/#/c/670717/ 16:40:43 <slaweq> https://review.opendev.org/#/c/670718/ 16:40:55 <slaweq> please take a look at them if You will have a minute :) 16:41:17 <njohnston_> added them to the pile 16:41:23 <slaweq> thx njohnston_ 16:41:26 <slaweq> and second thing 16:41:36 <slaweq> I sent patches to propose 2 new jobs: 16:41:41 <slaweq> https://review.opendev.org/#/c/670029/ - based on TripleO standalone 16:41:50 <igordc> slaweq, do you think these will decrease the odds of sshtimeout? 16:41:59 <slaweq> this is non-voting job for now, please review it too 16:42:08 <slaweq> and the second patch: https://review.opendev.org/#/c/670738/ 16:42:34 <slaweq> I realized recently that we don't have simple openvswitch job which runs neutron_tempest_plugin tests 16:42:41 <slaweq> we have only linuxbridge job like that 16:42:55 <slaweq> so I'm proposing to have similar job for ovs too 16:43:04 <slaweq> please let me know what do You think about it 16:43:20 <slaweq> (in patch review of course :)) 16:43:28 <njohnston_> thanks, these are great ideas 16:43:36 <slaweq> thx njohnston_ :) 16:43:50 <slaweq> igordc: According to SSH issue, I opened bug https://bugs.launchpad.net/neutron/+bug/1836642 16:43:52 <openstack> Launchpad bug 1836642 in OpenStack Compute (nova) "Metadata responses are very slow sometimes" [Undecided,Incomplete] 16:44:04 <slaweq> basically there are 2 main issues with ssh: 16:44:23 <slaweq> 1. covered by mlavalle's investigation which we talked on the beginning of the meeting 16:45:02 <slaweq> 2. this issue with slow metadata responses for /public-keys/ request which cause failure of confguring ssh key on instance and than failure with ssh 16:45:38 <slaweq> according to 2) I reported it for both nova and neutron as I noticed that in most cases it is nova-metadata-api which is responding very slow 16:46:15 <slaweq> but sean-k-mooney found out today that (at least in one case) nova was waiting very long time on response from neutron to get security groups 16:46:26 <slaweq> (this workflow is really complicated IMHO) 16:46:42 <igordc> slaweq, is the error different than sshtimeout for 2. ? 16:47:09 <slaweq> so I will continue investigating this but if someone wants to help, please take a look at this bug, maybe You will find something interesting :) 16:47:24 <njohnston_> wow, so nova has to ask for security group info in order to satisfy a metadata request? that seems highly nonintuitive. 16:47:32 <slaweq> igordc: basically in both cases final issue is authentication failure during sshing to the vm 16:47:55 <slaweq> njohnston_: yes, I was also surprised today when sean told me that 16:48:12 <igordc> yep I will take some time and see if I can find out something.. some patches get sshtimeout almost every time 16:48:41 <slaweq> igordc: difference between those 2 cases is that in 1) You will see about 20 attempts to get instance-id from metadata server 16:49:16 <slaweq> and in 2) You will see that instance-id was returned to vm properly and there was "failed to get .../public-keys" message in this console log 16:49:59 <slaweq> and in this second case, in most cases which I checked it was like that becasuse response for this request took more than 10 seconds 16:50:20 <slaweq> 10 seconds is timeout set in curl used in this ec2-metadata script in cirros 16:51:11 <slaweq> that's all from me about this issue 16:51:16 <slaweq> any questions/comments? 16:52:34 <igordc> wondering if certain patches with changes in certain places make this more likely 16:52:47 <slaweq> igordc: I don't think so 16:53:28 <slaweq> IMO, at least this second problem described by me is somehow related to performance of node at which tests are run 16:53:32 <slaweq> and nothing else 16:53:43 <slaweq> but I don't have any proof for that 16:54:40 <slaweq> ok 16:54:42 <slaweq> lets move on 16:54:46 <slaweq> from other things 16:55:05 <slaweq> neutron-tempest-with-uwsgi job is now running better after my fix for tempest was merged 16:55:13 <slaweq> but it has a lot of timeouts 16:55:50 <slaweq> maybe it will be better when it will be switched to run only neutron and nova related tests 16:56:03 <slaweq> but I think it's worth to investigate why it's so slow so often 16:56:21 <slaweq> so I will open bug for it 16:56:39 <slaweq> #action slaweq to open bug about slow neutron-tempest-with-uwsgi job 16:57:03 <slaweq> and that's all from me regarding tempest jobs 16:57:13 <slaweq> anything else You want to add here? 16:57:42 <njohnston_> not I 16:57:50 <ralonsoh> no 16:57:55 <haleyb> -1 16:58:06 <slaweq> ok, thx for attending the meeting 16:58:12 <njohnston_> thanks! 16:58:12 <slaweq> and have a nice week 16:58:17 <slaweq> #endmeeting