16:00:10 #startmeeting neutron_ci 16:00:12 Meeting started Tue Jul 16 16:00:10 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:13 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:15 The meeting name has been set to 'neutron_ci' 16:00:20 o/ 16:00:33 welcome on another meeting :) 16:00:55 last meeting of the day for me! \o/ 16:00:56 hi 16:01:05 njohnston_: yes, for me too :) 16:01:08 hi 16:01:23 ok, lets start as mlavalle will not be here today 16:01:30 Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:01:38 please open the link and we can move on 16:01:55 #topic Actions from previous meetings 16:02:04 first one: 16:02:06 mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:02:16 he send me some update about his findings: 16:02:25 "In regards to the test connectivity with 2 routers scenario, when the L3 agent requests the router to the server, we call https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/plugin.py#L1835-L1852 16:02:27 my theory right now is that we are leaving one execption uncaught in that peace of code: duplicate entry. I will propose a WIP patch to prove the theory. That's the next step." 16:03:34 so he is making progress on this one, hopefully this will be fixed soon :) 16:03:51 I will add action for him for the next week to not forget about this 16:03:58 #action mlavalle to continue debugging neutron-tempest-plugin-dvr-multinode-scenario issues 16:04:22 next one: 16:04:27 slaweq to contact yamamoto about networking-midonet py3 status 16:04:44 I emailed yamamoto and he respond me 16:04:59 I pasted info from him to the related etherpad 16:05:37 basically there are 2 issues with midonet and py3 right now, they are aware of them but they don't know when this can be fixed 16:05:47 next one 16:05:49 ralonsoh to check how to increase, in FT, the privsep pool 16:06:10 slaweq, the number of threads is == num of CPUs 16:06:27 we should not increase that number thus 16:06:39 how to solve the timeouts? I still don't know 16:06:50 ralonsoh: can we maybe limit number of test runners? 16:07:08 slaweq, but we'll reduce a lot the speed of execution 16:07:24 slaweq, I recommend to do this just as the last resource 16:07:30 ralonsoh: maybe it's worth to check how much longer it would take then 16:07:54 slaweq, we can do (num_cpu - 1) 16:08:14 just for functional tests? or fullstack too? 16:08:57 ralonsoh: in fullstack it's differently I think I each test as got own "environment" so probably also own privsep-helpers pool 16:09:04 ok 16:09:18 I'll submit a patch to reduce the number of workers in FT 16:09:24 thx 16:09:42 #action ralonsoh to try a patch to resuce the number of workers in FT 16:10:07 and the last one: 16:10:09 slaweq to check neutron.tests.functional.agent.l3.test_ha_router.L3HATestCase issue 16:10:13 Bug reported https://bugs.launchpad.net/neutron/+bug/1836565 16:10:15 Launchpad bug 1836565 in neutron "Functional test test_keepalived_state_change_notification may fail do to race condition" [Medium,In progress] - Assigned to Slawek Kaplonski (slaweq) 16:10:18 Patch proposed https://review.opendev.org/670815 16:10:37 I found that it is race condition which may cause such failure in rare cases 16:10:44 but fortunatelly fix was quite easy :) 16:11:01 any questions/comments? 16:11:43 slaweq, I need to check the patch first, I'll comment on it 16:11:55 ralonsoh: sure, thx 16:12:15 ok, so lets move on 16:12:17 #topic Stadium projects 16:12:24 Python 3 migration 16:12:26 Stadium projects etherpad: https://etherpad.openstack.org/p/neutron_stadium_python3_status 16:12:42 I already talked about midonet plugin 16:12:54 but I have also short update anout neutron-fwaas and networking-ovn 16:12:55 I have a bagpipe update that looks ready to go: https://review.openstack.org/641057 16:13:12 for neutron-fwaas all voting jobs are switched to py3 now 16:13:31 today I sent patch to switch multinode tempest job: https://review.opendev.org/671005 16:14:02 but I think I will have to wait with it and propose it to neutron-tempest-plugin repo when njohnston_ will finish his patches related to neutron-fwaas tempest plugin 16:14:24 for networking-ovn I today sent patch to switch rally job to py3: https://review.opendev.org/671006 16:15:14 with this patch I think we can consieder networking-ovn as done, because last job which isn't switched is based on Centos 7 so can't be run on py3 probably 16:15:21 that's all from me 16:15:56 njohnston_: thx for Your update about bagpipe 16:16:07 I will review this patch tomorrow morning 16:16:08 slaweq: The fwaas change in the neutron-tempest-plugin repo is already merged, so you could go ahead and propose it there immediately 16:16:23 njohnston_: ahh, yes, thx 16:16:31 so I will do it that way 16:16:35 +1 16:17:15 so we are "almost" done with this transition :) 16:17:39 any other questions/comments? 16:18:06 not on this topic from me 16:18:15 ok, lets move on then 16:18:16 tempest-plugins migration 16:18:18 Etherpad: https://etherpad.openstack.org/p/neutron_stadium_move_to_tempest_plugin_repo 16:18:33 any updates? 16:18:57 As I mentioned before, the first part of the fwaas move happened and I am tinkering with the second half now: https://review.opendev.org/643668 16:19:30 If you propose your converted multinode tempest job in n-t-p then I will incorporate it into that change 16:19:52 njohnston_: ok, thx a lot :) 16:20:09 from my side, I will tomorrow take a look at tidwellr's patch for neutron-dynamic-routing to help with it 16:21:01 so that should be all related to the stadium projects 16:21:11 I think we can move on to next topics 16:21:13 #topic Grafana 16:22:08 midonet cogate broken again 16:22:25 I don't know why there is no any data for gate queue since last few days - didn't we merge anything recently? 16:23:19 the number of jobs that have been run in the gate queue seems pretty low, so I wonder if many patches are making it through 16:23:57 when lots of jobs have moderate fail numbers - even pep8 has an 11% failure right now, which is unusual - then it seems likely that something will fail in a given run. 16:24:17 njohnston_: about midonet job, You're right but I don't think we have anyone who wants to look into it 16:24:26 I guess it will have to wait for yamamoto 16:24:34 indeed 16:25:38 njohnston_: about pep8 jobs I think I saw quite many WIP patches recently which may cause this failure rate 16:26:20 from good things I see that functional/fullstack jobs are slightly better now (below 30%) 16:26:21 ok, could be. I hope so! 16:27:00 and also our "highest rates" jobs one with uwsgi and one with dvr are better now 16:27:09 not good but better at least :) 16:28:01 anything else You want to add or can we move on? 16:29:13 ok, lets move on 16:29:15 #topic fullstack/functional 16:29:28 nothing else from me 16:29:36 I saw at least twice this week issue with neutron.tests.functional.agent.linux.test_keepalived.KeepalivedManagerTestCase: 16:29:41 http://logs.openstack.org/15/670815/1/check/neutron-functional-python27/028631e/testr_results.html.gz 16:29:46 http://logs.openstack.org/11/662111/27/check/neutron-functional/8404d24/testr_results.html.gz 16:30:00 but IIRC it's related to privsep pool in tests, right ralonsoh? 16:30:13 let me check 16:30:40 yes, and I think this is related to the sync decorator in ip_lib 16:31:00 but, of course, until we have a new pyroute2 release, we need to keep this sync decorator 16:31:14 and in this order: first the sync and then the privsep 16:31:24 this is killing sometimes the FTs 16:31:38 because you enter into the privsep and the wait 16:31:54 that should be in the other way, but this is not suppoerted by py2 16:32:19 that' all 16:32:27 ok, thx ralonsoh for confirmation 16:32:36 and do we have bug reported for that? 16:32:40 yes 16:32:44 I'll find it 16:32:54 please don;'t stop for me 16:33:23 https://review.opendev.org/#/c/666853/ 16:33:30 #link https://bugs.launchpad.net/neutron/+bug/1833721 16:33:32 Launchpad bug 1833721 in neutron "ip_lib synchronized decorator should wrap the privileged one" [Medium,In progress] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez) 16:33:38 ralonsoh: thx a lot 16:33:53 ok 16:34:03 for fullstack I found one new failure 16:34:08 http://logs.openstack.org/70/670570/1/check/neutron-fullstack/b563af4/testr_results.html.gz 16:34:17 failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed 16:34:25 did You saw somethig like that before? 16:34:42 yes, but almost one year ago 16:34:49 and that was solved 16:34:58 I can take a look at this 16:35:42 I see such error in logs: http://logs.openstack.org/70/670570/1/check/neutron-fullstack/b563af4/controller/logs/dsvm-fullstack-logs/TestBwLimitQoSOvs.test_bw_limit_qos_port_removed_egress_.txt.gz#_2019-07-15_22_14_11_638 16:36:14 but I'm not sure if that is related really 16:36:49 this should not happen 16:37:11 I'll download the logs and I'll grep to see if someone else is deleting this port before 16:37:25 ok, thx ralonsoh 16:38:02 can You also report a bug for it, that we can track it there? 16:38:09 +1 16:38:25 slaweq, perfect 16:38:52 #action ralonsoh to report a bug and investigate failed test neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_bw_limit_qos_port_removed 16:38:55 thx ralonsoh :) 16:39:22 anything else related to functional/fullstack tests? 16:40:17 ok, lets move on then 16:40:23 #topic Tempest/Scenario 16:40:38 first few short informations 16:40:39 Two patches which should improve our gate a bit: 16:40:41 https://review.opendev.org/#/c/670717/ 16:40:43 https://review.opendev.org/#/c/670718/ 16:40:55 please take a look at them if You will have a minute :) 16:41:17 added them to the pile 16:41:23 thx njohnston_ 16:41:26 and second thing 16:41:36 I sent patches to propose 2 new jobs: 16:41:41 https://review.opendev.org/#/c/670029/ - based on TripleO standalone 16:41:50 slaweq, do you think these will decrease the odds of sshtimeout? 16:41:59 this is non-voting job for now, please review it too 16:42:08 and the second patch: https://review.opendev.org/#/c/670738/ 16:42:34 I realized recently that we don't have simple openvswitch job which runs neutron_tempest_plugin tests 16:42:41 we have only linuxbridge job like that 16:42:55 so I'm proposing to have similar job for ovs too 16:43:04 please let me know what do You think about it 16:43:20 (in patch review of course :)) 16:43:28 thanks, these are great ideas 16:43:36 thx njohnston_ :) 16:43:50 igordc: According to SSH issue, I opened bug https://bugs.launchpad.net/neutron/+bug/1836642 16:43:52 Launchpad bug 1836642 in OpenStack Compute (nova) "Metadata responses are very slow sometimes" [Undecided,Incomplete] 16:44:04 basically there are 2 main issues with ssh: 16:44:23 1. covered by mlavalle's investigation which we talked on the beginning of the meeting 16:45:02 2. this issue with slow metadata responses for /public-keys/ request which cause failure of confguring ssh key on instance and than failure with ssh 16:45:38 according to 2) I reported it for both nova and neutron as I noticed that in most cases it is nova-metadata-api which is responding very slow 16:46:15 but sean-k-mooney found out today that (at least in one case) nova was waiting very long time on response from neutron to get security groups 16:46:26 (this workflow is really complicated IMHO) 16:46:42 slaweq, is the error different than sshtimeout for 2. ? 16:47:09 so I will continue investigating this but if someone wants to help, please take a look at this bug, maybe You will find something interesting :) 16:47:24 wow, so nova has to ask for security group info in order to satisfy a metadata request? that seems highly nonintuitive. 16:47:32 igordc: basically in both cases final issue is authentication failure during sshing to the vm 16:47:55 njohnston_: yes, I was also surprised today when sean told me that 16:48:12 yep I will take some time and see if I can find out something.. some patches get sshtimeout almost every time 16:48:41 igordc: difference between those 2 cases is that in 1) You will see about 20 attempts to get instance-id from metadata server 16:49:16 and in 2) You will see that instance-id was returned to vm properly and there was "failed to get .../public-keys" message in this console log 16:49:59 and in this second case, in most cases which I checked it was like that becasuse response for this request took more than 10 seconds 16:50:20 10 seconds is timeout set in curl used in this ec2-metadata script in cirros 16:51:11 that's all from me about this issue 16:51:16 any questions/comments? 16:52:34 wondering if certain patches with changes in certain places make this more likely 16:52:47 igordc: I don't think so 16:53:28 IMO, at least this second problem described by me is somehow related to performance of node at which tests are run 16:53:32 and nothing else 16:53:43 but I don't have any proof for that 16:54:40 ok 16:54:42 lets move on 16:54:46 from other things 16:55:05 neutron-tempest-with-uwsgi job is now running better after my fix for tempest was merged 16:55:13 but it has a lot of timeouts 16:55:50 maybe it will be better when it will be switched to run only neutron and nova related tests 16:56:03 but I think it's worth to investigate why it's so slow so often 16:56:21 so I will open bug for it 16:56:39 #action slaweq to open bug about slow neutron-tempest-with-uwsgi job 16:57:03 and that's all from me regarding tempest jobs 16:57:13 anything else You want to add here? 16:57:42 not I 16:57:50 no 16:57:55 -1 16:58:06 ok, thx for attending the meeting 16:58:12 thanks! 16:58:12 and have a nice week 16:58:17 #endmeeting