16:00:19 <slaweq> #startmeeting neutron_ci 16:00:20 <openstack> Meeting started Tue Dec 11 16:00:19 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:22 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:22 <slaweq> hi 16:00:25 <openstack> The meeting name has been set to 'neutron_ci' 16:00:29 <haleyb> hi 16:01:35 * haleyb is in another meeting if he gets unresponsive 16:01:45 <mlavalle> o/ 16:02:29 <slaweq> lets wait few more minutes for njohnston hongbin and others 16:03:19 <njohnston> we may also want sean-k-mooney for this one but I don't think he is in channel 16:04:34 <slaweq> ok, lets start 16:04:42 <slaweq> #topic Actions from previous meetings 16:04:53 <slaweq> mlavalle to change trunk scenario test and see if that will help with FIP issues 16:05:05 <mlavalle> I pushed a patch yesterday 16:05:28 <mlavalle> https://review.openstack.org/#/c/624271/ 16:05:35 <bcafarel> late hi o/ 16:05:50 <mlavalle> I need to investigate what the results are 16:06:16 <slaweq> looks that this didn't solve problem: http://logs.openstack.org/71/624271/1/check/neutron-tempest-plugin-dvr-multinode-scenario/a125b91/testr_results.html.gz 16:06:30 <mlavalle> yeah 16:06:54 <mlavalle> I'll still take a closer look 16:07:16 <mlavalle> and will continue investigating the bug in gneral 16:07:32 <slaweq> maybe there is some issue with timeouts there - those tests are using advaced image IIRC so it may be that it's trying ssh too short time 16:07:36 <slaweq> ? 16:08:01 <mlavalle> the lifecycle scenario doesn't use advanced image 16:08:29 <slaweq> #action mlavalle will continue debugging trunk tests failures in multinode dvr env 16:08:41 <slaweq> ahh, right - only one of tests is using adv image 16:08:44 <mlavalle> you know what on second thought we don't know if it worked 16:09:09 <mlavalle> becasue the test case that failed was the other one 16:09:20 <mlavalle> not the lifecycle one 16:09:24 <slaweq> in this example both tests failed 16:09:53 <mlavalle> ok 16:09:58 <mlavalle> I'll investigate 16:10:05 <slaweq> thx mlavalle 16:10:08 <slaweq> lets move on 16:10:11 <slaweq> njohnston will remove neutron-grenade from neutron ci queues and add comment why definition of job is still needed 16:11:18 <njohnston> So the feedback I got from the QA team is that they would rather we keep neutron-grenade, as they want to keep py2 grenade testing 16:11:55 <njohnston> they consider it part of the minimum level of testing needed until we officially stop supporting py2 16:12:32 <slaweq> ok 16:12:54 <slaweq> so we can consider this point from https://etherpad.openstack.org/p/neutron_ci_python3 as done, right? 16:13:02 <njohnston> yes 16:13:17 <njohnston> I was waiting for us to talk about it in the meeting before marking it 16:14:17 <slaweq> I just marked it as done in etherpad then 16:14:20 <slaweq> thx njohnston 16:14:21 <njohnston> thanks 16:14:37 <slaweq> one more question 16:15:15 <slaweq> is it only py2 based grenade job which QA wants still to have? or should we keep all grenade jobs with py2 too? 16:15:18 <slaweq> do You know? 16:15:57 <njohnston> They want grenad eto cover both py2 and py3, so we should have both - the same way we have unit tests for both 16:16:57 <slaweq> so we should "duplicate" all our grenade jobs then to have py2 and py3 variants for each 16:17:11 <slaweq> probably more rechecks but ok :) 16:17:31 <mlavalle> LOL 16:17:57 <njohnston> Sorry, I was not specific enough. I think they want at least one grenade for py3 and py2 each. I don't think we need a full matrix. 16:18:32 <njohnston> So we should have grenade-py3 and neutron-grenade... but for example neutron-grenade-multinode-dvr could be just on py3 and they would be fine 16:18:32 <slaweq> ok, so we already have neutron-grenade (py2) and grenade-py3 (py3) jobs 16:19:04 <slaweq> so we can just switch neutron-grenade-dvr-multinode and neutron-grenade-multinode to py3 now? 16:19:45 <njohnston> yes. I proposed a 'grenade-multinode-py3' job in the grenade repo https://review.openstack.org/#/c/622612/ 16:20:02 <njohnston> I thought that we could use that perhaps, and then it becomes available for other projects 16:20:13 <slaweq> ok, now it's clear 16:20:17 <slaweq> thx njohnston for working on this 16:20:20 <njohnston> np 16:20:34 <slaweq> ok, lets move on then 16:20:37 <slaweq> slaweq to continue debugging bug 1798475 16:20:37 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:20:56 <slaweq> unfortunatelly I didn't have too much time to work on it last week 16:21:13 <slaweq> I lost few days because of sick leave :/ 16:21:27 <slaweq> I will try to check it this week 16:21:38 <slaweq> #action slaweq to continue debugging bug 1798475 16:21:39 <openstack> bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:21:47 <slaweq> next one 16:21:49 <slaweq> slaweq to continue fixing funtional-py3 tests 16:21:50 <mlavalle> feeling better now? 16:21:55 <hongbin> o/ 16:22:00 <slaweq> mlavalle: yes, thx. It's much better 16:22:05 <slaweq> hi hongbin 16:22:16 <hongbin> slaweq: sorry, a bit late today 16:22:32 <slaweq> so according to functional py3 tests, I was playing with it a bit during the weekend 16:23:02 <slaweq> I tried to disable all warnings in python and so on but it still didn't help 16:23:49 <slaweq> issue is probably caused by capturing stderr, like e.g.: http://logs.openstack.org/83/577383/17/check/neutron-functional/2907d2b/job-output.txt.gz#_2018-12-10_11_06_04_272396 16:24:54 <slaweq> but: 16:25:01 <slaweq> 1. I don't know how to get rid of it 16:25:34 <slaweq> 2. I'm not sure if that's good idea to get rid of it because I'm not sure if that comes from test which failed or from test which actually passed 16:26:17 <slaweq> if anyone has any idea how to fix this issue - feel free to take it :) 16:26:23 <njohnston> I am at a loss for what the best course forward is 16:26:46 <slaweq> if not I will assign it to myself for next week and will try to continue work on it 16:27:26 <slaweq> #action slaweq to continue fixing funtional-py3 tests 16:27:31 <slaweq> ok, lets move on 16:27:39 <slaweq> njohnston to research py3 conversion for neutron grenade multinode jobs 16:27:48 <njohnston> I think we covered that before 16:27:52 <slaweq> I think we alread talked about it :) 16:27:56 <slaweq> yes, thx njohnston 16:28:08 <slaweq> so next one 16:28:10 <slaweq> slaweq to update etherpad with what is already converted to py3 16:28:27 <slaweq> I updated etherpad https://etherpad.openstack.org/p/neutron_ci_python3 today 16:28:27 <bcafarel> on functional tests, maybe worth sending a ML post, maybe some other projects would have an idea there 16:28:42 <bcafarel> (strange that it's only us getting hit by this "log limit") 16:28:46 <slaweq> bcafarel: good idea, I will send email today 16:30:35 <slaweq> basiacally we still need to convert most of tempest jobs, grenade, rally and functional 16:30:48 <slaweq> for rally I proposed patch https://review.openstack.org/624358 16:30:55 <slaweq> lets wait for results of CI now 16:31:34 <slaweq> so etherpad is updated, if someone wants to help, feel free to propose patches for jobs which are still waiting :) 16:32:07 <slaweq> ok, and the last action was: 16:32:09 <slaweq> hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test 16:32:40 <hongbin> i have to postponse this one since i am still figuring out how to setup the envirnoment for testing 16:33:11 <njohnston> sean-k-mooney: thanks for joining, we'll talk about CI issues in a moment 16:33:37 <sean-k-mooney> njohnston: no worries 16:33:48 <slaweq> hongbin: ok, ping me if I You will need any help 16:33:56 <hongbin> slaweq: thanks, will do 16:34:00 <slaweq> I will assign it as an action for next week, ok? 16:34:07 <hongbin> sure 16:34:10 <slaweq> #action hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test 16:34:18 <slaweq> ok, lets move on then 16:34:22 <slaweq> #topic Python 3 16:34:38 <slaweq> we already talked about grenade-jobs 16:35:21 <slaweq> I only wanted to mention this patch for neutron-rally job: https://review.openstack.org/624358 16:35:47 <slaweq> and also I sent today patch https://review.openstack.org/624360 to remove tempest-full job as we have tempest-full-py3 already 16:36:02 <slaweq> so I think that we don't need both of them 16:36:46 <slaweq> anything else You want to talk about njohnston, bcafarel? 16:37:17 <njohnston> nope, I think that covers it 16:37:30 <bcafarel> same here 16:37:31 <slaweq> ok, so let's move on then 16:37:37 <slaweq> #topic Grafana 16:37:42 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:38:14 <njohnston> looks like failure rate on neutron-tempest-iptables_hybrid job has gone from 8% at 0930UTC to 46% at 1620UTC 16:38:14 <njohnston> http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1&panelId=18&fullscreen&from=now%2Fd&to=now%2Fd 16:38:37 <njohnston> sean-k-mooney was looking into it and how it might be related to pyroute2 16:38:38 <slaweq> njohnston: yes, and I think that this is what sean-k-mooney has culprit for, right? 16:38:59 <njohnston> https://bugs.launchpad.net/os-vif/+bug/1807949 16:39:00 <openstack> Launchpad bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] - Assigned to sean mooney (sean-k-mooney) 16:39:16 <sean-k-mooney> so i that breaking on all build or jsut some 16:39:47 <njohnston> just the neutron-tempest-iptables_hybrid it looks like 16:41:10 <slaweq> sean-k-mooney: is it this error: http://logs.openstack.org/60/624360/1/check/neutron-tempest-iptables_hybrid/a6a4a0a/logs/screen-n-cpu.txt.gz?level=ERROR#_Dec_11_15_10_54_319285 ? 16:41:37 <njohnston> I am wondering if we should make neutron-tempest-iptables_hybrid non-voting while we figure this out, or blacklist this version of os-vif.... 16:42:23 <haleyb> we already had to blacklist 0.12.0... 16:42:29 <njohnston> based on the 24-hour-rolling-average nature of grafana lines I think a rise this rapid means we may have an effective 100% failure rate at the moment 16:42:38 <slaweq> sean-k-mooney: do You know why it may happen only in this job? 16:43:05 <slaweq> I don't see any such error e.g. in tempest-full job logs (at least this which I'm checking now) 16:43:54 <njohnston> I did not see that error in the neutron-tempest-linuxbridge jobs I spot-checked 16:44:01 <njohnston> (as another datapoint) 16:46:08 <slaweq> that is strange for me, the only thing which is "special" for neutron-tempest-iptables_hybrid is iptables_hybrid firewall driver instead of openvswitch driver 16:46:18 <slaweq> how this may trigger such error? 16:49:16 <mlavalle> I think sean-k-mmoney is not on-line anymore 16:49:21 <slaweq> ok, I think that we should check if happens 100% times in this job, if so, we should, as njohnston said, mark this job as non-voting temporary and then try to investigate it 16:49:30 <slaweq> do You agree? 16:49:34 <mlavalle> yes 16:49:39 <hongbin> +1 16:49:47 <njohnston> +1 16:50:11 <slaweq> ok, I will check tomorrow morning grafana and will send a patch to set as non-voting this job 16:50:57 <njohnston> should we send something to the ML asking people not to recheck if the failure is in iptables_hybrid? 16:51:07 <slaweq> #action slaweq to switch neutron-tempest-iptables_hybrid job as non-voting if it will be failing a lot because of bug 1807949 16:51:08 <openstack> bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] https://launchpad.net/bugs/1807949 - Assigned to sean mooney (sean-k-mooney) 16:51:12 <sean-k-mooney> hi sorry got disconnected 16:51:34 <slaweq> njohnston: yes, I will send an email 16:51:43 <bcafarel> I think I just did :/ (though there was a rally timeout too) 16:52:52 <sean-k-mooney> ill join the neutorn channel after to discuss the pyroute2 issue 16:53:01 <slaweq> ok, so sean-k-mooney - we will mark our job neutron-tempest-iptables_hybrid mark as non-voting if it will be failing 100% times becaise of this issue 16:53:20 <sean-k-mooney> ok 16:53:28 <slaweq> so we will have more time to investigate this :) 16:53:46 <sean-k-mooney> thanks :) 16:54:09 <slaweq> thx for helping with this :) 16:54:13 <slaweq> ok, lets move on 16:54:37 <slaweq> today I went through our list of issues in https://etherpad.openstack.org/p/neutron-ci-failures 16:54:55 <slaweq> and I wanted to find 3 which happens most often 16:55:33 <slaweq> one of problems which hits as the most is still this issue in db migrations in functional tests: 16:55:45 <slaweq> which happens many times 16:55:59 <slaweq> and which is in my backlog 16:56:09 <slaweq> but maybe we should mark those tests as unstable for now? 16:56:13 <slaweq> what do You think? 16:56:44 <bcafarel> sounds reasonable, I did see this db migration issue a few times recently 16:56:59 <mlavalle> yeah, I'm ok with that 16:57:06 <njohnston> it is a persistent bugaboo yes 16:57:06 <slaweq> ok, I will do that then 16:57:09 <mlavalle> we will continue trying to fix it, right? 16:57:18 <slaweq> mlavalle: of course 16:57:23 <mlavalle> ok 16:57:40 <slaweq> I even have card for it in our trello, I just need some time 16:57:40 <mlavalle> yeah, if it is getting in the way, let's mark it unstable 16:57:59 <slaweq> #action slaweq to mark db migration tests as unstable for now 16:58:00 <mlavalle> thanks slaweq 16:58:12 <slaweq> other issues which I found were: 16:58:29 <slaweq> 1. issues with cinder volume backup timeouts - I will try to ping cinder guys again with it 16:59:04 <slaweq> 2. various issues with FIP connectivity - it's not same test/job always, only common part is that ssh to fip is not working 16:59:31 <slaweq> if someone wants to debug it more, I can send list of jobs which failed because of that :) 16:59:40 <mlavalle> send it to me 16:59:46 <slaweq> mlavalle: ok, thx 17:00:08 <slaweq> we have to finish now 17:00:12 <slaweq> thx for attending guys 17:00:15 <slaweq> #endmeeting