16:00:19 #startmeeting neutron_ci 16:00:20 Meeting started Tue Dec 11 16:00:19 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:22 hi 16:00:25 The meeting name has been set to 'neutron_ci' 16:00:29 hi 16:01:35 * haleyb is in another meeting if he gets unresponsive 16:01:45 o/ 16:02:29 lets wait few more minutes for njohnston hongbin and others 16:03:19 we may also want sean-k-mooney for this one but I don't think he is in channel 16:04:34 ok, lets start 16:04:42 #topic Actions from previous meetings 16:04:53 mlavalle to change trunk scenario test and see if that will help with FIP issues 16:05:05 I pushed a patch yesterday 16:05:28 https://review.openstack.org/#/c/624271/ 16:05:35 late hi o/ 16:05:50 I need to investigate what the results are 16:06:16 looks that this didn't solve problem: http://logs.openstack.org/71/624271/1/check/neutron-tempest-plugin-dvr-multinode-scenario/a125b91/testr_results.html.gz 16:06:30 yeah 16:06:54 I'll still take a closer look 16:07:16 and will continue investigating the bug in gneral 16:07:32 maybe there is some issue with timeouts there - those tests are using advaced image IIRC so it may be that it's trying ssh too short time 16:07:36 ? 16:08:01 the lifecycle scenario doesn't use advanced image 16:08:29 #action mlavalle will continue debugging trunk tests failures in multinode dvr env 16:08:41 ahh, right - only one of tests is using adv image 16:08:44 you know what on second thought we don't know if it worked 16:09:09 becasue the test case that failed was the other one 16:09:20 not the lifecycle one 16:09:24 in this example both tests failed 16:09:53 ok 16:09:58 I'll investigate 16:10:05 thx mlavalle 16:10:08 lets move on 16:10:11 njohnston will remove neutron-grenade from neutron ci queues and add comment why definition of job is still needed 16:11:18 So the feedback I got from the QA team is that they would rather we keep neutron-grenade, as they want to keep py2 grenade testing 16:11:55 they consider it part of the minimum level of testing needed until we officially stop supporting py2 16:12:32 ok 16:12:54 so we can consider this point from https://etherpad.openstack.org/p/neutron_ci_python3 as done, right? 16:13:02 yes 16:13:17 I was waiting for us to talk about it in the meeting before marking it 16:14:17 I just marked it as done in etherpad then 16:14:20 thx njohnston 16:14:21 thanks 16:14:37 one more question 16:15:15 is it only py2 based grenade job which QA wants still to have? or should we keep all grenade jobs with py2 too? 16:15:18 do You know? 16:15:57 They want grenad eto cover both py2 and py3, so we should have both - the same way we have unit tests for both 16:16:57 so we should "duplicate" all our grenade jobs then to have py2 and py3 variants for each 16:17:11 probably more rechecks but ok :) 16:17:31 LOL 16:17:57 Sorry, I was not specific enough. I think they want at least one grenade for py3 and py2 each. I don't think we need a full matrix. 16:18:32 So we should have grenade-py3 and neutron-grenade... but for example neutron-grenade-multinode-dvr could be just on py3 and they would be fine 16:18:32 ok, so we already have neutron-grenade (py2) and grenade-py3 (py3) jobs 16:19:04 so we can just switch neutron-grenade-dvr-multinode and neutron-grenade-multinode to py3 now? 16:19:45 yes. I proposed a 'grenade-multinode-py3' job in the grenade repo https://review.openstack.org/#/c/622612/ 16:20:02 I thought that we could use that perhaps, and then it becomes available for other projects 16:20:13 ok, now it's clear 16:20:17 thx njohnston for working on this 16:20:20 np 16:20:34 ok, lets move on then 16:20:37 slaweq to continue debugging bug 1798475 16:20:37 bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:20:56 unfortunatelly I didn't have too much time to work on it last week 16:21:13 I lost few days because of sick leave :/ 16:21:27 I will try to check it this week 16:21:38 #action slaweq to continue debugging bug 1798475 16:21:39 bug 1798475 in neutron "Fullstack test test_ha_router_restart_agents_no_packet_lost failing" [High,Confirmed] https://launchpad.net/bugs/1798475 16:21:47 next one 16:21:49 slaweq to continue fixing funtional-py3 tests 16:21:50 feeling better now? 16:21:55 o/ 16:22:00 mlavalle: yes, thx. It's much better 16:22:05 hi hongbin 16:22:16 slaweq: sorry, a bit late today 16:22:32 so according to functional py3 tests, I was playing with it a bit during the weekend 16:23:02 I tried to disable all warnings in python and so on but it still didn't help 16:23:49 issue is probably caused by capturing stderr, like e.g.: http://logs.openstack.org/83/577383/17/check/neutron-functional/2907d2b/job-output.txt.gz#_2018-12-10_11_06_04_272396 16:24:54 but: 16:25:01 1. I don't know how to get rid of it 16:25:34 2. I'm not sure if that's good idea to get rid of it because I'm not sure if that comes from test which failed or from test which actually passed 16:26:17 if anyone has any idea how to fix this issue - feel free to take it :) 16:26:23 I am at a loss for what the best course forward is 16:26:46 if not I will assign it to myself for next week and will try to continue work on it 16:27:26 #action slaweq to continue fixing funtional-py3 tests 16:27:31 ok, lets move on 16:27:39 njohnston to research py3 conversion for neutron grenade multinode jobs 16:27:48 I think we covered that before 16:27:52 I think we alread talked about it :) 16:27:56 yes, thx njohnston 16:28:08 so next one 16:28:10 slaweq to update etherpad with what is already converted to py3 16:28:27 I updated etherpad https://etherpad.openstack.org/p/neutron_ci_python3 today 16:28:27 on functional tests, maybe worth sending a ML post, maybe some other projects would have an idea there 16:28:42 (strange that it's only us getting hit by this "log limit") 16:28:46 bcafarel: good idea, I will send email today 16:30:35 basiacally we still need to convert most of tempest jobs, grenade, rally and functional 16:30:48 for rally I proposed patch https://review.openstack.org/624358 16:30:55 lets wait for results of CI now 16:31:34 so etherpad is updated, if someone wants to help, feel free to propose patches for jobs which are still waiting :) 16:32:07 ok, and the last action was: 16:32:09 hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test 16:32:40 i have to postponse this one since i am still figuring out how to setup the envirnoment for testing 16:33:11 sean-k-mooney: thanks for joining, we'll talk about CI issues in a moment 16:33:37 njohnston: no worries 16:33:48 hongbin: ok, ping me if I You will need any help 16:33:56 slaweq: thanks, will do 16:34:00 I will assign it as an action for next week, ok? 16:34:07 sure 16:34:10 #action hongbin to report and check failing neutron.tests.fullstack.test_l3_agent.TestHAL3Agent.test_gateway_ip_changed test 16:34:18 ok, lets move on then 16:34:22 #topic Python 3 16:34:38 we already talked about grenade-jobs 16:35:21 I only wanted to mention this patch for neutron-rally job: https://review.openstack.org/624358 16:35:47 and also I sent today patch https://review.openstack.org/624360 to remove tempest-full job as we have tempest-full-py3 already 16:36:02 so I think that we don't need both of them 16:36:46 anything else You want to talk about njohnston, bcafarel? 16:37:17 nope, I think that covers it 16:37:30 same here 16:37:31 ok, so let's move on then 16:37:37 #topic Grafana 16:37:42 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:38:14 looks like failure rate on neutron-tempest-iptables_hybrid job has gone from 8% at 0930UTC to 46% at 1620UTC 16:38:14 http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?orgId=1&panelId=18&fullscreen&from=now%2Fd&to=now%2Fd 16:38:37 sean-k-mooney was looking into it and how it might be related to pyroute2 16:38:38 njohnston: yes, and I think that this is what sean-k-mooney has culprit for, right? 16:38:59 https://bugs.launchpad.net/os-vif/+bug/1807949 16:39:00 Launchpad bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] - Assigned to sean mooney (sean-k-mooney) 16:39:16 so i that breaking on all build or jsut some 16:39:47 just the neutron-tempest-iptables_hybrid it looks like 16:41:10 sean-k-mooney: is it this error: http://logs.openstack.org/60/624360/1/check/neutron-tempest-iptables_hybrid/a6a4a0a/logs/screen-n-cpu.txt.gz?level=ERROR#_Dec_11_15_10_54_319285 ? 16:41:37 I am wondering if we should make neutron-tempest-iptables_hybrid non-voting while we figure this out, or blacklist this version of os-vif.... 16:42:23 we already had to blacklist 0.12.0... 16:42:29 based on the 24-hour-rolling-average nature of grafana lines I think a rise this rapid means we may have an effective 100% failure rate at the moment 16:42:38 sean-k-mooney: do You know why it may happen only in this job? 16:43:05 I don't see any such error e.g. in tempest-full job logs (at least this which I'm checking now) 16:43:54 I did not see that error in the neutron-tempest-linuxbridge jobs I spot-checked 16:44:01 (as another datapoint) 16:46:08 that is strange for me, the only thing which is "special" for neutron-tempest-iptables_hybrid is iptables_hybrid firewall driver instead of openvswitch driver 16:46:18 how this may trigger such error? 16:49:16 I think sean-k-mmoney is not on-line anymore 16:49:21 ok, I think that we should check if happens 100% times in this job, if so, we should, as njohnston said, mark this job as non-voting temporary and then try to investigate it 16:49:30 do You agree? 16:49:34 yes 16:49:39 +1 16:49:47 +1 16:50:11 ok, I will check tomorrow morning grafana and will send a patch to set as non-voting this job 16:50:57 should we send something to the ML asking people not to recheck if the failure is in iptables_hybrid? 16:51:07 #action slaweq to switch neutron-tempest-iptables_hybrid job as non-voting if it will be failing a lot because of bug 1807949 16:51:08 bug 1807949 in os-vif "os_vif error: [Errno 24] Too many open files" [High,Triaged] https://launchpad.net/bugs/1807949 - Assigned to sean mooney (sean-k-mooney) 16:51:12 hi sorry got disconnected 16:51:34 njohnston: yes, I will send an email 16:51:43 I think I just did :/ (though there was a rally timeout too) 16:52:52 ill join the neutorn channel after to discuss the pyroute2 issue 16:53:01 ok, so sean-k-mooney - we will mark our job neutron-tempest-iptables_hybrid mark as non-voting if it will be failing 100% times becaise of this issue 16:53:20 ok 16:53:28 so we will have more time to investigate this :) 16:53:46 thanks :) 16:54:09 thx for helping with this :) 16:54:13 ok, lets move on 16:54:37 today I went through our list of issues in https://etherpad.openstack.org/p/neutron-ci-failures 16:54:55 and I wanted to find 3 which happens most often 16:55:33 one of problems which hits as the most is still this issue in db migrations in functional tests: 16:55:45 which happens many times 16:55:59 and which is in my backlog 16:56:09 but maybe we should mark those tests as unstable for now? 16:56:13 what do You think? 16:56:44 sounds reasonable, I did see this db migration issue a few times recently 16:56:59 yeah, I'm ok with that 16:57:06 it is a persistent bugaboo yes 16:57:06 ok, I will do that then 16:57:09 we will continue trying to fix it, right? 16:57:18 mlavalle: of course 16:57:23 ok 16:57:40 I even have card for it in our trello, I just need some time 16:57:40 yeah, if it is getting in the way, let's mark it unstable 16:57:59 #action slaweq to mark db migration tests as unstable for now 16:58:00 thanks slaweq 16:58:12 other issues which I found were: 16:58:29 1. issues with cinder volume backup timeouts - I will try to ping cinder guys again with it 16:59:04 2. various issues with FIP connectivity - it's not same test/job always, only common part is that ssh to fip is not working 16:59:31 if someone wants to debug it more, I can send list of jobs which failed because of that :) 16:59:40 send it to me 16:59:46 mlavalle: ok, thx 17:00:08 we have to finish now 17:00:12 thx for attending guys 17:00:15 #endmeeting