16:00:29 #startmeeting neutron_ci 16:00:30 Meeting started Tue Oct 15 16:00:29 2019 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:31 hi 16:00:32 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:33 hi 16:00:34 The meeting name has been set to 'neutron_ci' 16:00:55 * slaweq is on last meeting today \o/ 16:00:56 o/ 16:02:15 ok, I think we can start as bcafarel will not be available today 16:02:23 Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:02:27 please open it now 16:02:36 #topic Actions from previous meetings 16:02:44 njohnston Update the neutron-tempest-plugin dashboard in grafana 16:03:20 so the change has been submitted 16:03:37 #link https://review.opendev.org/687686 16:04:09 waiting for another +2 16:04:24 (you have my +1) 16:04:37 thx njohnston 16:04:38 It is dependent on this change by slaweq https://review.opendev.org/#/c/685214/ 16:05:00 ahh, yes I wanted to ask today for review of it 16:05:03 ^ this one, IMO, is ready to be merged 16:05:04 njohnston: can You? 16:05:06 :) 16:05:16 slaweq: your wish is my command. +W 16:05:35 LOL 16:05:37 thx 16:05:50 ok, so lets move on than 16:05:55 ralonsoh to check root cause of ssh issue in https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_9c2/664646/6/check/tempest-ipv6-only/9c2f68f/testr_results.html.gz 16:06:15 slaweq, sorry but I didn't find the error 16:06:46 I still don't know why the IP is not given correctly to the VM 16:06:54 and I don't see any DHCP message 16:07:13 is like this VM has no connectivity to the DHCP 16:07:28 but it's single node job, rigth? 16:07:33 yes 16:07:51 so both ports (vm and dhcp) should be plugged into br-int 16:07:56 yes 16:08:30 I can give another try 16:08:33 there can be only to reasons IMO - either port wasn't wired properly either firewall rules were not configured in it properly 16:08:39 nono ralonsoh 16:08:41 but I spent one whole morning on this 16:08:51 if there is nothing in logs that You can't do anything :) 16:08:58 don't waste Your time on it 16:09:50 well usually that is an indication we need to improve our logging 16:10:07 clarkb: yes, I just think about the same 16:10:09 but ya update logging to cover likely cases then move on and wait for it to hit again (can also update elastic-recheck to track it for you) 16:10:18 clarkb, and that's usually the next action 16:11:23 ralonsoh: but do You think we can add some additional debug logs to our code to know something more about it? 16:11:46 slaweq, I'll check it and if possible, I'll submit a patch 16:11:57 tbh we should be able to see if ports were configured by ovs agent 16:12:22 and in fact IMO it was configured otherwise neutron would not report to nova that port is UP 16:12:32 and nova would not unpause vm, right? 16:12:47 slaweq, yes and we see it in the ovs logs 16:13:08 so IMO more likely this was some issue with openflow rules in br-int 16:13:36 slaweq, this is something we maybe should log better in debug/testing mode 16:14:01 ralonsoh: is there any way to maybe log all OF rules from bridges (br-int) every e.g. 1 second? 16:14:22 slaweq, uffff be careful 16:14:23 something similar to e.g. dstat: https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_9c2/664646/6/check/tempest-ipv6-only/9c2f68f/controller/logs/screen-dstat.txt.gz 16:14:35 ralonsoh: LOL, I know but that would help us a lot 16:14:35 this could be too much for the logs 16:14:45 but yes, we can al least track the changes 16:15:03 ralonsoh: right, timestamp and what flow rule was added/removed 16:15:26 exactly but I need to check if ovs-ofctl allows this kind of monitoring 16:15:29 do You think that would this be possible? 16:15:38 I need to check it first 16:15:46 like ovs-vsctl monitor 16:15:56 ralonsoh: exactly 16:16:00 will You do this? 16:16:04 (sure 16:16:08 thx a lot 16:16:39 #action ralonsoh to check if there is any possibility to do somethig like ovsdb-monitor for openflows 16:17:05 thx ralonsoh for taking care of it 16:17:09 lets move on 16:17:18 next was 16:17:19 slaweq to prepare etherpad to track dropping py27 jobs from ci 16:17:26 Etherpad is there https://etherpad.openstack.org/p/neutron_drop_python_2_status but it's not ready yet 16:17:39 I will prepare list of jobs to remove in each repo 16:18:10 I know we already started it for some repos like networking-ovn and neutron 16:18:24 this should be much faster than switch to python 3 :) 16:19:08 just one note on this 16:19:15 njohnston: sure 16:19:27 there are some projects out there that are being more aggressive in deprecating python 2 16:19:39 for example, by starting the process of removing six 16:20:13 that level of effort should be discussed and planned out especially for projects that get imported by other projects like neutron/neutron-lib 16:20:38 so while it may seem tempting I just wanted to note for posterity's sake that such moves should be held off, at this time 16:20:39 njohnston: personally I don't think we should do it now 16:20:57 1. because of backports to stable branches which still support py2 16:21:12 the particular project I saw with the six removal was a widely used oslo project, and so everything I said obviously goes double for oslo 16:21:52 is there any "common goal" to remove such pieces of code during ussuri also? 16:22:02 There's discussion underway: https://etherpad.openstack.org/p/drop-python2-support 16:22:07 Oh, you already saw that. 16:22:09 or dropping py2 support means just that we will not test it in u/s ci? 16:22:41 No wait, the etherpad above is neutron-specific. So that tc one is still relevant here. 16:22:58 bnemec: yes, thx for link 16:23:05 I will read it later 16:23:19 thanks bnemec, that link was my next thing to say :-) 16:24:03 so to sum up: the stadium is complicated, deprecating py2 is complicated, so let's do our usual neutron thing and overcommunicate about it all :-) 16:24:16 lol 16:25:16 njohnston: what do You mean by "overcommunicate"? :) 16:28:46 hmm, I think we lost njohnston now 16:28:56 maybe we can continue with other topics than 16:28:58 #topic Stadium projects 16:29:07 I am here. Sorry, I thought you were saying that tongue-in-cheek 16:29:41 ahh, no 16:29:53 sorry I really wanted to know what You mean exactly 16:30:07 I just meant that we'll talk about it in all our meetings and coordinate about it verbosely 16:30:16 isn't preparing some etherpad and syncing weekly about it enough? 16:30:20 ahh, ok 16:30:33 thx, so that's something which I want to do with this :) 16:31:33 * njohnston is done 16:31:36 regarding stadium projects, I think we can remove python3 migration from this agenda as we did for neutron meeting 16:31:41 agreed 16:31:54 so we still have tempest-plugins migration 16:32:01 and there is no any news about it 16:32:26 anything else You want to discuss regarding stadium projects' ci? 16:33:06 ahh, I have one more small thing 16:33:10 please review https://review.opendev.org/#/c/685213/ - it's needed for Train and for stadium projects 16:34:07 if there is nothing else related to stadium, lets move on to the next topic 16:34:09 #topic Grafana 16:34:19 #link http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:35:24 it's a bit strange that we have so few data for gate queue 16:35:54 did we merge only few patches last week in neutron really? 16:36:01 not many merged last week 16:36:05 merges 16:36:50 hmm, it could be like that really 16:37:48 in check queue still our bigger problem IMO is functional and fullstack job 16:38:21 yep 16:38:30 I have couple of examples of new failures there 16:39:05 from good things I think that our (voting) scenario jobs works quite good recently 16:39:43 agreed 16:40:28 ok, so lets talk about those functional/fullstack failures 16:40:38 #topic fullstack/functional 16:40:45 first fullstack 16:41:05 ralonsoh: I saw 2 new failures of our old friend: neutron.tests.fullstack.test_dhcp_agent.TestDhcpAgentHA.test_reschedule_network_on_new_agent 16:41:11 https://2f600b8d6843c7d64afe-bbbb707b755a08f42bfac9929d4d55b4.ssl.cf2.rackcdn.com/688439/3/check/neutron-fullstack/119e702/testr_results.html.gz 16:41:18 and 16:41:20 https://e5965e413edbbc117465-bfd96e490b07511790a1eb1aa4beb29d.ssl.cf2.rackcdn.com/665467/37/check/neutron-fullstack/4437eb3/testr_results.html.gz 16:41:40 IIRC You added some extra logs some time ago to debug this when it will happen again, right? 16:42:10 slaweq, yes. I'll add this to my pile for tomorrow 16:42:23 ralonsoh: thx, I can help with this if You want 16:43:34 #action ralonsoh to investigate failed fullstack tests for dhcp agent rescheduling 16:43:51 ok, next one 16:43:52 neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs 16:43:57 https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_7c2/684457/1/check/neutron-fullstack/7c2c5d8/testr_results.html.gz 16:44:09 It seems like there was only one icmp packet sent and test failed: https://openstack.fortnebula.com:13808/v1/AUTH_e8fd161dc34c421a979a9e6421f823e9/zuul_opendev_logs_7c2/684457/1/check/neutron-fullstack/7c2c5d8/controller/logs/dsvm-fullstack-logs/TestDscpMarkingQoSOvs.test_dscp_marking_packets.txt.gz 16:44:21 so as it is "my" test I will debug this 16:44:24 ok for You? 16:44:29 ok 16:44:32 +1 16:44:35 :) 16:44:48 #action slaweq to investigate failed neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs 16:45:01 now functional tests 16:45:04 neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_port_creation_with_dscp_marking(egress) 16:45:09 https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_692/688439/3/check/neutron-functional/69288a9/testr_results.html.gz 16:45:21 this one is interesting IMO 16:45:26 in test logs: https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_692/688439/3/check/neutron-functional/69288a9/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_port_creation_with_dscp_marking_egress_.txt.gz 16:45:38 there is a lot of "EVENT OVSNeutronAgentOSKenApp->ofctl_service GetDatapathRequest send_event" log lines 16:45:59 at first glance it looks for me like something was looping there infinitely 16:46:09 that's quite unusual 16:46:38 njohnston: yep 16:47:25 anyone wants to investigate that? 16:47:34 sorry not now 16:47:34 if not, I can assign it to myself 16:47:41 I have enough backlog 16:47:45 ralonsoh: sure, I know You are overloaded :) 16:47:57 ok, I will try to check that 16:48:26 #action slaweq to check strange "EVENT OVSNeutronAgentOSKenApp->ofctl_service GetDatapathRequest send_event" log lines in neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_port_creation_with_dscp_marking 16:48:42 ok, next one 16:48:51 this one might be interesting for njohnston :) 16:48:59 failed db migration tests again 16:49:01 https://storage.bhs1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_692/688439/3/check/neutron-functional/69288a9/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_port_creation_with_dscp_marking_egress_.txt.gz 16:49:10 but this time it's not due to timeout and slow node 16:49:32 sorry, wrong link 16:50:01 argh, I don't have correct link for this one now 16:50:03 sorry 16:50:17 slaweq: np 16:50:21 if I will find it somewhere, I will ping You 16:50:28 slaweq: sure thing 16:50:46 slaweq: also as a lead on the ofctl error from the last bug, check out https://bugzilla.redhat.com/show_bug.cgi?id=1382372 16:50:46 bugzilla.redhat.com bug 1382372 in openstack-neutron "Selinux is blocking ovs-vswitchd during functional tests" [Urgent,Closed: currentrelease] - Assigned to twilson 16:51:02 it's been seen before, as an selinux issue 16:51:44 njohnston: but u/s ci is running on ubuntu and I don't think there is selinux enabled there 16:52:54 darn bugzilla, bamboozled again 16:53:17 lol 16:53:58 ok, lets move on to the last one: 16:54:00 neutron.tests.functional.agent.test_firewall.FirewallTestCase 16:54:02 https://08b11cea7395c153ac8e-9514c1b1570a8e9931b2a7d3207ef22f.ssl.cf2.rackcdn.com/684457/1/check/neutron-functional/06e9d0c/testr_results.html.gz 16:55:07 the flows were not applied?? maybe 16:55:23 (we need a flow monitor) 16:55:36 ralonsoh: yes 16:55:48 otherwise it may be hard to say what happend there 16:56:13 but in this case it's not "br-int" only to monitor 16:56:20 I know 16:56:21 each test has got own bridge 16:56:36 every test case should be able to deploy it's own monitor 16:56:42 in any bridge 16:56:53 I'll put extra effort on this 16:57:02 maybe we can do some simple decorator to run tests and in case of failure dump bridges from bridge used in the test somehow? 16:57:19 this could be useful too 16:57:21 that would be very useful 16:57:23 *dump flows 16:57:40 anyone wants to do it maybe? 16:57:47 if not, I can 16:57:49 :) 16:57:56 I can check it 16:57:57 at least add it to my backlog 16:58:01 ok, thx ralonsoh 16:58:04 along with the flow monitor 16:58:27 #action ralonsoh to try to log flows at the end of faliled functional test 16:58:41 ralonsoh: ^^ it's just as a reminder, it's not urgent for sure :) 16:58:52 I'll also see if I can clear enough time to take a look 16:58:59 thx njohnston 16:59:03 thanks! 16:59:20 ok, I think we have to finish now 16:59:25 one last thing 16:59:31 please add https://review.opendev.org/#/c/685705/ to Your review pile 16:59:42 thx for attending 16:59:51 and have a great day/evening :) 16:59:53 bye 16:59:57 bye! 16:59:57 #endmeeting