#openstack-meeting log

16:00:37 <slaweq> #startmeeting neutron_ci
16:00:39 <slaweq> hi
16:00:39 <openstack> Meeting started Tue Oct 29 16:00:37 2019 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:40 <ralonsoh> hi
16:00:41 <njohnston> o/
16:00:42 <openstack> The meeting name has been set to 'neutron_ci'
16:01:04 <bcafarel> o/
16:01:50 <slaweq> ok, we already have usual participants so we can start
16:02:05 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:02:10 <slaweq> please open now
16:04:35 <slaweq> sorry, I had to take care of daughter
16:04:40 <slaweq> but now I'm back
16:04:50 <slaweq> #topic Actions from previous meetings
16:05:02 <slaweq> first one
16:05:04 <slaweq> ralonsoh to check if there is any possibility to do somethig like ovsdb-monitor for openflows
16:05:19 <ralonsoh> one sec
16:05:26 <ralonsoh> #link https://review.opendev.org/#/c/689150/
16:05:39 <ralonsoh> this is a patch to implement a OF monitor
16:05:49 <ralonsoh> then, if approved, we can use it in UT/FT testing
16:06:18 <slaweq> and You want to use it only in the tests?
16:06:31 <ralonsoh> for now I don't see any other use
16:06:50 <ralonsoh> we raised this bug just for debugging tests
16:06:58 <ralonsoh> but of course, could be use anywhere
16:07:38 <slaweq> yes, but I wonder if we couldn't simply run ovs-ofctl monitor as external service and log it's output to some file
16:07:51 <slaweq> similary to how now dstat is working in ci jobs
16:07:59 <slaweq> wouldn't that be enough?
16:08:37 <ralonsoh> well you need a bit of logix there to clean the output
16:08:43 <ralonsoh> but sure this can be done
16:09:07 <ralonsoh> the goal of this class is to be used anywhere you want to track the OF changes
16:09:42 <slaweq> ok, this monitor is run "per bridge" so it would be better to run it "per test" or "per test class" probably
16:09:53 <ralonsoh> yes, only per bridge
16:09:58 <ralonsoh> cmd = ['ovs-ofctl', 'monitor', bridge_name, 'watch:', '--monitor']
16:10:02 <slaweq> so python is better for that, right :)
16:10:07 <ralonsoh> I think so
16:10:31 <slaweq> thx ralonsoh, I will add this patch to my review queue
16:10:46 <ralonsoh> thanks
16:11:49 <slaweq> thx for working on this, that may help us a lot with debugging some failed tests in the future
16:12:07 <slaweq> ok, next action
16:12:11 <slaweq> ralonsoh to investigate failed fullstack tests for dhcp agent rescheduling
16:12:20 <ralonsoh> as commented in https://bugs.launchpad.net/neutron/+bug/1799555/comments/23
16:12:20 <openstack> Launchpad bug 1799555 in neutron "Fullstack test neutron.tests.fullstack.test_dhcp_agent.TestDhcpAgentHA.test_reschedule_network_on_new_agent timeout" [High,Fix committed] - Assigned to Rodolfo Alonso (rodolfo-alonso-hernandez)
16:12:33 <ralonsoh> more time spent in reviewing logs than implementing the patch
16:12:40 <ralonsoh> just a timout issue
16:12:46 <ralonsoh> #link https://review.opendev.org/#/c/689550/
16:13:00 <ralonsoh> that's all
16:13:07 <slaweq> yeah, I saw Your comment and patch which looks fine for me
16:13:11 <ralonsoh> cool
16:13:26 <slaweq> njohnston: bcafarel: please also review it if You will have some time :)
16:13:56 <njohnston> will do
16:14:02 <slaweq> njohnston: thx
16:14:05 <slaweq> ok, next one
16:14:07 <slaweq> slaweq to investigate failed neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs
16:14:10 <bcafarel> sure thing, added to the list :)
16:14:23 <slaweq> unfortunatelly I didn't have time for this
16:14:41 <slaweq> I will try to do it this (or next week), maybe somewhere at the airport/airplane
16:14:46 <slaweq> #action slaweq to investigate failed neutron.tests.fullstack.test_qos.TestDscpMarkingQoSOvs
16:15:06 <slaweq> bcafarel: also thx :)
16:15:09 <slaweq> ok, next one
16:15:11 <slaweq> slaweq to check strange "EVENT OVSNeutronAgentOSKenApp->ofctl_service GetDatapathRequest send_event" log lines in neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_port_creation_with_dscp_marking
16:15:20 <slaweq> for this one I also didn't had time
16:15:26 <slaweq> sorry for that
16:15:30 <slaweq> #action slaweq to check strange "EVENT OVSNeutronAgentOSKenApp->ofctl_service GetDatapathRequest send_event" log lines in neutron.tests.functional.agent.l2.extensions.test_ovs_agent_qos_extension.TestOVSAgentQosExtension.test_port_creation_with_dscp_marking
16:16:35 <slaweq> next one
16:16:37 <slaweq> ralonsoh to try to log flows at the end of faliled functional test
16:16:39 <ralonsoh> yes
16:16:45 <ralonsoh> once the previous patch is approved
16:16:55 <ralonsoh> #link https://review.opendev.org/#/c/689150/10
16:17:04 <ralonsoh> then we can use it for the tests
16:17:06 <ralonsoh> one question
16:17:25 <ralonsoh> this can be used during the test, logging the OF changes
16:17:42 <ralonsoh> or just when the test fails, we can print the OF changes in the log
16:17:47 <ralonsoh> thoughts?
16:18:04 <slaweq> I think that logging during the test is better for debugging
16:18:07 <ralonsoh> perfect
16:18:16 <slaweq> as than You potentially can find out some race conditions
16:18:30 <ralonsoh> easier, just make use of the class and when needed, log the OF changes
16:18:55 <ralonsoh> that's all
16:19:04 <slaweq> ok, thx ralonsoh
16:19:12 <slaweq> so You will take care of this, right?
16:19:15 <ralonsoh> sure
16:19:28 <slaweq> can You maybe open launchpad bug for it? Just for tracking purpose
16:19:33 <ralonsoh> yes of course
16:19:36 <slaweq> thx
16:20:14 <slaweq> #action ralonsoh to open LP about adding OF monitor to functional tests
16:20:29 <slaweq> ok, I think we can move on to the next topic than
16:20:31 <slaweq> #topic Stadium projects
16:20:47 <slaweq> according to tempest-plugin-migration I don't have any updates
16:21:06 <slaweq> do You have anything related to stadium projects?
16:21:22 <njohnston> should we start tracking py3 support removal and zuul v3 job conversion goals in etherpads (either new or reused)?
16:21:43 <njohnston> Or should we wait until after Shanghai for those to be finalized?
16:22:12 <slaweq> I think we can start doing that, based on what was anounced recently by gmann
16:22:37 <slaweq> but even we want to start it "now" it will be effectively after ptg :)
16:22:44 <njohnston> true, true :-)
16:23:07 <njohnston> ok so maybe let's start talking about it the meeting after PTG
16:23:07 <bcafarel> unless everything is completed by the time we fly back :)
16:23:10 <slaweq> njohnston: will You prepare such etherpad maybe to track progress?
16:23:38 <slaweq> bcafarel: yeah :)
16:23:46 <njohnston> #action njohnston prepare etherpad to track stadium progress for zuul v3 job definition and py2 support drop
16:23:52 <slaweq> thx njohnston
16:24:54 <slaweq> ok, lets move on than
16:24:56 <slaweq> #topic Grafana
16:25:16 <slaweq> I have only one thing to say about grafana
16:25:26 <slaweq> infra had issue with disk space for few days
16:25:35 <slaweq> so we are missing data from last few days
16:25:44 <slaweq> it started working yesterday evening
16:26:41 <slaweq> now our numbers there looks fine but we don't have too much of recent data to check in fact
16:26:43 <njohnston> yeah, tough to make judgements
16:27:06 <slaweq> exactly
16:27:22 <slaweq> I see one thing, we should remove py27 jobs from it as we removed them from queues
16:27:27 <slaweq> I will do it today
16:27:40 <slaweq> #action slaweq to send patch to remove py27 jobs from grafana
16:27:59 <njohnston> +1
16:28:27 <slaweq> anything else You want to add/ask about grafana?
16:29:00 <njohnston> nope
16:29:08 <ralonsoh> no
16:29:17 <bcafarel> let's just wait for it to fill again :)
16:29:27 <slaweq> ok, so lets move on than
16:29:33 <slaweq> #topic fullstack/functional
16:29:45 <slaweq> I found few new issues for You :)
16:30:11 <slaweq> first one is something which we saw in the past already
16:30:13 <slaweq> Problem with "AttributeError: 'str' object has no attribute 'content_type'" error agan, see
16:30:18 <slaweq> https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_12f/690908/1/check/neutron-functional/12f8c37/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.linux.test_l3_tc_lib.TcLibTestCase.test_get_filter_id_for_ip.txt.gz
16:30:38 <ralonsoh> this one is very curious
16:30:50 <ralonsoh> because this is happening in the test case result return
16:31:08 <njohnston> weird
16:31:22 <ralonsoh> IMO, this is something related to testtools (but I can't confirm)
16:32:02 <ralonsoh> btw, I don't know if the test case failed or not (this is a good indicator to know what is in content)
16:32:16 <slaweq> yeah, it is strange
16:32:21 <slaweq> but if You look at http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22AttributeError%3A%20'str'%20object%20has%20no%20attribute%20'content_type'%5C%22
16:32:33 <slaweq> it seems that this happens only on neutron functional/fullstack jobs :/
16:32:51 <ralonsoh> ok, so we have a pattern, pointing to us....
16:33:14 <slaweq> yep
16:33:26 <slaweq> I will open LP for this one
16:33:32 <slaweq> as it happens from time to time
16:33:40 <slaweq> maybe we will find fix for it
16:34:02 <slaweq> or some volunteer for work on this than :)
16:34:17 <ralonsoh> I can take a look
16:34:29 <slaweq> #action slaweq to open LP related to "AttributeError: 'str' object has no attribute 'content_type'" error
16:34:45 <slaweq> ralonsoh: thx, I will ping You with link to LP when I will open it
16:34:49 <ralonsoh> thanks
16:34:59 <slaweq> if You will have time, would be great if You can check it
16:35:01 <slaweq> thx a lot
16:36:11 <slaweq> next failure which I found is
16:36:13 <slaweq> https://0c68218832dc7cac70c7-9752c849fa19cb3b4ae0f2b2e19d3d65.ssl.cf2.rackcdn.com/691710/2/check/neutron-functional/15b9600/testr_results.html.gz
16:36:33 <slaweq> but in this case in test's log I don't see anything obvious
16:36:35 <slaweq> https://0c68218832dc7cac70c7-9752c849fa19cb3b4ae0f2b2e19d3d65.ssl.cf2.rackcdn.com/691710/2/check/neutron-functional/15b9600/controller/logs/dsvm-functional-logs/neutron.tests.functional.agent.test_dhcp_agent.DHCPAgentOVSTestCase.test_good_address_allocation.txt.gz
16:37:15 <ralonsoh> slaweq, maybe
16:37:21 <ralonsoh> in assert_good_allocation_for_port
16:37:42 <ralonsoh> what we need to do is to retrieve the interface name and run the dhclient in the wait loop
16:37:52 <ralonsoh> not only to check the ip list
16:38:44 <slaweq> but when You run dhclient it will wait for dhcp reply and retry couple of times, no?
16:38:55 <slaweq> so it should be like "wait loop" already
16:39:33 <ralonsoh> I can take a closer look to those logs tomorrow
16:41:04 <slaweq> thx ralonsoh, but please don't waste too much of Your time on it, I saw it only once for now in the gate
16:41:11 <ralonsoh> ok
16:41:27 <slaweq> ok, lets move on
16:41:31 <slaweq> fullstack tests now
16:41:46 <slaweq> I found one new bug https://bugs.launchpad.net/neutron/+bug/1850292
16:41:46 <openstack> Launchpad bug 1850292 in neutron "Fullstack test can try to use IP address from outside the subnet's allocation pool" [Medium,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:41:55 <slaweq> I have almost working patch for it already
16:43:00 <slaweq> but it doesn't seems for me like something very urgent as it don't hits as often
16:43:25 <slaweq> that's all about functional/fullstack on my side
16:43:34 <slaweq> do You have anything else to add/ask?
16:43:59 <ralonsoh> no
16:44:13 <slaweq> ok, lets move on than
16:44:19 <slaweq> #topic Tempest/Scenario
16:44:20 <njohnston> not on this topic
16:44:35 <slaweq> njohnston: not on which topic?
16:44:56 <njohnston> slaweq: was responding to "do You have anything else to add/ask?" just slowly
16:45:02 <slaweq> ahh, ok
16:45:03 <slaweq> :)
16:45:19 <slaweq> ok, going back to scenario tests
16:45:28 <slaweq> I found that Multicast scenario test is failing often
16:45:41 <slaweq> so I reported bug https://bugs.launchpad.net/neutron/+bug/1850288
16:45:41 <openstack> Launchpad bug 1850288 in neutron "scenario test test_multicast_between_vms_on_same_network fails" [Critical,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:45:54 <slaweq> and I proposed to make it unstable for now https://review.opendev.org/691855
16:46:14 <slaweq> but later I will want to investigate why it is failing from time to time
16:47:11 <slaweq> I also saw some various connectivity issues, like:
16:47:17 <slaweq> https://storage.gra1.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_3ae/682483/1/check/tempest-slow-py3/3ae72c6/testr_results.html.gz
16:47:24 <slaweq> and
16:47:26 <slaweq> https://2a0fd24cd939d3b06641-34d5a3d9e2a9ca67eb62c8365f7602e7.ssl.cf5.rackcdn.com/679813/3/check/tempest-slow-py3/2313102/testr_results.html.gz
16:48:38 <ralonsoh> in both examples, one VM obtains IP and not the other one
16:48:59 <slaweq> ralonsoh: in both cases it was IMO the same vm
16:49:07 <slaweq> but it was resizes/cold migrated
16:49:13 <slaweq> so initially it got IP
16:49:15 <ralonsoh> hmmm correct
16:49:20 <slaweq> and after resize/migration it failed
16:49:32 <slaweq> that's how I see it
16:49:41 <slaweq> but I just checked it briefly for now
16:50:19 <slaweq> but I think that there is potentially some problem (race) there and we should check that
16:50:27 <slaweq> is there any volunteer maybe?
16:51:23 <ralonsoh> (distracted whistling)
16:51:34 <bcafarel> (looking intently at the ceiling)
16:51:53 <slaweq> ok, I will open LP for that and we will see
16:51:53 <njohnston> (wanders away aimlessly)
16:52:12 <slaweq> LOL
16:52:27 <slaweq> You guys rock in such moments :P
16:52:27 <njohnston> we're quite a bunch, aren't we
16:52:38 <ralonsoh> hehehehe
16:52:55 <slaweq> njohnston: yeah, indeed :P
16:53:08 <bcafarel> that is some team spirit at least :)
16:53:23 <slaweq> #action slaweq to report LP about connectivity issues after resize/migration
16:53:32 <slaweq> bcafarel: yeah
16:54:13 <slaweq> 👍
16:54:37 <slaweq> ok, now some good news
16:54:53 <slaweq> mlavalle recently finally found root cause and fixed issue with router migrations
16:54:57 <slaweq> patch is merged https://review.opendev.org/#/c/691498/
16:55:01 <njohnston> \o/
16:55:13 <ralonsoh> good one
16:55:15 <slaweq> so I hope we should have better numbers on dvr multinode scenario job now
16:55:33 <slaweq> ralonsoh: yes, that was interesting issue
16:55:43 <slaweq> in fact it was bug in our code, not in tests
16:56:20 <slaweq> ok, and the last one from me
16:56:29 <slaweq> related more to grenade jobs
16:56:47 <slaweq> I saw that some jobs are failing due to issue in nova scheduler
16:56:52 <slaweq> so I opened bug https://bugs.launchpad.net/nova/+bug/1850291
16:56:52 <openstack> Launchpad bug 1844929 in OpenStack Compute (nova) "duplicate for #1850291 grenade jobs failing due to "Timed out waiting for response from cell" in scheduler" [High,Confirmed]
16:57:00 * bcafarel waits for the https://review.opendev.org/#/c/691498/ backports to appear
16:57:11 <slaweq> but mriedem just marked it as duplicate of some other one, so it is already known issue for nova team
16:57:59 <slaweq> and that's all on my side for today
16:58:07 <ralonsoh> just in time
16:58:09 <njohnston> I have one thing.  Looking at the numbers for the *-uwsgi jobs, they closely mirror their non-uwsgi versions but often are a bit lower.  I'd like to propose we make them voting.  I'll propose a change and we can conduct the conversation in Gerrit.
16:58:14 <slaweq> next week I will cancel ci meeting as we will be on ptg
16:58:24 <ralonsoh> njohnston, are you sure?
16:58:41 <ralonsoh> ok, let's talk in gerrit
16:59:01 <slaweq> njohnston++
16:59:02 <njohnston> ralonsoh: I set up some custom graphite searches and it looked pretty good to me.  The tempest uwsgi job has veen quite stable since the fix for it went in in mid August
16:59:15 <ralonsoh> then perfect for me
16:59:16 <bcafarel> nice
16:59:28 <slaweq> njohnston: I remember that there was quite many timeout on this job some time ago
16:59:35 <slaweq> but I don't know exactly how it is now
16:59:41 <slaweq> lets talk about it on gerrit
16:59:49 <slaweq> ok, thx for attending
16:59:51 <njohnston> it's much better, as you can see in the current Grafana - the uwsgi lines are low and tight
16:59:54 <njohnston> thanks all
16:59:55 <slaweq> and have a great week
17:00:00 <njohnston> safe travels all
17:00:00 <slaweq> o/
17:00:01 <ralonsoh> bye!
17:00:02 <bcafarel> o/
17:00:03 <njohnston> o/
17:00:03 <slaweq> #endmeeting