#openstack-meeting log

16:00:31 <jlibosva> #startmeeting neutron_ci
16:00:32 <openstack> Meeting started Tue Jan 30 16:00:31 2018 UTC and is due to finish in 60 minutes.  The chair is jlibosva. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:36 <openstack> The meeting name has been set to 'neutron_ci'
16:00:40 <mlavalle> o/
16:00:45 <dalvarez> \o
16:00:46 <slaweq> hi
16:00:57 <jlibosva> hi all, Ihar texted me he won't be able to make to to this meeting on time, so I'll try to do my best :)
16:01:06 <jlibosva> #topic Actions from prev meeting
16:01:15 <jlibosva> "mlavalle to propose a patch removing last tempest bits from neutron tree"
16:01:36 <mlavalle> I did !
16:01:44 <mlavalle> #link https://review.openstack.org/#/c/536931/
16:02:07 <jlibosva> nice
16:02:11 <mlavalle> the tough part here is that there is a long list of dependencies in terms of patches
16:02:39 <jlibosva> is there anything we could do to speed up the dynamic-routing repo changes?
16:02:40 <mlavalle> that end up here https://review.openstack.org/#/c/528990/
16:02:58 <mlavalle> https://review.openstack.org/#/c/528992/
16:03:11 <jlibosva> frickler: hi ^^ we need your advice here :)
16:03:14 <mlavalle> if we get these two patches to move, we will get the entire chain in
16:03:14 <jlibosva> if you're around
16:03:49 <jlibosva> I guess he's not :)
16:03:59 <mlavalle> I'll continue bugging here
16:04:01 <mlavalle> :-)
16:04:08 <mlavalle> him^^^^
16:04:11 <jlibosva> thanks :)
16:04:13 <jlibosva> next one was
16:04:17 <jlibosva> "otherwiseguy to continue digging ovsdb retry / stream parsing issue and try things"
16:04:27 <jlibosva> I think the "try things" part was done :)
16:04:42 <mlavalle> LOL
16:04:43 <jlibosva> https://review.openstack.org/#/c/537241/
16:04:55 <jlibosva> and we have a fix!! Great job otherwiseguy
16:05:14 <mlavalle> ++
16:05:20 <jlibosva> do we need a new release now for the change to take effect in neutron gates?
16:05:57 <mlavalle> we probably do
16:06:11 <jlibosva> ok, I can propose one
16:06:30 <jlibosva> #action jlibosva to request a new release for ovsdbapp
16:06:40 <slaweq> but will we have to bump version in requirements also?
16:06:44 <slaweq> or it is not necessary?
16:06:55 <jlibosva> perhaps upper-constrains
16:07:11 <slaweq> so please remember about that :)
16:07:12 <jlibosva> but that's once we get the release in PyPI
16:07:17 <jlibosva> yeah, thanks for reminder :)
16:07:24 <jlibosva> next one was
16:07:27 <jlibosva> "jlibosva to continue digging why functional tests loop in ovsdb retry and don't timeout"
16:07:41 <jlibosva> so I didn't but I trust the ovsdbapp will fix those
16:07:53 <jlibosva> next was
16:07:56 <jlibosva> "mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario
16:07:58 <jlibosva> "
16:08:15 <mlavalle> haleyb and I discussed about it
16:08:31 <mlavalle> he said that he now has time to work on it
16:08:57 <haleyb> yes, i just need to configure an environment, in progress
16:09:27 <jlibosva> thanks, should we flip the AI to re-visit next week?
16:10:02 <mlavalle> sure
16:10:19 <jlibosva> #action mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario
16:10:28 <jlibosva> next was
16:10:30 <jlibosva> "mlavalle to check with infra what's the plan for periodics / maybe migrate jobs"
16:10:50 <jlibosva> is it https://review.openstack.org/#/c/539006/ ?
16:10:53 <mlavalle> the answer is that yes, we also need to move the periodic jobs to our repo
16:11:12 <mlavalle> and yes, that is the patch ^^^^
16:11:22 <jlibosva> #link https://review.openstack.org/#/c/539006/
16:11:24 <jlibosva> thanks :)
16:11:33 <mlavalle> so as soon as I get it to pass zuul
16:11:40 <jlibosva> and the last but not least
16:11:43 <jlibosva> "slaweq to check why fullstack job is busted / l3agent fails to start"
16:11:47 <slaweq> https://review.openstack.org/#/c/537863/
16:11:55 <slaweq> patch is merged for this one
16:12:00 <jlibosva> nice :)
16:12:03 <jlibosva> good job
16:12:08 <slaweq> fullstack is not on 100% failure rate now at least
16:12:13 <slaweq> thx jlibosva
16:12:38 <jlibosva> let's check it then :)
16:12:41 <jlibosva> #topic Grafana
16:12:47 <jlibosva> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:13:36 <jlibosva> I see the functional failure rate is low, even without the new ovsdbapp
16:14:58 <jlibosva> and I see the tempest-neutron-dvr job in the gate Q jumped to 100% recently
16:15:44 <mlavalle> what was it last week?
16:15:44 <jlibosva> or I see generic trend of failure rates going up in all jobs. Is there any known infra issue going on?
16:16:03 <jlibosva> mlavalle: it was 0
16:16:09 <mlavalle> oh
16:16:16 <jlibosva> it jumped up today
16:16:18 <haleyb> there was in issue with images yesterday, i still see jobs failing for no reason
16:16:56 <jlibosva> haleyb: do you have an example so we can have a look?
16:17:50 <haleyb> jlibosva: not for gate, but i did see TIMED_OUT in a couple
16:17:56 <haleyb> in the check queue
16:17:57 <jlibosva> aha
16:18:36 <haleyb> wow, now this is really bad, https://review.openstack.org/#/c/537654/
16:18:48 <haleyb> that's not becuase of the change
16:19:09 <haleyb> tempest/grenade/rally all fail
16:19:58 <mlavalle> well, look at this one: https://review.openstack.org/#/c/520249/
16:20:27 <jlibosva> it looks like some package deps broken: http://logs.openstack.org/54/537654/2/check/neutron-rally-neutron/afaa71d/logs/devstacklog.txt.gz#_2018-01-30_14_48_31_326
16:20:47 <haleyb> we are just the helpless victim
16:21:17 <jlibosva> so it looks like the jobs are really all busted
16:21:19 <jlibosva> or devstack is
16:21:32 <jlibosva> same reason: http://logs.openstack.org/49/520249/4/check/neutron-tempest-plugin-dvr-multinode-scenario/3d6610c/logs/devstacklog.txt.gz#_2018-01-30_11_25_41_189
16:22:10 <mlavalle> clarkb: is the infra team aware of this^^^^?
16:22:58 <jlibosva> I don't see any email on the dev ML
16:23:41 <clarkb> yes there was an irc notice at 13:42 ish utc
16:23:57 <clarkb> different channels get the notice at different times due to rate limiting
16:24:01 <mlavalle> clarkb: thanks
16:24:16 <clarkb> that also gets logged to the infra status wiki page and some twitter account too iirc
16:24:34 <mlavalle> so assume it is being worked on
16:25:04 <jlibosva> clarkb: is this the wiki you're talking about? https://wiki.openstack.org/wiki/Infrastructure_Status
16:25:06 <clarkb> yes, I think it is largely corrected at this point but fungu and pabelanger will have to confirm
16:25:30 <clarkb> jlibosva: yes
16:25:43 <clarkb> * fungi ^
16:25:51 <jlibosva> clarkb: thanks :) cool, I didn't know such thing exists :)
16:26:00 * jlibosva bookmarks
16:26:44 <fungi> yeah, i've now shifted focus to investigating packet loss for the mirror in rax-dfw (looks like we started getting rate limited there late last week)
16:26:48 <mlavalle> clarkb: thanks for the update. much appreciated :-)
16:27:32 <jlibosva> let's talk about specific jobs then
16:27:51 <jlibosva> last week we dedicated some time to fullstack so let's do scenario this week
16:27:54 <jlibosva> #topic Scenarios
16:28:43 * jlibosva searches some failure example
16:29:21 <jlibosva> e.g. http://logs.openstack.org/49/520249/4/check/neutron-tempest-plugin-dvr-multinode-scenario/b2cf720/logs/testr_results.html.gz
16:29:52 <jlibosva> hmm
16:29:55 <jlibosva> SSHException: Error reading SSH protocol banner
16:30:34 <jlibosva> the test attempted to connect to an instance, it connected to SSH, authenticated successfully but then failed with the above error
16:32:05 <Roamer`> jlibosva, I don't think the error means that it authenticated successfully; I think it means that it connected to the TCP port and then the SSH server did not even say "hi, I'm an SSH server with this version" - I think it means that it received nothing at all from the SSH server
16:32:47 <jlibosva> Roamer`: hmm, but the logs say "Authentication (publickey) successful!"
16:34:03 <Roamer`> hm, or maybe in this case "banner" means something else; for me it has always meant the initial server greeting, the very first thing sent in the connection, but ICBW for this particular SSH implementation, sorry
16:34:30 <slaweq> jlibosva: where You see it? I see there something like "2018-01-28 00:48:53,571 10039 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to ubuntu@172.24.5.10 (timed out). Number attempts: 15. Retry after 16 seconds."
16:34:44 <slaweq> so connection wasn't established in this case, right?
16:35:02 <haleyb> jlibosva: some of those tests have "Unable to connect to port 22" too
16:35:44 <jlibosva> slaweq: I looked at the test_snat_external_ip
16:37:13 <slaweq> I was looking at test_two_vms_fips
16:37:32 <haleyb> jlibosva: one job had a console too, and the instance failed to bring up networking :-/
16:37:46 <Roamer`> jlibosva, looking at the paramiko source, it still seems to me that this particular error message will only appear at the very start of the connection
16:37:46 <jlibosva> haleyb: which one?
16:38:02 <haleyb> test_connectivity_min_max_mtu
16:38:07 <slaweq> jlibosva: so in test which You checked there is something like
16:38:16 <slaweq> ssh connection to ubuntu@172.24.5.19 successfully created
16:38:28 <slaweq> and then error: "Failed to establish authenticated ssh connection to ubuntu@10.1.0.6 (Error reading SSH protocol banner). Number attempts: 1. Retry after 2 seconds."
16:38:34 <slaweq> so it's 2 different IPs
16:38:53 <slaweq> I guess 172.24.5.19 is FIP and 10.1.0.6 is fixed one, right?
16:39:54 <haleyb> yes, that's a typical range with devstack
16:40:50 <jlibosva> we should be able to see that in the API replies
16:41:20 <jlibosva> but it looks like 10.1.0.7 is the fixed, right?
16:42:52 <jlibosva> so the test creates two VMs, one with a FIP and one without
16:43:05 <jlibosva> on the same network
16:45:12 <slaweq> jlibosva: are You talking about test_snat_external_ip still?
16:45:18 <jlibosva> slaweq: yep
16:45:28 <jlibosva> anyways, perhaps we should dig into it offline
16:45:32 <slaweq> IMO it creates one vm
16:45:32 <slaweq> https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_floatingip.py#L179
16:45:46 <slaweq> sorry, 2
16:45:53 <slaweq> proxy and src_server
16:45:57 <jlibosva> right
16:46:15 <jlibosva> and I think it tries to connect to the src_server via proxy, while proxy has a FIP
16:46:26 <jlibosva> and then reach the external gateway from the src_server via the proxy
16:47:02 <slaweq> and it can connect to proxy via FIP but fails to connect via fixed IP to src_server
16:47:14 <slaweq> right?
16:47:19 <jlibosva> it seems like that's the case
16:47:39 <jlibosva> so perhaps the successful SSH connection is to the proxy and the SSH failure is from the src_server
16:48:47 <jlibosva> I'll report a bug on the failure
16:49:18 <jlibosva> #action jlibosva to report bug about scenario failure for test_snat_external_ip
16:49:30 <jlibosva> does anybody want to have a look at the other failures?
16:50:38 <jlibosva> mlavalle: haleyb any chance the failures are related to the fip failures you are working on?
16:50:50 <jlibosva> or those were specific jobs?
16:51:28 <mlavalle> thI wouldn't rule out the possibility
16:51:33 <mlavalle> I^^^
16:52:04 <jlibosva> mlavalle: is there anything specific in the logs we should check for?
16:53:23 <mlavalle> I can take a look at one of the other test cases
16:53:30 <jlibosva> thanks :)
16:54:41 <jlibosva> haleyb: the issue with interfaces not coming up was in test_connectivity_min_max_mtu ?
16:54:53 <jlibosva> oh, I see more of those ...
16:55:14 <haleyb> jlibosva: that's what it looked like in the second instance, never got to see interface info
16:56:50 <haleyb> cloud-init never got run since it's right after
16:57:42 <jlibosva> I see some ovs firewall related errors in q-agt logs
16:58:26 <jlibosva> maybe we could schedule https://bugs.launchpad.net/neutron/+bug/1740885 for the rc?
16:58:27 <openstack> Launchpad bug 1740885 in neutron "Security group updates fail when port hasn't been initialized yet" [High,In progress] - Assigned to Jakub Libosvar (libosvar)
16:59:10 <jlibosva> mlavalle: haleyb ^^ any thoughts?
16:59:20 <mlavalle> jlibosva: do you think you will get it in?
16:59:34 <slaweq> I can check it today if You want
16:59:43 <jlibosva> mlavalle: the patch had +2 already but was removed due to missing scheduled milestone
16:59:52 <haleyb> jlibosva: i will look
16:59:56 <jlibosva> thanks
17:00:01 <jlibosva> aaaaaand we're out of time
17:00:06 <jlibosva> thanks everyone for showing up
17:00:09 <jlibosva> #endmeeting