16:00:31 <jlibosva> #startmeeting neutron_ci 16:00:32 <openstack> Meeting started Tue Jan 30 16:00:31 2018 UTC and is due to finish in 60 minutes. The chair is jlibosva. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:33 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:36 <openstack> The meeting name has been set to 'neutron_ci' 16:00:40 <mlavalle> o/ 16:00:45 <dalvarez> \o 16:00:46 <slaweq> hi 16:00:57 <jlibosva> hi all, Ihar texted me he won't be able to make to to this meeting on time, so I'll try to do my best :) 16:01:06 <jlibosva> #topic Actions from prev meeting 16:01:15 <jlibosva> "mlavalle to propose a patch removing last tempest bits from neutron tree" 16:01:36 <mlavalle> I did ! 16:01:44 <mlavalle> #link https://review.openstack.org/#/c/536931/ 16:02:07 <jlibosva> nice 16:02:11 <mlavalle> the tough part here is that there is a long list of dependencies in terms of patches 16:02:39 <jlibosva> is there anything we could do to speed up the dynamic-routing repo changes? 16:02:40 <mlavalle> that end up here https://review.openstack.org/#/c/528990/ 16:02:58 <mlavalle> https://review.openstack.org/#/c/528992/ 16:03:11 <jlibosva> frickler: hi ^^ we need your advice here :) 16:03:14 <mlavalle> if we get these two patches to move, we will get the entire chain in 16:03:14 <jlibosva> if you're around 16:03:49 <jlibosva> I guess he's not :) 16:03:59 <mlavalle> I'll continue bugging here 16:04:01 <mlavalle> :-) 16:04:08 <mlavalle> him^^^^ 16:04:11 <jlibosva> thanks :) 16:04:13 <jlibosva> next one was 16:04:17 <jlibosva> "otherwiseguy to continue digging ovsdb retry / stream parsing issue and try things" 16:04:27 <jlibosva> I think the "try things" part was done :) 16:04:42 <mlavalle> LOL 16:04:43 <jlibosva> https://review.openstack.org/#/c/537241/ 16:04:55 <jlibosva> and we have a fix!! Great job otherwiseguy 16:05:14 <mlavalle> ++ 16:05:20 <jlibosva> do we need a new release now for the change to take effect in neutron gates? 16:05:57 <mlavalle> we probably do 16:06:11 <jlibosva> ok, I can propose one 16:06:30 <jlibosva> #action jlibosva to request a new release for ovsdbapp 16:06:40 <slaweq> but will we have to bump version in requirements also? 16:06:44 <slaweq> or it is not necessary? 16:06:55 <jlibosva> perhaps upper-constrains 16:07:11 <slaweq> so please remember about that :) 16:07:12 <jlibosva> but that's once we get the release in PyPI 16:07:17 <jlibosva> yeah, thanks for reminder :) 16:07:24 <jlibosva> next one was 16:07:27 <jlibosva> "jlibosva to continue digging why functional tests loop in ovsdb retry and don't timeout" 16:07:41 <jlibosva> so I didn't but I trust the ovsdbapp will fix those 16:07:53 <jlibosva> next was 16:07:56 <jlibosva> "mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario 16:07:58 <jlibosva> " 16:08:15 <mlavalle> haleyb and I discussed about it 16:08:31 <mlavalle> he said that he now has time to work on it 16:08:57 <haleyb> yes, i just need to configure an environment, in progress 16:09:27 <jlibosva> thanks, should we flip the AI to re-visit next week? 16:10:02 <mlavalle> sure 16:10:19 <jlibosva> #action mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario 16:10:28 <jlibosva> next was 16:10:30 <jlibosva> "mlavalle to check with infra what's the plan for periodics / maybe migrate jobs" 16:10:50 <jlibosva> is it https://review.openstack.org/#/c/539006/ ? 16:10:53 <mlavalle> the answer is that yes, we also need to move the periodic jobs to our repo 16:11:12 <mlavalle> and yes, that is the patch ^^^^ 16:11:22 <jlibosva> #link https://review.openstack.org/#/c/539006/ 16:11:24 <jlibosva> thanks :) 16:11:33 <mlavalle> so as soon as I get it to pass zuul 16:11:40 <jlibosva> and the last but not least 16:11:43 <jlibosva> "slaweq to check why fullstack job is busted / l3agent fails to start" 16:11:47 <slaweq> https://review.openstack.org/#/c/537863/ 16:11:55 <slaweq> patch is merged for this one 16:12:00 <jlibosva> nice :) 16:12:03 <jlibosva> good job 16:12:08 <slaweq> fullstack is not on 100% failure rate now at least 16:12:13 <slaweq> thx jlibosva 16:12:38 <jlibosva> let's check it then :) 16:12:41 <jlibosva> #topic Grafana 16:12:47 <jlibosva> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:13:36 <jlibosva> I see the functional failure rate is low, even without the new ovsdbapp 16:14:58 <jlibosva> and I see the tempest-neutron-dvr job in the gate Q jumped to 100% recently 16:15:44 <mlavalle> what was it last week? 16:15:44 <jlibosva> or I see generic trend of failure rates going up in all jobs. Is there any known infra issue going on? 16:16:03 <jlibosva> mlavalle: it was 0 16:16:09 <mlavalle> oh 16:16:16 <jlibosva> it jumped up today 16:16:18 <haleyb> there was in issue with images yesterday, i still see jobs failing for no reason 16:16:56 <jlibosva> haleyb: do you have an example so we can have a look? 16:17:50 <haleyb> jlibosva: not for gate, but i did see TIMED_OUT in a couple 16:17:56 <haleyb> in the check queue 16:17:57 <jlibosva> aha 16:18:36 <haleyb> wow, now this is really bad, https://review.openstack.org/#/c/537654/ 16:18:48 <haleyb> that's not becuase of the change 16:19:09 <haleyb> tempest/grenade/rally all fail 16:19:58 <mlavalle> well, look at this one: https://review.openstack.org/#/c/520249/ 16:20:27 <jlibosva> it looks like some package deps broken: http://logs.openstack.org/54/537654/2/check/neutron-rally-neutron/afaa71d/logs/devstacklog.txt.gz#_2018-01-30_14_48_31_326 16:20:47 <haleyb> we are just the helpless victim 16:21:17 <jlibosva> so it looks like the jobs are really all busted 16:21:19 <jlibosva> or devstack is 16:21:32 <jlibosva> same reason: http://logs.openstack.org/49/520249/4/check/neutron-tempest-plugin-dvr-multinode-scenario/3d6610c/logs/devstacklog.txt.gz#_2018-01-30_11_25_41_189 16:22:10 <mlavalle> clarkb: is the infra team aware of this^^^^? 16:22:58 <jlibosva> I don't see any email on the dev ML 16:23:41 <clarkb> yes there was an irc notice at 13:42 ish utc 16:23:57 <clarkb> different channels get the notice at different times due to rate limiting 16:24:01 <mlavalle> clarkb: thanks 16:24:16 <clarkb> that also gets logged to the infra status wiki page and some twitter account too iirc 16:24:34 <mlavalle> so assume it is being worked on 16:25:04 <jlibosva> clarkb: is this the wiki you're talking about? https://wiki.openstack.org/wiki/Infrastructure_Status 16:25:06 <clarkb> yes, I think it is largely corrected at this point but fungu and pabelanger will have to confirm 16:25:30 <clarkb> jlibosva: yes 16:25:43 <clarkb> * fungi ^ 16:25:51 <jlibosva> clarkb: thanks :) cool, I didn't know such thing exists :) 16:26:00 * jlibosva bookmarks 16:26:44 <fungi> yeah, i've now shifted focus to investigating packet loss for the mirror in rax-dfw (looks like we started getting rate limited there late last week) 16:26:48 <mlavalle> clarkb: thanks for the update. much appreciated :-) 16:27:32 <jlibosva> let's talk about specific jobs then 16:27:51 <jlibosva> last week we dedicated some time to fullstack so let's do scenario this week 16:27:54 <jlibosva> #topic Scenarios 16:28:43 * jlibosva searches some failure example 16:29:21 <jlibosva> e.g. http://logs.openstack.org/49/520249/4/check/neutron-tempest-plugin-dvr-multinode-scenario/b2cf720/logs/testr_results.html.gz 16:29:52 <jlibosva> hmm 16:29:55 <jlibosva> SSHException: Error reading SSH protocol banner 16:30:34 <jlibosva> the test attempted to connect to an instance, it connected to SSH, authenticated successfully but then failed with the above error 16:32:05 <Roamer`> jlibosva, I don't think the error means that it authenticated successfully; I think it means that it connected to the TCP port and then the SSH server did not even say "hi, I'm an SSH server with this version" - I think it means that it received nothing at all from the SSH server 16:32:47 <jlibosva> Roamer`: hmm, but the logs say "Authentication (publickey) successful!" 16:34:03 <Roamer`> hm, or maybe in this case "banner" means something else; for me it has always meant the initial server greeting, the very first thing sent in the connection, but ICBW for this particular SSH implementation, sorry 16:34:30 <slaweq> jlibosva: where You see it? I see there something like "2018-01-28 00:48:53,571 10039 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to ubuntu@172.24.5.10 (timed out). Number attempts: 15. Retry after 16 seconds." 16:34:44 <slaweq> so connection wasn't established in this case, right? 16:35:02 <haleyb> jlibosva: some of those tests have "Unable to connect to port 22" too 16:35:44 <jlibosva> slaweq: I looked at the test_snat_external_ip 16:37:13 <slaweq> I was looking at test_two_vms_fips 16:37:32 <haleyb> jlibosva: one job had a console too, and the instance failed to bring up networking :-/ 16:37:46 <Roamer`> jlibosva, looking at the paramiko source, it still seems to me that this particular error message will only appear at the very start of the connection 16:37:46 <jlibosva> haleyb: which one? 16:38:02 <haleyb> test_connectivity_min_max_mtu 16:38:07 <slaweq> jlibosva: so in test which You checked there is something like 16:38:16 <slaweq> ssh connection to ubuntu@172.24.5.19 successfully created 16:38:28 <slaweq> and then error: "Failed to establish authenticated ssh connection to ubuntu@10.1.0.6 (Error reading SSH protocol banner). Number attempts: 1. Retry after 2 seconds." 16:38:34 <slaweq> so it's 2 different IPs 16:38:53 <slaweq> I guess 172.24.5.19 is FIP and 10.1.0.6 is fixed one, right? 16:39:54 <haleyb> yes, that's a typical range with devstack 16:40:50 <jlibosva> we should be able to see that in the API replies 16:41:20 <jlibosva> but it looks like 10.1.0.7 is the fixed, right? 16:42:52 <jlibosva> so the test creates two VMs, one with a FIP and one without 16:43:05 <jlibosva> on the same network 16:45:12 <slaweq> jlibosva: are You talking about test_snat_external_ip still? 16:45:18 <jlibosva> slaweq: yep 16:45:28 <jlibosva> anyways, perhaps we should dig into it offline 16:45:32 <slaweq> IMO it creates one vm 16:45:32 <slaweq> https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_floatingip.py#L179 16:45:46 <slaweq> sorry, 2 16:45:53 <slaweq> proxy and src_server 16:45:57 <jlibosva> right 16:46:15 <jlibosva> and I think it tries to connect to the src_server via proxy, while proxy has a FIP 16:46:26 <jlibosva> and then reach the external gateway from the src_server via the proxy 16:47:02 <slaweq> and it can connect to proxy via FIP but fails to connect via fixed IP to src_server 16:47:14 <slaweq> right? 16:47:19 <jlibosva> it seems like that's the case 16:47:39 <jlibosva> so perhaps the successful SSH connection is to the proxy and the SSH failure is from the src_server 16:48:47 <jlibosva> I'll report a bug on the failure 16:49:18 <jlibosva> #action jlibosva to report bug about scenario failure for test_snat_external_ip 16:49:30 <jlibosva> does anybody want to have a look at the other failures? 16:50:38 <jlibosva> mlavalle: haleyb any chance the failures are related to the fip failures you are working on? 16:50:50 <jlibosva> or those were specific jobs? 16:51:28 <mlavalle> thI wouldn't rule out the possibility 16:51:33 <mlavalle> I^^^ 16:52:04 <jlibosva> mlavalle: is there anything specific in the logs we should check for? 16:53:23 <mlavalle> I can take a look at one of the other test cases 16:53:30 <jlibosva> thanks :) 16:54:41 <jlibosva> haleyb: the issue with interfaces not coming up was in test_connectivity_min_max_mtu ? 16:54:53 <jlibosva> oh, I see more of those ... 16:55:14 <haleyb> jlibosva: that's what it looked like in the second instance, never got to see interface info 16:56:50 <haleyb> cloud-init never got run since it's right after 16:57:42 <jlibosva> I see some ovs firewall related errors in q-agt logs 16:58:26 <jlibosva> maybe we could schedule https://bugs.launchpad.net/neutron/+bug/1740885 for the rc? 16:58:27 <openstack> Launchpad bug 1740885 in neutron "Security group updates fail when port hasn't been initialized yet" [High,In progress] - Assigned to Jakub Libosvar (libosvar) 16:59:10 <jlibosva> mlavalle: haleyb ^^ any thoughts? 16:59:20 <mlavalle> jlibosva: do you think you will get it in? 16:59:34 <slaweq> I can check it today if You want 16:59:43 <jlibosva> mlavalle: the patch had +2 already but was removed due to missing scheduled milestone 16:59:52 <haleyb> jlibosva: i will look 16:59:56 <jlibosva> thanks 17:00:01 <jlibosva> aaaaaand we're out of time 17:00:06 <jlibosva> thanks everyone for showing up 17:00:09 <jlibosva> #endmeeting