16:00:31 #startmeeting neutron_ci 16:00:32 Meeting started Tue Jan 30 16:00:31 2018 UTC and is due to finish in 60 minutes. The chair is jlibosva. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:33 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:36 The meeting name has been set to 'neutron_ci' 16:00:40 o/ 16:00:45 \o 16:00:46 hi 16:00:57 hi all, Ihar texted me he won't be able to make to to this meeting on time, so I'll try to do my best :) 16:01:06 #topic Actions from prev meeting 16:01:15 "mlavalle to propose a patch removing last tempest bits from neutron tree" 16:01:36 I did ! 16:01:44 #link https://review.openstack.org/#/c/536931/ 16:02:07 nice 16:02:11 the tough part here is that there is a long list of dependencies in terms of patches 16:02:39 is there anything we could do to speed up the dynamic-routing repo changes? 16:02:40 that end up here https://review.openstack.org/#/c/528990/ 16:02:58 https://review.openstack.org/#/c/528992/ 16:03:11 frickler: hi ^^ we need your advice here :) 16:03:14 if we get these two patches to move, we will get the entire chain in 16:03:14 if you're around 16:03:49 I guess he's not :) 16:03:59 I'll continue bugging here 16:04:01 :-) 16:04:08 him^^^^ 16:04:11 thanks :) 16:04:13 next one was 16:04:17 "otherwiseguy to continue digging ovsdb retry / stream parsing issue and try things" 16:04:27 I think the "try things" part was done :) 16:04:42 LOL 16:04:43 https://review.openstack.org/#/c/537241/ 16:04:55 and we have a fix!! Great job otherwiseguy 16:05:14 ++ 16:05:20 do we need a new release now for the change to take effect in neutron gates? 16:05:57 we probably do 16:06:11 ok, I can propose one 16:06:30 #action jlibosva to request a new release for ovsdbapp 16:06:40 but will we have to bump version in requirements also? 16:06:44 or it is not necessary? 16:06:55 perhaps upper-constrains 16:07:11 so please remember about that :) 16:07:12 but that's once we get the release in PyPI 16:07:17 yeah, thanks for reminder :) 16:07:24 next one was 16:07:27 "jlibosva to continue digging why functional tests loop in ovsdb retry and don't timeout" 16:07:41 so I didn't but I trust the ovsdbapp will fix those 16:07:53 next was 16:07:56 "mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario 16:07:58 " 16:08:15 haleyb and I discussed about it 16:08:31 he said that he now has time to work on it 16:08:57 yes, i just need to configure an environment, in progress 16:09:27 thanks, should we flip the AI to re-visit next week? 16:10:02 sure 16:10:19 #action mlavalle and haleyb to follow up on how we can move forward floating ip failures in dvr scenario 16:10:28 next was 16:10:30 "mlavalle to check with infra what's the plan for periodics / maybe migrate jobs" 16:10:50 is it https://review.openstack.org/#/c/539006/ ? 16:10:53 the answer is that yes, we also need to move the periodic jobs to our repo 16:11:12 and yes, that is the patch ^^^^ 16:11:22 #link https://review.openstack.org/#/c/539006/ 16:11:24 thanks :) 16:11:33 so as soon as I get it to pass zuul 16:11:40 and the last but not least 16:11:43 "slaweq to check why fullstack job is busted / l3agent fails to start" 16:11:47 https://review.openstack.org/#/c/537863/ 16:11:55 patch is merged for this one 16:12:00 nice :) 16:12:03 good job 16:12:08 fullstack is not on 100% failure rate now at least 16:12:13 thx jlibosva 16:12:38 let's check it then :) 16:12:41 #topic Grafana 16:12:47 http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:13:36 I see the functional failure rate is low, even without the new ovsdbapp 16:14:58 and I see the tempest-neutron-dvr job in the gate Q jumped to 100% recently 16:15:44 what was it last week? 16:15:44 or I see generic trend of failure rates going up in all jobs. Is there any known infra issue going on? 16:16:03 mlavalle: it was 0 16:16:09 oh 16:16:16 it jumped up today 16:16:18 there was in issue with images yesterday, i still see jobs failing for no reason 16:16:56 haleyb: do you have an example so we can have a look? 16:17:50 jlibosva: not for gate, but i did see TIMED_OUT in a couple 16:17:56 in the check queue 16:17:57 aha 16:18:36 wow, now this is really bad, https://review.openstack.org/#/c/537654/ 16:18:48 that's not becuase of the change 16:19:09 tempest/grenade/rally all fail 16:19:58 well, look at this one: https://review.openstack.org/#/c/520249/ 16:20:27 it looks like some package deps broken: http://logs.openstack.org/54/537654/2/check/neutron-rally-neutron/afaa71d/logs/devstacklog.txt.gz#_2018-01-30_14_48_31_326 16:20:47 we are just the helpless victim 16:21:17 so it looks like the jobs are really all busted 16:21:19 or devstack is 16:21:32 same reason: http://logs.openstack.org/49/520249/4/check/neutron-tempest-plugin-dvr-multinode-scenario/3d6610c/logs/devstacklog.txt.gz#_2018-01-30_11_25_41_189 16:22:10 clarkb: is the infra team aware of this^^^^? 16:22:58 I don't see any email on the dev ML 16:23:41 yes there was an irc notice at 13:42 ish utc 16:23:57 different channels get the notice at different times due to rate limiting 16:24:01 clarkb: thanks 16:24:16 that also gets logged to the infra status wiki page and some twitter account too iirc 16:24:34 so assume it is being worked on 16:25:04 clarkb: is this the wiki you're talking about? https://wiki.openstack.org/wiki/Infrastructure_Status 16:25:06 yes, I think it is largely corrected at this point but fungu and pabelanger will have to confirm 16:25:30 jlibosva: yes 16:25:43 * fungi ^ 16:25:51 clarkb: thanks :) cool, I didn't know such thing exists :) 16:26:00 * jlibosva bookmarks 16:26:44 yeah, i've now shifted focus to investigating packet loss for the mirror in rax-dfw (looks like we started getting rate limited there late last week) 16:26:48 clarkb: thanks for the update. much appreciated :-) 16:27:32 let's talk about specific jobs then 16:27:51 last week we dedicated some time to fullstack so let's do scenario this week 16:27:54 #topic Scenarios 16:28:43 * jlibosva searches some failure example 16:29:21 e.g. http://logs.openstack.org/49/520249/4/check/neutron-tempest-plugin-dvr-multinode-scenario/b2cf720/logs/testr_results.html.gz 16:29:52 hmm 16:29:55 SSHException: Error reading SSH protocol banner 16:30:34 the test attempted to connect to an instance, it connected to SSH, authenticated successfully but then failed with the above error 16:32:05 jlibosva, I don't think the error means that it authenticated successfully; I think it means that it connected to the TCP port and then the SSH server did not even say "hi, I'm an SSH server with this version" - I think it means that it received nothing at all from the SSH server 16:32:47 Roamer`: hmm, but the logs say "Authentication (publickey) successful!" 16:34:03 hm, or maybe in this case "banner" means something else; for me it has always meant the initial server greeting, the very first thing sent in the connection, but ICBW for this particular SSH implementation, sorry 16:34:30 jlibosva: where You see it? I see there something like "2018-01-28 00:48:53,571 10039 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to ubuntu@172.24.5.10 (timed out). Number attempts: 15. Retry after 16 seconds." 16:34:44 so connection wasn't established in this case, right? 16:35:02 jlibosva: some of those tests have "Unable to connect to port 22" too 16:35:44 slaweq: I looked at the test_snat_external_ip 16:37:13 I was looking at test_two_vms_fips 16:37:32 jlibosva: one job had a console too, and the instance failed to bring up networking :-/ 16:37:46 jlibosva, looking at the paramiko source, it still seems to me that this particular error message will only appear at the very start of the connection 16:37:46 haleyb: which one? 16:38:02 test_connectivity_min_max_mtu 16:38:07 jlibosva: so in test which You checked there is something like 16:38:16 ssh connection to ubuntu@172.24.5.19 successfully created 16:38:28 and then error: "Failed to establish authenticated ssh connection to ubuntu@10.1.0.6 (Error reading SSH protocol banner). Number attempts: 1. Retry after 2 seconds." 16:38:34 so it's 2 different IPs 16:38:53 I guess 172.24.5.19 is FIP and 10.1.0.6 is fixed one, right? 16:39:54 yes, that's a typical range with devstack 16:40:50 we should be able to see that in the API replies 16:41:20 but it looks like 10.1.0.7 is the fixed, right? 16:42:52 so the test creates two VMs, one with a FIP and one without 16:43:05 on the same network 16:45:12 jlibosva: are You talking about test_snat_external_ip still? 16:45:18 slaweq: yep 16:45:28 anyways, perhaps we should dig into it offline 16:45:32 IMO it creates one vm 16:45:32 https://github.com/openstack/neutron-tempest-plugin/blob/master/neutron_tempest_plugin/scenario/test_floatingip.py#L179 16:45:46 sorry, 2 16:45:53 proxy and src_server 16:45:57 right 16:46:15 and I think it tries to connect to the src_server via proxy, while proxy has a FIP 16:46:26 and then reach the external gateway from the src_server via the proxy 16:47:02 and it can connect to proxy via FIP but fails to connect via fixed IP to src_server 16:47:14 right? 16:47:19 it seems like that's the case 16:47:39 so perhaps the successful SSH connection is to the proxy and the SSH failure is from the src_server 16:48:47 I'll report a bug on the failure 16:49:18 #action jlibosva to report bug about scenario failure for test_snat_external_ip 16:49:30 does anybody want to have a look at the other failures? 16:50:38 mlavalle: haleyb any chance the failures are related to the fip failures you are working on? 16:50:50 or those were specific jobs? 16:51:28 thI wouldn't rule out the possibility 16:51:33 I^^^ 16:52:04 mlavalle: is there anything specific in the logs we should check for? 16:53:23 I can take a look at one of the other test cases 16:53:30 thanks :) 16:54:41 haleyb: the issue with interfaces not coming up was in test_connectivity_min_max_mtu ? 16:54:53 oh, I see more of those ... 16:55:14 jlibosva: that's what it looked like in the second instance, never got to see interface info 16:56:50 cloud-init never got run since it's right after 16:57:42 I see some ovs firewall related errors in q-agt logs 16:58:26 maybe we could schedule https://bugs.launchpad.net/neutron/+bug/1740885 for the rc? 16:58:27 Launchpad bug 1740885 in neutron "Security group updates fail when port hasn't been initialized yet" [High,In progress] - Assigned to Jakub Libosvar (libosvar) 16:59:10 mlavalle: haleyb ^^ any thoughts? 16:59:20 jlibosva: do you think you will get it in? 16:59:34 I can check it today if You want 16:59:43 mlavalle: the patch had +2 already but was removed due to missing scheduled milestone 16:59:52 jlibosva: i will look 16:59:56 thanks 17:00:01 aaaaaand we're out of time 17:00:06 thanks everyone for showing up 17:00:09 #endmeeting