16:00:27 <slaweq> #startmeeting neutron_ci 16:00:28 <openstack> Meeting started Tue Jan 14 16:00:27 2020 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:29 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:32 <openstack> The meeting name has been set to 'neutron_ci' 16:00:33 <slaweq> welcome back :) 16:00:40 <ralonsoh> hi 16:01:23 <njohnston> o/ 16:02:05 <slaweq> lets start, maybe others will join in the meantime 16:02:07 <slaweq> #topic Actions from previous meetings 16:02:18 <bcafarel> hi again 16:02:20 <slaweq> sorry, first thing: 16:02:22 <slaweq> Grafana dashboard: http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:02:24 <slaweq> Please open now :) 16:02:32 <slaweq> and now we can start with first topic 16:02:37 <slaweq> njohnston to check failing NetworkMigrationFromHA in multinode dvr job 16:03:21 <njohnston> So I have not been able to find any incidences of that. I tried last week, and logshash was not showing any. I tried for about an hour this morning, but logstash became unresponsive a few times so I haven't been able to see if any happened over the weekend 16:03:34 <njohnston> I should say, any recent 16:03:38 <njohnston> recent incidences 16:04:26 <njohnston> so I will keep searching, and once I find another failure I will poke at it. 16:04:52 <slaweq> njohnston: ok, if I will have something like that I will ping You 16:04:59 <njohnston> slaweq: thanks! 16:05:39 <slaweq> njohnston: I think it's this for example https://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b63/702249/3/check/neutron-tempest-plugin-dvr-multinode-scenario/b631ff4/testr_results.html 16:06:10 <njohnston> See, I know you'd be able to find one. :-) slaweq++ 16:06:18 <slaweq> :) 16:06:22 <slaweq> njohnston: thx 16:06:58 <slaweq> but I'm not sure if this is the error which You are looking for actually 16:07:07 <slaweq> here it seems it's ssh to instance failure 16:07:24 <njohnston> that is all I have ever seen for a failure on this test 16:07:26 <slaweq> and I'm not sure if this is "router migration specific" problem of general issue 16:08:21 <slaweq> I know that I saw similar ssh issue on other tests too 16:08:35 <slaweq> but maybe it's for some reason failing more often in those migration tests 16:09:08 <njohnston> yes, that is the angle I am taking on it 16:09:25 <slaweq> ok, thx njohnston for looking into this issue 16:09:39 <slaweq> if You will need any help, just let me know :) 16:09:53 <njohnston> slaweq: Will do 16:10:44 <slaweq> thx 16:10:53 <slaweq> ok, lets move on 16:11:02 <slaweq> ralonsoh to report bug for timeout related to bridge creation 16:11:25 <ralonsoh> let me find the bug 16:11:39 <ralonsoh> #link https://bugs.launchpad.net/neutron/+bug/1858661 16:11:39 <openstack> Launchpad bug 1858661 in neutron "Error when creating a Linux interface" [High,Confirmed] 16:11:47 <ralonsoh> but I didn't work on this one 16:13:28 <slaweq> ralonsoh: thx, we can at least track it if someone will have time to work on this 16:13:32 <slaweq> thx a lot 16:13:37 <ralonsoh> yw 16:14:01 <slaweq> and last action from last week 16:14:03 <slaweq> ralonsoh to take a look how to use newer Maridb in periodic job 16:14:10 <ralonsoh> one sec 16:14:18 <ralonsoh> #link https://review.opendev.org/#/c/702416/ 16:14:36 <ralonsoh> what I'm doing is, in the zuul job, adding a pre-run task 16:14:50 <ralonsoh> adding, just for ubuntu bionic, the repo with mariadb 10.4 16:15:23 <njohnston> nice 16:16:39 <slaweq> You can sent DNM patch which is on top of this one to run this job in check queue 16:16:49 <slaweq> just to see if it will actually work as expected 16:16:50 <ralonsoh> hmmm, you are right 16:17:02 <slaweq> and thx for that fix :) 16:17:11 <ralonsoh> I'll ask you about this later 16:17:21 <ralonsoh> because I though that was a periodic job 16:17:26 <ralonsoh> but in the neutron channel 16:17:50 <slaweq> ralonsoh: yes, this job is in periodic queue 16:17:54 <slaweq> and it should be like that 16:18:12 <slaweq> but You can send DNM patch where You will add it to check queue 16:18:36 <slaweq> then zuul will run it on this patch so we can see results before we will merge Your fix 16:18:43 <slaweq> I can send this patch if You want 16:18:49 <ralonsoh> ahhh ok! 16:18:54 <ralonsoh> understood 16:18:56 <slaweq> :) 16:20:10 <slaweq> ok, next topic 16:20:12 <slaweq> #topic Stadium projects 16:20:26 <slaweq> I think we already have good update about dropping py2 support 16:20:36 <slaweq> so we don't need to talk about it here probably 16:20:38 <slaweq> right? 16:20:54 <njohnston> +1 16:21:05 <ralonsoh> +1 16:21:12 <bcafarel> yep 16:21:12 <slaweq> but I have other question related to stadium projects 16:21:35 <slaweq> we have now in neutron check queue midonet and tripleo jobs which are non-voting and failing 100% times since very long time 16:21:49 <slaweq> both are failing because they try to run on python 2.7 16:21:59 <slaweq> my question is: what we should do with those jobs? 16:22:11 <slaweq> I would personally remove them for now from check queue 16:22:23 <slaweq> as it's only waste of infra resources to run them on each patch 16:22:33 <slaweq> but I want to also know Your opinion about it 16:22:35 <njohnston> I would remove midonet as they don't have anything but UTs now 16:22:51 <njohnston> For tripleo I would ask if there is an updated job for py3 we could switch to 16:23:00 <slaweq> njohnston: there is no currently 16:23:03 <bcafarel> +1 as these cannot be fixed short-term, we can always add back when there is support 16:23:15 <slaweq> afaik tripleo pinned neutron to some version which supports py2 still 16:23:20 <slaweq> and they use it like that in their ci 16:23:25 <njohnston> ugh. never mind. nuke it. 16:23:31 <slaweq> also this job runs on Centos 7 16:23:35 <bcafarel> yeah for tripleo we need centos8 support (which is in progress but not there tomorrow) 16:23:42 <slaweq> so we should probably wait with this job for centos8 16:24:52 <slaweq> so what do You think about removing both jobs from check queue with comment that we can bring them back when it will work? 16:24:55 <ralonsoh> for now, we should not waste CI resources (139 jobs in the queue now!) 16:25:05 <ralonsoh> we can comment them 16:25:07 <njohnston> agreed 🤯 16:25:12 <slaweq> ok, I will do that 16:25:14 <bcafarel> good for me too :) 16:25:37 <slaweq> #action slaweq to remove networking-midonet and tripleo based jobs from Neutron check queue 16:26:02 <slaweq> ok, that's all from my side regarding stadium projects 16:26:12 <slaweq> anything else You have in this topic? 16:26:44 <njohnston> nope 16:27:15 <slaweq> ok, lets move on than 16:27:17 <slaweq> #topic Grafana 16:27:41 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:28:13 <slaweq> all graphs are going to normal values today after yesterday's issue with devstack 16:28:19 <njohnston> well we can see the issues clearly enough 16:29:31 <njohnston> what was wrong with the neutron-tempest-plugin-scenario-* jobs in the check queue back on 1/8? They were all at 100% fail then as well. 16:30:26 <slaweq> njohnston: I think it was this issue with 'None object don't have open_session" or something like that 16:30:36 <njohnston> ah ok, I recall that fix 16:30:39 <slaweq> :) 16:31:41 <slaweq> other than that I don't see any real problems in grafana 16:32:06 <slaweq> but I checked today average number of rechecks in last week and according to my script it was more than 3 rechecks per patch 16:32:20 <slaweq> which isn't good result still :/ 16:32:48 <njohnston> let's see if that average is lower next week 16:33:19 <slaweq> yeah 16:33:25 <slaweq> I will try to monitor it every week 16:33:38 <ralonsoh> (can you share it?) 16:33:42 <ralonsoh> if it is possible 16:33:53 <ralonsoh> the script 16:34:20 <slaweq> ralonsoh: sure 16:34:22 <slaweq> https://github.com/slawqo/tools/blob/master/rechecks.py 16:34:29 <ralonsoh> cool!! 16:34:31 <slaweq> but it's nothing really "great" 16:34:40 <slaweq> just simple script which parses some comments from gerrit :) 16:34:47 <njohnston> ralonsoh: It is linked to from the article that describes it: http://kaplonski.pl/blog/failed_builds_per_patch/ 16:34:57 <slaweq> njohnston: yes, it is 16:35:05 <ralonsoh> that's right yes! 16:35:13 <slaweq> I think it works fine and IMO this metric makes some sense 16:35:28 <slaweq> but if You have any opinion about it, please let me know 16:35:48 <slaweq> maybe my way if thinking here is wrong and this don't have any informational value at all 16:36:22 <njohnston> I think you have a good point that it measures the developer pain due to bad CI 16:36:54 <njohnston> after all if it is a code problem then the developer is probably not going to do a bare 'recheck', so they have to be nearly all CI issues that get rechecked 16:36:55 <slaweq> njohnston: thx 16:37:08 <bcafarel> ack especially as it focuses on "final" series of rechecks 16:37:21 <slaweq> yeah, that's why I checked only "build failed" comments from last patchset 16:37:34 <slaweq> as this in most cases means that code was already fine 16:37:41 <slaweq> and issues were not related to this patch 16:38:14 <slaweq> of course in some case it may be differently when patch introduces e.g. some race condition and tests are failing intermittary but in general it shouldn't be the case 16:39:07 <bcafarel> yeah from personal experience there should be a good signal to noise ratio 16:39:23 <slaweq> thx bcafarel 16:39:38 <slaweq> ok, anything else You want to talk regarding grafana? 16:39:43 <slaweq> or can we move on? 16:39:53 <bcafarel> all good here 16:40:03 <slaweq> so lets move on than 16:40:05 <slaweq> #topic fullstack/functional 16:40:22 <slaweq> I found one failure in fullstack tests: 16:40:24 <slaweq> neutron.tests.fullstack.test_qos.TestMinBwQoSOvs.test_min_bw_qos_port_removed 16:40:26 <slaweq> https://b89f49db332cb8f54892-19780c33aa00a3c0d825d79cd8c225b0.ssl.cf5.rackcdn.com/701571/2/check/neutron-fullstack/51a1d1b/testr_results.html 16:40:40 <slaweq> but it's probably issue which should be fixed with https://review.opendev.org/#/c/687922/ 16:40:45 <slaweq> is that correct ralonsoh? 16:41:02 <ralonsoh> let me check 16:41:40 <ralonsoh> well, we didn't have the diver cache for qos min rules 16:41:52 <ralonsoh> and this specific test is totally refactored 16:42:08 <ralonsoh> so yes, I think this is going to fix it 16:42:14 <ralonsoh> (I hope so!) 16:42:24 <slaweq> ok, thx for confirmation 16:42:53 <slaweq> and I don't have any other new failures for functional/fullstack jobs 16:43:11 <slaweq> so I think we can move on to scenario jobs now 16:43:16 <slaweq> are You ok with that? 16:43:18 <bcafarel> I may have a new one for stein branch (functional) 16:43:26 <slaweq> bcafarel: shoot 16:44:00 <bcafarel> https://review.opendev.org/#/c/701898/ and https://review.opendev.org/#/c/702364/ both failed today on functional neutron.tests.functional.test_server tests 16:44:24 <bcafarel> may just be bad node I did not have time to dig in further, for example https://deacf45b9e7640612342-216aa7667ced3686ee75e1188a89b185.ssl.cf2.rackcdn.com/702364/1/check/neutron-functional/a6cf355/testr_results.html 16:44:48 <bcafarel> but I don't recall recent changes/fixes in this part of code in master/train 16:45:41 <slaweq> I remember some failure like that in master branch in the past 16:45:55 <slaweq> maybe it's just some missing backport? 16:47:01 <ralonsoh> slaweq, you changed the start method 16:47:14 <ralonsoh> https://review.opendev.org/#/c/680001/ 16:47:56 <ralonsoh> is this in stein? 16:48:05 <bcafarel> ahah so maybe oslo bump in stein? 16:48:15 <slaweq> ralonsoh: yes, and this patch in oslo was backported to stein: https://review.opendev.org/#/q/I86a34c22d41d87a9cce2d4ac6d95562d05823ecf 16:48:17 <slaweq> :) 16:48:20 <bcafarel> that one I see only in master/train 16:48:24 <slaweq> so this may be same problem 16:48:48 <bcafarel> ralonsoh: slaweq thanks I will check and send cherry-pick if that is the one 16:48:48 <ralonsoh> my job is done here! 16:49:02 <bcafarel> though I am tempted to bet it is :) 16:49:15 <slaweq> ralonsoh++ 16:49:22 <slaweq> bcafarel++ thx for checking that 16:50:25 <slaweq> #action bcafarel to send cherry-pick of https://review.opendev.org/#/c/680001/ to stable/stein to fix functional tests failure 16:50:46 <slaweq> ok, anything else regarding functional/fullstack jobs? 16:50:59 <bcafarel> nothing else from me at least :) 16:51:42 <slaweq> ok, lets move on than 16:51:44 <slaweq> #topic Tempest/Scenario 16:52:06 <slaweq> I have one failure related to scenario jobs to mention 16:52:09 <slaweq> Problem with ssh: paramiko.ssh_exception.SSHException: No existing session 16:52:17 <slaweq> e.g. https://947c62482e8e55a27073-47560c94aca274da9e9228ef37db57ef.ssl.cf1.rackcdn.com/701853/2/check/neutron-tempest-plugin-scenario-linuxbridge/e077ad8/testr_results.html 16:52:32 <slaweq> it may be the same issue like njohnston saw for migration routers tests 16:53:15 <slaweq> last week I opened bug for that also https://bugs.launchpad.net/neutron/+bug/1858642 16:53:15 <openstack> Launchpad bug 1858642 in neutron "paramiko.ssh_exception.NoValidConnectionsError error cause dvr scenario jobs failing" [High,Confirmed] 16:53:27 <slaweq> but now it seems that it's not only related to dvr jobs 16:54:10 <ralonsoh> so you think this is not a DVR/no DVR problem but a paramiko one 16:54:11 <ralonsoh> ? 16:54:17 <slaweq> njohnston: if You will find something with router migration, please maybe take a look at this also to check if that's not the same problem 16:54:23 <njohnston> will do 16:54:52 <slaweq> ralonsoh: IMO it's not paramiko problem, but we have just such error from paramiko when there is no ssh connectivity 16:55:06 <ralonsoh> ok 16:55:23 <slaweq> ahh no 16:55:25 <slaweq> sorry 16:55:32 <slaweq> it's not the same error in this linuxbridge job 16:55:42 <slaweq> njohnston: so please don't check that one 16:55:46 <slaweq> it's something different 16:55:54 <njohnston> ok 16:56:04 <slaweq> here it may be some paramiko error or some issue in our tests code 16:56:14 <slaweq> sorry for mixing things 16:57:01 <slaweq> so regarding this issue with linuxbridge job, I think that if we will spot it more often, I will open new bug for that 16:57:09 <slaweq> and we can than check it 16:57:31 <slaweq> at least for now I saw it only once so I don't think it will be easy/doable to check that 16:58:32 <slaweq> ok, we are almost out of time 16:58:42 <slaweq> anything else You have to talk today quickly? 16:59:33 <slaweq> if not, lets end the meeting now, thx for attending o/ 16:59:43 <slaweq> #endmeeting