16:00:47 <slaweq> #startmeeting neutron_ci 16:00:48 <openstack> Meeting started Tue Oct 9 16:00:47 2018 UTC and is due to finish in 60 minutes. The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot. 16:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 16:00:51 <openstack> The meeting name has been set to 'neutron_ci' 16:00:54 <slaweq> welcome (again) :) 16:00:55 <mlavalle> o/ 16:01:02 <njohnston> o/ 16:01:30 <slaweq> #topic Actions from previous meetings 16:01:43 <slaweq> njohnston Reduce to a single fullstack job named 'neutron-fullstack' that calls python3 in tox.ini and runs on bionic nodeset 16:02:39 <njohnston> so for that I hijacked bcafarel's change https://review.openstack.org/#/c/604749/ 16:03:21 <njohnston> I am tinkering with it to figure out the issues with neutron-fullstack job 16:04:00 <slaweq> what issue exactly? 16:05:09 <njohnston> http://logs.openstack.org/49/604749/3/check/neutron-fullstack/612006a/job-output.txt.gz#_2018-10-04_01_32_01_503103 16:06:08 <slaweq> but do we need still compile ovs on those tests? 16:06:18 <njohnston> looks like it is trying to compile OVS 16:06:30 <njohnston> I don't think so - that is not a change I introduced 16:06:37 <slaweq> I know 16:06:56 <slaweq> IIRC we had something like that because of some missing patch(es) in Ubuntu ovs package 16:07:46 <slaweq> https://github.com/openstack/neutron/blob/master/neutron/tests/contrib/gate_hook.sh#L82 16:07:49 <slaweq> it's here 16:07:59 <njohnston> ah! 16:08:32 <slaweq> maybe we should check if this commit is already included in version provided by Bionic/Xenial and then get rid of this 16:08:43 <njohnston> agreed 16:08:46 <mlavalle> that's a good point 16:08:49 <njohnston> I can look into that 16:08:55 <slaweq> thx njohnston :) 16:09:12 <njohnston> #action njohnston look into fullstack compilation of ovs still needed on bionic 16:09:30 <njohnston> thanks for a great memory slaweq! 16:09:38 <slaweq> yw njohnston :) 16:09:50 <slaweq> ok, lets move to the next one 16:09:53 <slaweq> njohnston Ask the infra team for what ubuntu release Stein will be supported 16:10:04 <njohnston> the answer is: bionic 16:10:09 <mlavalle> very good email summary on that one 16:10:16 <slaweq> yes, thx njohnston 16:10:29 <njohnston> I didn't send out that email to all, so let me summarize it here. 16:11:04 <njohnston> I brought this up to the TC, noting the point that we should be moving on to bionic based on our governance documents 16:11:38 <njohnston> Given how zuulv3 has pushed the onus of CI job definition maintenance to the project teams, I asked if the TC would be announcing this change or otherwise managing it 16:12:04 <njohnston> This prompted a bit of discussion; the conclusion of which was that the TC doesn't have a ready answer - yet 16:12:33 <njohnston> #link http://eavesdrop.openstack.org/irclogs/%23openstack-tc/%23openstack-tc.2018-10-03.log.html#t2018-10-03T16:46:21 16:13:00 <slaweq> ok then, so what if e.g. we will switch all our jobs to bionic but other projects will still run xenial at the end of Stein? is it possible? 16:13:49 <slaweq> should we just go on and try to switch all our jobs or should we wait for some more "global" action plan? 16:13:58 <njohnston> The TC will need to mitigate that risk. They are still smarting after the last time this happened, and everyone was running xenial except Trove, which was back on trusty. So they know the danger, better than we do. 16:14:22 <njohnston> I think we should go ahead andm switch our jobs 16:14:27 <mlavalle> yes 16:14:32 <mlavalle> we do our bit 16:14:47 <mlavalle> and let the TC worry about the global picture 16:14:56 <mlavalle> we can warn them in the ML 16:15:04 <mlavalle> when we are about to do it 16:15:06 <slaweq> so maybe we can try first to clone our jobs to experimental queue and use them with Stein there and see how it will looks like 16:15:20 <slaweq> what You think about it? 16:15:29 <mlavalle> reasonable 16:15:35 <njohnston> note that we already have tests running on bionic - you can't test on py36 without bionic, so the openstack-toc-py36 job is already there 16:16:01 <slaweq> yes, but I think more about all scenario/tempest/grenade jobs 16:16:07 <njohnston> but yes, that is a reasonable plan, especially with complicated things like scenario tests 16:16:19 <slaweq> and btw. what about grenade? should it run Rocky on Bionic and upgrade to Stein? 16:16:32 <slaweq> do You know maybe? 16:16:41 <njohnston> right, because we're not dealing with 100%pass vs 100% fail, we're dealing with fluctuations in fail trends a lot of the time 16:17:11 <njohnston> the QA team is working on grenade because of the python3 challenge, so I think we can wait for the fruits of their labor 16:17:54 <njohnston> I don't know the specific answer, but if I am not mistaken the QA PTL is on the TC 16:18:18 <njohnston> so it should not escape notice 16:18:19 <slaweq> yes, he is 16:18:28 <slaweq> gmann is QA PTL IIRC 16:18:33 <njohnston> yep 16:18:35 <mlavalle> yes 16:18:44 <slaweq> ok then 16:18:54 <slaweq> so I think we should now create copies of our scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue, right? 16:19:10 <njohnston> yes 16:19:15 <slaweq> and then we will see what issues we have so we will be able to plan this work somehow 16:19:23 <njohnston> +1 16:19:38 <slaweq> I can do this first step then :) 16:20:03 <slaweq> ok? 16:20:07 <mlavalle> yes 16:20:14 <slaweq> thx mlavalle :) 16:20:18 <njohnston> I am grateful, thanks 16:20:20 <slaweq> #action slaweq will create scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue 16:20:36 <slaweq> ok, then move on 16:20:38 <slaweq> njohnston to check and add missing jobs to grafana dashboard 16:21:17 <njohnston> https://review.openstack.org/#/c/607586/ merged 16:21:34 <slaweq> thx njohnston :) 16:21:48 <slaweq> ok, next one 16:21:49 <slaweq> * slaweq to prepare etherpad with list of jobs to move to Bionic 16:21:57 <slaweq> I start doing this today: 16:22:04 <slaweq> https://etherpad.openstack.org/p/neutron-ci-bionic 16:22:22 <slaweq> but as I will do those experimental jobs I will also update this etherpad 16:22:30 <njohnston> sounds good 16:23:31 <slaweq> next was: 16:23:33 <slaweq> * slaweq will report bug about failing trunk scenario test 16:23:39 <slaweq> Done: https://bugs.launchpad.net/neutron/+bug/1795870 16:23:39 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:23:48 <slaweq> sorry mlavalle that I forgot to ping You with this :) 16:23:53 <mlavalle> np 16:24:01 <slaweq> and the last one: 16:24:01 <mlavalle> I'll work on it next 16:24:03 <slaweq> * mlavalle will check failing trunk scenario test 16:24:13 <slaweq> I guess it's not done, right? :P 16:24:16 <mlavalle> yes 16:24:30 <slaweq> #action mlavalle will check failing trunk scenario test 16:24:42 <slaweq> I will add it for next week then 16:24:45 <slaweq> thx mlavalle 16:24:57 <slaweq> ok, so that was all from last week 16:25:00 <slaweq> next topic 16:25:02 <slaweq> #topic Grafana 16:25:07 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate 16:26:34 <slaweq> looking at grafana it looks that we are generally quite good 16:26:53 <slaweq> do You want to talk about something specific? 16:26:53 <njohnston> neutron-grenade-dvr-multinode falling fast in the check queue \o/ 16:27:00 <mlavalle> there's been a slow down of activity over the past few days 16:27:22 <slaweq> njohnston: about this grenade job I want to talk a bit later :) 16:27:58 <slaweq> mlavalle: yes, it looks that there wasn't anything in gate recently, most work in check queue 16:28:20 <mlavalle> maybe also a bit of the Columbus Day holiday in the US 16:28:34 <njohnston> +1 16:28:35 <slaweq> might be 16:29:18 <slaweq> but basically it looks quite good in overall 16:29:34 <mlavalle> \o/ 16:29:43 <slaweq> so lets talk about some specific jobs issues 16:29:51 <slaweq> (which are already well known) 16:29:53 <slaweq> #topic fullstack/functional 16:30:29 <slaweq> I don't know why but I think we are again more often hit with issue like in https://bugs.launchpad.net/neutron/+bug/1687027 in functional tests 16:30:29 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:30:40 <slaweq> I saw it couple of times during this week 16:30:48 <slaweq> e.g. http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-functional/e56b690/logs/testr_results.html.gz 16:31:00 <slaweq> it's periodic job from today 16:32:38 <mlavalle> should I prioritize this one over the previous one? 16:32:38 <slaweq> did You saw it also recently? 16:32:49 <mlavalle> I ahven't seen it 16:32:54 <mlavalle> lately 16:33:30 <slaweq> mlavalle: I think that this issue with trunk tests is more important 16:33:40 <slaweq> this one is IMO stricly test only issue 16:33:46 <mlavalle> yes, that's also my impression 16:33:53 <slaweq> and trunk tests issue - we don't know, maybe it's "real" bug 16:34:39 <slaweq> I will try to find few minutes to maybe reproduce this one 16:34:56 <slaweq> if I will find something, I will let You know on next meeting :) 16:35:14 <mlavalle> ok 16:35:24 <slaweq> #action slaweq will try to reproduce and triage https://bugs.launchpad.net/neutron/+bug/1687027 16:35:24 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel) 16:35:41 <slaweq> speaking about scenario jobs, 16:35:47 <slaweq> #topic Tempest/Scenario 16:35:56 <slaweq> manjeets: any update on https://bugs.launchpad.net/neutron/+bug/1789434 ? 16:35:57 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia) 16:37:32 <slaweq> I guess he isn't here 16:37:41 <mlavalle> nope 16:38:09 <slaweq> ok, let's wait one more week and maybe we will have to start working on this 16:38:50 <mlavalle> yes 16:39:22 <slaweq> so, lets move on to the next topic 16:39:25 <slaweq> #topic grenade 16:39:37 <slaweq> (my "favorite" one recently) 16:39:46 <mlavalle> LOL 16:40:09 <slaweq> I spent about 3 days on debugging https://bugs.launchpad.net/neutron/+bug/1791989 16:40:09 <openstack> Launchpad bug 1791989 in neutron "grenade-dvr-multinode job fails" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq) 16:40:13 <slaweq> and I found nothing :/ 16:40:21 <slaweq> well, almost nothing 16:41:16 <slaweq> I have some patch which prints me a lot of informations like, openflow rules, bridges in ovs, namespaces, network config in namespaces, iptables and so on 16:41:33 <slaweq> and I was comparing it from moment when ping was not working and when it starts working 16:41:37 <slaweq> and all was the same 16:42:15 <slaweq> haleyb proposed some patch to add permanent neighbors in fip and qrouter namespaces: https://review.openstack.org/#/c/608259/ 16:42:19 <slaweq> but this didn't helps 16:42:49 <slaweq> so I started ssh to those nodes with running job and I started looking for tcpdump when ping was not working 16:43:06 <slaweq> and I found that icmp requests are comming to vm's tap device 16:43:20 <slaweq> and vm sends icmp reply but this reply is gone somewhere in br-int 16:44:00 <mlavalle> that sounds like flows not properly setup maybe 16:44:06 <slaweq> so today I added another thing to my dnm patch (https://review.openstack.org/#/c/602204/) to flush fdb entries from br-int when ping will not work first time and check then if after this ping will start working 16:44:19 <slaweq> mlavalle: all flows were the same 16:44:32 <slaweq> openflow rules I mean 16:44:36 <mlavalle> ok 16:45:00 <slaweq> so, I am trying to spot this issue again in my dnm patch again and see if this flush will help or not 16:45:22 <slaweq> but since yesterday evening it looks like this job is working fine again 16:45:36 <haleyb> and it eventually starts working, which is why we started looking at arp. i'll await the testing... 16:45:44 * haleyb is late for the party 16:46:03 <slaweq> I can't spot it in my DNM patch - ping works every time before in first attempt 16:47:04 <slaweq> and looking at kibana: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:%5C%22neutron-grenade-dvr-multinode%5C%22%20AND%20build_status:FAILURE%20AND%20message:%5C%22die%2067%20'%5BFail%5D%20Couldn'%5C%5C''t%20ping%20server%5C%22 16:47:04 <njohnston> down to only 37% failure - makes me wonder if something was fixed elsewhere and our issue was just a manifestation http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?panelId=22&fullscreen&orgId=1 16:47:40 <slaweq> it looks that it really didn't happen since yesterday 16:48:19 <slaweq> njohnston: yes, that's exactly what I think now 16:48:32 <slaweq> but TBH I'm really puzzled with this one :/ 16:49:00 <slaweq> so, any questions/comments about this? 16:49:01 <njohnston> would it be useful to ping infra and ask i fthey fixed anything networking-related? 16:49:28 <njohnston> or moved their capacity around - drained all jobs from a cloud, perhaps? 16:49:29 <slaweq> njohnston: yes, I can ask them 16:49:45 <mlavalle> interesting 16:49:47 <njohnston> thats all I can think of 16:51:01 <slaweq> I think that I can only ask them if they maybe did/fixed something which might be related recently and wait to see how results will go in grafana for next few days 16:51:46 <slaweq> I was also trying to reproduce it locally on similar 2 nodes dvr env with rocky but I couldn't - all worked fine 16:53:28 <slaweq> ok, I think we can then move on to the next topic 16:53:34 <slaweq> #topic periodic 16:54:12 <slaweq> I just wanted to mention that last week we had issue https://bugs.launchpad.net/neutron/+bug/1795878 which I found after CI meeting so I didn't mention it then :) 16:54:12 <openstack> Launchpad bug 1795878 in neutron "Python 2.7 and 3.5 periodic jobs with-oslo-master fails" [High,Fix released] - Assigned to Bernard Cafarelli (bcafarel) 16:54:20 <slaweq> it's now fixed https://review.openstack.org/607872 16:54:52 <mlavalle> cool 16:55:13 <slaweq> and that's all related to periodic jobs from me 16:55:16 <slaweq> #topic Open discussion 16:55:26 <slaweq> I want also to mention one more thing 16:56:04 <slaweq> during the PTG I was on some QA team sessions and they asked us that we should move templest_plugins from neutron stadium projects to separate repos or to neutron_tempest_plugin repo 16:56:23 <slaweq> mlavalle: which option would be better in Your opinion? 16:56:55 <slaweq> list of plugins is on https://docs.openstack.org/tempest/latest/plugin-registry.html 16:57:24 <mlavalle> I don't have a strong opinion 16:57:48 <slaweq> me neighter 16:57:50 <mlavalle> we share a lot of tests with the Stadium, don't we? 16:57:59 <slaweq> I don't know if a lot 16:58:15 <mlavalle> since it is testing based on the API 16:58:17 <slaweq> maybe we should discuss about it on ML first then? 16:58:29 <mlavalle> yes, let's ask them 16:58:42 <slaweq> mlavalle: will You send an email or should I? 16:58:44 <njohnston> That makes sense, since the stadium projects may not have a uniform way they approach this 16:59:13 <mlavalle> I suspect many of them may not have man power to handle yet another repo 16:59:25 <mlavalle> let's see what they say 16:59:32 <mlavalle> yes, I'll send the message 16:59:38 <slaweq> ok, thx mlavalle 17:00:03 <slaweq> #action mlavalle to send an email about moving tempest plugins from stadium to separate repo 17:00:08 <slaweq> ok, we are out of time 17:00:12 <slaweq> thx for attending 17:00:15 <slaweq> #endmeeting