#openstack-meeting log

16:00:47 <slaweq> #startmeeting neutron_ci
16:00:48 <openstack> Meeting started Tue Oct  9 16:00:47 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:49 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:51 <openstack> The meeting name has been set to 'neutron_ci'
16:00:54 <slaweq> welcome (again) :)
16:00:55 <mlavalle> o/
16:01:02 <njohnston> o/
16:01:30 <slaweq> #topic Actions from previous meetings
16:01:43 <slaweq> njohnston Reduce to a single fullstack job named 'neutron-fullstack' that calls python3 in tox.ini and runs on bionic nodeset
16:02:39 <njohnston> so for that I hijacked bcafarel's change https://review.openstack.org/#/c/604749/
16:03:21 <njohnston> I am tinkering with it to figure out the issues with neutron-fullstack job
16:04:00 <slaweq> what issue exactly?
16:05:09 <njohnston> http://logs.openstack.org/49/604749/3/check/neutron-fullstack/612006a/job-output.txt.gz#_2018-10-04_01_32_01_503103
16:06:08 <slaweq> but do we need still compile ovs on those tests?
16:06:18 <njohnston> looks like it is trying to compile OVS
16:06:30 <njohnston> I don't think so - that is not a change I introduced
16:06:37 <slaweq> I know
16:06:56 <slaweq> IIRC we had something like that because of some missing patch(es) in Ubuntu ovs package
16:07:46 <slaweq> https://github.com/openstack/neutron/blob/master/neutron/tests/contrib/gate_hook.sh#L82
16:07:49 <slaweq> it's here
16:07:59 <njohnston> ah!
16:08:32 <slaweq> maybe we should check if this commit is already included in version provided by Bionic/Xenial and then get rid of this
16:08:43 <njohnston> agreed
16:08:46 <mlavalle> that's a good point
16:08:49 <njohnston> I can look into that
16:08:55 <slaweq> thx njohnston :)
16:09:12 <njohnston> #action njohnston look into fullstack compilation of ovs still needed on bionic
16:09:30 <njohnston> thanks for a great memory slaweq!
16:09:38 <slaweq> yw njohnston :)
16:09:50 <slaweq> ok, lets move to the next one
16:09:53 <slaweq> njohnston Ask the infra team for what ubuntu release Stein will be supported
16:10:04 <njohnston> the answer is: bionic
16:10:09 <mlavalle> very good email summary on that one
16:10:16 <slaweq> yes, thx njohnston
16:10:29 <njohnston> I didn't send out that email to all, so let me summarize it here.
16:11:04 <njohnston> I brought this up to the TC, noting the point that we should be moving on to bionic based on our governance documents
16:11:38 <njohnston> Given how zuulv3 has pushed the onus of CI job definition maintenance to the project teams, I asked if the TC would be announcing this change or otherwise managing it
16:12:04 <njohnston> This prompted a bit of discussion; the conclusion of which was that the TC doesn't have a ready answer - yet
16:12:33 <njohnston> #link http://eavesdrop.openstack.org/irclogs/%23openstack-tc/%23openstack-tc.2018-10-03.log.html#t2018-10-03T16:46:21
16:13:00 <slaweq> ok then, so what if e.g. we will switch all our jobs to bionic but other projects will still run xenial at the end of Stein? is it possible?
16:13:49 <slaweq> should we just go on and try to switch all our jobs or should we wait for some more "global" action plan?
16:13:58 <njohnston> The TC will need to mitigate that risk.  They are still smarting after the last time this happened, and everyone was running xenial except Trove, which was back on trusty.  So they know the danger, better than we do.
16:14:22 <njohnston> I think we should go ahead andm switch our jobs
16:14:27 <mlavalle> yes
16:14:32 <mlavalle> we do our bit
16:14:47 <mlavalle> and let the TC worry about the global picture
16:14:56 <mlavalle> we can warn them in the ML
16:15:04 <mlavalle> when we are about to do it
16:15:06 <slaweq> so maybe we can try first to clone our jobs to experimental queue and use them with Stein there and see how it will looks like
16:15:20 <slaweq> what You think about it?
16:15:29 <mlavalle> reasonable
16:15:35 <njohnston> note that we already have tests running on bionic - you can't test on py36 without bionic, so the openstack-toc-py36 job is already there
16:16:01 <slaweq> yes, but I think more about all scenario/tempest/grenade jobs
16:16:07 <njohnston> but yes, that is a reasonable plan, especially with complicated things like scenario tests
16:16:19 <slaweq> and btw. what about grenade? should it run Rocky on Bionic and upgrade to Stein?
16:16:32 <slaweq> do You know maybe?
16:16:41 <njohnston> right, because we're not dealing with 100%pass vs 100% fail, we're dealing with fluctuations in fail trends a lot of the time
16:17:11 <njohnston> the QA team is working on grenade because of the python3 challenge, so I think we can wait for the fruits of their labor
16:17:54 <njohnston> I don't know the specific answer, but if I am not mistaken the QA PTL is on the TC
16:18:18 <njohnston> so it should not escape notice
16:18:19 <slaweq> yes, he is
16:18:28 <slaweq> gmann is QA PTL IIRC
16:18:33 <njohnston> yep
16:18:35 <mlavalle> yes
16:18:44 <slaweq> ok then
16:18:54 <slaweq> so I think we should now create copies of our scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue, right?
16:19:10 <njohnston> yes
16:19:15 <slaweq> and then we will see what issues we have so we will be able to plan this work somehow
16:19:23 <njohnston> +1
16:19:38 <slaweq> I can do this first step then :)
16:20:03 <slaweq> ok?
16:20:07 <mlavalle> yes
16:20:14 <slaweq> thx mlavalle :)
16:20:18 <njohnston> I am grateful, thanks
16:20:20 <slaweq> #action slaweq will create scenario/tempest/fullstack/functional jobs running on Bionic in experimental queue
16:20:36 <slaweq> ok, then move on
16:20:38 <slaweq> njohnston to check and add missing jobs to grafana dashboard
16:21:17 <njohnston> https://review.openstack.org/#/c/607586/ merged
16:21:34 <slaweq> thx njohnston :)
16:21:48 <slaweq> ok, next one
16:21:49 <slaweq> * slaweq to prepare etherpad with list of jobs to move to Bionic
16:21:57 <slaweq> I start doing this today:
16:22:04 <slaweq> https://etherpad.openstack.org/p/neutron-ci-bionic
16:22:22 <slaweq> but as I will do those experimental jobs I will also update this etherpad
16:22:30 <njohnston> sounds good
16:23:31 <slaweq> next was:
16:23:33 <slaweq> * slaweq will report bug about failing trunk scenario test
16:23:39 <slaweq> Done: https://bugs.launchpad.net/neutron/+bug/1795870
16:23:39 <openstack> Launchpad bug 1795870 in neutron "Trunk scenario test test_trunk_subport_lifecycle fails from time to time" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:23:48 <slaweq> sorry mlavalle that I forgot to ping You with this :)
16:23:53 <mlavalle> np
16:24:01 <slaweq> and the last one:
16:24:01 <mlavalle> I'll work on it next
16:24:03 <slaweq> * mlavalle will check failing trunk scenario test
16:24:13 <slaweq> I guess it's not done, right? :P
16:24:16 <mlavalle> yes
16:24:30 <slaweq> #action mlavalle will check failing trunk scenario test
16:24:42 <slaweq> I will add it for next week then
16:24:45 <slaweq> thx mlavalle
16:24:57 <slaweq> ok, so that was all from last week
16:25:00 <slaweq> next topic
16:25:02 <slaweq> #topic Grafana
16:25:07 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:26:34 <slaweq> looking at grafana it looks that we are generally quite good
16:26:53 <slaweq> do You want to talk about something specific?
16:26:53 <njohnston> neutron-grenade-dvr-multinode falling fast in the check queue \o/
16:27:00 <mlavalle> there's been a slow down of activity over the past few days
16:27:22 <slaweq> njohnston: about this grenade job I want to talk a bit later :)
16:27:58 <slaweq> mlavalle: yes, it looks that there wasn't anything in gate recently, most work in check queue
16:28:20 <mlavalle> maybe also a bit of the Columbus Day holiday in the US
16:28:34 <njohnston> +1
16:28:35 <slaweq> might be
16:29:18 <slaweq> but basically it looks quite good in overall
16:29:34 <mlavalle> \o/
16:29:43 <slaweq> so lets talk about some specific jobs issues
16:29:51 <slaweq> (which are already well known)
16:29:53 <slaweq> #topic fullstack/functional
16:30:29 <slaweq> I don't know why but I think we are again more often hit with issue like in https://bugs.launchpad.net/neutron/+bug/1687027  in functional tests
16:30:29 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:30:40 <slaweq> I saw it couple of times during this week
16:30:48 <slaweq> e.g. http://logs.openstack.org/periodic/git.openstack.org/openstack/neutron/master/neutron-functional/e56b690/logs/testr_results.html.gz
16:31:00 <slaweq> it's periodic job from today
16:32:38 <mlavalle> should I prioritize this one over the previous one?
16:32:38 <slaweq> did You saw it also recently?
16:32:49 <mlavalle> I ahven't seen it
16:32:54 <mlavalle> lately
16:33:30 <slaweq> mlavalle: I think that this issue with trunk tests is more important
16:33:40 <slaweq> this one is IMO stricly test only issue
16:33:46 <mlavalle> yes, that's also my impression
16:33:53 <slaweq> and trunk tests issue - we don't know, maybe it's "real" bug
16:34:39 <slaweq> I will try to find few minutes to maybe reproduce this one
16:34:56 <slaweq> if I will find something, I will let You know on next meeting :)
16:35:14 <mlavalle> ok
16:35:24 <slaweq> #action slaweq will try to reproduce and triage https://bugs.launchpad.net/neutron/+bug/1687027
16:35:24 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:35:41 <slaweq> speaking about scenario jobs,
16:35:47 <slaweq> #topic Tempest/Scenario
16:35:56 <slaweq> manjeets: any update on https://bugs.launchpad.net/neutron/+bug/1789434 ?
16:35:57 <openstack> Launchpad bug 1789434 in neutron "neutron_tempest_plugin.scenario.test_migration.NetworkMigrationFromHA failing 100% times" [High,Confirmed] - Assigned to Manjeet Singh Bhatia (manjeet-s-bhatia)
16:37:32 <slaweq> I guess he isn't here
16:37:41 <mlavalle> nope
16:38:09 <slaweq> ok, let's wait one more week and maybe we will have to start working on this
16:38:50 <mlavalle> yes
16:39:22 <slaweq> so, lets move on to the next topic
16:39:25 <slaweq> #topic grenade
16:39:37 <slaweq> (my "favorite" one recently)
16:39:46 <mlavalle> LOL
16:40:09 <slaweq> I spent about 3 days on debugging https://bugs.launchpad.net/neutron/+bug/1791989
16:40:09 <openstack> Launchpad bug 1791989 in neutron "grenade-dvr-multinode job fails" [High,Confirmed] - Assigned to Slawek Kaplonski (slaweq)
16:40:13 <slaweq> and I found nothing :/
16:40:21 <slaweq> well, almost nothing
16:41:16 <slaweq> I have some patch which prints me a lot of informations like, openflow rules, bridges in ovs, namespaces, network config in namespaces, iptables and so on
16:41:33 <slaweq> and I was comparing it from moment when ping was not working and when it starts working
16:41:37 <slaweq> and all was the same
16:42:15 <slaweq> haleyb proposed some patch to add permanent neighbors in fip and qrouter namespaces: https://review.openstack.org/#/c/608259/
16:42:19 <slaweq> but this didn't helps
16:42:49 <slaweq> so I started ssh to those nodes with running job and I started looking for tcpdump when ping was not working
16:43:06 <slaweq> and I found that icmp requests are comming to vm's tap device
16:43:20 <slaweq> and vm sends icmp reply but this reply is gone somewhere in br-int
16:44:00 <mlavalle> that sounds like flows not properly setup maybe
16:44:06 <slaweq> so today I added another thing to my dnm patch (https://review.openstack.org/#/c/602204/) to flush fdb entries from br-int when ping will not work first time and check then if after this ping will start working
16:44:19 <slaweq> mlavalle: all flows were the same
16:44:32 <slaweq> openflow rules I mean
16:44:36 <mlavalle> ok
16:45:00 <slaweq> so, I am trying to spot this issue again in my dnm patch again and see if this flush will help or not
16:45:22 <slaweq> but since yesterday evening it looks like this job is working fine again
16:45:36 <haleyb> and it eventually starts working, which is why we started looking at arp.  i'll await the testing...
16:45:44 * haleyb is late for the party
16:46:03 <slaweq> I can't spot it in my DNM patch - ping works every time before in first attempt
16:47:04 <slaweq> and looking at kibana: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:%5C%22neutron-grenade-dvr-multinode%5C%22%20AND%20build_status:FAILURE%20AND%20message:%5C%22die%2067%20'%5BFail%5D%20Couldn'%5C%5C''t%20ping%20server%5C%22
16:47:04 <njohnston> down to only 37% failure - makes me wonder if something was fixed elsewhere and our issue was just a manifestation http://grafana.openstack.org/d/Hj5IHcSmz/neutron-failure-rate?panelId=22&fullscreen&orgId=1
16:47:40 <slaweq> it looks that it really didn't happen since yesterday
16:48:19 <slaweq> njohnston: yes, that's exactly what I think now
16:48:32 <slaweq> but TBH I'm really puzzled with this one :/
16:49:00 <slaweq> so, any questions/comments about this?
16:49:01 <njohnston> would it be useful to ping infra and ask i fthey fixed anything networking-related?
16:49:28 <njohnston> or moved their capacity around - drained all jobs from a cloud, perhaps?
16:49:29 <slaweq> njohnston: yes, I can ask them
16:49:45 <mlavalle> interesting
16:49:47 <njohnston> thats all I can think of
16:51:01 <slaweq> I think that I can only ask them if they maybe did/fixed something which might be related recently and wait to see how results will go in grafana for next few days
16:51:46 <slaweq> I was also trying to reproduce it locally on similar 2 nodes dvr env with rocky but I couldn't - all worked fine
16:53:28 <slaweq> ok, I think we can then move on to the next topic
16:53:34 <slaweq> #topic periodic
16:54:12 <slaweq> I just wanted to mention that last week we had issue https://bugs.launchpad.net/neutron/+bug/1795878 which I found after CI meeting so I didn't mention it then :)
16:54:12 <openstack> Launchpad bug 1795878 in neutron "Python 2.7 and 3.5 periodic jobs with-oslo-master fails" [High,Fix released] - Assigned to Bernard Cafarelli (bcafarel)
16:54:20 <slaweq> it's now fixed https://review.openstack.org/607872
16:54:52 <mlavalle> cool
16:55:13 <slaweq> and that's all related to periodic jobs from me
16:55:16 <slaweq> #topic Open discussion
16:55:26 <slaweq> I want also to mention one more thing
16:56:04 <slaweq> during the PTG I was on some QA team sessions and they asked us that we should move templest_plugins from neutron stadium projects to separate repos or to neutron_tempest_plugin repo
16:56:23 <slaweq> mlavalle: which option would be better in Your opinion?
16:56:55 <slaweq> list of plugins is on https://docs.openstack.org/tempest/latest/plugin-registry.html
16:57:24 <mlavalle> I don't have a strong opinion
16:57:48 <slaweq> me neighter
16:57:50 <mlavalle> we share a lot of tests with the Stadium, don't we?
16:57:59 <slaweq> I don't know if a lot
16:58:15 <mlavalle> since it is testing based on the API
16:58:17 <slaweq> maybe we should discuss about it on ML first then?
16:58:29 <mlavalle> yes, let's ask them
16:58:42 <slaweq> mlavalle: will You send an email or should I?
16:58:44 <njohnston> That makes sense, since the stadium projects may not have a uniform way they approach this
16:59:13 <mlavalle> I suspect many of them may not have man power to handle yet another repo
16:59:25 <mlavalle> let's see what they say
16:59:32 <mlavalle> yes, I'll send the message
16:59:38 <slaweq> ok, thx mlavalle
17:00:03 <slaweq> #action mlavalle to send an email about moving tempest plugins from stadium to separate repo
17:00:08 <slaweq> ok, we are out of time
17:00:12 <slaweq> thx for attending
17:00:15 <slaweq> #endmeeting