#openstack-meeting log

16:00:14 <slaweq> #startmeeting neutron_ci
16:00:15 <openstack> Meeting started Tue Sep 25 16:00:14 2018 UTC and is due to finish in 60 minutes.  The chair is slaweq. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:17 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
16:00:19 <slaweq> welcome again :)
16:00:20 <openstack> The meeting name has been set to 'neutron_ci'
16:00:52 <haleyb> hi
16:01:24 <mlavalle> o/
16:01:34 <mlavalle> made it
16:01:44 <slaweq> :)
16:01:48 <bcafarel> now I remember why I miss most ci meetings, usually have to leave shortly after they start
16:01:53 <bcafarel> still hi again
16:02:01 <slaweq> bcafarel: LOL
16:02:14 <slaweq> ok, lets go then
16:02:16 <slaweq> #topic Actions from previous meetings
16:02:26 <slaweq> * manjeets continue debugging why migration from HA routers fails 100% of times
16:02:33 <slaweq> manjeets: any progress on this one?
16:02:44 <slaweq> I proposed today to mark those tests as unstable for now: https://review.openstack.org/605057
16:03:37 <slaweq> I think manjeets is not available now
16:04:08 <slaweq> mlavalle: please just take a look at this mine patch - I think it would be good to make this job passing at least sometimes :)
16:04:43 <njohnston> o/
16:04:48 <mlavalle> done
16:04:48 <njohnston> o/
16:04:54 <slaweq> thx mlavalle
16:04:56 <slaweq> hi njohnston
16:05:03 <slaweq> ok, next one
16:05:04 <njohnston> sorry about the repeat there
16:05:05 <slaweq> * mlavale to check issue with failing test_attach_volume_shelved_or_offload_server test
16:05:22 <slaweq> no problem njohnston :)
16:05:35 <slaweq> no problem njohnston :)
16:05:36 <slaweq> LOL
16:05:40 <njohnston> :-)
16:06:11 <slaweq> ok, mlavalle any update about this shelved unshelved server test fail?
16:06:17 <mlavalle> slaweq: hang on
16:06:38 <slaweq> k
16:08:39 <mlavalle> slaweq: I don't find it. I think I left some notes recentky there
16:09:02 <slaweq> You don't find any issues like that recently, right?
16:10:06 <mlavalle> yes
16:10:15 <mlavalle> but do you have a pointer to the bug?
16:11:05 <slaweq> I don't have
16:11:10 <slaweq> but let me find it
16:13:01 <slaweq> I can't find it
16:13:09 <slaweq> was it reported as a bug?
16:13:15 <slaweq> maybe we forgot about that?
16:14:07 <mlavalle> yeah, that maybe the problem
16:14:39 <mlavalle> in any case, I spent some time last week searching kibana for it
16:14:44 <mlavalle> and didn't find instances
16:14:54 <slaweq> so maybe we will be good with it :)
16:14:57 <mlavalle> I'll dig the query and get back to you
16:15:04 <slaweq> ok, thx
16:15:07 <mlavalle> I sent myself an email
16:15:18 <mlavalle> with the query that I need to dig
16:15:53 <slaweq> #action mlavalle will work on logstash query to find if issues with test_attach_volume_shelved_or_offload_server still happens
16:16:05 <slaweq> ok, next one then
16:16:07 <slaweq> * njohnston will continue work on switch fullstack-python35 to python36 job
16:16:35 <njohnston> So it is voting now but as bcafarel pointed out it is still using py35
16:16:39 <njohnston> I am looking in to it now
16:16:47 <slaweq> ok
16:16:57 <slaweq> what about removing old fullstack with py27?
16:17:05 <slaweq> I think we are ready for that now
16:17:18 <njohnston> agreed, I'll push a change for that
16:17:28 <slaweq> mlavalle: haleyb: are You ok with it?
16:17:51 <haleyb> yes, i'm fine with it
16:17:58 <mlavalle> me too
16:18:02 <slaweq> great
16:18:08 <slaweq> ok, thx njohnston for working on this
16:18:26 <slaweq> #action njohnston will debug why fullstack-py36 job is using py35
16:18:47 <slaweq> #action njohnston will send a patch to remove fullstack py27 job completly
16:19:16 <slaweq> ok, and the last one is:
16:19:17 <slaweq> * slaweq will continue debugging multinode-dvr-grenade issue
16:19:39 <slaweq> I was working on this last week but I didn't found anything
16:19:59 <slaweq> I found that this issue happens very often on master branch and also on stable/pike
16:20:13 <slaweq> but I didn't found it even once on queens or rocky branches
16:20:36 <slaweq> I suspected that it may be some package which was upgraded recently or something like that
16:20:37 <bcafarel> so the "middle" branches are not affected? wierd
16:21:08 <slaweq> but all such packages like ovs, libvirt, qemu and so on are in completly different versions in stable/pike and master branch
16:21:13 <slaweq> so I don't think it's that
16:22:06 <slaweq> I was using query like: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22'%5BFail%5D%20Couldn'%5C%5C''t%20ping%20server'%5C%22
16:22:25 <bcafarel> yeah a master branch change that only got backported to versions used in pike sounds strange
16:23:05 <slaweq> from last week (now in logstash): 91 failures on master, 33 on stable/pike
16:23:08 * mlavalle found the logstash query: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_status:%5C%22FAILURE%5C%22%20AND%20project:%5C%22openstack%2Fneutron%5C%22%20AND%20message:%5C%22test_attach_volume_shelved_or_offload_server%5C%22%20AND%20tags:%5C%22console%5C%22&from=7d
16:25:25 <slaweq> mlavalle: so it looks that it happens still from time to time
16:26:17 <mlavalle> slaweq: yeah but my obeservation last week is that, when it shows up, many other tests also fail
16:26:31 <mlavalle> so that makes me suspicious of the the changes
16:26:49 <mlavalle> and that's the case for the two ocurrences at the top
16:26:50 <slaweq> ok, it can be that it fails together with other tests also
16:26:54 <mlavalle> in today's result
16:27:13 <mlavalle> so I'll dig today on this
16:27:19 <slaweq> ok
16:27:51 <haleyb> mlavalle: which changes are suspicious?
16:28:25 <haleyb> oh, maybe package changes
16:28:27 <mlavalle> one example is https://review.openstack.org/#/c/601336
16:28:46 <mlavalle> it's at the top of the search today
16:29:37 <haleyb> oh, that's just WIP though
16:30:07 <mlavalle> yeah, that's why I say, I discount those
16:30:21 <mlavalle> as not valid for this bug
16:30:31 <haleyb> right
16:30:36 <mlavalle> but I will try to find a valid failure and investigate
16:30:47 <haleyb> +1
16:31:01 <slaweq> thx mlavalle for working on this
16:31:32 <mlavalle> my point is that there are less failures than the kibana results show
16:32:13 <mlavalle> I mean real failures
16:32:19 <slaweq> that is possible :)
16:32:32 * slaweq hopes that it's not another real issue
16:32:40 <haleyb> it could be we can add more debug commands to one of slaweq's patches too, because when it failed it was "happy" once we logged in to look around - running some more things from the console perhaps?  we just don't know where to start
16:33:14 <haleyb> like looking at routes and arp table and flows...
16:33:24 <slaweq> haleyb: are You talking about grenade issue now?
16:33:52 <haleyb> did i miss a topic change ?
16:34:00 <mlavalle> yes
16:34:04 <mlavalle> but that's ok
16:34:06 <slaweq> I think so :)
16:34:09 <slaweq> LOL
16:34:16 <mlavalle> I'm done with the other one
16:34:27 <slaweq> so Your questions about "which change" were not related to what mlavalle was talking about?
16:34:34 <haleyb> doh, last one i saw was " slaweq will continue debugging multinode-dvr-grenade issue"
16:34:50 <mlavalle> it was my fault
16:35:00 <mlavalle> I interjected the discussion with my query
16:35:11 <mlavalle> so blame it all on me
16:35:23 <mlavalle> it's always the dumb PTL anyways
16:35:34 <haleyb> :)
16:35:42 <slaweq> :)
16:35:51 <slaweq> ok, so lets go back to grenade job now
16:36:13 <slaweq> yes, when I was checking that after logging to node it was all fine
16:36:38 <slaweq> and still there is one important thing - all smoke tests are passing first
16:36:50 <slaweq> and then there is this instance created and it fails
16:37:14 <slaweq> btw. as it happens also on pike - we can assume that it's not openvswitch firewall fault :)
16:38:05 <slaweq> haleyb: if You want to execute some more commands for debug this, I did 2 small patches:
16:38:09 <slaweq> https://review.openstack.org/#/c/602156/
16:38:14 <slaweq> and
16:38:15 <slaweq> https://review.openstack.org/#/c/602204/
16:38:22 <haleyb> right, that's a good thing.  it still could be ovs, or some config on the job side of things
16:38:38 <slaweq> feel free to update it with any new debug informations You want there :)
16:38:41 <haleyb> i'll take a look and probably add a few things
16:39:31 <slaweq> for now there is "sleep 3h" in https://review.openstack.org/#/c/602204/10/projects/60_nova/resources.sh added
16:39:46 <slaweq> and in https://review.openstack.org/#/c/602156/ there is my ssh key added
16:40:10 <slaweq> so if it will fail on https://review.openstack.org/#/c/602156/ I can ssh to master node easy without asking infra team about that :)
16:40:38 <slaweq> if You want to debug job with some other things, You maybe will have to remove that sleep :)
16:40:51 <haleyb> :)
16:41:35 * slaweq will buy a beer for someone who will solve this issue :)
16:41:51 * haleyb marks it critical :)
16:41:57 <mlavalle> LOL
16:42:05 <slaweq> :D
16:42:40 <slaweq> thx haleyb for help with this one
16:42:52 <slaweq> I think we can move on to next topic then
16:43:07 <slaweq> #topic Grafana
16:43:14 <slaweq> http://grafana.openstack.org/dashboard/db/neutron-failure-rate
16:43:56 <njohnston> gate functional looks unhappy
16:45:13 <slaweq> yes
16:45:15 <slaweq> a bit
16:45:54 <mlavalle> not sure, we don't know how many runs in the graph
16:46:00 <mlavalle> at least I don't
16:46:14 <mlavalle> but worth keeping an eye on, definitely
16:46:30 <slaweq> I can't find specific examples now
16:46:49 <slaweq> but I'm almost sure that it's again this issue with db migration tests which hits us from time to time
16:48:32 <slaweq> ok, as mlavalle said, lets keep an eye on it for now
16:49:04 <mlavalle> slaweq: do you mean this bug https://bugs.launchpad.net/neutron/+bug/1687027?
16:49:04 <openstack> Launchpad bug 1687027 in neutron "test_walk_versions tests fail with "IndexError: tuple index out of range" after timeout" [High,Confirmed] - Assigned to Miguel Lavalle (minsel)
16:49:26 <slaweq> mlavalle: yes
16:49:55 <mlavalle> slaweq: we have a query there. I'll run it today
16:50:05 <slaweq> ok
16:50:07 <mlavalle> and report in the bug my findings
16:50:33 <mlavalle> let's see if we can correlate the spike in functional with the bug
16:50:55 <slaweq> ok
16:53:04 <slaweq> from other things there are only known issues like grenade dvr job failures and this migration from HA test failures in multinode dvr scenario job
16:53:24 <slaweq> so basically I think we will be good if we will fix those 2 issues :)
16:53:53 <mlavalle> the HA failures is what manjeets is working on, right?
16:54:14 <slaweq> right
16:54:28 <mlavalle> ack
16:55:01 <slaweq> ok, so that's all from me for today
16:55:10 <slaweq> do You want to talk about anything else?
16:55:15 <slaweq> #topic Open discussion
16:55:19 <njohnston> I pushed a change to delete the python2 fullstack job https://review.openstack.org/605126 but I was wondering if I needed to update the neutron-fullstack-with-uwsgi job that is nonvoting experimental
16:55:41 <njohnston> I also pushed an update to grafana for fullstack: https://review.openstack.org/605128
16:55:49 <slaweq> hmm, we already merged support for uwsgi, right?
16:56:01 <mlavalle> yes, last cycle
16:56:22 <slaweq> so maybe we should do this job at least non voting but in check queue?
16:56:32 <slaweq> and switch to py36 also
16:56:34 <njohnston> is uwsgi enabled by default now, or is it just a supported config?
16:56:42 <slaweq> what You think?
16:58:01 <njohnston> note that there are also neutron-functional-with-uwsgi and neutron-tempest-with-uwsgi jobs also in experimental
16:58:35 * njohnston has no opinion on uwsgi jobs
16:58:38 <slaweq> IMO we should promote it to check queue as nonvoting for now
16:59:12 <mlavalle> we can do that
16:59:12 <haleyb> njohnston: NEUTRON_DEPLOY_MOD_WSGI: True in zuul.yaml - so i'm assuming it's not default, that's tweaking some other flag
16:59:27 <slaweq> ok, we are out of time now
16:59:36 <slaweq> thx for attending the meeting
16:59:39 <mlavalle> o/
16:59:40 <njohnston> mlavalle if you agree then I'll do that and also convert them to zuulv3 syntax and add them to grafana
16:59:40 <slaweq> and see You next week
16:59:50 <slaweq> njohnston++
16:59:50 <mlavalle> njohnston: ok
16:59:54 <slaweq> #endmeeting