#openstack-kolla log

15:00:39 <mgoddard> #startmeeting kolla
15:00:39 <openstack> Meeting started Wed Jul 10 15:00:39 2019 UTC and is due to finish in 60 minutes.  The chair is mgoddard. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:00:42 <openstack> The meeting name has been set to 'kolla'
15:00:44 <mgoddard> #topic rollcall
15:00:47 <mgoddard> \o
15:00:56 <chason> o/
15:01:04 <Wasaac> o/
15:01:52 <kplant> o'
15:03:11 <mgoddard> I think this meeting conflicts with yoctozepto's commute or dinner arrangements - he's always active either side of it but not during :)
15:03:29 <mgoddard> mnasiadka: around?
15:04:07 <mgoddard> #topic agenda
15:04:10 <mgoddard> * Roll-call
15:04:12 <mgoddard> * Announcements
15:04:14 <mgoddard> ** Welcome yoctozepto, latest core reviewer
15:04:16 <mgoddard> ** Removed zhubingbing and Duong Ha-Quang from core reviewer list
15:04:18 <mgoddard> ** Kayobe being added as a kolla deliverable https://review.opendev.org/669299
15:04:20 <mgoddard> ** From next week, meeting will cover kayobe
15:04:22 <mgoddard> * Review action items from last meeting
15:04:24 <mgoddard> * Kolla whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard
15:04:26 <mgoddard> * Stein release status
15:04:28 <mgoddard> * Train release planning
15:04:30 <mgoddard> #topic announcements
15:04:33 <mgoddard> #info Welcome yoctozepto, latest core reviewer
15:04:39 <mgoddard> #info Removed zhubingbing and Duong Ha-Quang from core reviewer list
15:04:53 <mgoddard> #info Kayobe being added as a kolla deliverable
15:05:00 <mgoddard> #link https://review.opendev.org/669299
15:05:08 <mnasiadka> o/ (sorry for being late)
15:05:08 <mgoddard> #info From next week, meeting will cover kayobe
15:05:14 <mgoddard> hi mnasiadka, np
15:05:24 <mgoddard> Any other announcements?
15:05:40 <mgoddard> I should probably make one more
15:05:44 <yoctozepto> o/
15:05:59 <yoctozepto> (sorry too)
15:06:28 <mgoddard> #info PTL (mgoddard) expecting a baby, due date 28th July. Will be away for ~3 weeks paternity leave
15:06:46 <mgoddard> hi yoctozepto
15:07:12 <mgoddard> I expect to be logging in from time to time to check in
15:07:22 <mgoddard> #topic Review action items from last meeting
15:07:24 <chason> mgoddard, congrats!
15:07:35 <mgoddard> :) thanks chason
15:07:41 <mgoddard> No action items last time
15:07:43 <yoctozepto> mgoddard: congrats! bless you all
15:07:58 <mgoddard> #topic Kolla whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard
15:08:00 <mgoddard> :)
15:08:04 <yoctozepto> <goldyfruit> yoctozepto, I tried to install masakari client into centos7 via pip and got this error: https://paste.api-zulu.com/inajozofel.php
15:08:14 <yoctozepto> nope, I was on time the last time
15:08:20 <mgoddard> Let's check recent CI
15:08:25 <yoctozepto> sorry, wronggg pasteee
15:08:33 <yoctozepto> <mgoddard> I think this meeting conflicts with yoctozepto's commute or dinner arrangements - he's always active either side of it but not during :)
15:08:36 <mgoddard> #link https://etherpad.openstack.org/p/KollaWhiteBoard
15:08:42 <yoctozepto> <yoctozepto> nope, I was on time the last time
15:09:13 <yoctozepto> (as for goldyfruit I believe the CI issue is somewhere else)
15:10:01 <mgoddard> yoctozepto: that's true, my apologies
15:10:28 <mgoddard> CI mostly green
15:10:42 <yoctozepto> yeah, we did a good job
15:10:50 <mgoddard> #topic Stein release status
15:10:56 <mgoddard> So close
15:11:11 <mgoddard> I even pushed a review to create a new RC
15:11:33 <mgoddard> But yoctozepto found some more mariadb failures, during deploy and upgrade
15:12:06 <yoctozepto> let's wait this week, I'll spend some time digging into mariadb issues - let's release anyways if we cannot fix them now, quality is already much much better
15:12:10 <mgoddard> FWIW, we also see these issues on other branches
15:12:26 <mgoddard> (well at least rocky)
15:12:40 <yoctozepto> mgoddard: yup, as if we are doing something wrong - though we are doing one thing wrong: docker stop
15:12:48 <yoctozepto> this class of errors should be gone
15:12:55 <yoctozepto> after really waiting for mysql shutdown
15:13:01 <yoctozepto> (partially amended by 60s timer)
15:13:12 <yoctozepto> though the other problem classes remain
15:13:25 <yoctozepto> mgoddard suggested it could be due to missing haproxy
15:13:51 <mgoddard> I saw the 'WSREP not ready' issue in rocky earlier
15:13:56 <yoctozepto> (is anyone else sans me and Mark taking part in this meeting? ;p )
15:14:28 <mgoddard> yoctozepto: generally it is me and one other person doing the talking, if I'm lucky :D
15:14:51 <yoctozepto> ;D
15:15:12 <mgoddard> I was trying to think if there is a way we could get haproxy involved without using an overlay
15:15:39 <mgoddard> could disable keepalived, and configure a 'VIP' on the primary
15:15:41 <yoctozepto> fwiw, the deadlocks seem to be unrelated to mariadb per se, most likely the way we run things or the way things run themmselves
15:15:43 <mgoddard> (manually)
15:15:50 <mgoddard> or just use api_interface_address
15:15:59 <mgoddard> and separate frontend and backend port
15:16:10 <kplant> mgoddard: how is that handled when the primary goes down?
15:16:14 <kplant> if vrrp isn't used to move the vip
15:16:25 <mgoddard> kplant: doesn't really matter in CI
15:16:35 <yoctozepto> kplant: we don't take down primary for cluster testing
15:16:58 <kplant> oh ok. i'll sit back down :]
15:17:12 <yoctozepto> kplant: no no, please share
15:17:17 <mgoddard> remain standing kplant :)
15:17:28 <kplant> i just meant for the current topic, no worries
15:18:05 <mgoddard> does that plan sound crazy? it does depend on being able to configure separate a listening port for all services we test, not sure if that's possible
15:18:31 <mgoddard> oh, it's not supported in rocky. That was the blocker, remember now
15:18:44 <mgoddard> so wouldn't work for upgrade jobs
15:18:54 <mgoddard> maybe we do need an overlay then
15:19:06 <yoctozepto> has there been any testing around overlay?
15:19:18 <yoctozepto> believe egonzalez was trying out something?
15:19:22 <mgoddard> that doesn't sound like something we could do quickly though
15:19:36 <yoctozepto> nope, definitely
15:19:46 <yoctozepto> and there is no guarantee it fixes mariadb
15:19:50 <mgoddard> egonzalez was looking at multinode haproxy, I don't think he tried overlay yet
15:19:52 <mgoddard> nope
15:21:22 <mgoddard> ok, yoctozepto said he would spend some time looking at mariadb this week, I will try to help where I can.
15:21:59 <yoctozepto> yeah, but no promises, I'm no mariadb-guru
15:22:07 <mgoddard> defining N jobs to run in parallel is quite good for testing flakey things
15:22:15 <yoctozepto> wish I were
15:22:26 <yoctozepto> as in ceph case
15:22:39 <yoctozepto> higher chance of failing
15:22:41 <mgoddard> tempting to just add that sleep back - perhaps we're just not waiting for the right thing
15:23:01 <yoctozepto> mgoddard: 60 sleep will help with one class of problems for sure
15:23:08 <mgoddard> right
15:23:16 <yoctozepto> but there are some which did not have the mariadb killed
15:23:26 <mgoddard> I remember priteau saying that order of shutdown matters, perhaps that is something to investigate
15:23:31 <yoctozepto> yet wsrep does not want to work later
15:23:44 <yoctozepto> as in; it works, works and does not
15:24:50 <priteau> mgoddard: well, my understanding is that order of shutdown helps to reliably keep the same node as the most advanced.
15:25:18 <priteau> But I haven't experimented with enough shutdowns to draw a conclusion from it.
15:25:48 <mgoddard> priteau: I find it hard to see how you could reliably determine the most advanced, especially if they are all the same at the point you check
15:26:30 <priteau> If you shut down all but one replica, wouldn't the remaining one be the most advanced?
15:27:54 <mgoddard> what if one you shut down was more advanced when you shut it down
15:28:31 <mgoddard> don't know enough about galera to know what it would do
15:29:19 <yoctozepto> mgoddard: so far galera has proven to work reverse to what I would expect it too
15:29:31 <yoctozepto> HF instead of HA
15:29:33 <mgoddard> how do you mean?
15:29:48 <mgoddard> :)
15:29:53 <yoctozepto> I left a note in CI errors
15:29:56 <yoctozepto> copy-pasting:
15:29:57 <yoctozepto> ^ these should be fixable by waiting for mariadb to really shutdown - else a slave might not want to recover because it is behind (seems broken to me but that's how life goes: https://stackoverflow.com/questions/54664565/unable-to-complete-sst-transfer-due-to-wsrep-sst-position-cant-be-set-in-past )
15:30:27 <yoctozepto> might be my English being bad
15:30:31 <yoctozepto> or my time understanding
15:30:47 <yoctozepto> but it does not make sense to me receive such error
15:31:04 <yoctozepto> is not that what galera should do
15:31:13 <yoctozepto> "I'm older, got newer, let's sync"
15:31:43 <yoctozepto> instead it is: "fsck, they are newer, bailing out lol"
15:32:01 <mgoddard> could be that they diverged?
15:32:14 <yoctozepto> but this one we probably fix by not forcibly crashing the containers
15:32:20 <mgoddard> i.e. both have newer than common root
15:32:27 <mgoddard> that should help
15:32:29 <yoctozepto> no idea, honestly
15:32:44 <yoctozepto> and nothing to trigger that either
15:33:04 <yoctozepto> we get that whenever we forcibly kill the process
15:33:11 <yoctozepto> not otherwise
15:33:36 <yoctozepto> and moreover it's from a slave afaik
15:33:49 <mgoddard> I did think at one point we should check docker logs for the message where it has a stop timeout and fail the job
15:34:06 <yoctozepto> maybe we have some bug in mariadb config causing this oddity
15:34:16 <yoctozepto> (thinking loudly)
15:34:33 <yoctozepto> mgoddard: but there was none
15:35:02 <yoctozepto> it kills them silently as far as CI logs go
15:35:10 <yoctozepto> or I missed something
15:35:31 <mgoddard> docker journal I think has something
15:36:23 <mgoddard> alternatively we stop using 'docker stop' and replace with 'docker kill' and a manual poll, then fail if it doesn't stop
15:36:26 <yoctozepto> Jul 08 21:53:20 primary dockerd[10281]: time="2019-07-08T21:53:20.701738294Z" level=debug msg="Sending kill signal 15 to container 90a7c430eac1ee06f1f804346c43dd0c6f9d3da714383eb57547ba23e6c443f4"
15:36:34 <yoctozepto> don't remember what 15 stands for
15:36:46 <mgoddard> sigterm
15:36:49 <mgoddard> kill is 9
15:36:56 <openstackgerrit> Gaëtan Trellu proposed openstack/kolla master: Add HAcluster containers  https://review.opendev.org/668765
15:36:57 <yoctozepto> then odd
15:37:14 <mgoddard> no message 10 seconds later?
15:37:30 <yoctozepto> got it
15:37:31 <yoctozepto> Jul 08 22:14:31 primary dockerd[10281]: time="2019-07-08T22:14:31.250643659Z" level=debug msg="Sending kill signal 9 to container d8b893da786b16fa32a5ecfef30f12dcb5f2713765986d8c93a01ba420781ec2"
15:37:43 <yoctozepto> the log did not download fully yet, sorry
15:38:04 <yoctozepto> http://logs.openstack.org/30/669730/2/check/kolla-ansible-centos-source-upgrade-ceph/6f71867/primary/logs/system_logs/docker.txt.gz
15:38:07 <yoctozepto> ^ 4 of 9
15:38:20 <yoctozepto> i.e. 4 times kill -9
15:38:50 <mgoddard> grep 'failed to exit within' /var/log/docker.log
15:38:56 <mgoddard> or simialr
15:39:12 <yoctozepto> 3 matches somehow
15:39:40 <yoctozepto> one was a real DELETE
15:39:47 <yoctozepto> Jul 08 22:15:50 primary dockerd[10281]: time="2019-07-08T22:15:50.381314870Z" level=debug msg="Calling DELETE /v1.39/containers/nova_consoleauth?force=True&link=False&v=False"
15:39:47 <yoctozepto> Jul 08 22:15:50 primary dockerd[10281]: time="2019-07-08T22:15:50.381409920Z" level=debug msg="Sending kill signal 9 to container 5b78a8437161a5cadff4a9421495e88c51ff388268bad27110ede84e2bef827f"
15:40:00 <mgoddard> let's move on. we can continue with this after the meeting
15:40:07 <yoctozepto> yeah, exactly
15:40:21 <mgoddard> #topic Train release planning
15:40:48 <mgoddard> We seem so wrapped up in stein, I'm worried we're ignoring Train, so thought I'd add this topic
15:41:11 <mgoddard> We have lots of features on the whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard
15:41:31 <mgoddard> Some big ones without owners
15:41:50 <mgoddard> CentOS8 / CentOS py3 depends on CentOS 8, out of our hands for now
15:41:54 <openstackgerrit> Gaëtan Trellu proposed openstack/kolla-ansible master: WIP: Add HAcluster Ansible role  https://review.opendev.org/670104
15:42:06 <mgoddard> although some work was done using fedora to test py3 tripleo
15:42:12 <mgoddard> if someone was keen to get going
15:42:37 <mgoddard> Another one is the support matrix definition
15:43:02 <mgoddard> We need someone to go through all the services & features we support, and help define a level of 'support' for them
15:43:34 <mgoddard> I started listing images
15:43:37 <mgoddard> #link https://etherpad.openstack.org/p/kolla-train-image-evaluation
15:43:52 <mgoddard> some projects we 'support' are dead :)
15:44:02 <mgoddard> some were for k8s, we can drop those
15:44:06 <openstackgerrit> Michal Nasiadka proposed openstack/kolla-ansible master: ceph-nfs: Add rpcbind to Ubuntu host bootstrap  https://review.opendev.org/669315
15:44:24 <mgoddard> Anyone want to help with that?
15:44:32 <yoctozepto> mgoddard: image list != feature list
15:44:49 <mgoddard> yoctozepto: nope, that's just a start
15:44:53 <yoctozepto> ;-)
15:45:00 <mnasiadka> yoctozepto: not all images are OpenStack project related :)
15:45:02 <mgoddard> features probably more applicable to k-a
15:45:15 <yoctozepto> yeah, thought about k-a
15:45:27 <mnasiadka> so let's start with kolla - easier?
15:45:31 <mgoddard> yeah
15:45:47 <yoctozepto> yeah
15:45:57 <mgoddard> I found some low hanging fruit - images we can drop straight away
15:46:20 <openstackgerrit> Gaëtan Trellu proposed openstack/kolla master: Add HAcluster containers  https://review.opendev.org/668765
15:46:29 <yoctozepto> mgoddard: yeah, mariadb
15:46:31 <mnasiadka> mgoddard: a lot of kolla-kubernetes remnants I assume
15:46:36 <mgoddard> Another feature without an owner is health checks
15:46:44 <mgoddard> yoctozepto: yeah mariadb will be dropped in train
15:46:55 <mgoddard> :p
15:47:13 <mnasiadka> mgoddard: I can take this, had a preliminary look into how tripleo does that, shouldn't be a lot of work
15:47:31 <mgoddard> And always looking to improve test coverage, so if there's a service you use that is untested, please add a test (we can help with this)
15:47:47 <mgoddard> mnasiadka: that would be good
15:48:32 <mgoddard> they're the main ones - please see the full list if you are interested in picking up a feature, and we can help explain if you need context
15:49:02 <mgoddard> anything else for train planning?
15:49:36 <mgoddard> #topic open discussion
15:49:55 <gchenuet> Hi guys ! Congrats for your work on Kolla and Kolla-ansible. Is there a relase date planned for kolla/kolla-ansible 8.0.0.0 ?
15:50:13 <goldyfruit> plop Guillaume :_
15:50:15 <goldyfruit> o/
15:50:19 <goldyfruit> gchenuet,
15:50:24 <mgoddard> gchenuet: hi, we're hoping for next week
15:50:33 <mgoddard> but it depends on some ongoing testing of mariadb
15:50:46 <mgoddard> (see earlier discussion)
15:50:53 <mgoddard> We had a bit of a discussion on monday in the (last ever) kayobe meeting about how to integrate
15:51:04 <gchenuet> goldyfruit: \o
15:51:10 <mgoddard> #link http://eavesdrop.openstack.org/meetings/kayobe/2019/kayobe.2019-07-08-14.01.log.html
15:51:24 <gchenuet> Oh cool ! Good news :)
15:51:33 <mgoddard> we thought we should combine the kayobe whiteboard into the kolla one
15:51:44 <mgoddard> #link https://etherpad.openstack.org/p/kayobe-whiteboard
15:51:47 <mgoddard> does that make sense?
15:51:58 <mgoddard> keeps things in one place
15:52:30 <mgoddard> we'll also be closing the #openstack-kayobe channel and moving in here, so expect more kayobe chatter
15:52:43 <mgoddard> the gerritbot has already been updated to push notifications here
15:53:19 <mgoddard> I'll introduce the kayobe team next week's meeting
15:53:42 <mgoddard> but I'm sure you'll have seen them all around
15:54:27 <mnasiadka> More people complaining on MariaDB - the merrier :)
15:54:39 <mgoddard> that's mostly what we do yeah
15:55:28 <mgoddard> Anything else to discuss?
15:57:32 <mgoddard> Ok, thanks everyone
15:57:36 <mgoddard> #endmeeting