15:00:39 <mgoddard> #startmeeting kolla 15:00:39 <openstack> Meeting started Wed Jul 10 15:00:39 2019 UTC and is due to finish in 60 minutes. The chair is mgoddard. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:40 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:42 <openstack> The meeting name has been set to 'kolla' 15:00:44 <mgoddard> #topic rollcall 15:00:47 <mgoddard> \o 15:00:56 <chason> o/ 15:01:04 <Wasaac> o/ 15:01:52 <kplant> o' 15:03:11 <mgoddard> I think this meeting conflicts with yoctozepto's commute or dinner arrangements - he's always active either side of it but not during :) 15:03:29 <mgoddard> mnasiadka: around? 15:04:07 <mgoddard> #topic agenda 15:04:10 <mgoddard> * Roll-call 15:04:12 <mgoddard> * Announcements 15:04:14 <mgoddard> ** Welcome yoctozepto, latest core reviewer 15:04:16 <mgoddard> ** Removed zhubingbing and Duong Ha-Quang from core reviewer list 15:04:18 <mgoddard> ** Kayobe being added as a kolla deliverable https://review.opendev.org/669299 15:04:20 <mgoddard> ** From next week, meeting will cover kayobe 15:04:22 <mgoddard> * Review action items from last meeting 15:04:24 <mgoddard> * Kolla whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard 15:04:26 <mgoddard> * Stein release status 15:04:28 <mgoddard> * Train release planning 15:04:30 <mgoddard> #topic announcements 15:04:33 <mgoddard> #info Welcome yoctozepto, latest core reviewer 15:04:39 <mgoddard> #info Removed zhubingbing and Duong Ha-Quang from core reviewer list 15:04:53 <mgoddard> #info Kayobe being added as a kolla deliverable 15:05:00 <mgoddard> #link https://review.opendev.org/669299 15:05:08 <mnasiadka> o/ (sorry for being late) 15:05:08 <mgoddard> #info From next week, meeting will cover kayobe 15:05:14 <mgoddard> hi mnasiadka, np 15:05:24 <mgoddard> Any other announcements? 15:05:40 <mgoddard> I should probably make one more 15:05:44 <yoctozepto> o/ 15:05:59 <yoctozepto> (sorry too) 15:06:28 <mgoddard> #info PTL (mgoddard) expecting a baby, due date 28th July. Will be away for ~3 weeks paternity leave 15:06:46 <mgoddard> hi yoctozepto 15:07:12 <mgoddard> I expect to be logging in from time to time to check in 15:07:22 <mgoddard> #topic Review action items from last meeting 15:07:24 <chason> mgoddard, congrats! 15:07:35 <mgoddard> :) thanks chason 15:07:41 <mgoddard> No action items last time 15:07:43 <yoctozepto> mgoddard: congrats! bless you all 15:07:58 <mgoddard> #topic Kolla whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard 15:08:00 <mgoddard> :) 15:08:04 <yoctozepto> <goldyfruit> yoctozepto, I tried to install masakari client into centos7 via pip and got this error: https://paste.api-zulu.com/inajozofel.php 15:08:14 <yoctozepto> nope, I was on time the last time 15:08:20 <mgoddard> Let's check recent CI 15:08:25 <yoctozepto> sorry, wronggg pasteee 15:08:33 <yoctozepto> <mgoddard> I think this meeting conflicts with yoctozepto's commute or dinner arrangements - he's always active either side of it but not during :) 15:08:36 <mgoddard> #link https://etherpad.openstack.org/p/KollaWhiteBoard 15:08:42 <yoctozepto> <yoctozepto> nope, I was on time the last time 15:09:13 <yoctozepto> (as for goldyfruit I believe the CI issue is somewhere else) 15:10:01 <mgoddard> yoctozepto: that's true, my apologies 15:10:28 <mgoddard> CI mostly green 15:10:42 <yoctozepto> yeah, we did a good job 15:10:50 <mgoddard> #topic Stein release status 15:10:56 <mgoddard> So close 15:11:11 <mgoddard> I even pushed a review to create a new RC 15:11:33 <mgoddard> But yoctozepto found some more mariadb failures, during deploy and upgrade 15:12:06 <yoctozepto> let's wait this week, I'll spend some time digging into mariadb issues - let's release anyways if we cannot fix them now, quality is already much much better 15:12:10 <mgoddard> FWIW, we also see these issues on other branches 15:12:26 <mgoddard> (well at least rocky) 15:12:40 <yoctozepto> mgoddard: yup, as if we are doing something wrong - though we are doing one thing wrong: docker stop 15:12:48 <yoctozepto> this class of errors should be gone 15:12:55 <yoctozepto> after really waiting for mysql shutdown 15:13:01 <yoctozepto> (partially amended by 60s timer) 15:13:12 <yoctozepto> though the other problem classes remain 15:13:25 <yoctozepto> mgoddard suggested it could be due to missing haproxy 15:13:51 <mgoddard> I saw the 'WSREP not ready' issue in rocky earlier 15:13:56 <yoctozepto> (is anyone else sans me and Mark taking part in this meeting? ;p ) 15:14:28 <mgoddard> yoctozepto: generally it is me and one other person doing the talking, if I'm lucky :D 15:14:51 <yoctozepto> ;D 15:15:12 <mgoddard> I was trying to think if there is a way we could get haproxy involved without using an overlay 15:15:39 <mgoddard> could disable keepalived, and configure a 'VIP' on the primary 15:15:41 <yoctozepto> fwiw, the deadlocks seem to be unrelated to mariadb per se, most likely the way we run things or the way things run themmselves 15:15:43 <mgoddard> (manually) 15:15:50 <mgoddard> or just use api_interface_address 15:15:59 <mgoddard> and separate frontend and backend port 15:16:10 <kplant> mgoddard: how is that handled when the primary goes down? 15:16:14 <kplant> if vrrp isn't used to move the vip 15:16:25 <mgoddard> kplant: doesn't really matter in CI 15:16:35 <yoctozepto> kplant: we don't take down primary for cluster testing 15:16:58 <kplant> oh ok. i'll sit back down :] 15:17:12 <yoctozepto> kplant: no no, please share 15:17:17 <mgoddard> remain standing kplant :) 15:17:28 <kplant> i just meant for the current topic, no worries 15:18:05 <mgoddard> does that plan sound crazy? it does depend on being able to configure separate a listening port for all services we test, not sure if that's possible 15:18:31 <mgoddard> oh, it's not supported in rocky. That was the blocker, remember now 15:18:44 <mgoddard> so wouldn't work for upgrade jobs 15:18:54 <mgoddard> maybe we do need an overlay then 15:19:06 <yoctozepto> has there been any testing around overlay? 15:19:18 <yoctozepto> believe egonzalez was trying out something? 15:19:22 <mgoddard> that doesn't sound like something we could do quickly though 15:19:36 <yoctozepto> nope, definitely 15:19:46 <yoctozepto> and there is no guarantee it fixes mariadb 15:19:50 <mgoddard> egonzalez was looking at multinode haproxy, I don't think he tried overlay yet 15:19:52 <mgoddard> nope 15:21:22 <mgoddard> ok, yoctozepto said he would spend some time looking at mariadb this week, I will try to help where I can. 15:21:59 <yoctozepto> yeah, but no promises, I'm no mariadb-guru 15:22:07 <mgoddard> defining N jobs to run in parallel is quite good for testing flakey things 15:22:15 <yoctozepto> wish I were 15:22:26 <yoctozepto> as in ceph case 15:22:39 <yoctozepto> higher chance of failing 15:22:41 <mgoddard> tempting to just add that sleep back - perhaps we're just not waiting for the right thing 15:23:01 <yoctozepto> mgoddard: 60 sleep will help with one class of problems for sure 15:23:08 <mgoddard> right 15:23:16 <yoctozepto> but there are some which did not have the mariadb killed 15:23:26 <mgoddard> I remember priteau saying that order of shutdown matters, perhaps that is something to investigate 15:23:31 <yoctozepto> yet wsrep does not want to work later 15:23:44 <yoctozepto> as in; it works, works and does not 15:24:50 <priteau> mgoddard: well, my understanding is that order of shutdown helps to reliably keep the same node as the most advanced. 15:25:18 <priteau> But I haven't experimented with enough shutdowns to draw a conclusion from it. 15:25:48 <mgoddard> priteau: I find it hard to see how you could reliably determine the most advanced, especially if they are all the same at the point you check 15:26:30 <priteau> If you shut down all but one replica, wouldn't the remaining one be the most advanced? 15:27:54 <mgoddard> what if one you shut down was more advanced when you shut it down 15:28:31 <mgoddard> don't know enough about galera to know what it would do 15:29:19 <yoctozepto> mgoddard: so far galera has proven to work reverse to what I would expect it too 15:29:31 <yoctozepto> HF instead of HA 15:29:33 <mgoddard> how do you mean? 15:29:48 <mgoddard> :) 15:29:53 <yoctozepto> I left a note in CI errors 15:29:56 <yoctozepto> copy-pasting: 15:29:57 <yoctozepto> ^ these should be fixable by waiting for mariadb to really shutdown - else a slave might not want to recover because it is behind (seems broken to me but that's how life goes: https://stackoverflow.com/questions/54664565/unable-to-complete-sst-transfer-due-to-wsrep-sst-position-cant-be-set-in-past ) 15:30:27 <yoctozepto> might be my English being bad 15:30:31 <yoctozepto> or my time understanding 15:30:47 <yoctozepto> but it does not make sense to me receive such error 15:31:04 <yoctozepto> is not that what galera should do 15:31:13 <yoctozepto> "I'm older, got newer, let's sync" 15:31:43 <yoctozepto> instead it is: "fsck, they are newer, bailing out lol" 15:32:01 <mgoddard> could be that they diverged? 15:32:14 <yoctozepto> but this one we probably fix by not forcibly crashing the containers 15:32:20 <mgoddard> i.e. both have newer than common root 15:32:27 <mgoddard> that should help 15:32:29 <yoctozepto> no idea, honestly 15:32:44 <yoctozepto> and nothing to trigger that either 15:33:04 <yoctozepto> we get that whenever we forcibly kill the process 15:33:11 <yoctozepto> not otherwise 15:33:36 <yoctozepto> and moreover it's from a slave afaik 15:33:49 <mgoddard> I did think at one point we should check docker logs for the message where it has a stop timeout and fail the job 15:34:06 <yoctozepto> maybe we have some bug in mariadb config causing this oddity 15:34:16 <yoctozepto> (thinking loudly) 15:34:33 <yoctozepto> mgoddard: but there was none 15:35:02 <yoctozepto> it kills them silently as far as CI logs go 15:35:10 <yoctozepto> or I missed something 15:35:31 <mgoddard> docker journal I think has something 15:36:23 <mgoddard> alternatively we stop using 'docker stop' and replace with 'docker kill' and a manual poll, then fail if it doesn't stop 15:36:26 <yoctozepto> Jul 08 21:53:20 primary dockerd[10281]: time="2019-07-08T21:53:20.701738294Z" level=debug msg="Sending kill signal 15 to container 90a7c430eac1ee06f1f804346c43dd0c6f9d3da714383eb57547ba23e6c443f4" 15:36:34 <yoctozepto> don't remember what 15 stands for 15:36:46 <mgoddard> sigterm 15:36:49 <mgoddard> kill is 9 15:36:56 <openstackgerrit> Gaëtan Trellu proposed openstack/kolla master: Add HAcluster containers https://review.opendev.org/668765 15:36:57 <yoctozepto> then odd 15:37:14 <mgoddard> no message 10 seconds later? 15:37:30 <yoctozepto> got it 15:37:31 <yoctozepto> Jul 08 22:14:31 primary dockerd[10281]: time="2019-07-08T22:14:31.250643659Z" level=debug msg="Sending kill signal 9 to container d8b893da786b16fa32a5ecfef30f12dcb5f2713765986d8c93a01ba420781ec2" 15:37:43 <yoctozepto> the log did not download fully yet, sorry 15:38:04 <yoctozepto> http://logs.openstack.org/30/669730/2/check/kolla-ansible-centos-source-upgrade-ceph/6f71867/primary/logs/system_logs/docker.txt.gz 15:38:07 <yoctozepto> ^ 4 of 9 15:38:20 <yoctozepto> i.e. 4 times kill -9 15:38:50 <mgoddard> grep 'failed to exit within' /var/log/docker.log 15:38:56 <mgoddard> or simialr 15:39:12 <yoctozepto> 3 matches somehow 15:39:40 <yoctozepto> one was a real DELETE 15:39:47 <yoctozepto> Jul 08 22:15:50 primary dockerd[10281]: time="2019-07-08T22:15:50.381314870Z" level=debug msg="Calling DELETE /v1.39/containers/nova_consoleauth?force=True&link=False&v=False" 15:39:47 <yoctozepto> Jul 08 22:15:50 primary dockerd[10281]: time="2019-07-08T22:15:50.381409920Z" level=debug msg="Sending kill signal 9 to container 5b78a8437161a5cadff4a9421495e88c51ff388268bad27110ede84e2bef827f" 15:40:00 <mgoddard> let's move on. we can continue with this after the meeting 15:40:07 <yoctozepto> yeah, exactly 15:40:21 <mgoddard> #topic Train release planning 15:40:48 <mgoddard> We seem so wrapped up in stein, I'm worried we're ignoring Train, so thought I'd add this topic 15:41:11 <mgoddard> We have lots of features on the whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard 15:41:31 <mgoddard> Some big ones without owners 15:41:50 <mgoddard> CentOS8 / CentOS py3 depends on CentOS 8, out of our hands for now 15:41:54 <openstackgerrit> Gaëtan Trellu proposed openstack/kolla-ansible master: WIP: Add HAcluster Ansible role https://review.opendev.org/670104 15:42:06 <mgoddard> although some work was done using fedora to test py3 tripleo 15:42:12 <mgoddard> if someone was keen to get going 15:42:37 <mgoddard> Another one is the support matrix definition 15:43:02 <mgoddard> We need someone to go through all the services & features we support, and help define a level of 'support' for them 15:43:34 <mgoddard> I started listing images 15:43:37 <mgoddard> #link https://etherpad.openstack.org/p/kolla-train-image-evaluation 15:43:52 <mgoddard> some projects we 'support' are dead :) 15:44:02 <mgoddard> some were for k8s, we can drop those 15:44:06 <openstackgerrit> Michal Nasiadka proposed openstack/kolla-ansible master: ceph-nfs: Add rpcbind to Ubuntu host bootstrap https://review.opendev.org/669315 15:44:24 <mgoddard> Anyone want to help with that? 15:44:32 <yoctozepto> mgoddard: image list != feature list 15:44:49 <mgoddard> yoctozepto: nope, that's just a start 15:44:53 <yoctozepto> ;-) 15:45:00 <mnasiadka> yoctozepto: not all images are OpenStack project related :) 15:45:02 <mgoddard> features probably more applicable to k-a 15:45:15 <yoctozepto> yeah, thought about k-a 15:45:27 <mnasiadka> so let's start with kolla - easier? 15:45:31 <mgoddard> yeah 15:45:47 <yoctozepto> yeah 15:45:57 <mgoddard> I found some low hanging fruit - images we can drop straight away 15:46:20 <openstackgerrit> Gaëtan Trellu proposed openstack/kolla master: Add HAcluster containers https://review.opendev.org/668765 15:46:29 <yoctozepto> mgoddard: yeah, mariadb 15:46:31 <mnasiadka> mgoddard: a lot of kolla-kubernetes remnants I assume 15:46:36 <mgoddard> Another feature without an owner is health checks 15:46:44 <mgoddard> yoctozepto: yeah mariadb will be dropped in train 15:46:55 <mgoddard> :p 15:47:13 <mnasiadka> mgoddard: I can take this, had a preliminary look into how tripleo does that, shouldn't be a lot of work 15:47:31 <mgoddard> And always looking to improve test coverage, so if there's a service you use that is untested, please add a test (we can help with this) 15:47:47 <mgoddard> mnasiadka: that would be good 15:48:32 <mgoddard> they're the main ones - please see the full list if you are interested in picking up a feature, and we can help explain if you need context 15:49:02 <mgoddard> anything else for train planning? 15:49:36 <mgoddard> #topic open discussion 15:49:55 <gchenuet> Hi guys ! Congrats for your work on Kolla and Kolla-ansible. Is there a relase date planned for kolla/kolla-ansible 8.0.0.0 ? 15:50:13 <goldyfruit> plop Guillaume :_ 15:50:15 <goldyfruit> o/ 15:50:19 <goldyfruit> gchenuet, 15:50:24 <mgoddard> gchenuet: hi, we're hoping for next week 15:50:33 <mgoddard> but it depends on some ongoing testing of mariadb 15:50:46 <mgoddard> (see earlier discussion) 15:50:53 <mgoddard> We had a bit of a discussion on monday in the (last ever) kayobe meeting about how to integrate 15:51:04 <gchenuet> goldyfruit: \o 15:51:10 <mgoddard> #link http://eavesdrop.openstack.org/meetings/kayobe/2019/kayobe.2019-07-08-14.01.log.html 15:51:24 <gchenuet> Oh cool ! Good news :) 15:51:33 <mgoddard> we thought we should combine the kayobe whiteboard into the kolla one 15:51:44 <mgoddard> #link https://etherpad.openstack.org/p/kayobe-whiteboard 15:51:47 <mgoddard> does that make sense? 15:51:58 <mgoddard> keeps things in one place 15:52:30 <mgoddard> we'll also be closing the #openstack-kayobe channel and moving in here, so expect more kayobe chatter 15:52:43 <mgoddard> the gerritbot has already been updated to push notifications here 15:53:19 <mgoddard> I'll introduce the kayobe team next week's meeting 15:53:42 <mgoddard> but I'm sure you'll have seen them all around 15:54:27 <mnasiadka> More people complaining on MariaDB - the merrier :) 15:54:39 <mgoddard> that's mostly what we do yeah 15:55:28 <mgoddard> Anything else to discuss? 15:57:32 <mgoddard> Ok, thanks everyone 15:57:36 <mgoddard> #endmeeting