15:00:39 #startmeeting kolla 15:00:39 Meeting started Wed Jul 10 15:00:39 2019 UTC and is due to finish in 60 minutes. The chair is mgoddard. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:00:40 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 15:00:42 The meeting name has been set to 'kolla' 15:00:44 #topic rollcall 15:00:47 \o 15:00:56 o/ 15:01:04 o/ 15:01:52 o' 15:03:11 I think this meeting conflicts with yoctozepto's commute or dinner arrangements - he's always active either side of it but not during :) 15:03:29 mnasiadka: around? 15:04:07 #topic agenda 15:04:10 * Roll-call 15:04:12 * Announcements 15:04:14 ** Welcome yoctozepto, latest core reviewer 15:04:16 ** Removed zhubingbing and Duong Ha-Quang from core reviewer list 15:04:18 ** Kayobe being added as a kolla deliverable https://review.opendev.org/669299 15:04:20 ** From next week, meeting will cover kayobe 15:04:22 * Review action items from last meeting 15:04:24 * Kolla whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard 15:04:26 * Stein release status 15:04:28 * Train release planning 15:04:30 #topic announcements 15:04:33 #info Welcome yoctozepto, latest core reviewer 15:04:39 #info Removed zhubingbing and Duong Ha-Quang from core reviewer list 15:04:53 #info Kayobe being added as a kolla deliverable 15:05:00 #link https://review.opendev.org/669299 15:05:08 o/ (sorry for being late) 15:05:08 #info From next week, meeting will cover kayobe 15:05:14 hi mnasiadka, np 15:05:24 Any other announcements? 15:05:40 I should probably make one more 15:05:44 o/ 15:05:59 (sorry too) 15:06:28 #info PTL (mgoddard) expecting a baby, due date 28th July. Will be away for ~3 weeks paternity leave 15:06:46 hi yoctozepto 15:07:12 I expect to be logging in from time to time to check in 15:07:22 #topic Review action items from last meeting 15:07:24 mgoddard, congrats! 15:07:35 :) thanks chason 15:07:41 No action items last time 15:07:43 mgoddard: congrats! bless you all 15:07:58 #topic Kolla whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard 15:08:00 :) 15:08:04 yoctozepto, I tried to install masakari client into centos7 via pip and got this error: https://paste.api-zulu.com/inajozofel.php 15:08:14 nope, I was on time the last time 15:08:20 Let's check recent CI 15:08:25 sorry, wronggg pasteee 15:08:33 I think this meeting conflicts with yoctozepto's commute or dinner arrangements - he's always active either side of it but not during :) 15:08:36 #link https://etherpad.openstack.org/p/KollaWhiteBoard 15:08:42 nope, I was on time the last time 15:09:13 (as for goldyfruit I believe the CI issue is somewhere else) 15:10:01 yoctozepto: that's true, my apologies 15:10:28 CI mostly green 15:10:42 yeah, we did a good job 15:10:50 #topic Stein release status 15:10:56 So close 15:11:11 I even pushed a review to create a new RC 15:11:33 But yoctozepto found some more mariadb failures, during deploy and upgrade 15:12:06 let's wait this week, I'll spend some time digging into mariadb issues - let's release anyways if we cannot fix them now, quality is already much much better 15:12:10 FWIW, we also see these issues on other branches 15:12:26 (well at least rocky) 15:12:40 mgoddard: yup, as if we are doing something wrong - though we are doing one thing wrong: docker stop 15:12:48 this class of errors should be gone 15:12:55 after really waiting for mysql shutdown 15:13:01 (partially amended by 60s timer) 15:13:12 though the other problem classes remain 15:13:25 mgoddard suggested it could be due to missing haproxy 15:13:51 I saw the 'WSREP not ready' issue in rocky earlier 15:13:56 (is anyone else sans me and Mark taking part in this meeting? ;p ) 15:14:28 yoctozepto: generally it is me and one other person doing the talking, if I'm lucky :D 15:14:51 ;D 15:15:12 I was trying to think if there is a way we could get haproxy involved without using an overlay 15:15:39 could disable keepalived, and configure a 'VIP' on the primary 15:15:41 fwiw, the deadlocks seem to be unrelated to mariadb per se, most likely the way we run things or the way things run themmselves 15:15:43 (manually) 15:15:50 or just use api_interface_address 15:15:59 and separate frontend and backend port 15:16:10 mgoddard: how is that handled when the primary goes down? 15:16:14 if vrrp isn't used to move the vip 15:16:25 kplant: doesn't really matter in CI 15:16:35 kplant: we don't take down primary for cluster testing 15:16:58 oh ok. i'll sit back down :] 15:17:12 kplant: no no, please share 15:17:17 remain standing kplant :) 15:17:28 i just meant for the current topic, no worries 15:18:05 does that plan sound crazy? it does depend on being able to configure separate a listening port for all services we test, not sure if that's possible 15:18:31 oh, it's not supported in rocky. That was the blocker, remember now 15:18:44 so wouldn't work for upgrade jobs 15:18:54 maybe we do need an overlay then 15:19:06 has there been any testing around overlay? 15:19:18 believe egonzalez was trying out something? 15:19:22 that doesn't sound like something we could do quickly though 15:19:36 nope, definitely 15:19:46 and there is no guarantee it fixes mariadb 15:19:50 egonzalez was looking at multinode haproxy, I don't think he tried overlay yet 15:19:52 nope 15:21:22 ok, yoctozepto said he would spend some time looking at mariadb this week, I will try to help where I can. 15:21:59 yeah, but no promises, I'm no mariadb-guru 15:22:07 defining N jobs to run in parallel is quite good for testing flakey things 15:22:15 wish I were 15:22:26 as in ceph case 15:22:39 higher chance of failing 15:22:41 tempting to just add that sleep back - perhaps we're just not waiting for the right thing 15:23:01 mgoddard: 60 sleep will help with one class of problems for sure 15:23:08 right 15:23:16 but there are some which did not have the mariadb killed 15:23:26 I remember priteau saying that order of shutdown matters, perhaps that is something to investigate 15:23:31 yet wsrep does not want to work later 15:23:44 as in; it works, works and does not 15:24:50 mgoddard: well, my understanding is that order of shutdown helps to reliably keep the same node as the most advanced. 15:25:18 But I haven't experimented with enough shutdowns to draw a conclusion from it. 15:25:48 priteau: I find it hard to see how you could reliably determine the most advanced, especially if they are all the same at the point you check 15:26:30 If you shut down all but one replica, wouldn't the remaining one be the most advanced? 15:27:54 what if one you shut down was more advanced when you shut it down 15:28:31 don't know enough about galera to know what it would do 15:29:19 mgoddard: so far galera has proven to work reverse to what I would expect it too 15:29:31 HF instead of HA 15:29:33 how do you mean? 15:29:48 :) 15:29:53 I left a note in CI errors 15:29:56 copy-pasting: 15:29:57 ^ these should be fixable by waiting for mariadb to really shutdown - else a slave might not want to recover because it is behind (seems broken to me but that's how life goes: https://stackoverflow.com/questions/54664565/unable-to-complete-sst-transfer-due-to-wsrep-sst-position-cant-be-set-in-past ) 15:30:27 might be my English being bad 15:30:31 or my time understanding 15:30:47 but it does not make sense to me receive such error 15:31:04 is not that what galera should do 15:31:13 "I'm older, got newer, let's sync" 15:31:43 instead it is: "fsck, they are newer, bailing out lol" 15:32:01 could be that they diverged? 15:32:14 but this one we probably fix by not forcibly crashing the containers 15:32:20 i.e. both have newer than common root 15:32:27 that should help 15:32:29 no idea, honestly 15:32:44 and nothing to trigger that either 15:33:04 we get that whenever we forcibly kill the process 15:33:11 not otherwise 15:33:36 and moreover it's from a slave afaik 15:33:49 I did think at one point we should check docker logs for the message where it has a stop timeout and fail the job 15:34:06 maybe we have some bug in mariadb config causing this oddity 15:34:16 (thinking loudly) 15:34:33 mgoddard: but there was none 15:35:02 it kills them silently as far as CI logs go 15:35:10 or I missed something 15:35:31 docker journal I think has something 15:36:23 alternatively we stop using 'docker stop' and replace with 'docker kill' and a manual poll, then fail if it doesn't stop 15:36:26 Jul 08 21:53:20 primary dockerd[10281]: time="2019-07-08T21:53:20.701738294Z" level=debug msg="Sending kill signal 15 to container 90a7c430eac1ee06f1f804346c43dd0c6f9d3da714383eb57547ba23e6c443f4" 15:36:34 don't remember what 15 stands for 15:36:46 sigterm 15:36:49 kill is 9 15:36:56 Gaëtan Trellu proposed openstack/kolla master: Add HAcluster containers https://review.opendev.org/668765 15:36:57 then odd 15:37:14 no message 10 seconds later? 15:37:30 got it 15:37:31 Jul 08 22:14:31 primary dockerd[10281]: time="2019-07-08T22:14:31.250643659Z" level=debug msg="Sending kill signal 9 to container d8b893da786b16fa32a5ecfef30f12dcb5f2713765986d8c93a01ba420781ec2" 15:37:43 the log did not download fully yet, sorry 15:38:04 http://logs.openstack.org/30/669730/2/check/kolla-ansible-centos-source-upgrade-ceph/6f71867/primary/logs/system_logs/docker.txt.gz 15:38:07 ^ 4 of 9 15:38:20 i.e. 4 times kill -9 15:38:50 grep 'failed to exit within' /var/log/docker.log 15:38:56 or simialr 15:39:12 3 matches somehow 15:39:40 one was a real DELETE 15:39:47 Jul 08 22:15:50 primary dockerd[10281]: time="2019-07-08T22:15:50.381314870Z" level=debug msg="Calling DELETE /v1.39/containers/nova_consoleauth?force=True&link=False&v=False" 15:39:47 Jul 08 22:15:50 primary dockerd[10281]: time="2019-07-08T22:15:50.381409920Z" level=debug msg="Sending kill signal 9 to container 5b78a8437161a5cadff4a9421495e88c51ff388268bad27110ede84e2bef827f" 15:40:00 let's move on. we can continue with this after the meeting 15:40:07 yeah, exactly 15:40:21 #topic Train release planning 15:40:48 We seem so wrapped up in stein, I'm worried we're ignoring Train, so thought I'd add this topic 15:41:11 We have lots of features on the whiteboard https://etherpad.openstack.org/p/KollaWhiteBoard 15:41:31 Some big ones without owners 15:41:50 CentOS8 / CentOS py3 depends on CentOS 8, out of our hands for now 15:41:54 Gaëtan Trellu proposed openstack/kolla-ansible master: WIP: Add HAcluster Ansible role https://review.opendev.org/670104 15:42:06 although some work was done using fedora to test py3 tripleo 15:42:12 if someone was keen to get going 15:42:37 Another one is the support matrix definition 15:43:02 We need someone to go through all the services & features we support, and help define a level of 'support' for them 15:43:34 I started listing images 15:43:37 #link https://etherpad.openstack.org/p/kolla-train-image-evaluation 15:43:52 some projects we 'support' are dead :) 15:44:02 some were for k8s, we can drop those 15:44:06 Michal Nasiadka proposed openstack/kolla-ansible master: ceph-nfs: Add rpcbind to Ubuntu host bootstrap https://review.opendev.org/669315 15:44:24 Anyone want to help with that? 15:44:32 mgoddard: image list != feature list 15:44:49 yoctozepto: nope, that's just a start 15:44:53 ;-) 15:45:00 yoctozepto: not all images are OpenStack project related :) 15:45:02 features probably more applicable to k-a 15:45:15 yeah, thought about k-a 15:45:27 so let's start with kolla - easier? 15:45:31 yeah 15:45:47 yeah 15:45:57 I found some low hanging fruit - images we can drop straight away 15:46:20 Gaëtan Trellu proposed openstack/kolla master: Add HAcluster containers https://review.opendev.org/668765 15:46:29 mgoddard: yeah, mariadb 15:46:31 mgoddard: a lot of kolla-kubernetes remnants I assume 15:46:36 Another feature without an owner is health checks 15:46:44 yoctozepto: yeah mariadb will be dropped in train 15:46:55 :p 15:47:13 mgoddard: I can take this, had a preliminary look into how tripleo does that, shouldn't be a lot of work 15:47:31 And always looking to improve test coverage, so if there's a service you use that is untested, please add a test (we can help with this) 15:47:47 mnasiadka: that would be good 15:48:32 they're the main ones - please see the full list if you are interested in picking up a feature, and we can help explain if you need context 15:49:02 anything else for train planning? 15:49:36 #topic open discussion 15:49:55 Hi guys ! Congrats for your work on Kolla and Kolla-ansible. Is there a relase date planned for kolla/kolla-ansible 8.0.0.0 ? 15:50:13 plop Guillaume :_ 15:50:15 o/ 15:50:19 gchenuet, 15:50:24 gchenuet: hi, we're hoping for next week 15:50:33 but it depends on some ongoing testing of mariadb 15:50:46 (see earlier discussion) 15:50:53 We had a bit of a discussion on monday in the (last ever) kayobe meeting about how to integrate 15:51:04 goldyfruit: \o 15:51:10 #link http://eavesdrop.openstack.org/meetings/kayobe/2019/kayobe.2019-07-08-14.01.log.html 15:51:24 Oh cool ! Good news :) 15:51:33 we thought we should combine the kayobe whiteboard into the kolla one 15:51:44 #link https://etherpad.openstack.org/p/kayobe-whiteboard 15:51:47 does that make sense? 15:51:58 keeps things in one place 15:52:30 we'll also be closing the #openstack-kayobe channel and moving in here, so expect more kayobe chatter 15:52:43 the gerritbot has already been updated to push notifications here 15:53:19 I'll introduce the kayobe team next week's meeting 15:53:42 but I'm sure you'll have seen them all around 15:54:27 More people complaining on MariaDB - the merrier :) 15:54:39 that's mostly what we do yeah 15:55:28 Anything else to discuss? 15:57:32 Ok, thanks everyone 15:57:36 #endmeeting