09:04:04 <flwang1> #startmeeting magnum
09:04:05 <openstack> Meeting started Wed Sep 23 09:04:04 2020 UTC and is due to finish in 60 minutes.  The chair is flwang1. Information about MeetBot at http://wiki.debian.org/MeetBot.
09:04:06 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
09:04:08 <openstack> The meeting name has been set to 'magnum'
09:04:18 <flwang1> brtknr: meeting?
09:04:24 <brtknr> sure
09:04:28 <brtknr> o/
09:04:41 <flwang1> o/
09:04:56 <flwang1> Spyros said he will tried, but seems he is not online
09:05:20 <flwang1> as for the hyperkube vs binary, i'm thinking if we can keep support both
09:05:21 <jakeyip> o/
09:05:26 <flwang1> hi jakeyip
09:05:34 <brtknr> o/ jakeyip
09:05:49 <brtknr> i am happy to support both
09:06:24 <strigazi> flwang1: o/
09:06:38 <flwang1> strigazi: hey
09:06:50 <flwang1> #topic hyperkube vs binary
09:07:11 <flwang1> strigazi: i'm thinking if we can support both, do you think it's a bad idea?
09:07:44 <strigazi> flwang1: no, i think we have too
09:07:59 <brtknr> o/ strigazi
09:08:06 <strigazi> flwang1: well it is bad, but we have too :)
09:08:11 <strigazi> brtknr: hello
09:08:29 <flwang1> when i say support both, i mean something like we did before for atomic system container and the hyperkube
09:09:08 <flwang1> however, that will make the upgrade script a bit messy
09:09:16 <brtknr> do  we allow the possibility of going from hyperkube -> binary?
09:09:34 <strigazi> brtknr: for running cluster we can't do it
09:09:59 <flwang1> strigazi: why we can't?
09:10:13 <brtknr> because the upgrade scripts are baked into the heat template?
09:10:49 <strigazi> brtknr: yes
09:11:08 <flwang1> strigazi: i see the point
09:11:31 <strigazi> flwang1: brtknr: what we can do is deploy a software_deployment from another template but we will lose track so much
09:11:51 <strigazi> the good part is that we NGs at least, new NGs in running clusters will get the binary
09:12:19 <flwang1> if that's the case, we will have to support hyperkube
09:12:40 <brtknr> i noticed it yesterday when i tried to bootstrap some new nodes with the virtio-scsi fix: https://review.opendev.org/#/c/753312/
09:12:44 <flwang1> and from an operation pov, the cloud provider can decide when to fully switch to binary
09:12:52 <brtknr> it didnt work with existing cluster, only with a new cluwter
09:13:40 <strigazi> flwang1: sounds reasonable, at least for new clusters we should avoid hyperkube
09:13:43 <flwang1> ok, so can we make an agreement that we will have to support both?
09:14:17 <flwang1> like atomic system container and hyperkube
09:14:23 <brtknr> can we allow the possibility of choosing hyperkube source? i dont mind defaulting to k8s.gcr.io
09:14:25 <flwang1> that's a bit funny for me
09:15:40 <flwang1> brtknr: i don't mind supporting that
09:15:42 <brtknr> it doesnt affect the function, just gives operators more flexibility
09:15:51 <strigazi> brtknr: flwang1: we can add new label for rancher
09:15:52 <flwang1> brtknr: yep, i understand
09:16:10 <strigazi> for people with their own registry it is pointless
09:16:15 <flwang1> strigazi: yep, that's basically what brtknr proposed
09:16:30 <flwang1> strigazi: true
09:16:57 <strigazi> it is just that it doesn't work with the private registry override we have. like who wins?
09:17:15 <strigazi> brtknr: you will honestly rely on rancher?
09:18:17 <strigazi> brtknr: maybe do it like helm? repository and tag?
09:19:17 <brtknr> strigazi: do you see problems with using rancher?
09:20:24 <strigazi> brtknr: it is a competitor project and product, not sure about the license and how community driven it is. From our (magnum) side, we can make the option generic and not default to it
09:21:03 <strigazi> if someone opts in for it, they are on their own
09:21:11 <brtknr> strigazi: i understand, i will revert the change to default to it
09:21:35 <brtknr> what do you mean by > maybe do it like helm? repository and tag?
09:21:56 <strigazi> brtknr: instead of prefix, have the full thing as a label
09:22:13 <strigazi> repository: index.docker.io/rancher/foo
09:22:30 <strigazi> tag: vX.Y.Z-bar
09:22:52 <brtknr> and not limit to calling it hyperkube?
09:22:53 <strigazi> the tag we have it as kube_tah
09:22:58 <strigazi> the tag we have it as kube_tag
09:23:02 <flwang1> i think we can reuse our current kube_tag, just need a new label for the repo, isn't it?
09:23:05 <openstackgerrit> Merged openstack/magnum-ui stable/victoria: Update .gitreview for stable/victoria  https://review.opendev.org/753179
09:23:06 <openstackgerrit> Merged openstack/magnum-ui stable/victoria: Update TOX_CONSTRAINTS_FILE for stable/victoria  https://review.opendev.org/753180
09:23:38 <strigazi> brtknr: yeah, since it a new label, why limit it? the community has a high rate of changing things
09:23:50 <strigazi> flwang1: yes
09:24:13 <brtknr> okay okay okay sdasdasd
09:24:17 <strigazi> smth like hyperkube_image? hyperkube_repo or _repository
09:24:24 <brtknr> sdasdasds
09:24:26 <brtknr> sdsds
09:24:28 <brtknr> qweqwewqeqe
09:24:34 <brtknr> oops sorry
09:24:38 <strigazi> cat?
09:24:38 <jakeyip> lol
09:24:49 <openstackgerrit> Merged openstack/magnum-ui master: Update master for stable/victoria  https://review.opendev.org/753181
09:24:49 <openstackgerrit> Merged openstack/magnum-ui master: Add Python3 wallaby unit tests  https://review.opendev.org/753182
09:24:51 <brtknr> my keyboard was unresponsibe
09:25:03 <strigazi> I wanted to be cat :)
09:25:22 <brtknr> lol
09:25:37 <brtknr> okay i'll change it to hyperkube repo
09:25:44 <strigazi> So we agree on it right?
09:26:37 <flwang1> strigazi: +1 are you going to polish the patch?
09:26:37 <strigazi> Sorry team, I have time for one more quick subject, I need to leave in 10'. A family thing, everything is ok.
09:26:45 <strigazi> flwang1: yes
09:27:01 <flwang1> strigazi: cool, i will take a look the upgrade part
09:27:11 <flwang1> strigazi: i have a interesting topoic
09:27:16 <flwang1> the health monitor of heat-contianer-agent
09:27:30 <flwang1> i have seen several cases of dead heat-container-agent on our production
09:27:36 <flwang1> which caused upgrade failed
09:27:57 <strigazi> flwang1: I can change to brtknr's suggestion to use the server tarball.
09:28:15 <flwang1> who will build the tarball?
09:28:19 <flwang1> and where to host it?
09:28:20 <strigazi> flwang1: will the monitor help?
09:28:37 <strigazi> flwang1 already there, built by the community
09:28:41 <flwang1> podman does support health-check command, but i have no idea what's the good command we should use
09:28:44 <brtknr> and use kube-* binaries from the server tarball?
09:29:05 <flwang1> strigazi: if there is a tarball from community, then i'm happy to use it
09:29:16 <brtknr> what about kube-proxy? you mention you had issues getting kube-proxy ?working
09:29:30 <strigazi> brtknr: only kubelet binary, for the rest the images. They are included in the tarball. I will push, we can take it in gerrut
09:29:42 <flwang1> the tarball will includes kubelet and those container images of kube components, is it?
09:29:43 <strigazi> brtknr: works now
09:29:50 <strigazi> flwang1: yes
09:30:04 <flwang1> then i will be very happy
09:30:11 <brtknr> cool!
09:30:19 <flwang1> because it can resolve the digest issue
09:30:22 <flwang1> properly
09:30:25 <strigazi> flwang1: exactly
09:30:34 <flwang1> very good
09:30:39 <flwang1> move on?
09:30:44 <strigazi> yes
09:30:55 <flwang1> #topic health check of heat-container-agent
09:31:02 <strigazi> flwang1: you want smth to trigger the unit restart?
09:31:11 <flwang1> strigazi: exactly
09:31:18 <strigazi> sounds good to me
09:31:22 <brtknr> seems non-controvertial
09:31:28 <brtknr> next topic? :P
09:31:29 <flwang1> because based on my experience, a restart just fix it
09:31:55 <strigazi> "did you turn it off and on?"
09:32:10 <flwang1> turn what off and on?
09:32:25 <strigazi> yeap, the fix for everything
09:32:26 <brtknr> vm
09:32:41 <flwang1> no, i don't want to restart the vm
09:32:45 <flwang1> just heat-contaiiner-agent
09:32:57 <flwang1> because the node is working ok
09:33:15 <brtknr> please propose a patch, i dont know how this is health check?
09:33:32 <strigazi> flwang1: was the process dead?
09:33:44 <strigazi> the container was still running?
09:33:53 <brtknr> which heat container agent tag are you using?
09:34:02 <flwang1> i only have a rough idea, podman does support health-check command, but i don't have any idea how to check the health of heat-container-agent
09:34:14 <flwang1> the process is not dead
09:34:19 <strigazi> in theory we shouldn't
09:35:17 <flwang1> i can't remember the details, because it's a customer ticket, the only thing i can see is by 'journalctl -u heat-container-agent', there is nothing
09:35:27 <strigazi> i think improvements in the container or systemd are just welcome, right?
09:35:29 <flwang1> train-stable-1
09:35:43 <flwang1> strigazi: i think so
09:35:57 <brtknr> +1
09:36:03 <flwang1> as i said above, i just want to open my mind by getting some input from you guys
09:36:33 <brtknr> anything in the HCA logs before it dies?
09:37:04 <flwang1> brtknr: how to see the log before the stuck?
09:37:08 <flwang1> i don't think i can
09:37:17 <flwang1> it's rotated i think
09:37:20 <brtknr> inside /var/log/heat-config/
09:37:43 <flwang1> that's normal i would say
09:37:56 <strigazi> flwang1: just check if os-collect-config is running
09:38:06 <flwang1> for some reasons, it just dead after a long time idle
09:38:09 <strigazi> ps aux | grep os-collect-config
09:38:20 <flwang1> i cann't do it now
09:38:25 <brtknr> it is not dying during script execution?
09:38:34 <flwang1> because the ticket has been resolved, it's a customer env
09:38:41 <flwang1> brtknr: it's not
09:38:54 <strigazi> flwang1: in a running cluster try to kill -9 the os-collect-config process
09:38:58 <flwang1> it's an old cluster, something like 4-5 months
09:39:19 <strigazi> see if the container/systemd unit will log smth
09:39:28 <strigazi> we can start there
09:39:36 <flwang1> strigazi: good point, i will try
09:39:51 <flwang1> i think i got something to start
09:40:08 <flwang1> brtknr: strigazi: anything else you want to discuss?
09:40:31 <brtknr> yes
09:40:39 <brtknr> can you guys take a look at this? https://review.opendev.org/#/c/743945/
09:41:01 <strigazi> flwang1: i'm good, I need to go guys, see you brtknr I will have a look
09:41:02 <brtknr> i am surprised you havent hit this, deleting cluster leaves dead trustee users behind
09:41:23 <brtknr> cya strigazi
09:41:30 <flwang1> strigazi: cya
09:41:37 <brtknr> strigazi: btw i will be off from next month until next year
09:41:50 <brtknr> on parental leave
09:41:59 <brtknr> so will be quiet for a while
09:42:02 <flwang1> brtknr: sorry, i will test it tomorrow
09:42:21 <brtknr> thanks flwang1
09:42:23 <flwang1> brtknr: so you won't do any upstream work in the next 4 months?
09:42:41 <strigazi> brtknr: oh, nice, family time :)
09:42:44 <brtknr> flwang1: i will try but will be looking after 2 babies full time :P
09:43:07 <strigazi> flwang1: brtknr: signing off, brtknr I will catch you before your leave :)
09:43:44 <flwang1> i see. no problem
09:44:38 <flwang1> it's a good idea to have more time with family during the covid19 time
09:46:27 <flwang1> anything else, team?
09:46:48 <jakeyip> I have a question
09:47:01 <jakeyip> does anybody run out of ulimits using coreos?
09:47:17 <jakeyip> https://github.com/coreos/fedora-coreos-tracker/issues/269#issuecomment-615202267
09:48:16 <flwang1> jakeyip:  https://review.opendev.org/#/c/749169/
09:48:25 <flwang1> is this you're looking for?
09:48:41 <jakeyip> lol
09:48:56 <flwang1> it's just merged today :)
09:49:09 <jakeyip> yeah I know, what a coincidence
09:49:23 <flwang1> jakeyip: put some time to do some code review will save you more time :D
09:49:52 <jakeyip> good idea ;)
09:50:39 <flwang1> if there is no new topic, i'm going to close the meeting for today
09:50:50 <flwang1> thank you for joining, my friends
09:50:54 <jakeyip> do you plan on backporting? I can do it in our repo anyway so no big deal
09:51:09 <flwang1> jakeyip: we will do backport to Victoria
09:51:27 <jakeyip> ok
09:51:39 <openstackgerrit> Feilong Wang proposed openstack/magnum stable/victoria: Update default values for docker nofile and vm.max_map_count  https://review.opendev.org/753559
09:51:56 <openstackgerrit> Feilong Wang proposed openstack/magnum stable/victoria: Remove cloud-config from k8s worker node  https://review.opendev.org/753560
09:52:08 <flwang1> #endmeeting