#openstack-kolla log

15:01:18 <mgoddard> #startmeeting kolla
15:01:20 <openstack> Meeting started Wed Dec 16 15:01:18 2020 UTC and is due to finish in 60 minutes.  The chair is mgoddard. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:01:21 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
15:01:23 <openstack> The meeting name has been set to 'kolla'
15:01:39 <mgoddard> #topic rollcall
15:01:41 <mgoddard> \o
15:01:47 <yoctozepto> o/
15:01:48 <mnasiadka> o/
15:02:29 <rafaelweingartne> \o
15:04:22 <mgoddard> #topic agenda
15:05:02 <mgoddard> * Roll-call
15:05:04 <mgoddard> * Announcements
15:05:06 <mgoddard> * Review action items from the last meeting
15:05:08 <mgoddard> * CI status
15:05:10 <mgoddard> * Victoria release planning
15:05:12 <mgoddard> * Dockerhub pull rate limits https://etherpad.opendev.org/p/docker-pull-limits
15:05:14 <mgoddard> * CentOS 8.3 & stream https://lists.centos.org/pipermail/centos-devel/2020-December/075451.html
15:05:16 <mgoddard> * Cinder active/active https://bugs.launchpad.net/kolla-ansible/+bug/1904062
15:05:17 <openstack> Launchpad bug 1904062 in kolla-ansible wallaby "external ceph cinder volume config breaks volumes on ussuri upgrade" [High,In progress] - Assigned to Michal Nasiadka (mnasiadka)
15:05:18 <mgoddard> * Wallaby PTG actions
15:05:20 <mgoddard> #topic announcements
15:05:28 <mgoddard> I suppose we should cancel some meetings
15:05:57 <mgoddard> the next two?
15:05:57 <mnasiadka> for the next 3 weeks I guess, 6th Jan is designated holiday in Poland
15:06:02 <mgoddard> oh ok
15:06:03 <yoctozepto> ++
15:06:22 <yoctozepto> I am running masakari's on 22nd
15:06:29 <yoctozepto> but yeah, 23rd with kolla might be a bit late
15:06:31 <mgoddard> #info The next 3 meetings are cancelled
15:06:59 <mgoddard> #action mgoddard send email to openstack-discuss about meeting cancellations
15:07:00 <wuchunyang> ok
15:07:00 <yoctozepto> 2020-01-13 is the next kolla meeting
15:07:17 <wuchunyang> happy new year
15:07:19 <yoctozepto> 2021-01-13 *
15:07:26 <yoctozepto> happy new year! :D
15:07:34 <mgoddard> Any others?
15:07:46 <yoctozepto> we got our CI back on track
15:07:55 <mgoddard> #topic CI status
15:08:11 * yoctozepto sends off fireworks
15:08:15 <mgoddard> hooray
15:08:42 <mgoddard> thank you to everyone involved in firefighting over the last few weeks
15:09:03 <mgoddard> we're not done yet - kayobe is still busted
15:09:23 <mgoddard> I had it passing yesterday, now there two new failures
15:09:27 <yoctozepto> ah, yeah
15:09:29 <yoctozepto> :O
15:09:33 <yoctozepto> that went quick
15:09:42 <yoctozepto> what kind of failures?
15:09:44 <mgoddard> well, passing in review anyway
15:09:56 <mgoddard> bifrost changed a default
15:10:00 <yoctozepto> oh gosh
15:10:05 <mgoddard> and some weird ironic locking issue
15:10:19 <yoctozepto> :-(
15:11:01 * hrw at 2 meetings at time
15:11:01 <mgoddard> anyway let's go over the CI status
15:11:20 <mgoddard> centos8-ceph-upgrade jobs seem to be retried 3 times only to fail in some weird way
15:11:23 <mgoddard> still seeing it?
15:12:09 <yoctozepto> yes
15:12:16 <yoctozepto> obviously not always
15:13:06 <yoctozepto> https://zuul.openstack.org/builds?job_name=kolla-ansible-centos8-source-upgrade-ceph-ansible&project=openstack%2Fkolla-ansible&branch=master
15:13:28 <yoctozepto> a SUCCESS is rare
15:14:02 <yoctozepto> fwiw, ubuntu ain't looking any better https://zuul.openstack.org/builds?job_name=kolla-ansible-ubuntu-source-upgrade-ceph-ansible&project=openstack%2Fkolla-ansible&branch=master
15:14:06 <mgoddard> wonderful
15:14:19 <yoctozepto> well, I suppose DISK_FULL is kinda success
15:14:35 <yoctozepto> I like your optimism
15:15:46 <kevko> guys, probably i refactor and have completed switch from haproxy to maxscale , everyone who want to check working maxscale can login here
15:15:47 <kevko> http://185.21.197.26:8990/#/dashboard/servers
15:16:01 <kevko> just ask for pass in private message ..
15:16:12 <mgoddard> kevko: meeting time
15:16:15 <yoctozepto> yup, these recent ones still failed on "Ensuring config directories exist"
15:17:25 <mgoddard> let's move on from CI. At least we're in a better position than before, even if we have loose ends
15:17:35 <mgoddard> #topic Victoria release planning
15:17:40 <yoctozepto> ubuntu does not exhibit that particular behaviour
15:17:59 <yoctozepto> all right, Victoria not so victorious
15:18:14 <mgoddard> end of year release not looking too likely
15:18:25 <mgoddard> we're still basically blocked on the cinder issue
15:18:28 <mgoddard> mnasiadka: any progress?
15:18:29 <yoctozepto> yup, super sad this time
15:19:13 <mnasiadka> mgoddard: have a test env, testing migration from non-cluster to cluster, should have something to push this week
15:19:13 <mgoddard> perhaps we need to make a call about whether to release without the cinder fix
15:19:53 <mnasiadka> backend_host probably won't take too much time, it's not much different, than we have one cinder-volume service instead of 3 or more/less
15:20:21 <mnasiadka> mgoddard: well, we could release with an issue type reno, but it won't look very good :)
15:20:56 <mgoddard> no, but it's no worse than what we have already
15:21:07 <yoctozepto> ++
15:21:51 <mgoddard> that doesn't mean we should take the pressure off fixing the issue
15:21:59 <yoctozepto> ++
15:22:17 <mgoddard> vote: release victoria without cinder active/active fix?
15:23:12 <mnasiadka> if I won't push any updates to the change tomorrow - I'll make the releases change myself, deal? :)
15:23:31 <yoctozepto> mnasiadka: otherwise you'll make mgoddard do them? :P
15:23:35 <mnasiadka> unless we want to test the migration in a CI
15:23:42 <mnasiadka> then it will take for sure more time
15:24:48 <yoctozepto> let's release with an issue reno
15:24:55 <yoctozepto> ussuri is already "broken"
15:24:59 <yoctozepto> mention that
15:25:11 <yoctozepto> it is not a *new issue*
15:25:34 <yoctozepto> and it can be workarounded if you just control the process yourself
15:25:41 <yoctozepto> which might be the case with an external ceph ;-)
15:27:30 <yoctozepto> mgoddard, mnasiadka: decisions, decisions
15:27:34 <mgoddard> ugh
15:27:50 <mgoddard> let's give mnasiadka a day or two
15:27:59 <yoctozepto> this week indeed
15:28:16 <yoctozepto> I might think about extending CI testing
15:28:27 <mgoddard> we may not even have the right people around to get a release approved before 2021
15:28:37 <yoctozepto> that true as well
15:28:43 <mgoddard> #topic Dockerhub pull rate limits https://etherpad.opendev.org/p/docker-pull-limits
15:28:50 <yoctozepto> oh noez
15:29:03 <yoctozepto> in nightmares it hunts me
15:29:16 <mgoddard> you have reached your dream limit
15:30:05 <yoctozepto> http 429 no sweet dreams for you
15:30:21 <yoctozepto> sooo
15:30:22 <mnasiadka> lol
15:30:31 <yoctozepto> are we pushing towards an internal registry?
15:30:40 <mgoddard> I've crossed through docker devil pact option 2
15:30:41 <yoctozepto> because I have no idea what to do about this
15:30:59 <mnasiadka> so, mostly k-a jobs suffer?
15:31:02 <mgoddard> which leaves switching to another devil, or using the infra registry
15:31:14 <yoctozepto> mnasiadka: k too
15:31:27 <mnasiadka> so, in kolla case we pull only one image from docker hub, right?
15:31:30 <yoctozepto> maybe let's try another devil?
15:31:43 <yoctozepto> mnasiadka: yeah 1 per distro version
15:31:53 <mnasiadka> and we fail with that, that's some crazy stuff
15:32:10 <yoctozepto> because we kill it with k-a jobs
15:32:30 <mnasiadka> ok, so option 1: another devil
15:32:33 <yoctozepto> or, well
15:32:43 <yoctozepto> we are not using registry mirror in kolla
15:32:52 <yoctozepto> when building those images
15:33:02 <yoctozepto> they seem to be coming directly from the dockerhub
15:33:07 <yoctozepto> maybe that is where our issue is
15:33:08 <mnasiadka> I don't think centos:8 ubuntu:something and debian:devilish_number changes a lot
15:33:21 <yoctozepto> they don't
15:33:25 <mgoddard> if we switch to another registry, I don't know if we get the caching or the mirror
15:33:31 <mnasiadka> so we could at least cross out those failures by using a cache
15:33:52 <yoctozepto> mgoddard, mnasiadka: could you confirm that in k we are not using any mirror at the very moment
15:33:57 <mnasiadka> do we know how many pulls credits we use per a standard k-a CI job?
15:34:15 <yoctozepto> mnasiadka: 20-30
15:34:59 <yoctozepto> *(mirror meaning caching proxy)
15:35:57 <openstackgerrit> Michal Arbet proposed openstack/kolla-ansible master: Add maxscale support for database  https://review.opendev.org/c/openstack/kolla-ansible/+/767370
15:37:26 <openstackgerrit> Michal Arbet proposed openstack/kolla-ansible master: Add maxscale support for database  https://review.opendev.org/c/openstack/kolla-ansible/+/767370
15:37:29 <mnasiadka> yoctozepto: 20-30 is not bad, we have 200 pulls per 6 hours, right?
15:37:31 <mnasiadka> or 100?
15:37:35 <mgoddard> 100
15:37:42 <mgoddard> it's not enough
15:37:47 <yoctozepto> not enuff
15:38:03 <mgoddard> ok, key question
15:38:21 <mnasiadka> so, if we disable using cache in k-a jobs, and add a check in pre to check if we have enough pulls for current host, is that something we could live with?
15:38:23 <mgoddard> if I pull from github's registry using a registry mirror, will it cache the image?
15:39:04 <mnasiadka> mgoddard: good question, I guess it should - but maybe we are both wrong in that assumption :) needs testing
15:39:19 <mgoddard> I remember reading it doesn't
15:40:47 <yoctozepto> you mean pull-through?
15:40:51 <mgoddard> yes
15:41:33 <yoctozepto> I also think it only caches the primary one
15:41:38 <yoctozepto> might be worth gooling
15:41:41 <yoctozepto> googling*
15:41:42 <yoctozepto> let me see,
15:42:14 <mnasiadka> the standard doesn't
15:42:21 <mnasiadka> but there is stuff like https://hub.docker.com/r/tiangolo/docker-registry-proxy
15:42:28 <mnasiadka> or probably something else we could use
15:42:59 <mgoddard> Gotcha
15:43:01 <mgoddard> It’s currently not possible to mirror another private registry. Only the central Hub can be mirrored.
15:44:11 <yoctozepto> https://github.com/docker/distribution/blob/master/ROADMAP.md#proxying-to-other-registries
15:44:53 <yoctozepto> mnasiadka: the thing is: infra has to
15:45:14 <mnasiadka> yoctozepto: I know, that's why we need to think of something we can do now
15:45:17 <mgoddard> yeah
15:45:20 <yoctozepto> and they already have their hands full with the new great gerrit
15:45:54 <yoctozepto> I think we should add cache to k
15:46:01 <yoctozepto> as I see it is not there
15:46:07 <yoctozepto> I wanted you to cross-validate me
15:46:23 <mgoddard> what do you mean?
15:46:25 <mnasiadka> and judging by the amount of storage our images need - it might be complicated to find a registry local to nodepool providers :)
15:46:51 <yoctozepto> no hits of registry-mirrors in kolla job plays
15:47:12 <mgoddard> oh, I see
15:47:12 <yoctozepto> kolla-ansible yes
15:47:14 <yoctozepto> kolla not
15:47:18 <mgoddard> it would help for some build failures
15:47:28 <yoctozepto> mgoddard: now you know why I thought what I thought ;-)
15:48:03 <mgoddard> but while it might save the build jobs, the deploy jobs are likely to fail
15:48:24 <mnasiadka> yeah, at least we can fix kolla for now - it's an easy step
15:48:31 <yoctozepto> yes, I thought this
15:48:36 <yoctozepto> we can move to another registry
15:48:41 <yoctozepto> but we still need to fetch
15:48:44 <yoctozepto> distro images
15:48:49 <yoctozepto> so cache in kolla
15:48:51 <yoctozepto> and that is enough
15:48:54 <yoctozepto> for kolla
15:49:08 <yoctozepto> then move publishing to quay
15:49:13 <yoctozepto> or something
15:49:16 <mnasiadka> and why quay is better?
15:49:21 <mnasiadka> higher limits?
15:49:26 <yoctozepto> because it has only burst limits
15:49:40 <mgoddard> if we move to another registry we can no longer use the registry mirror, and that will increase external traffic significantly
15:49:42 <yoctozepto> if you do too many reqs per sec
15:49:45 <yoctozepto> but it has no mirror
15:49:48 <yoctozepto> so is trickky
15:49:49 <yoctozepto> indeed
15:50:18 <mnasiadka> so, in reality the zuul pull through cache is having going over pull limits?
15:50:35 <mnasiadka> uh, english is complicated
15:50:46 <yoctozepto> yes, rephrase pretty please
15:50:53 <mgoddard> yes, all pulls come from the caches
15:51:01 <yoctozepto> in k-a jobs
15:51:14 <mnasiadka> and the caches have one ip, so it's not a problem for them to use the quota
15:51:27 <mnasiadka> are we sure that disabling the cache is not a better solution? :)
15:51:35 <yoctozepto> one ip addr per cloud more or less
15:51:50 <yoctozepto> we are pretty sure regarding the external traffic
15:52:00 <mnasiadka> well, not better in terms of traffic, but better in terms of pull quotas
15:53:02 <mgoddard> hard to say without trying it
15:53:14 <mgoddard> it depends on how nodepool is setup
15:53:34 <mgoddard> if all nodes are in a single project using one router, they'll all use the same IP anyway
15:53:42 <yoctozepto> indeed
15:54:13 <mnasiadka> well then it's even worse, if they all share one external IP via SNAT
15:54:18 <mgoddard> yes
15:54:52 <mnasiadka> so the only way forward is deploying a registry in each nodepool provider and try to sync them (or check for images older than X and build them)
15:55:24 <yoctozepto> or cache quay
15:55:33 <mnasiadka> or... check the usage of docker pull quota and decide if we need to build the images, or just pull them (if we have the same IP as the pull through cache)
15:55:45 <mnasiadka> (docker exposes number of pull requests left in API)
15:55:45 <yoctozepto> I wonder
15:55:54 <yoctozepto> if we are not better of rebuilding the images now each time :P
15:56:02 <mgoddard> it's a bit racy with multiple jobs though
15:56:22 <mnasiadka> well, we can build on first host, push into a registry and use that registry
15:56:30 <mnasiadka> and build only a subset that is required for CI
15:56:35 <yoctozepto> that is what we are doing
15:56:42 <yoctozepto> when we need to build them
15:56:46 <yoctozepto> we are smart (TM)
15:57:00 <yoctozepto> so what do you say
15:57:07 <yoctozepto> we just enable building by default
15:57:12 <yoctozepto> and see how bad that works for us
15:57:31 <mnasiadka> well, how longer will it take? 10-20 minutes?
15:57:44 <mgoddard> why don't we try another registry
15:57:52 <mnasiadka> but then there's no cache?
15:58:02 <mgoddard> all other options seem to lack a cache
15:58:22 <yoctozepto> folks, 2 minutes
15:58:26 <mgoddard> or require pulling in data for building
15:58:32 <yoctozepto> enable building images each time and see how bad it goes?
15:58:39 <yoctozepto> for the time being
15:58:42 <yoctozepto> on master
15:58:53 <mgoddard> could do it on master
15:59:13 <mnasiadka> yeah, let's do it on master - and see how that goes
15:59:19 <mnasiadka> or even only for source
15:59:24 <yoctozepto> then basically kolla-ansible jobs work like they always do in kolla
15:59:39 <yoctozepto> mnasiadka: no, always, otherwise we might still trigger the duckery
15:59:41 <mgoddard> kolla-ansible pull | kolla-build
15:59:45 <mgoddard> ||
16:00:01 <mgoddard> doesn't work for pre-upgrade though
16:00:08 <mgoddard> also true of building
16:00:26 <yoctozepto> no, just build on master
16:00:31 <yoctozepto> stable have their own jobs
16:01:06 <yoctozepto> so we would still be pulling something extra
16:01:21 <yoctozepto> eh, dockerhub
16:01:28 <yoctozepto> was that really necessary
16:01:54 <yoctozepto> a'ight time is up
16:02:45 <mnasiadka> yup
16:03:05 <mgoddard> maybe we should just accept the Ts & Cs :D
16:03:41 <mnasiadka> basically - ideally if infra would deploy something like Harbor in each mirror site and setup replication rules - it would be solved like that
16:03:51 <mnasiadka> we would just publish both to docker hub and local registry
16:04:12 <mnasiadka> wonder what is the traffic volume we create by publishing
16:04:25 <mgoddard> ok, let's finish
16:04:28 <mgoddard> thanks
16:04:30 <mgoddard> #endmeeting