#opendev-meeting log

19:01:09 <clarkb> #startmeeting infra
19:01:10 <openstack> Meeting started Tue Sep 22 19:01:09 2020 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:13 <openstack> The meeting name has been set to 'infra'
19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-September/000100.html Our Agenda
19:01:24 <clarkb> #topic Announcements
19:01:42 <clarkb> PTG and Summit are fast appraoching. If you plan to participate now is a good time to register
19:01:57 <ianw> o/
19:02:17 <clarkb> schedules for all three should be up too so you can cross check your timezone and availability
19:02:41 <clarkb> unfortunately I don't currently have links ready but if you need help finding anything I'm sure I can either find the info or know who to talk to
19:02:43 <diablo_rojo> o/
19:03:20 <clarkb> Also the smoke is mostly gone now and tomorrow is the last day where my kids don't have school obligations for the next several months so I'm going to take the day off and go do somethingin the rain
19:03:43 <clarkb> #topic Actions from last meeting
19:03:55 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-09-15-19.01.txt minutes from last meeting
19:03:58 <clarkb> no recordred actions
19:04:20 <clarkb> #topic Priority Efforts
19:04:27 <clarkb> #topic Update Config Management
19:04:35 <clarkb> nb04.opendev.org has been cleaned up
19:04:56 <corvus> so all of zuul+nodepool are running from container images now?
19:05:03 <clarkb> corvus: yes
19:05:18 <corvus> is all of puppet-openstackci unused by us now?
19:05:49 <clarkb> I'm not sure if we use the elasticsearch and logstash stuff out of there or not
19:06:07 <clarkb> I think not so ya that may be completely unused by us now
19:06:17 <corvus> that's a major milestone :)
19:07:06 <clarkb> ya, and we're also using the upstream images for all services too
19:07:12 <clarkb> *all zuul and nodepool services
19:07:27 <clarkb> whcih is a good way to help ensure thoes function well for the real world
19:09:18 <clarkb> #link https://review.opendev.org/#/c/744821/ fetch sshfp dns records in launch node
19:09:43 <clarkb> #link https://review.opendev.org/#/c/750049/ wait for ipv6 RAs when launching nodes
19:10:03 <clarkb> those are two launch node changes that would be good to land as we continue to roll out services with new config management on new hosts
19:10:15 <clarkb> Any other config management related items to discuss?
19:11:34 <clarkb> #topic OpenDev
19:11:52 <clarkb> It has been noticed that our Gitea project descriptions are not updated when we change them
19:12:25 <clarkb> I believe that the current method to address this is to run the sync-gitea-projects.yaml playbook whcih will do the slow resync of everything
19:13:02 <clarkb> If that is the current method, do we need to lock ansible on bridge when doing it to prevent other things from conflicting? I think so
19:13:46 <clarkb> fungi also brought up that we should think about ways to do this automatically if we can manage it somehow
19:14:07 <corvus> i'm trying to remember why we don't already do that
19:14:08 <ianw> how slow is slow?
19:14:40 <clarkb> I want to say it took about 4 hours to run the sync playbook last time mordred ran it
19:14:59 <clarkb> but maybe that was before we rewrote it in python?
19:15:37 <ianw> ok, that counts as slow :)
19:16:21 <clarkb> thinking out loud here: maybe we lock ansible on bridge, run the sync playbook and time it, then based on that we can consider doing a full sync each time?
19:16:43 <clarkb> also the ci job for gitea does two passes of project creation iirc. Maybe we can look at that for rought timing data
19:17:47 <ianw> ++
19:17:48 <corvus> it's not clear to me that sync-gitea-projects.yaml will update the descr
19:18:04 <clarkb> corvus: oh interesting is this possibly a bug in our python too?
19:18:28 <corvus> clarkb: yeah (or missing feature); a quick scan of the python looks like it only touches description on create
19:18:40 <clarkb> neat
19:18:46 <corvus> settings and branches get updated by sync-gitea-projects.yaml, but not the description
19:18:52 <corvus> expect it would be an easy fix
19:18:55 <clarkb> now I'm thinking we make a chagne that just updates descriptions and use the ci job to time it
19:19:01 <clarkb> since I'm pretty sure we do two passes in ci now
19:19:09 <corvus> (also, i'm only at 90% confidence on this)
19:19:19 <clarkb> corvus: we can also use the ci job to check your theory on that
19:19:50 <clarkb> I can take a look at that if no one else is able/interested. It may just be a day or two while I work through other things first
19:20:22 <clarkb> and if you are interested I'm happy for someone else to work on it too :)
19:21:16 <clarkb> The other gitea topic I added to the agenda is that gitea 1.12.4 has released. I've done the last few gitea upgrades. Curious if anyone else is interested in giving it a go.
19:21:32 <clarkb> With the minor releases its mostly about double checking file diffs and editing them as necessary for our forked html content
19:22:12 <ianw> i can give it a go
19:22:22 <clarkb> Our CI job for that has decent coverage of the automation results, and if you really want to check the rendered web ui launching a gitea locally isn't too bad
19:22:25 <clarkb> ianw: thanks!
19:23:05 <clarkb> On the gerrit upgrade side of things I've been distracted by a number of operational issues the last few weeks so unfortunately no new updates here
19:23:21 <clarkb> Any other OpenDev topics people would like to bring up?
19:25:12 <clarkb> #topic General Topics
19:25:24 <clarkb> #topic Splitting puppet-else into service specific infra-prod jobs
19:25:34 <clarkb> This is something that ianw reminded me we had planned to do
19:26:08 <clarkb> essentially we split the node definition(s) out of our manifests/site.pp and put them in manifests/service.pp then add new jobs to run puppet for that specific service instead of everything
19:26:33 <clarkb> the reason this is coming up is we've been having servers like elasticsearch servers crash on rax then ansible sits there waiting for them until the puppet-else job times out
19:26:53 <clarkb> this adds a lot of noise to our logs and it is hard to tell if things are working or not since they are lumped into a big basket
19:27:24 <clarkb> I wonder if we should plan a sprint to make those changes and get through a number of them in a day or two
19:28:17 <ianw> yeah, something else i can put on my short-term todo list
19:28:27 <clarkb> (also if anyone knows how to convince ansible to timeout more quickly when ssh will never succeed that would be great too)
19:29:37 <clarkb> ianw: I'm happy to help too, though for me having a day or two dedicated to it would likely help, maybe ping me when you intend on working on it and I'll start later in the day and we can sift through them?
19:30:06 <ianw> ok; hopefully it's all pretty mechanical
19:30:16 <ianw> always surprises though :)
19:30:23 <clarkb> ya I think the least mechanical part will be adding testinfra tests
19:30:41 <clarkb> I think that was part of mordred's original goal so that we can switch out puppet for ansible+docker and have the tests confirm everything is still happy
19:30:47 <clarkb> without needing to replace test frameworks
19:31:24 <clarkb> #topic Bup and Borg Backups
19:31:28 <clarkb> #link https://review.opendev.org/741366 ready to merge when we are.
19:31:37 <clarkb> kept this on the agenda as ianw mentioned I should
19:31:58 <clarkb> I don't think its incredibly urgent as bup continues to work, but being better prepared for focal and beyond is a good thing too
19:32:47 <clarkb> ianw: anything else to add on that one?
19:33:12 <corvus> we looking for another +2, or just waiting for a babysitter?
19:33:29 <clarkb> corvus: I think waiting for a babysitter (ianw mentioned he could do it, but there have been many many distractions since)
19:33:30 <corvus> well, there's a host bringup, so a bit more work than babysitting
19:33:37 <ianw> yeah, just waiting for me to have a chunk to new server, and babysit
19:33:42 <clarkb> mostly I think it gets deprioritized since bup is working
19:33:57 <ianw> although there's been some chat about alternative providers
19:33:58 * corvus touches wood
19:34:07 <ianw> do we want to put the bup server somewhere !rax?
19:34:21 <clarkb> ianw: the original goal with bup was to backup to >1 provider
19:34:26 <corvus> i think 2 servers in 2 providers would be great
19:34:30 <clarkb> corvus: ++
19:34:53 <ianw> any preference of the options we have?
19:35:02 <corvus> i'd vote for rax + 1.
19:35:21 <clarkb> ianw: mnaser has recently indicated he'd be happy to host more things. Backups likely make sense there due to the use of ceph too
19:35:39 <clarkb> (our backups will be replicated many times)
19:35:47 <mnaser> yes, i meant to reply to the email, we have a lot of capacity of storage in mtl btw
19:35:57 <corvus> i'll just say that at one point we *did* have rax +1, and the +1 exited the cloud business.  i really really hope (and i certainly don't expect) that to happen again.  but having been bitten once.
19:36:10 <ianw> ok, sounds like vexxhost mtl
19:36:21 <mnaser> please feel free to loop me in if you need quota bumps or anything like that
19:36:29 <clarkb> mnaser: thank you!
19:36:33 <corvus> rax+mtl sounds great :)
19:36:43 <ianw> mnaser: thanks, will do
19:38:02 <clarkb> #topic PTG Planning
19:38:18 <clarkb> As mentioned earlier now is a good time to register if you plan to attend the PTG
19:38:24 <clarkb> #link https://www.openstack.org/ptg/ Register for the PTG
19:38:35 <clarkb> #link https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here
19:38:43 <clarkb> I've added a number of topics to this etherpad
19:39:00 <clarkb> we are just over a month away so now is a great time to think about what we should be talking about during our PTG times
19:39:15 <clarkb> Feel free to add input on the topics I've added or add your own
19:39:37 <clarkb> if a particular topic is very important to you it might be a good thing to indicate your time availability next to the topic so we can include you
19:40:06 <clarkb> Also, I plan to use meetpad again as that worked well for us last time
19:41:20 <clarkb> Any other PTG concerns or thoughts?
19:43:04 <clarkb> #topic Switch fedora-latest to fedora-32
19:43:07 <clarkb> #link https://review.opendev.org/#/c/752744/
19:43:14 <clarkb> I sent an email last week saying we'd make this change today
19:43:25 <clarkb> I intend on approving the change after the meeting unless there are last minute objections
19:43:43 <clarkb> I figure if anyone really really needs fedora-30 they can use fedora-30 directly as we work them off of it
19:43:54 <clarkb> hoping that in the near future we'll delete the fedora-30 image entirely
19:44:23 <clarkb> part of the motivation here is that the old fedoras seem to be bitrotting with respect to ansible. Ansible isn't able to reliably manage systemd services on f31 for example
19:44:36 <clarkb> Getting to the up to date fedora version seems important as a result
19:44:53 <clarkb> ianw: ^ any particular concerns from you on that topic? you probably do more fedora things than the rest of us
19:45:29 <ianw> no, i mean we shouldn't really be using fedora-!latest in jobs, we've always said it's a rolling thing
19:46:03 <clarkb> ya the number of cases where fedora-30 is used explicitly is very small
19:46:14 <clarkb> nodepool, ara, and dib
19:46:29 <clarkb> nodepool and ara are/have being updated and dib will just stop testing f30 buidls I think
19:47:27 <clarkb> #topic Open Discussion
19:47:58 <clarkb> https://review.opendev.org/#/c/752908/ is a change I'm hoping to get review(s) from someone with a fresh perspective
19:48:15 <clarkb> I've had some initial concerns but have largely come around to thinking merging it is probably the most pragmatic thing
19:48:27 <ianw> one thing was restarting zuul-web to pickup the new pf4 changes that were merged
19:48:32 <clarkb> hoping that someone else can take a look and double check on that
19:48:57 <clarkb> I'll probably approve it by the end of my work day if no one else looks as I don't want the tripleo testing to floudner longer
19:49:22 <clarkb> ianw: typically those are really straightforward, you docker-compose down and docker-compose up -d in /etc/zuul-web or whatever the dir is
19:49:25 <ianw> i'll check, yesterday there were unanswered questions
19:49:46 <ianw> clarkb: is there a reason we don't CD deploy that?
19:49:51 <clarkb> ianw: if you want to do the zuul-web restart I'm around for another 5 or so hours and will happy backup if something goes wrong
19:50:05 <clarkb> ianw: I think because sometimes you need to restart zuul-web and scheduler together
19:50:13 <clarkb> corvus: ^ is that overly cautious on our part?
19:50:25 <ianw> ahh, ok, yeah this is not an API change
19:50:37 <ianw> but i guess it could be, at some times
19:50:38 <clarkb> ya most of the time its fine to just restart
19:50:42 <clarkb> occasionally it isnt :)
19:51:39 <ianw> i'll take a look then
19:52:42 <corvus> i think it would be fine to cd zuul-web
19:52:47 <corvus> but the mechanics are tricky
19:52:57 <corvus> zuul repo is in a different tenant, etc
19:53:15 <corvus> really want a url trigger or something for that, i'd think.
19:53:42 <clarkb> we check the docker-compose pull info in the gitea role to understand if we need to restart in a safe way (whcih is different than just down and up)
19:53:44 <ianw> hrm, i docker restarted it, but it looks the same
19:53:55 <clarkb> we might be able to do something similar for zuul-web and get the hour delayed CD
19:53:55 <ianw> which must mean what i thought would be new containers isn't
19:54:19 <corvus> ianw: i think it needs more than a restart for the container to be recreated with a new image
19:54:29 <clarkb> yes I think that is the case
19:54:35 <corvus> i think a docker-compose down/up ?
19:54:44 <clarkb> ya down then up -d is what I usually use
19:55:30 <ianw> ok, yeah, looks like that's in the bash history
19:55:43 <ianw> my usual source of best practice tips :)
19:56:23 <clarkb> Sounds like that may be it
19:56:25 <clarkb> Thank you everyone
19:56:42 <ianw> yay, that got it :)
19:56:48 <clarkb> we'll be back here next week until then feel free to chat in #opendev or on service-discuss@lists.opendev.org
19:56:54 <corvus> clarkb: thanks :)
19:56:55 <clarkb> #endmeeting