19:01:09 <clarkb> #startmeeting infra 19:01:10 <openstack> Meeting started Tue Sep 22 19:01:09 2020 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:11 <openstack> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:13 <openstack> The meeting name has been set to 'infra' 19:01:17 <clarkb> #link http://lists.opendev.org/pipermail/service-discuss/2020-September/000100.html Our Agenda 19:01:24 <clarkb> #topic Announcements 19:01:42 <clarkb> PTG and Summit are fast appraoching. If you plan to participate now is a good time to register 19:01:57 <ianw> o/ 19:02:17 <clarkb> schedules for all three should be up too so you can cross check your timezone and availability 19:02:41 <clarkb> unfortunately I don't currently have links ready but if you need help finding anything I'm sure I can either find the info or know who to talk to 19:02:43 <diablo_rojo> o/ 19:03:20 <clarkb> Also the smoke is mostly gone now and tomorrow is the last day where my kids don't have school obligations for the next several months so I'm going to take the day off and go do somethingin the rain 19:03:43 <clarkb> #topic Actions from last meeting 19:03:55 <clarkb> #link http://eavesdrop.openstack.org/meetings/infra/2020/infra.2020-09-15-19.01.txt minutes from last meeting 19:03:58 <clarkb> no recordred actions 19:04:20 <clarkb> #topic Priority Efforts 19:04:27 <clarkb> #topic Update Config Management 19:04:35 <clarkb> nb04.opendev.org has been cleaned up 19:04:56 <corvus> so all of zuul+nodepool are running from container images now? 19:05:03 <clarkb> corvus: yes 19:05:18 <corvus> is all of puppet-openstackci unused by us now? 19:05:49 <clarkb> I'm not sure if we use the elasticsearch and logstash stuff out of there or not 19:06:07 <clarkb> I think not so ya that may be completely unused by us now 19:06:17 <corvus> that's a major milestone :) 19:07:06 <clarkb> ya, and we're also using the upstream images for all services too 19:07:12 <clarkb> *all zuul and nodepool services 19:07:27 <clarkb> whcih is a good way to help ensure thoes function well for the real world 19:09:18 <clarkb> #link https://review.opendev.org/#/c/744821/ fetch sshfp dns records in launch node 19:09:43 <clarkb> #link https://review.opendev.org/#/c/750049/ wait for ipv6 RAs when launching nodes 19:10:03 <clarkb> those are two launch node changes that would be good to land as we continue to roll out services with new config management on new hosts 19:10:15 <clarkb> Any other config management related items to discuss? 19:11:34 <clarkb> #topic OpenDev 19:11:52 <clarkb> It has been noticed that our Gitea project descriptions are not updated when we change them 19:12:25 <clarkb> I believe that the current method to address this is to run the sync-gitea-projects.yaml playbook whcih will do the slow resync of everything 19:13:02 <clarkb> If that is the current method, do we need to lock ansible on bridge when doing it to prevent other things from conflicting? I think so 19:13:46 <clarkb> fungi also brought up that we should think about ways to do this automatically if we can manage it somehow 19:14:07 <corvus> i'm trying to remember why we don't already do that 19:14:08 <ianw> how slow is slow? 19:14:40 <clarkb> I want to say it took about 4 hours to run the sync playbook last time mordred ran it 19:14:59 <clarkb> but maybe that was before we rewrote it in python? 19:15:37 <ianw> ok, that counts as slow :) 19:16:21 <clarkb> thinking out loud here: maybe we lock ansible on bridge, run the sync playbook and time it, then based on that we can consider doing a full sync each time? 19:16:43 <clarkb> also the ci job for gitea does two passes of project creation iirc. Maybe we can look at that for rought timing data 19:17:47 <ianw> ++ 19:17:48 <corvus> it's not clear to me that sync-gitea-projects.yaml will update the descr 19:18:04 <clarkb> corvus: oh interesting is this possibly a bug in our python too? 19:18:28 <corvus> clarkb: yeah (or missing feature); a quick scan of the python looks like it only touches description on create 19:18:40 <clarkb> neat 19:18:46 <corvus> settings and branches get updated by sync-gitea-projects.yaml, but not the description 19:18:52 <corvus> expect it would be an easy fix 19:18:55 <clarkb> now I'm thinking we make a chagne that just updates descriptions and use the ci job to time it 19:19:01 <clarkb> since I'm pretty sure we do two passes in ci now 19:19:09 <corvus> (also, i'm only at 90% confidence on this) 19:19:19 <clarkb> corvus: we can also use the ci job to check your theory on that 19:19:50 <clarkb> I can take a look at that if no one else is able/interested. It may just be a day or two while I work through other things first 19:20:22 <clarkb> and if you are interested I'm happy for someone else to work on it too :) 19:21:16 <clarkb> The other gitea topic I added to the agenda is that gitea 1.12.4 has released. I've done the last few gitea upgrades. Curious if anyone else is interested in giving it a go. 19:21:32 <clarkb> With the minor releases its mostly about double checking file diffs and editing them as necessary for our forked html content 19:22:12 <ianw> i can give it a go 19:22:22 <clarkb> Our CI job for that has decent coverage of the automation results, and if you really want to check the rendered web ui launching a gitea locally isn't too bad 19:22:25 <clarkb> ianw: thanks! 19:23:05 <clarkb> On the gerrit upgrade side of things I've been distracted by a number of operational issues the last few weeks so unfortunately no new updates here 19:23:21 <clarkb> Any other OpenDev topics people would like to bring up? 19:25:12 <clarkb> #topic General Topics 19:25:24 <clarkb> #topic Splitting puppet-else into service specific infra-prod jobs 19:25:34 <clarkb> This is something that ianw reminded me we had planned to do 19:26:08 <clarkb> essentially we split the node definition(s) out of our manifests/site.pp and put them in manifests/service.pp then add new jobs to run puppet for that specific service instead of everything 19:26:33 <clarkb> the reason this is coming up is we've been having servers like elasticsearch servers crash on rax then ansible sits there waiting for them until the puppet-else job times out 19:26:53 <clarkb> this adds a lot of noise to our logs and it is hard to tell if things are working or not since they are lumped into a big basket 19:27:24 <clarkb> I wonder if we should plan a sprint to make those changes and get through a number of them in a day or two 19:28:17 <ianw> yeah, something else i can put on my short-term todo list 19:28:27 <clarkb> (also if anyone knows how to convince ansible to timeout more quickly when ssh will never succeed that would be great too) 19:29:37 <clarkb> ianw: I'm happy to help too, though for me having a day or two dedicated to it would likely help, maybe ping me when you intend on working on it and I'll start later in the day and we can sift through them? 19:30:06 <ianw> ok; hopefully it's all pretty mechanical 19:30:16 <ianw> always surprises though :) 19:30:23 <clarkb> ya I think the least mechanical part will be adding testinfra tests 19:30:41 <clarkb> I think that was part of mordred's original goal so that we can switch out puppet for ansible+docker and have the tests confirm everything is still happy 19:30:47 <clarkb> without needing to replace test frameworks 19:31:24 <clarkb> #topic Bup and Borg Backups 19:31:28 <clarkb> #link https://review.opendev.org/741366 ready to merge when we are. 19:31:37 <clarkb> kept this on the agenda as ianw mentioned I should 19:31:58 <clarkb> I don't think its incredibly urgent as bup continues to work, but being better prepared for focal and beyond is a good thing too 19:32:47 <clarkb> ianw: anything else to add on that one? 19:33:12 <corvus> we looking for another +2, or just waiting for a babysitter? 19:33:29 <clarkb> corvus: I think waiting for a babysitter (ianw mentioned he could do it, but there have been many many distractions since) 19:33:30 <corvus> well, there's a host bringup, so a bit more work than babysitting 19:33:37 <ianw> yeah, just waiting for me to have a chunk to new server, and babysit 19:33:42 <clarkb> mostly I think it gets deprioritized since bup is working 19:33:57 <ianw> although there's been some chat about alternative providers 19:33:58 * corvus touches wood 19:34:07 <ianw> do we want to put the bup server somewhere !rax? 19:34:21 <clarkb> ianw: the original goal with bup was to backup to >1 provider 19:34:26 <corvus> i think 2 servers in 2 providers would be great 19:34:30 <clarkb> corvus: ++ 19:34:53 <ianw> any preference of the options we have? 19:35:02 <corvus> i'd vote for rax + 1. 19:35:21 <clarkb> ianw: mnaser has recently indicated he'd be happy to host more things. Backups likely make sense there due to the use of ceph too 19:35:39 <clarkb> (our backups will be replicated many times) 19:35:47 <mnaser> yes, i meant to reply to the email, we have a lot of capacity of storage in mtl btw 19:35:57 <corvus> i'll just say that at one point we *did* have rax +1, and the +1 exited the cloud business. i really really hope (and i certainly don't expect) that to happen again. but having been bitten once. 19:36:10 <ianw> ok, sounds like vexxhost mtl 19:36:21 <mnaser> please feel free to loop me in if you need quota bumps or anything like that 19:36:29 <clarkb> mnaser: thank you! 19:36:33 <corvus> rax+mtl sounds great :) 19:36:43 <ianw> mnaser: thanks, will do 19:38:02 <clarkb> #topic PTG Planning 19:38:18 <clarkb> As mentioned earlier now is a good time to register if you plan to attend the PTG 19:38:24 <clarkb> #link https://www.openstack.org/ptg/ Register for the PTG 19:38:35 <clarkb> #link https://etherpad.opendev.org/opendev-ptg-planning-oct-2020 October PTG planning starts here 19:38:43 <clarkb> I've added a number of topics to this etherpad 19:39:00 <clarkb> we are just over a month away so now is a great time to think about what we should be talking about during our PTG times 19:39:15 <clarkb> Feel free to add input on the topics I've added or add your own 19:39:37 <clarkb> if a particular topic is very important to you it might be a good thing to indicate your time availability next to the topic so we can include you 19:40:06 <clarkb> Also, I plan to use meetpad again as that worked well for us last time 19:41:20 <clarkb> Any other PTG concerns or thoughts? 19:43:04 <clarkb> #topic Switch fedora-latest to fedora-32 19:43:07 <clarkb> #link https://review.opendev.org/#/c/752744/ 19:43:14 <clarkb> I sent an email last week saying we'd make this change today 19:43:25 <clarkb> I intend on approving the change after the meeting unless there are last minute objections 19:43:43 <clarkb> I figure if anyone really really needs fedora-30 they can use fedora-30 directly as we work them off of it 19:43:54 <clarkb> hoping that in the near future we'll delete the fedora-30 image entirely 19:44:23 <clarkb> part of the motivation here is that the old fedoras seem to be bitrotting with respect to ansible. Ansible isn't able to reliably manage systemd services on f31 for example 19:44:36 <clarkb> Getting to the up to date fedora version seems important as a result 19:44:53 <clarkb> ianw: ^ any particular concerns from you on that topic? you probably do more fedora things than the rest of us 19:45:29 <ianw> no, i mean we shouldn't really be using fedora-!latest in jobs, we've always said it's a rolling thing 19:46:03 <clarkb> ya the number of cases where fedora-30 is used explicitly is very small 19:46:14 <clarkb> nodepool, ara, and dib 19:46:29 <clarkb> nodepool and ara are/have being updated and dib will just stop testing f30 buidls I think 19:47:27 <clarkb> #topic Open Discussion 19:47:58 <clarkb> https://review.opendev.org/#/c/752908/ is a change I'm hoping to get review(s) from someone with a fresh perspective 19:48:15 <clarkb> I've had some initial concerns but have largely come around to thinking merging it is probably the most pragmatic thing 19:48:27 <ianw> one thing was restarting zuul-web to pickup the new pf4 changes that were merged 19:48:32 <clarkb> hoping that someone else can take a look and double check on that 19:48:57 <clarkb> I'll probably approve it by the end of my work day if no one else looks as I don't want the tripleo testing to floudner longer 19:49:22 <clarkb> ianw: typically those are really straightforward, you docker-compose down and docker-compose up -d in /etc/zuul-web or whatever the dir is 19:49:25 <ianw> i'll check, yesterday there were unanswered questions 19:49:46 <ianw> clarkb: is there a reason we don't CD deploy that? 19:49:51 <clarkb> ianw: if you want to do the zuul-web restart I'm around for another 5 or so hours and will happy backup if something goes wrong 19:50:05 <clarkb> ianw: I think because sometimes you need to restart zuul-web and scheduler together 19:50:13 <clarkb> corvus: ^ is that overly cautious on our part? 19:50:25 <ianw> ahh, ok, yeah this is not an API change 19:50:37 <ianw> but i guess it could be, at some times 19:50:38 <clarkb> ya most of the time its fine to just restart 19:50:42 <clarkb> occasionally it isnt :) 19:51:39 <ianw> i'll take a look then 19:52:42 <corvus> i think it would be fine to cd zuul-web 19:52:47 <corvus> but the mechanics are tricky 19:52:57 <corvus> zuul repo is in a different tenant, etc 19:53:15 <corvus> really want a url trigger or something for that, i'd think. 19:53:42 <clarkb> we check the docker-compose pull info in the gitea role to understand if we need to restart in a safe way (whcih is different than just down and up) 19:53:44 <ianw> hrm, i docker restarted it, but it looks the same 19:53:55 <clarkb> we might be able to do something similar for zuul-web and get the hour delayed CD 19:53:55 <ianw> which must mean what i thought would be new containers isn't 19:54:19 <corvus> ianw: i think it needs more than a restart for the container to be recreated with a new image 19:54:29 <clarkb> yes I think that is the case 19:54:35 <corvus> i think a docker-compose down/up ? 19:54:44 <clarkb> ya down then up -d is what I usually use 19:55:30 <ianw> ok, yeah, looks like that's in the bash history 19:55:43 <ianw> my usual source of best practice tips :) 19:56:23 <clarkb> Sounds like that may be it 19:56:25 <clarkb> Thank you everyone 19:56:42 <ianw> yay, that got it :) 19:56:48 <clarkb> we'll be back here next week until then feel free to chat in #opendev or on service-discuss@lists.opendev.org 19:56:54 <corvus> clarkb: thanks :) 19:56:55 <clarkb> #endmeeting