fungi | meeting in about 10 minutes | 18:50 |
---|---|---|
fungi | meeting time! | 19:00 |
fungi | i've volunteered to chair this week since clarkb is feeling under the weather | 19:00 |
fungi | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Dec 19 19:00:52 2023 UTC and is due to finish in 60 minutes. The chair is fungi. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
fungi | #link https://wiki.openstack.org/wiki/Meetings/InfraTeamMeeting#Agenda_for_next_meeting Our Agenda | 19:01 |
fungi | #topic Announcements | 19:01 |
fungi | #info The OpenDev weekly meeting is cancelled for the next two weeks owing to lack of availability for many participants; we're skipping December 26 and January 2, resuming as usual on January 9. | 19:02 |
fungi | i'm also skipping the empty boilerplate topics | 19:03 |
fungi | #topic Upgrading Bionic servers to Focal/Jammy (clarkb) | 19:03 |
fungi | #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades | 19:03 |
tonyb | mirrors are done and need to be cleaned up | 19:04 |
fungi | there's a note here in the agenda about updating cnames and cleaning up old servers for mirror replacements | 19:04 |
fungi | yeah, that | 19:04 |
fungi | are there open changes for dns still? | 19:04 |
tonyb | I started doing this yesterday but wanted additional eyes as it's my first time | 19:04 |
fungi | or do we just need to delete servers/volumes? | 19:04 |
tonyb | no open changes ATM | 19:04 |
tonyb | that one | 19:04 |
fungi | what specifically do you want an extra pair of eyes on? happy to help | 19:04 |
tonyb | fungi: the server and volume deletes I understand the process | 19:05 |
fungi | i'm around to help after the meeting if you want, or you can pick another better time | 19:06 |
tonyb | fungi: after the meeting is good for me | 19:06 |
fungi | sounds good, thanks! | 19:07 |
tonyb | I've started looking at jvb and meetpad for upgrades | 19:07 |
fungi | that's a huge help | 19:07 |
tonyb | I'm thinking we'll bring up 3 new servers and then do a cname swiatch. | 19:07 |
fungi | that should be fine. there's not a lot of utilization on them at this time of year anyway | 19:08 |
fungi | #topic DIB bionic support (ianw) | 19:08 |
tonyb | I was considering a more complex process to growing the jvb pool but I think that way is uneeded | 19:08 |
fungi | i think this got covered last week. was there any followup we needed to do? | 19:08 |
fungi | seems like there was some work to fix the dib unit tests? | 19:09 |
fungi | i'm guessing this no longer needed to be on the agenda, just making sure | 19:09 |
fungi | #topic Python container updates (tonyb) | 19:10 |
fungi | zuul-operator seems to still need addressing | 19:11 |
tonyb | no updates this week | 19:11 |
fungi | no worries, just checking. thanks! | 19:11 |
fungi | Gitea 1.21.1 Upgrade (clarkb) | 19:11 |
fungi | er... | 19:11 |
fungi | #topic Gitea 1.21.1 Upgrade (clarkb | 19:11 |
tonyb | Yup I intend to update the roles to enhance container logging and then we'll have a good platform to understand the problem | 19:11 |
fungi | we were planning to do the gitea upgrade at the beginning of the week, but with lingering concerns after the haproxy incident over the weekend we decided to postpone | 19:12 |
tonyb | I think we're safe to remove the 2 LBs from emergency right? | 19:12 |
fungi | #link https://review.opendev.org/903805 Downgrade haproxy image from latest to lts | 19:13 |
fungi | that hasn't been approved yet | 19:13 |
fungi | so not until it merges at least | 19:13 |
tonyb | Ah | 19:13 |
fungi | but upgrading gitea isn't necessarily blocked on the lb being updated | 19:14 |
fungi | different system, separate software | 19:14 |
tonyb | Fair point | 19:14 |
fungi | anyway, with people also vacationing and/or ill this is probably still not a good time for a gitea upgrade. if the situation changes later in the week we can decide to do it then, i think | 19:15 |
tonyb | Okay | 19:15 |
fungi | #topic Updating Zuul's database server (clarkb) | 19:15 |
tonyb | I suspect there hasn't been much progress this week. | 19:16 |
fungi | i'm not sure where we ended up on this, there was research being done, but also an interest in temporarily dumping/importing on a replacement trove instance in the meantime | 19:16 |
fungi | we can revisit next year | 19:16 |
fungi | #topic Annual Report Season (clarkb) | 19:17 |
fungi | #link OpenDev's 2023 Annual Report Draft will live here: https://etherpad.opendev.org/p/2023-opendev-annual-report | 19:17 |
fungi | we need to get that to the foundation staff coordinator for the overall annual report by the end of the week, so we're about out of time for further edits if you wanted to check it over | 19:17 |
fungi | #topic EMS discontinuing legacy/consumer hosting plans (fungi) | 19:18 |
fungi | we received a notice last week that element matrix services (ems) who hosts our opendev.org matrix homeserver for us is changing their pricing and eliminating the low-end plan we had the foundation paying for | 19:19 |
fungi | the lowest "discounted" option they're offering us comes in at 10x what we've been paying, and has to be paid a year ahead in one lump sum | 19:20 |
fungi | (we were paying monthly before) | 19:20 |
tonyb | when? | 19:20 |
tonyb | does the plan need to be purchased | 19:20 |
fungi | we have until 2024-02-07 to upgrade to a business hosting plan or move elsewhere | 19:20 |
tonyb | phew | 19:20 |
fungi | so ~1.5 months to decide on and execute a course of action | 19:21 |
tonyb | not a lot of lead time but also some lead time | 19:21 |
corvus | is the foundation interested in upgrading? | 19:22 |
fungi | i've so far not heard anyone say they're keen to work on deploying a matrix homeserver in our infrastructure, and i looked at a few (4 i think?) other hosting options but they were either as expensive or problematic in various ways, and also we'd have to find time to export/import our configuration and switch dns resulting in some downtime | 19:22 |
fungi | i've talked to the people who hold the pursestrings on the foundation staff and it sounds like we could go ahead and buy a year of business service from ems since we do have several projects utilizing it at this point | 19:23 |
fungi | which would buy us more time to decide if we want to keep doing that or work on our own solution | 19:24 |
tonyb | A *very* quick looks implies that hosting our own server wouldn't be too bad. the hardest part will be the export/import and downtime | 19:24 |
frickler | another option might be dropping the homeserver and moving the rooms to matrix.org? | 19:24 |
tonyb | I suspect that StartlingX will be the "most impacted" | 19:24 |
frickler | I've tried running a homeserver privately some time ago but it was very opaque and not debugable | 19:25 |
fungi | maybe, but with as many channels as they have they're still not super active on them (i lurk in all their channels and they average a few messages a day tops) | 19:25 |
corvus | fungi: does the business plan support more than one hostname? the foundation may be able to eek out some more value if they can use the same plan to host internal comms. | 19:26 |
fungi | looking at https://element.io/pricing it's not clear to me how that's covered exactly | 19:28 |
fungi | maybe? | 19:28 |
corvus | ok. just a thought :) | 19:28 |
frickler | also, is that "discounted" option a special price or does that match the public pricing? | 19:28 |
fungi | the "discounted" rate they offered us to switch is basically the normal business cloud option on that page, but with a reduced minimum user count of 20 instead of 50 | 19:29 |
fungi | anyway, mostly wanted to put this on the agenda so folks know it's coming and have some time to think about options | 19:30 |
fungi | we can discuss again in the next meeting which will be roughly a month before the deadline | 19:31 |
corvus | if the foundation is comfortable paying for it, i'd lean that direction | 19:31 |
fungi | yeah, i'm feeling similarly. i don't think any of us has a ton of free time for another project just now | 19:31 |
corvus | (i think there are good reasons to do so, including the value of the service provided compared to our time and materials cost of running it ourselves, and also supporting open source projects) | 19:32 |
fungi | agreed, and while it's 10x what we've been paying, there wasn't a lot of surprise at a us$1.2k/yr+tax price tag | 19:32 |
fungi | helps from a budget standpoint that it's due in the beginning of the year | 19:33 |
corvus | tbh i thought the original price was too way low for an org (i'm personally sad that it isn't an option for individuals any more though) | 19:33 |
fungi | yeah, we went with it mainly because they didn't have any open source community discounts, which we'd have otherwise opted for | 19:34 |
fungi | any other comments before we move to other topics? | 19:34 |
fungi | #topic Followup on 20231216 incident (frickler) | 19:35 |
fungi | you have the floor | 19:35 |
frickler | well I just collected some things that came to my mind on sunday | 19:35 |
frickler | first question: Do we want to pin external images like haproxy and only bump them after testing? (Not sure that would've helped for the current issue though) | 19:36 |
fungi | there's a similar question from corvus in 903805 about whether we want to make the switch from "latest" to "lts" permanent | 19:36 |
fungi | testing wouldn't have caught it though i don't thinl | 19:37 |
corvus | yeah, unlike gerrit/gitea where there's stuff to test, i don't think we're going to catch haproxy bugs in advance | 19:37 |
fungi | but maybe someone with a higher tolerance for the bleeding edge would have spotted it before latest became lts | 19:37 |
fungi | also it's not like we use recent/advanced features of haproxy | 19:38 |
corvus | for me, i think maybe permanently switching to tracking the lts tag is the right balance of auto-upgrade with hopefully low probability of breakage | 19:38 |
fungi | so i think the answer is "it depends, but we can be conservative on haproxy and similar components" | 19:38 |
frickler | are there other images we consume that could cause similar issues? | 19:38 |
frickler | and I'm fine with haproxy:lts as a middle ground for now | 19:39 |
fungi | i don't know off the top of my head, but if someone wants to `git grep :latest$` and do some digging, i'm happy to review a change | 19:39 |
frickler | ok, second thing: Use docker prune less aggressively for easier rollback? | 19:39 |
frickler | We do so for some services, like https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/gitea/tasks/main.yaml#L71-L76, might want to duplicate for all containers? Bump the hold time to 7d? | 19:39 |
corvus | (also, honestly i think the fact that haproxy is usually rock solid is why it took us so long to diagnose it. normally checking restart times would be near the top of the list of things to check) | 19:40 |
fungi | fwiw, when i switched gitea-lb's compose file from latest to lts and did a pull, nothing was downloaded, the image was still in the cache | 19:40 |
frickler | so "docker prune" doesn't clear the cache? | 19:41 |
tonyb | IIRC it did download on zuul-lb | 19:41 |
corvus | i sort of wonder what we're trying to achieve there? resiliency against upstream retroactively changing a tag? or shortened download times? or ability to guess what versions we were running by inspecting the cache? | 19:41 |
frickler | being able to have a fast revert of an image upgrade by just checking "docker images" locally | 19:42 |
fungi | i guess the concern is that we're tracking lts, upstream moves lts to a broken image, and we've pruned the image that lts used to point to so we have to redownload it when we change the tag? | 19:42 |
frickler | also I don't think we are having disk space issues that make fast pruning essential | 19:43 |
corvus | if it's resiliency against changes, i agree that 7d is probably a good idea. otherwise, 1-3 days is probably okay... if we haven't cared enough to look after 3 days, we can probably check logs or dockerhub, etc... | 19:43 |
fungi | but also the download times are generally on the order of seconds, not minutes | 19:43 |
fungi | it might buy us a little time but it's far from the most significant proportion of any related outage | 19:44 |
frickler | the 3d is only in effect for gitea, most other images are pruned immediately after upgrading | 19:44 |
corvus | (we're probably going to revert to a tag, which, since we can download in a few seconds, means the local cache isn't super important) | 19:44 |
fungi | i'm basically +/-0 on adjusting image cache times. i agree that we can afford the additional storage, but want to make sure it doesn't grow without bound | 19:45 |
frickler | ok, so leave it at that for now | 19:45 |
frickler | next up: Add timestamps to zuul_reboot.log? | 19:45 |
frickler | https://opendev.org/opendev/system-config/src/branch/master/playbooks/service-bridge.yaml#L41-L55 Also this is running on Saturdays (weekday: 6), do we want to fix the comment or the dow? | 19:45 |
fungi | also having too many images cached makes it a pain to dig through when you're looking for a recent-ish obne | 19:45 |
fungi | is zuul_reboot.log a file? on bridge? | 19:46 |
frickler | yes | 19:46 |
frickler | the code above shows how it is generated | 19:46 |
fungi | aha, /var/log/ansible/zuul_reboot.log | 19:47 |
corvus | adding timestamps sounds good to me; i like the current time so i'd say change the comment | 19:47 |
fungi | i have no objection to adding timestamps | 19:47 |
fungi | to, well, anything really | 19:47 |
frickler | ok, so I'll look into that | 19:47 |
fungi | more time-based context is preferable to less | 19:47 |
fungi | thanks! | 19:47 |
frickler | final one: Do we want to document or implement a procedure for rolling back zuul upgrades? Or do we assume that issues can always be fixed in a forward going way? | 19:47 |
fungi | i think the challenge there is that "downgrading" may mean manually undoing database migrations | 19:48 |
frickler | like what would we have done if we hadn't found a fast fix for the timer issue? | 19:48 |
fungi | the details of which will differ from version to version | 19:48 |
fungi | frickler: if the solution hadn't been obvious i was going to propose a revert of the offending change and try to get that fast-tracked | 19:49 |
frickler | ok, what if no clear bad patch had been identied? | 19:49 |
frickler | identified | 19:49 |
frickler | anyway, we don't need to discuss this at length right now, more something to think about medium term | 19:50 |
fungi | for zuul we're in a special situation where several of us are maintainers, so we've generally been able to solve things like that quickly one way or another | 19:50 |
corvus | i agree with fungi, any downgrade procedure is dependent on the revisions in scope, so i don't think there's a generic process we can do | 19:50 |
fungi | it'll be an on-the-spot determination as to whether its less work to roll forward or try to unwind things | 19:51 |
frickler | ok, time's tight, so lets move to AFS? | 19:51 |
fungi | yep! | 19:51 |
fungi | #topic AFS quota issues (frickler) | 19:51 |
frickler | mirror.openeuler has reached its quota limit and the mirror job seems to be failing since two weeks. I'm also a bit worried that they seem do have doubled their volume over the last 12 months | 19:52 |
frickler | ubuntu mirrors are also getting close, but we might have another couple of months time there | 19:52 |
frickler | mirror.centos-stream seems to have a steep increase in the last two months and might also run into quota limits soon | 19:52 |
frickler | project.zuul with the latest releases is getting close to its tight limit of 1GB (sic), I suggest to simply double that | 19:52 |
frickler | the last one is easy I think. for openeuler instead of bumping the quota someone may want to look into cleanup options first? | 19:52 |
frickler | the others are more of something to keep an eye on | 19:53 |
fungi | broken openeuler mirrors that nobody brought to our attention would indicate they're not being used, but yes it's possible we can filter out some things like we do for centos | 19:53 |
fungi | i'll try to figure out based on git blame who added the openeuler mirror and see if they can propose improvements before lunar new year | 19:53 |
frickler | well they are being used in devstack, but being out of date for some weeks does not yet break jobs | 19:53 |
corvus | feel free to action me on the zuul thing if no one else wants to do it | 19:54 |
fungi | i agree just bumping the zuul quota is fine | 19:54 |
fungi | #action fungi Reach out to someone about cleaning up OpenEuler mirroring | 19:54 |
fungi | #action corvus Increase project.zuul AFS volume quota | 19:55 |
fungi | let's move to the last topic | 19:55 |
fungi | #topic Broken wheel build issues (frickler) | 19:55 |
fungi | centos8 wheel builds are the only ones that are thoroughly broken currently? | 19:55 |
fungi | i'm pleasantly surprised if so | 19:56 |
frickler | fungi: https://review.opendev.org/c/openstack/devstack/+/900143 is the last patch on devstack that a quick search showed me for openeuler | 19:56 |
frickler | I think centos9 too? | 19:56 |
fungi | oh, >+8 | 19:56 |
fungi | >=8 | 19:56 |
fungi | got it | 19:56 |
frickler | though depends on what you mean by thoroughly (8 months vs. just 1) | 19:57 |
fungi | how much centos testing is going on these days now that tripleo has basically closed up shop? | 19:57 |
fungi | wondering how much time we're saving by not rebuilding some stuff from sdist in centos jobs | 19:57 |
frickler | not sure, I think some usage is still there for special reqs like FIPS | 19:58 |
tonyb | yup FIPS still needs it. | 19:58 |
frickler | at least people are still enough concerned about devstack global_venv being broken on centos | 19:58 |
fungi | for 9 or 8 too? | 19:58 |
frickler | both I think | 19:59 |
tonyb | I can work with ade_lee to verify what *exactly* is needed and fix or prune as appropriate | 19:59 |
fungi | we can quite easily stop running the wheel build jobs, if the resources for running those every day is a concern | 19:59 |
fungi | i guess we can discuss options in #opendev since we're past the end of the hour | 20:00 |
frickler | the question is then do we want to keep the outdated builds or purge them too? | 20:00 |
fungi | keeping them doesn't hurt anything, i don't think | 20:00 |
fungi | it's just an extra index url for pypi | 20:00 |
fungi | and either the desired wheel is there or it's not | 20:00 |
fungi | and if it's not, the job grabs the sdist from pypi and builds it | 20:00 |
tonyb | and storage? | 20:00 |
frickler | it does mask build errors that can happen for people that do not have access to those wheels | 20:00 |
frickler | if the build was working 6 months ago but has broken since then | 20:01 |
frickler | but anyway, not urgent, we can also continue next year | 20:01 |
fungi | it's a good point, we considered that as a balance between using job resources continually building the same wheels over and over | 20:01 |
fungi | and projects forgetting to list the necessary requirements for building the wheels for things they depend on that lack them | 20:01 |
fungi | okay, let's continue in #opendev. thanks everyone! | 20:02 |
fungi | #endmeeting | 20:02 |
opendevmeet | Meeting ended Tue Dec 19 20:02:18 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:02 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.html | 20:02 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.txt | 20:02 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-12-19-19.00.log.html | 20:02 |
frickler | thx fungi | 20:02 |
tonyb | Thanks all | 20:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!