clarkb | meeting time | 19:00 |
---|---|---|
clarkb | we'll get started in a minute | 19:00 |
fungi | ahoy! | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Jul 19 19:01:06 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000345.html Our Agenda | 19:01 |
clarkb | There were no announcements. So we'll dive right into our topic list | 19:01 |
clarkb | #topic Topics | 19:01 |
clarkb | #topic Improving OpenDev's CD throughput | 19:01 |
clarkb | Our new Zuul management seems to be workign well (no issues related to the auto upgrades yet that I am aware of) | 19:02 |
clarkb | Probably worth picking the core of this topic back up again if anyone has time for it. | 19:02 |
ianw | o/ | 19:02 |
ianw | ++; perpetually on my todo stack but somehow new things keep getting pushed on it! | 19:03 |
clarkb | I know the feeling :) | 19:03 |
clarkb | I don't think this is a rush. We've been slowly chipping at it and making improvements along the way. | 19:03 |
clarkb | Just wanted to point out that the most recent of thsoe improvements haven't appeared to create any problems. Please do let us know if you notice any issues with the auto upgrades | 19:04 |
clarkb | #topic Improving Grafana management | 19:05 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html | 19:05 |
clarkb | I've been meaning to followup on this thread but too much travel and vacation recently. | 19:05 |
clarkb | ianw: are we mostly looking for feedback from those that had concerns previoulsy? | 19:05 |
clarkb | cc corvus ^ if you get a chance to look over that and provide your feedback to the thread that would be great. This way we can consolidate all the feedback asynchronously and don't have to rely on the meeting for it | 19:06 |
ianw | concerns, comments, whatever :) it's much more concrete as there's actual examples that work now :) | 19:06 |
clarkb | #topic Bastion Host Updates | 19:08 |
clarkb | I'm not sure there has been much movement on this yet. But I've also been distracted lately (hopeflly back to normalish starting today) | 19:09 |
clarkb | ianw: are there any changes for the venv work yet or any other related updates? | 19:09 |
ianw | not yet, another todo item sorry | 19:09 |
clarkb | no problem. I bring it up beacuse it is slightly related to the next topic | 19:10 |
clarkb | #topic Bionic Server Upgrades | 19:10 |
clarkb | We've got a number of Bionic servers out there that we need to start upgrading to either focal or jammy | 19:10 |
clarkb | #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. | 19:10 |
clarkb | I've begun taking notes there and trying to classify and call out the difficulty and process for the various servers | 19:11 |
clarkb | if you want to add more notes about bridge there please feel free | 19:11 |
clarkb | Also if you notice I've missed any feel free to add them (this data set started with the ansibel fact cache so should be fairly complete) | 19:11 |
clarkb | I did have a couple of things to call out. The first is that I'm not sure we have openafs packages for jammy and/or have tested jammy with its openafs packages | 19:12 |
clarkb | I think we have asserted we prefer to just run our own packages for now though | 19:12 |
clarkb | We can either update servers to focal and punt on the jammy openafs work or start figuring out jammy openafs first. I think either option is fine and probably depends on who starts where :) | 19:13 |
ianw | we do have jammy packages for that @ https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs | 19:13 |
clarkb | oh right for the wheel mirror stuff | 19:13 |
clarkb | any idea if we have exercised them outside of the wheel mirror context? | 19:13 |
ianw | in CI; but only R/O | 19:13 |
clarkb | I'll make note of that on the etherpad really quickly | 19:14 |
ianw | one thing that is another perpetual todo list item is to use kafs; it might be worth considering for jammy | 19:14 |
ianw | i has had a ton of work, more than i can keep up with | 19:15 |
clarkb | ianw: fwiw I tried using it on tumbleweed a while back and didn't have great luck with it. But that was many kernel releases ago and probably worht testing again | 19:15 |
fungi | worth noting, the current openafs packages don't build with very new kernels | 19:15 |
fungi | we'll have to keep an eye out if we add hwe kernels on jammy or something | 19:15 |
clarkb | noted | 19:15 |
ianw | fungi: yeah, although usually there's patches in their gerrit pretty quickly (one of the reasons we keep our ppa :) | 19:16 |
corvus | clarkb: ianw i have no further feedback on grafyaml, thanks | 19:16 |
corvus | (i think ianw incorporated the concerns and responses/mitigations i raised earlier in that message) | 19:17 |
clarkb | The other thing I noticed is that our nameservers are needing upgrades. I've got a disucssion in the etherpad going on whether or not we should do replacements or do them in place. I think doing replacements would be best, but we have to update info with our registrars to do that. The opendev.org domain is managed by the foundation's registrar service and fungi and I should | 19:18 |
clarkb | be able to update their records. zuul-ci.org is managed by corvus. We can probably coordinate stuff if necessary without too much trouble, but it is an important step to not miss | 19:18 |
clarkb | corvus: thanks | 19:18 |
clarkb | One reason to do replacement ns servers is that we can have them running on focal or jammy and test that nsd and bind continue to work as expected before handing production load to them | 19:19 |
fungi | we'll want to double-check that any packet filtering or axfr/also-notify we do in bind is updated for new ip addresses | 19:19 |
ianw | i feel like history may show we did those in place last time? | 19:19 |
clarkb | ianw: I think they started life as bionic and haven't been updated since? | 19:19 |
ianw | (i'm certain we did for openafs) | 19:19 |
clarkb | But I could be wrong | 19:19 |
fungi | i'd have to double-check, but typically those are configured by ip address due to chicken and egg problems | 19:19 |
ianw | that could also be true :) | 19:19 |
clarkb | fungi: yes I think that is all managed by ansible and we should ba able to have adns2 and ns3 and ns4 completely up and running before we change anything if we like | 19:20 |
corvus | the only registrar coord we should need to do is opendev.org itself | 19:20 |
corvus | zuul-ci.org references nsX.opendev.org by name. but nsX.opendev.org has glue records in .org | 19:21 |
fungi | adns1 and ns1 have a last modified date of 2018 on their /etc/hosts files, so i suspect an in-place upgrade previously | 19:21 |
clarkb | fungi: I don't think they existed prior to that | 19:21 |
corvus | (if we elect to make ns3/ns4, then zuul-ci.org would need to update its ns records, but that's just a regular zone update) | 19:21 |
fungi | oh, yes they're running ubuntu 18.04 so an october 2018 creation time would imply that | 19:22 |
clarkb | corvus: ah right | 19:22 |
fungi | anyway, it won't be hard, but i do think that replacing the nameservers is likely to need some more thorough planning that we'll be able to do in today's meeting | 19:23 |
clarkb | fungi: yup I mostly wanted to call it out and have people leave feedback on the etherpad | 19:23 |
fungi | wfm | 19:23 |
clarkb | I definitely think we'll be doing jammy upgrades on other services before the nameservers as well | 19:23 |
clarkb | Also if anyone would really like to upgrade any specific service on that ehterpad doc feel free to indicate so there | 19:24 |
clarkb | I'm hoping I can dig into jammy node replacements for something in the next week or so | 19:24 |
clarkb | maybe zp01 since it should be straightforward to swap out | 19:24 |
clarkb | #topic Zuul and Ansible v5 and Glibc deadlock | 19:25 |
clarkb | Yesterday all of our executors were updated to run the new zuul executor image with bookworm's glibc | 19:26 |
clarkb | Since then https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0 seems to show no more post failures | 19:26 |
ianw | sorry about the first attempt, i didn't realise the restart playbook didn't pull images. i guess that is by design? | 19:26 |
clarkb | Considering we had post failures with half the executors on old glibc I strongly suspect this is corrected. | 19:27 |
clarkb | ianw: ya there is a separate pull playbook because sometimes you just need to restart without updating the image | 19:27 |
ianw | i did think of using ansibles prompt thing to write an interactive playbook, which would say "do you want to restart <scheduler|merger|executor> <y/n>" and then "do you want to pull new images <y|n>" | 19:28 |
ianw | now i've made the mistake, i guess i won't forget next time :) | 19:28 |
clarkb | It also looks like upstream ansible has closed their issue and says the fix is to upgrade glibc | 19:29 |
ianw | clarkb: yes, almost certainly; i managed to eventually catch a hung process and it traced back into the exact code discussed in those issues | 19:29 |
clarkb | good news is that is what we have done. Bad news is I guess we're running backported glibc on our executors until bookworm releases | 19:29 |
corvus | so... 10...20 years? | 19:29 |
ianw | (for posterity, discussions @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-17.log.html#t2022-07-17T23:30:09) | 19:29 |
clarkb | corvus: ha, zigo made it sound like this winter. So maybe next spring :) | 19:30 |
corvus | i wonder if a glibc fix could make it into stable? | 19:30 |
clarkb | (NA seasons) | 19:30 |
clarkb | corvus: not a bad idea since it affects real software like ansible. Though like rhel I wonder if debian won't prioritize it due to their packaged version of ansible being fine (however other software may struggle with this bug we just don't know about it yet) | 19:31 |
ianw | it seemed a quite small fix and also seems serious enough, it's very annoying to have things deadlock like that | 19:31 |
clarkb | is the process for that similar to ubuntu? Open a bug with a link to the issue and ask for a stable backport with justification? | 19:32 |
fungi | bookworm release is expected in 2023 yeah | 19:32 |
fungi | the freeze starts late this year | 19:33 |
fungi | hard freeze is next march | 19:33 |
ianw | clarkb: that's where i'd start; i can take a closer look at the patch and send something up | 19:34 |
fungi | actually i misspoke, the transitions deadline is late this year, soft freeze is early next year according to the schedule | 19:34 |
clarkb | ianw: thanks | 19:34 |
fungi | mmm, in fact even the transition deadline is actually early next year (january) | 19:35 |
fungi | https://release.debian.org/ | 19:35 |
clarkb | Seems like we've got a good next step and we can likely run that glibc until 2023 if necessary. It is limited to the executors which should limit impact | 19:36 |
clarkb | #topic Other Zuul job POST_FAILURES | 19:37 |
clarkb | yesterday it occured to me that the POST_FAILURES that OSA was seeing could be due to the same issue. I asked jrosser if they had updated their job ansible version to v5 early (because they had these errors prior to us updating our default), but unfortauntely they had not | 19:37 |
clarkb | that means their issue is very likely independent of our issue taht we believe is now fixed. | 19:38 |
clarkb | #link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs | 19:38 |
clarkb | for that reason I think we should try and get this chagne landed to add the extra swift location debugging info to our base jobs. Please cross check against the base-test job and leave a review | 19:38 |
clarkb | ianw did manage to strace a live one last week and found that it appeared to be our end waiting on read()s returning from the remote swift server | 19:39 |
corvus | which provider? | 19:39 |
clarkb | the remote was ovh. I suspect we may need to gather some evidence that it is consistently ovh using the above change then take that to ovh and see if they know why it is breaking | 19:40 |
ianw | one thing we noticed was this one included an ARA report, which is a lot of little files | 19:41 |
ianw | at first i thought that might be in-common with our hang issues (they generate ara), but as discussed turned out to be completely separate | 19:42 |
ianw | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-13.log.html#t2022-07-13T23:19:34 | 19:42 |
ianw | is the discussion of what i found on the stuck process | 19:42 |
clarkb | I suspect that the remote service is having some sort of problem and if we can collect stronger evidence of that (through consistent failures against a single backed) and take that evidence with timestamps that the provider will be able to debug | 19:43 |
clarkb | which makes that base job change important. But ya we can take it from there if/when we get that evidence | 19:44 |
clarkb | #topic Gerrit 3.5 Larger Caches | 19:45 |
clarkb | Last week our gerrit installation ran out of disk space due to the growth of its caches. Turns out that Gerrit 3.5 added some new caches that were expected to grow quite large but upstream didn't properly document this change in their release notes | 19:45 |
clarkb | Upstream has said they should update those release notes now. They also noted part of the problem is a deficiency in h2 where it spends only 200ms compacting its backing file for the database before giving up which is insufficient to make a very large file smaller | 19:46 |
clarkb | That said they did think that the backing files would essentially stop growing after a certain size (a size big enough to fit all the data we use in a day) | 19:47 |
clarkb | To accomodate that we've increased the size of our volume and filesystem on the gerrit installation (thank you ianw and fungi for getting that done) | 19:47 |
clarkb | We are now in a period of monitoring the consumption to check the assumptions from upstream about growth eventually platueaing. | 19:48 |
clarkb | #link http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70356&rra_id=all Cacti graphs of review02 disk consumption | 19:48 |
clarkb | I did test using held nodes (which I should delete now I guess) that you can safely delete these two large cache files while gerrit is offline and it will recreate them when needed after starting again | 19:49 |
clarkb | If we find ourselves in a gerrit is crashing situation I think ^ is a good short term fix. | 19:49 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/849886 is our fallback if the cache disk usage doesn't stabilize. | 19:49 |
ianw | ++ | 19:49 |
corvus | does that graph mean that we're approximately at nominal size for those files now since it's leveled out? | 19:49 |
clarkb | And this change which is marked WIP is the long term fix if we end up having problems again. It basically says we won't cache this info on disk (it will still maintain memory caches for the data) | 19:50 |
clarkb | corvus: yes I think we're seeing that we are close to leveling out. The bigger of the two files hasn't grown much at all since the original problem was discovered. The smaller of the two has grown by abour 25% | 19:50 |
clarkb | time will tell though. It is also possible that we're just quieter than usual since we had the problem and if development picks up again we'll see growth | 19:51 |
clarkb | I think it is still a bit early to draw any major conclusions, but it is looking good at least | 19:51 |
clarkb | For future upgrades we might want to start capturing any new cache files too. I bet our upgrade job could actually be modified to check which caches are preset pre upgrade and diff that against the list post upgrade | 19:52 |
fungi | the cache size does strike me as being unnecessarily large though. searching through that much disk is unlikely to result in a performance benefit | 19:52 |
clarkb | I can look at updating our upgrade job to try and do that | 19:52 |
clarkb | fungi: it may actually help beacuse computing diffs is not cheap | 19:53 |
fungi | oh, is it git diffs/interdiffs? | 19:53 |
clarkb | fungi: while doing a large disk lookup may not be quick either, diffing is an expensive operation on top of the file io to perform the diff | 19:53 |
fungi | then yeah, i could see how that could need a lot of space | 19:53 |
clarkb | fungi: yes it is caching results of various diff operations | 19:53 |
ianw | but can anything ever leave the cache, if h2 can't clean it? | 19:54 |
clarkb | ianw: yes items leave the cache aiui. It is just that the backing file isn't trimmed to match | 19:54 |
clarkb | so that very large file may only be 1% used | 19:54 |
corvus | then why do we think it'll level out at a daily max? won't people (bots actually) be continually diffing new things and adding to the cache and therefore backing file? | 19:55 |
clarkb | because it isn't append only aiui | 19:55 |
clarkb | it can reuse the allocated blocks that are free in h2 (but not on disk) to cache new things | 19:56 |
corvus | backing file space gets reclaimed but not deleted? | 19:56 |
clarkb | from h2's perspective the database may have plenty of room and not need to grow the disk. But the disk once expanded doesn't really shrink | 19:56 |
ianw | i guess it needs to, basically, defrag itself? | 19:56 |
clarkb | ianw: ya and spend more than 200ms doing so apparently | 19:57 |
corvus | sound like it -- and defrag is too computationally expensive | 19:57 |
fungi | moreso the larger the set | 19:57 |
ianw | well i've spent plenty of time starting at defrag moving blocks around in the dos days :) never did me any harm ... kids (databases?) these days ... | 19:57 |
clarkb | its like a python list. It will expand and expand to add more records. But once you del stuff out of it you don't free up that room | 19:58 |
clarkb | but if you need the list to be bigger again you don't need to allocate more memory you just reuse what was already there | 19:58 |
corvus | at least that's a consistent set of behaviors i can understand, even if it's .... suboptimal. | 19:58 |
ianw | it seems though that part of upgrade steps is probably to delete the caches though? | 19:58 |
ianw | a super-defrag | 19:59 |
clarkb | ianw: upstream was talking about doing that automatically for us | 19:59 |
clarkb | so ya might make sense to just do it | 19:59 |
ianw | ++ one to keep in mind for the checklist | 19:59 |
clarkb | We are almost at time and I had a couple of small changes I wanted to call out before we run out of time | 19:59 |
clarkb | #topic Open Discussion | 19:59 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/850268 Add a new mailing list for computing force network to lists.opendev.org | 20:00 |
clarkb | fungi: ^ I think you wanted to make sure others were comfortable with adding this list under the opendev.org domain? | 20:00 |
clarkb | (we have a small number of similar working group like lists already so no objection from me) | 20:00 |
fungi | yeah, even just a heads up that it's proposed | 20:00 |
fungi | a second review would be appreciated | 20:01 |
clarkb | #link https://review.opendev.org/c/openstack/project-config/+/850252 Fix our new project CI checks to handle gitea/git trying to auth when repo doesn't exist | 20:01 |
clarkb | When you try to git clone a repo that doesn't exist in gitea you don't get a 404 and instead get prompted to provide credentials as the assumption is you need to authenticate to see a repo which isn't public | 20:01 |
clarkb | This change works around taht by doing an http fetch of the gitea ui page instead of a git clone to determine if the repo exists already | 20:02 |
clarkb | (note this isn't really gitea specific, github will do the same thing) | 20:02 |
clarkb | And we are at time. Thank you everyone | 20:02 |
clarkb | Feel free to continue discussion on the mailing list or in #opendev. I'll end this meeting so we can return to $meal and other tasks :) | 20:03 |
clarkb | #endmeeting | 20:03 |
opendevmeet | Meeting ended Tue Jul 19 20:03:12 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:03 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.html | 20:03 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.txt | 20:03 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.log.html | 20:03 |
*** Guest5201 is now known as diablo_rojo_phone | 20:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!