Tuesday, 2022-07-19

clarkbmeeting time19:00
clarkbwe'll get started in a minute19:00
fungiahoy!19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Jul 19 19:01:06 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-July/000345.html Our Agenda19:01
clarkbThere were no announcements. So we'll dive right into our topic list19:01
clarkb#topic Topics19:01
clarkb#topic Improving OpenDev's CD throughput19:01
clarkbOur new Zuul management seems to be workign well (no issues related to the auto upgrades yet that I am aware of)19:02
clarkbProbably worth picking the core of this topic back up again if anyone has time for it.19:02
ianwo/19:02
ianw++; perpetually on my todo stack but somehow new things keep getting pushed on it!19:03
clarkbI know the feeling :)19:03
clarkbI don't think this is a rush. We've been slowly chipping at it and making improvements along the way.19:03
clarkbJust wanted to point out that the most recent of thsoe improvements haven't appeared to create any problems. Please do let us know if you notice any issues with the auto upgrades19:04
clarkb#topic Improving Grafana management19:05
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-July/000342.html19:05
clarkbI've been meaning to followup on this thread but too much travel and vacation recently.19:05
clarkbianw: are we mostly looking for feedback from those that had concerns previoulsy?19:05
clarkbcc corvus ^ if you get a chance to look over that and provide your feedback to the thread that would be great. This way we can consolidate all the feedback asynchronously and don't have to rely on the meeting for it19:06
ianwconcerns, comments, whatever :)  it's much more concrete as there's actual examples that work now :)19:06
clarkb#topic Bastion Host Updates19:08
clarkbI'm not sure there has been much movement on this yet. But I've also been distracted lately (hopeflly back to normalish starting today)19:09
clarkbianw: are there any changes for the venv work yet or any other related updates?19:09
ianwnot yet, another todo item sorry19:09
clarkbno problem. I bring it up beacuse it is slightly related to the next topic19:10
clarkb#topic Bionic Server Upgrades19:10
clarkbWe've got a number of Bionic servers out there that we need to start upgrading to either focal or jammy19:10
clarkb#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.19:10
clarkbI've begun taking notes there and trying to classify and call out the difficulty and process for the various servers19:11
clarkbif you want to add more notes about bridge there please feel free19:11
clarkbAlso if you notice I've missed any feel free to add them (this data set started with the ansibel fact cache so should be fairly complete)19:11
clarkbI did have a couple of things to call out. The first is that I'm not sure we have openafs packages for jammy and/or have tested jammy with its openafs packages19:12
clarkbI think we have asserted we prefer to just run our own packages for now though19:12
clarkbWe can either update servers to focal and punt on the jammy openafs work or start figuring out jammy openafs first. I think either option is fine and probably depends on who starts where :)19:13
ianwwe do have jammy packages for that @ https://launchpad.net/~openstack-ci-core/+archive/ubuntu/openafs19:13
clarkboh right for the wheel mirror stuff19:13
clarkbany idea if we have exercised them outside of the wheel mirror context?19:13
ianwin CI; but only R/O19:13
clarkbI'll make note of that on the etherpad really quickly19:14
ianwone thing that is another perpetual todo list item is to use kafs; it might be worth considering for jammy19:14
ianwi has had a ton of work, more than i can keep up with19:15
clarkbianw: fwiw I tried using it on tumbleweed a while back and didn't have great luck with it. But that was many kernel releases ago and probably worht testing again19:15
fungiworth noting, the current openafs packages don't build with very new kernels19:15
fungiwe'll have to keep an eye out if we add hwe kernels on jammy or something19:15
clarkbnoted19:15
ianwfungi: yeah, although usually there's patches in their gerrit pretty quickly (one of the reasons we keep our ppa :)19:16
corvusclarkb: ianw i have no further feedback on grafyaml, thanks19:16
corvus(i think ianw incorporated the concerns and responses/mitigations i raised earlier in that message)19:17
clarkbThe other thing I noticed is that our nameservers are needing upgrades. I've got a disucssion in the etherpad going on whether or not we should do replacements or do them in place. I think doing replacements would be best, but we have to update info with our registrars to do that. The opendev.org domain is managed by the foundation's registrar service and fungi and I should19:18
clarkbbe able to update their records. zuul-ci.org is managed by corvus. We can probably coordinate stuff if necessary without too much trouble, but it is an important step to not miss19:18
clarkbcorvus: thanks19:18
clarkbOne reason to do replacement ns servers is that we can have them running on focal or jammy and test that nsd and bind continue to work as expected before handing production load to them19:19
fungiwe'll want to double-check that any packet filtering or axfr/also-notify we do in bind is updated for new ip addresses19:19
ianwi feel like history may show we did those in place last time?19:19
clarkbianw: I think they started life as bionic and haven't been updated since?19:19
ianw(i'm certain we did for openafs)19:19
clarkbBut I could be wrong19:19
fungii'd have to double-check, but typically those are configured by ip address due to chicken and egg problems19:19
ianwthat could also be true :)19:19
clarkbfungi: yes I think that is all managed by ansible and we should ba able to have adns2 and ns3 and ns4 completely up and running before we change anything if we like19:20
corvusthe only registrar coord we should need to do is opendev.org itself19:20
corvuszuul-ci.org references nsX.opendev.org by name.  but nsX.opendev.org has glue records in .org19:21
fungiadns1 and ns1 have a last modified date of 2018 on their /etc/hosts files, so i suspect an in-place upgrade previously19:21
clarkbfungi: I don't think they existed prior to that19:21
corvus(if we elect to make ns3/ns4, then zuul-ci.org would need to update its ns records, but that's just a regular zone update)19:21
fungioh, yes they're running ubuntu 18.04 so an october 2018 creation time would imply that19:22
clarkbcorvus: ah right19:22
fungianyway, it won't be hard, but i do think that replacing the nameservers is likely to need some more thorough planning that we'll be able to do in today's meeting19:23
clarkbfungi: yup I mostly wanted to call it out and have people leave feedback on the etherpad19:23
fungiwfm19:23
clarkbI definitely think we'll be doing jammy upgrades on other services before the nameservers as well19:23
clarkbAlso if anyone would really like to upgrade any specific service on that ehterpad doc feel free to indicate so there19:24
clarkbI'm hoping I can dig into jammy node replacements for something in the next week or so19:24
clarkbmaybe zp01 since it should be straightforward to swap out19:24
clarkb#topic Zuul and Ansible v5 and Glibc deadlock19:25
clarkbYesterday all of our executors were updated to run the new zuul executor image with bookworm's glibc19:26
clarkbSince then https://zuul.opendev.org/t/openstack/builds?pipeline=opendev-prod-hourly&skip=0 seems to show no more post failures19:26
ianwsorry about the first attempt, i didn't realise the restart playbook didn't pull images.  i guess that is by design?19:26
clarkbConsidering we had post failures with half the executors on old glibc I strongly suspect this is corrected.19:27
clarkbianw: ya there is a separate pull playbook because sometimes you just need to restart without updating the image19:27
ianwi did think of using ansibles prompt thing to write an interactive playbook, which would say "do you want to restart <scheduler|merger|executor> <y/n>" and then "do you want to pull new images <y|n>"19:28
ianwnow i've made the mistake, i guess i won't forget next time :)19:28
clarkbIt also looks like upstream ansible has closed their issue and says the fix is to upgrade glibc19:29
ianwclarkb: yes, almost certainly; i managed to eventually catch a hung process and it traced back into the exact code discussed in those issues19:29
clarkbgood news is that is what we have done. Bad news is I guess we're running backported glibc on our executors until bookworm releases19:29
corvusso... 10...20 years?19:29
ianw(for posterity, discussions @ https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-17.log.html#t2022-07-17T23:30:09)19:29
clarkbcorvus: ha, zigo made it sound like this winter. So maybe next spring :)19:30
corvusi wonder if a glibc fix could make it into stable?19:30
clarkb(NA seasons)19:30
clarkbcorvus: not a bad idea since it affects real software like ansible. Though like rhel I wonder if debian won't prioritize it due to their packaged version of ansible being fine (however other software may struggle with this bug we just don't know about it yet)19:31
ianwit seemed a quite small fix and also seems serious enough, it's very annoying to have things deadlock like that19:31
clarkbis the process for that similar to ubuntu? Open a bug with a link to the issue and ask for a stable backport with justification?19:32
fungibookworm release is expected in 2023 yeah19:32
fungithe freeze starts late this year19:33
fungihard freeze is next march19:33
ianwclarkb: that's where i'd start; i can take a closer look at the patch and send something up19:34
fungiactually i misspoke, the transitions deadline is late this year, soft freeze is early next year according to the schedule19:34
clarkbianw: thanks19:34
fungimmm, in fact even the transition deadline is actually early next year (january)19:35
fungihttps://release.debian.org/19:35
clarkbSeems like we've got a good next step and we can likely run that glibc until 2023 if necessary. It is limited to the executors which should limit impact19:36
clarkb#topic Other Zuul job POST_FAILURES19:37
clarkbyesterday it occured to me that the POST_FAILURES that OSA was seeing could be due to the same issue. I asked jrosser if they had updated their job ansible version to v5 early (because they had these errors prior to us updating our default), but unfortauntely they had not19:37
clarkbthat means their issue is very likely independent of our issue taht we believe is now fixed.19:38
clarkb#link https://review.opendev.org/c/opendev/base-jobs/+/848027 Add remote log store location debugging info to base jobs19:38
clarkbfor that reason I think we should try and get this chagne landed to add the extra swift location debugging info to our base jobs. Please cross check against the base-test job and leave a review19:38
clarkbianw did manage to strace a live one last week and found that it appeared to be our end waiting on read()s returning from the remote swift server19:39
corvuswhich provider?19:39
clarkbthe remote was ovh. I suspect we may need to gather some evidence that it is consistently ovh using the above change then take that to ovh and see if they know why it is breaking19:40
ianwone thing we noticed was this one included an ARA report, which is a lot of little files19:41
ianwat first i thought that might be in-common with our hang issues (they generate ara), but as discussed turned out to be completely separate19:42
ianwhttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2022-07-13.log.html#t2022-07-13T23:19:3419:42
ianwis the discussion of what i found on the stuck process19:42
clarkbI suspect that the remote service is having some sort of problem and if we can collect stronger evidence of that (through consistent failures against a single backed) and take that evidence with timestamps that the provider will be able to debug19:43
clarkbwhich makes that base job change important. But ya we can take it from there if/when we get that evidence19:44
clarkb#topic Gerrit 3.5 Larger Caches19:45
clarkbLast week our gerrit installation ran out of disk space due to the growth of its caches. Turns out that Gerrit 3.5 added some new caches that were expected to grow quite large but upstream didn't properly document this change in their release notes19:45
clarkbUpstream has said they should update those release notes now. They also noted part of the problem is a deficiency in h2 where it spends only 200ms compacting its backing file for the database before giving up which is insufficient to make a very large file smaller19:46
clarkbThat said they did think that the backing files would essentially stop growing after a certain size (a size big enough to fit all the data we use in a day)19:47
clarkbTo accomodate that we've increased the size of our volume and filesystem on the gerrit installation (thank you ianw and fungi for getting that done)19:47
clarkbWe are now in a period of monitoring the consumption to check the assumptions from upstream about growth eventually platueaing.19:48
clarkb#link http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=70356&rra_id=all Cacti graphs of review02 disk consumption19:48
clarkbI did test using held nodes (which I should delete now I guess) that you can safely delete these two large cache files while gerrit is offline and it will recreate them when needed after starting again19:49
clarkbIf we find ourselves in a gerrit is crashing situation I think ^ is a good short term fix.19:49
clarkb#link https://review.opendev.org/c/opendev/system-config/+/849886 is our fallback if the cache disk usage doesn't stabilize.19:49
ianw++19:49
corvusdoes that graph mean that we're approximately at nominal size for those files now since it's leveled out?19:49
clarkbAnd this change which is marked WIP is the long term fix if we end up having problems again. It basically says we won't cache this info on disk (it will still maintain memory caches for the data)19:50
clarkbcorvus: yes I think we're seeing that we are close to leveling out. The bigger of the two files hasn't grown much at all since the original problem was discovered. The smaller of the two has grown by abour 25%19:50
clarkbtime will tell though. It is also possible that we're just quieter than usual since we had the problem and if development picks up again we'll see growth19:51
clarkbI think it is still a bit early to draw any major conclusions, but it is looking good at least19:51
clarkbFor future upgrades we might want to start capturing any new cache files too. I bet our upgrade job could actually be modified to check which caches are preset pre upgrade and diff that against the list post upgrade19:52
fungithe cache size does strike me as being unnecessarily large though. searching through that much disk is unlikely to result in a performance benefit19:52
clarkbI can look at updating our upgrade job to try and do that19:52
clarkbfungi: it may actually help beacuse computing diffs is not cheap19:53
fungioh, is it git diffs/interdiffs?19:53
clarkbfungi: while doing a large disk lookup may not be quick either, diffing is an expensive operation on top of the file io to perform the diff19:53
fungithen yeah, i could see how that could need a lot of space19:53
clarkbfungi: yes it is caching results of various diff operations19:53
ianwbut can anything ever leave the cache, if h2 can't clean it?19:54
clarkbianw: yes items leave the cache aiui. It is just that the backing file isn't trimmed to match19:54
clarkbso that very large file may only be 1% used19:54
corvusthen why do we think it'll level out at a daily max?  won't people (bots actually) be continually diffing new things and adding to the cache and therefore backing file?19:55
clarkbbecause it isn't append only aiui19:55
clarkbit can reuse the allocated blocks that are free in h2 (but not on disk) to cache new things19:56
corvusbacking file space gets reclaimed but not deleted?19:56
clarkbfrom h2's perspective the database may have plenty of room and not need to grow the disk. But the disk once expanded doesn't really shrink19:56
ianwi guess it needs to, basically, defrag itself?19:56
clarkbianw: ya and spend more than 200ms doing so apparently19:57
corvussound like it -- and defrag is too computationally expensive19:57
fungimoreso the larger the set19:57
ianwwell i've spent plenty of time starting at defrag moving blocks around in the dos days :)  never did me any harm ... kids (databases?) these days ...19:57
clarkbits like a python list. It will expand and expand to add more records. But once you del stuff out of it you don't free up that room19:58
clarkbbut if you need the list to be bigger again you don't need to allocate more memory you just reuse what was already there19:58
corvusat least that's a consistent set of behaviors i can understand, even if it's .... suboptimal.19:58
ianwit seems though that part of upgrade steps is probably to delete the caches though?19:58
ianwa super-defrag19:59
clarkbianw: upstream was talking about doing that automatically for us19:59
clarkbso ya might make sense to just do it19:59
ianw++ one to keep in mind for the checklist19:59
clarkbWe are almost at time and I had a couple of small changes I wanted to call out before we run out of time19:59
clarkb#topic Open Discussion19:59
clarkb#link https://review.opendev.org/c/opendev/system-config/+/850268 Add a new mailing list for computing force network to lists.opendev.org20:00
clarkbfungi: ^ I think you wanted to make sure others were comfortable with adding this list under the opendev.org domain?20:00
clarkb(we have a small number of similar working group like lists already so no objection from me)20:00
fungiyeah, even just a heads up that it's proposed20:00
fungia second review would be appreciated20:01
clarkb#link https://review.opendev.org/c/openstack/project-config/+/850252 Fix our new project CI checks to handle gitea/git trying to auth when repo doesn't exist20:01
clarkbWhen you try to git clone a repo that doesn't exist in gitea you don't get a 404 and instead get prompted to provide credentials as the assumption is you need to authenticate to see a repo which isn't public20:01
clarkbThis change works around taht by doing an http fetch of the gitea ui page instead of a git clone to determine if the repo exists already20:02
clarkb(note this isn't really gitea specific, github will do the same thing)20:02
clarkbAnd we are at time. Thank you everyone20:02
clarkbFeel free to continue discussion on the mailing list or in #opendev. I'll end this meeting so we can return to $meal and other tasks :)20:03
clarkb#endmeeting20:03
opendevmeetMeeting ended Tue Jul 19 20:03:12 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:03
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.html20:03
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.txt20:03
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-07-19-19.01.log.html20:03
*** Guest5201 is now known as diablo_rojo_phone20:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!