#opendev-meeting log

19:01:22 <clarkb> #startmeeting infra
19:01:22 <opendevmeet> Meeting started Tue Jun 21 19:01:22 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:22 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:22 <opendevmeet> The meeting name has been set to 'infra'
19:01:33 <ianw> o/
19:02:33 <frickler> \o
19:02:53 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-June/000340.html Our Agenda
19:02:58 <clarkb> #topic Announcements
19:03:01 <clarkb> I had no announcements
19:03:25 <clarkb> And there were no actiosn from last meeting
19:03:31 <clarkb> That means we can dive right in
19:03:35 <clarkb> #topic Topics
19:03:39 <clarkb> #topic Improving CD throughput
19:03:45 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/846195 Running Zuul cluster upgrade playbook automatically.
19:04:04 <clarkb> This change is ready for review now. It sets up a cron that fires on the weekend to upgrade and reboot our entire zuul cluster
19:04:37 <clarkb> fungi: ran the playbook by hand last week and it continued to function. I think we're as ready as we will be to do this, but let me know if you disagree. Also call out any problems with my implementation if you see them
19:04:56 <fungi> yeah, it completed without issue
19:05:32 <fungi> took over 24 hours to complete, but weekly seems like a reasonable cadence for that
19:06:04 <clarkb> and it should go quicker when zuul is calmer over the weekend
19:06:13 <fungi> yes
19:06:48 <clarkb> anyway we can followup further in review. Just wanted to call out the change exists and has had its -W removed
19:06:52 <clarkb> Anything else on this topic?
19:08:14 <clarkb> #topic Glean rpm platform static ipv6 support
19:08:30 <clarkb> This is sort of a ninja addition to the agenda, but I realized i never followed up on this whole thing due to travel
19:08:42 <clarkb> ianw: I see all the glean changes ended up merging. I guess we did the releases and things are happy on ovh now?
19:09:02 <clarkb> Is there anything else that needs to be done or can we consider this completed?
19:09:20 <ianw> i think that's complete, i haven't heard anything else on it
19:09:30 <clarkb> great. Thank you for taking care of that
19:09:36 <ianw> certainly if somebody wants something to do, there could be quite a bit of refactoring done in glean
19:09:55 <ianw> but, it works, so if it ain't broke ... :)
19:10:28 <clarkb> #topic Container Maintenance
19:10:44 <clarkb> I wanted to call out that we upgraded our first mariadb installation from 10.4 to 10.6 during the gerrit 3.5 upgrade process
19:11:04 <clarkb> As far as I can tell that went well. We should probably start thinking about upgrading the DBs for other services too
19:11:21 <clarkb> #link https://etherpad.opendev.org/p/gerrit-upgrade-3.5 captures mariadb upgrade process
19:12:11 <clarkb> This isn't urgent and does require rot access as performed on gerrit. However, there is an env var we can set to have mariadb do it automatically for us if non roots want to help us be brave and push those changes :)
19:12:15 <clarkb> s/rot/root/
19:13:10 <fungi> root13
19:13:35 <clarkb> #topic Gerrit 3.5 upgrade
19:13:48 <clarkb> This happened. Thank you to everyone (ianw in particular) who helped get us here
19:14:01 <clarkb> The upgrade itself went very smoothly. We have noticed a couple of issues since then.
19:14:11 <clarkb> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16018 Now fixed
19:14:12 <fungi> one of which is already addressed
19:14:27 <clarkb> yup that one I just linked is fixed upstream and the fix is deployed in opendev
19:14:44 <clarkb> The other issue whcih frickler noticed is that it seems any change marked WorkInProgress also shows up as merge conflicting in a change listing
19:15:06 <clarkb> I am hoping to have time to look into that probably tomorrow as I suspect that is a bug on the gerrit side where they equate WIP and merge conflict for some reason
19:15:14 <clarkb> If you notice other issues please do call them out
19:15:18 <fungi> has anyone had an opportunity to confirm whether that also happens on 3.6?
19:15:27 <frickler> actually some other kolla people noticed first
19:15:41 <clarkb> I have not. The next thing to discuss is removing 3.4 images and adding 3.6 stuff whcih will make it easier for us to test that
19:16:07 <ianw> ++ having a 3.6 replication in s-c would probably be a good help
19:16:08 <frickler> one other thing that seems new is the highlighting on hovering with the pointer on a word, which I think is very annoying
19:16:29 <ianw> i feel like it was doing that before; or maybe i'm just thinking of the upstream gerrit
19:16:31 <clarkb> yes that is new, and yes that is intentional and I haven't figured out if I like it or not yet
19:16:41 <clarkb> ianw: upstream gerrit did it before but 3.5 brought it to us
19:16:52 <clarkb> I wonder if we can put that behind a config and turn it off
19:16:55 <frickler> is that configurable somehow?
19:17:04 <clarkb> frickler: I looked for user config for it yesterday and couldn't find it
19:17:18 <clarkb> I think user config would be ideal, but server wide would be acceptable too. I'll have to look at it more
19:17:22 <fungi> i guess that's another behavior gertty is shielding me from
19:17:34 <fungi> since i don't get what's being described
19:17:44 <fungi> s/get/understand/
19:17:45 <clarkb> fungi: ya if you mouse over wrods in the diff view gerrit now highlights all occurences of that word in the diff
19:17:51 <clarkb> I personally prefer explicit use of ^F
19:17:55 <fungi> uh... huh
19:17:59 <frickler> in screaming yellow
19:18:03 <clarkb> not sure why they added that, but its definitely something that seems intentional
19:18:36 <clarkb> #link https://review.opendev.org/q/topic:gerrit-3.4-cleanups
19:18:52 <clarkb> I've pushed up changes to being some of the cleanup here. The first two are actually followups to running 3.5
19:18:56 <clarkb> #link https://review.opendev.org/c/opendev/system-config/+/847034
19:19:01 <clarkb> #link https://review.opendev.org/c/openstack/project-config/+/847057
19:19:12 <fungi> is ^F another gerrit override? one of the reasons i don't use the webui is that it also implements its own keybindings, which seems to somehow override my browser's keybinds, so for someone who prefers to do keyboard-driven browser navigation it's nigh unworkable
19:19:21 <clarkb> if we can get those reviewed then the rest of the stack can be checked and landed when we are happy that we are unlikely to revert
19:19:41 <clarkb> fungi: old gerrit captured ^F but modern polygerrit doesn't it just uses the browser search function now
19:19:57 <clarkb> fungi: there are still other shortcut keys though
19:20:12 <fungi> oh, well that's an improvement at least, though my keyboard navigation plugin relies on / to search which is probably still overridden
19:20:24 <clarkb> I'm thinking maybe next week we remove 3.4 and hopefully by then we've also got the 3.6 jobs working and can land 3.6 quickly after 3.4 is removed
19:20:37 <fungi> if there were an option to disable all the gerrit webui keybindings, i might consider trying it out again
19:20:44 <clarkb> fungi: / is not overridden anymore
19:21:01 <clarkb> but [ and ] still move forward and backward through a review
19:21:20 <clarkb> oh wait / is but it grabs the normal search bar now not in page text search
19:21:32 <clarkb> sorry about that they just chagned what search meant in the context of / I guess
19:22:07 <fungi> yeah. ultimately i see that as the failing of the browser for not giving me an option to just ignore javascript keypress capture
19:23:10 <clarkb> The other thing we'll want to monitor is general memory usage by 3.5 since other users have had memory trouble. I suspect we're fine simply because we don't run extra plugins and metric gathering
19:23:24 <clarkb> But I'm remembering now that I meant to take a heap usage measurement then compare it daily or something
19:23:57 <fungi> also the earlier memory capture attempts in zuul didn't show any difference, though obviously there's no production load there
19:24:39 <clarkb> gerrit show-caches will give us this info I'll run that after the meeting and check it every day at roughly 1900 utc for the next few days
19:25:06 <clarkb> Anything else gerrit upgraded related to call out before we move on?
19:25:29 <fungi> just the surprisingly few complaints we've received
19:25:38 <fungi> it's like hardly anyone even noticed we upgraded
19:25:53 <fungi> success
19:25:56 <clarkb> ++
19:27:01 <clarkb> #topic Enable Nodepool Launcher Webapp
19:27:28 <frickler> actually it turns out that it is enabled
19:27:59 <frickler> just not present in the config, which doesn't work in my local setup, but somehow works in the opendev deployment
19:28:18 <frickler> so nothing to do there
19:28:41 <clarkb> I think ianw was also talking about fixing up the grafana dashboard stuff?
19:28:45 <frickler> the grafana page has two issues: the "failed" status isn't shown in red
19:28:54 <clarkb> to add missing images?
19:28:56 <frickler> and some of the timing graphs are missing
19:28:58 <clarkb> oh right eh color
19:29:10 <ianw> yes i did start looking at this ...
19:29:16 <frickler> but I couldn't find a solution for any of this
19:29:39 <fungi> what's the launcher webapp path?
19:29:39 <ianw> the reason it doesn't work is because grafana has redone the way it deals with thresholds
19:30:22 <ianw> and it seems the way that grafyaml writes thresholds is in the old format, that doesn't quite get upgraded properly to the new format, so it doesn't set "red" when the value is "1"
19:31:06 <frickler> for the webapp, we should replace the apache default page with a list of the possible links. and maybe add ssl. http://nl01.opendev.org/
19:31:25 <clarkb> ianw: ok so a grafyaml update is required.
19:31:38 <frickler> http://nl01.opendev.org/dib-image-list and http://nl01.opendev.org/image-list are the useful things
19:31:39 <fungi> what's the launcher webapp path?
19:31:40 <clarkb> is that fork you found stillactive? I wonder if we're better off just adopting that?
19:31:42 <ianw> so i started trying to reverse engineer the way to do new thresholds, and I find it fairly frustrating and frankly a bit pointless to be trying to rewrite grafyaml to a non-existent API
19:31:43 <fungi> thanks!
19:32:19 <clarkb> ianw: non existant because they change it arbitrarily?
19:32:30 <fungi> and/or don't document it?
19:32:39 <ianw> both
19:33:02 <fungi> shades of jjb
19:33:28 <ianw> they do tend to write backwards compat things so that old dashboards load
19:34:00 <ianw> then also, as mentioned, i found https://github.com/deliveryhero/grafyaml who have done a bunch of reverse engineering work for new panel types, etc
19:34:21 <ianw> but have also run reformatters, and not chosen to interact with the upstream development at all
19:34:27 <ianw> which was also depressing
19:35:04 <fungi> it's apache 2.0 though, so we could just fork their changes back in
19:35:16 <fungi> oh, except for the reformatting
19:35:17 <frickler> sound like typical german startup I must say
19:35:49 <ianw> then last time we discussed just using the raw json (which imo really isn't that hard to read) there was disagreement, which was also not fun
19:36:01 <ianw> so frankly i left the whole exercise a bit depressed over everything :)
19:36:47 <clarkb> ya maybe we need to revive that discussion and see if we can find a compromise. Like maybe we can store the raw json as yaml to make it reviewable and then have a really light weight tool that just converts yaml to json directly for grafana consumption
19:36:52 <fungi> i'm tempted to open a github issue against that project saying that their updates aren't importable upstream due to the reformatting
19:36:56 <clarkb> rather tahn doing a different representation entirely in yaml
19:37:23 <clarkb> I assume ruamel can do that in a pretty straightforward manner for us
19:37:56 <fungi> pyyaml could presumably do it too
19:38:15 <ianw> i don't know if we're having any different conversation than we had before, but i think if you look at the json output exported by grafana it's quite readable.  they are not doing anything crazy in there
19:38:15 <fungi> unless you're worried about preserving ordering or inline comments
19:38:40 <ianw> it's just they arbitrarily choose to refactor things
19:38:48 <clarkb> ianw: I think it was corvus  who had the major objection and didn't feel json was human readable at all.
19:38:59 <clarkb> (regardless of the actual json)
19:39:46 <clarkb> maybe this deservers a mailing list thread
19:39:53 <clarkb> as I don't know that corvus is watching the meeting today
19:40:02 <ianw> yes, fair enough
19:40:10 <clarkb> and we can be explciit about what the issues are that way and try to find a workaround/compromise/etc
19:40:15 <fungi> could probably stand to be a ml thread either way
19:40:21 <clarkb> ++
19:40:42 <ianw> i mean, delivery hero or whatever have obviously seen some usefulness in it too
19:41:28 <ianw> but imo it's just a losing game if upstream don't want to give you an api to work to
19:41:39 <clarkb> ya you'll always be fighting problems like this
19:41:58 <corvus> oh hi
19:42:55 <clarkb> corvus: I think where we've ended up is we need to do something re grafyaml and grafana dashboard/graph management as the current tool is not working in ways that are annoying. But we should start up a mailing list thread to discuss it further to make sure we capture all the angles (including this random fork on github)
19:43:02 <frickler> do we know if that lack of documentation is intentional on the grafana side?
19:43:08 <fungi> fwiw, i don't see any obvious indication that the deliveryhero devs tried to upstream patches for grafyaml, just skimming the reviews there
19:43:16 <clarkb> ianw: is that something you woudl like to start or would it be helpful if I try to give it a go
19:43:31 <ianw> i can send a mail, sure
19:43:36 <clarkb> thanks
19:43:43 <ianw> i don't think we need to keep having the same conversation in irc, for sure
19:44:11 <clarkb> Alright anything else on this before we move on?
19:45:12 <clarkb> #topic Custom url shortener
19:45:21 <frickler> that's an easy one: still on my todo list
19:45:32 <clarkb> ok just wanted to make sure I had not missed a change
19:45:57 <frickler> nope
19:46:34 <clarkb> #topic Removing Projects from Zuul
19:46:47 <clarkb> This was not on the emailed agenda beacuse it occured to me just this morning
19:47:17 <clarkb> The changes I pushed up to windmill and x/vmware-nsx to remove their use of pipeline level queue definitions in the gerrit config have not been reviewed and most of them fail CI
19:47:32 <clarkb> one idea to address this is to simply remove projects like that from our zuul config.
19:47:46 <clarkb> Separately I do also notice that it seems like literally no one has addressed this problem in openstack at all
19:48:08 <clarkb> but I think for this topic I'm mostly concerned about what are very likely dead projects that we should just decouple from zuul until they become active again
19:48:17 <clarkb> Are there any objections to that or concerns with doing that?
19:48:46 <frickler> no, but an additional idea, can we also handle some of the lang standing zuul config errors like that?
19:48:52 <fungi> i'm fine with that. for openstack projects, i'm happy to present the tc with the list of projects we're removing, and suggest that they can be re-added when their authors are ready to address any problems
19:49:03 <fungi> same for config errors
19:49:04 <ianw> would you commit something to the projects saying "btw this zuul config is not being processed"?
19:49:20 <ianw> i just wonder if people do try to commit something, and it goes into gerrit and nothing happens
19:49:21 <clarkb> ok a lot going on. I'm going to start with ianw's since that was one of the concerns I had too
19:49:32 <corvus> i understood the suggestion as remove them from the tenant config; and i think that sounds good
19:49:34 <clarkb> Which was how do we make people aware of this change if they are already not really paying attention
19:49:36 <fungi> though that can lead to a cascade effect, since many of those errors are due to projects which have been renamed or retired still listed as required-projects in old branches of other projects' configs
19:49:59 <corvus> (removing from the tenant config means no commits to projects necessary)
19:50:14 <clarkb> corvus: correct, but it also means if someone pushes code to that repo now they'll just be silently ignored
19:50:41 <frickler> we could add a job in project-config that just output some comment?
19:50:48 <corvus> yes, and presumably ask someone what is up and end up at service-discuss or #opendev
19:51:05 <clarkb> frickler: the problem with that is you need th project in the tenant config to run the job against the repo
19:51:15 <clarkb> I think where I've ended up on that is what corvus describes
19:51:26 <clarkb> basically it isn't ideal but they should know where to go asking questions
19:52:01 <frickler> can't we just ignore the in-project config like we do for some github projects?
19:52:04 <corvus> (you could theoretically exclude all config objects from those projects and then run a job from a config repo, but that sounds like a lot of work for people who aren't around)
19:52:21 <fungi> just so i can go back to the openstack tc with a clear message, it's that leaving project zuul configs in a broken/error state indefinitely is not okay, even if it's "just" on some old stable branches ~nobody cares about, and we will be taking those projects out of the tenant config even if their master branch configs are still working
19:52:45 <corvus> my argument would be that opendev's level of service for projects should not exceed the attention given by their developers to them.
19:52:48 <clarkb> fungi: sort of. This si specifically re http://lists.openstack.org/pipermail/openstack-discuss/2022-May/028603.html
19:53:24 <frickler> corvus: fair enough, I support that
19:53:28 <clarkb> fungi: I do think though that we're appraoching what you descirbe whcih is that broken project configs regardless of the reason create problems for the projects in question and others. If they aren't going to do basic care and feeding then we'll remove from the CI system to avoid confusion
19:53:32 <fungi> well, i was extending it to frickler's suggestion that we "also handle some of the lang standing zuul config errors like that"
19:53:35 <clarkb> corvus: ++
19:53:53 <corvus> (my comments are mostly in the context of abandoned projects)
19:53:57 <clarkb> the risk with doing wide srpead removal is that it will chain reaction down all the dependencies
19:54:32 <fungi> yes, my point was that already a vast number of the config errors are due to retired/renamed projects no longer appearing in the tenant config
19:54:51 <fungi> so i would expect the error count to grow if we remove them
19:55:07 <clarkb> anyway to start I'm just suggesting x/vmware-nsx and windmill be removed since they both appear dead and are not part of openstack. Then separately we need to push openstack harder to actually fix this stuff
19:55:14 <fungi> also the errors are branch-specific, but tenant removal is project-level
19:55:25 <clarkb> and if pushing openstack harder doesn't result in fixing these things we should consider removing from zuul at that time
19:55:49 <clarkb> btu I don't think we're quite to that point yet for openstack. But we should probably warn them that is ultimately our failsafe on the zuul side
19:56:09 <fungi> yeah, i'll give a gently firm reminder
19:57:36 <corvus> maybe separately ask openstack to appoint some janitors for those projects (midonet, etc)?
19:58:21 <fungi> definitely
19:58:40 <frickler> I also noticed that the zuul tenant has collected a set of config errors btw.
19:59:04 <corvus> yes, they are unresolvable until opendev finishes being extracted from openstack
19:59:36 <clarkb> Alright we are just about at time
19:59:39 <clarkb> #topic Open Discussion
19:59:41 <clarkb> Anything else?
19:59:41 <corvus> possibly some may remain even then
20:00:01 <fungi> i'm hoping to refresh our meetpad configs to current upstream examples/defaults
20:00:16 <fungi> not quite sure how best to minimize our differential going forward
20:00:41 <fungi> we've tried a few different things, but those files are quite large and our edits represent a small proportion of them
20:01:05 <fungi> open to ideas in #opendev if anyone has some
20:01:25 <clarkb> fungi: probably upstreaming support for flags we need is the best way
20:01:27 <clarkb> but also we are at time
20:01:30 <clarkb> thank you everyone
20:01:32 <clarkb> #endmeeting