19:01:22 #startmeeting infra 19:01:22 Meeting started Tue Jun 21 19:01:22 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:22 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:22 The meeting name has been set to 'infra' 19:01:33 o/ 19:02:33 \o 19:02:53 #link https://lists.opendev.org/pipermail/service-discuss/2022-June/000340.html Our Agenda 19:02:58 #topic Announcements 19:03:01 I had no announcements 19:03:25 And there were no actiosn from last meeting 19:03:31 That means we can dive right in 19:03:35 #topic Topics 19:03:39 #topic Improving CD throughput 19:03:45 #link https://review.opendev.org/c/opendev/system-config/+/846195 Running Zuul cluster upgrade playbook automatically. 19:04:04 This change is ready for review now. It sets up a cron that fires on the weekend to upgrade and reboot our entire zuul cluster 19:04:37 fungi: ran the playbook by hand last week and it continued to function. I think we're as ready as we will be to do this, but let me know if you disagree. Also call out any problems with my implementation if you see them 19:04:56 yeah, it completed without issue 19:05:32 took over 24 hours to complete, but weekly seems like a reasonable cadence for that 19:06:04 and it should go quicker when zuul is calmer over the weekend 19:06:13 yes 19:06:48 anyway we can followup further in review. Just wanted to call out the change exists and has had its -W removed 19:06:52 Anything else on this topic? 19:08:14 #topic Glean rpm platform static ipv6 support 19:08:30 This is sort of a ninja addition to the agenda, but I realized i never followed up on this whole thing due to travel 19:08:42 ianw: I see all the glean changes ended up merging. I guess we did the releases and things are happy on ovh now? 19:09:02 Is there anything else that needs to be done or can we consider this completed? 19:09:20 i think that's complete, i haven't heard anything else on it 19:09:30 great. Thank you for taking care of that 19:09:36 certainly if somebody wants something to do, there could be quite a bit of refactoring done in glean 19:09:55 but, it works, so if it ain't broke ... :) 19:10:28 #topic Container Maintenance 19:10:44 I wanted to call out that we upgraded our first mariadb installation from 10.4 to 10.6 during the gerrit 3.5 upgrade process 19:11:04 As far as I can tell that went well. We should probably start thinking about upgrading the DBs for other services too 19:11:21 #link https://etherpad.opendev.org/p/gerrit-upgrade-3.5 captures mariadb upgrade process 19:12:11 This isn't urgent and does require rot access as performed on gerrit. However, there is an env var we can set to have mariadb do it automatically for us if non roots want to help us be brave and push those changes :) 19:12:15 s/rot/root/ 19:13:10 root13 19:13:35 #topic Gerrit 3.5 upgrade 19:13:48 This happened. Thank you to everyone (ianw in particular) who helped get us here 19:14:01 The upgrade itself went very smoothly. We have noticed a couple of issues since then. 19:14:11 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16018 Now fixed 19:14:12 one of which is already addressed 19:14:27 yup that one I just linked is fixed upstream and the fix is deployed in opendev 19:14:44 The other issue whcih frickler noticed is that it seems any change marked WorkInProgress also shows up as merge conflicting in a change listing 19:15:06 I am hoping to have time to look into that probably tomorrow as I suspect that is a bug on the gerrit side where they equate WIP and merge conflict for some reason 19:15:14 If you notice other issues please do call them out 19:15:18 has anyone had an opportunity to confirm whether that also happens on 3.6? 19:15:27 actually some other kolla people noticed first 19:15:41 I have not. The next thing to discuss is removing 3.4 images and adding 3.6 stuff whcih will make it easier for us to test that 19:16:07 ++ having a 3.6 replication in s-c would probably be a good help 19:16:08 one other thing that seems new is the highlighting on hovering with the pointer on a word, which I think is very annoying 19:16:29 i feel like it was doing that before; or maybe i'm just thinking of the upstream gerrit 19:16:31 yes that is new, and yes that is intentional and I haven't figured out if I like it or not yet 19:16:41 ianw: upstream gerrit did it before but 3.5 brought it to us 19:16:52 I wonder if we can put that behind a config and turn it off 19:16:55 is that configurable somehow? 19:17:04 frickler: I looked for user config for it yesterday and couldn't find it 19:17:18 I think user config would be ideal, but server wide would be acceptable too. I'll have to look at it more 19:17:22 i guess that's another behavior gertty is shielding me from 19:17:34 since i don't get what's being described 19:17:44 s/get/understand/ 19:17:45 fungi: ya if you mouse over wrods in the diff view gerrit now highlights all occurences of that word in the diff 19:17:51 I personally prefer explicit use of ^F 19:17:55 uh... huh 19:17:59 in screaming yellow 19:18:03 not sure why they added that, but its definitely something that seems intentional 19:18:36 #link https://review.opendev.org/q/topic:gerrit-3.4-cleanups 19:18:52 I've pushed up changes to being some of the cleanup here. The first two are actually followups to running 3.5 19:18:56 #link https://review.opendev.org/c/opendev/system-config/+/847034 19:19:01 #link https://review.opendev.org/c/openstack/project-config/+/847057 19:19:12 is ^F another gerrit override? one of the reasons i don't use the webui is that it also implements its own keybindings, which seems to somehow override my browser's keybinds, so for someone who prefers to do keyboard-driven browser navigation it's nigh unworkable 19:19:21 if we can get those reviewed then the rest of the stack can be checked and landed when we are happy that we are unlikely to revert 19:19:41 fungi: old gerrit captured ^F but modern polygerrit doesn't it just uses the browser search function now 19:19:57 fungi: there are still other shortcut keys though 19:20:12 oh, well that's an improvement at least, though my keyboard navigation plugin relies on / to search which is probably still overridden 19:20:24 I'm thinking maybe next week we remove 3.4 and hopefully by then we've also got the 3.6 jobs working and can land 3.6 quickly after 3.4 is removed 19:20:37 if there were an option to disable all the gerrit webui keybindings, i might consider trying it out again 19:20:44 fungi: / is not overridden anymore 19:21:01 but [ and ] still move forward and backward through a review 19:21:20 oh wait / is but it grabs the normal search bar now not in page text search 19:21:32 sorry about that they just chagned what search meant in the context of / I guess 19:22:07 yeah. ultimately i see that as the failing of the browser for not giving me an option to just ignore javascript keypress capture 19:23:10 The other thing we'll want to monitor is general memory usage by 3.5 since other users have had memory trouble. I suspect we're fine simply because we don't run extra plugins and metric gathering 19:23:24 But I'm remembering now that I meant to take a heap usage measurement then compare it daily or something 19:23:57 also the earlier memory capture attempts in zuul didn't show any difference, though obviously there's no production load there 19:24:39 gerrit show-caches will give us this info I'll run that after the meeting and check it every day at roughly 1900 utc for the next few days 19:25:06 Anything else gerrit upgraded related to call out before we move on? 19:25:29 just the surprisingly few complaints we've received 19:25:38 it's like hardly anyone even noticed we upgraded 19:25:53 success 19:25:56 ++ 19:27:01 #topic Enable Nodepool Launcher Webapp 19:27:28 actually it turns out that it is enabled 19:27:59 just not present in the config, which doesn't work in my local setup, but somehow works in the opendev deployment 19:28:18 so nothing to do there 19:28:41 I think ianw was also talking about fixing up the grafana dashboard stuff? 19:28:45 the grafana page has two issues: the "failed" status isn't shown in red 19:28:54 to add missing images? 19:28:56 and some of the timing graphs are missing 19:28:58 oh right eh color 19:29:10 yes i did start looking at this ... 19:29:16 but I couldn't find a solution for any of this 19:29:39 what's the launcher webapp path? 19:29:39 the reason it doesn't work is because grafana has redone the way it deals with thresholds 19:30:22 and it seems the way that grafyaml writes thresholds is in the old format, that doesn't quite get upgraded properly to the new format, so it doesn't set "red" when the value is "1" 19:31:06 for the webapp, we should replace the apache default page with a list of the possible links. and maybe add ssl. http://nl01.opendev.org/ 19:31:25 ianw: ok so a grafyaml update is required. 19:31:38 http://nl01.opendev.org/dib-image-list and http://nl01.opendev.org/image-list are the useful things 19:31:39 what's the launcher webapp path? 19:31:40 is that fork you found stillactive? I wonder if we're better off just adopting that? 19:31:42 so i started trying to reverse engineer the way to do new thresholds, and I find it fairly frustrating and frankly a bit pointless to be trying to rewrite grafyaml to a non-existent API 19:31:43 thanks! 19:32:19 ianw: non existant because they change it arbitrarily? 19:32:30 and/or don't document it? 19:32:39 both 19:33:02 shades of jjb 19:33:28 they do tend to write backwards compat things so that old dashboards load 19:34:00 then also, as mentioned, i found https://github.com/deliveryhero/grafyaml who have done a bunch of reverse engineering work for new panel types, etc 19:34:21 but have also run reformatters, and not chosen to interact with the upstream development at all 19:34:27 which was also depressing 19:35:04 it's apache 2.0 though, so we could just fork their changes back in 19:35:16 oh, except for the reformatting 19:35:17 sound like typical german startup I must say 19:35:49 then last time we discussed just using the raw json (which imo really isn't that hard to read) there was disagreement, which was also not fun 19:36:01 so frankly i left the whole exercise a bit depressed over everything :) 19:36:47 ya maybe we need to revive that discussion and see if we can find a compromise. Like maybe we can store the raw json as yaml to make it reviewable and then have a really light weight tool that just converts yaml to json directly for grafana consumption 19:36:52 i'm tempted to open a github issue against that project saying that their updates aren't importable upstream due to the reformatting 19:36:56 rather tahn doing a different representation entirely in yaml 19:37:23 I assume ruamel can do that in a pretty straightforward manner for us 19:37:56 pyyaml could presumably do it too 19:38:15 i don't know if we're having any different conversation than we had before, but i think if you look at the json output exported by grafana it's quite readable. they are not doing anything crazy in there 19:38:15 unless you're worried about preserving ordering or inline comments 19:38:40 it's just they arbitrarily choose to refactor things 19:38:48 ianw: I think it was corvus who had the major objection and didn't feel json was human readable at all. 19:38:59 (regardless of the actual json) 19:39:46 maybe this deservers a mailing list thread 19:39:53 as I don't know that corvus is watching the meeting today 19:40:02 yes, fair enough 19:40:10 and we can be explciit about what the issues are that way and try to find a workaround/compromise/etc 19:40:15 could probably stand to be a ml thread either way 19:40:21 ++ 19:40:42 i mean, delivery hero or whatever have obviously seen some usefulness in it too 19:41:28 but imo it's just a losing game if upstream don't want to give you an api to work to 19:41:39 ya you'll always be fighting problems like this 19:41:58 oh hi 19:42:55 corvus: I think where we've ended up is we need to do something re grafyaml and grafana dashboard/graph management as the current tool is not working in ways that are annoying. But we should start up a mailing list thread to discuss it further to make sure we capture all the angles (including this random fork on github) 19:43:02 do we know if that lack of documentation is intentional on the grafana side? 19:43:08 fwiw, i don't see any obvious indication that the deliveryhero devs tried to upstream patches for grafyaml, just skimming the reviews there 19:43:16 ianw: is that something you woudl like to start or would it be helpful if I try to give it a go 19:43:31 i can send a mail, sure 19:43:36 thanks 19:43:43 i don't think we need to keep having the same conversation in irc, for sure 19:44:11 Alright anything else on this before we move on? 19:45:12 #topic Custom url shortener 19:45:21 that's an easy one: still on my todo list 19:45:32 ok just wanted to make sure I had not missed a change 19:45:57 nope 19:46:34 #topic Removing Projects from Zuul 19:46:47 This was not on the emailed agenda beacuse it occured to me just this morning 19:47:17 The changes I pushed up to windmill and x/vmware-nsx to remove their use of pipeline level queue definitions in the gerrit config have not been reviewed and most of them fail CI 19:47:32 one idea to address this is to simply remove projects like that from our zuul config. 19:47:46 Separately I do also notice that it seems like literally no one has addressed this problem in openstack at all 19:48:08 but I think for this topic I'm mostly concerned about what are very likely dead projects that we should just decouple from zuul until they become active again 19:48:17 Are there any objections to that or concerns with doing that? 19:48:46 no, but an additional idea, can we also handle some of the lang standing zuul config errors like that? 19:48:52 i'm fine with that. for openstack projects, i'm happy to present the tc with the list of projects we're removing, and suggest that they can be re-added when their authors are ready to address any problems 19:49:03 same for config errors 19:49:04 would you commit something to the projects saying "btw this zuul config is not being processed"? 19:49:20 i just wonder if people do try to commit something, and it goes into gerrit and nothing happens 19:49:21 ok a lot going on. I'm going to start with ianw's since that was one of the concerns I had too 19:49:32 i understood the suggestion as remove them from the tenant config; and i think that sounds good 19:49:34 Which was how do we make people aware of this change if they are already not really paying attention 19:49:36 though that can lead to a cascade effect, since many of those errors are due to projects which have been renamed or retired still listed as required-projects in old branches of other projects' configs 19:49:59 (removing from the tenant config means no commits to projects necessary) 19:50:14 corvus: correct, but it also means if someone pushes code to that repo now they'll just be silently ignored 19:50:41 we could add a job in project-config that just output some comment? 19:50:48 yes, and presumably ask someone what is up and end up at service-discuss or #opendev 19:51:05 frickler: the problem with that is you need th project in the tenant config to run the job against the repo 19:51:15 I think where I've ended up on that is what corvus describes 19:51:26 basically it isn't ideal but they should know where to go asking questions 19:52:01 can't we just ignore the in-project config like we do for some github projects? 19:52:04 (you could theoretically exclude all config objects from those projects and then run a job from a config repo, but that sounds like a lot of work for people who aren't around) 19:52:21 just so i can go back to the openstack tc with a clear message, it's that leaving project zuul configs in a broken/error state indefinitely is not okay, even if it's "just" on some old stable branches ~nobody cares about, and we will be taking those projects out of the tenant config even if their master branch configs are still working 19:52:45 my argument would be that opendev's level of service for projects should not exceed the attention given by their developers to them. 19:52:48 fungi: sort of. This si specifically re http://lists.openstack.org/pipermail/openstack-discuss/2022-May/028603.html 19:53:24 corvus: fair enough, I support that 19:53:28 fungi: I do think though that we're appraoching what you descirbe whcih is that broken project configs regardless of the reason create problems for the projects in question and others. If they aren't going to do basic care and feeding then we'll remove from the CI system to avoid confusion 19:53:32 well, i was extending it to frickler's suggestion that we "also handle some of the lang standing zuul config errors like that" 19:53:35 corvus: ++ 19:53:53 (my comments are mostly in the context of abandoned projects) 19:53:57 the risk with doing wide srpead removal is that it will chain reaction down all the dependencies 19:54:32 yes, my point was that already a vast number of the config errors are due to retired/renamed projects no longer appearing in the tenant config 19:54:51 so i would expect the error count to grow if we remove them 19:55:07 anyway to start I'm just suggesting x/vmware-nsx and windmill be removed since they both appear dead and are not part of openstack. Then separately we need to push openstack harder to actually fix this stuff 19:55:14 also the errors are branch-specific, but tenant removal is project-level 19:55:25 and if pushing openstack harder doesn't result in fixing these things we should consider removing from zuul at that time 19:55:49 btu I don't think we're quite to that point yet for openstack. But we should probably warn them that is ultimately our failsafe on the zuul side 19:56:09 yeah, i'll give a gently firm reminder 19:57:36 maybe separately ask openstack to appoint some janitors for those projects (midonet, etc)? 19:58:21 definitely 19:58:40 I also noticed that the zuul tenant has collected a set of config errors btw. 19:59:04 yes, they are unresolvable until opendev finishes being extracted from openstack 19:59:36 Alright we are just about at time 19:59:39 #topic Open Discussion 19:59:41 Anything else? 19:59:41 possibly some may remain even then 20:00:01 i'm hoping to refresh our meetpad configs to current upstream examples/defaults 20:00:16 not quite sure how best to minimize our differential going forward 20:00:41 we've tried a few different things, but those files are quite large and our edits represent a small proportion of them 20:01:05 open to ideas in #opendev if anyone has some 20:01:25 fungi: probably upstreaming support for flags we need is the best way 20:01:27 but also we are at time 20:01:30 thank you everyone 20:01:32 #endmeeting