19:01:10 <clarkb> #startmeeting infra
19:01:10 <opendevmeet> Meeting started Tue Jun 28 19:01:10 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:01:10 <opendevmeet> Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.
19:01:10 <opendevmeet> The meeting name has been set to 'infra'
19:01:21 <clarkb> #link https://lists.opendev.org/pipermail/service-discuss/2022-June/000341.html Our Agenda
19:01:29 <clarkb> #topic Announcements
19:01:48 <ianw> o/
19:01:54 <clarkb> Next week Monday is a big holiday for a few of us. I would expect it to be quiet ish early next week.
19:02:11 <clarkb> Additionally I very likely won't be able to make the meeting two weeks from today
19:02:46 <clarkb> More than happy to skip that week or have someone else run the meeting (its July 12, 2022)
19:03:19 <ianw> i can do 12th july if there's interest
19:03:41 <clarkb> figured I'd let people knwo early then we can organize with plenty of time
19:03:51 <clarkb> Any other announcements?
19:05:06 <clarkb> #topic Topics
19:05:13 <clarkb> #topic Improving CD throughput
19:05:32 <clarkb> There was a bug in the flock path for the zuul auto upgrade playbook which unfortunately caused last weekends upgrade and reboots to fail
19:05:46 <clarkb> That issue has since been fixed so the next pass should run
19:05:54 <clarkb> This is the downside to only trying to run it once a week.
19:06:29 <clarkb> But we can always manually run it if necessary at an earlier date. I'm also hoping that I'll be feeling much better next weekend and can pay attention to it as it runs. (I missed the last one because I wasn't feeling well)
19:07:31 <clarkb> Slow progress, but that still counts :)
19:08:17 <clarkb> Anything else on this topic?
19:09:12 <clarkb> #topic Gerrit 3.5 upgrade
19:09:19 <clarkb> #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 WorkInProgress always treated as merge conflict
19:09:28 <clarkb> I did some investigating of this problem that frickler called out.
19:10:09 <clarkb> I thought I would dig into that more today and try to write a patch, but what I've realized since is that there isn't a great solution here since WIP changes are not mergeable. But Gerrit overloads mergable to indicate there is a conflict (which isn't necessarily true in the WIP case)
19:10:48 <clarkb> so now I'm thinking I'll wait a bit and see if any upstream devs have some hints for how we might address this. Maybe it is ok to drop merge conflict in the case of all wips. Or maybe we need a better distinction between the triple state and use something other than a binary value
19:11:09 <clarkb> If the latter option then that may require them to write a chagne as I think it requires a new index version
19:11:38 <clarkb> But I do think I understand this enough to say it is largely a non issue. It looks werid in the UI, but it doens't indicate a bug in merge conflict checking or the index itself
19:12:02 <clarkb> Which means I think it is fairly low priority and we can focus effort elsewhere
19:12:26 <ianw> (clarkb becoming dangerously close to a java developer again :)
19:12:57 <clarkb> The other item I wanted to bring up here is whether or not we think we are ready to drop 3.4 images and add 3.6 as well as testing
19:12:59 <clarkb> https://review.opendev.org/q/topic:gerrit-3.4-cleanups
19:13:37 <clarkb> If so there are three changes ^ there that need review. The first one drops 3.4, second adds 3.6, and last one adds 3.5 -> 3.6 upgrade testing. That last one is a bit complicated as there are steps we have to take on 3.5 before upgrading to 3.6 and the old test system for that wasn't able to do that
19:14:19 <clarkb> Considering it has been a week and of the two discovered issues one has already been fixed and the other is absically just a UI thing I'm comfortable saying it is unlikely we'll revert at this point
19:14:25 <clarkb> memory usage has also looked good
19:14:47 <ianw> ++ I think so, I can't imagine we'd go back at this point, but we can always revert
19:15:04 <clarkb> ya our 3.4 iamges will stay on docker hub for a bit and we can revert without reinstating all the machinery to build new ones
19:15:42 <fungi> on the merge conflict front, maybe just changing the displayed label to be more accurate would suffice?
19:15:46 <clarkb> looks like ianw has already reviewed those changes. Maybe fungi and/or frickler can take a second look. Particularly of the changes that add 3.6 just to make sure we don't miss anything.
19:16:27 <clarkb> fungi: that is possible but merge conflict relies on mergable: false even though it can also mean wip. So it becomes tricky to not break the merge conflict reporting on non wip changes
19:16:45 <clarkb> But ya maybe we just remove the merge conflict tag entirely on wip things in the UI
19:16:52 <clarkb> that is relatively straightforward
19:17:33 <clarkb> maybe upstream will haev a good idea and we can fix it some way I haven't considered
19:17:55 <clarkb> Anything else on this subject? I think we're just about at a place where we acn drop it off the schedule (once 3.4 images are removed)
19:18:00 <clarkb> s/schedule/agenda/
19:18:03 <fungi> well, s/merge conflict/unmergeable/ would be more accurate to display
19:18:14 <fungi> since it's not always a git merge conflict causing it to be marked as such
19:18:38 <frickler> in particular the "needs rebase" msg is wrong
19:18:59 <clarkb> fungi: but that is only true for wip changes aiui
19:19:09 <fungi> right
19:19:32 <clarkb> but ya maybe clarifying that in the case of wip changes is a way to go "unmergeable due to the wip state"
19:19:45 <fungi> well, also changes with outdated parents get marked as being in merge conflict even if they're technically not (though in those cases, rebases are warranted)
19:20:19 <clarkb> oh that is news to me but after reading the code is not unexpected. Making note of that on the bug I filed would be good
19:21:00 <fungi> also possible i've imagined that, i'll have to double-check
19:21:07 <clarkb> k
19:22:11 <clarkb> We have a few more topics to get through. Any other gerrit upgrade items before we move on?
19:23:20 <clarkb> #topic Improving grafana management tooling
19:23:52 <clarkb> This topic was largely going to talk about the new grafyaml dashboard screenshotting jobs, but those have since merged.
19:24:06 <clarkb> I guess maybe we should catch up on the current state of things and where we think we might be headed?
19:24:46 <clarkb> pulling info from last meeing what ianw has discovered is that grafyaml uses old APIs which can't properly express things like threshold levels for colors in graphs. This means success and failure graphs both show green in some cases
19:25:13 <ianw> i'm still working on it all
19:25:31 <ianw> but in doing so i did find one issue
19:25:49 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/847876
19:26:03 <ianw> this is my fault, missing parts of the config when we converted to ansible
19:26:47 <ianw> in short, we're not setting xFilesFactor to 0 for .wsp files created since the update.  this was something corvus fixed many years ago that reverted
19:27:39 <ianw> as noted, i'll have to manually correct the files on-disk after we fix the configs
19:27:42 <clarkb> noted. I've got that on my todo list for after the meeting and lunch
19:27:48 <clarkb> reviewing the change I mean
19:28:17 <ianw> i noticed this because i was not getting sensible results in the screenshots of the graphs we now create when we update graphs
19:29:28 <clarkb> ianw: jrosser_ also noted that the screenshots may be catching a spinning loading wheel in some cases. Is this related?
19:29:43 <clarkb> the info is in #opendev if you want to dig into that more
19:30:07 <ianw> ahh, ok, each screenshot waits 5 seconds, but that may not be long enough
19:30:31 <clarkb> it may depend on the size of the dashboard. I think the OSA dashboards have a lot of content based on hte change diff
19:30:51 <ianw> it's quite difficult to tell if the page is actually loaded
19:31:07 <clarkb> Once we've got this fairly stable do we have an idea of what sorts of things we might be looking at to address the grafyaml deficiencies?
19:32:11 <clarkb> or maybe too early to tell since bootstrapping testing has been the focus
19:32:25 <fungi> i wonder if there's a way to identify when the page has completed loading
19:32:36 <ianw> my proposal would be that editing directly in grafana and committing dashboards it exports, using the screenshots as a better way to review changes than trying to be human parsers
19:32:46 <ianw> however, we are not quite at the point I have a working CI example of that
19:33:10 <ianw> so i'd like to get that POC 100%, and then we can move it to a concrete discussion
19:33:18 <clarkb> got it. Works for me
19:33:54 <corvus> will the screenshots show the actual metrics used?
19:34:25 <corvus> by that, i mean the metrics names, formulas applied, etc?
19:35:03 <clarkb> I think grafana can be convinced to show that info, but it may be equivalent to what is the in json (aka just the json backing)
19:35:26 <corvus> (so that a reviewer can see that someone is adding a panel that, say, takes a certain metric and divides by 10 and not 100)
19:35:50 <corvus> okay, so someone reviewing the change for accuracy would need to read the json?
19:37:12 <clarkb> I'm looking at the prod dashboard and to see that info currently it does seem like you have to load the json version (it shows the actual data and stats separatenyl but not how they were formulated)
19:38:01 <ianw> yes, you would want to take a look at the json for say metric functions
19:38:28 <ianw> "the json" looks something like https://review.opendev.org/c/openstack/project-config/+/833213/1/grafana/infra-prod-deployment.json
19:39:35 <corvus> the comment about reviewers not needing to be human parsers made me think that may no longer be the case, but i guess reviews still require reading the source (which will be json instead of yaml)
19:40:46 <corvus> or maybe there's some other way to output that information
19:41:44 <clarkb> one idea had was to use a simpler translation tool between the json and yaml to help humans. But not try to encode logic as much as grafyaml does today as that seems to be part of what trips us up.
19:42:07 <clarkb> But I think we can continue to improve the testing. users have already said how helpful it is while using grafyaml so no harm in improving things this way
19:42:21 <clarkb> and we can further discuss the future of manageing the dashboards as we've learned more about our options
19:42:41 <clarkb> We've got 18 minutes left in the meeting hour and a few more topics. Anything urgent on this subject before we continue on?
19:43:15 <ianw> nope, thanks
19:43:22 <clarkb> #topic URL Shortener Service
19:43:27 <clarkb> frickler: Any updates on this?
19:43:28 <frickler> still no progress here, sorry
19:43:38 <clarkb> no worries
19:43:47 <clarkb> #topic Zuul job POST_FAILUREs
19:44:05 <clarkb> Starting sometime last week openstack ansible and tripleo both noticed a higher rate of POST_FAILURE jobs
19:44:57 <clarkb> fungi did a fair bit of digging last week and I've tried to help out more recently. It isn't easy to debug because these post failure appear related to log uploads which means we get no log url and no log links
19:45:24 <clarkb> We strongly suspect that this si related to the executor -> swift upload process with the playbook timing out in that period of time.
19:45:49 <clarkb> We suspect that it is also related to either the total number of log files, their size or some combo of the two since only OSA and tripleo seem to be affected and they log quite a bit compared to other users/jobs
19:46:16 <fungi> the time required to upload to swift endpoints does seem to account for the majority of the playbook's time, and can vary significantly
19:46:58 <clarkb> We've helped them both identify places they might be able to trim their log content down. The categories largely boiled down to no more ARA, reduce deep nesting of logs since nesting requires an index.html for each dir level, remove logs that are identical on every run (think /etc contents that are fixed and never change), and things like journald binary files.
19:47:10 <clarkb> Doing this cleanup does appear to have helped but not completed removed the issue
19:47:14 <corvus> well, if hypothetically, under some circumstances it takes 4x time to upload, it may simply be that only those jobs are long enough that 4x time is noticeable?
19:47:25 <fungi> yes
19:47:27 <corvus> (but the real issue is surely that under some circumstances it takes 4x time, right?)
19:47:29 <clarkb> corvus: yup, I think that is what we are suspecting
19:47:44 <clarkb> in the OSA case we've seen some jobs take ~2 minutes to upload logs, ~9 minutes, and ~22 minutes
19:47:52 <corvus> so initial steps are good, and help reduce the pain, but underlying problem remains
19:48:01 <fungi> also it's really only impacting tripleo and openstack-ansible changes, so seems to be something specific to their jobs (again, probably the volume of logs they collect)
19:48:21 <clarkb> the problem is we have very little insight into this due to how the issues occur. We lose a lot of info. Even on the executor log side the timeout happens and we don't get info about where we were uploading to
19:48:33 <fungi> unfortunately a lot of the troubleshooting is hampered by blind spots due to ansible not outputting things when it gets aborted mid-task
19:48:37 <clarkb> we could add logging of that to the ansible role but then we set no_log: true which I think may break any explicit logging too
19:48:47 <fungi> so it's hard to even identify which swift endpoint is involved in one of the post_failure results
19:49:05 <clarkb> I think we've managed the bleeding, but now we're looking for ideas on how we might log this better going forward.
19:49:37 <clarkb> One idea that fungi had that was great was to do two passes of uploads. The first can upload the console log and console json and maybe the inventory content. Then a second pass can upload the job specific data
19:49:58 <ianw> yeah -- when i started hitting this, it turned out to be the massive syslog that was due to a kernel bug that only hit in one cloud-provider flooding with backtrace messages.  luckily in that that, some of the uploads managed to work, so we could see the massive file.  but it could be something obscure and unrelated to the job like this
19:50:06 <corvus> that would help us not have post_failures, but it wouldn't help us have logs, and it wouldn't help us know that we have problems uploading logs.
19:50:10 <clarkb> The problem with this is we generate a zuul manifect with all of the log files and record that for the zuul dashboard so we'd essentially need to upload those base logs twice to make that work
19:50:22 <corvus> iow, it could sweep this under the rug but not actually make things better
19:50:30 <fungi> i think it's the opposite?
19:50:41 <clarkb> corvus: I don't think it would stop the post failures. The second upload pass would still cause that to happen
19:50:47 <fungi> it wouldn't stop the post_failure results, but we'd have console logs and could inspect things in the dashboard
19:51:00 <corvus> oh i see
19:51:02 <clarkb> it would allow us to in theoryk now where we're slow to upload to
19:51:09 <clarkb> and some other info.
19:51:22 <fungi> basically try to avoid leaving users with a build result that says "oops, i have nothing to show you, but trust me this broke"
19:51:28 <clarkb> But making that shift work in the way zuul's logging system currently works is not trivial
19:51:38 <corvus> that sounds good.  the other thing is that all the info is in the executor logs.  so if you want to write a script to parse it out, that could be an option.
19:51:45 <clarkb> Mostly calling this out here so people are aware of the struggles and also to brainstorm how we can log better
19:51:59 <clarkb> corvus: the info is only there if we don't time out the task though
19:52:04 <corvus> i suggest that because even if you improve the log upload situation, it still doesn't really point at the problem
19:52:06 <fungi> except a lot of the info we want for debugging this isn't in the executor logs, at least not that i can find
19:52:09 <clarkb> corvus: when we time out the task we kill the task before it can record anything in the executor logs
19:52:13 <fungi> though we can improve that, yes
19:52:30 <clarkb> at least that was my impression of the issue here
19:52:37 <corvus> no i mean all the info is in the executor log.  /var/log/zuul/executor-debug.log
19:52:45 <fungi> basically do some sort of a dry-run task before the task which might time out
19:52:56 <clarkb> corvus: yes that file doesn't get the info because the task is forcefully killed before it records the info
19:53:07 <fungi> all of the info we have is in the executor log, but in these cases the info isn't much
19:53:30 <clarkb> ya another appraoch may be to do the random selection of target, record it in the log, then start the upload similar to the change jrosser wrote
19:53:34 <clarkb> then we'd at least have that information
19:53:34 <corvus> you're talking about the swift endpoint?
19:53:41 <clarkb> corvus: yes that is the major piece of info
19:53:42 <fungi> that's one piece of data, yes
19:53:46 <clarkb> potentially also the files copied
19:53:52 <fungi> it gets logged by the task when it ends
19:54:04 <fungi> except in these cases, because it isn't allowed to end
19:54:10 <fungi> instead it gets killed by the timeout
19:54:29 <clarkb> the more I think about it the more I think a change like jrosser's could be a good thing here. Basically make random selection, record target, run upload. Then we record some of the useful info before the forceful kill
19:54:46 <fungi> so we get a log that says the task timed out, and no other information that task would normally have logged (for example, the endpoint url)
19:55:13 <fungi> we can explicitly log those other things by doing it before we run that task though
19:55:30 <clarkb> https://review.opendev.org/c/opendev/base-jobs/+/847780 that change
19:55:51 <corvus> yeah that's a good change
19:55:59 <clarkb> that change was initiall made so we can do two upload passes, but maybe we start with it just to record the info and do one upload
19:56:01 <fungi> another idea was to temporarily increase the timeout for post-logs and then try to analyze builds which took longer in that phase than the normal timeout
19:56:02 <corvus> then you'll have everything in the debug log :)
19:56:25 <clarkb> yup. Ok thats a good next step and we can take debugging from there as it may provide important info
19:56:34 <clarkb> we are alomost out of time but do have one more agenda item I'd like to get to
19:56:45 <corvus> in particular, you can look for log upload times by job, and see "normal" and "abnormal" times
19:56:53 <fungi> the risk of temporarily increasing the timeout, of course, is that jobs may end up merging changes that make the situation worse in the interim
19:57:18 <clarkb> #topic Bastion host
19:57:29 <clarkb> ianw put this on the agenda to discuss two items re bridge.
19:57:37 <corvus> yeah, i wouldn't change any timeouts; i'd do the change to get all the data in the logs, then analyze the logs.  that's going to be a lot better than doing a bunch of spot checks anyway.
19:57:45 <clarkb> The first is whether or not we should put ansible and openstacksdk in a venv rather than global install
19:57:55 <ianw> this came out of
19:57:58 <ianw> #link https://review.opendev.org/c/opendev/system-config/+/847700
19:58:11 <ianw> which fixes the -devel job, which uses these from upstream
19:58:38 <ianw> i started to take a different approach, moving everything into a non-privileged virtualenv, but then wondered if there was actually any appetite for such a change
19:58:54 <ianw> do we want to push on that?  or do we not care that much
19:59:12 <clarkb> I think putting pip installs into a venv is a good idea simply because not doing that continues to break in fun ways over time
19:59:46 <clarkb> The major downsides are things are no longer automatically in $PATH but we can add them explicitly. And when python upgrades you get really weird errors running stuff out of venvs
20:00:04 <ianw> yeah they basically need to be regenerated
20:00:14 <fungi> i am 100% in favor of avoiding `sudo pip install` in favor of either distro packages or venvs, yes
20:00:51 <fungi> also python upgrades on a stable distro shouldn't need regenerating unless we do an in-place upgrade of the distro to a new major version
20:00:51 <clarkb> ianw: and config management makes that easy if we just rm or move the broken venv aside and let config management rerun (there is a chicken and egg here for ansible specifically though, but I think that is ok if we shift to venvs more and more)
20:01:03 <clarkb> fungi: yes that is the next question :)
20:01:19 <clarkb> The next questions is re upgrading bridge and whether or not we should do it in place or with a new host
20:01:35 <fungi> and to be clear, an in-place upgrade of the distro is fine with me, we just need to remember to regenerate any venvs which were built for the old python versions
20:01:58 <clarkb> I personally really like using new hosts when we can get away with it as it helps us start from a clean slate and delete old cruft automatically. But bridge's IP address might be important? In the past we explicitly allowed root ssh from its IP on hosts iirc
20:02:21 <ianw> this is a longer term thing.  one thing about starting a fresh host is that we will probably find all those bits we haven't quite codified yet
20:02:27 <clarkb> I'm not sure if we still do that or not. If we do then doing an in place upgrade is probably fine. But I have a small prefernce for new host if we can get away with it
20:02:27 <corvus> clarkb: i think that should be automatic for a replacement host
20:02:46 <clarkb> corvus: ya for all hosts that run ansible during the time frame we have both in the var/list/group
20:03:00 <clarkb> mostly concerned that a host might get missed for one reason or another and get stradned but we can always manually update that too
20:03:19 <ianw> ok, so i'm thinking maybe we push on the virtualenv stuff for the tools in use on bridge first, and probably end up with the current bridge as a franken-host with things installed everywhere everyhow
20:03:26 <clarkb> Anyway no objection from me shifting ansibel and openstacksdk (and more and more of our other tools) into venvs.
20:03:38 <fungi> same here
20:03:40 <ianw> however, we can then look at upgrade/replacement, and we should start fresh with more compartmentalized tools
20:03:43 <clarkb> And preference for new host to do OS upgrade
20:05:00 <fungi> i also prefer a new host, all things being equal, but understand it's a pragmatic choice in some cases to do in-place
20:05:23 <clarkb> We are now a few minutes over time. No open discussion today, but feel free to bring discussion up in #opendev or on the mailing list. Last call for anything else on bastion work
20:05:44 <fungi> thanks clarkb!
20:07:05 <corvus> thanks!
20:07:47 <clarkb> Sounds like that is it. Thank you everyone!
20:07:49 <clarkb> #endmeeting