19:01:10 #startmeeting infra 19:01:10 Meeting started Tue Jun 28 19:01:10 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:01:10 Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. 19:01:10 The meeting name has been set to 'infra' 19:01:21 #link https://lists.opendev.org/pipermail/service-discuss/2022-June/000341.html Our Agenda 19:01:29 #topic Announcements 19:01:48 o/ 19:01:54 Next week Monday is a big holiday for a few of us. I would expect it to be quiet ish early next week. 19:02:11 Additionally I very likely won't be able to make the meeting two weeks from today 19:02:46 More than happy to skip that week or have someone else run the meeting (its July 12, 2022) 19:03:19 i can do 12th july if there's interest 19:03:41 figured I'd let people knwo early then we can organize with plenty of time 19:03:51 Any other announcements? 19:05:06 #topic Topics 19:05:13 #topic Improving CD throughput 19:05:32 There was a bug in the flock path for the zuul auto upgrade playbook which unfortunately caused last weekends upgrade and reboots to fail 19:05:46 That issue has since been fixed so the next pass should run 19:05:54 This is the downside to only trying to run it once a week. 19:06:29 But we can always manually run it if necessary at an earlier date. I'm also hoping that I'll be feeling much better next weekend and can pay attention to it as it runs. (I missed the last one because I wasn't feeling well) 19:07:31 Slow progress, but that still counts :) 19:08:17 Anything else on this topic? 19:09:12 #topic Gerrit 3.5 upgrade 19:09:19 #link https://bugs.chromium.org/p/gerrit/issues/detail?id=16041 WorkInProgress always treated as merge conflict 19:09:28 I did some investigating of this problem that frickler called out. 19:10:09 I thought I would dig into that more today and try to write a patch, but what I've realized since is that there isn't a great solution here since WIP changes are not mergeable. But Gerrit overloads mergable to indicate there is a conflict (which isn't necessarily true in the WIP case) 19:10:48 so now I'm thinking I'll wait a bit and see if any upstream devs have some hints for how we might address this. Maybe it is ok to drop merge conflict in the case of all wips. Or maybe we need a better distinction between the triple state and use something other than a binary value 19:11:09 If the latter option then that may require them to write a chagne as I think it requires a new index version 19:11:38 But I do think I understand this enough to say it is largely a non issue. It looks werid in the UI, but it doens't indicate a bug in merge conflict checking or the index itself 19:12:02 Which means I think it is fairly low priority and we can focus effort elsewhere 19:12:26 (clarkb becoming dangerously close to a java developer again :) 19:12:57 The other item I wanted to bring up here is whether or not we think we are ready to drop 3.4 images and add 3.6 as well as testing 19:12:59 https://review.opendev.org/q/topic:gerrit-3.4-cleanups 19:13:37 If so there are three changes ^ there that need review. The first one drops 3.4, second adds 3.6, and last one adds 3.5 -> 3.6 upgrade testing. That last one is a bit complicated as there are steps we have to take on 3.5 before upgrading to 3.6 and the old test system for that wasn't able to do that 19:14:19 Considering it has been a week and of the two discovered issues one has already been fixed and the other is absically just a UI thing I'm comfortable saying it is unlikely we'll revert at this point 19:14:25 memory usage has also looked good 19:14:47 ++ I think so, I can't imagine we'd go back at this point, but we can always revert 19:15:04 ya our 3.4 iamges will stay on docker hub for a bit and we can revert without reinstating all the machinery to build new ones 19:15:42 on the merge conflict front, maybe just changing the displayed label to be more accurate would suffice? 19:15:46 looks like ianw has already reviewed those changes. Maybe fungi and/or frickler can take a second look. Particularly of the changes that add 3.6 just to make sure we don't miss anything. 19:16:27 fungi: that is possible but merge conflict relies on mergable: false even though it can also mean wip. So it becomes tricky to not break the merge conflict reporting on non wip changes 19:16:45 But ya maybe we just remove the merge conflict tag entirely on wip things in the UI 19:16:52 that is relatively straightforward 19:17:33 maybe upstream will haev a good idea and we can fix it some way I haven't considered 19:17:55 Anything else on this subject? I think we're just about at a place where we acn drop it off the schedule (once 3.4 images are removed) 19:18:00 s/schedule/agenda/ 19:18:03 well, s/merge conflict/unmergeable/ would be more accurate to display 19:18:14 since it's not always a git merge conflict causing it to be marked as such 19:18:38 in particular the "needs rebase" msg is wrong 19:18:59 fungi: but that is only true for wip changes aiui 19:19:09 right 19:19:32 but ya maybe clarifying that in the case of wip changes is a way to go "unmergeable due to the wip state" 19:19:45 well, also changes with outdated parents get marked as being in merge conflict even if they're technically not (though in those cases, rebases are warranted) 19:20:19 oh that is news to me but after reading the code is not unexpected. Making note of that on the bug I filed would be good 19:21:00 also possible i've imagined that, i'll have to double-check 19:21:07 k 19:22:11 We have a few more topics to get through. Any other gerrit upgrade items before we move on? 19:23:20 #topic Improving grafana management tooling 19:23:52 This topic was largely going to talk about the new grafyaml dashboard screenshotting jobs, but those have since merged. 19:24:06 I guess maybe we should catch up on the current state of things and where we think we might be headed? 19:24:46 pulling info from last meeing what ianw has discovered is that grafyaml uses old APIs which can't properly express things like threshold levels for colors in graphs. This means success and failure graphs both show green in some cases 19:25:13 i'm still working on it all 19:25:31 but in doing so i did find one issue 19:25:49 #link https://review.opendev.org/c/opendev/system-config/+/847876 19:26:03 this is my fault, missing parts of the config when we converted to ansible 19:26:47 in short, we're not setting xFilesFactor to 0 for .wsp files created since the update. this was something corvus fixed many years ago that reverted 19:27:39 as noted, i'll have to manually correct the files on-disk after we fix the configs 19:27:42 noted. I've got that on my todo list for after the meeting and lunch 19:27:48 reviewing the change I mean 19:28:17 i noticed this because i was not getting sensible results in the screenshots of the graphs we now create when we update graphs 19:29:28 ianw: jrosser_ also noted that the screenshots may be catching a spinning loading wheel in some cases. Is this related? 19:29:43 the info is in #opendev if you want to dig into that more 19:30:07 ahh, ok, each screenshot waits 5 seconds, but that may not be long enough 19:30:31 it may depend on the size of the dashboard. I think the OSA dashboards have a lot of content based on hte change diff 19:30:51 it's quite difficult to tell if the page is actually loaded 19:31:07 Once we've got this fairly stable do we have an idea of what sorts of things we might be looking at to address the grafyaml deficiencies? 19:32:11 or maybe too early to tell since bootstrapping testing has been the focus 19:32:25 i wonder if there's a way to identify when the page has completed loading 19:32:36 my proposal would be that editing directly in grafana and committing dashboards it exports, using the screenshots as a better way to review changes than trying to be human parsers 19:32:46 however, we are not quite at the point I have a working CI example of that 19:33:10 so i'd like to get that POC 100%, and then we can move it to a concrete discussion 19:33:18 got it. Works for me 19:33:54 will the screenshots show the actual metrics used? 19:34:25 by that, i mean the metrics names, formulas applied, etc? 19:35:03 I think grafana can be convinced to show that info, but it may be equivalent to what is the in json (aka just the json backing) 19:35:26 (so that a reviewer can see that someone is adding a panel that, say, takes a certain metric and divides by 10 and not 100) 19:35:50 okay, so someone reviewing the change for accuracy would need to read the json? 19:37:12 I'm looking at the prod dashboard and to see that info currently it does seem like you have to load the json version (it shows the actual data and stats separatenyl but not how they were formulated) 19:38:01 yes, you would want to take a look at the json for say metric functions 19:38:28 "the json" looks something like https://review.opendev.org/c/openstack/project-config/+/833213/1/grafana/infra-prod-deployment.json 19:39:35 the comment about reviewers not needing to be human parsers made me think that may no longer be the case, but i guess reviews still require reading the source (which will be json instead of yaml) 19:40:46 or maybe there's some other way to output that information 19:41:44 one idea had was to use a simpler translation tool between the json and yaml to help humans. But not try to encode logic as much as grafyaml does today as that seems to be part of what trips us up. 19:42:07 But I think we can continue to improve the testing. users have already said how helpful it is while using grafyaml so no harm in improving things this way 19:42:21 and we can further discuss the future of manageing the dashboards as we've learned more about our options 19:42:41 We've got 18 minutes left in the meeting hour and a few more topics. Anything urgent on this subject before we continue on? 19:43:15 nope, thanks 19:43:22 #topic URL Shortener Service 19:43:27 frickler: Any updates on this? 19:43:28 still no progress here, sorry 19:43:38 no worries 19:43:47 #topic Zuul job POST_FAILUREs 19:44:05 Starting sometime last week openstack ansible and tripleo both noticed a higher rate of POST_FAILURE jobs 19:44:57 fungi did a fair bit of digging last week and I've tried to help out more recently. It isn't easy to debug because these post failure appear related to log uploads which means we get no log url and no log links 19:45:24 We strongly suspect that this si related to the executor -> swift upload process with the playbook timing out in that period of time. 19:45:49 We suspect that it is also related to either the total number of log files, their size or some combo of the two since only OSA and tripleo seem to be affected and they log quite a bit compared to other users/jobs 19:46:16 the time required to upload to swift endpoints does seem to account for the majority of the playbook's time, and can vary significantly 19:46:58 We've helped them both identify places they might be able to trim their log content down. The categories largely boiled down to no more ARA, reduce deep nesting of logs since nesting requires an index.html for each dir level, remove logs that are identical on every run (think /etc contents that are fixed and never change), and things like journald binary files. 19:47:10 Doing this cleanup does appear to have helped but not completed removed the issue 19:47:14 well, if hypothetically, under some circumstances it takes 4x time to upload, it may simply be that only those jobs are long enough that 4x time is noticeable? 19:47:25 yes 19:47:27 (but the real issue is surely that under some circumstances it takes 4x time, right?) 19:47:29 corvus: yup, I think that is what we are suspecting 19:47:44 in the OSA case we've seen some jobs take ~2 minutes to upload logs, ~9 minutes, and ~22 minutes 19:47:52 so initial steps are good, and help reduce the pain, but underlying problem remains 19:48:01 also it's really only impacting tripleo and openstack-ansible changes, so seems to be something specific to their jobs (again, probably the volume of logs they collect) 19:48:21 the problem is we have very little insight into this due to how the issues occur. We lose a lot of info. Even on the executor log side the timeout happens and we don't get info about where we were uploading to 19:48:33 unfortunately a lot of the troubleshooting is hampered by blind spots due to ansible not outputting things when it gets aborted mid-task 19:48:37 we could add logging of that to the ansible role but then we set no_log: true which I think may break any explicit logging too 19:48:47 so it's hard to even identify which swift endpoint is involved in one of the post_failure results 19:49:05 I think we've managed the bleeding, but now we're looking for ideas on how we might log this better going forward. 19:49:37 One idea that fungi had that was great was to do two passes of uploads. The first can upload the console log and console json and maybe the inventory content. Then a second pass can upload the job specific data 19:49:58 yeah -- when i started hitting this, it turned out to be the massive syslog that was due to a kernel bug that only hit in one cloud-provider flooding with backtrace messages. luckily in that that, some of the uploads managed to work, so we could see the massive file. but it could be something obscure and unrelated to the job like this 19:50:06 that would help us not have post_failures, but it wouldn't help us have logs, and it wouldn't help us know that we have problems uploading logs. 19:50:10 The problem with this is we generate a zuul manifect with all of the log files and record that for the zuul dashboard so we'd essentially need to upload those base logs twice to make that work 19:50:22 iow, it could sweep this under the rug but not actually make things better 19:50:30 i think it's the opposite? 19:50:41 corvus: I don't think it would stop the post failures. The second upload pass would still cause that to happen 19:50:47 it wouldn't stop the post_failure results, but we'd have console logs and could inspect things in the dashboard 19:51:00 oh i see 19:51:02 it would allow us to in theoryk now where we're slow to upload to 19:51:09 and some other info. 19:51:22 basically try to avoid leaving users with a build result that says "oops, i have nothing to show you, but trust me this broke" 19:51:28 But making that shift work in the way zuul's logging system currently works is not trivial 19:51:38 that sounds good. the other thing is that all the info is in the executor logs. so if you want to write a script to parse it out, that could be an option. 19:51:45 Mostly calling this out here so people are aware of the struggles and also to brainstorm how we can log better 19:51:59 corvus: the info is only there if we don't time out the task though 19:52:04 i suggest that because even if you improve the log upload situation, it still doesn't really point at the problem 19:52:06 except a lot of the info we want for debugging this isn't in the executor logs, at least not that i can find 19:52:09 corvus: when we time out the task we kill the task before it can record anything in the executor logs 19:52:13 though we can improve that, yes 19:52:30 at least that was my impression of the issue here 19:52:37 no i mean all the info is in the executor log. /var/log/zuul/executor-debug.log 19:52:45 basically do some sort of a dry-run task before the task which might time out 19:52:56 corvus: yes that file doesn't get the info because the task is forcefully killed before it records the info 19:53:07 all of the info we have is in the executor log, but in these cases the info isn't much 19:53:30 ya another appraoch may be to do the random selection of target, record it in the log, then start the upload similar to the change jrosser wrote 19:53:34 then we'd at least have that information 19:53:34 you're talking about the swift endpoint? 19:53:41 corvus: yes that is the major piece of info 19:53:42 that's one piece of data, yes 19:53:46 potentially also the files copied 19:53:52 it gets logged by the task when it ends 19:54:04 except in these cases, because it isn't allowed to end 19:54:10 instead it gets killed by the timeout 19:54:29 the more I think about it the more I think a change like jrosser's could be a good thing here. Basically make random selection, record target, run upload. Then we record some of the useful info before the forceful kill 19:54:46 so we get a log that says the task timed out, and no other information that task would normally have logged (for example, the endpoint url) 19:55:13 we can explicitly log those other things by doing it before we run that task though 19:55:30 https://review.opendev.org/c/opendev/base-jobs/+/847780 that change 19:55:51 yeah that's a good change 19:55:59 that change was initiall made so we can do two upload passes, but maybe we start with it just to record the info and do one upload 19:56:01 another idea was to temporarily increase the timeout for post-logs and then try to analyze builds which took longer in that phase than the normal timeout 19:56:02 then you'll have everything in the debug log :) 19:56:25 yup. Ok thats a good next step and we can take debugging from there as it may provide important info 19:56:34 we are alomost out of time but do have one more agenda item I'd like to get to 19:56:45 in particular, you can look for log upload times by job, and see "normal" and "abnormal" times 19:56:53 the risk of temporarily increasing the timeout, of course, is that jobs may end up merging changes that make the situation worse in the interim 19:57:18 #topic Bastion host 19:57:29 ianw put this on the agenda to discuss two items re bridge. 19:57:37 yeah, i wouldn't change any timeouts; i'd do the change to get all the data in the logs, then analyze the logs. that's going to be a lot better than doing a bunch of spot checks anyway. 19:57:45 The first is whether or not we should put ansible and openstacksdk in a venv rather than global install 19:57:55 this came out of 19:57:58 #link https://review.opendev.org/c/opendev/system-config/+/847700 19:58:11 which fixes the -devel job, which uses these from upstream 19:58:38 i started to take a different approach, moving everything into a non-privileged virtualenv, but then wondered if there was actually any appetite for such a change 19:58:54 do we want to push on that? or do we not care that much 19:59:12 I think putting pip installs into a venv is a good idea simply because not doing that continues to break in fun ways over time 19:59:46 The major downsides are things are no longer automatically in $PATH but we can add them explicitly. And when python upgrades you get really weird errors running stuff out of venvs 20:00:04 yeah they basically need to be regenerated 20:00:14 i am 100% in favor of avoiding `sudo pip install` in favor of either distro packages or venvs, yes 20:00:51 also python upgrades on a stable distro shouldn't need regenerating unless we do an in-place upgrade of the distro to a new major version 20:00:51 ianw: and config management makes that easy if we just rm or move the broken venv aside and let config management rerun (there is a chicken and egg here for ansible specifically though, but I think that is ok if we shift to venvs more and more) 20:01:03 fungi: yes that is the next question :) 20:01:19 The next questions is re upgrading bridge and whether or not we should do it in place or with a new host 20:01:35 and to be clear, an in-place upgrade of the distro is fine with me, we just need to remember to regenerate any venvs which were built for the old python versions 20:01:58 I personally really like using new hosts when we can get away with it as it helps us start from a clean slate and delete old cruft automatically. But bridge's IP address might be important? In the past we explicitly allowed root ssh from its IP on hosts iirc 20:02:21 this is a longer term thing. one thing about starting a fresh host is that we will probably find all those bits we haven't quite codified yet 20:02:27 I'm not sure if we still do that or not. If we do then doing an in place upgrade is probably fine. But I have a small prefernce for new host if we can get away with it 20:02:27 clarkb: i think that should be automatic for a replacement host 20:02:46 corvus: ya for all hosts that run ansible during the time frame we have both in the var/list/group 20:03:00 mostly concerned that a host might get missed for one reason or another and get stradned but we can always manually update that too 20:03:19 ok, so i'm thinking maybe we push on the virtualenv stuff for the tools in use on bridge first, and probably end up with the current bridge as a franken-host with things installed everywhere everyhow 20:03:26 Anyway no objection from me shifting ansibel and openstacksdk (and more and more of our other tools) into venvs. 20:03:38 same here 20:03:40 however, we can then look at upgrade/replacement, and we should start fresh with more compartmentalized tools 20:03:43 And preference for new host to do OS upgrade 20:05:00 i also prefer a new host, all things being equal, but understand it's a pragmatic choice in some cases to do in-place 20:05:23 We are now a few minutes over time. No open discussion today, but feel free to bring discussion up in #opendev or on the mailing list. Last call for anything else on bastion work 20:05:44 thanks clarkb! 20:07:05 thanks! 20:07:47 Sounds like that is it. Thank you everyone! 20:07:49 #endmeeting