ianw | https://zuul.opendev.org/t/openstack/build/3ce64b0d3e3845609c14fcd26be34db4/console | 00:00 |
---|---|---|
ianw | it's coming from | 00:00 |
ianw | pip install -c /home/zuul/src/opendev.org/openstack/horizon/upper-constraints.txt -r requirements.txt -r test-requirements.txt | 00:00 |
fungi | can't hurt to add it, i'm just not expecting it to solve the retry failures i was seeing where ensure-sphinx was breaking | 00:01 |
ianw | but that's a bit of a misnomer, because aiui it's run active.sh previously in the script. so that pip isn't the system pip | 00:01 |
ianw | my reading of it was that pip was from that venv ... but could be wrong! | 00:02 |
fungi | a lot of the failures i was looking at happen in pre-run, before that script ever comes into the picture | 00:02 |
fungi | e.g. the builds for nova | 00:03 |
fungi | failures to install pillow into the sphinx venv | 00:03 |
ianw | hrm, ok, i'm looking @ https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&job_name=upstream-translation-update&result=FAILURE | 00:04 |
fungi | you need to broaden it to include RETRY_LIMIT as well | 00:04 |
fungi | but yeah, one thing at a time. as you say, we probably have multiple places this is breaking | 00:04 |
ianw | ok, i see it now. i'll start making some notes | 00:06 |
fungi | it could still be the same underlying issue. maybe the ensure-sphinx role needs to upgrade pip | 00:06 |
fungi | though a bigger problem in my mind is that we're running this on bionic but applying master branch upper-constraints.txt which no longer take older python into account, and we may not be building those wheels for bionic and upstream may no longer be publishing cp36 wheels to pypi either even though they have a requires_python which allows 3.6 still | 00:10 |
ianw | this is true | 00:11 |
fungi | so as a result, pip is going to grab sdists of some things and the projects don't have the necessary build deps in their bindep.txt | 00:11 |
fungi | but yeah, let's try the easy things first and then it's simpler to reason about solutions for what's still breaking after that | 00:12 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-sphinx: upgrade pip https://review.opendev.org/c/zuul/zuul-jobs/+/828441 | 00:15 |
ianw | fungi: ^ i feel like that might restore the status-quo, maybe? | 00:16 |
fungi | i expect installing with python 2.7 was hitting some legacy paths through constraints files which masked a bunch of problems, so i'm not getting my hopes up | 00:24 |
ianw | error: invalid command 'bdist_wheel' | 00:27 |
ianw | i guess maybe that venv needs wheel too... | 00:27 |
fungi | yes, venv doesn't have wheel by default | 00:28 |
ianw | then again, it also says | 00:28 |
ianw | The headers or library files could not be found for jpeg, | 00:28 |
ianw | a required dependency when compiling Pillow from source. | 00:28 |
fungi | but also that only comes into play if it's trying to install things from sdist because it can't find a wheel | 00:28 |
ianw | ... so what is the problem :/ | 00:28 |
fungi | and it's a warning, there's a legacy build codepath which doesn't involve building and installing a wheel | 00:29 |
ianw | so basically we had a wheel for pillow and now don't is the theory | 00:29 |
fungi | a cp27 wheel probably, yes | 00:30 |
fungi | now it wants cp36 for bionic | 00:30 |
ianw | Collecting Pillow===8.4.0 (from -c /home/zuul/src/opendev.org/openstack/requirements/upper-constraints.txt (line 97)) | 00:31 |
clarkb | I think it may have had to do with abi3 wheels | 00:31 |
clarkb | old pip doesn't understand those as valid for any python version iirc | 00:31 |
ianw | http://mirror.iad.rax.opendev.org/wheel/ubuntu-18.04-x86_64/pillow/ | 00:32 |
clarkb | then once you update pip it recognizes it can install those specially annotated wheels | 00:32 |
opendevreview | Neil Hanlon proposed openstack/project-config master: Add rockylinux-8 to nodepool configuration https://review.opendev.org/c/openstack/project-config/+/828435 | 00:32 |
ianw | ahhh ... then the pip upgrade *might* help :) are we back where we started?! :) | 00:32 |
NeilHanlon | 😂 | 00:32 |
ianw | i think we're where we've always been, in a huge tangled mess of dependencies that somehow sometimes works | 00:33 |
NeilHanlon | i.e., python | 00:33 |
fungi | ianw: i think we'll be at the point that pip will think it's possible to install a newer version of pillow than is available as a wheel, but maybe abi3 works for cp36 | 00:34 |
ianw | yeah, it seems like we need to keep iterating | 00:34 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-sphinx: upgrade pip https://review.opendev.org/c/zuul/zuul-jobs/+/828441 | 00:44 |
ianw | now with less typos | 00:44 |
clarkb | ianw: any idea why the base-test log-inventory stuff was out of sync? | 00:45 |
ianw | i didn't go through the history; i assumed something was tested that didn't make it in | 00:46 |
clarkb | gotcha | 00:47 |
ianw | looks like the bits i removed came in via I6c93fd03aadb5e4d15ac7da98887dd7ca4998319 | 00:47 |
ianw | https://review.opendev.org/c/opendev/base-jobs/+/798139 | 00:48 |
*** dviroel|ruck|afk is now known as dviroel|ruck | 00:48 | |
ianw | then it looks like https://review.opendev.org/c/zuul/zuul-jobs/+/798087 didn't make it in? | 00:48 |
clarkb | that might've stalled out due to the zuul fixes that went in mid last year? | 00:54 |
clarkb | things were more aggressively split out and filtered | 00:54 |
*** rlandy|ruck|bbl is now known as rlandy|ruck | 00:57 | |
*** dviroel|ruck is now known as dviroel|ruck|out | 00:57 | |
*** dviroel|ruck|out is now known as dviroel|out | 00:57 | |
opendevreview | Merged zuul/zuul-jobs master: ensure-sphinx: upgrade pip https://review.opendev.org/c/zuul/zuul-jobs/+/828441 | 01:22 |
opendevreview | Merged opendev/base-jobs master: base-test: sync with base/pre.yaml https://review.opendev.org/c/opendev/base-jobs/+/828439 | 01:23 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: Revert "Use rpm -e instead of dnf for cleaning old kernels" https://review.opendev.org/c/openstack/diskimage-builder/+/827381 | 03:33 |
ianw | ok, it looks like https://zuul.opendev.org/t/openstack/build/0c8aa08db4c844a7bdb70dfe222597ea (upstream-translation-update for nova) passed after making ensure-sphinx update pip in the venv | 03:58 |
ianw | this still leaves problems with the host type, zanata in general, etc. but that's for tomorrow :) | 03:58 |
*** rlandy|ruck is now known as rlandy|out | 04:06 | |
opendevreview | Merged openstack/diskimage-builder master: Cleanup more CentOS 8 bits https://review.opendev.org/c/openstack/diskimage-builder/+/827210 | 04:39 |
opendevreview | Merged openstack/diskimage-builder master: Remove contrib/setup-gate-mirrors.sh https://review.opendev.org/c/openstack/diskimage-builder/+/827211 | 05:02 |
opendevreview | Merged openstack/diskimage-builder master: General improvements to the ubuntu-minimal docs https://review.opendev.org/c/openstack/diskimage-builder/+/806308 | 05:19 |
opendevreview | Merged openstack/diskimage-builder master: Remove extra if/then/else construct in pip element https://review.opendev.org/c/openstack/diskimage-builder/+/822224 | 05:19 |
opendevreview | Merged openstack/diskimage-builder master: Revert "Use rpm -e instead of dnf for cleaning old kernels" https://review.opendev.org/c/openstack/diskimage-builder/+/827381 | 07:50 |
opendevreview | Merged opendev/base-jobs master: base-test: fail centos-8 if pointing to centos-8-stream image type https://review.opendev.org/c/opendev/base-jobs/+/828440 | 08:03 |
*** amoralej|off is now known as amoralej | 08:11 | |
*** jpena|off is now known as jpena | 08:31 | |
*** sshnaidm|afk is now known as sshnaidm | 08:54 | |
*** ysandeep|out is now known as ysandeep | 09:01 | |
opendevreview | Riccardo Pittau proposed openstack/diskimage-builder master: Fallback to persistent netifs names with systemd https://review.opendev.org/c/openstack/diskimage-builder/+/828266 | 09:16 |
opendevreview | Riccardo Pittau proposed openstack/diskimage-builder master: Fallback to persistent netifs names with systemd https://review.opendev.org/c/openstack/diskimage-builder/+/828266 | 09:18 |
*** mnasiadka_ is now known as mnasiadka | 09:18 | |
opendevreview | Merged openstack/diskimage-builder master: Don't run functional tests on doc changes https://review.opendev.org/c/openstack/diskimage-builder/+/825891 | 09:21 |
opendevreview | Merged openstack/diskimage-builder master: fedora-container: pull in glibc-langpack-en https://review.opendev.org/c/openstack/diskimage-builder/+/827772 | 09:35 |
sshnaidm | cores, please review in your time perms patch https://review.opendev.org/c/openstack/project-config/+/828371 | 10:41 |
sshnaidm | fungi, ^^ | 10:41 |
*** rlandy|out is now known as rlandy|ruck | 11:06 | |
*** dviroel|out is now known as dviroel|ruck | 11:10 | |
mnasiadka | Good afternoon | 12:37 |
mnasiadka | Since https://opendev.org/openstack/diskimage-builder/commit/398e07e6f2bb5a2f763a22a8e4801108c242ffe2 landed - is there a slight chance that it would be possible to add Rocky Linux 8 to the possible nodesets in Zuul? Kolla projects would be happy to run their CI on something that is not so unpredictable as CentOS Stream (and there's user interest in adding Rocky Linux support - which we'd like to have properly tested). | 12:39 |
*** ysandeep is now known as ysandeep|break | 12:41 | |
*** artom__ is now known as artom | 13:03 | |
*** amoralej is now known as amoralej|lunch | 13:07 | |
*** ysandeep|break is now known as ysandeep | 13:13 | |
frickler | mnasiadka: seems https://review.opendev.org/c/openstack/project-config/+/828435 is the next step | 13:18 |
fungi | mnasiadka: it's in progress, i believe we need a dib release and then a version bump in nodepool | 13:18 |
mnasiadka | oh, great - Neill followed up on that | 13:18 |
mnasiadka | wasn't aware :) | 13:18 |
fungi | maybe the dib release already happened while i was asleep | 13:19 |
frickler | I don't think that that change actually tests builds, so that release+bump may still be needed | 13:20 |
*** amoralej|lunch is now known as amoralej | 13:59 | |
*** pojadhav is now known as pojadhav|brb | 14:06 | |
*** akahat|rover is now known as akahat|PTO | 14:11 | |
*** pojadhav|brb is now known as pojadhav | 14:24 | |
*** pojadhav is now known as pojadhav|dinner | 15:00 | |
*** ysandeep is now known as ysandeep|out | 15:43 | |
*** pojadhav|dinner is now known as pojadhav | 16:14 | |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM change to test and hold gitea 1.16.1 https://review.opendev.org/c/opendev/system-config/+/828586 | 16:35 |
clarkb | fungi: ^ how do I hold for a specific change again? is it --ref refs/changes/xy/abcxy ? | 16:35 |
clarkb | looks like I need the ps in there too | 16:36 |
fungi | yes, you need the revision | 16:39 |
fungi | refs/changes/xy/abcxy/z | 16:39 |
*** ralonsoh_ is now known as ralonsoh | 16:40 | |
clarkb | thanks. I'm hoping to get gitea 1.16.1 held today so we can look it over and double check it against the changelog then maybe upgrade soon | 16:40 |
clarkb | I set that hold up and claened up my old gerrit tag pushing hold | 16:42 |
clarkb | ianw: looks like you may have gerrit 3.4 holds that are no longer required since we upgraded. But I'll wait for you to confirm before doing any cleanup | 16:44 |
*** ykarel is now known as ykarel|away | 16:48 | |
*** marios is now known as marios|out | 16:53 | |
clarkb | I suspect https://review.opendev.org/c/openstack/diskimage-builder/+/826976 is the change that ianw is hoping to get sorted for the dib release based on what was said yseterday | 16:55 |
clarkb | I'm having a really hard time parsing what that change aims to do | 16:58 |
clarkb | I guess we want to set up grub without installing a bootloader. But aren't those two things intertwined? | 17:00 |
*** jpena is now known as jpena|off | 17:27 | |
corvus | the mergers looks like they may be stuck | 17:46 |
corvus | infra-root: i think we may be looking at the gerrit slowness again | 17:47 |
corvus | the mergers are not stuck, they're just getting really slow performance from gerrit on their git ops. they have a 300 job backlog | 17:48 |
corvus | clarkb: did we have a next step for debugging that? | 17:48 |
clarkb | corvus: luca asked for show-queue -w output from when it was happening | 17:50 |
clarkb | corvus: and maybe we should grab another thread dump since I think the last one captured the very tail end of it | 17:50 |
corvus | clarkb: do you want to "sudo" and do that? | 17:51 |
clarkb | ya I'll work on running jstack to capture a current thread dump | 17:51 |
clarkb | then around the same time try to show-queue -w | 17:52 |
corvus | ah, i've re-learned about my admin account, so i can do show-queue -w now too :) | 17:53 |
corvus | oh wow the output is a lot different, you can see the jobs waiting | 17:54 |
clarkb | corvus: yup that was what I was trying to explain to luca. I guess its a good thing we can actually acpture it now. I've done both a jstack and show-queue which captures data that needs filtering before we can upload it | 17:54 |
clarkb | if others want to keep poking and debugging I can start working on cleaning up these files | 17:55 |
clarkb | corvus: one thing I notice is that we've got another old task at the top of the list. | 17:58 |
clarkb | last time we had the same thing | 17:58 |
clarkb | I killed the task last time and maybe that helped make things get better? | 17:58 |
clarkb | I think we didn't see immediate relief but the built up backlog may explain that | 17:58 |
corvus | clarkb: yeah let's try that | 17:59 |
clarkb | corvus: do you want to kill it or should I? | 17:59 |
corvus | i'm watching a zm log and should notice if progress picks up. you kill it. | 17:59 |
clarkb | ok one sec while I look up the command again | 17:59 |
corvus | fwiw, the backlog started at 16:39 | 18:00 |
clarkb | done | 18:00 |
corvus | there might have been a slight improvement, from ~30s per repo to ~20 | 18:03 |
corvus | i don't think it's enough to keep up with incoming workload though | 18:04 |
corvus | the queue is decreasing | 18:06 |
clarkb | pretty sharply too | 18:06 |
clarkb | so maybe the issue is that old task | 18:06 |
fungi | no sudden jump in the merger queue this time, it did see a bit of a rise and tehn climbed a decelerating curve as of ~16:40z | 18:07 |
clarkb | and hopefully correlation between thread dump and show-queue prior to removing the old task can help upstream diagnose it. There may also be zuul settings to time out connections for ssh? | 18:07 |
fungi | the executors were very busy for a while before that too | 18:07 |
clarkb | we might be able to workaround this if so by setting a reasonable connection timeout to like an hour? | 18:07 |
clarkb | I'lm going to keep working on sanitizing these files. But maybe someone can look at gerrit config options around that | 18:07 |
fungi | a largeish stack of nova changes were enqueued into check around that time | 18:08 |
fungi | yeah, looks like there was a stack rebase and push for those right at 16:40 | 18:09 |
fungi | so it's possible gerrit was already slow, and this was the bump which pushed things over the edge | 18:10 |
corvus | so do we have 4 slots for servicing this? | 18:10 |
clarkb | corvus: we should have 100 which si aprt of what I asked about on the bug I filed | 18:11 |
clarkb | the thread dump also shows 100 threads exist | 18:11 |
fungi | that's how it's acting (and how it was acting last time as well) but yeah it's not what we think is configured | 18:11 |
clarkb | It seems like there is some other limitation (thread ocntention, locks? I don't know what) | 18:11 |
corvus | then show-queue has all but 4 git-upload-pack jobs waiting | 18:11 |
clarkb | yup I think that is why luca wanted the show queue output. | 18:12 |
corvus | k | 18:12 |
clarkb | review02:~clarkb/gerrit_queues.20220209.sanitized should be sanitized. But please double check it particularly the query tasks as I'm not sure if we need to scrub out the change identifiers too (I don't think so since all our changes are public) | 18:12 |
clarkb | now to work on the thread dump | 18:12 |
corvus | clarkb: sanitized lgtm | 18:14 |
clarkb | corvus: still sanitizing the thread dump but it looks like some of the waiting tasks are waiting on a lock | 18:15 |
clarkb | I'm hopeful this will end up allowing this to be understood and fixed given what I'm seeing. This may take some time though as I'm trying to synchronize the sanitized usernames between the two files | 18:15 |
clarkb | that way they can be directly correlated | 18:15 |
*** amoralej is now known as amoralej|off | 18:17 | |
corvus | queue @ 200 now | 18:18 |
corvus | there's now a 2m old task at the top, and things are slowing down again | 18:31 |
corvus | so it does seem like we're right on the edge of holding performance | 18:31 |
corvus | it finished; so we don't need to kill it or anything, just may be informative. | 18:32 |
clarkb | it does seem that after I killed the very old task the queue dropped quickly | 18:34 |
clarkb | makes me wonder if longer running tasks create a lot of contention somehow | 18:35 |
corvus | it leveled off while the 2m old task was there and has resumed falling | 18:35 |
corvus | well, if we can only service 4 of them at a time, then our capacity drops by 25% | 18:35 |
clarkb | but also none of this explains why 95% of our interactive ssh worker threads are doing nothing. Unless the same lock causes contention with thread assignemnt | 18:35 |
clarkb | ya that | 18:35 |
corvus | you can now see the little shelf on the merger queue graph from that 2m job | 18:36 |
corvus | (was probably 3+ minutes by the end, which more closely corresponds with the shelf length) | 18:37 |
corvus | <100 | 18:48 |
corvus | still seeing about 20s per repo on the merger | 18:49 |
clarkb | ok I've gone through review02:~clarkb/gerrit_thread_dump.20220209.sanitized and cleaned up what I could find. The diff against gerrit_thread_dump.20220209 will show you what I changed | 18:50 |
clarkb | there were no http entries that needed cleanup this time from what I could see | 18:50 |
clarkb | corvus: I think the two big questions are "why are we slowing down in general" and "why are we not using the many free ssh worker threads that could be used to spread out the load" | 18:51 |
corvus | yeah, though if the slowdown is entirely just waiting for threads, could be only one question. | 18:51 |
clarkb | indeed | 18:52 |
clarkb | sshd.idleTimeout and sshd.waitTimeout may be useful here depending on whether or not longer running requests are a problem | 18:53 |
clarkb | that might impact zuul listening on ssh event streaming though | 18:53 |
clarkb | anyway if ya'll can take a look at those two files on review02 and give them a critical eye I can update the issue with a bit more info | 18:54 |
clarkb | hrm we set idletimeout to an hour already | 18:55 |
clarkb | and waittimeout defaults to 30s so maybe not | 18:55 |
corvus | wonder why that job was there for so long then | 18:55 |
clarkb | oh! I've just now noticed that luca wanted -q added to show-queue | 18:57 |
clarkb | unfortunately too late to add that now but if you run it gives more detailed information on the internal queues too | 18:58 |
clarkb | notably all those tasks are apparently in the batchworker queue not the stream events queue | 18:58 |
clarkb | I think we are using 2 batch threads which is the default on a multicore system based on that | 18:58 |
clarkb | and zuul et all are being scheduled to that despite my earlier group membership checking | 18:58 |
clarkb | maybe zuul is a member of service users and I missed it before? | 18:59 |
corvus | oh interesting. the split between batch and interactive seems arbitrary too. | 18:59 |
corvus | there are some 3pci in batch and some in interactive. | 19:00 |
fungi | i think it's based on group membership | 19:00 |
clarkb | ya I think we may have to address that via groups | 19:00 |
clarkb | reading the config docs if we set batch threads to 0 then the interactive and non interactive users share a threadpool | 19:00 |
clarkb | that might be the most starightforward thing for us to do though maybe not the most correct thing | 19:00 |
corvus | we need 3 pools :/ | 19:01 |
clarkb | corvus: humans, zuul, else? | 19:01 |
corvus | ya | 19:01 |
fungi | or arbitrary preemptable pools | 19:01 |
fungi | but yes, more than 2 | 19:01 |
fungi | our users, our ci systems, our users ci systems | 19:01 |
clarkb | any objections to me pushing a change to set batchThreads to 0 and share for now? or would we prefer to look into cleaning things up an splitting the pools | 19:02 |
corvus | clarkb: fwiw your file looks good, but it seems like we have things to try before we necessarily go back to gerrit folks | 19:02 |
clarkb | ++ | 19:02 |
fungi | maybe an alternative would be to put third-party ci accounts back into the normal user threadpool, and dedicate the batch pool to zuul? | 19:03 |
fungi | but i'm fine with trying the giant shared pool first | 19:03 |
clarkb | fungi: Service Users also impacts the attention set stuff unfortunately | 19:03 |
clarkb | overloading those two sets makes this really awkward for us | 19:03 |
fungi | that does seem like something they ought to consider splitting | 19:03 |
corvus | why do each of the queues have 2 worker threads? | 19:05 |
corvus | oh wait, batch has 2 interactive has 100? | 19:05 |
clarkb | ya that | 19:05 |
clarkb | corvus: we've long set sshd.threads to 100 (since like 2.8 maybe? its old) but then recently with attention sets and changes along those lines gerrit recognizes service users and split those out into a separate pool | 19:06 |
clarkb | corvus: I thought the default was that threads were always shared though but maybe that changed in 3.4? | 19:06 |
corvus | i think the batch pool has been around for a while | 19:06 |
clarkb | corvus: it has been, but I was fairly certain the default was to share threads not to only use 2. I think that may be the change | 19:06 |
clarkb | I'm trying to find the 3.3 docs to confirm | 19:07 |
corvus | what was the batch setting for if not to segregate threads? | 19:07 |
corvus | though it's really not that important | 19:07 |
fungi | looks like the mergers have fully caught up again now | 19:07 |
corvus | the important thing is which group of users we want to have sidelined when someone holds an ssh operation open for 2 weeks | 19:08 |
clarkb | corvus: ya ideally the idletimeout would address that and then we can have enough headroom on thred count we largely avoid it. Or add additional pools | 19:08 |
clarkb | corvus: fwiw 3.3 docs say 2 for batchThreads is the default so maybe we just never noticed until recently | 19:08 |
corvus | do you have the unsanitized queue dump? | 19:09 |
opendevreview | Clark Boylan proposed opendev/system-config master: Set Gerrit sshd.batchThreads to 0 https://review.opendev.org/c/opendev/system-config/+/828605 | 19:10 |
clarkb | corvus: I do | 19:10 |
clarkb | corvus: one sec I'll put it on review02 | 19:10 |
corvus | i forgot which user was the one running the task from feb 4 that we killed; would like to confirm they're in the batch worker group | 19:10 |
clarkb | corvus: its on review02 now without the santiized suffix | 19:11 |
clarkb | I think it was userA though | 19:11 |
clarkb | "gerrit ls-members --recursive 'Service Users'" <- that doesn't show me zuul so it must be finding zuul in that group via some other method? | 19:13 |
corvus | priority = batch group Non-Interactive Users | 19:15 |
clarkb | Non-Interactive Users got renamed to Service users. I think the text may not have updated in that move because the uuid for the group stayed the same and that is what gerrit uses | 19:16 |
clarkb | so ya it would be membership of that group or another priority = batch entry for additional groups | 19:17 |
clarkb | I half wonder if ls-members doesn't recurse properly and we should "sudo" and check via the web ui | 19:17 |
clarkb | https://osm.etsi.org/gerrit/Documentation/rest-api-groups.html#list-subgroups should be able to do it too | 19:18 |
clarkb | heh its fun how you get different gerrit installations back from google when gooling this stuff | 19:19 |
corvus | yeah, i think it's worth exploring in the web ui; i have to run for a bit | 19:19 |
fungi | the rest api has a recurse option for group member listing | 19:19 |
fungi | clarkb: oh, so does ls-members... "--recursive : to resolve included groups recursively (default: false)" | 19:20 |
clarkb | fungi: ya but if I do --recursive it returns the same results and from everything we acn tell somehow zuul is in that group | 19:21 |
fungi | oh | 19:21 |
clarkb | basically I'm not trusting it :) | 19:21 |
clarkb | because otherwise how is zuul ending up in the batch queue | 19:21 |
fungi | ahh, now i see in scrollback you already discussed --recursive | 19:21 |
clarkb | I need to take a quick break myself and find something to drink. But I can settle back in and look via the web ui | 19:21 |
fungi | mgagne: we've been contacted from someone in sales at inap saying they're turning off the iweb cloud in mtl01. i guess that part didn't transfer to leaseweb? | 19:29 |
mgagne_ | fungi: I didn't know it was that far into the process and that sales was going to do the communication themselves. | 19:29 |
fungi | ahh, well it was someone offering to put us in touch with sales reps about pricing out access to their vmware cloud | 19:30 |
fungi | i just wanted to make sure it's actually going away and not moving to leaseweb before i replied further | 19:31 |
mgagne_ | I don't think you should have received this email, not in that format. But it's true that we are planning on sunsetting the openstack platform in mtl01. The sequence/data is yet tbd. | 19:31 |
fungi | seems like they may have reached out to contact addresses they had on file for the accounts in that environment, since it went to our infra-root alias inbox | 19:32 |
clarkb | fungi: corvus: "Continuous Integration Tools" and "Third-Party CI" are both group members of Service Users. The ssh ls-members is bugged I guess. Zuul is a member of Continuous Integration Tools | 19:33 |
mgagne_ | I wonder how they got that specific email tbh. | 19:33 |
clarkb | fungi: mgagne_: we cc'd you on the email thread but maybe not to a currently valid email? | 19:33 |
clarkb | fungi: corvus: now that we know that I think we can consider this provisionally solved and work to adjust the batch users thread pool size. I expect that setting it to 0 and sharing is probably the least bad option for us currently due to the overloaded use of service users with attention sets | 19:34 |
clarkb | it may be possible to do some followup where we remove batch priority from service users and assign it to Continuous Integration Tools and then have third party ci remain non participatory with attention sets and go into the interactive pool or similar. But that would probably need more testing and planning. Setting batchThreads to 0 should be fairly safe | 19:35 |
fungi | mgagne_: anyway, thanks for the details. it sounds like i can have enough info to be able to reply. as far as timeline they said we have 90 days to migrate to vmware before they're turning the environment off | 19:35 |
clarkb | I've deescalated my privs in the web ui now | 19:36 |
mgagne_ | I'm currently in a meeting, I'll get back to you in ~60m at most. | 19:36 |
fungi | mgagne_: no worries, take your time. i appreciate the help. i'll wait to reply in that case | 19:37 |
clarkb | ++ thank you | 19:39 |
clarkb | infra-root https://104.130.74.7:3081/opendev/system-config has been held for gitea checking. Though I think I'll defer a bit on that until we can close out the gerrit issue | 19:44 |
*** timburke__ is now known as timburke | 20:54 | |
opendevreview | James E. Blair proposed openstack/project-config master: Add zuul-web stats to zuul-status page https://review.opendev.org/c/openstack/project-config/+/828609 | 21:02 |
opendevreview | James E. Blair proposed openstack/project-config master: Add zuul-web stats to zuul-status page https://review.opendev.org/c/openstack/project-config/+/828609 | 21:03 |
corvus | i would like to do a rolling restart of zuul now. | 21:03 |
corvus | i'm going to run https://review.opendev.org/828176 and then do the scheduler/web part manually at the end | 21:04 |
corvus | and by "now" i mean in about 5 minutes after i confirm the image promotion | 21:04 |
fungi | sounds good to me. thanks! | 21:05 |
clarkb | zuul is happy with https://review.opendev.org/c/opendev/system-config/+/828605 if we want to go ahead and land that and plan for a gerrit restart later today | 21:12 |
corvus | lgtm | 21:13 |
fungi | and in it goes | 21:14 |
sshnaidm | cores, please merge a patch about perms to delete branches: https://review.opendev.org/c/openstack/project-config/+/828371 | 21:15 |
corvus | pull finished, restarting now | 21:17 |
fungi | i'm getting started on dinner but somewhat around and can pivot to help if something goes sour | 21:19 |
corvus | it does not appear that the mergers exit appropriately on 'zuul-merger stop' | 21:25 |
corvus | i hard-stopped them after gracefully stopping them. i think that will stop them without errors. they're probably just hung on a thread that doesn't exit. | 21:26 |
corvus | ze01-06 are gracefully stopping now. | 21:27 |
fungi | interesting in relation to clarkb's graceful change | 21:32 |
clarkb | corvus: oh fun | 21:35 |
clarkb | corvus: we can probably do a stop against one of the mergers then ask it for a thread dump to see what it is held up on | 21:48 |
clarkb | or I suppose just running it locally may reproduce | 21:48 |
clarkb | looking at the gerrit code for ls-members I don't see anything that might recurse in the actual implementation | 21:49 |
mgagne_ | fungi: sounds like they are planning on 90 days. We had internal discussions about it but no timeline. I guess we have one now. Hopefully they didn't confused mtl01 that used to be at INAP with our other OpenStack platform at iWeb. | 21:57 |
corvus | clarkb: yeah, can repro locally. i'm working on a fix. | 22:00 |
fungi | mgagne_: they said that the iweb.com domain went to inap in the sale, and that the identity.api.cloud.iweb.com endoint we're communicating with is what's being shut down. do we need to switch hostnames there? | 22:01 |
fungi | we're definitely using the mtl01 region, but maybe we need to adjust the api url? | 22:02 |
mgagne_ | ok, I wonder what was in that email, it's a bit confusing. | 22:03 |
fungi | there wasn't much in the email, which is why i started asking them questions | 22:03 |
mgagne_ | For INAP customers, they need to move the URLs used for the OpenStack API to inap.com. Although it was an INAP product, it was using iweb DNS. Now they have to move to inap DNS because well, they don't own iweb.com | 22:04 |
fungi | i said we were using https://identity.api.cloud.iweb.com and they replied "That domain transferred to INAP in the sale. That is exactly what we are shutting down." | 22:04 |
mgagne_ | * face palm * | 22:04 |
fungi | yeah, maybe language barrier? it's possible jennifer curry at inap didn't completely understand what i said we're using | 22:05 |
mgagne_ | I think there is confusion and the request got lost in translation or across department. | 22:05 |
fungi | that wouldn't surprise me at all. this is a complicated field ;) | 22:07 |
mgagne_ | For INAP customers: they need to move to cloud.inap.com. cloud.iweb.com will be phased out. It's the same product, different DNS.For mtl01, it's gonna be phased out, there is no replacement. So updating DNS won't help. Now the official timeline for mtl01 phase out wasn't known to our team. But we had discussion about how/when to make it happen. | 22:07 |
mgagne_ | We (I) didn't communicate yet to you because we didn't have an official timeline/answer about it. But now you know. it won't happen overnight but it's gonna happen at some point. | 22:08 |
fungi | oh, no worries, i was just reaching out trying to understand, sounds like there are a lot of people not talking to one another. this actually started more than a week ago (we received the first communication on february 1 and responded a couple of days later) | 22:11 |
fungi | i guess inap chose a timeline and didn't pass that information along right away | 22:12 |
mgagne_ | I can't officially talk for my new employer but there is no plan to work with OpenStack in the near future. But they also didn't want it to be perceived as officially closing the door forever, whatever that would mean. | 22:13 |
fungi | i definitely don't see it that way either. the help we've had has been great and much appreciated | 22:14 |
mgagne_ | =) | 22:14 |
fungi | i mainly just need to know whether we should turn off our use of that environment right away or wait until the dns record disappears | 22:14 |
fungi | nodepool will handle it fine either way | 22:15 |
opendevreview | Merged opendev/system-config master: Set Gerrit sshd.batchThreads to 0 https://review.opendev.org/c/opendev/system-config/+/828605 | 22:15 |
mgagne_ | DNS will be there for a couple months. | 22:15 |
mgagne_ | I'll keep the boat floating until they officially ask us to shut it down. | 22:15 |
fungi | sounds like we can keep our configuration in place in the meantime then. thanks for all the clarity! | 22:16 |
mgagne_ | We have a lot more other things to take care of before mtl01. | 22:16 |
mgagne_ | np, sorry that it happened that way, I didn't know their plan about communication. | 22:17 |
mgagne_ | #canadians | 22:17 |
fungi | i would say "don't apologize" but... canadians | 22:17 |
mgagne_ | :D | 22:19 |
*** dviroel|ruck is now known as dviroel|out | 22:33 | |
opendevreview | Merged openstack/project-config master: Add zuul-web stats to zuul-status page https://review.opendev.org/c/openstack/project-config/+/828609 | 22:38 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: [dnm] testing https://review.opendev.org/c/opendev/base-jobs/+/828440 https://review.opendev.org/c/zuul/zuul-jobs/+/828615 | 22:39 |
ianw | ^ "msg": "The conditional check '{{ item }} == 'centos-8'' failed. The error was: error while evaluating conditional ({{ item }} == 'centos-8'): 'centos' is undefined | 22:47 |
ianw | ... and that is why we test base job additions :) | 22:47 |
corvus | i love how excited we all are when the system tells us we're wrong :) | 22:49 |
fungi | i've come to accept zuul's judgement, that i am nearly always wrong | 22:50 |
ianw | heh, i've never thought of it like that, but so true! clarkb ^ might be one for your talk. we actually get excited when we thought something would work and it fails, because it means we just avoided a big production mess :) | 22:51 |
corvus | ianw: the grafana change is deployed already. thanks! (for the review and deployment speedup) | 22:52 |
fungi | earlier today i tried to make a trivial zuul docs change. i attempted to test it locally, but `tox -e docs` wanted more than the 3.5gb i had available, so i punted it up to review and zuul let me know that my assumption about sphinx treating implicit labels the same fo :ref: directives as it does for normal link targets is wrong | 22:53 |
fungi | i was quite sure it was fine, but happy to have been proved wrong | 22:54 |
fungi | humility as a service | 22:55 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: base: fail centos-8 if pointing to centos-8-stream image type https://review.opendev.org/c/opendev/base-jobs/+/828437 | 23:00 |
opendevreview | Ian Wienand proposed opendev/base-jobs master: base-test: fix typos in centos-8 detection https://review.opendev.org/c/opendev/base-jobs/+/828616 | 23:00 |
ianw | fungi / ianchoi[m] : i'm guessing from https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&job_name=upstream-translation-update&skip=0 the translation jobs are roughly back in shape | 23:11 |
fungi | oh, awesome! thanks ianw, less complicated than i had feared | 23:11 |
ianw | the only failure seems to be possibly just a network blip -> https://zuul.opendev.org/t/openstack/build/397db8ca6c204fca8620d7c0a470959b/console | 23:12 |
fungi | only so many times we can defibrillate zanata though | 23:13 |
ianw | fungi: i definitely agree with your analysis though, it's a ticking time-bomb of fair complexity | 23:13 |
fungi | "the patient miraculously survived" <furtive glance at other soap opera actors> | 23:14 |
corvus | one of the executors has finally finished stopping :) | 23:22 |
fungi | that's reassuring. the others will fall like dominoes | 23:22 |
corvus | [in low gravity] | 23:22 |
*** rlandy|ruck is now known as rlandy|out | 23:23 | |
fungi | better than witnessing their infinite fall into the event horizon of a black hole | 23:24 |
clarkb | ianw: ++ | 23:26 |
clarkb | I got new hardware today and am flipping back and forth between it and the old one so I can real work done too | 23:27 |
clarkb | I should probably just put it down for a bit though | 23:27 |
clarkb | turns out relatively high res displays in small form factor cause a bunch of random things to be weird | 23:27 |
fungi | or force yourself onto the new hardware and fix up whatever's missing as you go | 23:28 |
clarkb | fungi: I find that I haev a really hard time doing that :) I need xmonad and firefox set up just so and so on | 23:28 |
opendevreview | Merged opendev/base-jobs master: base-test: fix typos in centos-8 detection https://review.opendev.org/c/opendev/base-jobs/+/828616 | 23:29 |
fungi | after i found a scalable terminal, i was all set | 23:29 |
clarkb | oh ya thats the other thing fonts and getting the terminal set up so it doesnt' get in xmonad's way with a bunch of menu bars | 23:29 |
clarkb | I could probably automate some of this but xfce (and I think other desktops) have so much config in a registry like db these days | 23:30 |
clarkb | looks like the batchThreads gerrit change is ready | 23:30 |
clarkb | s/ready/in place on review02/ | 23:30 |
clarkb | is now a bad time to restart for that? gate queues seem pretty quiet and I don't see release jobs | 23:31 |
fungi | executors are still stopping, but that's probably not going to make it a worse time for a gerrit restart | 23:32 |
corvus | no objection from me | 23:35 |
clarkb | ok I'll finish this zuul stopping fixup change review then restart gerrit | 23:35 |
opendevreview | Steve Baker proposed openstack/diskimage-builder master: Replace kpartx with qemu-nbd in extract-image https://review.opendev.org/c/openstack/diskimage-builder/+/828617 | 23:38 |
fungi | interesting problem with booting a snapshot of the ethercalc server... we put the software for it in the ephemeral disk mounted on /opt, which isn't included when making a server image | 23:38 |
fungi | rsync t the rescue | 23:39 |
clarkb | ok I'm going to prep for a gerrit restart. Shouldn't need a new image. Will just be a docker-compose down then up -d | 23:39 |
clarkb | zuul queues still look good. I'm proceeding | 23:40 |
clarkb | oh I just realized that the batchThreads change will conflict with fungi's normalization change of the gerrit config | 23:40 |
clarkb | fungi: we should land your normalization change and land the change to force case sensitive users soon | 23:40 |
clarkb | but back to restarting gerrit | 23:40 |
fungi | which normalization change? i've clearly not told myself what it is i'm working on lately | 23:41 |
clarkb | fungi: you added a bunch of tabs for consistency iirc | 23:42 |
clarkb | the web ui seems to be up. One thing I noticed is that chagnes loaded immediately after the restart did not have diff or file info | 23:43 |
clarkb | wait 30 seconds and refresh and it shows up | 23:43 |
fungi | oh, i thought the tabs merged already | 23:43 |
clarkb | there don't appear to be tabs in the diff for my change? https://review.opendev.org/c/opendev/system-config/+/828605/1/playbooks/roles/gerrit/templates/gerrit.config.j2 or maybe we need to add more tabs? | 23:45 |
clarkb | Zuul shows up in the interactive queue worker list now doing a show-queue -w -q | 23:45 |
clarkb | fungi: ya looks like it wasn't a complete edit. Just partial. That explains my confusion | 23:47 |
clarkb | #status log Restarted Gerrit to pick up sshd.batchThreads = 0 config update | 23:47 |
opendevstatus | clarkb: finished logging | 23:47 |
clarkb | I notice that apple's web crawler is tripping over the changes that are in a sad state that our reindexing complains about too | 23:49 |
clarkb | I don't think there is much we can do about that | 23:49 |
corvus | regarding the zuul restart, i will likely allow the executor restart to continue tonight and then do the scheduler+web first thing tomorrow | 23:50 |
corvus | (unless someone feels adventurous overnight; but i think running half-upgraded for a while is fine) | 23:51 |
fungi | i'll consider it a valuable experiment | 23:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!