Wednesday, 2022-02-09

ianw	https://zuul.opendev.org/t/openstack/build/3ce64b0d3e3845609c14fcd26be34db4/console	00:00
ianw	it's coming from	00:00
ianw	pip install -c /home/zuul/src/opendev.org/openstack/horizon/upper-constraints.txt -r requirements.txt -r test-requirements.txt	00:00
fungi	can't hurt to add it, i'm just not expecting it to solve the retry failures i was seeing where ensure-sphinx was breaking	00:01
ianw	but that's a bit of a misnomer, because aiui it's run active.sh previously in the script. so that pip isn't the system pip	00:01
ianw	my reading of it was that pip was from that venv ... but could be wrong!	00:02
fungi	a lot of the failures i was looking at happen in pre-run, before that script ever comes into the picture	00:02
fungi	e.g. the builds for nova	00:03
fungi	failures to install pillow into the sphinx venv	00:03
ianw	hrm, ok, i'm looking @ https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&job_name=upstream-translation-update&result=FAILURE	00:04
fungi	you need to broaden it to include RETRY_LIMIT as well	00:04
fungi	but yeah, one thing at a time. as you say, we probably have multiple places this is breaking	00:04
ianw	ok, i see it now. i'll start making some notes	00:06
fungi	it could still be the same underlying issue. maybe the ensure-sphinx role needs to upgrade pip	00:06
fungi	though a bigger problem in my mind is that we're running this on bionic but applying master branch upper-constraints.txt which no longer take older python into account, and we may not be building those wheels for bionic and upstream may no longer be publishing cp36 wheels to pypi either even though they have a requires_python which allows 3.6 still	00:10
ianw	this is true	00:11
fungi	so as a result, pip is going to grab sdists of some things and the projects don't have the necessary build deps in their bindep.txt	00:11
fungi	but yeah, let's try the easy things first and then it's simpler to reason about solutions for what's still breaking after that	00:12
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-sphinx: upgrade pip https://review.opendev.org/c/zuul/zuul-jobs/+/828441	00:15
ianw	fungi: ^ i feel like that might restore the status-quo, maybe?	00:16
fungi	i expect installing with python 2.7 was hitting some legacy paths through constraints files which masked a bunch of problems, so i'm not getting my hopes up	00:24
ianw	error: invalid command 'bdist_wheel'	00:27
ianw	i guess maybe that venv needs wheel too...	00:27
fungi	yes, venv doesn't have wheel by default	00:28
ianw	then again, it also says	00:28
ianw	The headers or library files could not be found for jpeg,	00:28
ianw	a required dependency when compiling Pillow from source.	00:28
fungi	but also that only comes into play if it's trying to install things from sdist because it can't find a wheel	00:28
ianw	... so what is the problem :/	00:28
fungi	and it's a warning, there's a legacy build codepath which doesn't involve building and installing a wheel	00:29
ianw	so basically we had a wheel for pillow and now don't is the theory	00:29
fungi	a cp27 wheel probably, yes	00:30
fungi	now it wants cp36 for bionic	00:30
ianw	Collecting Pillow===8.4.0 (from -c /home/zuul/src/opendev.org/openstack/requirements/upper-constraints.txt (line 97))	00:31
clarkb	I think it may have had to do with abi3 wheels	00:31
clarkb	old pip doesn't understand those as valid for any python version iirc	00:31
ianw	http://mirror.iad.rax.opendev.org/wheel/ubuntu-18.04-x86_64/pillow/	00:32
clarkb	then once you update pip it recognizes it can install those specially annotated wheels	00:32
opendevreview	Neil Hanlon proposed openstack/project-config master: Add rockylinux-8 to nodepool configuration https://review.opendev.org/c/openstack/project-config/+/828435	00:32
ianw	ahhh ... then the pip upgrade might help :) are we back where we started?! :)	00:32
NeilHanlon	😂	00:32
ianw	i think we're where we've always been, in a huge tangled mess of dependencies that somehow sometimes works	00:33
NeilHanlon	i.e., python	00:33
fungi	ianw: i think we'll be at the point that pip will think it's possible to install a newer version of pillow than is available as a wheel, but maybe abi3 works for cp36	00:34
ianw	yeah, it seems like we need to keep iterating	00:34
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: ensure-sphinx: upgrade pip https://review.opendev.org/c/zuul/zuul-jobs/+/828441	00:44
ianw	now with less typos	00:44
clarkb	ianw: any idea why the base-test log-inventory stuff was out of sync?	00:45
ianw	i didn't go through the history; i assumed something was tested that didn't make it in	00:46
clarkb	gotcha	00:47
ianw	looks like the bits i removed came in via I6c93fd03aadb5e4d15ac7da98887dd7ca4998319	00:47
ianw	https://review.opendev.org/c/opendev/base-jobs/+/798139	00:48
*** dviroel\|ruck\|afk is now known as dviroel\|ruck		00:48
ianw	then it looks like https://review.opendev.org/c/zuul/zuul-jobs/+/798087 didn't make it in?	00:48
clarkb	that might've stalled out due to the zuul fixes that went in mid last year?	00:54
clarkb	things were more aggressively split out and filtered	00:54
*** rlandy\|ruck\|bbl is now known as rlandy\|ruck		00:57
*** dviroel\|ruck is now known as dviroel\|ruck\|out		00:57
*** dviroel\|ruck\|out is now known as dviroel\|out		00:57
opendevreview	Merged zuul/zuul-jobs master: ensure-sphinx: upgrade pip https://review.opendev.org/c/zuul/zuul-jobs/+/828441	01:22
opendevreview	Merged opendev/base-jobs master: base-test: sync with base/pre.yaml https://review.opendev.org/c/opendev/base-jobs/+/828439	01:23
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: Revert "Use rpm -e instead of dnf for cleaning old kernels" https://review.opendev.org/c/openstack/diskimage-builder/+/827381	03:33
ianw	ok, it looks like https://zuul.opendev.org/t/openstack/build/0c8aa08db4c844a7bdb70dfe222597ea (upstream-translation-update for nova) passed after making ensure-sphinx update pip in the venv	03:58
ianw	this still leaves problems with the host type, zanata in general, etc. but that's for tomorrow :)	03:58
*** rlandy\|ruck is now known as rlandy\|out		04:06
opendevreview	Merged openstack/diskimage-builder master: Cleanup more CentOS 8 bits https://review.opendev.org/c/openstack/diskimage-builder/+/827210	04:39
opendevreview	Merged openstack/diskimage-builder master: Remove contrib/setup-gate-mirrors.sh https://review.opendev.org/c/openstack/diskimage-builder/+/827211	05:02
opendevreview	Merged openstack/diskimage-builder master: General improvements to the ubuntu-minimal docs https://review.opendev.org/c/openstack/diskimage-builder/+/806308	05:19
opendevreview	Merged openstack/diskimage-builder master: Remove extra if/then/else construct in pip element https://review.opendev.org/c/openstack/diskimage-builder/+/822224	05:19
opendevreview	Merged openstack/diskimage-builder master: Revert "Use rpm -e instead of dnf for cleaning old kernels" https://review.opendev.org/c/openstack/diskimage-builder/+/827381	07:50
opendevreview	Merged opendev/base-jobs master: base-test: fail centos-8 if pointing to centos-8-stream image type https://review.opendev.org/c/opendev/base-jobs/+/828440	08:03
*** amoralej\|off is now known as amoralej		08:11
*** jpena\|off is now known as jpena		08:31
*** sshnaidm\|afk is now known as sshnaidm		08:54
*** ysandeep\|out is now known as ysandeep		09:01
opendevreview	Riccardo Pittau proposed openstack/diskimage-builder master: Fallback to persistent netifs names with systemd https://review.opendev.org/c/openstack/diskimage-builder/+/828266	09:16
opendevreview	Riccardo Pittau proposed openstack/diskimage-builder master: Fallback to persistent netifs names with systemd https://review.opendev.org/c/openstack/diskimage-builder/+/828266	09:18
*** mnasiadka_ is now known as mnasiadka		09:18
opendevreview	Merged openstack/diskimage-builder master: Don't run functional tests on doc changes https://review.opendev.org/c/openstack/diskimage-builder/+/825891	09:21
opendevreview	Merged openstack/diskimage-builder master: fedora-container: pull in glibc-langpack-en https://review.opendev.org/c/openstack/diskimage-builder/+/827772	09:35
sshnaidm	cores, please review in your time perms patch https://review.opendev.org/c/openstack/project-config/+/828371	10:41
sshnaidm	fungi, ^^	10:41
*** rlandy\|out is now known as rlandy\|ruck		11:06
*** dviroel\|out is now known as dviroel\|ruck		11:10
mnasiadka	Good afternoon	12:37
mnasiadka	Since https://opendev.org/openstack/diskimage-builder/commit/398e07e6f2bb5a2f763a22a8e4801108c242ffe2 landed - is there a slight chance that it would be possible to add Rocky Linux 8 to the possible nodesets in Zuul? Kolla projects would be happy to run their CI on something that is not so unpredictable as CentOS Stream (and there's user interest in adding Rocky Linux support - which we'd like to have properly tested).	12:39
*** ysandeep is now known as ysandeep\|break		12:41
*** artom__ is now known as artom		13:03
*** amoralej is now known as amoralej\|lunch		13:07
*** ysandeep\|break is now known as ysandeep		13:13
frickler	mnasiadka: seems https://review.opendev.org/c/openstack/project-config/+/828435 is the next step	13:18
fungi	mnasiadka: it's in progress, i believe we need a dib release and then a version bump in nodepool	13:18
mnasiadka	oh, great - Neill followed up on that	13:18
mnasiadka	wasn't aware :)	13:18
fungi	maybe the dib release already happened while i was asleep	13:19
frickler	I don't think that that change actually tests builds, so that release+bump may still be needed	13:20
*** amoralej\|lunch is now known as amoralej		13:59
*** pojadhav is now known as pojadhav\|brb		14:06
*** akahat\|rover is now known as akahat\|PTO		14:11
*** pojadhav\|brb is now known as pojadhav		14:24
*** pojadhav is now known as pojadhav\|dinner		15:00
*** ysandeep is now known as ysandeep\|out		15:43
*** pojadhav\|dinner is now known as pojadhav		16:14
opendevreview	Clark Boylan proposed opendev/system-config master: DNM change to test and hold gitea 1.16.1 https://review.opendev.org/c/opendev/system-config/+/828586	16:35
clarkb	fungi: ^ how do I hold for a specific change again? is it --ref refs/changes/xy/abcxy ?	16:35
clarkb	looks like I need the ps in there too	16:36
fungi	yes, you need the revision	16:39
fungi	refs/changes/xy/abcxy/z	16:39
*** ralonsoh_ is now known as ralonsoh		16:40
clarkb	thanks. I'm hoping to get gitea 1.16.1 held today so we can look it over and double check it against the changelog then maybe upgrade soon	16:40
clarkb	I set that hold up and claened up my old gerrit tag pushing hold	16:42
clarkb	ianw: looks like you may have gerrit 3.4 holds that are no longer required since we upgraded. But I'll wait for you to confirm before doing any cleanup	16:44
*** ykarel is now known as ykarel\|away		16:48
*** marios is now known as marios\|out		16:53
clarkb	I suspect https://review.opendev.org/c/openstack/diskimage-builder/+/826976 is the change that ianw is hoping to get sorted for the dib release based on what was said yseterday	16:55
clarkb	I'm having a really hard time parsing what that change aims to do	16:58
clarkb	I guess we want to set up grub without installing a bootloader. But aren't those two things intertwined?	17:00
*** jpena is now known as jpena\|off		17:27
corvus	the mergers looks like they may be stuck	17:46
corvus	infra-root: i think we may be looking at the gerrit slowness again	17:47
corvus	the mergers are not stuck, they're just getting really slow performance from gerrit on their git ops. they have a 300 job backlog	17:48
corvus	clarkb: did we have a next step for debugging that?	17:48
clarkb	corvus: luca asked for show-queue -w output from when it was happening	17:50
clarkb	corvus: and maybe we should grab another thread dump since I think the last one captured the very tail end of it	17:50
corvus	clarkb: do you want to "sudo" and do that?	17:51
clarkb	ya I'll work on running jstack to capture a current thread dump	17:51
clarkb	then around the same time try to show-queue -w	17:52
corvus	ah, i've re-learned about my admin account, so i can do show-queue -w now too :)	17:53
corvus	oh wow the output is a lot different, you can see the jobs waiting	17:54
clarkb	corvus: yup that was what I was trying to explain to luca. I guess its a good thing we can actually acpture it now. I've done both a jstack and show-queue which captures data that needs filtering before we can upload it	17:54
clarkb	if others want to keep poking and debugging I can start working on cleaning up these files	17:55
clarkb	corvus: one thing I notice is that we've got another old task at the top of the list.	17:58
clarkb	last time we had the same thing	17:58
clarkb	I killed the task last time and maybe that helped make things get better?	17:58
clarkb	I think we didn't see immediate relief but the built up backlog may explain that	17:58
corvus	clarkb: yeah let's try that	17:59
clarkb	corvus: do you want to kill it or should I?	17:59
corvus	i'm watching a zm log and should notice if progress picks up. you kill it.	17:59
clarkb	ok one sec while I look up the command again	17:59
corvus	fwiw, the backlog started at 16:39	18:00
clarkb	done	18:00
corvus	there might have been a slight improvement, from ~30s per repo to ~20	18:03
corvus	i don't think it's enough to keep up with incoming workload though	18:04
corvus	the queue is decreasing	18:06
clarkb	pretty sharply too	18:06
clarkb	so maybe the issue is that old task	18:06
fungi	no sudden jump in the merger queue this time, it did see a bit of a rise and tehn climbed a decelerating curve as of ~16:40z	18:07
clarkb	and hopefully correlation between thread dump and show-queue prior to removing the old task can help upstream diagnose it. There may also be zuul settings to time out connections for ssh?	18:07
fungi	the executors were very busy for a while before that too	18:07
clarkb	we might be able to workaround this if so by setting a reasonable connection timeout to like an hour?	18:07
clarkb	I'lm going to keep working on sanitizing these files. But maybe someone can look at gerrit config options around that	18:07
fungi	a largeish stack of nova changes were enqueued into check around that time	18:08
fungi	yeah, looks like there was a stack rebase and push for those right at 16:40	18:09
fungi	so it's possible gerrit was already slow, and this was the bump which pushed things over the edge	18:10
corvus	so do we have 4 slots for servicing this?	18:10
clarkb	corvus: we should have 100 which si aprt of what I asked about on the bug I filed	18:11
clarkb	the thread dump also shows 100 threads exist	18:11
fungi	that's how it's acting (and how it was acting last time as well) but yeah it's not what we think is configured	18:11
clarkb	It seems like there is some other limitation (thread ocntention, locks? I don't know what)	18:11
corvus	then show-queue has all but 4 git-upload-pack jobs waiting	18:11
clarkb	yup I think that is why luca wanted the show queue output.	18:12
corvus	k	18:12
clarkb	review02:~clarkb/gerrit_queues.20220209.sanitized should be sanitized. But please double check it particularly the query tasks as I'm not sure if we need to scrub out the change identifiers too (I don't think so since all our changes are public)	18:12
clarkb	now to work on the thread dump	18:12
corvus	clarkb: sanitized lgtm	18:14
clarkb	corvus: still sanitizing the thread dump but it looks like some of the waiting tasks are waiting on a lock	18:15
clarkb	I'm hopeful this will end up allowing this to be understood and fixed given what I'm seeing. This may take some time though as I'm trying to synchronize the sanitized usernames between the two files	18:15
clarkb	that way they can be directly correlated	18:15
*** amoralej is now known as amoralej\|off		18:17
corvus	queue @ 200 now	18:18
corvus	there's now a 2m old task at the top, and things are slowing down again	18:31
corvus	so it does seem like we're right on the edge of holding performance	18:31
corvus	it finished; so we don't need to kill it or anything, just may be informative.	18:32
clarkb	it does seem that after I killed the very old task the queue dropped quickly	18:34
clarkb	makes me wonder if longer running tasks create a lot of contention somehow	18:35
corvus	it leveled off while the 2m old task was there and has resumed falling	18:35
corvus	well, if we can only service 4 of them at a time, then our capacity drops by 25%	18:35
clarkb	but also none of this explains why 95% of our interactive ssh worker threads are doing nothing. Unless the same lock causes contention with thread assignemnt	18:35
clarkb	ya that	18:35
corvus	you can now see the little shelf on the merger queue graph from that 2m job	18:36
corvus	(was probably 3+ minutes by the end, which more closely corresponds with the shelf length)	18:37
corvus	<100	18:48
corvus	still seeing about 20s per repo on the merger	18:49
clarkb	ok I've gone through review02:~clarkb/gerrit_thread_dump.20220209.sanitized and cleaned up what I could find. The diff against gerrit_thread_dump.20220209 will show you what I changed	18:50
clarkb	there were no http entries that needed cleanup this time from what I could see	18:50
clarkb	corvus: I think the two big questions are "why are we slowing down in general" and "why are we not using the many free ssh worker threads that could be used to spread out the load"	18:51
corvus	yeah, though if the slowdown is entirely just waiting for threads, could be only one question.	18:51
clarkb	indeed	18:52
clarkb	sshd.idleTimeout and sshd.waitTimeout may be useful here depending on whether or not longer running requests are a problem	18:53
clarkb	that might impact zuul listening on ssh event streaming though	18:53
clarkb	anyway if ya'll can take a look at those two files on review02 and give them a critical eye I can update the issue with a bit more info	18:54
clarkb	hrm we set idletimeout to an hour already	18:55
clarkb	and waittimeout defaults to 30s so maybe not	18:55
corvus	wonder why that job was there for so long then	18:55
clarkb	oh! I've just now noticed that luca wanted -q added to show-queue	18:57
clarkb	unfortunately too late to add that now but if you run it gives more detailed information on the internal queues too	18:58
clarkb	notably all those tasks are apparently in the batchworker queue not the stream events queue	18:58
clarkb	I think we are using 2 batch threads which is the default on a multicore system based on that	18:58
clarkb	and zuul et all are being scheduled to that despite my earlier group membership checking	18:58
clarkb	maybe zuul is a member of service users and I missed it before?	18:59
corvus	oh interesting. the split between batch and interactive seems arbitrary too.	18:59
corvus	there are some 3pci in batch and some in interactive.	19:00
fungi	i think it's based on group membership	19:00
clarkb	ya I think we may have to address that via groups	19:00
clarkb	reading the config docs if we set batch threads to 0 then the interactive and non interactive users share a threadpool	19:00
clarkb	that might be the most starightforward thing for us to do though maybe not the most correct thing	19:00
corvus	we need 3 pools :/	19:01
clarkb	corvus: humans, zuul, else?	19:01
corvus	ya	19:01
fungi	or arbitrary preemptable pools	19:01
fungi	but yes, more than 2	19:01
fungi	our users, our ci systems, our users ci systems	19:01
clarkb	any objections to me pushing a change to set batchThreads to 0 and share for now? or would we prefer to look into cleaning things up an splitting the pools	19:02
corvus	clarkb: fwiw your file looks good, but it seems like we have things to try before we necessarily go back to gerrit folks	19:02
clarkb	++	19:02
fungi	maybe an alternative would be to put third-party ci accounts back into the normal user threadpool, and dedicate the batch pool to zuul?	19:03
fungi	but i'm fine with trying the giant shared pool first	19:03
clarkb	fungi: Service Users also impacts the attention set stuff unfortunately	19:03
clarkb	overloading those two sets makes this really awkward for us	19:03
fungi	that does seem like something they ought to consider splitting	19:03
corvus	why do each of the queues have 2 worker threads?	19:05
corvus	oh wait, batch has 2 interactive has 100?	19:05
clarkb	ya that	19:05
clarkb	corvus: we've long set sshd.threads to 100 (since like 2.8 maybe? its old) but then recently with attention sets and changes along those lines gerrit recognizes service users and split those out into a separate pool	19:06
clarkb	corvus: I thought the default was that threads were always shared though but maybe that changed in 3.4?	19:06
corvus	i think the batch pool has been around for a while	19:06
clarkb	corvus: it has been, but I was fairly certain the default was to share threads not to only use 2. I think that may be the change	19:06
clarkb	I'm trying to find the 3.3 docs to confirm	19:07
corvus	what was the batch setting for if not to segregate threads?	19:07
corvus	though it's really not that important	19:07
fungi	looks like the mergers have fully caught up again now	19:07
corvus	the important thing is which group of users we want to have sidelined when someone holds an ssh operation open for 2 weeks	19:08
clarkb	corvus: ya ideally the idletimeout would address that and then we can have enough headroom on thred count we largely avoid it. Or add additional pools	19:08
clarkb	corvus: fwiw 3.3 docs say 2 for batchThreads is the default so maybe we just never noticed until recently	19:08
corvus	do you have the unsanitized queue dump?	19:09
opendevreview	Clark Boylan proposed opendev/system-config master: Set Gerrit sshd.batchThreads to 0 https://review.opendev.org/c/opendev/system-config/+/828605	19:10
clarkb	corvus: I do	19:10
clarkb	corvus: one sec I'll put it on review02	19:10
corvus	i forgot which user was the one running the task from feb 4 that we killed; would like to confirm they're in the batch worker group	19:10
clarkb	corvus: its on review02 now without the santiized suffix	19:11
clarkb	I think it was userA though	19:11
clarkb	"gerrit ls-members --recursive 'Service Users'" <- that doesn't show me zuul so it must be finding zuul in that group via some other method?	19:13
corvus	priority = batch group Non-Interactive Users	19:15
clarkb	Non-Interactive Users got renamed to Service users. I think the text may not have updated in that move because the uuid for the group stayed the same and that is what gerrit uses	19:16
clarkb	so ya it would be membership of that group or another priority = batch entry for additional groups	19:17
clarkb	I half wonder if ls-members doesn't recurse properly and we should "sudo" and check via the web ui	19:17
clarkb	https://osm.etsi.org/gerrit/Documentation/rest-api-groups.html#list-subgroups should be able to do it too	19:18
clarkb	heh its fun how you get different gerrit installations back from google when gooling this stuff	19:19
corvus	yeah, i think it's worth exploring in the web ui; i have to run for a bit	19:19
fungi	the rest api has a recurse option for group member listing	19:19
fungi	clarkb: oh, so does ls-members... "--recursive : to resolve included groups recursively (default: false)"	19:20
clarkb	fungi: ya but if I do --recursive it returns the same results and from everything we acn tell somehow zuul is in that group	19:21
fungi	oh	19:21
clarkb	basically I'm not trusting it :)	19:21
clarkb	because otherwise how is zuul ending up in the batch queue	19:21
fungi	ahh, now i see in scrollback you already discussed --recursive	19:21
clarkb	I need to take a quick break myself and find something to drink. But I can settle back in and look via the web ui	19:21
fungi	mgagne: we've been contacted from someone in sales at inap saying they're turning off the iweb cloud in mtl01. i guess that part didn't transfer to leaseweb?	19:29
mgagne_	fungi: I didn't know it was that far into the process and that sales was going to do the communication themselves.	19:29
fungi	ahh, well it was someone offering to put us in touch with sales reps about pricing out access to their vmware cloud	19:30
fungi	i just wanted to make sure it's actually going away and not moving to leaseweb before i replied further	19:31
mgagne_	I don't think you should have received this email, not in that format. But it's true that we are planning on sunsetting the openstack platform in mtl01. The sequence/data is yet tbd.	19:31
fungi	seems like they may have reached out to contact addresses they had on file for the accounts in that environment, since it went to our infra-root alias inbox	19:32
clarkb	fungi: corvus: "Continuous Integration Tools" and "Third-Party CI" are both group members of Service Users. The ssh ls-members is bugged I guess. Zuul is a member of Continuous Integration Tools	19:33
mgagne_	I wonder how they got that specific email tbh.	19:33
clarkb	fungi: mgagne_: we cc'd you on the email thread but maybe not to a currently valid email?	19:33
clarkb	fungi: corvus: now that we know that I think we can consider this provisionally solved and work to adjust the batch users thread pool size. I expect that setting it to 0 and sharing is probably the least bad option for us currently due to the overloaded use of service users with attention sets	19:34
clarkb	it may be possible to do some followup where we remove batch priority from service users and assign it to Continuous Integration Tools and then have third party ci remain non participatory with attention sets and go into the interactive pool or similar. But that would probably need more testing and planning. Setting batchThreads to 0 should be fairly safe	19:35
fungi	mgagne_: anyway, thanks for the details. it sounds like i can have enough info to be able to reply. as far as timeline they said we have 90 days to migrate to vmware before they're turning the environment off	19:35
clarkb	I've deescalated my privs in the web ui now	19:36
mgagne_	I'm currently in a meeting, I'll get back to you in ~60m at most.	19:36
fungi	mgagne_: no worries, take your time. i appreciate the help. i'll wait to reply in that case	19:37
clarkb	++ thank you	19:39
clarkb	infra-root https://104.130.74.7:3081/opendev/system-config has been held for gitea checking. Though I think I'll defer a bit on that until we can close out the gerrit issue	19:44
*** timburke__ is now known as timburke		20:54
opendevreview	James E. Blair proposed openstack/project-config master: Add zuul-web stats to zuul-status page https://review.opendev.org/c/openstack/project-config/+/828609	21:02
opendevreview	James E. Blair proposed openstack/project-config master: Add zuul-web stats to zuul-status page https://review.opendev.org/c/openstack/project-config/+/828609	21:03
corvus	i would like to do a rolling restart of zuul now.	21:03
corvus	i'm going to run https://review.opendev.org/828176 and then do the scheduler/web part manually at the end	21:04
corvus	and by "now" i mean in about 5 minutes after i confirm the image promotion	21:04
fungi	sounds good to me. thanks!	21:05
clarkb	zuul is happy with https://review.opendev.org/c/opendev/system-config/+/828605 if we want to go ahead and land that and plan for a gerrit restart later today	21:12
corvus	lgtm	21:13
fungi	and in it goes	21:14
sshnaidm	cores, please merge a patch about perms to delete branches: https://review.opendev.org/c/openstack/project-config/+/828371	21:15
corvus	pull finished, restarting now	21:17
fungi	i'm getting started on dinner but somewhat around and can pivot to help if something goes sour	21:19
corvus	it does not appear that the mergers exit appropriately on 'zuul-merger stop'	21:25
corvus	i hard-stopped them after gracefully stopping them. i think that will stop them without errors. they're probably just hung on a thread that doesn't exit.	21:26
corvus	ze01-06 are gracefully stopping now.	21:27
fungi	interesting in relation to clarkb's graceful change	21:32
clarkb	corvus: oh fun	21:35
clarkb	corvus: we can probably do a stop against one of the mergers then ask it for a thread dump to see what it is held up on	21:48
clarkb	or I suppose just running it locally may reproduce	21:48
clarkb	looking at the gerrit code for ls-members I don't see anything that might recurse in the actual implementation	21:49
mgagne_	fungi: sounds like they are planning on 90 days. We had internal discussions about it but no timeline. I guess we have one now. Hopefully they didn't confused mtl01 that used to be at INAP with our other OpenStack platform at iWeb.	21:57
corvus	clarkb: yeah, can repro locally. i'm working on a fix.	22:00
fungi	mgagne_: they said that the iweb.com domain went to inap in the sale, and that the identity.api.cloud.iweb.com endoint we're communicating with is what's being shut down. do we need to switch hostnames there?	22:01
fungi	we're definitely using the mtl01 region, but maybe we need to adjust the api url?	22:02
mgagne_	ok, I wonder what was in that email, it's a bit confusing.	22:03
fungi	there wasn't much in the email, which is why i started asking them questions	22:03
mgagne_	For INAP customers, they need to move the URLs used for the OpenStack API to inap.com. Although it was an INAP product, it was using iweb DNS. Now they have to move to inap DNS because well, they don't own iweb.com	22:04
fungi	i said we were using https://identity.api.cloud.iweb.com and they replied "That domain transferred to INAP in the sale. That is exactly what we are shutting down."	22:04
mgagne_	* face palm *	22:04
fungi	yeah, maybe language barrier? it's possible jennifer curry at inap didn't completely understand what i said we're using	22:05
mgagne_	I think there is confusion and the request got lost in translation or across department.	22:05
fungi	that wouldn't surprise me at all. this is a complicated field ;)	22:07
mgagne_	For INAP customers: they need to move to cloud.inap.com. cloud.iweb.com will be phased out. It's the same product, different DNS.For mtl01, it's gonna be phased out, there is no replacement. So updating DNS won't help. Now the official timeline for mtl01 phase out wasn't known to our team. But we had discussion about how/when to make it happen.	22:07
mgagne_	We (I) didn't communicate yet to you because we didn't have an official timeline/answer about it. But now you know. it won't happen overnight but it's gonna happen at some point.	22:08
fungi	oh, no worries, i was just reaching out trying to understand, sounds like there are a lot of people not talking to one another. this actually started more than a week ago (we received the first communication on february 1 and responded a couple of days later)	22:11
fungi	i guess inap chose a timeline and didn't pass that information along right away	22:12
mgagne_	I can't officially talk for my new employer but there is no plan to work with OpenStack in the near future. But they also didn't want it to be perceived as officially closing the door forever, whatever that would mean.	22:13
fungi	i definitely don't see it that way either. the help we've had has been great and much appreciated	22:14
mgagne_	=)	22:14
fungi	i mainly just need to know whether we should turn off our use of that environment right away or wait until the dns record disappears	22:14
fungi	nodepool will handle it fine either way	22:15
opendevreview	Merged opendev/system-config master: Set Gerrit sshd.batchThreads to 0 https://review.opendev.org/c/opendev/system-config/+/828605	22:15
mgagne_	DNS will be there for a couple months.	22:15
mgagne_	I'll keep the boat floating until they officially ask us to shut it down.	22:15
fungi	sounds like we can keep our configuration in place in the meantime then. thanks for all the clarity!	22:16
mgagne_	We have a lot more other things to take care of before mtl01.	22:16
mgagne_	np, sorry that it happened that way, I didn't know their plan about communication.	22:17
mgagne_	#canadians	22:17
fungi	i would say "don't apologize" but... canadians	22:17
mgagne_	:D	22:19
*** dviroel\|ruck is now known as dviroel\|out		22:33
opendevreview	Merged openstack/project-config master: Add zuul-web stats to zuul-status page https://review.opendev.org/c/openstack/project-config/+/828609	22:38
opendevreview	Ian Wienand proposed zuul/zuul-jobs master: [dnm] testing https://review.opendev.org/c/opendev/base-jobs/+/828440 https://review.opendev.org/c/zuul/zuul-jobs/+/828615	22:39
ianw	^ "msg": "The conditional check '{{ item }} == 'centos-8'' failed. The error was: error while evaluating conditional ({{ item }} == 'centos-8'): 'centos' is undefined	22:47
ianw	... and that is why we test base job additions :)	22:47
corvus	i love how excited we all are when the system tells us we're wrong :)	22:49
fungi	i've come to accept zuul's judgement, that i am nearly always wrong	22:50
ianw	heh, i've never thought of it like that, but so true! clarkb ^ might be one for your talk. we actually get excited when we thought something would work and it fails, because it means we just avoided a big production mess :)	22:51
corvus	ianw: the grafana change is deployed already. thanks! (for the review and deployment speedup)	22:52
fungi	earlier today i tried to make a trivial zuul docs change. i attempted to test it locally, but `tox -e docs` wanted more than the 3.5gb i had available, so i punted it up to review and zuul let me know that my assumption about sphinx treating implicit labels the same fo :ref: directives as it does for normal link targets is wrong	22:53
fungi	i was quite sure it was fine, but happy to have been proved wrong	22:54
fungi	humility as a service	22:55
opendevreview	Ian Wienand proposed opendev/base-jobs master: base: fail centos-8 if pointing to centos-8-stream image type https://review.opendev.org/c/opendev/base-jobs/+/828437	23:00
opendevreview	Ian Wienand proposed opendev/base-jobs master: base-test: fix typos in centos-8 detection https://review.opendev.org/c/opendev/base-jobs/+/828616	23:00
ianw	fungi / ianchoi[m] : i'm guessing from https://zuul.opendev.org/t/openstack/builds?job_name=propose-translation-update&job_name=upstream-translation-update&skip=0 the translation jobs are roughly back in shape	23:11
fungi	oh, awesome! thanks ianw, less complicated than i had feared	23:11
ianw	the only failure seems to be possibly just a network blip -> https://zuul.opendev.org/t/openstack/build/397db8ca6c204fca8620d7c0a470959b/console	23:12
fungi	only so many times we can defibrillate zanata though	23:13
ianw	fungi: i definitely agree with your analysis though, it's a ticking time-bomb of fair complexity	23:13
fungi	"the patient miraculously survived" <furtive glance at other soap opera actors>	23:14
corvus	one of the executors has finally finished stopping :)	23:22
fungi	that's reassuring. the others will fall like dominoes	23:22
corvus	[in low gravity]	23:22
*** rlandy\|ruck is now known as rlandy\|out		23:23
fungi	better than witnessing their infinite fall into the event horizon of a black hole	23:24
clarkb	ianw: ++	23:26
clarkb	I got new hardware today and am flipping back and forth between it and the old one so I can real work done too	23:27
clarkb	I should probably just put it down for a bit though	23:27
clarkb	turns out relatively high res displays in small form factor cause a bunch of random things to be weird	23:27
fungi	or force yourself onto the new hardware and fix up whatever's missing as you go	23:28
clarkb	fungi: I find that I haev a really hard time doing that :) I need xmonad and firefox set up just so and so on	23:28
opendevreview	Merged opendev/base-jobs master: base-test: fix typos in centos-8 detection https://review.opendev.org/c/opendev/base-jobs/+/828616	23:29
fungi	after i found a scalable terminal, i was all set	23:29
clarkb	oh ya thats the other thing fonts and getting the terminal set up so it doesnt' get in xmonad's way with a bunch of menu bars	23:29
clarkb	I could probably automate some of this but xfce (and I think other desktops) have so much config in a registry like db these days	23:30
clarkb	looks like the batchThreads gerrit change is ready	23:30
clarkb	s/ready/in place on review02/	23:30
clarkb	is now a bad time to restart for that? gate queues seem pretty quiet and I don't see release jobs	23:31
fungi	executors are still stopping, but that's probably not going to make it a worse time for a gerrit restart	23:32
corvus	no objection from me	23:35
clarkb	ok I'll finish this zuul stopping fixup change review then restart gerrit	23:35
opendevreview	Steve Baker proposed openstack/diskimage-builder master: Replace kpartx with qemu-nbd in extract-image https://review.opendev.org/c/openstack/diskimage-builder/+/828617	23:38
fungi	interesting problem with booting a snapshot of the ethercalc server... we put the software for it in the ephemeral disk mounted on /opt, which isn't included when making a server image	23:38
fungi	rsync t the rescue	23:39
clarkb	ok I'm going to prep for a gerrit restart. Shouldn't need a new image. Will just be a docker-compose down then up -d	23:39
clarkb	zuul queues still look good. I'm proceeding	23:40
clarkb	oh I just realized that the batchThreads change will conflict with fungi's normalization change of the gerrit config	23:40
clarkb	fungi: we should land your normalization change and land the change to force case sensitive users soon	23:40
clarkb	but back to restarting gerrit	23:40
fungi	which normalization change? i've clearly not told myself what it is i'm working on lately	23:41
clarkb	fungi: you added a bunch of tabs for consistency iirc	23:42
clarkb	the web ui seems to be up. One thing I noticed is that chagnes loaded immediately after the restart did not have diff or file info	23:43
clarkb	wait 30 seconds and refresh and it shows up	23:43
fungi	oh, i thought the tabs merged already	23:43
clarkb	there don't appear to be tabs in the diff for my change? https://review.opendev.org/c/opendev/system-config/+/828605/1/playbooks/roles/gerrit/templates/gerrit.config.j2 or maybe we need to add more tabs?	23:45
clarkb	Zuul shows up in the interactive queue worker list now doing a show-queue -w -q	23:45
clarkb	fungi: ya looks like it wasn't a complete edit. Just partial. That explains my confusion	23:47
clarkb	#status log Restarted Gerrit to pick up sshd.batchThreads = 0 config update	23:47
opendevstatus	clarkb: finished logging	23:47
clarkb	I notice that apple's web crawler is tripping over the changes that are in a sad state that our reindexing complains about too	23:49
clarkb	I don't think there is much we can do about that	23:49
corvus	regarding the zuul restart, i will likely allow the executor restart to continue tonight and then do the scheduler+web first thing tomorrow	23:50
corvus	(unless someone feels adventurous overnight; but i think running half-upgraded for a while is fine)	23:51
fungi	i'll consider it a valuable experiment	23:52

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!