Wednesday, 2021-10-27

opendevreview	Ian Wienand proposed openstack/diskimage-builder master: tests: remove debootstrap install https://review.opendev.org/c/openstack/diskimage-builder/+/815571	00:02
ianw	frickler: https://gerrit-review.googlesource.com/c/gerrit/+/321535 just merged to remove the /#/ from dashboard urls in the docs, so i guess that is the ultimate solution for the problem you found :)	00:05
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: dracut-regenerate: drop Python 2 packages https://review.opendev.org/c/openstack/diskimage-builder/+/815409	00:13
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: fedora-container: regenerate initramfs for F34 https://review.opendev.org/c/openstack/diskimage-builder/+/815385	00:13
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [dnm] debug fail for f34 https://review.opendev.org/c/openstack/diskimage-builder/+/815574	00:13
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [dnm] debug fail for f34 https://review.opendev.org/c/openstack/diskimage-builder/+/815574	00:44
*** pojadhav\|out is now known as pojadhav\|ruck		02:57
*** ysandeep\|out is now known as ysandeep		04:16
*** ykarel\|away is now known as ykarel		05:04
*** ysandeep is now known as ysandeep\|brb		05:52
*** ysandeep\|brb is now known as ysandeep		06:41
opendevreview	Alfredo Moralejo proposed openstack/diskimage-builder master: Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392	07:18
opendevreview	daniel.pawlik proposed openstack/project-config master: Setup zuul jobs for openstack/ci-log-processing project https://review.opendev.org/c/openstack/project-config/+/815024	07:37
zigo	clarkb: FYI, simplejson was uploaded to unstable today with python3-setuptools as build-depends. Likely, it will be in Ubuntu 22.04.	07:54
ysandeep	#opendev: https://zuul.opendev.org/t/openstack/status#815472 in gate queue is awaiting for node since long, looks like its stuck.. Could someone please take a look.	08:00
ysandeep	One other jobs in check as well looks stuck: https://zuul.opendev.org/t/openstack/status#813930	08:04
ysandeep	Should we abandon/ restore?	08:05
*** ysandeep is now known as ysandeep\|lunch		08:08
*** dpawlik3 is now known as dpawlik		08:34
opendevreview	Merged opendev/system-config master: Update artifact signing key management process https://review.opendev.org/c/opendev/system-config/+/815547	08:48
*** ykarel is now known as ykarel\|lunch		09:05
fzzf[m]	Hi. I clone http:ci-sandbox. then git-review -s. After input username. this occur https://paste.opendev.org/show/810236/ . thanks in advance. I plan to use ci-sandbox to test a external CI that installed by SF . but I'm new in use gerrit. What steps should I follow, Any help would be appreciated.	09:11
fzzf[m]	I have configure SF with gerrit_connections. and Add ssh keys on gerrit	09:12
*** ysandeep\|lunch is now known as ysandeep		09:19
ysandeep	#opendev we abandon/restored 815472 to clear gate but https://zuul.opendev.org/t/openstack/status#813930 is still stuck if you want to investigate.	09:25
*** ykarel\|lunch is now known as ykarel		10:19
*** cloudnull5 is now known as cloudnull		10:38
*** marios is now known as marios\|afk		10:47
zigo	What's James Blair nick on IRC ?	10:50
odyssey4me	zigo corvus	10:59
zigo	Thanks.	10:59
zigo	corvus: Sphinx 4.0 removed the PyModulelevel class, do you think that here: https://opendev.org/jjb/jenkins-job-builder/src/branch/master/jenkins_jobs/sphinx/yaml.py it's fine to replace this by PyMethod instead if running with Sphinx 4.2 ? I'm asking because "git blame" tells me you're the author of that code ... :)	11:01
zigo	https://review.opendev.org/c/jjb/jenkins-job-builder/+/815624	11:03
*** ysandeep is now known as ysandeep\|afk		11:13
*** dviroel\|rover\|afk is now known as dviroel\|rover		11:15
*** marios\|afk is now known as marios		11:21
fungi	zigo: we haven't used jjb in years, but its readme suggests they still hang out in #openstack-jjb (now on oftc), and have a mailing list here: https://groups.google.com/g/jenkins-job-builder	11:38
fungi	basically, maintenance of the tool was handed over to some of its remaining users	11:40
opendevreview	Alfredo Moralejo proposed openstack/diskimage-builder master: Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392	11:42
fungi	ysandeep\|afk: i see one stuck in check, not gate. i have an early appointment to get to, but i can take a look once i get back if it's still there	11:45
zigo	fungi: I know yeah, but still asking James if he knows ! :)	11:50
zigo	fungi: FYI, Sphinx 4.x has caused many failed to build in my packages, it's a real pain.	11:51
zigo	Upstream seems to be quite careless about backward compat, claiming they need to move forward and therefore cannot care...	11:51
zigo	Which is kind of fine, if at least it was correctly documented (whcih isn't the case).	11:52
zigo	I've just opened a bug about it asking them to do a better job.	11:52
zigo	(kindly asking)	11:52
*** ysandeep\|afk is now known as ysandeep		11:54
ysandeep	fungi: thanks!	11:54
*** ykarel_ is now known as ykarel		12:01
fungi	zigo: makes sense, and yeah it might unfortunately be one of those situations where sphinx 3 and 4 need separate packages for a while :(	12:01
fungi	zigo: looking at some other sphinx 4 conversions, i see PyFunction being used as a replacement	12:05
fungi	which, given the name of the subclass, seems like it would probably be a better fit	12:05
fungi	okay, so for the stuck 813930,1 in the openstack tenant's check pipeline, it looks like there are two builds which have neither completed nor timed out. i'll try to see if there's a lost completion event for those or something	12:57
fungi	actually i don't see evidence it's stuck	13:00
fungi	i think those two builds may have just taken ~14 hours to get nodes assigned	13:01
fungi	2021-10-27 11:43:47,609 INFO zuul.Pipeline.openstack.check: [e: dc7497cbf976446bbbc46d849296dd8a] Completed node request <NodeRequest 299-0015899688 ['centos-8-stream']> for job tripleo-ci-centos-8-standalone of item <QueueItem 0654f7b652724616adb4e8ea143b95c4 for <Change 0x7fe5c2b8a040 openstack/python-tripleoclient 813930,1> in check> with nodes ['0027115320']	13:01
*** jpena\|off is now known as jpena		13:04
fungi	grafana says we've not been under any node pressure though	13:04
ysandeep	new tripleo patches were getting nodes even though 813930 was waiting for nodes.	13:06
fungi	yeah, i agree that's a bit odd	13:07
fungi	2021-10-26 22:46:43,028 INFO zuul.nodepool: [e: dc7497cbf976446bbbc46d849296dd8a] Submitted node request <NodeRequest 299-0015899688 ['centos-8-stream']>	13:10
fungi	it finally reached priority 0 at 2021-10-27 01:48:27,106 (three hours later)	13:11
fungi	but took a further 10 hours to get fulfilled	13:12
fungi	that's as much as the scheduler knows. i'll see if the launchers can give me a clearer picture of what was going on from their end	13:13
fungi	looks like the request was originally picked up by nl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674	13:18
ysandeep	ack thanks!	13:19
fungi	2021-10-26 22:46:50,079 DEBUG nodepool.PoolWorker.airship-kna1-airship: [e: dc7497cbf976446bbbc46d849296dd8a] [node_request: 299-0015899688] Locking request	13:20
fungi	oh this is interesting, almost immediately it did...	13:21
fungi	2021-10-26 22:46:50,104 INFO nodepool.driver.NodeRequestHandler[nl02.opendev.org-PoolWorker.airship-kna1-airship-0ed54e659f504a039fc8669bf56599bc]: [e: dc7497cbf976446bbbc46d849296dd8a] [node_request: 299-0015899688] Declining node request because node type(s) [centos-8-stream] not available	13:21
fungi	yet all the other poolworkers kept yielding to it after that anyway	13:21
fungi	then for some reason i can't see, at 2021-10-27 10:39:46,552 nl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674 accepts the node request	13:26
fungi	note the uuid for the poolworkers differs from the one which had originally rejected it almost 12 hours earlier	13:27
fungi	clarkb: corvus: once you're around, does that ^ make any sense to you?	13:29
Clark[m]	fungi: they are two different nodepool pools with two different launchers	13:30
Clark[m]	If I had to guess the job was the child of the paused docker image build job and since nodepool requires all of those jobs run in the same cloud it had to wait for free resources	13:31
fungi	oh! indeed there's a paused tripleo-ci-centos-8-content-provider involved there	13:32
Clark[m]	TripleO runs a number of those jobs and they are often multinode. It's possible it was just waiting for one job to finish after another to find room	13:32
fungi	so basically if tripleo-ci-centos-8-content-provider lands on a low-capacity/high-launch-failure provider, it can take ages for the child jobs to get node assignments	13:32
fungi	apparently 12 hours or more	13:33
Clark[m]	Yes in part because each of the child jobs is like 3 nodes and 3 hours	13:33
frickler	so "paused" means the nodes are still in use, right?	13:34
Clark[m]	frickler: yup a paused job is still running on it's nodes. In this case to serve docker images	13:34
frickler	then the situation is likely worsened when there are 10 jobs in gate all doing this. maybe decreasing the depth of the tripleo queue could help	13:35
fungi	looking at https://grafana.opendev.org/d/QQzTp6EGz/nodepool-airship-citycloud we're operating well below quota there, and probably have a number of leaked deleting nodes	13:36
fungi	i wonder if a flexible/proportional nodeset maximum per provider would help. like if we could avoid trying to fulfill three-node jobs on a provider with only 25 nodes capacity, forcing those to wait for higher-capacity providers	13:38
fungi	probably would also make sense for the maximum to be compared against the paused parent and its largest nodeset child added together	13:38
fungi	the math there gets hairy quickly though	13:38
fungi	but basically these tripleo jobs are pathologically engineered to monopolize a 25-node provider rather easily if they land on it	13:39
yuriys1	I also noticed trippleo jobs seem to run hot and long. is there a way to link/track uuid of node on a provider, to a job/build	13:42
fungi	as for the steady deleting count, they're not leaked, all have status times <5min	13:44
yuriys1	Clark[m]: fungi: do you guys think we can meet sometime today; short meeting ~15m tops	13:45
fungi	yuriys1: you might be able to query for the node id or ip address on http://logstash.openstack.org/	13:45
fungi	yuriys1: i should have time for a quick discussion, my schedule is mostly open for the day starting in a couple hours	13:47
yuriys1	yeah ive been playing with that a lot fungi, i had two thoughts	13:47
fungi	nevermind what i said about the deleting nodes in citycloud not being stuck, it seems the delete worker resets the status time on those after each attempt	13:50
yuriys1	If this page (i think) https://opendev.org/zuul/zuul/src/branch/master/web/src/pages/Builds.jsx ; could also show provider field, you'd have a pretty good filter system to see a typical runtime of a job linked to a provider	13:50
yuriys1	ive been using this to check average time of success jobs to get a birds eye view	13:50
yuriys1	https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-8-containers-multinode-victoria	13:50
Clark[m]	I should have time between about 11:30am and 1:30pm eastern. Otherwise I've got a bunch of errands I'm trying to get done today	13:50
yuriys1	something like that	13:50
yuriys1	nice my schedule is pretty open today so 11:30am+ EST works	13:51
fungi	okay, so the node detetion exceptions aren't terribly helpful. we raise this in waitForNodeCleanup... nodepool.exceptions.ServerDeleteException: server 46e376c8-dc34-4e35-9d60-cb9340744387 deletion	13:54
fungi	i'll see if i can get more from the api	13:55
fungi	aha, there are a bunch of nodes in "error" state in our citycloud tenant	13:56
fungi	server show lists a fault of "NeutronClientException"	13:57
*** kopecmartin is now known as kopecmartin\|pto		14:00
fungi	5 nodes like that, all with a NeutronClientException fault timestamped between 2021-10-26T14:39:59Z and 2021-10-26T14:43:57Z so presumably they had some neutron api issue around then and nova couldn't cope	14:00
fungi	4 are standard nodes and 1 is an "expanded" node, so in aggregate they're in excess of 20% of the quota there	14:01
fungi	oh, in fact we have max-servers set to 16 there, so more like a third of our capped capacity	14:03
fungi	er, sorry, two pools which together total 26	14:04
fungi	40% of the standard pool and 6% of the specialty pool	14:05
fungi	no wonder tripleo jobs were struggling to get node requests fulfilled there, we had a capacity of 6 standard nodes in that pool when you subtract the 4 stuck error nodes	14:05
fungi	so with one node (?) taken by the paused registry server job, it could only run a single 3-node job at a time. if 3 or more registry jobs somehow got assigned there at once, then we'd be in a deadlock unable to run any of their child jobs	14:07
fungi	this 10-node pool for standard labels in citycloud is starting to seem more and more like a liability the closer i look	14:08
Clark[m]	When we created it the idea was to have a place to regularly run jobs to ensure it was generally working and avoid the once a week job suddenly not working due to a sad mirror or similar being overlooked for long period of time	14:11
fungi	for single-node jobs and the occasional two-node, max-servers 10 seems fine. when tripleo is running a bunch of three-node jobs with parent jobs which need to be colocated together, not so much. worse when you're down 40% of the capacity there because of provider-side errors	14:13
*** ysandeep is now known as ysandeep\|out		14:17
Clark[m]	Yup, making the cloud more reliable would go a long way here. And probably smarter scheduling from nodepool and zuul could help as you mentioned previously	14:24
Clark[m]	Unfortunately I think that cloud is the only one booting fedora right now too	14:26
Clark[m]	Though inmotion may be capable too now that it is back	14:27
* fungi checks		14:35
fungi	right now the only fedora-34 node is there but it's in a deleting state, i'll need to check the logs to see if it was reachable/used	14:36
fungi	yep, it ran this: https://zuul.opendev.org/t/openstack/build/d46680424cc540569479bdc78cc1bdc4	14:40
fungi	those tobiko jobs are hella broken though, so i don't think the provider had anything to do with the build timeout there	14:40
yuriys1	I took a look as well , is tobiko an alternative to tempest? weird for it to be so responsive up until that timeout	14:52
yuriys1	Also based on this: https://zuul.opendev.org/t/openstack/builds?job_name=devstack-tobiko-fedora .... they hit some timeout... a lot	14:52
yuriys1	not a success in sight	14:52
fungi	right, as i said, their testing seems to be mostly broken in recent weeks	14:58
*** ykarel is now known as ykarel\|away		15:00
*** pojadhav\|ruck is now known as pojadhav\|out		15:23
clarkb	yuriys1: fungi: now for the next ~ 2 hours is good for me if you still want to chat	15:41
yuriys1	yep!	15:41
fungi	i'm free	15:43
yuriys1	https://meetpad.opendev.org/imhmeet :)	15:44
hashar	fungi: clarkb: thank you for the gear 0.16.0 release :]	15:56
*** marios is now known as marios\|out		15:56
fungi	hashar: you're welcome, sorry we took so long, we didn't want to destabilize zuul in the midst of its work to move off gearman	16:01
hashar	fungi: well understood don't worry. I am happy you ended up being able to safely cut a new one ;)	16:07
fungi	clarkb: yuriys1: that i915 on-battery lockup just hit me, sorry	16:26
fungi	hit me like 3x in rapid succession so i wonder if it's related to battery level/voltage dropouts or something. seems to be stable now that i've plugged my charger in	16:38
*** jpena is now known as jpena\|off		17:04
opendevreview	Merged openstack/project-config master: Replace old Xena cycle signing key with Yoga https://review.opendev.org/c/openstack/project-config/+/815548	17:33
ianw	fedora 35 seems like it's release is imminent. we've never used a pre-release fedora before but in this case, i wonder if that's a better idea than pushing further on f34	21:01
clarkb	ianw: they certainly seem to have decided that releasing f35 is more important than fixing all the f34 users they broke :/	21:02
ianw	i couldn't (easily) get a dib image with the changes yesterday due to the upstream mirror issues, which i think have abated	21:02
fungi	i'm in favor of falling forward	21:02
fungi	you risk injury either way, but at least one gets you farther	21:02
ianw	i'm not sure if just setting DIB_DISTRO=35 will work with a pre-release ... only one way to find out i guess	21:02
clarkb	ianw: thats the best thing about having a great CI system	21:03
clarkb	we can have a computer answer those questions for us if we just ask nicely	21:03
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [dnm] trying fedora 35 https://review.opendev.org/c/openstack/diskimage-builder/+/815574	21:07
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [dnm] trying fedora 35 https://review.opendev.org/c/openstack/diskimage-builder/+/815574	21:22
ianw	clarkb: not sure if you were following all int mina sshd key things, but looks like 2.7.1 should be fixed soon. just noticed luca asking about it targeting gerrit v3.5	21:24
ianw	s/fixed/released with fixes/	21:24
Clark[m]	Yup I just got the email and am on the Mina bug subscribe list as well as gerrit's	21:25
ianw	cool; well it will be a good carrot for a 3.5 upgrade at the right time	21:25
Clark[m]	I have a change up to do Mina 2.7.0 against Gerrit 3.3 in system config but maybe I should see about doing it against master and pushing it upstream	21:25
Clark[m]	Enotime	21:26
Clark[m]	Looks like they updated 3.5 to 2.6.0 I really thought I checked that and it was still 2.4.0 a week ago. Maybe they recently changed that	21:34
corvus	i'd like to launch zuul01.opendev.org as a 16GB vm	21:53
corvus	that's half the size of zuul02. it looks like we're using a good 5-6GB of ram on zuul02 right now. we could probably do an 8GB but i don't want to push it too close yet.	21:53
corvus	i've looked over the current ansible, and i don't think any changes are needed in order to launch it. just adding it to inventory afterwords.	21:54
corvus	if there are no objections, i'll do that shortly. i'd like to have it on hand for some experimentation once we finish merging the current stack.	21:55
clarkb	corvus: thats sounds about right. The inventory addition is what adds it to firewall rules and then starts services	21:55
corvus	hrm, i might make a change to add a zuulschedulerstart:false hostvar for the new host	21:58
corvus	oh wait that shouldn't be necessary, i think that's the default	21:58
clarkb	ya it could be. I can never remember and tpyically just double check before adding to inventory	21:58
corvus	so yeah, it shouldn't actually start the service... but our zuul_restart playbooks might hit it, i'll check that.	21:58
opendevreview	James E. Blair proposed opendev/system-config master: Limit zuul stop/start playbooks to zuul02 https://review.opendev.org/c/opendev/system-config/+/815759	22:01
corvus	okay, i think that's the only thing we need to do to make the system safer for zuul01	22:01
clarkb	corvus: and infra-root https://review.opendev.org/c/opendev/system-config/+/791832 is a change Iv'e had up forever that might be good to land before launching a new server	22:03
clarkb	or at least hand patch into your local copy of the launch script	22:03
clarkb	I don't think it is critical, but the fs tools complain about our current alignment	22:03
corvus	+2	22:04
*** dviroel\|rover is now known as dviroel\|rover\|afk		22:08
fungi	lookin'	22:08
ianw	i though we really only used the swapfile method	22:11
fungi	depends on where we boot, i thought	22:12
Clark[m]	Yes rackspace boots use the ephemeral drive to carve out a proper partition. Elsewhere we don't get extra devices to repartition and do the swapfile	22:18
opendevreview	Merged opendev/system-config master: Better swap alignment https://review.opendev.org/c/opendev/system-config/+/791832	22:20
opendevreview	Ian Wienand proposed openstack/diskimage-builder master: [dnm] testing f34 with bullseye podman https://review.opendev.org/c/openstack/diskimage-builder/+/815763	22:49
opendevreview	Merged opendev/system-config master: Limit zuul stop/start playbooks to zuul02 https://review.opendev.org/c/opendev/system-config/+/815759	22:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!