Wednesday, 2021-10-27

opendevreviewIan Wienand proposed openstack/diskimage-builder master: tests: remove debootstrap install
ianwfrickler: just merged to remove the /#/ from dashboard urls in the docs, so i guess that is the ultimate solution for the problem you found :)00:05
opendevreviewIan Wienand proposed openstack/diskimage-builder master: dracut-regenerate: drop Python 2 packages
opendevreviewIan Wienand proposed openstack/diskimage-builder master: fedora-container: regenerate initramfs for F34
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] debug fail for f34
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] debug fail for f34
*** pojadhav|out is now known as pojadhav|ruck02:57
*** ysandeep|out is now known as ysandeep04:16
*** ykarel|away is now known as ykarel05:04
*** ysandeep is now known as ysandeep|brb05:52
*** ysandeep|brb is now known as ysandeep06:41
opendevreviewAlfredo Moralejo proposed openstack/diskimage-builder master: Add support for CentOS Stream 9 in DIB
opendevreviewdaniel.pawlik proposed openstack/project-config master: Setup zuul jobs for openstack/ci-log-processing project
zigoclarkb: FYI, simplejson was uploaded to unstable today with python3-setuptools as build-depends. Likely, it will be in Ubuntu 22.04.07:54
ysandeep#opendev: in gate queue is awaiting for node since long, looks like its stuck.. Could someone please take a look.08:00
ysandeepOne other jobs in check as well looks stuck: 08:04
ysandeepShould we abandon/ restore?08:05
*** ysandeep is now known as ysandeep|lunch08:08
*** dpawlik3 is now known as dpawlik08:34
opendevreviewMerged opendev/system-config master: Update artifact signing key management process
*** ykarel is now known as ykarel|lunch09:05
fzzf[m]Hi. I clone http:ci-sandbox. then git-review -s. After input username. this occur .  thanks in advance. I plan to use ci-sandbox to test a external CI that installed by SF . but I'm new in use gerrit. What steps should I follow, Any help would be appreciated.09:11
fzzf[m]I have configure SF with gerrit_connections. and Add ssh keys on gerrit09:12
*** ysandeep|lunch is now known as ysandeep09:19
ysandeep#opendev we abandon/restored 815472 to clear gate but is still stuck if you want to investigate.09:25
*** ykarel|lunch is now known as ykarel10:19
*** cloudnull5 is now known as cloudnull10:38
*** marios is now known as marios|afk10:47
zigoWhat's James Blair nick on IRC ?10:50
odyssey4mezigo corvus10:59
zigocorvus: Sphinx 4.0 removed the PyModulelevel class, do you think that here: it's fine to replace this by PyMethod instead if running with Sphinx 4.2 ? I'm asking because "git blame" tells me you're the author of that code ... :)11:01
*** ysandeep is now known as ysandeep|afk11:13
*** dviroel|rover|afk is now known as dviroel|rover11:15
*** marios|afk is now known as marios11:21
fungizigo: we haven't used jjb in years, but its readme suggests they still hang out in #openstack-jjb (now on oftc), and have a mailing list here:
fungibasically, maintenance of the tool was handed over to some of its remaining users11:40
opendevreviewAlfredo Moralejo proposed openstack/diskimage-builder master: Add support for CentOS Stream 9 in DIB
fungiysandeep|afk: i see one stuck in check, not gate. i have an early appointment to get to, but i can take a look once i get back if it's still there11:45
zigofungi: I know yeah, but still asking James if he knows ! :)11:50
zigofungi: FYI, Sphinx 4.x has caused *many* failed to build in my packages, it's a real pain.11:51
zigoUpstream seems to be quite careless about backward compat, claiming they need to move forward and therefore cannot care...11:51
zigoWhich is kind of fine, if at least it was correctly documented (whcih isn't the case).11:52
zigoI've just opened a bug about it asking them to do a better job.11:52
zigo(kindly asking)11:52
*** ysandeep|afk is now known as ysandeep11:54
ysandeepfungi: thanks! 11:54
*** ykarel_ is now known as ykarel12:01
fungizigo: makes sense, and yeah it might unfortunately be one of those situations where sphinx 3 and 4 need separate packages for a while :(12:01
fungizigo: looking at some other sphinx 4 conversions, i see PyFunction being used as a replacement12:05
fungiwhich, given the name of the subclass, seems like it would probably be a better fit12:05
fungiokay, so for the stuck 813930,1 in the openstack tenant's check pipeline, it looks like there are two builds which have neither completed nor timed out. i'll try to see if there's a lost completion event for those or something12:57
fungiactually i don't see evidence it's stuck13:00
fungii think those two builds may have just taken ~14 hours to get nodes assigned13:01
fungi2021-10-27 11:43:47,609 INFO zuul.Pipeline.openstack.check: [e: dc7497cbf976446bbbc46d849296dd8a] Completed node request <NodeRequest 299-0015899688 ['centos-8-stream']> for job tripleo-ci-centos-8-standalone of item <QueueItem 0654f7b652724616adb4e8ea143b95c4 for <Change 0x7fe5c2b8a040 openstack/python-tripleoclient 813930,1> in check> with nodes ['0027115320']13:01
*** jpena|off is now known as jpena13:04
fungigrafana says we've not been under any node pressure though13:04
ysandeepnew tripleo patches were getting nodes even though 813930 was waiting for nodes.13:06
fungiyeah, i agree that's a bit odd13:07
fungi2021-10-26 22:46:43,028 INFO zuul.nodepool: [e: dc7497cbf976446bbbc46d849296dd8a] Submitted node request <NodeRequest 299-0015899688 ['centos-8-stream']>13:10
fungiit finally reached priority 0 at 2021-10-27 01:48:27,106 (three hours later)13:11
fungibut took a further 10 hours to get fulfilled13:12
fungithat's as much as the scheduler knows. i'll see if the launchers can give me a clearer picture of what was going on from their end13:13
fungilooks like the request was originally picked up by
ysandeepack thanks!13:19
fungi2021-10-26 22:46:50,079 DEBUG nodepool.PoolWorker.airship-kna1-airship: [e: dc7497cbf976446bbbc46d849296dd8a] [node_request: 299-0015899688] Locking request13:20
fungioh this is interesting, almost immediately it did...13:21
fungi2021-10-26 22:46:50,104 INFO nodepool.driver.NodeRequestHandler[]: [e: dc7497cbf976446bbbc46d849296dd8a] [node_request: 299-0015899688] Declining node request because node type(s) [centos-8-stream] not available13:21
fungiyet all the other poolworkers kept yielding to it after that anyway13:21
fungithen for some reason i can't see, at 2021-10-27 10:39:46,552 accepts the node request13:26
funginote the uuid for the poolworkers differs from the one which had originally rejected it almost 12 hours earlier13:27
fungiclarkb: corvus: once you're around, does that ^ make any sense to you?13:29
Clark[m]fungi: they are two different nodepool pools with two different launchers13:30
Clark[m]If I had to guess the job was the child of the paused docker image build job and since nodepool requires all of those jobs run in the same cloud it had to wait for free resources13:31
fungioh! indeed there's a paused tripleo-ci-centos-8-content-provider involved there13:32
Clark[m]TripleO runs a number of those jobs and they are often multinode. It's possible it was just waiting for one job to finish after another to find room13:32
fungiso basically if tripleo-ci-centos-8-content-provider lands on a low-capacity/high-launch-failure provider, it can take ages for the child jobs to get node assignments13:32
fungiapparently 12 hours or more13:33
Clark[m]Yes in part because each of the child jobs is like 3 nodes and 3 hours13:33
fricklerso "paused" means the nodes are still in use, right?13:34
Clark[m]frickler: yup a paused job is still running on it's nodes. In this case to serve docker images13:34
fricklerthen the situation is likely worsened when there are 10 jobs in gate all doing this. maybe decreasing the depth of the tripleo queue could help13:35
fungilooking at we're operating well below quota there, and probably have a number of leaked deleting nodes13:36
fungii wonder if a flexible/proportional nodeset maximum per provider would help. like if we could avoid trying to fulfill three-node jobs on a provider with only 25 nodes capacity, forcing those to wait for higher-capacity providers13:38
fungiprobably would also make sense for the maximum to be compared against the paused parent and its largest nodeset child added together13:38
fungithe math there gets hairy quickly though13:38
fungibut basically these tripleo jobs are pathologically engineered to monopolize a 25-node provider rather easily if they land on it13:39
yuriys1I also noticed trippleo jobs seem to run hot and long. is there a way to link/track uuid of node on a provider, to a job/build13:42
fungias for the steady deleting count, they're not leaked, all have status times <5min13:44
yuriys1Clark[m]: fungi: do you guys think we can meet sometime today; short meeting ~15m tops13:45
fungiyuriys1: you might be able to query for the node id or ip address on
fungiyuriys1: i should have time for a quick discussion, my schedule is mostly open for the day starting in a couple hours13:47
yuriys1yeah ive been playing with that a lot fungi, i had two thoughts13:47
funginevermind what i said about the deleting nodes in citycloud not being stuck, it seems the delete worker resets the status time on those after each attempt13:50
yuriys1If this page (i think) ; could also show provider field, you'd have a pretty good filter system to see a typical runtime of a job linked to a provider13:50
yuriys1ive been using this to check average time of success jobs to get a birds eye view13:50
Clark[m]I should have time between about 11:30am and 1:30pm eastern. Otherwise I've got a bunch of errands I'm trying to get done today13:50
yuriys1something like that13:50
yuriys1nice my schedule is pretty open today so 11:30am+ EST works13:51
fungiokay, so the node detetion exceptions aren't terribly helpful. we raise this in waitForNodeCleanup... nodepool.exceptions.ServerDeleteException: server 46e376c8-dc34-4e35-9d60-cb9340744387 deletion13:54
fungii'll see if i can get more from the api13:55
fungiaha, there are a bunch of nodes in "error" state in our citycloud tenant13:56
fungiserver show lists a fault of "NeutronClientException"13:57
*** kopecmartin is now known as kopecmartin|pto14:00
fungi5 nodes like that, all with a NeutronClientException fault timestamped between 2021-10-26T14:39:59Z and 2021-10-26T14:43:57Z so presumably they had some neutron api issue around then and nova couldn't cope14:00
fungi4 are standard nodes and 1 is an "expanded" node, so in aggregate they're in excess of 20% of the quota there14:01
fungioh, in fact we have max-servers set to 16 there, so more like a third of our capped capacity14:03
fungier, sorry, two pools which together total 2614:04
fungi40% of the standard pool and 6% of the specialty pool14:05
fungino wonder tripleo jobs were struggling to get node requests fulfilled there, we had a capacity of 6 standard nodes in that pool when you subtract the 4 stuck error nodes14:05
fungiso with one node (?) taken by the paused registry server job, it could only run a single 3-node job at a time. if 3 or more registry jobs somehow got assigned there at once, then we'd be in a deadlock unable to run any of their child jobs14:07
fungithis 10-node pool for standard labels in citycloud is starting to seem more and more like a liability the closer i look14:08
Clark[m]When we created it the idea was to have a place to regularly run jobs to ensure it was generally working and avoid the once a week job suddenly not working due to a sad mirror or similar being overlooked for long period of time14:11
fungifor single-node jobs and the occasional two-node, max-servers 10 seems fine. when tripleo is running a bunch of three-node jobs with parent jobs which need to be colocated together, not so much. worse when you're down 40% of the capacity there because of provider-side errors14:13
*** ysandeep is now known as ysandeep|out14:17
Clark[m]Yup, making the cloud more reliable would go a long way here. And probably smarter scheduling from nodepool and zuul could help as you mentioned previously14:24
Clark[m]Unfortunately I think that cloud is the only one booting fedora right now too14:26
Clark[m]Though inmotion may be capable too now that it is back14:27
* fungi checks14:35
fungiright now the only fedora-34 node is there but it's in a deleting state, i'll need to check the logs to see if it was reachable/used14:36
fungiyep, it ran this:
fungithose tobiko jobs are hella broken though, so i don't think the provider had anything to do with the build timeout there14:40
yuriys1I took a look as well , is tobiko an alternative to tempest? weird for it to be so responsive up until that timeout14:52
yuriys1Also based on this: .... they hit some timeout... a lot14:52
yuriys1not a success in sight14:52
fungiright, as i said, their testing seems to be mostly broken in recent weeks14:58
*** ykarel is now known as ykarel|away15:00
*** pojadhav|ruck is now known as pojadhav|out15:23
clarkbyuriys1: fungi: now for the next ~ 2 hours is good for me if you still want to chat15:41
fungii'm free15:43
yuriys1 :)15:44
hasharfungi: clarkb: thank you for the gear 0.16.0 release :]15:56
*** marios is now known as marios|out15:56
fungihashar: you're welcome, sorry we took so long, we didn't want to destabilize zuul in the midst of its work to move off gearman16:01
hasharfungi: well understood don't worry. I am happy you ended up being able to safely cut a new one ;)16:07
fungiclarkb: yuriys1: that i915 on-battery lockup just hit me, sorry16:26
fungihit me like 3x in rapid succession so i wonder if it's related to battery level/voltage dropouts or something. seems to be stable now that i've plugged my charger in16:38
*** jpena is now known as jpena|off17:04
opendevreviewMerged openstack/project-config master: Replace old Xena cycle signing key with Yoga
ianwfedora 35 seems like it's release is imminent.  we've never used a pre-release fedora before but in this case, i wonder if that's a better idea than pushing further on f3421:01
clarkbianw: they certainly seem to have decided that releasing f35 is more important than fixing all the f34 users they broke :/21:02
ianwi couldn't (easily) get a dib image with the changes yesterday due to the upstream mirror issues, which i think have abated21:02
fungii'm in favor of falling forward21:02
fungiyou risk injury either way, but at least one gets you farther21:02
ianwi'm not sure if just setting DIB_DISTRO=35 will work with a pre-release ... only one way to find out i guess21:02
clarkbianw: thats the best thing about having a great CI system21:03
clarkbwe can have a computer answer those questions for us if we just ask nicely21:03
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] trying fedora 35
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] trying fedora 35
ianwclarkb: not sure if you were following all int mina sshd key things, but looks like 2.7.1 should be fixed soon.  just noticed luca asking about it targeting gerrit v3.521:24
ianws/fixed/released with fixes/21:24
Clark[m]Yup I just got the email and am on the Mina bug subscribe list as well as gerrit's21:25
ianwcool; well it will be a good carrot for a 3.5 upgrade at the right time21:25
Clark[m]I have a change up to do Mina 2.7.0 against Gerrit 3.3 in system config but maybe I should see about doing it against master and pushing it upstream21:25
Clark[m]Looks like they updated 3.5 to 2.6.0 I really thought I checked that and it was still 2.4.0 a week ago. Maybe they recently changed that21:34
corvusi'd like to launch as a 16GB vm21:53
corvusthat's half the size of zuul02.  it looks like we're using a good 5-6GB of ram on zuul02 right now.  we could probably do an 8GB but i don't want to push it too close yet.21:53
corvusi've looked over the current ansible, and i don't think any changes are needed in order to launch it.  just adding it to inventory afterwords.21:54
corvusif there are no objections, i'll do that shortly.  i'd like to have it on hand for some experimentation once we finish merging the current stack.21:55
clarkbcorvus: thats sounds about right. The inventory addition is what adds it to firewall rules and then starts services21:55
corvushrm, i might make a change to add a zuulschedulerstart:false hostvar for the new host21:58
corvusoh wait that shouldn't be necessary, i think that's the default21:58
clarkbya it could be. I can never remember and tpyically just double check before adding to inventory21:58
corvusso yeah, it shouldn't actually start the service... but our zuul_restart playbooks might hit it, i'll check that.21:58
opendevreviewJames E. Blair proposed opendev/system-config master: Limit zuul stop/start playbooks to zuul02
corvusokay, i think that's the only thing we need to do to make the system safer for zuul0122:01
clarkbcorvus: and infra-root is a change Iv'e had up forever that might be good to land before launching a new server22:03
clarkbor at least hand patch into your local copy of the launch script22:03
clarkbI don't think it is critical, but the fs tools complain about our current alignment22:03
*** dviroel|rover is now known as dviroel|rover|afk22:08
ianwi though we really only used the swapfile method22:11
fungidepends on where we boot, i thought22:12
Clark[m]Yes rackspace boots use the ephemeral drive to carve out a proper partition. Elsewhere we don't get extra devices to repartition and do the swapfile22:18
opendevreviewMerged opendev/system-config master: Better swap alignment
opendevreviewIan Wienand proposed openstack/diskimage-builder master: [dnm] testing f34 with bullseye podman
opendevreviewMerged opendev/system-config master: Limit zuul stop/start playbooks to zuul02

Generated by 2.17.2 by Marius Gedminas - find it at!