opendevreview | Ian Wienand proposed openstack/diskimage-builder master: tests: remove debootstrap install https://review.opendev.org/c/openstack/diskimage-builder/+/815571 | 00:02 |
---|---|---|
ianw | frickler: https://gerrit-review.googlesource.com/c/gerrit/+/321535 just merged to remove the /#/ from dashboard urls in the docs, so i guess that is the ultimate solution for the problem you found :) | 00:05 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: dracut-regenerate: drop Python 2 packages https://review.opendev.org/c/openstack/diskimage-builder/+/815409 | 00:13 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: fedora-container: regenerate initramfs for F34 https://review.opendev.org/c/openstack/diskimage-builder/+/815385 | 00:13 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] debug fail for f34 https://review.opendev.org/c/openstack/diskimage-builder/+/815574 | 00:13 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] debug fail for f34 https://review.opendev.org/c/openstack/diskimage-builder/+/815574 | 00:44 |
*** pojadhav|out is now known as pojadhav|ruck | 02:57 | |
*** ysandeep|out is now known as ysandeep | 04:16 | |
*** ykarel|away is now known as ykarel | 05:04 | |
*** ysandeep is now known as ysandeep|brb | 05:52 | |
*** ysandeep|brb is now known as ysandeep | 06:41 | |
opendevreview | Alfredo Moralejo proposed openstack/diskimage-builder master: Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392 | 07:18 |
opendevreview | daniel.pawlik proposed openstack/project-config master: Setup zuul jobs for openstack/ci-log-processing project https://review.opendev.org/c/openstack/project-config/+/815024 | 07:37 |
zigo | clarkb: FYI, simplejson was uploaded to unstable today with python3-setuptools as build-depends. Likely, it will be in Ubuntu 22.04. | 07:54 |
ysandeep | #opendev: https://zuul.opendev.org/t/openstack/status#815472 in gate queue is awaiting for node since long, looks like its stuck.. Could someone please take a look. | 08:00 |
ysandeep | One other jobs in check as well looks stuck: https://zuul.opendev.org/t/openstack/status#813930 | 08:04 |
ysandeep | Should we abandon/ restore? | 08:05 |
*** ysandeep is now known as ysandeep|lunch | 08:08 | |
*** dpawlik3 is now known as dpawlik | 08:34 | |
opendevreview | Merged opendev/system-config master: Update artifact signing key management process https://review.opendev.org/c/opendev/system-config/+/815547 | 08:48 |
*** ykarel is now known as ykarel|lunch | 09:05 | |
fzzf[m] | Hi. I clone http:ci-sandbox. then git-review -s. After input username. this occur https://paste.opendev.org/show/810236/ . thanks in advance. I plan to use ci-sandbox to test a external CI that installed by SF . but I'm new in use gerrit. What steps should I follow, Any help would be appreciated. | 09:11 |
fzzf[m] | I have configure SF with gerrit_connections. and Add ssh keys on gerrit | 09:12 |
*** ysandeep|lunch is now known as ysandeep | 09:19 | |
ysandeep | #opendev we abandon/restored 815472 to clear gate but https://zuul.opendev.org/t/openstack/status#813930 is still stuck if you want to investigate. | 09:25 |
*** ykarel|lunch is now known as ykarel | 10:19 | |
*** cloudnull5 is now known as cloudnull | 10:38 | |
*** marios is now known as marios|afk | 10:47 | |
zigo | What's James Blair nick on IRC ? | 10:50 |
odyssey4me | zigo corvus | 10:59 |
zigo | Thanks. | 10:59 |
zigo | corvus: Sphinx 4.0 removed the PyModulelevel class, do you think that here: https://opendev.org/jjb/jenkins-job-builder/src/branch/master/jenkins_jobs/sphinx/yaml.py it's fine to replace this by PyMethod instead if running with Sphinx 4.2 ? I'm asking because "git blame" tells me you're the author of that code ... :) | 11:01 |
zigo | https://review.opendev.org/c/jjb/jenkins-job-builder/+/815624 | 11:03 |
*** ysandeep is now known as ysandeep|afk | 11:13 | |
*** dviroel|rover|afk is now known as dviroel|rover | 11:15 | |
*** marios|afk is now known as marios | 11:21 | |
fungi | zigo: we haven't used jjb in years, but its readme suggests they still hang out in #openstack-jjb (now on oftc), and have a mailing list here: https://groups.google.com/g/jenkins-job-builder | 11:38 |
fungi | basically, maintenance of the tool was handed over to some of its remaining users | 11:40 |
opendevreview | Alfredo Moralejo proposed openstack/diskimage-builder master: Add support for CentOS Stream 9 in DIB https://review.opendev.org/c/openstack/diskimage-builder/+/811392 | 11:42 |
fungi | ysandeep|afk: i see one stuck in check, not gate. i have an early appointment to get to, but i can take a look once i get back if it's still there | 11:45 |
zigo | fungi: I know yeah, but still asking James if he knows ! :) | 11:50 |
zigo | fungi: FYI, Sphinx 4.x has caused *many* failed to build in my packages, it's a real pain. | 11:51 |
zigo | Upstream seems to be quite careless about backward compat, claiming they need to move forward and therefore cannot care... | 11:51 |
zigo | Which is kind of fine, if at least it was correctly documented (whcih isn't the case). | 11:52 |
zigo | I've just opened a bug about it asking them to do a better job. | 11:52 |
zigo | (kindly asking) | 11:52 |
*** ysandeep|afk is now known as ysandeep | 11:54 | |
ysandeep | fungi: thanks! | 11:54 |
*** ykarel_ is now known as ykarel | 12:01 | |
fungi | zigo: makes sense, and yeah it might unfortunately be one of those situations where sphinx 3 and 4 need separate packages for a while :( | 12:01 |
fungi | zigo: looking at some other sphinx 4 conversions, i see PyFunction being used as a replacement | 12:05 |
fungi | which, given the name of the subclass, seems like it would probably be a better fit | 12:05 |
fungi | okay, so for the stuck 813930,1 in the openstack tenant's check pipeline, it looks like there are two builds which have neither completed nor timed out. i'll try to see if there's a lost completion event for those or something | 12:57 |
fungi | actually i don't see evidence it's stuck | 13:00 |
fungi | i think those two builds may have just taken ~14 hours to get nodes assigned | 13:01 |
fungi | 2021-10-27 11:43:47,609 INFO zuul.Pipeline.openstack.check: [e: dc7497cbf976446bbbc46d849296dd8a] Completed node request <NodeRequest 299-0015899688 ['centos-8-stream']> for job tripleo-ci-centos-8-standalone of item <QueueItem 0654f7b652724616adb4e8ea143b95c4 for <Change 0x7fe5c2b8a040 openstack/python-tripleoclient 813930,1> in check> with nodes ['0027115320'] | 13:01 |
*** jpena|off is now known as jpena | 13:04 | |
fungi | grafana says we've not been under any node pressure though | 13:04 |
ysandeep | new tripleo patches were getting nodes even though 813930 was waiting for nodes. | 13:06 |
fungi | yeah, i agree that's a bit odd | 13:07 |
fungi | 2021-10-26 22:46:43,028 INFO zuul.nodepool: [e: dc7497cbf976446bbbc46d849296dd8a] Submitted node request <NodeRequest 299-0015899688 ['centos-8-stream']> | 13:10 |
fungi | it finally reached priority 0 at 2021-10-27 01:48:27,106 (three hours later) | 13:11 |
fungi | but took a further 10 hours to get fulfilled | 13:12 |
fungi | that's as much as the scheduler knows. i'll see if the launchers can give me a clearer picture of what was going on from their end | 13:13 |
fungi | looks like the request was originally picked up by nl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674 | 13:18 |
ysandeep | ack thanks! | 13:19 |
fungi | 2021-10-26 22:46:50,079 DEBUG nodepool.PoolWorker.airship-kna1-airship: [e: dc7497cbf976446bbbc46d849296dd8a] [node_request: 299-0015899688] Locking request | 13:20 |
fungi | oh this is interesting, almost immediately it did... | 13:21 |
fungi | 2021-10-26 22:46:50,104 INFO nodepool.driver.NodeRequestHandler[nl02.opendev.org-PoolWorker.airship-kna1-airship-0ed54e659f504a039fc8669bf56599bc]: [e: dc7497cbf976446bbbc46d849296dd8a] [node_request: 299-0015899688] Declining node request because node type(s) [centos-8-stream] not available | 13:21 |
fungi | yet all the other poolworkers kept yielding to it after that anyway | 13:21 |
fungi | then for some reason i can't see, at 2021-10-27 10:39:46,552 nl02.opendev.org-PoolWorker.airship-kna1-main-43d321e2735c45e5a17fc8d90e8ac674 accepts the node request | 13:26 |
fungi | note the uuid for the poolworkers differs from the one which had originally rejected it almost 12 hours earlier | 13:27 |
fungi | clarkb: corvus: once you're around, does that ^ make any sense to you? | 13:29 |
Clark[m] | fungi: they are two different nodepool pools with two different launchers | 13:30 |
Clark[m] | If I had to guess the job was the child of the paused docker image build job and since nodepool requires all of those jobs run in the same cloud it had to wait for free resources | 13:31 |
fungi | oh! indeed there's a paused tripleo-ci-centos-8-content-provider involved there | 13:32 |
Clark[m] | TripleO runs a number of those jobs and they are often multinode. It's possible it was just waiting for one job to finish after another to find room | 13:32 |
fungi | so basically if tripleo-ci-centos-8-content-provider lands on a low-capacity/high-launch-failure provider, it can take ages for the child jobs to get node assignments | 13:32 |
fungi | apparently 12 hours or more | 13:33 |
Clark[m] | Yes in part because each of the child jobs is like 3 nodes and 3 hours | 13:33 |
frickler | so "paused" means the nodes are still in use, right? | 13:34 |
Clark[m] | frickler: yup a paused job is still running on it's nodes. In this case to serve docker images | 13:34 |
frickler | then the situation is likely worsened when there are 10 jobs in gate all doing this. maybe decreasing the depth of the tripleo queue could help | 13:35 |
fungi | looking at https://grafana.opendev.org/d/QQzTp6EGz/nodepool-airship-citycloud we're operating well below quota there, and probably have a number of leaked deleting nodes | 13:36 |
fungi | i wonder if a flexible/proportional nodeset maximum per provider would help. like if we could avoid trying to fulfill three-node jobs on a provider with only 25 nodes capacity, forcing those to wait for higher-capacity providers | 13:38 |
fungi | probably would also make sense for the maximum to be compared against the paused parent and its largest nodeset child added together | 13:38 |
fungi | the math there gets hairy quickly though | 13:38 |
fungi | but basically these tripleo jobs are pathologically engineered to monopolize a 25-node provider rather easily if they land on it | 13:39 |
yuriys1 | I also noticed trippleo jobs seem to run hot and long. is there a way to link/track uuid of node on a provider, to a job/build | 13:42 |
fungi | as for the steady deleting count, they're not leaked, all have status times <5min | 13:44 |
yuriys1 | Clark[m]: fungi: do you guys think we can meet sometime today; short meeting ~15m tops | 13:45 |
fungi | yuriys1: you might be able to query for the node id or ip address on http://logstash.openstack.org/ | 13:45 |
fungi | yuriys1: i should have time for a quick discussion, my schedule is mostly open for the day starting in a couple hours | 13:47 |
yuriys1 | yeah ive been playing with that a lot fungi, i had two thoughts | 13:47 |
fungi | nevermind what i said about the deleting nodes in citycloud not being stuck, it seems the delete worker resets the status time on those after each attempt | 13:50 |
yuriys1 | If this page (i think) https://opendev.org/zuul/zuul/src/branch/master/web/src/pages/Builds.jsx ; could also show provider field, you'd have a pretty good filter system to see a typical runtime of a job linked to a provider | 13:50 |
yuriys1 | ive been using this to check average time of success jobs to get a birds eye view | 13:50 |
yuriys1 | https://zuul.opendev.org/t/openstack/builds?job_name=tripleo-ci-centos-8-containers-multinode-victoria | 13:50 |
Clark[m] | I should have time between about 11:30am and 1:30pm eastern. Otherwise I've got a bunch of errands I'm trying to get done today | 13:50 |
yuriys1 | something like that | 13:50 |
yuriys1 | nice my schedule is pretty open today so 11:30am+ EST works | 13:51 |
fungi | okay, so the node detetion exceptions aren't terribly helpful. we raise this in waitForNodeCleanup... nodepool.exceptions.ServerDeleteException: server 46e376c8-dc34-4e35-9d60-cb9340744387 deletion | 13:54 |
fungi | i'll see if i can get more from the api | 13:55 |
fungi | aha, there are a bunch of nodes in "error" state in our citycloud tenant | 13:56 |
fungi | server show lists a fault of "NeutronClientException" | 13:57 |
*** kopecmartin is now known as kopecmartin|pto | 14:00 | |
fungi | 5 nodes like that, all with a NeutronClientException fault timestamped between 2021-10-26T14:39:59Z and 2021-10-26T14:43:57Z so presumably they had some neutron api issue around then and nova couldn't cope | 14:00 |
fungi | 4 are standard nodes and 1 is an "expanded" node, so in aggregate they're in excess of 20% of the quota there | 14:01 |
fungi | oh, in fact we have max-servers set to 16 there, so more like a third of our capped capacity | 14:03 |
fungi | er, sorry, two pools which together total 26 | 14:04 |
fungi | 40% of the standard pool and 6% of the specialty pool | 14:05 |
fungi | no wonder tripleo jobs were struggling to get node requests fulfilled there, we had a capacity of 6 standard nodes in that pool when you subtract the 4 stuck error nodes | 14:05 |
fungi | so with one node (?) taken by the paused registry server job, it could only run a single 3-node job at a time. if 3 or more registry jobs somehow got assigned there at once, then we'd be in a deadlock unable to run any of their child jobs | 14:07 |
fungi | this 10-node pool for standard labels in citycloud is starting to seem more and more like a liability the closer i look | 14:08 |
Clark[m] | When we created it the idea was to have a place to regularly run jobs to ensure it was generally working and avoid the once a week job suddenly not working due to a sad mirror or similar being overlooked for long period of time | 14:11 |
fungi | for single-node jobs and the occasional two-node, max-servers 10 seems fine. when tripleo is running a bunch of three-node jobs with parent jobs which need to be colocated together, not so much. worse when you're down 40% of the capacity there because of provider-side errors | 14:13 |
*** ysandeep is now known as ysandeep|out | 14:17 | |
Clark[m] | Yup, making the cloud more reliable would go a long way here. And probably smarter scheduling from nodepool and zuul could help as you mentioned previously | 14:24 |
Clark[m] | Unfortunately I think that cloud is the only one booting fedora right now too | 14:26 |
Clark[m] | Though inmotion may be capable too now that it is back | 14:27 |
* fungi checks | 14:35 | |
fungi | right now the only fedora-34 node is there but it's in a deleting state, i'll need to check the logs to see if it was reachable/used | 14:36 |
fungi | yep, it ran this: https://zuul.opendev.org/t/openstack/build/d46680424cc540569479bdc78cc1bdc4 | 14:40 |
fungi | those tobiko jobs are hella broken though, so i don't think the provider had anything to do with the build timeout there | 14:40 |
yuriys1 | I took a look as well , is tobiko an alternative to tempest? weird for it to be so responsive up until that timeout | 14:52 |
yuriys1 | Also based on this: https://zuul.opendev.org/t/openstack/builds?job_name=devstack-tobiko-fedora .... they hit some timeout... a lot | 14:52 |
yuriys1 | not a success in sight | 14:52 |
fungi | right, as i said, their testing seems to be mostly broken in recent weeks | 14:58 |
*** ykarel is now known as ykarel|away | 15:00 | |
*** pojadhav|ruck is now known as pojadhav|out | 15:23 | |
clarkb | yuriys1: fungi: now for the next ~ 2 hours is good for me if you still want to chat | 15:41 |
yuriys1 | yep! | 15:41 |
fungi | i'm free | 15:43 |
yuriys1 | https://meetpad.opendev.org/imhmeet :) | 15:44 |
hashar | fungi: clarkb: thank you for the gear 0.16.0 release :] | 15:56 |
*** marios is now known as marios|out | 15:56 | |
fungi | hashar: you're welcome, sorry we took so long, we didn't want to destabilize zuul in the midst of its work to move off gearman | 16:01 |
hashar | fungi: well understood don't worry. I am happy you ended up being able to safely cut a new one ;) | 16:07 |
fungi | clarkb: yuriys1: that i915 on-battery lockup just hit me, sorry | 16:26 |
fungi | hit me like 3x in rapid succession so i wonder if it's related to battery level/voltage dropouts or something. seems to be stable now that i've plugged my charger in | 16:38 |
*** jpena is now known as jpena|off | 17:04 | |
opendevreview | Merged openstack/project-config master: Replace old Xena cycle signing key with Yoga https://review.opendev.org/c/openstack/project-config/+/815548 | 17:33 |
ianw | fedora 35 seems like it's release is imminent. we've never used a pre-release fedora before but in this case, i wonder if that's a better idea than pushing further on f34 | 21:01 |
clarkb | ianw: they certainly seem to have decided that releasing f35 is more important than fixing all the f34 users they broke :/ | 21:02 |
ianw | i couldn't (easily) get a dib image with the changes yesterday due to the upstream mirror issues, which i think have abated | 21:02 |
fungi | i'm in favor of falling forward | 21:02 |
fungi | you risk injury either way, but at least one gets you farther | 21:02 |
ianw | i'm not sure if just setting DIB_DISTRO=35 will work with a pre-release ... only one way to find out i guess | 21:02 |
clarkb | ianw: thats the best thing about having a great CI system | 21:03 |
clarkb | we can have a computer answer those questions for us if we just ask nicely | 21:03 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] trying fedora 35 https://review.opendev.org/c/openstack/diskimage-builder/+/815574 | 21:07 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] trying fedora 35 https://review.opendev.org/c/openstack/diskimage-builder/+/815574 | 21:22 |
ianw | clarkb: not sure if you were following all int mina sshd key things, but looks like 2.7.1 should be fixed soon. just noticed luca asking about it targeting gerrit v3.5 | 21:24 |
ianw | s/fixed/released with fixes/ | 21:24 |
Clark[m] | Yup I just got the email and am on the Mina bug subscribe list as well as gerrit's | 21:25 |
ianw | cool; well it will be a good carrot for a 3.5 upgrade at the right time | 21:25 |
Clark[m] | I have a change up to do Mina 2.7.0 against Gerrit 3.3 in system config but maybe I should see about doing it against master and pushing it upstream | 21:25 |
Clark[m] | Enotime | 21:26 |
Clark[m] | Looks like they updated 3.5 to 2.6.0 I really thought I checked that and it was still 2.4.0 a week ago. Maybe they recently changed that | 21:34 |
corvus | i'd like to launch zuul01.opendev.org as a 16GB vm | 21:53 |
corvus | that's half the size of zuul02. it looks like we're using a good 5-6GB of ram on zuul02 right now. we could probably do an 8GB but i don't want to push it too close yet. | 21:53 |
corvus | i've looked over the current ansible, and i don't think any changes are needed in order to launch it. just adding it to inventory afterwords. | 21:54 |
corvus | if there are no objections, i'll do that shortly. i'd like to have it on hand for some experimentation once we finish merging the current stack. | 21:55 |
clarkb | corvus: thats sounds about right. The inventory addition is what adds it to firewall rules and then starts services | 21:55 |
corvus | hrm, i might make a change to add a zuulschedulerstart:false hostvar for the new host | 21:58 |
corvus | oh wait that shouldn't be necessary, i think that's the default | 21:58 |
clarkb | ya it could be. I can never remember and tpyically just double check before adding to inventory | 21:58 |
corvus | so yeah, it shouldn't actually start the service... but our zuul_restart playbooks might hit it, i'll check that. | 21:58 |
opendevreview | James E. Blair proposed opendev/system-config master: Limit zuul stop/start playbooks to zuul02 https://review.opendev.org/c/opendev/system-config/+/815759 | 22:01 |
corvus | okay, i think that's the only thing we need to do to make the system safer for zuul01 | 22:01 |
clarkb | corvus: and infra-root https://review.opendev.org/c/opendev/system-config/+/791832 is a change Iv'e had up forever that might be good to land before launching a new server | 22:03 |
clarkb | or at least hand patch into your local copy of the launch script | 22:03 |
clarkb | I don't think it is critical, but the fs tools complain about our current alignment | 22:03 |
corvus | +2 | 22:04 |
*** dviroel|rover is now known as dviroel|rover|afk | 22:08 | |
fungi | lookin' | 22:08 |
ianw | i though we really only used the swapfile method | 22:11 |
fungi | depends on where we boot, i thought | 22:12 |
Clark[m] | Yes rackspace boots use the ephemeral drive to carve out a proper partition. Elsewhere we don't get extra devices to repartition and do the swapfile | 22:18 |
opendevreview | Merged opendev/system-config master: Better swap alignment https://review.opendev.org/c/opendev/system-config/+/791832 | 22:20 |
opendevreview | Ian Wienand proposed openstack/diskimage-builder master: [dnm] testing f34 with bullseye podman https://review.opendev.org/c/openstack/diskimage-builder/+/815763 | 22:49 |
opendevreview | Merged opendev/system-config master: Limit zuul stop/start playbooks to zuul02 https://review.opendev.org/c/opendev/system-config/+/815759 | 22:51 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!