Wednesday, 2024-10-02

*** dhill is now known as Guest515401:03
mnasiadkafungi: I still have a feeling something is not in it's best shape - I'm seeing a lot of Timeout exception waiting for the logger. Please check connectivity to [104.239.142.11:19885] and gerrit/zuul dashboard is not as responsive as always - but maybe that's the release load ;-)07:17
fungimnasiadka: release hasn't started yet, we're about 2 minutes away09:58
fungimnasiadka: 104.239.142.11 isn't one of our servers... where did you see that error?09:59
fungiare you talking about job failures or something?10:00
mnasiadkafungi: https://zuul.opendev.org/t/openstack/build/bed8add582624e28bf5970a6bb85d0ac/log/job-output.txt#150510:01
mnasiadkabasically I have jobs that do time out - I don't know if it's possible for the logger to be the culprit10:01
fungimnasiadka: that message is being reported from the zuul_stream callback which tries to capture ansible output and ship it over the network: https://opendev.org/zuul/zuul/src/branch/master/zuul/ansible/base/callback/zuul_stream.py#L134-L17010:10
fungiyou'll see it when there are long-running ansible tasks that produce no new output for a while10:10
fungiin your example, it's the kolla-build task which is in progress when the job reached its configured timeout, but since that task's output is all redirected to other files you don't see it streamed in the main job output log. instead you'll need to look in the separate kolla build logfiles to find out what took so long10:14
mnasiadkaah right, forgot about that10:14
fungiinfra-root: seems like it's time to get ovh to extend our service voucher again... "Dear Customer, Either there is no payment method registered on your account, or there is an issue with the default payment method that has been registered..."12:27
*** dhill is now known as Guest521112:32
Clark[m]Is amorin around? Had typically been helpful in getting that sorted out iirc12:34
amorinhey12:34
amorinoh, voucher issue12:35
amorinwhat is your project-id ?12:35
amorindcaab5e32b234d56b626f72581e3644c I believe12:37
amorinok, I can see the voucher, will ask for a new one12:38
fungii can check hold on. need to re-open the inbox for that address12:38
fungiamorin: correct "PublicCloudProject: dcaab5e32b234d56b626f72581e3644c (openstackjenkins)"12:39
fungithat's what was in the invoice e-mail we received anyway12:39
fungiamorin: keep in mind we have two tenants/projects in ovh, but that's the only one we got a "pending payment" message about12:40
amorinyes, I saw you second one12:41
amorinwith only two instances, you still have money on the second12:41
fungithanks for checking!12:41
amorinit will expire beginning of january for the second one, so I will try to have a new voucher for this one also, to avoid issue on january 512:42
Clark[m]Thank you!12:42
fungiwe'll keep a close eye on notifications too12:42
amorinthat's good if you keep an eye because nobody does here :)12:43
mnasiadkafungi: actually this task is logging output, but we're not getting any output due to those logger errors14:14
mnasiadkafungi: here's an example: https://zuul.opendev.org/t/openstack/build/ede006ddc95d41579977142ff77c7c69/log/job-output.txt#57314:15
Clark[m]That example doesn't log the indications that the logger was unavailable like the other one did. Are we sure this is the same thing?14:20
fungiyou'll get the "waiting on logger" messages any time there's a delay with the callback opening the file that the task stdout/stderr is written to, but if the task starts writing output and then there's a long pause you won't see that message14:24
corvusthe initial link from mnasiadka https://zuul.opendev.org/t/openstack/build/bed8add582624e28bf5970a6bb85d0ac/log/job-output.txt#1505 looks like it never had any streaming output even in early tasks; that seems like a problem with the worker node or the connection from the executor to it.14:28
fungiagreed, lack of connectivity from the executor to the node sounds possible (that's what the exception implies after all)14:30
fungicould be an iptables rule on the node or maybe we have a security group that's regressed in one of our providers14:30
fungiand has started blocking the log streaming port somehow14:30
corvusthat one was rokylinux-9 on rax-dfw14:31
corvus2024-10-02 06:07:47,062 ERROR zuul.log_streamer: Streaming failure for build UUID bed8add582624e28bf5970a6bb85d0ac: [Errno 104] Connection reset by peer14:32
corvus2024-10-02 07:34:37,301 ERROR zuul.log_streamer: Streaming failure for build UUID bed8add582624e28bf5970a6bb85d0ac: [Errno 104] Connection reset by peer14:33
corvusttx, elodilles, fungi was the new zuul status overview page (the zoomed out thing with the little squares) useful during the release process?  any feedback on that (pro or con)?14:45
corvus(whenever you have a moment, no worries if you're busy :)14:45
clarkbone thing I noticed is that it was easy for me to find and expand only the release approval pipeline14:46
fungicorvus: with pipelines collapsed i found it more helpful than in the past, with pipelines expanded slightly less convenient than before because of the additional scrolling around and because the dynamic repositioning was a bit more noticeable. overall a net positive though for me14:47
corvusclarkb: did you end up also opening the release-approval pipeline detail page (https://zuul.opendev.org/t/openstack/status/pipeline/release-approval) or did you just stay on the overview screen with the others minimized?14:47
clarkbcorvus: I just stayed on the overview page with others minimized as that got me the info I needed14:48
fungii stuck with the overview screen, but using the drill-down pages instead of expanding pipelines would probably have helped me with having them jump around, in retrospect14:48
corvusack.  fwiw, no one loves the jumping around; we're still working on ideas to reduce/eliminate that.  for now, yeah, the pipeline detail page will avoid that, or if you turn on "show all pipelines" there is less jumping around since all the empty pipelines stay on the page, but there's still some as they expand and contract...14:50
ttxI could find the info I needed, and it's pretty intuitive. Agree the jumping around is the most annoying UX issue at this point14:51
ttxI should try the drill-down page14:51
corvusthanks for the feedback!14:52
elodillesyepp, it was easier to find the jobs and see the status of the queues and jobs14:52
ttxI did not notice there was a way to get to a pipeline-specific summary page, but now that I know there is one I could find it14:52
TheJuliaSo thought regarding the new zuul ui: I'm seeing a lot of line wrapping due to longer job names and it seems like a ton of screen "real estate" is in the padding and the width of the progress bar. It would be totally cool if we could somehow hint/suggest/save some sort of preference with wise for the progress bar so the job name wrapping doesn't cause problems. I think the base challenge would be we would have to cut the names by 14:53
TheJulialike 40% which begins to loose meaning as well to avoid the wrapping.14:53
fungii think there's work in progress to condense it and get back closer to the density of the previous interface14:53
corvusTheJulia: good point; yeah we do want to reduce the padding which should allow us to increase the information density14:54
amorinfungi clarkb it's done14:55
amorin(the voucher)14:55
clarkbamorin: thank you for the quick response14:55
clarkband thank you for the test nodes!14:55
corvusi know from other zuul installations that there are even much (much!) longer names out there than in openstack -- so i know we're always going to have to deal with wrapping.  but we do want to minimize it.14:56
corvusttx: (and others) on the overview page, the pipeline name is a link to the pipeline detail page; so that's how you can get there quickly15:00
corvusthat actually gives us something we didn't have in the old page: you can do this: https://zuul.opendev.org/t/openstack/status/pipeline/check?project=openstack%2Fneutron15:02
corvuswhich, if you look at that now, is an entire page filled with nothing but neutron check queue items15:02
corvusor the glob version, to get all neutron-related projects: https://zuul.opendev.org/t/openstack/status/pipeline/check?project=*neutron*15:03
corvus(that last thing is probably not helpful for release, but can be helpful for projects in day-to-day work)15:04
ttxyeah once I knew it was there somewhere I could easily find it.15:31
opendevreviewMerged openstack/project-config master: Revert "Temporarily remove release docs semaphores"  https://review.opendev.org/c/openstack/project-config/+/93071015:47
clarkbcinder is debugging an issue where a job fails because they try to use a minimum of 2 cpus but there is only one available15:55
clarkbhttps://zuul.opendev.org/t/openstack/build/66e86bad2ef1422ea1783704590a84c3/log/zuul-info/host-info.ubuntu-noble.yaml#540 seems to confirm that ansible thinks there is one vcpu too15:55
clarkbthis is in rax-ord15:55
clarkbI guess we should check if that happens for all noble instances in rax-ord or rax-* as maybe it is a kernel issue with the platform?15:58
corvusalso, could "just" be an ansible issue (ie, maybe the vm does see multiple vcpus but ansible isn't reading it right)16:19
clarkbthats possible though I think python is also seeing one cpu which leads to the test failure?16:20
clarkbcould be that ansible and pythno share the common failure mode though since ansible is written primarily in python16:20
opendevreviewClark Boylan proposed openstack/project-config master: Set bindep profiles for openstack release doc publication  https://review.opendev.org/c/openstack/project-config/+/93120416:35
opendevreviewJeremy Stanley proposed opendev/system-config master: Update Mailman containers to latest versions  https://review.opendev.org/c/opendev/system-config/+/93023616:40
fungiclarkb: not at all urgent, but i think i've answered or addressed your comments on 930236 whenever you have a chance to look again17:09
fungialso updated it for the new mailman core release from yesterday17:10
fungihopefully tests still pass17:10
opendevreviewMerged openstack/project-config master: Set bindep profiles for openstack release doc publication  https://review.opendev.org/c/openstack/project-config/+/93120417:25
TheJuliacorvus: also, the space between each entry representing a testing change set appears to be a good place to recover some space for improving density.19:03
clarkbI've spot checked two noble nodes in rax dfw and iad and they both have 8 vcpus19:32
clarkbthere are no running instances in ord so I haven't been able to check there yet19:32
clarkbnow to figure out how python is determining the number of cpus and see if it is different than what /proc/cpuinfo refports19:32
corvusTheJulia: agreed and noted19:38
clarkbcross checking with zuul py312 jobs on noble run on rax-iad do report 8 vcpu19:46
corvusvia that same ansible output?  or via the nox stestr concurrency thing?19:47
clarkbcorvus: via the same ansible output (the host info facts stuff we collect)19:48
clarkbhttps://zuul.opendev.org/t/zuul/build/2e0a992ed6244ca1bcc3174e024f8486/log/zuul-info/host-info.ubuntu-noble.yaml#56119:48
corvusthat seems mysterious then19:48
clarkbya I half wonder if there is a broken falvor in ord or something specific about the version of xen there or something19:49
clarkbbut still trying to collect more info19:49
corvusi wonder if there's any possibility nodepool picked the wrong flavor19:49
clarkbmaybe? let me see if any of the other flavor details from the facts file look obviously wrong too19:49
corvusi think we're using min-ram for that, right?  so maybe there's an extra flavor in ord that matches, or maybe there's some flaw on that launcher (like it cached a bad flavor list)19:50
clarkboooh the ansible_product_version differs os maybe is a different xen setup19:50
corvusoh interesting19:50
clarkbthe block device sizes and memory allocation both look like I would expect ~40GB / and ~8GB RAM19:51
* clarkb looks for opensearch details to try and query some stuff19:52
corvusclarkb: could write a job that fails if cpu count==1 then run it 10 times with an autohold19:52
clarkbya I'll probably fallback to that if I can't figure anything out via opensearch19:55
clarkbside note where is this documented. I've brute forced some api access but ideally I'd get the dashboard...19:55
clarkbhttps://opendev.org/openstack/neutron/src/branch/master/doc/source/contributor/policies/gate-failure-triage.rst here19:56
clarkbmaking some progress20:01
clarkbhttps://c6855c6fbe54327bc980-a152e227f766163de848f7eda36ee170.ssl.cf1.rackcdn.com/931005/2/check/openstack-tox-py312/c903185/job-output.txt is a succesful run20:02
clarkbactually let me rewind20:02
clarkbover the last day ish there are two failures both in rax-ord and they both have the same product version. If I look for successes of the same joband project in rax-ord there are some with the newer product20:03
corvuslooking at that same job that's guaranteed to fail with 1 vcpu?20:04
clarkbyes20:04
corvuslink to the successes?20:04
clarkbjust above I linked to one (sorry its the raw log because opensearch isn't linking to the nice one)20:04
corvusoh that one above right?20:04
corvusya sorry20:04
clarkbso I'm beginning to wonder if this is very hypervisor specific with the kernel that noble runs or maybe with python312?20:05
corvusnice log: https://zuul.opendev.org/t/openstack/build/c903185a1e9f43a8a509788fbbbca04d20:05
corvusclarkb: host id is in the inventory20:06
corvus(if we want to try to establish/refute a hypervisior pattern)20:06
clarkboh cool let me try and collect successful host ids vs unsuccessful ones20:06
clarkblooks like 3 successes in the same time frame. I'm going to expand that to say the last week instead of last day and collect more data20:07
clarkb7 successes over the last week20:07
corvusthe success does say 8vcpus on xen 4.720:07
clarkbputting some notes in there bceause there is a lot of data to keep track of. Could probably script it but I don't want to deal with the opensearch api right now if I don't have to20:10
clarkbhttps://etherpad.opendev.org/p/Kk-QduOuLQV6grR444bw er there20:10
clarkbstill just the 2 failuresl20:11
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: trivial: provide a little more context for disk size failures  https://review.opendev.org/c/openstack/diskimage-builder/+/93122420:11
corvusi'll look up the nice urls for the failures20:11
corvuser i mean the successes20:11
clarkbthanks20:11
corvusclarkb: the 2 successes have the same raw url20:14
corvuser failures20:14
corvus("strike that! reverse it!")20:14
clarkboh let me double check that maybe the string occurs twice that I searched on and its really just the one failure20:15
clarkbcorvus: ya I think this is acutally one failure with two hits in opensearch on my search string20:16
clarkbso 1/8 failed 7/8 succeeded20:16
clarkbprobably going to be hard to draw conclusions based on that20:16
clarkbthe failed node is a different host_id and product version to all the others20:20
clarkbstill more of a hunch than a smoking gun but I suspect this is related20:20
clarkbI think cardoe is on vacation but afterwards this might be a fun one to debug in conjunction with someone on the inside20:23
cardoeclarkb: remind me next week and I can help. Or at least get someone on to help.20:42
clarkbcardoe: thanks20:43
cardoeI didn’t bring any devices capable of reaching out to work. Just open source stuff.20:43
clarkband you should be enjoying your vacation20:47
TheJuliaclarkb++20:56
opendevreviewJay Faulkner proposed openstack/project-config master: Send notifications for unmaintained ps to -ironic  https://review.opendev.org/c/openstack/project-config/+/93124022:25
JayFopen source work on openstack is work when you're paid to be a cloud engineer working on openstack :)22:28

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!