*** dhill is now known as Guest5154 | 01:03 | |
mnasiadka | fungi: I still have a feeling something is not in it's best shape - I'm seeing a lot of Timeout exception waiting for the logger. Please check connectivity to [104.239.142.11:19885] and gerrit/zuul dashboard is not as responsive as always - but maybe that's the release load ;-) | 07:17 |
---|---|---|
fungi | mnasiadka: release hasn't started yet, we're about 2 minutes away | 09:58 |
fungi | mnasiadka: 104.239.142.11 isn't one of our servers... where did you see that error? | 09:59 |
fungi | are you talking about job failures or something? | 10:00 |
mnasiadka | fungi: https://zuul.opendev.org/t/openstack/build/bed8add582624e28bf5970a6bb85d0ac/log/job-output.txt#1505 | 10:01 |
mnasiadka | basically I have jobs that do time out - I don't know if it's possible for the logger to be the culprit | 10:01 |
fungi | mnasiadka: that message is being reported from the zuul_stream callback which tries to capture ansible output and ship it over the network: https://opendev.org/zuul/zuul/src/branch/master/zuul/ansible/base/callback/zuul_stream.py#L134-L170 | 10:10 |
fungi | you'll see it when there are long-running ansible tasks that produce no new output for a while | 10:10 |
fungi | in your example, it's the kolla-build task which is in progress when the job reached its configured timeout, but since that task's output is all redirected to other files you don't see it streamed in the main job output log. instead you'll need to look in the separate kolla build logfiles to find out what took so long | 10:14 |
mnasiadka | ah right, forgot about that | 10:14 |
fungi | infra-root: seems like it's time to get ovh to extend our service voucher again... "Dear Customer, Either there is no payment method registered on your account, or there is an issue with the default payment method that has been registered..." | 12:27 |
*** dhill is now known as Guest5211 | 12:32 | |
Clark[m] | Is amorin around? Had typically been helpful in getting that sorted out iirc | 12:34 |
amorin | hey | 12:34 |
amorin | oh, voucher issue | 12:35 |
amorin | what is your project-id ? | 12:35 |
amorin | dcaab5e32b234d56b626f72581e3644c I believe | 12:37 |
amorin | ok, I can see the voucher, will ask for a new one | 12:38 |
fungi | i can check hold on. need to re-open the inbox for that address | 12:38 |
fungi | amorin: correct "PublicCloudProject: dcaab5e32b234d56b626f72581e3644c (openstackjenkins)" | 12:39 |
fungi | that's what was in the invoice e-mail we received anyway | 12:39 |
fungi | amorin: keep in mind we have two tenants/projects in ovh, but that's the only one we got a "pending payment" message about | 12:40 |
amorin | yes, I saw you second one | 12:41 |
amorin | with only two instances, you still have money on the second | 12:41 |
fungi | thanks for checking! | 12:41 |
amorin | it will expire beginning of january for the second one, so I will try to have a new voucher for this one also, to avoid issue on january 5 | 12:42 |
Clark[m] | Thank you! | 12:42 |
fungi | we'll keep a close eye on notifications too | 12:42 |
amorin | that's good if you keep an eye because nobody does here :) | 12:43 |
mnasiadka | fungi: actually this task is logging output, but we're not getting any output due to those logger errors | 14:14 |
mnasiadka | fungi: here's an example: https://zuul.opendev.org/t/openstack/build/ede006ddc95d41579977142ff77c7c69/log/job-output.txt#573 | 14:15 |
Clark[m] | That example doesn't log the indications that the logger was unavailable like the other one did. Are we sure this is the same thing? | 14:20 |
fungi | you'll get the "waiting on logger" messages any time there's a delay with the callback opening the file that the task stdout/stderr is written to, but if the task starts writing output and then there's a long pause you won't see that message | 14:24 |
corvus | the initial link from mnasiadka https://zuul.opendev.org/t/openstack/build/bed8add582624e28bf5970a6bb85d0ac/log/job-output.txt#1505 looks like it never had any streaming output even in early tasks; that seems like a problem with the worker node or the connection from the executor to it. | 14:28 |
fungi | agreed, lack of connectivity from the executor to the node sounds possible (that's what the exception implies after all) | 14:30 |
fungi | could be an iptables rule on the node or maybe we have a security group that's regressed in one of our providers | 14:30 |
fungi | and has started blocking the log streaming port somehow | 14:30 |
corvus | that one was rokylinux-9 on rax-dfw | 14:31 |
corvus | 2024-10-02 06:07:47,062 ERROR zuul.log_streamer: Streaming failure for build UUID bed8add582624e28bf5970a6bb85d0ac: [Errno 104] Connection reset by peer | 14:32 |
corvus | 2024-10-02 07:34:37,301 ERROR zuul.log_streamer: Streaming failure for build UUID bed8add582624e28bf5970a6bb85d0ac: [Errno 104] Connection reset by peer | 14:33 |
corvus | ttx, elodilles, fungi was the new zuul status overview page (the zoomed out thing with the little squares) useful during the release process? any feedback on that (pro or con)? | 14:45 |
corvus | (whenever you have a moment, no worries if you're busy :) | 14:45 |
clarkb | one thing I noticed is that it was easy for me to find and expand only the release approval pipeline | 14:46 |
fungi | corvus: with pipelines collapsed i found it more helpful than in the past, with pipelines expanded slightly less convenient than before because of the additional scrolling around and because the dynamic repositioning was a bit more noticeable. overall a net positive though for me | 14:47 |
corvus | clarkb: did you end up also opening the release-approval pipeline detail page (https://zuul.opendev.org/t/openstack/status/pipeline/release-approval) or did you just stay on the overview screen with the others minimized? | 14:47 |
clarkb | corvus: I just stayed on the overview page with others minimized as that got me the info I needed | 14:48 |
fungi | i stuck with the overview screen, but using the drill-down pages instead of expanding pipelines would probably have helped me with having them jump around, in retrospect | 14:48 |
corvus | ack. fwiw, no one loves the jumping around; we're still working on ideas to reduce/eliminate that. for now, yeah, the pipeline detail page will avoid that, or if you turn on "show all pipelines" there is less jumping around since all the empty pipelines stay on the page, but there's still some as they expand and contract... | 14:50 |
ttx | I could find the info I needed, and it's pretty intuitive. Agree the jumping around is the most annoying UX issue at this point | 14:51 |
ttx | I should try the drill-down page | 14:51 |
corvus | thanks for the feedback! | 14:52 |
elodilles | yepp, it was easier to find the jobs and see the status of the queues and jobs | 14:52 |
ttx | I did not notice there was a way to get to a pipeline-specific summary page, but now that I know there is one I could find it | 14:52 |
TheJulia | So thought regarding the new zuul ui: I'm seeing a lot of line wrapping due to longer job names and it seems like a ton of screen "real estate" is in the padding and the width of the progress bar. It would be totally cool if we could somehow hint/suggest/save some sort of preference with wise for the progress bar so the job name wrapping doesn't cause problems. I think the base challenge would be we would have to cut the names by | 14:53 |
TheJulia | like 40% which begins to loose meaning as well to avoid the wrapping. | 14:53 |
fungi | i think there's work in progress to condense it and get back closer to the density of the previous interface | 14:53 |
corvus | TheJulia: good point; yeah we do want to reduce the padding which should allow us to increase the information density | 14:54 |
amorin | fungi clarkb it's done | 14:55 |
amorin | (the voucher) | 14:55 |
clarkb | amorin: thank you for the quick response | 14:55 |
clarkb | and thank you for the test nodes! | 14:55 |
corvus | i know from other zuul installations that there are even much (much!) longer names out there than in openstack -- so i know we're always going to have to deal with wrapping. but we do want to minimize it. | 14:56 |
corvus | ttx: (and others) on the overview page, the pipeline name is a link to the pipeline detail page; so that's how you can get there quickly | 15:00 |
corvus | that actually gives us something we didn't have in the old page: you can do this: https://zuul.opendev.org/t/openstack/status/pipeline/check?project=openstack%2Fneutron | 15:02 |
corvus | which, if you look at that now, is an entire page filled with nothing but neutron check queue items | 15:02 |
corvus | or the glob version, to get all neutron-related projects: https://zuul.opendev.org/t/openstack/status/pipeline/check?project=*neutron* | 15:03 |
corvus | (that last thing is probably not helpful for release, but can be helpful for projects in day-to-day work) | 15:04 |
ttx | yeah once I knew it was there somewhere I could easily find it. | 15:31 |
opendevreview | Merged openstack/project-config master: Revert "Temporarily remove release docs semaphores" https://review.opendev.org/c/openstack/project-config/+/930710 | 15:47 |
clarkb | cinder is debugging an issue where a job fails because they try to use a minimum of 2 cpus but there is only one available | 15:55 |
clarkb | https://zuul.opendev.org/t/openstack/build/66e86bad2ef1422ea1783704590a84c3/log/zuul-info/host-info.ubuntu-noble.yaml#540 seems to confirm that ansible thinks there is one vcpu too | 15:55 |
clarkb | this is in rax-ord | 15:55 |
clarkb | I guess we should check if that happens for all noble instances in rax-ord or rax-* as maybe it is a kernel issue with the platform? | 15:58 |
corvus | also, could "just" be an ansible issue (ie, maybe the vm does see multiple vcpus but ansible isn't reading it right) | 16:19 |
clarkb | thats possible though I think python is also seeing one cpu which leads to the test failure? | 16:20 |
clarkb | could be that ansible and pythno share the common failure mode though since ansible is written primarily in python | 16:20 |
opendevreview | Clark Boylan proposed openstack/project-config master: Set bindep profiles for openstack release doc publication https://review.opendev.org/c/openstack/project-config/+/931204 | 16:35 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Update Mailman containers to latest versions https://review.opendev.org/c/opendev/system-config/+/930236 | 16:40 |
fungi | clarkb: not at all urgent, but i think i've answered or addressed your comments on 930236 whenever you have a chance to look again | 17:09 |
fungi | also updated it for the new mailman core release from yesterday | 17:10 |
fungi | hopefully tests still pass | 17:10 |
opendevreview | Merged openstack/project-config master: Set bindep profiles for openstack release doc publication https://review.opendev.org/c/openstack/project-config/+/931204 | 17:25 |
TheJulia | corvus: also, the space between each entry representing a testing change set appears to be a good place to recover some space for improving density. | 19:03 |
clarkb | I've spot checked two noble nodes in rax dfw and iad and they both have 8 vcpus | 19:32 |
clarkb | there are no running instances in ord so I haven't been able to check there yet | 19:32 |
clarkb | now to figure out how python is determining the number of cpus and see if it is different than what /proc/cpuinfo refports | 19:32 |
corvus | TheJulia: agreed and noted | 19:38 |
clarkb | cross checking with zuul py312 jobs on noble run on rax-iad do report 8 vcpu | 19:46 |
corvus | via that same ansible output? or via the nox stestr concurrency thing? | 19:47 |
clarkb | corvus: via the same ansible output (the host info facts stuff we collect) | 19:48 |
clarkb | https://zuul.opendev.org/t/zuul/build/2e0a992ed6244ca1bcc3174e024f8486/log/zuul-info/host-info.ubuntu-noble.yaml#561 | 19:48 |
corvus | that seems mysterious then | 19:48 |
clarkb | ya I half wonder if there is a broken falvor in ord or something specific about the version of xen there or something | 19:49 |
clarkb | but still trying to collect more info | 19:49 |
corvus | i wonder if there's any possibility nodepool picked the wrong flavor | 19:49 |
clarkb | maybe? let me see if any of the other flavor details from the facts file look obviously wrong too | 19:49 |
corvus | i think we're using min-ram for that, right? so maybe there's an extra flavor in ord that matches, or maybe there's some flaw on that launcher (like it cached a bad flavor list) | 19:50 |
clarkb | oooh the ansible_product_version differs os maybe is a different xen setup | 19:50 |
corvus | oh interesting | 19:50 |
clarkb | the block device sizes and memory allocation both look like I would expect ~40GB / and ~8GB RAM | 19:51 |
* clarkb looks for opensearch details to try and query some stuff | 19:52 | |
corvus | clarkb: could write a job that fails if cpu count==1 then run it 10 times with an autohold | 19:52 |
clarkb | ya I'll probably fallback to that if I can't figure anything out via opensearch | 19:55 |
clarkb | side note where is this documented. I've brute forced some api access but ideally I'd get the dashboard... | 19:55 |
clarkb | https://opendev.org/openstack/neutron/src/branch/master/doc/source/contributor/policies/gate-failure-triage.rst here | 19:56 |
clarkb | making some progress | 20:01 |
clarkb | https://c6855c6fbe54327bc980-a152e227f766163de848f7eda36ee170.ssl.cf1.rackcdn.com/931005/2/check/openstack-tox-py312/c903185/job-output.txt is a succesful run | 20:02 |
clarkb | actually let me rewind | 20:02 |
clarkb | over the last day ish there are two failures both in rax-ord and they both have the same product version. If I look for successes of the same joband project in rax-ord there are some with the newer product | 20:03 |
corvus | looking at that same job that's guaranteed to fail with 1 vcpu? | 20:04 |
clarkb | yes | 20:04 |
corvus | link to the successes? | 20:04 |
clarkb | just above I linked to one (sorry its the raw log because opensearch isn't linking to the nice one) | 20:04 |
corvus | oh that one above right? | 20:04 |
corvus | ya sorry | 20:04 |
clarkb | so I'm beginning to wonder if this is very hypervisor specific with the kernel that noble runs or maybe with python312? | 20:05 |
corvus | nice log: https://zuul.opendev.org/t/openstack/build/c903185a1e9f43a8a509788fbbbca04d | 20:05 |
corvus | clarkb: host id is in the inventory | 20:06 |
corvus | (if we want to try to establish/refute a hypervisior pattern) | 20:06 |
clarkb | oh cool let me try and collect successful host ids vs unsuccessful ones | 20:06 |
clarkb | looks like 3 successes in the same time frame. I'm going to expand that to say the last week instead of last day and collect more data | 20:07 |
clarkb | 7 successes over the last week | 20:07 |
corvus | the success does say 8vcpus on xen 4.7 | 20:07 |
clarkb | putting some notes in there bceause there is a lot of data to keep track of. Could probably script it but I don't want to deal with the opensearch api right now if I don't have to | 20:10 |
clarkb | https://etherpad.opendev.org/p/Kk-QduOuLQV6grR444bw er there | 20:10 |
clarkb | still just the 2 failuresl | 20:11 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: trivial: provide a little more context for disk size failures https://review.opendev.org/c/openstack/diskimage-builder/+/931224 | 20:11 |
corvus | i'll look up the nice urls for the failures | 20:11 |
corvus | er i mean the successes | 20:11 |
clarkb | thanks | 20:11 |
corvus | clarkb: the 2 successes have the same raw url | 20:14 |
corvus | er failures | 20:14 |
corvus | ("strike that! reverse it!") | 20:14 |
clarkb | oh let me double check that maybe the string occurs twice that I searched on and its really just the one failure | 20:15 |
clarkb | corvus: ya I think this is acutally one failure with two hits in opensearch on my search string | 20:16 |
clarkb | so 1/8 failed 7/8 succeeded | 20:16 |
clarkb | probably going to be hard to draw conclusions based on that | 20:16 |
clarkb | the failed node is a different host_id and product version to all the others | 20:20 |
clarkb | still more of a hunch than a smoking gun but I suspect this is related | 20:20 |
clarkb | I think cardoe is on vacation but afterwards this might be a fun one to debug in conjunction with someone on the inside | 20:23 |
cardoe | clarkb: remind me next week and I can help. Or at least get someone on to help. | 20:42 |
clarkb | cardoe: thanks | 20:43 |
cardoe | I didn’t bring any devices capable of reaching out to work. Just open source stuff. | 20:43 |
clarkb | and you should be enjoying your vacation | 20:47 |
TheJulia | clarkb++ | 20:56 |
opendevreview | Jay Faulkner proposed openstack/project-config master: Send notifications for unmaintained ps to -ironic https://review.opendev.org/c/openstack/project-config/+/931240 | 22:25 |
JayF | open source work on openstack is work when you're paid to be a cloud engineer working on openstack :) | 22:28 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!