Wednesday, 2024-10-02

*** dhill is now known as Guest5154		01:03
mnasiadka	fungi: I still have a feeling something is not in it's best shape - I'm seeing a lot of Timeout exception waiting for the logger. Please check connectivity to [104.239.142.11:19885] and gerrit/zuul dashboard is not as responsive as always - but maybe that's the release load ;-)	07:17
fungi	mnasiadka: release hasn't started yet, we're about 2 minutes away	09:58
fungi	mnasiadka: 104.239.142.11 isn't one of our servers... where did you see that error?	09:59
fungi	are you talking about job failures or something?	10:00
mnasiadka	fungi: https://zuul.opendev.org/t/openstack/build/bed8add582624e28bf5970a6bb85d0ac/log/job-output.txt#1505	10:01
mnasiadka	basically I have jobs that do time out - I don't know if it's possible for the logger to be the culprit	10:01
fungi	mnasiadka: that message is being reported from the zuul_stream callback which tries to capture ansible output and ship it over the network: https://opendev.org/zuul/zuul/src/branch/master/zuul/ansible/base/callback/zuul_stream.py#L134-L170	10:10
fungi	you'll see it when there are long-running ansible tasks that produce no new output for a while	10:10
fungi	in your example, it's the kolla-build task which is in progress when the job reached its configured timeout, but since that task's output is all redirected to other files you don't see it streamed in the main job output log. instead you'll need to look in the separate kolla build logfiles to find out what took so long	10:14
mnasiadka	ah right, forgot about that	10:14
fungi	infra-root: seems like it's time to get ovh to extend our service voucher again... "Dear Customer, Either there is no payment method registered on your account, or there is an issue with the default payment method that has been registered..."	12:27
*** dhill is now known as Guest5211		12:32
Clark[m]	Is amorin around? Had typically been helpful in getting that sorted out iirc	12:34
amorin	hey	12:34
amorin	oh, voucher issue	12:35
amorin	what is your project-id ?	12:35
amorin	dcaab5e32b234d56b626f72581e3644c I believe	12:37
amorin	ok, I can see the voucher, will ask for a new one	12:38
fungi	i can check hold on. need to re-open the inbox for that address	12:38
fungi	amorin: correct "PublicCloudProject: dcaab5e32b234d56b626f72581e3644c (openstackjenkins)"	12:39
fungi	that's what was in the invoice e-mail we received anyway	12:39
fungi	amorin: keep in mind we have two tenants/projects in ovh, but that's the only one we got a "pending payment" message about	12:40
amorin	yes, I saw you second one	12:41
amorin	with only two instances, you still have money on the second	12:41
fungi	thanks for checking!	12:41
amorin	it will expire beginning of january for the second one, so I will try to have a new voucher for this one also, to avoid issue on january 5	12:42
Clark[m]	Thank you!	12:42
fungi	we'll keep a close eye on notifications too	12:42
amorin	that's good if you keep an eye because nobody does here :)	12:43
mnasiadka	fungi: actually this task is logging output, but we're not getting any output due to those logger errors	14:14
mnasiadka	fungi: here's an example: https://zuul.opendev.org/t/openstack/build/ede006ddc95d41579977142ff77c7c69/log/job-output.txt#573	14:15
Clark[m]	That example doesn't log the indications that the logger was unavailable like the other one did. Are we sure this is the same thing?	14:20
fungi	you'll get the "waiting on logger" messages any time there's a delay with the callback opening the file that the task stdout/stderr is written to, but if the task starts writing output and then there's a long pause you won't see that message	14:24
corvus	the initial link from mnasiadka https://zuul.opendev.org/t/openstack/build/bed8add582624e28bf5970a6bb85d0ac/log/job-output.txt#1505 looks like it never had any streaming output even in early tasks; that seems like a problem with the worker node or the connection from the executor to it.	14:28
fungi	agreed, lack of connectivity from the executor to the node sounds possible (that's what the exception implies after all)	14:30
fungi	could be an iptables rule on the node or maybe we have a security group that's regressed in one of our providers	14:30
fungi	and has started blocking the log streaming port somehow	14:30
corvus	that one was rokylinux-9 on rax-dfw	14:31
corvus	2024-10-02 06:07:47,062 ERROR zuul.log_streamer: Streaming failure for build UUID bed8add582624e28bf5970a6bb85d0ac: [Errno 104] Connection reset by peer	14:32
corvus	2024-10-02 07:34:37,301 ERROR zuul.log_streamer: Streaming failure for build UUID bed8add582624e28bf5970a6bb85d0ac: [Errno 104] Connection reset by peer	14:33
corvus	ttx, elodilles, fungi was the new zuul status overview page (the zoomed out thing with the little squares) useful during the release process? any feedback on that (pro or con)?	14:45
corvus	(whenever you have a moment, no worries if you're busy :)	14:45
clarkb	one thing I noticed is that it was easy for me to find and expand only the release approval pipeline	14:46
fungi	corvus: with pipelines collapsed i found it more helpful than in the past, with pipelines expanded slightly less convenient than before because of the additional scrolling around and because the dynamic repositioning was a bit more noticeable. overall a net positive though for me	14:47
corvus	clarkb: did you end up also opening the release-approval pipeline detail page (https://zuul.opendev.org/t/openstack/status/pipeline/release-approval) or did you just stay on the overview screen with the others minimized?	14:47
clarkb	corvus: I just stayed on the overview page with others minimized as that got me the info I needed	14:48
fungi	i stuck with the overview screen, but using the drill-down pages instead of expanding pipelines would probably have helped me with having them jump around, in retrospect	14:48
corvus	ack. fwiw, no one loves the jumping around; we're still working on ideas to reduce/eliminate that. for now, yeah, the pipeline detail page will avoid that, or if you turn on "show all pipelines" there is less jumping around since all the empty pipelines stay on the page, but there's still some as they expand and contract...	14:50
ttx	I could find the info I needed, and it's pretty intuitive. Agree the jumping around is the most annoying UX issue at this point	14:51
ttx	I should try the drill-down page	14:51
corvus	thanks for the feedback!	14:52
elodilles	yepp, it was easier to find the jobs and see the status of the queues and jobs	14:52
ttx	I did not notice there was a way to get to a pipeline-specific summary page, but now that I know there is one I could find it	14:52
TheJulia	So thought regarding the new zuul ui: I'm seeing a lot of line wrapping due to longer job names and it seems like a ton of screen "real estate" is in the padding and the width of the progress bar. It would be totally cool if we could somehow hint/suggest/save some sort of preference with wise for the progress bar so the job name wrapping doesn't cause problems. I think the base challenge would be we would have to cut the names by	14:53
TheJulia	like 40% which begins to loose meaning as well to avoid the wrapping.	14:53
fungi	i think there's work in progress to condense it and get back closer to the density of the previous interface	14:53
corvus	TheJulia: good point; yeah we do want to reduce the padding which should allow us to increase the information density	14:54
amorin	fungi clarkb it's done	14:55
amorin	(the voucher)	14:55
clarkb	amorin: thank you for the quick response	14:55
clarkb	and thank you for the test nodes!	14:55
corvus	i know from other zuul installations that there are even much (much!) longer names out there than in openstack -- so i know we're always going to have to deal with wrapping. but we do want to minimize it.	14:56
corvus	ttx: (and others) on the overview page, the pipeline name is a link to the pipeline detail page; so that's how you can get there quickly	15:00
corvus	that actually gives us something we didn't have in the old page: you can do this: https://zuul.opendev.org/t/openstack/status/pipeline/check?project=openstack%2Fneutron	15:02
corvus	which, if you look at that now, is an entire page filled with nothing but neutron check queue items	15:02
corvus	or the glob version, to get all neutron-related projects: https://zuul.opendev.org/t/openstack/status/pipeline/check?project=neutron	15:03
corvus	(that last thing is probably not helpful for release, but can be helpful for projects in day-to-day work)	15:04
ttx	yeah once I knew it was there somewhere I could easily find it.	15:31
opendevreview	Merged openstack/project-config master: Revert "Temporarily remove release docs semaphores" https://review.opendev.org/c/openstack/project-config/+/930710	15:47
clarkb	cinder is debugging an issue where a job fails because they try to use a minimum of 2 cpus but there is only one available	15:55
clarkb	https://zuul.opendev.org/t/openstack/build/66e86bad2ef1422ea1783704590a84c3/log/zuul-info/host-info.ubuntu-noble.yaml#540 seems to confirm that ansible thinks there is one vcpu too	15:55
clarkb	this is in rax-ord	15:55
clarkb	I guess we should check if that happens for all noble instances in rax-ord or rax-* as maybe it is a kernel issue with the platform?	15:58
corvus	also, could "just" be an ansible issue (ie, maybe the vm does see multiple vcpus but ansible isn't reading it right)	16:19
clarkb	thats possible though I think python is also seeing one cpu which leads to the test failure?	16:20
clarkb	could be that ansible and pythno share the common failure mode though since ansible is written primarily in python	16:20
opendevreview	Clark Boylan proposed openstack/project-config master: Set bindep profiles for openstack release doc publication https://review.opendev.org/c/openstack/project-config/+/931204	16:35
opendevreview	Jeremy Stanley proposed opendev/system-config master: Update Mailman containers to latest versions https://review.opendev.org/c/opendev/system-config/+/930236	16:40
fungi	clarkb: not at all urgent, but i think i've answered or addressed your comments on 930236 whenever you have a chance to look again	17:09
fungi	also updated it for the new mailman core release from yesterday	17:10
fungi	hopefully tests still pass	17:10
opendevreview	Merged openstack/project-config master: Set bindep profiles for openstack release doc publication https://review.opendev.org/c/openstack/project-config/+/931204	17:25
TheJulia	corvus: also, the space between each entry representing a testing change set appears to be a good place to recover some space for improving density.	19:03
clarkb	I've spot checked two noble nodes in rax dfw and iad and they both have 8 vcpus	19:32
clarkb	there are no running instances in ord so I haven't been able to check there yet	19:32
clarkb	now to figure out how python is determining the number of cpus and see if it is different than what /proc/cpuinfo refports	19:32
corvus	TheJulia: agreed and noted	19:38
clarkb	cross checking with zuul py312 jobs on noble run on rax-iad do report 8 vcpu	19:46
corvus	via that same ansible output? or via the nox stestr concurrency thing?	19:47
clarkb	corvus: via the same ansible output (the host info facts stuff we collect)	19:48
clarkb	https://zuul.opendev.org/t/zuul/build/2e0a992ed6244ca1bcc3174e024f8486/log/zuul-info/host-info.ubuntu-noble.yaml#561	19:48
corvus	that seems mysterious then	19:48
clarkb	ya I half wonder if there is a broken falvor in ord or something specific about the version of xen there or something	19:49
clarkb	but still trying to collect more info	19:49
corvus	i wonder if there's any possibility nodepool picked the wrong flavor	19:49
clarkb	maybe? let me see if any of the other flavor details from the facts file look obviously wrong too	19:49
corvus	i think we're using min-ram for that, right? so maybe there's an extra flavor in ord that matches, or maybe there's some flaw on that launcher (like it cached a bad flavor list)	19:50
clarkb	oooh the ansible_product_version differs os maybe is a different xen setup	19:50
corvus	oh interesting	19:50
clarkb	the block device sizes and memory allocation both look like I would expect ~40GB / and ~8GB RAM	19:51
* clarkb looks for opensearch details to try and query some stuff		19:52
corvus	clarkb: could write a job that fails if cpu count==1 then run it 10 times with an autohold	19:52
clarkb	ya I'll probably fallback to that if I can't figure anything out via opensearch	19:55
clarkb	side note where is this documented. I've brute forced some api access but ideally I'd get the dashboard...	19:55
clarkb	https://opendev.org/openstack/neutron/src/branch/master/doc/source/contributor/policies/gate-failure-triage.rst here	19:56
clarkb	making some progress	20:01
clarkb	https://c6855c6fbe54327bc980-a152e227f766163de848f7eda36ee170.ssl.cf1.rackcdn.com/931005/2/check/openstack-tox-py312/c903185/job-output.txt is a succesful run	20:02
clarkb	actually let me rewind	20:02
clarkb	over the last day ish there are two failures both in rax-ord and they both have the same product version. If I look for successes of the same joband project in rax-ord there are some with the newer product	20:03
corvus	looking at that same job that's guaranteed to fail with 1 vcpu?	20:04
clarkb	yes	20:04
corvus	link to the successes?	20:04
clarkb	just above I linked to one (sorry its the raw log because opensearch isn't linking to the nice one)	20:04
corvus	oh that one above right?	20:04
corvus	ya sorry	20:04
clarkb	so I'm beginning to wonder if this is very hypervisor specific with the kernel that noble runs or maybe with python312?	20:05
corvus	nice log: https://zuul.opendev.org/t/openstack/build/c903185a1e9f43a8a509788fbbbca04d	20:05
corvus	clarkb: host id is in the inventory	20:06
corvus	(if we want to try to establish/refute a hypervisior pattern)	20:06
clarkb	oh cool let me try and collect successful host ids vs unsuccessful ones	20:06
clarkb	looks like 3 successes in the same time frame. I'm going to expand that to say the last week instead of last day and collect more data	20:07
clarkb	7 successes over the last week	20:07
corvus	the success does say 8vcpus on xen 4.7	20:07
clarkb	putting some notes in there bceause there is a lot of data to keep track of. Could probably script it but I don't want to deal with the opensearch api right now if I don't have to	20:10
clarkb	https://etherpad.opendev.org/p/Kk-QduOuLQV6grR444bw er there	20:10
clarkb	still just the 2 failuresl	20:11
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: trivial: provide a little more context for disk size failures https://review.opendev.org/c/openstack/diskimage-builder/+/931224	20:11
corvus	i'll look up the nice urls for the failures	20:11
corvus	er i mean the successes	20:11
clarkb	thanks	20:11
corvus	clarkb: the 2 successes have the same raw url	20:14
corvus	er failures	20:14
corvus	("strike that! reverse it!")	20:14
clarkb	oh let me double check that maybe the string occurs twice that I searched on and its really just the one failure	20:15
clarkb	corvus: ya I think this is acutally one failure with two hits in opensearch on my search string	20:16
clarkb	so 1/8 failed 7/8 succeeded	20:16
clarkb	probably going to be hard to draw conclusions based on that	20:16
clarkb	the failed node is a different host_id and product version to all the others	20:20
clarkb	still more of a hunch than a smoking gun but I suspect this is related	20:20
clarkb	I think cardoe is on vacation but afterwards this might be a fun one to debug in conjunction with someone on the inside	20:23
cardoe	clarkb: remind me next week and I can help. Or at least get someone on to help.	20:42
clarkb	cardoe: thanks	20:43
cardoe	I didn’t bring any devices capable of reaching out to work. Just open source stuff.	20:43
clarkb	and you should be enjoying your vacation	20:47
TheJulia	clarkb++	20:56
opendevreview	Jay Faulkner proposed openstack/project-config master: Send notifications for unmaintained ps to -ironic https://review.opendev.org/c/openstack/project-config/+/931240	22:25
JayF	open source work on openstack is work when you're paid to be a cloud engineer working on openstack :)	22:28

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!