Thursday, 2024-09-12

jrosser	cardoe: there is no choice currently but to use a 3rd party plugin for CLI with websso. It is a shame that the client cannot not do this natively.	06:10
opendevreview	James E. Blair proposed openstack/project-config master: Add opendev/zuul-jobs to opendev channel config https://review.opendev.org/c/openstack/project-config/+/929118	14:28
corvus	we might want to speedy merge that ^	14:28
clarkb	corvus: done	14:49
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Allow dib_elements key to be a nested list https://review.opendev.org/c/zuul/zuul-jobs/+/929123	14:54
opendevreview	Merged openstack/project-config master: Add opendev/zuul-jobs to opendev channel config https://review.opendev.org/c/openstack/project-config/+/929118	14:56
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add initial Zuul project config https://review.opendev.org/c/opendev/zuul-jobs/+/929139	15:14
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Copy DIB elements from project-config https://review.opendev.org/c/opendev/zuul-jobs/+/929140	15:14
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	15:14
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add initial Zuul project config https://review.opendev.org/c/opendev/zuul-jobs/+/929139	15:41
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Copy DIB elements from project-config https://review.opendev.org/c/opendev/zuul-jobs/+/929140	15:41
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	15:41
mordred	corvus: left a comment on https://review.opendev.org/c/opendev/zuul-jobs/+/929141 - but COOL!!!	15:58
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	15:59
corvus	mordred: good point... what's the variable for that again? :)	15:59
mordred	corvus: yes!	16:00
mordred	something_underscore_mirror_something_role_jinja?	16:00
corvus	there's probably an "openstack_" in there somewhere	16:01
clarkb	I think if you codesearch mirror_fqdn you'll find the things constructed from that value	16:01
clarkb	dib also has an element that promotes that info into its test jobs	16:01
clarkb	https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/openstack-ci-mirrors	16:02
clarkb	sorry I didn't pull up a specific answer. I've got to get to an appointment now	16:02
corvus	there's zuul_site_mirror_fqdn from the site vars file	16:04
corvus	yeah, looks like that's the actual var, and that's the default for mirror_fqdn in many roles	16:05
opendevreview	James E. Blair proposed zuul/zuul-jobs master: Fix build-diskimage playbook paths https://review.opendev.org/c/zuul/zuul-jobs/+/929147	16:09
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	16:10
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	16:16
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	16:32
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	16:41
corvus	dib is actually doing work now ^	16:43
corvus	https://zuul.opendev.org/t/opendev/build/1b111d9e6234460db0c83c3625ad4c9b/log/diskimage-debian-bullseye.log#1337	16:52
clarkb	corvus: that comes from openstack/project-config/nodepool/elements/openstack-repos/extra-data.d/50-create-repos-list	17:20
clarkb	I suspect we can just make that a python3 script	17:20
clarkb	in fact it already has six.moves code handling so may just work under python3	17:21
clarkb	I'm writing a cleanup of related things in those elements will get it pushed shortly	17:30
opendevreview	Clark Boylan proposed openstack/project-config master: Convert python2 dib element scripts to python3 https://review.opendev.org/c/openstack/project-config/+/929166	17:37
clarkb	corvus: ^ can you depends on that?	17:37
clarkb	not sure if the tenant situation makes that sad	17:37
corvus	we don't have os/pc in the opendev tenant	17:53
corvus	oh but that's the nodepool elements	17:53
corvus	i mean the opendev nodepool elements	17:53
corvus	they're being copied over in 929140 so we just need to copy that	17:54
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	17:56
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Convert python2 dib element scripts to python3 https://review.opendev.org/c/opendev/zuul-jobs/+/929168	17:56
corvus	clarkb: ^ like that	17:57
opendevreview	Clark Boylan proposed openstack/project-config master: Add rockylinux nodes to openmetal provider https://review.opendev.org/c/openstack/project-config/+/929169	17:58
opendevreview	Clark Boylan proposed openstack/project-config master: Add nested virt labels to raxflex and openmetal providers https://review.opendev.org/c/openstack/project-config/+/929170	17:58
clarkb	corvus: oh I didn't realize we were copying the elements over. Probably worht noting that once we're transitioning that we have to keep them in sync	17:59
clarkb	infra-root ^ 929170 is the thing I mentioned we could do as a next step for raxflex and then noticed openmetal could use the same treatment as well as some cleanup	18:00
corvus	clarkb: yeah; i think that'll be pretty easy, but if we decide we don't like that, we could add the repo to the tenant and the job's required-projects.	18:01
corvus	clarkb: fungi i suspect our tcp connection issues in rax flex are beginning to manifest in zuul-nox-remote tests	18:09
corvus	example failure: https://zuul.opendev.org/t/zuul/build/3cb96e4f68654401812c6a01fdabed04	18:10
corvus	the test that failed there is a multi-node functional test where one node acts like a zuul executor and another acts like a nodepool worker node, and we verify that the zuul console streaming works. that entails making a connection from one node to the other on the zuul streaming tcp port	18:11
corvus	i don't have a smoking gun, but i suspect that a long delay in connecting might manifest like that test failure (where we just never saw the streaming output, even though the command did actually run)	18:12
corvus	https://zuul.opendev.org/t/zuul/build/3cb96e4f68654401812c6a01fdabed04/log/job-output.txt#6027	18:16
corvus	oh yeah there's the smoking gun	18:16
corvus	we allow 5 seconds to connect	18:17
clarkb	neat	18:19
clarkb	corvus: I wonder if we should hold a node and then try and run some "benchmarking" tools to figure out if the problem is at a network layer or maybe cpu level?	18:20
corvus	mm good idea	18:20
clarkb	its possible that tcp isn't getting syn ack syn'd fast enough because our cpu is busy or assigned to some other VM (cpu steal). Or maybe the packets are never flowing in the first place	18:20
corvus	that test/job is really reliable, so i'll just put a hold on it; it'll probably be what we need	18:20
clarkb	++	18:21
corvus	https://zuul.opendev.org/t/zuul/autohold/0000000087	18:22
corvus	we'll have two nodes to tcpdump on too, so that's nice	18:24
corvus	https://zuul.opendev.org/t/opendev/build/5da585bd9c8549b09013393ed54de136 dib wants yaml	18:26
clarkb	corvus: pyyaml is listed in dib's requirements	18:27
corvus	does that need to be a system package? ie apt-get install python3-yaml?	18:27
clarkb	oh we're using /usr/bin/env python3 and that must do a venv escape?	18:27
clarkb	ya Ithink it needs to be at the sytem level if we're using /usr/bin/env python3	18:28
corvus	maybe? just an initial guess	18:28
clarkb	or we need to get that shebang invocation to use the venv that dib is in which should have pyyaml installed	18:28
corvus	wonder which is more "correct" from a dib element development perspective?	18:28
corvus	should custom elements expect to work in dib's environment or the system env?	18:28
clarkb	my initial hunch/reaction is the system env	18:29
clarkb	since we're using a bunch of other system tools like bash, debootstrap, vhd-util, qemu-img etc etc	18:29
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	18:33
corvus	the opendev tenant has zero configuration errors or warnings (yay!)	18:35
corvus	https://zuul.opendev.org/t/zuul/autohold/0000000087 triggered on a rax-flex node as expected	18:41
corvus	oh wait that is not a multinode job	18:42
corvus	that's one node serving two roles	18:42
corvus	which means this might be a fip issue	18:42
corvus	https://paste.opendev.org/show/bs3X3PjNa9pAMTqx74ie/	18:51
corvus	we have 3 zuul console daemons running (as expected). the "real" one is on 19885, and it answers when i connect to either the fip or the local addr. the 2 daemons started by the test only answer on the local ip.	18:52
corvus	is this a security groups issue?	18:53
fungi	i'm not really around today (or tomorrow) but in theory the mirror instance could exhibit the same problems? though if it's just networking issues impacting a subset of hypervisor hosts that could be harder to track down (maybe correlate to nova host-id?)	20:09
Clark[m]	corvus: the security groups should be set wide open by the cloud launcher tooling but it is possible that didn't happen as expected.	20:26
Clark[m]	It could be local iptables rules too	20:26
corvus	yeah i was thinking the same thing about iptables	20:27
corvus	of the two, i think that's the only one we have that port listed	20:28
corvus	i'm wondering if our iptables rules on worker nodes have outlived their usefulness	20:29
Clark[m]	When projects open things up improperly we do occasionally get emails from providers conplaining	20:30
Clark[m]	OVH and the old inet clouds in particular would complain about services that could be used in reflection attacks. Out of the box I think our nodes are fine it's the job workload we're trying to isolate from the world	20:31
corvus	we have 6 job-specific iptables rules built into our default ruleset	20:32
corvus	(ie, we don't expect nova and neutron to adjust those rules, we do it for them in the base image)	20:32
corvus	that's what prompted me to think that perhaps we're taking the wrong approach	20:33
corvus	but in this case, we have a cloud that uniquely routes traffic from localhost to the shade/nodepool "interface ip" through the input filter, because the host's interface ip isn't on the host.	20:35
Clark[m]	Oh ha	20:35
fungi	is it the same in openmetal too? we use fips there as well right?	20:36
corvus	i don't think we can add a general iptables rule for that. so i think we either need to change iptables in the zuul job, or see if we can use the internal ip (i don't know if other parts of that job rely on the interface ip or not; probably not, but i'm not 100% sure)	20:37
corvus	fungi: good q	20:37
corvus	our openmetal mirror knows its ip address	20:38
corvus	^public	20:38
fungi	oh, we might have an actual global address pool there instead of fips, right	20:38
corvus	yeah, i think this may be a cloud configuration we've not seen since the hpcloud days?	20:39
fungi	something similar came up with an openstack job, and i think we decided that "internal" address zuul reports for the node will always be bound to an interface regardless of whether there's a fip for the public address	20:39
corvus	note that there may be difficulty running openafs servers in this environment	20:40
corvus	(there might be a way to do it now, but with extra configuration)	20:41
Clark[m]	corvus swift switched to the internal IP and it worked for them	20:44
Clark[m]	fungi no FIPs in openmetal	20:44
fungi	okay, yeah https://zuul.opendev.org/t/openstack/build/baafa2316de049b7b30d823cf5f34f00/log/zuul-info/inventory.yaml#17-23 shows an example	20:44
Clark[m]	Last fip cloud was the citynetwork stuff iirc	20:45
fungi	so when there's only one interface and it has a globally routable ip address, private_ipv4 and public_ipv4 end up being equivalent	20:45
Clark[m]	Yes	20:45
fungi	and coincidentally, it's openmetal in that example, and does have a globally routable address bound to the interface	20:46
corvus	Clark: yeah, i mean i agree that could be a solution, i'm just saying i haven't paged in everything about this job to know what's communicating with what yet. if any part of it involves communicating from the executor to the node, then internal won't work. i suspect that's not the case, and it will work. it just needs testing with this particular job	20:46
corvus	yes, this doesn't work because we're routing from the host to the host via an ip which is not on the host and not in a subnet the host thinks is local, so it has to go out to the gateway and come back	20:47
corvus	i wonder what happened on citycloud, since i suspect some version of this job was running there	20:47
fungi	and yes, openafs servers are a good point if we end up expanding/relocating control plane services into flex. maybe we can ask them whether they can make publicly routable assignments too, similar to what openmetal is doing	20:47
corvus	would be nice to have a choice	20:48
fungi	though that's less likely to be useful for the job nodes since we'd be sitting on a large v4 allocation equivalent to our quota there	20:49
corvus	remote: https://review.opendev.org/c/zuul/zuul/+/929189 Remove ZUUL_REMOTE_IP from nox-remote job [NEW]	20:49
fungi	also i wonder if some of the job issue could be sidestepped once we have working v6 there	20:50
fungi	which is supposedly coming to flex soon, just no eta	20:50
corvus	that's my first attempt (to use localhost instead) ^	20:51
corvus	nope	20:53
* fungi disappears back into the ether		20:53
clarkb	ya I suspect ipv6 might alleviate some of this. Fundamentally floating ips serve two purposes and only having public ips via floating ips is for the one issue: lack of ipv4 addresses so you want to allocate a block then have people use them sparingly compared to all of their backend instances	21:00
clarkb	(the other reason is that you might want to have a fixed ip for a service over time)	21:00
clarkb	but also they only make sense once you exceed a certain scale of ip usage. The reason openmetal doesn't use floating IPs is that all the extra router devices involved consume extra IPs	21:01
clarkb	so in that situation where we've got like a /26 or whatever it is its more economical to have a single public network everything attached to in order to save a few IPs	21:01
corvus	on the other topic -- it looks like we're at the point where we run out of space because we're trying to clone all the repos (expected). so we need to turn the git cache on the host (/opt/cache/git i think) into a dib-compatible repo cache.	21:05
clarkb	I suspect that may be as easy as doing a local (hard linked if possible) git clone from the /opt/cache/git space into /opt/dib_cache (or whatever the path is in the job) space	21:06
corvus	or we could just rename since we don't need them anymore	21:07
corvus	it looks like they have a weird naming structure	21:07
corvus	/opt/dib_cache/source-repositories/zuul_website_media_9b81b80caa18a094de47403a415032bb0ec52bbc	21:08
clarkb	I think its just the last portion of the project name (after the last /) then the sha? maybe there is some punctuation normalization happening too	21:08
corvus	is it the source-repositories role that makes that name? or whatever it is that feeds the list to source repos?	21:08
corvus	CACHE_NAME=$(echo "${REPOTYPE}_${REPOLOCATION}" \| sha1sum \| awk '{ print $1 }' )	21:09
corvus	looks like source-repos	21:09
corvus	so at a high level, we "just" need to re-do that same translation for everything in /opt/cache/git and construct a series of mv commands	21:11
clarkb	another option may be to update source-repositories to take a cache location to fetch form first then update from upstream	21:11
clarkb	off the top of my head I'm not sure if that would be particularly complicated	21:12
corvus	(a really easy version of that would be to tell it the cache is upstream, but i bet that would end up with different sha1sums)	21:12
corvus	clarkb: if we did that, we might run a bunch of git commands which might be slower than mv's?	21:13
clarkb	corvus: yes in the case where the filesystems differ that is likely going to be true	21:13
clarkb	(I think git defaults to hardlikning on the same fs)	21:13
corvus	let's assume same filesystem	21:13
clarkb	in that case I would think it would still be pretty fast. Maybe not quite as fast but within reasonable bounds	21:14
clarkb	since a mv is one inode update but hardlinking is still many	21:14
corvus	still a git clone is going to do more work than a mv; it's going to process a lot of data, copy some things, hardlink others, and perform a checikout	21:14
clarkb	yes the mv idea is likely the most efficient	21:14
corvus	(but risks getting out of sync with dib)	21:14
corvus	https://paste.opendev.org/show/bPPFNCOHcXhKUSFGSatg/	21:27
corvus	there's the incantation	21:27
clarkb	that seems straightforward enough taht doing the mv is probably fine	21:29
clarkb	and if it were to change I would expect a sha1 to sha256sum which is a straightforward update too	21:29
corvus	i'll see if i can turn that into a script to do mv's	21:29
corvus	regarding the remote test and flex -- the change to use the private ip passed on ovh -- so it should at least not be a regression	21:31
corvus	i think we should merge that now and let the gate backlog run it through its paces	21:31
clarkb	works for me let me review it really quickly	21:31
corvus	++thx	21:31
clarkb	https://review.opendev.org/c/zuul/zuul/+/929189 I think it was that one and I've approved it	21:32
corvus	yep, thanks!	21:32
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	22:04
corvus	okay that includes a stab at an "undo the cache"	22:06
clarkb	make the cache more cashier	22:06
corvus	"reverse the polarity"	22:06
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	22:11
clarkb	I'm going to clean up the held etherpad 2.2.4 test nodes including the one with the db from prod in it. We're on 2.2.4 and if we need to do additional testing of the meetpad fix(es) we can do a new hold	22:28
clarkb	the etherpad maintainer seems to be active in the middle of the night relative to my timezone so hopefully we get some pointers overnight	22:30
clarkb	infra-root in my followup email with rax about the flex cloud I mentioned that we could add this cloud to the nested virt flavors if we (opendev) feel ready for that. I'ev proposed that change in this stack https://review.opendev.org/c/openstack/project-config/+/929169 (it actually happens in the child as this parent change fixes an inconsistency with openmetal cloud first)	22:34
clarkb	If we think that is a bad idea feel free to make note of that either here or in the change. I suspect that johnsom in particular would be willing to try it out as that gives us more quota for those special jobs	22:35
johnsom	+1	22:36
clarkb	the other thing we could do is use swift in that cloud for job logs. I wonder if we could put weights on the random assignment of swift service for job log storage so that we don't have to go straight to 1/6 of the logs writing to this region	22:39
clarkb	but maybe thats a good test of their swift installation and worth doing anyway	22:39
corvus	clarkb: a different kind of test would be to start uploading dib images into that swift	22:42
corvus	(that's an upcoming step for the dib job)	22:42
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	22:43
clarkb	corvus: oh interesting idea and that woudl probably be easier to control from a ramp up perspective	22:44
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	22:51
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	22:55
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	22:59
corvus	that appears to be using the cache \o/	23:06
clarkb	nice	23:07
corvus	the cache update still isn't fast, but it's probably no slower than what we do now. probably worth double checking	23:07
clarkb	I think it is one of the slowest parts of the build	23:08
corvus	took 2m20s for reversing the cache	23:09
corvus	2m of that was the find command	23:10
corvus	if we made that more efficient (maybe using a glob, or a python program that understood the directory structure) so we're not recursing too deep, we could probably remove most of that time.	23:10
corvus	(also maybe starting the mvs while the find is still running)	23:11
clarkb	you can limit the depth with find too	23:11
corvus	yeah looks like -maxdepth 4 would do it	23:12
corvus	that's assuming we never go deeper, which we can in gerrit...	23:13
corvus	also, looks like we will need /opt/dib_tmp -- /tmp is insufficient :)	23:13
clarkb	oh yup if /tmp is in memory we'll definitely need a disk location	23:13
corvus	i figured; i just also figured i'd see where it breaks :)	23:14
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	23:16
opendevreview	James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141	23:25
corvus	20s for the cache reverse now	23:34
corvus	looks like we might need to think a bit about the disk space available on our various clouds	23:46
corvus	it looks like rax-ord just has a single 40G / based on https://zuul.opendev.org/t/opendev/build/62ea40645cf4441a9b2961d440eb139c/log/zuul-info/zuul-info.ubuntu-noble.txt	23:48
corvus	https://etherpad.opendev.org/p/opendev-cloud-disks	23:55
corvus	i think there's an unused ephemeral volume, right? maybe we should mount that at dib_tmp for these builds?	23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!