Thursday, 2024-09-12

jrossercardoe: there is no choice currently but to use a 3rd party plugin for CLI with websso. It is a shame that the client cannot not do this natively.06:10
opendevreviewJames E. Blair proposed openstack/project-config master: Add opendev/zuul-jobs to opendev channel config  https://review.opendev.org/c/openstack/project-config/+/92911814:28
corvuswe might want to speedy merge that ^14:28
clarkbcorvus: done14:49
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Allow dib_elements key to be a nested list  https://review.opendev.org/c/zuul/zuul-jobs/+/92912314:54
opendevreviewMerged openstack/project-config master: Add opendev/zuul-jobs to opendev channel config  https://review.opendev.org/c/openstack/project-config/+/92911814:56
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add initial Zuul project config  https://review.opendev.org/c/opendev/zuul-jobs/+/92913915:14
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Copy DIB elements from project-config  https://review.opendev.org/c/opendev/zuul-jobs/+/92914015:14
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914115:14
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add initial Zuul project config  https://review.opendev.org/c/opendev/zuul-jobs/+/92913915:41
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Copy DIB elements from project-config  https://review.opendev.org/c/opendev/zuul-jobs/+/92914015:41
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914115:41
mordredcorvus: left a comment on https://review.opendev.org/c/opendev/zuul-jobs/+/929141 - but COOL!!!15:58
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914115:59
corvusmordred:  good point... what's the variable for that again?  :)15:59
mordredcorvus: yes!16:00
mordredsomething_underscore_mirror_something_role_jinja?16:00
corvusthere's probably an "openstack_" in there somewhere16:01
clarkbI think if you codesearch mirror_fqdn you'll find the things constructed from that value16:01
clarkbdib also has an element that promotes that info into its test jobs16:01
clarkbhttps://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/openstack-ci-mirrors16:02
clarkbsorry I didn't pull up a specific answer. I've got to get to an appointment now16:02
corvusthere's zuul_site_mirror_fqdn from the site vars file16:04
corvusyeah, looks like that's the actual var, and that's the default for mirror_fqdn in many roles16:05
opendevreviewJames E. Blair proposed zuul/zuul-jobs master: Fix build-diskimage playbook paths  https://review.opendev.org/c/zuul/zuul-jobs/+/92914716:09
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914116:10
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914116:16
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914116:32
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914116:41
corvusdib is actually doing work now ^16:43
corvushttps://zuul.opendev.org/t/opendev/build/1b111d9e6234460db0c83c3625ad4c9b/log/diskimage-debian-bullseye.log#133716:52
clarkbcorvus: that comes from openstack/project-config/nodepool/elements/openstack-repos/extra-data.d/50-create-repos-list17:20
clarkbI suspect we can just make that a python3 script17:20
clarkbin fact it already has six.moves code handling so may just work under python317:21
clarkbI'm writing a cleanup of related things in those elements will get it pushed shortly17:30
opendevreviewClark Boylan proposed openstack/project-config master: Convert python2 dib element scripts to python3  https://review.opendev.org/c/openstack/project-config/+/92916617:37
clarkbcorvus: ^ can you depends on that?17:37
clarkbnot sure if the tenant situation makes that sad17:37
corvuswe don't have os/pc in the opendev tenant17:53
corvusoh but that's the nodepool elements17:53
corvusi mean the opendev nodepool elements17:53
corvusthey're being copied over in 929140 so we just need to copy that17:54
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914117:56
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Convert python2 dib element scripts to python3  https://review.opendev.org/c/opendev/zuul-jobs/+/92916817:56
corvusclarkb: ^ like that17:57
opendevreviewClark Boylan proposed openstack/project-config master: Add rockylinux nodes to openmetal provider  https://review.opendev.org/c/openstack/project-config/+/92916917:58
opendevreviewClark Boylan proposed openstack/project-config master: Add nested virt labels to raxflex and openmetal providers  https://review.opendev.org/c/openstack/project-config/+/92917017:58
clarkbcorvus: oh I didn't realize we were copying the elements over. Probably worht noting that once we're transitioning that we have to keep them in sync17:59
clarkbinfra-root ^ 929170 is the thing I mentioned we could do as a next step for raxflex and then noticed openmetal could use the same treatment as well as some cleanup18:00
corvusclarkb: yeah; i think that'll be pretty easy, but if we decide we don't like that, we could add the repo to the tenant and the job's required-projects.18:01
corvusclarkb: fungi i suspect our tcp connection issues in rax flex are beginning to manifest in zuul-nox-remote tests18:09
corvusexample failure: https://zuul.opendev.org/t/zuul/build/3cb96e4f68654401812c6a01fdabed0418:10
corvusthe test that failed there is a multi-node functional test where one node acts like a zuul executor and another acts like a nodepool worker node, and we verify that the zuul console streaming works.  that entails making a connection from one node to the other on the zuul streaming tcp port18:11
corvusi don't have a smoking gun, but i suspect that a long delay in connecting might manifest like that test failure (where we just never saw the streaming output, even though the command did actually run)18:12
corvushttps://zuul.opendev.org/t/zuul/build/3cb96e4f68654401812c6a01fdabed04/log/job-output.txt#602718:16
corvusoh yeah there's the smoking gun18:16
corvuswe allow 5 seconds to connect18:17
clarkbneat18:19
clarkbcorvus: I wonder if we should hold a node and then try and run some "benchmarking" tools to figure out if the problem is at a network layer or maybe cpu level?18:20
corvusmm good idea18:20
clarkbits possible that tcp isn't getting syn ack syn'd fast enough because our cpu is busy or assigned to some other VM (cpu steal). Or maybe the packets are never flowing in the first place18:20
corvusthat test/job is really reliable, so i'll just put a hold on it; it'll probably be what we need18:20
clarkb++18:21
corvushttps://zuul.opendev.org/t/zuul/autohold/000000008718:22
corvuswe'll have two nodes to tcpdump on too, so that's nice18:24
corvushttps://zuul.opendev.org/t/opendev/build/5da585bd9c8549b09013393ed54de136 dib wants yaml18:26
clarkbcorvus: pyyaml is listed in dib's requirements18:27
corvusdoes that need to be a system package? ie apt-get install python3-yaml?18:27
clarkboh we're using /usr/bin/env python3 and that must do a venv escape?18:27
clarkbya  Ithink it needs to be at the sytem level if we're using /usr/bin/env python318:28
corvusmaybe?  just an initial guess18:28
clarkbor we need to get that shebang invocation to use the venv that dib is in which should have pyyaml installed18:28
corvuswonder which is more "correct" from a dib element development perspective?18:28
corvusshould custom elements expect to work in dib's environment or the system env?18:28
clarkbmy initial hunch/reaction is the system env18:29
clarkbsince we're using a bunch of other system tools like bash, debootstrap, vhd-util, qemu-img etc etc18:29
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914118:33
corvusthe opendev tenant has zero configuration errors or warnings (yay!)18:35
corvushttps://zuul.opendev.org/t/zuul/autohold/0000000087 triggered on a rax-flex node as expected18:41
corvusoh wait that is not a multinode job18:42
corvusthat's one node serving two roles18:42
corvuswhich means this might be a fip issue18:42
corvushttps://paste.opendev.org/show/bs3X3PjNa9pAMTqx74ie/18:51
corvuswe have 3 zuul console daemons running (as expected).  the "real" one is on 19885, and it answers when i connect to either the fip or the local addr.  the 2 daemons started by the test only answer on the local ip.18:52
corvusis this a security groups issue?18:53
fungii'm not really around today (or tomorrow) but in theory the mirror instance could exhibit the same problems? though if it's just networking issues impacting a subset of hypervisor hosts that could be harder to track down (maybe correlate to nova host-id?)20:09
Clark[m]corvus: the security groups should be set wide open by the cloud launcher tooling but it is possible that didn't happen as expected.20:26
Clark[m]It could be local iptables rules too20:26
corvusyeah i was thinking the same thing about iptables20:27
corvusof the two, i think that's the only one we have that port listed20:28
corvusi'm wondering if our iptables rules on worker nodes have outlived their usefulness20:29
Clark[m]When projects open things up improperly we do occasionally get emails from providers conplaining20:30
Clark[m]OVH and the old inet clouds in particular would complain about services that could be used in reflection attacks. Out of the box I think our nodes are fine it's the job workload we're trying to isolate from the world20:31
corvuswe have 6 job-specific iptables rules built into our default ruleset20:32
corvus(ie, we don't expect nova and neutron to adjust those rules, we do it for them in the base image)20:32
corvusthat's what prompted me to think that perhaps we're taking the wrong approach20:33
corvusbut in this case, we have a cloud that uniquely routes traffic from localhost to the shade/nodepool "interface ip" through the input filter, because the host's interface ip isn't on the host.20:35
Clark[m]Oh ha20:35
fungiis it the same in openmetal too? we use fips there as well right?20:36
corvusi don't think we can add a general iptables rule for that.  so i think we either need to change iptables in the zuul job, or see if we can use the internal ip (i don't know if other parts of that job rely on the interface ip or not; probably not, but i'm not 100% sure)20:37
corvusfungi: good q20:37
corvusour openmetal mirror knows its ip address20:38
corvus^public20:38
fungioh, we might have an actual global address pool there instead of fips, right20:38
corvusyeah, i think this may be a cloud configuration we've not seen since the hpcloud days?20:39
fungisomething similar came up with an openstack job, and i think we decided that "internal" address zuul reports for the node will always be bound to an interface regardless of whether there's a fip for the public address20:39
corvusnote that there may be difficulty running openafs servers in this environment20:40
corvus(there might be a way to do it now, but with extra configuration)20:41
Clark[m]corvus swift switched to the internal IP and it worked for them20:44
Clark[m]fungi no FIPs in openmetal20:44
fungiokay, yeah https://zuul.opendev.org/t/openstack/build/baafa2316de049b7b30d823cf5f34f00/log/zuul-info/inventory.yaml#17-23 shows an example20:44
Clark[m]Last fip cloud was the citynetwork stuff iirc20:45
fungiso when there's only one interface and it has a globally routable ip address, private_ipv4 and public_ipv4 end up being equivalent20:45
Clark[m]Yes20:45
fungiand coincidentally, it's openmetal in that example, and does have a globally routable address bound to the interface20:46
corvusClark: yeah, i mean i agree that could be a solution, i'm just saying i haven't paged in everything about this job to know what's communicating with what yet.  if any part of it involves communicating from the executor to the node, then internal won't work.  i suspect that's not the case, and it will work.  it just needs testing with this particular job20:46
corvusyes, this doesn't work because we're routing from the host to the host via an ip which is not on the host and not in a subnet the host thinks is local, so it has to go out to the gateway and come back20:47
corvusi wonder what happened on citycloud, since i suspect some version of this job was running there20:47
fungiand yes, openafs servers are a good point if we end up expanding/relocating control plane services into flex. maybe we can ask them whether they can make publicly routable assignments too, similar to what openmetal is doing20:47
corvuswould be nice to have a choice20:48
fungithough that's less likely to be useful for the job nodes since we'd be sitting on a large v4 allocation equivalent to our quota there20:49
corvusremote:   https://review.opendev.org/c/zuul/zuul/+/929189 Remove ZUUL_REMOTE_IP from nox-remote job [NEW]        20:49
fungialso i wonder if some of the job issue could be sidestepped once we have working v6 there20:50
fungiwhich is supposedly coming to flex soon, just no eta20:50
corvusthat's my first attempt (to use localhost instead) ^20:51
corvusnope20:53
* fungi disappears back into the ether20:53
clarkbya I suspect ipv6 might alleviate some of this. Fundamentally floating ips serve two purposes and only having public ips via floating ips is for the one issue: lack of ipv4 addresses so you want to allocate a block then have people use them sparingly compared to all of their backend instances21:00
clarkb(the other reason is that you might want to have a fixed ip for a service over time)21:00
clarkbbut also they only make sense once you exceed a certain scale of ip usage. The reason openmetal doesn't use floating IPs is that all the extra router devices involved consume extra IPs21:01
clarkbso in that situation where we've got like a /26 or whatever it is its more economical to have a single public network everything attached to in order to save a few IPs21:01
corvuson the other topic -- it looks like we're at the point where we run out of space because we're trying to clone all the repos (expected).  so we need to turn the git cache on the host (/opt/cache/git i think) into a dib-compatible repo cache.21:05
clarkbI suspect that may be as easy as doing a local (hard linked if possible) git clone from the /opt/cache/git space into /opt/dib_cache (or whatever the path is in the job) space21:06
corvusor we could just rename since we don't need them anymore21:07
corvusit looks like they have a weird naming structure21:07
corvus/opt/dib_cache/source-repositories/zuul_website_media_9b81b80caa18a094de47403a415032bb0ec52bbc21:08
clarkbI think its just the last portion of the project name (after the last /) then the sha? maybe there is some punctuation normalization happening too21:08
corvusis it the source-repositories role that makes that name?  or whatever it is that feeds the list to source repos?21:08
corvus            CACHE_NAME=$(echo "${REPOTYPE}_${REPOLOCATION}" | sha1sum | awk '{ print $1 }' )21:09
corvuslooks like source-repos21:09
corvusso at a high level, we "just" need to re-do that same translation for everything in /opt/cache/git and construct a series of mv commands21:11
clarkbanother option may be to update source-repositories to take a cache location to fetch form first then update from upstream21:11
clarkboff the top of my head I'm not sure if that would be particularly complicated21:12
corvus(a really easy version of that would be to tell it the cache is upstream, but i bet that would end up with different sha1sums)21:12
corvusclarkb: if we did that, we might run a bunch of git commands which might be slower than mv's?21:13
clarkbcorvus: yes in the case where the filesystems differ that is likely going to be true21:13
clarkb(I think git defaults to hardlikning on the same fs)21:13
corvuslet's assume same filesystem21:13
clarkbin that case I would think it would still be pretty fast. Maybe not quite as fast but within reasonable bounds21:14
clarkbsince a mv is one inode update but hardlinking is still many21:14
corvusstill a git clone is going to do more work than a mv; it's going to process a lot of data, copy some things, hardlink others, and perform a checikout21:14
clarkbyes the mv idea is likely the most efficient21:14
corvus(but risks getting out of sync with dib)21:14
corvushttps://paste.opendev.org/show/bPPFNCOHcXhKUSFGSatg/21:27
corvusthere's the incantation21:27
clarkbthat seems straightforward enough taht doing the mv is probably fine21:29
clarkband if it were to change I would expect a sha1 to sha256sum which is a straightforward update too21:29
corvusi'll see if i can turn that into a script to do mv's21:29
corvusregarding the remote test and flex -- the change to use the private ip passed on ovh -- so it should at least not be a regression21:31
corvusi think we should merge that now and let the gate backlog run it through its paces21:31
clarkbworks for me let me review it really quickly21:31
corvus++thx21:31
clarkbhttps://review.opendev.org/c/zuul/zuul/+/929189 I think it was that one and I've approved it21:32
corvusyep, thanks!21:32
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914122:04
corvusokay that includes a stab at an "undo the cache"22:06
clarkbmake the cache more cashier22:06
corvus"reverse the polarity"22:06
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914122:11
clarkbI'm going to clean up the held etherpad 2.2.4 test nodes including the one with the db from prod in it. We're on 2.2.4 and if we need to do additional testing of the meetpad fix(es) we can do a new hold22:28
clarkbthe etherpad maintainer seems to be active in the middle of the night relative to my timezone so hopefully we get some pointers overnight22:30
clarkbinfra-root in my followup email with rax about the flex cloud I mentioned that we could add this cloud to the nested virt flavors if we (opendev) feel ready for that. I'ev proposed that change in this stack https://review.opendev.org/c/openstack/project-config/+/929169 (it actually happens in the child as this parent change fixes an inconsistency with openmetal cloud first)22:34
clarkbIf we think that is a bad idea feel free to make note of that either here or in the change. I suspect that johnsom in particular would be willing to try it out as that gives us more quota for those special jobs22:35
johnsom+122:36
clarkbthe other thing we could do is use swift in that cloud for job logs. I wonder if we could put weights on the random assignment of swift service for job log storage so that we don't have to go straight to 1/6 of the logs writing to this region22:39
clarkbbut maybe thats a good test of their swift installation and worth doing anyway22:39
corvusclarkb: a different kind of test would be to start uploading dib images into that swift22:42
corvus(that's an upcoming step for the dib job)22:42
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914122:43
clarkbcorvus: oh interesting idea and that woudl probably be easier to control from a ramp up perspective22:44
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914122:51
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914122:55
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914122:59
corvusthat appears to be using the cache \o/23:06
clarkbnice23:07
corvusthe cache update still isn't fast, but it's probably no slower than what we do now.  probably worth double checking23:07
clarkbI think it is one of the slowest parts of the build23:08
corvustook 2m20s for reversing the cache23:09
corvus2m of that was the find command23:10
corvusif we made that more efficient (maybe using a glob, or a python program that understood the directory structure) so we're not recursing too deep, we could probably remove most of that time.23:10
corvus(also maybe starting the mvs while the find is still running)23:11
clarkbyou can limit the depth with find too23:11
corvusyeah looks like -maxdepth 4 would do it23:12
corvusthat's assuming we never go deeper, which we can in gerrit...23:13
corvusalso, looks like we will need /opt/dib_tmp -- /tmp is insufficient :)23:13
clarkboh yup if /tmp is in memory we'll definitely need a disk location23:13
corvusi figured; i just also figured i'd see where it breaks :)23:14
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914123:16
opendevreviewJames E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image  https://review.opendev.org/c/opendev/zuul-jobs/+/92914123:25
corvus20s for the cache reverse now23:34
corvuslooks like we might need to think a bit about the disk space available on our various clouds23:46
corvusit looks like rax-ord just has a single 40G / based on https://zuul.opendev.org/t/opendev/build/62ea40645cf4441a9b2961d440eb139c/log/zuul-info/zuul-info.ubuntu-noble.txt23:48
corvushttps://etherpad.opendev.org/p/opendev-cloud-disks23:55
corvusi think there's an unused ephemeral volume, right?  maybe we should mount that at dib_tmp for these builds?23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!