jrosser | cardoe: there is no choice currently but to use a 3rd party plugin for CLI with websso. It is a shame that the client cannot not do this natively. | 06:10 |
---|---|---|
opendevreview | James E. Blair proposed openstack/project-config master: Add opendev/zuul-jobs to opendev channel config https://review.opendev.org/c/openstack/project-config/+/929118 | 14:28 |
corvus | we might want to speedy merge that ^ | 14:28 |
clarkb | corvus: done | 14:49 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Allow dib_elements key to be a nested list https://review.opendev.org/c/zuul/zuul-jobs/+/929123 | 14:54 |
opendevreview | Merged openstack/project-config master: Add opendev/zuul-jobs to opendev channel config https://review.opendev.org/c/openstack/project-config/+/929118 | 14:56 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add initial Zuul project config https://review.opendev.org/c/opendev/zuul-jobs/+/929139 | 15:14 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Copy DIB elements from project-config https://review.opendev.org/c/opendev/zuul-jobs/+/929140 | 15:14 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 15:14 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add initial Zuul project config https://review.opendev.org/c/opendev/zuul-jobs/+/929139 | 15:41 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Copy DIB elements from project-config https://review.opendev.org/c/opendev/zuul-jobs/+/929140 | 15:41 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 15:41 |
mordred | corvus: left a comment on https://review.opendev.org/c/opendev/zuul-jobs/+/929141 - but COOL!!! | 15:58 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 15:59 |
corvus | mordred: good point... what's the variable for that again? :) | 15:59 |
mordred | corvus: yes! | 16:00 |
mordred | something_underscore_mirror_something_role_jinja? | 16:00 |
corvus | there's probably an "openstack_" in there somewhere | 16:01 |
clarkb | I think if you codesearch mirror_fqdn you'll find the things constructed from that value | 16:01 |
clarkb | dib also has an element that promotes that info into its test jobs | 16:01 |
clarkb | https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/elements/openstack-ci-mirrors | 16:02 |
clarkb | sorry I didn't pull up a specific answer. I've got to get to an appointment now | 16:02 |
corvus | there's zuul_site_mirror_fqdn from the site vars file | 16:04 |
corvus | yeah, looks like that's the actual var, and that's the default for mirror_fqdn in many roles | 16:05 |
opendevreview | James E. Blair proposed zuul/zuul-jobs master: Fix build-diskimage playbook paths https://review.opendev.org/c/zuul/zuul-jobs/+/929147 | 16:09 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 16:10 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 16:16 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 16:32 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 16:41 |
corvus | dib is actually doing work now ^ | 16:43 |
corvus | https://zuul.opendev.org/t/opendev/build/1b111d9e6234460db0c83c3625ad4c9b/log/diskimage-debian-bullseye.log#1337 | 16:52 |
clarkb | corvus: that comes from openstack/project-config/nodepool/elements/openstack-repos/extra-data.d/50-create-repos-list | 17:20 |
clarkb | I suspect we can just make that a python3 script | 17:20 |
clarkb | in fact it already has six.moves code handling so may just work under python3 | 17:21 |
clarkb | I'm writing a cleanup of related things in those elements will get it pushed shortly | 17:30 |
opendevreview | Clark Boylan proposed openstack/project-config master: Convert python2 dib element scripts to python3 https://review.opendev.org/c/openstack/project-config/+/929166 | 17:37 |
clarkb | corvus: ^ can you depends on that? | 17:37 |
clarkb | not sure if the tenant situation makes that sad | 17:37 |
corvus | we don't have os/pc in the opendev tenant | 17:53 |
corvus | oh but that's the nodepool elements | 17:53 |
corvus | i mean the opendev nodepool elements | 17:53 |
corvus | they're being copied over in 929140 so we just need to copy that | 17:54 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 17:56 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Convert python2 dib element scripts to python3 https://review.opendev.org/c/opendev/zuul-jobs/+/929168 | 17:56 |
corvus | clarkb: ^ like that | 17:57 |
opendevreview | Clark Boylan proposed openstack/project-config master: Add rockylinux nodes to openmetal provider https://review.opendev.org/c/openstack/project-config/+/929169 | 17:58 |
opendevreview | Clark Boylan proposed openstack/project-config master: Add nested virt labels to raxflex and openmetal providers https://review.opendev.org/c/openstack/project-config/+/929170 | 17:58 |
clarkb | corvus: oh I didn't realize we were copying the elements over. Probably worht noting that once we're transitioning that we have to keep them in sync | 17:59 |
clarkb | infra-root ^ 929170 is the thing I mentioned we could do as a next step for raxflex and then noticed openmetal could use the same treatment as well as some cleanup | 18:00 |
corvus | clarkb: yeah; i think that'll be pretty easy, but if we decide we don't like that, we could add the repo to the tenant and the job's required-projects. | 18:01 |
corvus | clarkb: fungi i suspect our tcp connection issues in rax flex are beginning to manifest in zuul-nox-remote tests | 18:09 |
corvus | example failure: https://zuul.opendev.org/t/zuul/build/3cb96e4f68654401812c6a01fdabed04 | 18:10 |
corvus | the test that failed there is a multi-node functional test where one node acts like a zuul executor and another acts like a nodepool worker node, and we verify that the zuul console streaming works. that entails making a connection from one node to the other on the zuul streaming tcp port | 18:11 |
corvus | i don't have a smoking gun, but i suspect that a long delay in connecting might manifest like that test failure (where we just never saw the streaming output, even though the command did actually run) | 18:12 |
corvus | https://zuul.opendev.org/t/zuul/build/3cb96e4f68654401812c6a01fdabed04/log/job-output.txt#6027 | 18:16 |
corvus | oh yeah there's the smoking gun | 18:16 |
corvus | we allow 5 seconds to connect | 18:17 |
clarkb | neat | 18:19 |
clarkb | corvus: I wonder if we should hold a node and then try and run some "benchmarking" tools to figure out if the problem is at a network layer or maybe cpu level? | 18:20 |
corvus | mm good idea | 18:20 |
clarkb | its possible that tcp isn't getting syn ack syn'd fast enough because our cpu is busy or assigned to some other VM (cpu steal). Or maybe the packets are never flowing in the first place | 18:20 |
corvus | that test/job is really reliable, so i'll just put a hold on it; it'll probably be what we need | 18:20 |
clarkb | ++ | 18:21 |
corvus | https://zuul.opendev.org/t/zuul/autohold/0000000087 | 18:22 |
corvus | we'll have two nodes to tcpdump on too, so that's nice | 18:24 |
corvus | https://zuul.opendev.org/t/opendev/build/5da585bd9c8549b09013393ed54de136 dib wants yaml | 18:26 |
clarkb | corvus: pyyaml is listed in dib's requirements | 18:27 |
corvus | does that need to be a system package? ie apt-get install python3-yaml? | 18:27 |
clarkb | oh we're using /usr/bin/env python3 and that must do a venv escape? | 18:27 |
clarkb | ya Ithink it needs to be at the sytem level if we're using /usr/bin/env python3 | 18:28 |
corvus | maybe? just an initial guess | 18:28 |
clarkb | or we need to get that shebang invocation to use the venv that dib is in which should have pyyaml installed | 18:28 |
corvus | wonder which is more "correct" from a dib element development perspective? | 18:28 |
corvus | should custom elements expect to work in dib's environment or the system env? | 18:28 |
clarkb | my initial hunch/reaction is the system env | 18:29 |
clarkb | since we're using a bunch of other system tools like bash, debootstrap, vhd-util, qemu-img etc etc | 18:29 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 18:33 |
corvus | the opendev tenant has zero configuration errors or warnings (yay!) | 18:35 |
corvus | https://zuul.opendev.org/t/zuul/autohold/0000000087 triggered on a rax-flex node as expected | 18:41 |
corvus | oh wait that is not a multinode job | 18:42 |
corvus | that's one node serving two roles | 18:42 |
corvus | which means this might be a fip issue | 18:42 |
corvus | https://paste.opendev.org/show/bs3X3PjNa9pAMTqx74ie/ | 18:51 |
corvus | we have 3 zuul console daemons running (as expected). the "real" one is on 19885, and it answers when i connect to either the fip or the local addr. the 2 daemons started by the test only answer on the local ip. | 18:52 |
corvus | is this a security groups issue? | 18:53 |
fungi | i'm not really around today (or tomorrow) but in theory the mirror instance could exhibit the same problems? though if it's just networking issues impacting a subset of hypervisor hosts that could be harder to track down (maybe correlate to nova host-id?) | 20:09 |
Clark[m] | corvus: the security groups should be set wide open by the cloud launcher tooling but it is possible that didn't happen as expected. | 20:26 |
Clark[m] | It could be local iptables rules too | 20:26 |
corvus | yeah i was thinking the same thing about iptables | 20:27 |
corvus | of the two, i think that's the only one we have that port listed | 20:28 |
corvus | i'm wondering if our iptables rules on worker nodes have outlived their usefulness | 20:29 |
Clark[m] | When projects open things up improperly we do occasionally get emails from providers conplaining | 20:30 |
Clark[m] | OVH and the old inet clouds in particular would complain about services that could be used in reflection attacks. Out of the box I think our nodes are fine it's the job workload we're trying to isolate from the world | 20:31 |
corvus | we have 6 job-specific iptables rules built into our default ruleset | 20:32 |
corvus | (ie, we don't expect nova and neutron to adjust those rules, we do it for them in the base image) | 20:32 |
corvus | that's what prompted me to think that perhaps we're taking the wrong approach | 20:33 |
corvus | but in this case, we have a cloud that uniquely routes traffic from localhost to the shade/nodepool "interface ip" through the input filter, because the host's interface ip isn't on the host. | 20:35 |
Clark[m] | Oh ha | 20:35 |
fungi | is it the same in openmetal too? we use fips there as well right? | 20:36 |
corvus | i don't think we can add a general iptables rule for that. so i think we either need to change iptables in the zuul job, or see if we can use the internal ip (i don't know if other parts of that job rely on the interface ip or not; probably not, but i'm not 100% sure) | 20:37 |
corvus | fungi: good q | 20:37 |
corvus | our openmetal mirror knows its ip address | 20:38 |
corvus | ^public | 20:38 |
fungi | oh, we might have an actual global address pool there instead of fips, right | 20:38 |
corvus | yeah, i think this may be a cloud configuration we've not seen since the hpcloud days? | 20:39 |
fungi | something similar came up with an openstack job, and i think we decided that "internal" address zuul reports for the node will always be bound to an interface regardless of whether there's a fip for the public address | 20:39 |
corvus | note that there may be difficulty running openafs servers in this environment | 20:40 |
corvus | (there might be a way to do it now, but with extra configuration) | 20:41 |
Clark[m] | corvus swift switched to the internal IP and it worked for them | 20:44 |
Clark[m] | fungi no FIPs in openmetal | 20:44 |
fungi | okay, yeah https://zuul.opendev.org/t/openstack/build/baafa2316de049b7b30d823cf5f34f00/log/zuul-info/inventory.yaml#17-23 shows an example | 20:44 |
Clark[m] | Last fip cloud was the citynetwork stuff iirc | 20:45 |
fungi | so when there's only one interface and it has a globally routable ip address, private_ipv4 and public_ipv4 end up being equivalent | 20:45 |
Clark[m] | Yes | 20:45 |
fungi | and coincidentally, it's openmetal in that example, and does have a globally routable address bound to the interface | 20:46 |
corvus | Clark: yeah, i mean i agree that could be a solution, i'm just saying i haven't paged in everything about this job to know what's communicating with what yet. if any part of it involves communicating from the executor to the node, then internal won't work. i suspect that's not the case, and it will work. it just needs testing with this particular job | 20:46 |
corvus | yes, this doesn't work because we're routing from the host to the host via an ip which is not on the host and not in a subnet the host thinks is local, so it has to go out to the gateway and come back | 20:47 |
corvus | i wonder what happened on citycloud, since i suspect some version of this job was running there | 20:47 |
fungi | and yes, openafs servers are a good point if we end up expanding/relocating control plane services into flex. maybe we can ask them whether they can make publicly routable assignments too, similar to what openmetal is doing | 20:47 |
corvus | would be nice to have a choice | 20:48 |
fungi | though that's less likely to be useful for the job nodes since we'd be sitting on a large v4 allocation equivalent to our quota there | 20:49 |
corvus | remote: https://review.opendev.org/c/zuul/zuul/+/929189 Remove ZUUL_REMOTE_IP from nox-remote job [NEW] | 20:49 |
fungi | also i wonder if some of the job issue could be sidestepped once we have working v6 there | 20:50 |
fungi | which is supposedly coming to flex soon, just no eta | 20:50 |
corvus | that's my first attempt (to use localhost instead) ^ | 20:51 |
corvus | nope | 20:53 |
* fungi disappears back into the ether | 20:53 | |
clarkb | ya I suspect ipv6 might alleviate some of this. Fundamentally floating ips serve two purposes and only having public ips via floating ips is for the one issue: lack of ipv4 addresses so you want to allocate a block then have people use them sparingly compared to all of their backend instances | 21:00 |
clarkb | (the other reason is that you might want to have a fixed ip for a service over time) | 21:00 |
clarkb | but also they only make sense once you exceed a certain scale of ip usage. The reason openmetal doesn't use floating IPs is that all the extra router devices involved consume extra IPs | 21:01 |
clarkb | so in that situation where we've got like a /26 or whatever it is its more economical to have a single public network everything attached to in order to save a few IPs | 21:01 |
corvus | on the other topic -- it looks like we're at the point where we run out of space because we're trying to clone all the repos (expected). so we need to turn the git cache on the host (/opt/cache/git i think) into a dib-compatible repo cache. | 21:05 |
clarkb | I suspect that may be as easy as doing a local (hard linked if possible) git clone from the /opt/cache/git space into /opt/dib_cache (or whatever the path is in the job) space | 21:06 |
corvus | or we could just rename since we don't need them anymore | 21:07 |
corvus | it looks like they have a weird naming structure | 21:07 |
corvus | /opt/dib_cache/source-repositories/zuul_website_media_9b81b80caa18a094de47403a415032bb0ec52bbc | 21:08 |
clarkb | I think its just the last portion of the project name (after the last /) then the sha? maybe there is some punctuation normalization happening too | 21:08 |
corvus | is it the source-repositories role that makes that name? or whatever it is that feeds the list to source repos? | 21:08 |
corvus | CACHE_NAME=$(echo "${REPOTYPE}_${REPOLOCATION}" | sha1sum | awk '{ print $1 }' ) | 21:09 |
corvus | looks like source-repos | 21:09 |
corvus | so at a high level, we "just" need to re-do that same translation for everything in /opt/cache/git and construct a series of mv commands | 21:11 |
clarkb | another option may be to update source-repositories to take a cache location to fetch form first then update from upstream | 21:11 |
clarkb | off the top of my head I'm not sure if that would be particularly complicated | 21:12 |
corvus | (a really easy version of that would be to tell it the cache is upstream, but i bet that would end up with different sha1sums) | 21:12 |
corvus | clarkb: if we did that, we might run a bunch of git commands which might be slower than mv's? | 21:13 |
clarkb | corvus: yes in the case where the filesystems differ that is likely going to be true | 21:13 |
clarkb | (I think git defaults to hardlikning on the same fs) | 21:13 |
corvus | let's assume same filesystem | 21:13 |
clarkb | in that case I would think it would still be pretty fast. Maybe not quite as fast but within reasonable bounds | 21:14 |
clarkb | since a mv is one inode update but hardlinking is still many | 21:14 |
corvus | still a git clone is going to do more work than a mv; it's going to process a lot of data, copy some things, hardlink others, and perform a checikout | 21:14 |
clarkb | yes the mv idea is likely the most efficient | 21:14 |
corvus | (but risks getting out of sync with dib) | 21:14 |
corvus | https://paste.opendev.org/show/bPPFNCOHcXhKUSFGSatg/ | 21:27 |
corvus | there's the incantation | 21:27 |
clarkb | that seems straightforward enough taht doing the mv is probably fine | 21:29 |
clarkb | and if it were to change I would expect a sha1 to sha256sum which is a straightforward update too | 21:29 |
corvus | i'll see if i can turn that into a script to do mv's | 21:29 |
corvus | regarding the remote test and flex -- the change to use the private ip passed on ovh -- so it should at least not be a regression | 21:31 |
corvus | i think we should merge that now and let the gate backlog run it through its paces | 21:31 |
clarkb | works for me let me review it really quickly | 21:31 |
corvus | ++thx | 21:31 |
clarkb | https://review.opendev.org/c/zuul/zuul/+/929189 I think it was that one and I've approved it | 21:32 |
corvus | yep, thanks! | 21:32 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 22:04 |
corvus | okay that includes a stab at an "undo the cache" | 22:06 |
clarkb | make the cache more cashier | 22:06 |
corvus | "reverse the polarity" | 22:06 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 22:11 |
clarkb | I'm going to clean up the held etherpad 2.2.4 test nodes including the one with the db from prod in it. We're on 2.2.4 and if we need to do additional testing of the meetpad fix(es) we can do a new hold | 22:28 |
clarkb | the etherpad maintainer seems to be active in the middle of the night relative to my timezone so hopefully we get some pointers overnight | 22:30 |
clarkb | infra-root in my followup email with rax about the flex cloud I mentioned that we could add this cloud to the nested virt flavors if we (opendev) feel ready for that. I'ev proposed that change in this stack https://review.opendev.org/c/openstack/project-config/+/929169 (it actually happens in the child as this parent change fixes an inconsistency with openmetal cloud first) | 22:34 |
clarkb | If we think that is a bad idea feel free to make note of that either here or in the change. I suspect that johnsom in particular would be willing to try it out as that gives us more quota for those special jobs | 22:35 |
johnsom | +1 | 22:36 |
clarkb | the other thing we could do is use swift in that cloud for job logs. I wonder if we could put weights on the random assignment of swift service for job log storage so that we don't have to go straight to 1/6 of the logs writing to this region | 22:39 |
clarkb | but maybe thats a good test of their swift installation and worth doing anyway | 22:39 |
corvus | clarkb: a different kind of test would be to start uploading dib images into that swift | 22:42 |
corvus | (that's an upcoming step for the dib job) | 22:42 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 22:43 |
clarkb | corvus: oh interesting idea and that woudl probably be easier to control from a ramp up perspective | 22:44 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 22:51 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 22:55 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 22:59 |
corvus | that appears to be using the cache \o/ | 23:06 |
clarkb | nice | 23:07 |
corvus | the cache update still isn't fast, but it's probably no slower than what we do now. probably worth double checking | 23:07 |
clarkb | I think it is one of the slowest parts of the build | 23:08 |
corvus | took 2m20s for reversing the cache | 23:09 |
corvus | 2m of that was the find command | 23:10 |
corvus | if we made that more efficient (maybe using a glob, or a python program that understood the directory structure) so we're not recursing too deep, we could probably remove most of that time. | 23:10 |
corvus | (also maybe starting the mvs while the find is still running) | 23:11 |
clarkb | you can limit the depth with find too | 23:11 |
corvus | yeah looks like -maxdepth 4 would do it | 23:12 |
corvus | that's assuming we never go deeper, which we can in gerrit... | 23:13 |
corvus | also, looks like we will need /opt/dib_tmp -- /tmp is insufficient :) | 23:13 |
clarkb | oh yup if /tmp is in memory we'll definitely need a disk location | 23:13 |
corvus | i figured; i just also figured i'd see where it breaks :) | 23:14 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 23:16 |
opendevreview | James E. Blair proposed opendev/zuul-jobs master: Add debian-bullseye image https://review.opendev.org/c/opendev/zuul-jobs/+/929141 | 23:25 |
corvus | 20s for the cache reverse now | 23:34 |
corvus | looks like we might need to think a bit about the disk space available on our various clouds | 23:46 |
corvus | it looks like rax-ord just has a single 40G / based on https://zuul.opendev.org/t/opendev/build/62ea40645cf4441a9b2961d440eb139c/log/zuul-info/zuul-info.ubuntu-noble.txt | 23:48 |
corvus | https://etherpad.opendev.org/p/opendev-cloud-disks | 23:55 |
corvus | i think there's an unused ephemeral volume, right? maybe we should mount that at dib_tmp for these builds? | 23:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!