clarkb | though the easiest thing might be to bounce through a proxy or vpn :( | 00:00 |
---|---|---|
tonyb | --- mirror01.ca-ymq-1.vexxhost.opendev.org ping statistics --- | 00:00 |
tonyb | 30 packets transmitted, 29 received, 3.33333% packet loss, time 29148ms | 00:00 |
tonyb | rtt min/avg/max/mdev = 2003.815/2693.335/3720.201/371.795 ms, pipe 4 | 00:00 |
ianw | yeah i get a bit of packet loss to mirror.iad.rax.opendev.org too | 00:01 |
tonyb | Also the reverse DNS is interesting | 00:01 |
tonyb | tony@thor:~$ host mirror.ca-ymq-1.vexxhost.opendev.org | 00:01 |
tonyb | mirror.ca-ymq-1.vexxhost.opendev.org is an alias for mirror01.ca-ymq-1.vexxhost.opendev.org. | 00:01 |
tonyb | mirror01.ca-ymq-1.vexxhost.opendev.org has address 199.204.45.149 | 00:01 |
tonyb | mirror01.ca-ymq-1.vexxhost.opendev.org has IPv6 address 2604:e100:1:0:f816:3eff:fe0c:e2c0 | 00:01 |
tonyb | tony@thor:~$ host 199.204.45.149 | 00:01 |
tonyb | 149.45.204.199.in-addr.arpa domain name pointer abla-4.albalisaude2.com.br. | 00:01 |
ianw | 7.14286% packet loss | 00:02 |
ianw | Perhaps the USA has implemented at 10% tariff on incoming packets crossing the Gulf of America? | 00:03 |
tonyb | LOL | 00:03 |
clarkb | re reverse DNS that might be something mnaser would want to look at but I don't think we're able to poke those bits ourselves | 00:04 |
clarkb | either tariffs or someone put an anchor in the wrong spot | 00:04 |
ianw | ... or the right spot depending on who's at the other end of the anchor :) | 00:05 |
tonyb | Yeah I the reverse DNS was beyond our control. | 00:05 |
clarkb | I'm trying to ping stuff in australia and everything I try is either cloudflare or akamai so far | 00:11 |
tonyb | try magni.bakeyoutnoodle.com | 00:11 |
clarkb | that doesn't resolve for me | 00:12 |
tonyb | or ozlabs.org | 00:12 |
clarkb | oh you should be your I think | 00:12 |
tonyb | Yeah | 00:12 |
tonyb | It's my home server | 00:13 |
tonyb | ozlabs.org would be better as it's an a DC $somewhere in AU | 00:13 |
clarkb | no packet loss from my west coast isp to either of those | 00:13 |
clarkb | so its prbably not all of NA but some subset | 00:14 |
clarkb | pdx.edu is close to me physically and route wise I think if you want to ping something here from there and confirm | 00:14 |
tonyb | I suspect the pdx.edu is filtering as I got 100% packet loss | 00:15 |
tonyb | OSU? | 00:16 |
clarkb | try cat.pdx.edu | 00:16 |
clarkb | I get some packet loss to curtin.edu.au which is in amazon in sydney looks like so ya its probably specific routes that are struggling across the pacific | 00:16 |
tonyb | cat.pdx.edu is better | 00:17 |
tonyb | 0% packet loss | 00:17 |
tonyb | We can only assume that someone's pager is going off and it will be fixed ASAP | 00:18 |
clarkb | ianw: I'm guessing the network connectivity might make it tough but did you see the questions about why you would prefer to pull the inventory out of system-config on the executor rather than copy system-config over at that point if we have to copy it at some point anywy? | 00:27 |
ianw | it's a good question and challenging my assumptions about what the bootstrapping bridge steps are doing | 00:27 |
clarkb | considering that the repo sync step doesn't involve running ansible on the bridge itself I feel like its ok either way. The run ansible on bridge step seems like a natural boundary | 00:29 |
clarkb | but I we wanted to make sure we weren't missing somethign important there | 00:29 |
ianw | https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 | 00:32 |
ianw | i don't think this works | 00:32 |
ianw | [WARNING]: Could not match supplied host pattern, ignoring: prod_bastion | 00:32 |
ianw | Using /etc/ansible/ansible.cfg as config file | 00:32 |
ianw | PLAY [prod_bastion[0]] ********************************************************* | 00:32 |
ianw | skipping: no hosts matched | 00:32 |
clarkb | BRIDGE_INVENTORY doesn't have the host in it? Is that something that maybe works in CI but not prod or vice versa? | 00:35 |
clarkb | BRIDGE_INVENTORY: '{{ "-i/home/zuul/bastion-inventory.ini" if root_rsa_key is defined else "" }}' | 00:35 |
clarkb | ya I wonder if the case you're seeing is the else "" there | 00:35 |
ianw | right, it's confusing for sure. the idea was that we could have multiple bastion hosts when we need to rotate | 00:36 |
ianw | "prod_bastion" is a group of one host, the currently active bastion host. the "bastion" group can have all the bastion hosts in in we want | 00:36 |
clarkb | the comment earlier in the file says that root_rsa_key value should be set in prod secrets and in testing its set with an ephemeral key | 00:37 |
ianw | that way we can bootstrap a new bridge by adding it, and the old bridge will actually go to the new bridge and deploy almost everything onto it | 00:37 |
clarkb | so the itnent seems to be that we would always have a root_rsa_key so I'm not sure why we have the else block there | 00:37 |
clarkb | oh I think I get it | 00:38 |
clarkb | root_rsa_key should only be set in the zuul ansible in CI jobs. If we aren't a ci job we still run ansible but it uses the built in inventory on the bridge | 00:39 |
ianw | i think https://opendev.org/opendev/system-config/src/branch/master/playbooks/zuul/run-production-bootstrap-bridge-add-rootkey.yaml#L1 should run on localhost | 00:39 |
clarkb | I think that might fix it. Becuase system-config/inventory/service/groups.yaml doesn't have a prod_bastion group. That group is dynamically created within zuul jobs looks like | 00:40 |
clarkb | so ya we never match on the production side | 00:40 |
clarkb | however, this should only come into play when bootstrapping a new bastion right? so its probably less urgent than the known_hosts thing but another good catch on something to fix there | 00:40 |
ianw | right, what got me thinking is that https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 we are running that all from the checkout of system-config | 00:41 |
clarkb | oh I see ya so things need to be updated there too | 00:41 |
clarkb | otherwise we'd run with the N-1 version of playbooks/zuul/run-production-bootstrap-bridge-add-rootkey.yaml | 00:42 |
clarkb | or in the case of bootstrapping an entirely new server nothing would be present | 00:42 |
ianw | so i'm just trying to think through the ordering. i think we probably agree that if we want to run in parallel, we have to have a step to setup the executor so that it can log into the bastion host (done for every job) and a single step to update the source on the bastion host | 00:43 |
clarkb | yes that makes sense to me there has to be at least one source update that happens before things consume the playbooks from sytem-config | 00:43 |
clarkb | I need to help with dinner momentarily but I'll try to catch up afterwards. Feel free to drop any further thoughts in email or here if gerrit is still painfully behind the packet loss barrier | 00:49 |
clarkb | and thank you for taking a look at this. Its helpful since you had so much of it paged in in the past | 00:50 |
ianw | yeah sorry i'm trying to untangle it in my head again :) | 00:51 |
ianw | the difficult bit is remembering where the line between production and testing lies | 00:52 |
ianw | i think that at https://opendev.org/opendev/system-config/src/branch/master/playbooks/bootstrap-bridge.yaml#L69 we are trying to install the root key from ansible on the bridge, because we want to access the root key from hiera | 01:07 |
ianw | ... on the production system. in the testing case, the root key is in a file on disk. and we want to use ansible's key deployment steps, rather than write our own script | 01:09 |
ianw | the problem is that it's not that much of a "bootstrap" step if it relies on an up-to-date system-config being in place | 01:09 |
ianw | it is in the *testing* case which is the VAST majority of what runs, because the bridge is a node setup by zuul. but still ... | 01:10 |
Clark[m] | Ya but we also don't need to boil the ocean if we can make progress overall. The chicken and egg does feel a bit painful though | 01:46 |
Clark[m] | But if there is minimal manual intervention that can still be a ein | 01:47 |
Clark[m] | *win | 01:47 |
ianw | yep i think the insight here is that there is no separate bootstrap and "update source" step ... it's the same thing. for testing, it's a no-op | 01:56 |
ianw | clarkb: I think i'm ok with it, but I do think we should probably clear up those jobs in the LE section that aren't actually LE based | 03:25 |
ianw | as I said in my too long comment -- can we get it so that the bootstrap job pauses in the same way the buildset-registry job does? and then locks out another bootstrap job from running? | 03:26 |
ianw | i think the dependencies are in place that that would mean you could clone the source just once in the bootstrap job | 03:30 |
corvus | i think the deploy pipeline will only run one buildset at a time; so we shouldn't need to lock out another bootstrap? (but we still have collisions with periodic). having said that, having the bootstrap job run the full length of the buildset and used to lock out periodic jobs makes sense to me | 03:33 |
ianw | yeah it's the periodic collisions that i think we need to lock out | 03:33 |
corvus | okay, then yeah that sounds like an elegant solution, and basically no-cost since it's a zero-node job | 03:34 |
ianw | i guess it's not a solution so much as an enhancement (or a RFE from me, haha) ... but it's getting us to being able to run prod jobs in parallel | 03:34 |
ianw | i had it in my head that we'd have a separate src update job that everything else would depend on (that was what was reverted in https://review.opendev.org/c/opendev/system-config/+/820250) | 03:35 |
corvus | i think doing that while we understand it makes a lot of sense :) | 03:36 |
ianw | the "bootstrap-bridge" stuff post-dates that, from when we did our most recent replacement of the actual bridge node. that was how we got to the point that we run bridge99.opendev.org in testing -- as a deliberate thing to avoid too much hard-coding of bridge01 as much as possible | 03:38 |
ianw | but i think the work done there, to parent everything to the bootstrap bridge job, is a good basis for the next step | 03:39 |
Clark[m] | ianw: corvus: catching up, it isn't clear to me if you think we need the pause block behavior as a prereq to fix the immediate issue or if that is a solution for the run things in parallel as a followup | 04:12 |
ianw | i think probably we can pull the source in the bootstrap job as you suggest. pausing and moving towards parallel running is an enhancement | 04:13 |
Clark[m] | Ok cool. Then the only real update for the current change is to organize the jobs a bit better in project.yaml to make the dependencies clearer. | 04:13 |
ianw | yes i think we should do that, having identified that | 04:14 |
Clark[m] | Then we can followup with trying to parallelize stuff | 04:14 |
Clark[m] | Thanks again for looking. Enjoy your weekend! | 04:15 |
ianw | something is still funky talking to review from here | 04:35 |
opendevreview | Ian Wienand proposed opendev/system-config master: DNM: Bootstrap as top-level job https://review.opendev.org/c/opendev/system-config/+/942439 | 12:57 |
ianw | clarkb: ^ that's the vague idea ... i'm not sure it can be tested outside of production because in testing, each job is it's own little world, there's no global bootstrap job ... | 12:58 |
clarkb | thanks I'll work on updating my change to address the organizational concerns then I can rebase ^ on top of it too | 15:45 |
clarkb | https://docs.docker.com/docker-hub/usage/ says that the 10 pulls per hour per ip limit on docker hub goes into effect March 1. There is discussion here: https://news.ycombinator.com/item?id=43125089 some people have even picked up on the issue with upstream blocking PRs to add mirrors for non docker hub registries (of course others have said this isn't an important featre...) | 15:46 |
clarkb | that discussion points out that https://gallery.ecr.aws already mirrors some official docker images so we may be able to reduce the amount of mirroring we do ourselves if we like | 15:48 |
clarkb | what is weird is that we definitely saw a dramatic increase in docker hub rate limit errors late last year. I half wonder if we got A/B tested on the new limits | 15:49 |
opendevreview | Jay Faulkner proposed openstack/diskimage-builder master: [doc] Document expected environment for dib https://review.opendev.org/c/openstack/diskimage-builder/+/942466 | 15:49 |
clarkb | but we should probably expect things to get even worse in ~1week | 15:49 |
clarkb | a good reminder that there are changes up in zuul-jobs to fetch mirrored images from quay rather than from docker hub if anyone has a moment to review them | 15:50 |
opendevreview | Jay Faulkner proposed openstack/diskimage-builder master: [doc] Document expected environment for dib https://review.opendev.org/c/openstack/diskimage-builder/+/942466 | 15:58 |
mnaser | if there's anything i can help with this... `Powered by Gitea Version: v1.23.3 Page: 22261ms Template: 53ms` | 16:16 |
mnaser | I really want to try and use opendev.org to browse code but it's incredibly frustrating :-( | 16:16 |
clarkb | mnaser: we asked for feedback when we last debugged it and didn't get any... | 16:17 |
mnaser | sorry, it must have gotten lost in my buffer | 16:17 |
clarkb | the working theory is that the gitea caches may becoming large enough that there is an impact on go gc. I restarted teh gitea instance you were speaking to at the time (gitea13) as that would reset the internal memory state in hopes it would clear out gc issues | 16:17 |
mnaser | oh its using sticky sessions? | 16:18 |
clarkb | we knew it would be temporary but it was hard to say one way or another if it actually helped at the time without more feedback | 16:18 |
clarkb | yes git protocols don't let us load balance based on load | 16:18 |
clarkb | they are too stateful across requests/connections | 16:18 |
clarkb | if all gitea was doing was web ui stuff then it wouldn't matter | 16:18 |
mnaser | oh even for serving http/https? | 16:19 |
clarkb | yes | 16:19 |
mnaser | TIL! | 16:19 |
mnaser | sorry, i've gotten through a new pc and didnt have irccloud installed so i haven't been fully syncd in yet | 16:19 |
mnaser | it has started happening again for sure for me, is there a way i can share what backend im talking to | 16:19 |
clarkb | I think it has to do with where the refs are located as they can be loose or in packfiles and if you talk to server A and it says grab this packfile then hit server B and the ref is loose your git fetch regardless of transport protocol can fail | 16:19 |
clarkb | if you inspect the ssl cert one of the names is the backend name | 16:20 |
clarkb | that is probably the easiest method | 16:20 |
mnaser | now that makes me curious how github and gitlab do it :) | 16:20 |
mnaser | one sec | 16:20 |
mnaser | gitea13.opendev.org indeed | 16:20 |
clarkb | I think github uses a single git repo for all forks of the repo | 16:21 |
mordred | mnaser: I believe they have shared/coordinated storage clusters, as opposed to the shared-nothing setup we use | 16:21 |
clarkb | via shared fielsystem or object storage or something | 16:21 |
mnaser | yes they do thats why you can do this weird link that points to a commit that doesnt exist in that repo | 16:21 |
clarkb | yup which has led to many major security issues. Tradeoffs | 16:22 |
clarkb | anyway give me a minute and I'll restart services on gitea13 to reset the memory state and hopefully we can get more feedback on whether or not htis is helpful | 16:23 |
clarkb | (side note I think the persistent requests from all the various ai crawler bots is what causes teh cache to go crazy) | 16:23 |
clarkb | but this is all hypothesis at the moment | 16:23 |
mnaser | they can get their juice from github.com mirrors, maybe we just block them? | 16:23 |
mnaser | im sure they already scraped that before us :) | 16:23 |
clarkb | they reproduce like rabbits | 16:25 |
clarkb | and not everything we host is replicated to github. Its a tricky thing to balance | 16:25 |
clarkb | services have restarted. I would expect things to be slowish as valid things we care about cache and then speed up with a tail off as all the things get cached over time (this was all discussed last time too) | 16:26 |
clarkb | the ideal is that the folks running the crawlers would realize that runninga git clone against each rpo and then crawling the commits locally would be far more efficient for us and them | 16:27 |
clarkb | if anyone knows anyone at anthropic, openai, amazon, meta, google, etc I'm happy to give them this insight for a good deal | 16:27 |
clarkb | mnaser: gitea13 seems to be performing reasonably well if I hit it directly via https://gitea13.opendev.org:3081/zuul/zuul for example | 16:28 |
clarkb | any noticeable difference for your usage patterns? | 16:28 |
clarkb | you might want ot double check that you're still hitting gitea13 too as I'm not sure how haproxy rebalancing works with the service flapping. I believe the algorithm is deterministic against your IP and the total number of backends so I would expect you to be back on the same server | 16:31 |
mnaser | clarkb: i think it actually feels better now from a few seconds ago | 16:34 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update got Gitea 1.23.4 https://review.opendev.org/c/opendev/system-config/+/942477 | 16:35 |
clarkb | I don't think ^ will help based on the changelog but seems like something we should be doing anyway | 16:36 |
clarkb | mnaser: ok, it might be helpful to know when it "feels" slow again so that we can gauge how long it takes to degrade | 16:36 |
clarkb | but ya the gitea caches are implemented as an internal go map. There is no maximum size on the map (though they do treat older data (based on timestamps) as invalid and will refresh it). I think what happens is things can just grow and grow with the crawlers fetching pages for every single commit on the service which results in very large maps. That apparently can craete problems for | 16:39 |
clarkb | the go gc system whcih I suspect is creating halt the world problems like you'd expect from java programs 10 yaers ago. Unfortunately I don't think go binaries come with tools like the jvm provides to inspect this easily. You either run the entire process with gc tracing on or you grab the data from within your program (I'd be happy to hear there are better tools I don't kow about | 16:39 |
clarkb | that I can look at) | 16:39 |
mnaser | i will report that for sure | 16:40 |
clarkb | and maybe turning on gc tracing is the next step in debugging. Just need to be careful we don't accidentally fill disks and make git sad | 16:42 |
opendevreview | Clark Boylan proposed opendev/system-config master: Reparent the bootstrap-bridge job onto a job that sets up git repos https://review.opendev.org/c/opendev/system-config/+/942307 | 16:55 |
clarkb | ok that is the promised small update to the fix for known_hosts generation | 16:56 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM: Bootstrap as top-level job https://review.opendev.org/c/opendev/system-config/+/942439 | 17:01 |
clarkb | and rebased ianw's DNM change to show what this might look like as a followon | 17:01 |
clarkb | I think nb04 has filled its disk again | 17:13 |
clarkb | confirmed. I'll start a screen and begin deletion of content in the tmp dir | 17:14 |
opendevreview | Merged openstack/diskimage-builder master: [doc] Document expected environment for dib https://review.opendev.org/c/openstack/diskimage-builder/+/942466 | 18:00 |
clarkb | https://review.opendev.org/c/zuul/zuul-jobs/+/942127 https://review.opendev.org/c/zuul/zuul-jobs/+/941992/ https://review.opendev.org/c/zuul/zuul-jobs/+/941970 and https://review.opendev.org/c/zuul/zuul-jobs/+/942008 are straghtforward zuul-jobs updates that might be good for a friday afternoon if anyone has time | 21:39 |
clarkb | this will reduce our reliance on docker hub for zuul jobs | 21:39 |
corvus | you have at least one usually 2 +2s on all those, but i didn't +w them | 22:21 |
clarkb | thanks. I guess only the most recent one was devoid of all review until now. I can approve them (or a subset I'll double check each one for comfort levels on a friday afternnon before doing so) | 22:29 |
clarkb | ah looks like fungi got them | 22:29 |
clarkb | I can help debug if anything goes wrong (and they should all be relatively easy reverts if it comes to that) | 22:29 |
fungi | yep | 22:29 |
fungi | i'm semi-around still too | 22:30 |
clarkb | then monday we should probably do the gitea upgrade nad maybe the grafana upgrade | 22:30 |
fungi | that'll be great | 22:30 |
clarkb | the nb04 disk cleanup is still in progress | 22:31 |
clarkb | io there is not very quick unfortunately | 22:31 |
opendevreview | Merged zuul/zuul-jobs master: Use mirrored buildkit:buildx-stable-1 image https://review.opendev.org/c/zuul/zuul-jobs/+/941992 | 22:37 |
opendevreview | Merged zuul/zuul-jobs master: Use mirrored qemu-user-static image https://review.opendev.org/c/zuul/zuul-jobs/+/942127 | 22:37 |
opendevreview | Merged zuul/zuul-jobs master: Use registry:2 image mirrored to quay.io https://review.opendev.org/c/zuul/zuul-jobs/+/941970 | 22:37 |
opendevreview | Merged zuul/zuul-jobs master: Replace debian:testing with quay.io/opendevmirror/httpd:alpine https://review.opendev.org/c/zuul/zuul-jobs/+/942008 | 22:42 |
clarkb | I do wonder if we'll need to force ipv4 access to docker hub after march 1, but we can cross that bridge when we get there. I think for most opendev and zuul stuff we'ev managed to shift everything but the images opendev produces for itself to quay at this point | 23:08 |
clarkb | that leaves a number of images still, but it is rare that we need many images all on one host with one ip at the same time | 23:08 |
clarkb | I think jitsi meet is the main exception to both of those statements. We pull from upstream and they are on docker still and there are like 5 images or something | 23:09 |
clarkb | but most things will pull at most 2 images | 23:09 |
clarkb | I think we're probably in a see what happens move while still moving what we can over time to reduce the overall demand we create | 23:10 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!