ianw | 2ebdabe1-799f-4bb0-9ed6-758d9ee34bbc | test-server | ACTIVE | auto_allocated_network=10.100.0.147 | centos-8-stream-arm64-1674085448 | opendev-no-ephemeral | 00:14 |
---|---|---|
ianw | ok ... so the new linaro cloud can boot a raw image uploaded, with no ephemeral storage. that's a start | 00:14 |
fungi | progress! | 00:17 |
opendevreview | Ian Wienand proposed opendev/system-config master: nodepool config: set linaro cloud to use raw images https://review.opendev.org/c/opendev/system-config/+/871010 | 00:38 |
opendevreview | Merged openstack/project-config master: nb04: use linaro region mirror https://review.opendev.org/c/openstack/project-config/+/871006 | 00:39 |
*** dasm is now known as dasm|off | 01:10 | |
opendevreview | Merged opendev/system-config master: Update git in gitea images https://review.opendev.org/c/opendev/system-config/+/871009 | 02:21 |
opendevreview | Ian Wienand proposed opendev/system-config master: openafs: use consistent name for cache size https://review.opendev.org/c/opendev/system-config/+/871014 | 02:31 |
ianw | ^ i've also manually fixed the linaro mirror to start openafs correctly; but that will do it permanently | 02:33 |
clarkb | inventory/service/host_vars/mirror01.regionone.linaro.opendev.org.yaml is where the smaller size is set if anyone is wondering | 02:36 |
ianw | huh, that went into a recursive error | 02:57 |
ianw | i guess you can not call a role with openafs_client_cache_size: "{{ openafs_client_cache_size | default(10000000) }}" # 10GiB | 03:33 |
opendevreview | Ian Wienand proposed opendev/system-config master: linaro mirror: fix afs cache size https://review.opendev.org/c/opendev/system-config/+/871014 | 03:37 |
ianw | it's confusing but i don't have motivation to do anything more fancy now | 03:37 |
fungi | makes sense | 03:45 |
opendevreview | Ian Wienand proposed opendev/system-config master: hound: use updated git packages https://review.opendev.org/c/opendev/system-config/+/871016 | 03:46 |
opendevreview | Merged opendev/system-config master: nodepool config: set linaro cloud to use raw images https://review.opendev.org/c/opendev/system-config/+/871010 | 04:21 |
opendevreview | Merged opendev/system-config master: linaro mirror: fix afs cache size https://review.opendev.org/c/opendev/system-config/+/871014 | 04:25 |
ianw | i've recreated the "opendev" flavor on the new linaro cloud to not have ephemeral storage. however nodepool still isn't sending work there as it doesn't have the right images, yet. nb04 is building them (after nb03 went missing) | 04:37 |
opendevreview | Merged opendev/system-config master: hound: use updated git packages https://review.opendev.org/c/opendev/system-config/+/871016 | 05:09 |
*** soniya is now known as soniya29|rover | 05:41 | |
*** ysandeep is now known as ysandeep|afk | 06:13 | |
*** ysandeep|afk is now known as ysandeep | 07:35 | |
*** jpena|off is now known as jpena | 08:05 | |
*** soniya29|rover is now known as soniya29|rover|brb | 09:44 | |
*** ysandeep is now known as ysandeep|afk | 11:00 | |
*** dviroel|afk is now known as dviroel | 11:13 | |
*** rlandy|out is now known as rlandy | 11:14 | |
*** soniya29|rover|brb is now known as soniya29|rover | 11:25 | |
*** ysandeep|afk is now known as ysandeep | 12:48 | |
*** dasm|off is now known as dasm | 13:08 | |
*** ysandeep is now known as ysandeep|dinner | 15:18 | |
*** dasm is now known as Guest1846 | 15:29 | |
*** Guest1846 is now known as dasm | 15:30 | |
*** ysandeep|dinner is now known as ysandeep | 15:32 | |
Tengu | folks, I have a really, really weird behavior with the ansible-galaxy proxy: locally, it works. The vhost has the exact same configuration, though I don't have TLS enabled. But the mirror on opendev infra seems to have a difference making it unreliable. | 15:32 |
Tengu | for instance, using the proxy, it's impossible to install "community.general" collection, while it does work through my local config. | 15:33 |
Tengu | and I really don't know why this is failing. Especially since installing (so far) any other collection is working fine. | 15:33 |
fungi | have you tested more than one mirror server? | 15:35 |
fungi | what is the error you receive from it? | 15:35 |
Tengu | fungi: I don't know the URI for other proxies, so I tested only via https://mirror.iad3.inmotion.opendev.org:4448 | 15:35 |
fungi | can you replicate it consistently, or is it intermittent> | 15:36 |
Tengu | consistent on that one. here's the command: ansible ~/.ansible/galaxy_cache/api.json ; ansible-galaxy collection install -vvvvvvv -s https://mirror.iad3.inmotion.opendev.org:4448 -p ./ansible community.general | 15:36 |
Tengu | err... it's missing the beginning. | 15:37 |
Tengu | here: rm -rf ansible ~/.ansible/galaxy_cache/api.json ; ansible-galaxy collection install -vvvvvvv -s https://mirror.iad3.inmotion.opendev.org:4448 -p ./ansible community.general | 15:37 |
Tengu | the first part is to clear all local cache. the "-p ansible" ensures we're using a local directory, in order to not pollute the system. | 15:37 |
fungi | https://mirror.dfw.rax.opendev.org:4448/ https://mirror.bhs1.ovh.opendev.org:4448/ https://mirror.sjc1.vexxhost.opendev.org:4448/ | 15:38 |
fungi | those are a few in more providers | 15:38 |
Tengu | let's see. | 15:38 |
Tengu | same on https://mirror.sjc1.vexxhost.opendev.org:4448/ | 15:38 |
fungi | what is the error you receive from it? | 15:39 |
Tengu | it's an ansible CLI error - and it doesn't really provide data. I tried to compare things with the actual ansible-galaxy server, but didn't find anything. lemme paste the stack. | 15:39 |
fungi | to paste.opendev.org please ;) | 15:40 |
Tengu | https://paste.openstack.org/show/biRnXTrehuHU0b0GNRZ5/ | 15:40 |
fungi | thanks | 15:40 |
Tengu | (no, I won't paste 60+ lines on IRC ;)) | 15:40 |
fungi | much appreciated | 15:40 |
Tengu | ;) | 15:40 |
Tengu | and if we point to another collection, say "ansible.utils", it just works fine. | 15:41 |
fungi | https://mirror.sjc1.vexxhost.opendev.org:4448/api/v2/collections/community/general/versions/ seems to paginate when i hit it with a browser | 15:41 |
Tengu | it's expected. | 15:42 |
Tengu | for instance, ansible-galaxy CLI does it on its own: Calling Galaxy at https://mirror.dfw.rax.opendev.org:4448/api/v2/collections/community/general/versions/?page_size=100 | 15:42 |
Tengu | and the, on the next line: Calling Galaxy at https://mirror01.dfw.rax.opendev.org:4448/api/v2/collections/community/general/versions/?page=2&page_size=100 | 15:42 |
Tengu | basically, ansible-galaxy wants to get the full index | 15:42 |
fungi | so, there's one problem we observed with mod_substitute when proxying pypi | 15:43 |
Tengu | wait.... you may have a thing | 15:43 |
Tengu | ansible.utils, a working collection, doesn't paginate | 15:43 |
fungi | if a json response is all in one line, it may exceed teh maximum line length supported by the mod. this can be adjusted with a setting | 15:43 |
Tengu | hmmm nope, it seems to be fine with the substitute | 15:43 |
Tengu | ansible.posix doesn't paginate either | 15:44 |
Tengu | hmmmmm. | 15:44 |
Tengu | fungi: are there settings in httpd set outside of playbooks/roles/mirror/templates/mirror.vhost.j2 ? | 15:46 |
fungi | looking at the pypi proxy we set SubstituteMaxLineLength 20m because of https://github.com/pypi/warehouse/issues/11919 | 15:46 |
Tengu | asking since the same vhost config is working locally - maybe there's something global I don't have, creating the issue. | 15:46 |
Tengu | fungi: I copied the setting in the galaxy vhost | 15:46 |
fungi | looks like the limit in mod_substitute is 1m characters to a line | 15:47 |
Tengu | fungi: and... well, it would fail on my local env, but it's working fine. | 15:47 |
fungi | just trying to rule that out real quick | 15:47 |
Tengu | lemme try to paste my httpd.conf from my container somewhere. | 15:47 |
Tengu | it's pretty ugly, 1-file, but... | 15:47 |
fungi | total response size is 12584 bytes, so definitely not that | 15:48 |
fungi | an order of magnitude lower than would be needed to hit that problem | 15:49 |
Tengu | fungi: I also ruled out some internal cache issue within ansible code - I suspected that the "localhost:8080" being far shorter than the "mirror01......" used in the CI job, it may be truncated or something - but apparently it's not the case. | 15:50 |
Tengu | the trace is really weird. | 15:50 |
Gue___________________________ | Greetings #opendev. Quick question: is it possible to delete an etherpad that I created on your site or to delete the content including its history? We are hoping to use the pad for a brainstorm but would prefer if the convo did not live on forever. Thank you in advance for considering. | 15:50 |
fungi | Gue___________________________: that's not a supported use for our etherpad server. i think etherpad.org may have a public server which expires pads after a while, you might check there | 15:52 |
Tengu | aha. yeah. ok. fungi I think you get a thing with the paginate actually. | 15:52 |
fungi | Gue___________________________: our etherpad is intended for public collaboration, and we make every attempt to preserve the history there for posterity | 15:53 |
Tengu | the ansible cache is far, far different. | 15:53 |
Tengu | yessssssss | 15:53 |
Tengu | jm1: I fond a workaround! | 15:53 |
Tengu | and it won't hit too hard: add "--no-cache" to the ansible-galaxy command | 15:53 |
fungi | Tengu: so local caching impacts it? | 15:54 |
Tengu | jm1: that will tell ansible to NOT touch its ~/.ansible/galaxy_cache/api.json | 15:54 |
Tengu | fungi: in a weird way - maybe due to something sent by the proxy, still. | 15:54 |
Tengu | grumpf... isn't there some CLI one can easily use to paste a long file ?! | 15:54 |
Tengu | instead of copy-pasting blocks after blocks.. | 15:55 |
fungi | i would probably resort to hacking some debug logging into resolvelib to get more detail about the dict that it's trying to access | 15:55 |
fungi | Tengu: there's the pastebinit tool | 15:55 |
Tengu | http://paste.scsys.co.uk/2216 here | 15:55 |
Tengu | nopaste < httpd.conf | 15:55 |
Tengu | fungi: I checked the file on-disk | 15:56 |
Tengu | its content is indeed "slightly" different when there's a paginate. | 15:56 |
fungi | for future reference, pastebinit can paste to paste.opendev.org as well | 15:56 |
* Tengu takes note | 15:56 | |
Tengu | ah, via -b paste.opendev.org I guess. | 15:57 |
Tengu | ok. | 15:57 |
Gue___________________________ | @fungi Thank you, understood. | 16:01 |
*** dviroel is now known as dviroel|lunch | 16:01 | |
fungi | Tengu: yeah, i have an "opaste" alias to that in my shell, for convenience | 16:11 |
Tengu | fungi: it failed to get the generated link. bah. I usually don't have to paste 100+ lines. | 16:12 |
Tengu | anyway. I have a workaround, but it would still be nice to understand why it fails on the "prod", while dev env is fine :/. | 16:12 |
fungi | i would probably resort to hacking some debug logging into resolvelib to get more detail about the dict that it's trying to access | 16:13 |
Tengu | fungi: so I checked the JSON (yeah, the local cache is plain JSON), and it seems to miss things when it comes to that specific collection. | 16:28 |
Tengu | it's... weird. | 16:28 |
fungi | what part of the json is missing? | 16:29 |
*** ysandeep is now known as ysandeep|out | 16:38 | |
*** dviroel|lunch is now known as dviroel | 16:57 | |
*** marios is now known as marios|out | 16:59 | |
Tengu | fungi: (sorry, was on some other discussion) the whole part matching the key shown in the trace | 17:01 |
Tengu | so basically, it's as if it's flushing all of the data related to the versions | 17:01 |
Tengu | i.e. loads page one, injects data in the file, and that entry is dropped at some point when it comes to load the second page and tries to update it. | 17:02 |
Tengu | and since it's supposed to be there, it crashes instead of re-creating (which is probably better). | 17:02 |
Tengu | but this happens if and only if we're using the opendev proxies. My local httpd, with the configuration I pasted earlier, doesn't crash ansible-galaxy. | 17:03 |
Tengu | this is why I'm wondering if there are some other configurations in httpd, set outside of that mirror thingy. | 17:03 |
Tengu | fungi: I'm running my local proxy like this: podman run --rm --security-opt label=disable -v ./httpd.conf:/usr/local/apache2/conf/httpd.conf:ro -v ./cache:/var/cache/apache2/proxy:rw -v ./logs:/usr/local/apache2/logs:rw -p 8080:8080 httpd:2.4 | 17:04 |
Tengu | and then pointing ansible-galaxy -s http://localhost:8080 | 17:04 |
fungi | Tengu: looking at one of the mirror servers, we have https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/files/apache-connection-tuning added | 17:07 |
fungi | but aside from that, just the default configuration from ubuntu bionic (18.04 lts) | 17:08 |
Tengu | hmmmmmm not sure it would really be a thing | 17:08 |
Tengu | yeah. shouldn't do | 17:08 |
Tengu | weird... | 17:08 |
Tengu | anyway... getting late here. I'm a bit puzzled by that behavior, but I don't know what I can do. Yeah, adding some debugg, of course - maybe getting the file copied before update, that may help. | 17:09 |
Tengu | if I have some time... though I doubt. | 17:09 |
fungi | i confirm that i don't find community.general in the json we get back from the proxy, but it's not in the json from galaxy.ansible.com either | 17:13 |
fungi | i guess community.general is a key in the local cache | 17:17 |
fungi | not in the json response | 17:17 |
fungi | i suppose that's the result of the earlier warning line | 17:17 |
fungi | Tengu: i suppose one difference might be that mirror.dfw.rax.opendev.org is a cname to mirror01.dfw.rax.opendev.org and while we're calling the former it's the latter we find being substituted in the response | 17:19 |
fungi | do you see the same error if you use https://mirror01.dfw.rax.opendev.org:4448/ instead of the hostname without the 01 in it? | 17:19 |
clarkb | note we also use the internal rax ip address in CI for the rax mirrors specifically (gets better throughput) | 17:21 |
clarkb | but I wouldn't expect that to matter too much as they are both CNAMEs so should be able to reproduce using the public name | 17:21 |
fungi | though it's worth noting that we'll end up using the non-internal interface for subsequent calls recursed from the initial request since mod_substitute is writing the server name in there. maybe we need to substitute the hostname from the request instead | 17:22 |
clarkb | I half expected that it already did that? I guess not | 17:23 |
fungi | it would get extra broken if we started doing sni with different hostname-specific vhosts later | 17:23 |
fungi | if you curl https://mirror.dfw.rax.opendev.org:4448/api/v2/collections/community/general/ you'll see the json says mirror01 instead of mirror | 17:24 |
clarkb | does it do that with pypi? | 17:25 |
clarkb | (it also substitutes iirc) | 17:26 |
fungi | also in https://paste.opendev.org/show/biRnXTrehuHU0b0GNRZ5/ you can see the initial requests are to mirror.dfw.rax but then a subsequent request goes to mirror01.dfw.rax | 17:26 |
fungi | clarkb: we don't embed the hostname in pypi responses, we use relative hrefs | 17:27 |
fungi | rewriting to /pypifiles | 17:27 |
fungi | er, not relative, but local | 17:27 |
clarkb | aha | 17:27 |
*** jpena is now known as jpena|off | 17:30 | |
fungi | i guess technically we could try that with the ansible galaxy substitutions too | 17:35 |
fungi | basically just drop the scheme, servername and port from https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mirror/templates/mirror.vhost.j2#L584 | 17:37 |
fungi | Substitute "s|https://galaxy.ansible.com/|/|ni" | 17:37 |
fungi | also would allow to get rid of the scheme lookup conditionals | 17:38 |
fungi | oh, though the comment immediately above there states "ansible-galaxy CLI needs a fully qualified URI" | 17:38 |
fungi | so maybe that was already attempted | 17:38 |
fungi | maybe we can do it like | 17:49 |
fungi | Substitute "s|https://galaxy.ansible.com/|%{REQUEST_SCHEME}://%{HTTP_HOST}:%{SERVER_PORT}/|ni | 17:55 |
fungi | oh, i see it's a feature of apache 2.5.1: https://httpd.apache.org/docs/trunk/en/mod/mod_substitute.html (see the bit on expr= syntax) | 18:02 |
fungi | some of our older mirror servers are still on bionic, so not new enough for that | 18:03 |
fungi | oh, even focal's isn't | 18:04 |
fungi | yeah, nevermind. ubuntu lunar is even still on apache 2.4 | 18:05 |
fungi | so yeah, the options are dwindling | 18:09 |
fungi | i guess we could make the preferred mirror hostname a separate ansible var and jinja that into the Substitute directive | 18:10 |
fungi | assuming this ends up being the problem | 18:11 |
frickler | wow, that's some monster job logs that make my firefox choke when I open the corresponding zuul page https://184e5731741af40c59ec-11b479ab8ac0999ee2009c93a602f83a.ssl.cf1.rackcdn.com/870988/1/check/cross-nova-functional/1cb7591/ | 18:43 |
clarkb | frickler: definitelyworth encouraging the nova team to address that. I twill make jobs run faster too | 18:45 |
clarkb | I usually open large logs with vim and it mostly handles it | 18:45 |
frickler | the problem is that the build page loads the huge job-output.json and them seems to break it | 18:49 |
frickler | https://zuul.opendev.org/t/openstack/buildset/e8968c8bf1be4caa86d4d6ef0fb23cd9 is the buildset page, the failed job is the one in question | 18:49 |
clarkb | ya I'm not sure what zuul can do about that. I guess show an error? | 18:49 |
frickler | maybe it should truncate oversized logs even before uploading them | 18:51 |
clarkb | the problem with that is you lose the information necessar to address the problem in many cases | 18:52 |
frickler | them maybe just rename oversized job-output.json so it doesn't get autoloaded. it will still be available for manual inspection | 18:54 |
clarkb | lots of "WARNING [oslo_messaging.rpc.client] Using RPCClient manually to instantiate client. Please use get_rpc_client to obtain an RPC client instance." | 18:54 |
clarkb | 2617487 log lines in the job-output.txt. 2101438 are that line above | 18:55 |
clarkb | melwitt: ^ fyi | 18:55 |
clarkb | who hacks on oslo these days? stephenfin? Maybe that warning should be emitted once per process? | 18:56 |
frickler | not sure if or how that could be related to the eventlet bump though. doesn't seem to happen for other patches | 18:56 |
melwitt | I see stephenfin around oslo occasionally | 18:57 |
clarkb | frickler: I don't think it is. I suspect this is a change in oslo_messaging | 18:57 |
melwitt | I'll look at nova, should be "easy" to fix I would think | 18:57 |
clarkb | I think the january 5 release of oslo.messaging version 14.1.0 added it. Commit 4ead7cb2dcf376032f7bf9532a375256db6d3784 was the change and appears to be after 14.0.0 | 19:01 |
clarkb | tobias-urdin: ^ fyi | 19:02 |
melwitt | clarkb: looks like it was fixed two days agoi https://github.com/openstack/nova/commit/c59db128a00477f6163d71ea1454da4286dad708 | 19:26 |
melwitt | *ago | 19:26 |
clarkb | hrm that log is from about 26 hours ago | 19:27 |
clarkb | oh maybe the change landed more recently than two days ago. The commit time would be earlier | 19:27 |
clarkb | yup it merged 22 hours ago or so. That explains it | 19:28 |
melwitt | ah, yeah | 19:28 |
clarkb | thank you for looking into it | 19:29 |
melwitt | np, thanks for the heads up about it | 19:30 |
Tengu | fungi: oh! using the host you pointed (mirror01.dfw.rax.opendev.org) , it seems to work now! | 19:47 |
fungi | Tengu: thanks for testing, that at least narrows down the cause. now to figure out what to do about it | 19:47 |
Tengu | clarkb: I'd expect it to actually play a role, because the ansible-galaxy cache is using the servername (as passed in -s <servername>) | 19:47 |
fungi | that also makes a lot more sense as to why skipping the local cache works around the issue | 19:48 |
Tengu | yup | 19:48 |
Tengu | so in the CI case, skipping local cache is OK, since it's a one-show. | 19:48 |
fungi | would still be nice to figure out how to do that substitution so that you don't have to use that workaround | 19:49 |
Tengu | and the local cache is a dict, built as {'host': {'module_name': {'path1': ..., 'path2': ...}}} | 19:49 |
Tengu | or something like that | 19:49 |
Tengu | fungi: iirc httpd itself should know its actual name? | 19:50 |
fungi | well, that's the tricky part. there's not necessarily any single name, that vhost supports multiple names | 19:50 |
Tengu | fungi: yeah, so, confirmation: cleaning local cache, running with ansible-galaxy collection install -vvvvvv -s https://mirror.dfw.rax.opendev.org:4448 -p /tmp/foo_test__ community.general it fails; cleaning cache, re-running with the mirror01, it works. | 19:51 |
fungi | however, we can probably specify a preferred name we want to use in the rewrites. for example doing mirror-int for the ones where we want the nodes to use the mirror's internal/private interface for performance reasons | 19:51 |
Tengu | fungi: hmmm.... so there's a mismatch between the ansible variable (don't remember its name) and the actual vhost in apache config? | 19:51 |
Tengu | jm1: we found the root cause apparently :) | 19:52 |
Tengu | jm1: well, actually, fungi pointed the missing piece :) | 19:52 |
fungi | not a mismatch. mirror and mirror01 are both valid names for the server and the vhost will serve (currently the same) content for either | 19:52 |
Tengu | fungi: ~> the ansible_var we get in the zuul job then? | 19:53 |
fungi | the challenge is deciding which name we want nodes using in their requests, which may not be the primary hostname for the server | 19:53 |
fungi | but setting that statically in the vhost configuration, because as your inline comment points out, mod_substitute on apache 2.4.x doesn't support expressions | 19:53 |
jm1 | Tengu, fungi 🥳 | 19:54 |
jm1 | I just wanted to give up on this :D | 19:54 |
Tengu | fungi: hmmm..... so mod_substitute doesn't support the httpd internal variables? | 19:54 |
fungi | not until apache 2.5.1 (currently under development) | 19:54 |
Tengu | dang | 19:54 |
fungi | however, we can probably add an ansible var in our deployment inventory for each mirror host to contain the name we plan to tell clients to access it as | 19:55 |
Tengu | that would be great :) | 19:55 |
fungi | and then jinja splat that into the substitute rule | 19:56 |
fungi | other root sysadmins with a better grasp of ansible can tell me if i'm smoking something with that idea | 19:56 |
Tengu | we can do whatever you want actually :). host_vars are here for that. | 19:57 |
clarkb | fungi: the main issue with that is you'd break public access/testing of the name in rax (since we'd have to use the internal name since that is what the jobs get). Elsewhere it is fine and we can probably use elsewhere as testing proxies | 20:00 |
fungi | clarkb: yes, i think more generally it's impossible to proxy ansible-galaxy completely with apache if you want to serve it from multiple arbitrary hostnames for the same server | 20:01 |
fungi | so it's already broken in that way | 20:01 |
clarkb | ya | 20:02 |
fungi | just trying to think of a solution which breaks it in favor of the hostnames we want test nodes using rather than in favor of some other hostname | 20:02 |
fungi | where the latter is what we have at the moment | 20:02 |
fungi | another way would be to give the rackspace mirrors two vhosts and use sni to route requests to the correct one for internal vs external interface hostnames | 20:04 |
Tengu | or maybe pass a secondary zuul_site_mirror_fqdn var such as zuul_site_mirror_fqdn_fixed (or the like) that will then match the actual name of the host (i.e. mirror01.dfw.rax.opendev.org) ? | 20:05 |
Tengu | that way, actual jobs will have to get the config, but it doesn't really change anything on the mirror config itself? | 20:05 |
fungi | Tengu: the problem (and the reason for the mention of rackspace) is that in rackspace our mirror servers are dual-homed and we'd prefer nodes to connect to their non-public interfaces | 20:06 |
Tengu | hmm ok. | 20:07 |
fungi | so we really do want nodes to use urls like https://mirror-int..dfw.rax.opendev.org/... which isn't reachable from outside | 20:07 |
Tengu | sounds legit | 20:07 |
fungi | mainly because we get improved efficiency and stability for connections across their private internal network | 20:08 |
fungi | so making the apache configuration on that server know the hostname we're telling nodes to connect to would give us something to bake into the substitute rule | 20:08 |
clarkb | ya tha might be the best appraoch but more effort than simply substituting the internal name always | 20:09 |
Tengu | fungi: so for instance http://mirror.iad3.inmotion.opendev.org:8085 would actually be another name ? | 20:09 |
fungi | and yeah, an apache host_var containing that name would be one way to go about it | 20:09 |
fungi | Tengu: we tell clients to connect to mirror.iad3.inmotion.opendev.org which is currently a cname to mirror02.iad3.inmotion.opendev.org | 20:10 |
fungi | the server knows itself as mirror02 but considers mirror to be an available alias | 20:10 |
Tengu | ok, so this means the substitute talks abec mirror02, while galaxy knows "mirror".... and crashes. | 20:11 |
Tengu | ok. | 20:11 |
fungi | right | 20:11 |
Tengu | and it fails once we get to the second page for #reason. | 20:11 |
Tengu | because single paging is fine. | 20:11 |
fungi | so if we tweak the substitute rule to use mirror.iad3.inmotion.opendev.org like the client requests do, then it should work | 20:12 |
Tengu | go figure.... they probably messed big time at some point, but that's really a corner case. | 20:12 |
Tengu | fungi: so if we have a way to inject that name "mirror.iad3..." in the ansible generating the config, we're good. | 20:12 |
fungi | the reason the pagination seems to break it is that the pages include "previous" and "next" fields which use fully qualified urls | 20:12 |
Tengu | does it? | 20:13 |
fungi | and the client is probably following the "next" url from the first page, which then takes it to mirror02 instead of mirror | 20:13 |
Tengu | oh.... dang. | 20:13 |
Tengu | yeah | 20:13 |
Tengu | that's exactly that | 20:13 |
Tengu | that's the trick | 20:13 |
Tengu | you got it, fungi ! | 20:14 |
fungi | anyway, the solution seems fairly straightforward, we probably just need to get consensus among the sysadmins as to the best way to encode the "preferred" request name for the mirror sites (like do we leverage some mechanism to generate them on the fly in group_vars or something similar which doesn't need us to list them individually) | 20:16 |
Tengu | fungi: *maybe* as a first thing we may just make a simple mapping in the jinja, using {% set %} and some if/elsif/else things... | 20:16 |
tobias-urdin | clarkb: yeah, I'm working on getting everything moved over to the new API there, started with Nova gonna continue with Neutron but got blocked needing this https://review.opendev.org/c/openstack/oslo.messaging/+/869899 and been stuggling some with getting CI green with new tox etc | 20:17 |
tobias-urdin | I will continue look into it for sure | 20:17 |
Tengu | fungi: though... if I understand correctly, mirror.foo.bar is a cname to, at least, mirror01.foo.bar, but may also be a cname to mirror02.foo.bar - would that mean both are up, and both are answering, meaning galaxy may end on 01 first, then re-request mirror.foo.bar and end on 02? | 20:18 |
fungi | Tengu: we add and remove mirror servers frequently is the reason for the cnames | 20:19 |
fungi | you can't have a cname resolve to multiple names though, you're probably thinking of round-robin address records | 20:20 |
Tengu | fungi: err yeah, round-robin address record indeed. | 20:20 |
Tengu | and yeah, cname can't match multiple names. indeed. so it's more a "service" address that may be attached to any of the used server. | 20:21 |
Tengu | fungi: maybe... why is the proxy able/configured to answer to multiple names? | 20:21 |
fungi | anyway, there's enough churn that setting vars is probably cleaner than doing some sort of name mapping | 20:21 |
Tengu | :) pretty sure you'll figure something out | 20:22 |
Tengu | lemme know if there's a need for testing or just pushing ideas. | 20:22 |
Tengu | though.... not today - it's getting late here. | 20:22 |
Tengu | but at least, the root cause is known | 20:22 |
fungi | Tengu: mainly because we haven't told the proxies not to, and in some cases (like the multi-homed mirrors in rackspace) it's useful to be able to test them over the internet on externally reachable interfaces | 20:22 |
Tengu | fungi: makes sense. | 20:22 |
Tengu | a pity mod_substitute doesn't support vars yet ;_;. that would solve everything | 20:23 |
fungi | we could make the apache vhosts name-specific rather than wildcarded, as i mentioned earlier, it would just mean a lot of duplication in the configs or more apache templating | 20:23 |
jrosser | is the idea to generally proxy/cache any ansible collection, or are there a subset of them we're more interested in? | 20:24 |
fungi | so there are several ways we could go about it, mainly just trying to work out the least intrusive | 20:24 |
Tengu | jrosser: no actual idea. I thought it would be 2 or 3, but apparently that's already wrong. | 20:24 |
fungi | jrosser: to cache general ansible-galaxy access. if you're going to want to test with unreleased or un-merged commits from specific collections, still better to use required-projects in zuul | 20:25 |
jrosser | well, in projects i'm involved in we rewrite the collection URLs on the fly to use any that happen to be cached on the CI node | 20:25 |
clarkb | tobias-urdin: my main suggestin would be to look into using the python warnings library and emit the warning once | 20:25 |
Tengu | jrosser: yeah - and if not cached on the ci node? adding more and more and more isn't good either, since they end up being moved around during the node bootstrap | 20:26 |
Tengu | using the caching-proxy is more flexible imho. and... we're not that far from a working setup, once we get over that hostname "mismatch". And there's a "clean" workaround (passing --no-cache to ansible-galaxy CLI) | 20:27 |
Tengu | also, that issue seems to be affecting only ansible >2.9 - because the galaxy cache was implemented in later release (2.11 I think) | 20:28 |
clarkb | they ultimately provide different functionality some of which is useful at different times. In particular you should use the Zuul case if you are doing testing against unreleased collections and if you want depends-on support | 20:29 |
Tengu | yep. | 20:29 |
Tengu | anyway.... getting really late, I'll check back tomorrow :) | 20:32 |
fungi | however, if you're just doing some ansible testing and need something from galaxy, it's nice not to need to wait for someone to add another github repo to the zuul tenant config | 20:32 |
fungi | have a good night Tengu! | 20:32 |
Tengu | thanks fungi for the pointers :). | 20:32 |
jrosser | the most widespread problem i have seen with galaxy in ci jobs is the API returning 5xx and just bailing out, so server side problem at their end | 20:33 |
jrosser | as a result our jobs get the collections from github with git rathan than galaxy with the API wherever possible | 20:34 |
jrosser | though having said that, occurrences of that kind of error have been very infrequent lately, something must be be fixed/improved in the galaxy server | 20:35 |
fungi | also, in theory, proxying and caching requests for galaxy local to our test nodes should reduce the load we impose on their servers by running test jobs, while improving latency, packet loss, and bandiwdth availability/speed for the requests if hitting a warmed cache | 20:41 |
tobias-urdin | clarkb: i guess the problem is that it gets logged everytime for example an API worker is spawned which means it will be a new interpreter (atleast for mod_wsgi) every time for the lifetime of that worker atleast | 20:57 |
tobias-urdin | did it shrink down a bit when nova was fixed? or is there some other ones causing potential issues | 20:58 |
*** dviroel is now known as dviroel|out | 21:05 | |
clarkb | tobias-urdin: https://5ee1e6c6ba7962bf8d90-9f271c6f9270f1e424d49ce4325dabf5.ssl.cf2.rackcdn.com/871001/1/gate/cross-nova-functional/f996c0a/ it shrunk down significantly. Its more that if you are going to warn a user or operator about something repeating the warning in a tight loop is not helpful as it fills disks/logs and irritates them. That is why the python warnings library | 21:15 |
clarkb | allows you to emit such warnings once and move on | 21:15 |
tobias-urdin | clarkb: yeah i agree, just wondering if using python warning once would actually solve all such issues but hm yea probably some of them atleast | 21:26 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fix Gerrit 3.6 image build https://review.opendev.org/c/opendev/system-config/+/870118 | 21:28 |
opendevreview | Clark Boylan proposed opendev/system-config master: Build Gerrit on top of our python-base images https://review.opendev.org/c/opendev/system-config/+/870874 | 21:28 |
opendevreview | Clark Boylan proposed opendev/system-config master: Switch Gerrit to Java 17 https://review.opendev.org/c/opendev/system-config/+/870877 | 21:28 |
clarkb | this is neat I've got github notifications for a gitea release that doesn't show up in github yet | 21:32 |
clarkb | tobias-urdin: I would expect it to cut down quite a bit since wsgi should have some process reuse right? | 21:33 |
tobias-urdin | clarkb: yeah think so, i've proposed patches to all projects now atleast, not sure if i should change oslo.m also | 21:34 |
*** arxcruz|ruck is now known as arxcruz | 21:50 | |
opendevreview | Ian Wienand proposed openstack/project-config master: nodepool: drop linaro-us https://review.opendev.org/c/openstack/project-config/+/871196 | 21:55 |
clarkb | ianw: ^ re that did the new flavor get things going on the new cloud? | 21:55 |
clarkb | ianw: also left a thought on that new change | 21:57 |
clarkb | hrm that tag is still not there I wonder if that implies they immediately deleted it | 22:02 |
ianw | not really. for some reason, it hasn't chosen to upload all the image types, i'm not sure why. there's nothing in the nb04 logs that i can see, it just doesn't seem to try uploading | 22:02 |
clarkb | has it built them? if so thats weird | 22:03 |
ianw | kevinz gave me access to the cloud, but i'm a little worried it's out of disk | 22:03 |
ianw | /dev/nvme1n1p2 196G 186G 792M 100% / | 22:04 |
clarkb | that is one downside to raw images | 22:04 |
clarkb | they are a lot bigger | 22:04 |
clarkb | I wonder if we shuldn't consider trimming what we support on arm64 way back. Like Jammy and Rocky 9 | 22:05 |
clarkb | thats 4 images (2 * 2) at about 20GB each raw we'd be under that limit | 22:05 |
ianw | Local Volumes space usage: | 22:07 |
ianw | glance 1 122.9GB | 22:07 |
ianw | it does seem to me that is probably where the images are being stored. i'm still just trying to understand the layout and kolla deployment | 22:08 |
ianw | t] Failed to upload image data due to HTTP error: webob.exc.HTTPRequestEntity | 22:12 |
ianw | TooLarge: Image storage media is full: There is not enough disk space on the image storage media. | 22:12 |
ianw | yeah, glance is not happy | 22:12 |
ianw | ok, kevinz did explain this, but i see now ... there's 2 1tb disks on this | 22:16 |
ianw | nvme0n1 259:0 0 894.3G 0 disk | 22:16 |
ianw | nvme1n1 259:1 0 894.3G 0 disk | 22:16 |
ianw | nvme1n1 is the boot disk -- it has a 1gb efi partition, and 200gb / and then the rest is in lvm for cinder volumes | 22:17 |
ianw | nvme0n1 is 100% in the cinder lvm | 22:17 |
ianw | i think we probably want to make glance use cinder | 22:21 |
opendevreview | Merged opendev/system-config master: Fix Gerrit 3.6 image build https://review.opendev.org/c/opendev/system-config/+/870118 | 22:21 |
ianw | since that is where the space is | 22:21 |
clarkb | if that is possible that seems like a good idea | 22:23 |
ianw | it doesn't say cinder -> https://docs.openstack.org/kolla-ansible/latest/reference/shared-services/glance-guide.html#glance-backends | 22:25 |
ianw | but https://opendev.org/openstack/kolla-ansible/commit/fa49b2692de1b38bfdf47e1468296770d5dfff89 suggests maybe otherwise | 22:27 |
*** dasm is now known as dasm|off | 23:13 | |
*** rlandy is now known as rlandy|out | 23:29 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!