opendevreview | OpenStack Proposal Bot proposed openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/892726 | 03:06 |
---|---|---|
opendevreview | Merged openstack/project-config master: Normalize projects.yaml https://review.opendev.org/c/openstack/project-config/+/892726 | 11:43 |
*** TheJulia is now known as needs_brains_and_sleep | 13:04 | |
*** needs_brains_and_sleep is now known as TheJulia | 13:23 | |
frickler | infra-root: kolla just saw four jobs failing in gate in parallel https://zuul.opendev.org/t/openstack/buildset/9a2af787cb62474c88536484142d607d , three of them ran on rax-ord. I'm kind of eoding, maybe you can have a look? mnasiadka might still be around a bit to answer possible kolla related questions | 14:02 |
fungi | looking | 14:12 |
fungi | during `kolla-ansible -i /etc/kolla/inventory -vvv bootstrap-servers` the ssh connection to the node was prematurely closed, in all 4 cases, looks like? | 14:13 |
fungi | since one of those four happened in ovh, i think we can rule out a provider-specific problem impacting the nodes themselves | 14:16 |
fungi | though it's possible something provider side is impacting the executors | 14:16 |
Clark[m] | Was it zuul's connection or Kolla ansible's connection? The latter is comms all within the same cloud | 14:19 |
fungi | with all four builds, the connection error occurred within 1-2 minutes of starting to run the setup_gate.sh script | 14:22 |
fungi | primary | ok: 19 changed: 15 unreachable: 0 failed: 1 skipped: 14 rescued: 0 ignored: 0 | 14:23 |
fungi | i think that's from the nested ansible? | 14:23 |
fungi | it's possible that stderr line about the shared connection being closed is not related to the cause. ultimately the task failed because the setup script exited 2 | 14:30 |
fungi | okay, here we go: https://zuul.opendev.org/t/openstack/build/72774530166546c1a96284f3312e0e36/log/primary/logs/ansible/bootstrap-servers#749 | 14:31 |
fungi | https://zuul.opendev.org/t/openstack/build/f7e6f2677ed9440db9f8a3c1b04f1868/log/primary/logs/ansible/bootstrap-servers#699 | 14:32 |
fungi | failing in different spots in that script | 14:33 |
fungi | "Failed to download metadata for repo 'docker': Yum repo downloading error: Downloading error(s): repodata/8e89c445039a4ff75bb98ab62bee6b6ae7c4c8ae853a61cab75de5e30c39d0bf-primary.xml.gz - Cannot download, all mirrors were already tried without success; repodata/abe464de7c144654302f1b3b46042d88f1d6550b46527f15a2cef794091f2b3c-filelists.xml.gz - Cannot download, all mirrors were already tried | 14:33 |
fungi | without success" | 14:33 |
fungi | "E:Failed to fetch https://download.docker.com/linux/debian/dists/bookworm/stable/binary-amd64/Packages.bz2 File has unexpected size (11933 != 12572). Mirror sync in progress?" | 14:36 |
frickler | 14:37 | |
fungi | seems like different mirror problems (for different distros, in different parts of the world), but maybe there is some relationship | 14:37 |
fungi | afs fileservers and database servers don't have anything new in dmesg for the past two months, so they don't seem to think they're in distress at least | 14:43 |
fungi | oh! i should have paid closer attention. those are direct download errors for things hosted on download.docker.com, nothing to do with our mirrors at all i don't think? | 14:44 |
fungi | i should have looked closer | 14:45 |
fungi | so my best guess is that docker.com is blowing up their package repositories | 14:45 |
fungi | mnasiadka: ^ let me know if that explanation doesn't match your observations | 14:45 |
frickler | oops, looks like that ssh session wasn't as dead yet as it looked like, sorry | 15:09 |
frickler | fungi: thanks for debugging, seems I was mislead by the non-fatal warnings about our mirror host in the same block | 15:10 |
fungi | frickler: some of the confusion is probably due to the fact that zuul is now separating display of stdout and stderr streams, which made a normal connection close jump out as if it were the cause | 15:11 |
fungi | simply because it was the only thing that task sent to the stderr stream | 15:12 |
fungi | it mislead me first too | 15:12 |
frickler | I thought that that feature wasn't enabled yet? | 15:13 |
frickler | kolla explicitly redirects logs for those tasks I think | 15:13 |
fungi | right, i think the connection closed was coming from the task run by the executor, looking at how it's split up. but maybe i'm misinterpreting that | 15:15 |
Clark[m] | Stdout and stderr shouldn't be split yet. And when the first change happens it will be opt in | 15:20 |
fungi | that "Shared connection to 23.253.56.132 closed." et cetera is clearly called out as being from stderr though | 15:27 |
fungi | https://zuul.opendev.org/t/openstack/build/72774530166546c1a96284f3312e0e36/console | 15:27 |
fungi | has separate boxes for stderr and stdout | 15:27 |
fungi | they're separate in the ansible json | 15:28 |
fungi | and so also in the task summary | 15:28 |
Clark[m] | It's not a command or shell task. I think it is a https://docs.ansible.com/ansible/latest/collections/ansible/builtin/script_module.html task | 15:33 |
Clark[m] | Only command and shell have the combined outputs | 15:33 |
fungi | oh, i see | 15:35 |
clarkb | fungi: if you're still ready for mailman3 I think we can send it in | 15:41 |
fungi | yeah, i'm braced and ready for impact | 15:42 |
clarkb | should I +A or do you want to do it? | 15:43 |
fungi | go for it | 15:44 |
frickler | the "Shared connection ... closed." is a red herring, you can also see it for that task when it (and the whole job) is passing https://zuul.opendev.org/t/openstack/build/89e86e76b0df4af1a19511f98a4fb323/console#2/1/28/primary | 15:44 |
clarkb | done | 15:44 |
fungi | thanks! | 15:47 |
fungi | frickler: yeah, i realized that was the case after i went hunting for the bootstrap-servers log and saw the errors in it | 15:48 |
fungi | the upload image job has been sitting queued for a while even though we've got tons of available capacity. i wonder if we're running into significant boot failures again | 16:08 |
clarkb | it is running now | 16:13 |
clarkb | ~5 minutes isn't abnormal for anode boot in some clouds. However I Think it ended up being longer than that | 16:13 |
fungi | yeah, it was close to 30 minutes wait for the node after the registry job paused | 16:23 |
fungi | it's wrapping up now | 16:26 |
opendevreview | Merged opendev/system-config master: Upgrade to latest Mailman 3 releases https://review.opendev.org/c/opendev/system-config/+/869210 | 16:27 |
clarkb | looks like it is promoting the image but not triggering the lists3 job | 16:29 |
clarkb | ya that job doesn't trigger on mailman3 docker updates. Only on the playbook side | 16:30 |
clarkb | fungi: you could add the docker paths to the files list for the infra-prod-service-lists3 job and add a rebuild trigger comment to one or several of the imges to have it run through again | 16:30 |
clarkb | or we could manually trigger the infra-prod-service-lists3 playbook on bridge instead | 16:30 |
fungi | sure, on it | 16:30 |
clarkb | proably better long term to have docker image builds trigger the job though | 16:31 |
clarkb | thinking back to when we built this out I think it wasn't clear if upgrades needed intervention like gerrit or not so we left that out | 16:34 |
clarkb | the current indication is that this should typically be automated so making it happen automatically makes sense to me. But if you think that isn't the case we can leave it as is and trigger the playbook manually | 16:34 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Trigger mm3 deployment when containers change https://review.opendev.org/c/opendev/system-config/+/892807 | 16:36 |
fungi | clarkb: like that ^? | 16:36 |
clarkb | yup looks similar to etherpad for example | 16:37 |
clarkb | I've approved it | 16:37 |
fungi | thanks | 16:37 |
fungi | once again system-config-build-image-mailman is taking a surprisingly long time to get a single ubuntu-jammy node assigned | 16:48 |
fungi | there it finally goes | 16:48 |
fungi | that time was about 10 minutes after opendev-buildset-registry paused | 16:49 |
fungi | not as bad at least | 16:49 |
fungi | just surprising when we have so much available capacity at the moment | 16:49 |
fungi | this time it got a node the instant the registry paused | 17:09 |
fungi | error node and time to ready spikes on https://grafana.opendev.org/d/6c807ed8fd/nodepool suggest we may have some intermittent issues | 17:13 |
fungi | i think the errors are predominately in ovh regions: https://grafana.opendev.org/d/2b4dba9e25/nodepool%3a-ovh | 17:14 |
clarkb | should merge soon | 17:22 |
fungi | yup | 17:24 |
opendevreview | Merged opendev/system-config master: Trigger mm3 deployment when containers change https://review.opendev.org/c/opendev/system-config/+/892807 | 17:24 |
clarkb | I just cleaned up my etherpad and gitea autoholds in zuul | 17:25 |
fungi | infra-prod-service-lists3 is waiting in deploy this time, so looks like it worked | 17:25 |
fungi | i'll clean up the mm3 held node once we're sure the prod upgrade is good, just in case we need it for a comparison or something | 17:26 |
fungi | deployment is now in progress | 17:26 |
fungi | containers are restarting | 17:27 |
clarkb | ya I left those other two around for a bit for the same reason | 17:28 |
fungi | https://lists.opendev.org/ is up and still seems to be working | 17:28 |
fungi | "Postorius Version 1.3.8" on https://lists.opendev.org/mailman3/lists/ | 17:29 |
clarkb | and hyperkitty 1.3.7 on archive pages | 17:29 |
fungi | "HyperKitty version 1.3.7" on https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/ yep | 17:29 |
clarkb | I can read emails in the archive. vhosting still seems good. | 17:30 |
clarkb | Main thing we're missing is an email going through | 17:30 |
fungi | docker/mailman/web/requirements.txt contains postorius==1.3.8 and hyperkitty==1.3.7 | 17:31 |
fungi | i'm planning to send something to service-discuss next | 17:31 |
fungi | i've received my copy | 17:35 |
fungi | https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/EUE5GZNFTH22QAG5D2BMF3R56IEAXE4R/ | 17:36 |
clarkb | I got it too | 17:36 |
fungi | i think we're set | 17:36 |
clarkb | ++ | 17:36 |
fungi | i'm going to head out to a late lunch pretty soon, but will check back in once i get back | 17:36 |
clarkb | enjoy. I'm going to try and sneak a bike ride in before it gets super hot | 17:36 |
clarkb | we had thunderstorms overnight so temperatures never really dropped | 17:36 |
clarkb | was warm and humid. | 17:37 |
fungi | hot and muggy. sounds like here | 17:37 |
fungi | i mostly just want to get outside to escape the paint fumes | 17:37 |
fungi | now that they're done for the day | 17:37 |
clarkb | if you breath deeply it is its own form of escape | 17:37 |
fungi | touché | 17:37 |
fungi | seems the universe didn't implode while i was out at the bar. good | 19:46 |
Clark[m] | I didn't expect fireworks. I'm eating lunch but wanted to follow up on whether or not we can unfork that config file now. Then maybe approve a bookworm container update or two | 20:01 |
fungi | oh, yep sonds good | 20:08 |
fungi | Clark[m]: which config file specifically were you thinking we can un-fork? | 20:13 |
Clark[m] | fungi: https://review.opendev.org/c/opendev/system-config/+/869210/8/docker/mailman/web/mailman-web/settings.py that one maybe | 20:22 |
fungi | i'll double check it against upstream | 20:23 |
Clark[m] | https://opendev.org/opendev/system-config/src/branch/master/playbooks/roles/mailman3/files/web-settings.py is the forked version | 20:23 |
Clark[m] | the one in the change is/should be in sync with upstream. Then in our role we bind mount over it | 20:23 |
fungi | we do force SITE_ID = 0 in it intentionally | 20:25 |
clarkb | ya so there are a couple of extra things that we would need ot address upstream first | 20:27 |
clarkb | in that case we can't unfork and thats fine. This has worked well enough so far | 20:27 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/892702 is a low risk bookworm update | 20:28 |
fungi | we can reset ours to what's in maxking/docker-mailman except for the SITE_ID override, i thnk | 20:28 |
fungi | yes, i agree. the rest seem like basically no-op changes | 20:31 |
fungi | at least in our case | 20:31 |
clarkb | I think the gethostbyname("mailman-web"), hardcoded in the list of hosts is still a problem too | 20:35 |
clarkb | we use host netowkring so that name doesn't end up in magical dns for us | 20:36 |
fungi | ah, so we do still need the custom 127.0.0.1 entry? | 20:39 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py https://review.opendev.org/c/opendev/system-config/+/892817 | 20:42 |
clarkb | fungi: I thought 127.0.0.1 came from upstream and I put the differences under my comment. We may not need it since localhost is there | 20:42 |
fungi | the 127.0.0.1 had a comment explicitly claiming to be an opendev edit | 20:43 |
clarkb | ah it did | 20:43 |
clarkb | fungi: note the mailman-web stuff needs to be commented out so I don't think^ will work | 20:43 |
fungi | i just tried to clarify the comment a bit | 20:43 |
clarkb | it will fail on mailman web startup trying ot resolve that name | 20:43 |
fungi | ah, i'll comment it out again then | 20:44 |
clarkb | both the lines | 20:44 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py https://review.opendev.org/c/opendev/system-config/+/892817 | 20:45 |
fungi | yep | 20:45 |
fungi | didn't know if it would just get ignored when lookups failed | 20:46 |
clarkb | fungi: oh you need to edit our DJANGO_ALLOWED_HOSTS value too | 20:49 |
clarkb | I dind't notice upstream changed the separator | 20:49 |
fungi | the new upstream separator won't work for us? | 20:50 |
clarkb | fungi: our values are separated by : not , so the split won't split things in a meaningful way for us | 20:50 |
clarkb | currently we reuse the exim mm_domains variable and exim wants : iirc | 20:50 |
clarkb | but we can define a new value or convert it in ansible before writing it out into the config for docker-compose | 20:50 |
fungi | oh, so we can't just change to , | 20:50 |
clarkb | correct | 20:51 |
clarkb | we need something slightly smarter. But still doable | 20:51 |
fungi | we still need the conditional there too? | 20:51 |
fungi | and use the ansible var instead of the envvar? | 20:52 |
clarkb | fungi: we don't need the condition. That was there to make the change more likely to be upstreamable. But they condensed it down in a safe way for us (even if they didn't it would work because we always set the value) | 20:52 |
clarkb | I wouldn't use the ansible var. I would keep using the envvar to stay in sync with upstream there | 20:53 |
clarkb | what we need to change is where we set the env var value | 20:53 |
clarkb | which is set in playbooks/roles/mailman3/templates/docker-compose.yaml.j2 | 20:53 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py https://review.opendev.org/c/opendev/system-config/+/892817 | 20:53 |
clarkb | and we can do something like mm_domains | split(:) | join (,) | 20:53 |
clarkb | not valid ansible | 20:53 |
fungi | mmm, okay. but basically the only thing we could un-fork was to update the TIME_ZONE assignment? | 20:54 |
clarkb | we can unfork the DJANGO_ALLOWED_HOSTS code too. We just have to change how we set the DJANGO_ALLOWED_HOSTS value in docker-compose.yaml | 20:54 |
fungi | seems like we're not really un-forking the config, though it was a good exercise to confirm basically all the differences there were needed | 20:54 |
clarkb | bsaically upstream splits on , we split on : so we can change the input to the split and unfork that way | 20:55 |
fungi | ah, i guess the docker-compose is a jinja2 template so we can manipulate values there | 20:57 |
clarkb | yup exactly. I think doing that is worthwhile if we can pretty easily use jinja filters to change the separator value | 20:57 |
clarkb | that way its less divergence from upstream in the settings file | 20:57 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py https://review.opendev.org/c/opendev/system-config/+/892817 | 21:02 |
fungi | so like that? | 21:02 |
clarkb | in DJANGO_ALLOWED_HOSTS={{ mm_domains.split(':') | join(,) }} I don't think mm_domains.split() is valid. YOu need to us | filter() syntax? | 21:08 |
clarkb | but yes from a psuedo code perspective | 21:08 |
clarkb | huh google says I'm wrong | 21:12 |
clarkb | I guess it isn't clera to me which things are functions and whihc things are filters then | 21:12 |
clarkb | fungi: you might need quotes around the , ? otherwise I guess that may work | 21:13 |
fungi | oh, i thought i did, sorry. will fix | 21:14 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: mailman3: re-sync custom web/settings.py https://review.opendev.org/c/opendev/system-config/+/892817 | 21:15 |
fungi | i stole the foo.split() invocation from other templates we have in system-config, fwiw | 21:15 |
clarkb | ya so some string methods exist as directly invocable? | 21:16 |
clarkb | jinja is weird | 21:16 |
clarkb | I think that will work. CI should confirm | 21:16 |
fungi | playbooks/roles/base/exim/templates/exim4.conf.j2 roles/set-hostname/templates/hosts.j2 roles/set-hostname/templates/mailname.j2 | 21:17 |
fungi | were the examples i found of split() methods in templates | 21:17 |
fungi | cargo cult ftw | 21:18 |
opendevreview | Merged opendev/system-config master: Update zookeeper-statsd image to bookworm https://review.opendev.org/c/opendev/system-config/+/892702 | 21:21 |
clarkb | the mm3 change passed testing. I was hoping we record the docker-compose.yaml file but it seems we don't | 22:02 |
clarkb | its probably fine | 22:02 |
clarkb | looks like zookeeper-statsd won't update until our daily run later. Not a big deal its low impact if anything goes wrong (we lose zk stats until we fix it) | 22:03 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!