| -@gerrit:opendev.org- Roja Eswaran proposed: [openstack/diskimage-builder] 987717: Replace debootstrap with mmdebstrap https://review.opendev.org/c/openstack/diskimage-builder/+/987717 | 15:35 | |
| -@gerrit:opendev.org- Monty Taylor https://matrix.to/#/@mordred:inaugust.com proposed: [opendev/system-config] 991140: Start mirroring the rust container image https://review.opendev.org/c/opendev/system-config/+/991140 | 15:39 | |
| @mordred:waterwanders.com | looks like we're using unauthenticated docker for mirroring. presumably if that ^^ and it's parent (add resolute) aren't centrally interesting there's nothing stopping me from using the same job in an inaugust repo or drizzle namespace and doing my own parallel mirror, right? (although those seem reasonably applicable enough to be in the central mirror) Also - I have a rabbitmq plugin in drizzle that it feels like testing against the rabbitmq image would make sense. I'm a little surprised no-one in openstack is deploying rabbit in a container... or at least not leveraging upstream rabbit images and wanting a mirrror image | 16:01 |
|---|---|---|
| @jim:acmegating.com | i think our intention was to have a fairly low bar for mirroring, but not at zero. both of those seem to be well above that so +2 from be on both. | 16:07 |
| @clarkb:matrix.org | I think the primary thing was that we wouldn't mirror the super specific image that only one group would need. But generic things like language runtimes and official builds of tools like databases are fine as they can and are used by many | 16:10 |
| @mordred:waterwanders.com | ++ cool! \o/ | 16:17 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 16:20 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 16:21 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 16:21 | |
| @mordred:waterwanders.com | I feel like I should know this - but if I have a job in opendev/base-jobs and I want to mention a role from zuul/zuul-jobs in its documentation, do we have a construct for that? | 16:23 |
| @mordred:waterwanders.com | (actually, thinking about it, I'm not sure I really need to do that - but now I'm curious) | 16:24 |
| @jim:acmegating.com | mordred: someone just added intersphinx support to zuul-sphinx, so it should be possible to use that for that now, but nothing has been done to enable that in any of the opendev repos. i would happily review patches that did that. | 16:25 |
| short of that, i think just a normal hyperlink. | ||
| @jim:acmegating.com | (i think they did add testing of that and examples to the zuul-sphinx repo itself, so i think that could serve as a source of copypasta) | 16:26 |
| @mordred:waterwanders.com | neat. I may nerd-snipe myself on that one. it feels like a thing that _should_ work | 16:26 |
| @jim:acmegating.com | agree; a little awkward for that not to be enabled in zuul-jobs and friends now that it exists | 16:27 |
| @mordred:waterwanders.com | Clark: mind +A on the parent of the rust one: https://review.opendev.org/c/opendev/system-config/+/990727 | 16:32 |
| @clarkb:matrix.org | oh yup I missed there was a parent change | 16:35 |
| @mordred:waterwanders.com | yeehaw | 16:36 |
| -@gerrit:opendev.org- Zuul merged on behalf of Monty Taylor https://matrix.to/#/@mordred:inaugust.com: | 16:43 | |
| - [opendev/system-config] 990727: Start mirroring ubuntu resolute container images https://review.opendev.org/c/opendev/system-config/+/990727 | ||
| - [opendev/system-config] 991140: Start mirroring the rust container image https://review.opendev.org/c/opendev/system-config/+/991140 | ||
| -@gerrit:opendev.org- Zuul merged on behalf of ayyappa: [openstack/project-config] 990651: Add repo app-ejbca for starlingx https://review.opendev.org/c/openstack/project-config/+/990651 | 16:44 | |
| @clarkb:matrix.org | mnasiadka: any idea why https://review.opendev.org/c/opendev/system-config/+/988310/ says it depends on a change with invalid configuration? | 16:52 |
| @clarkb:matrix.org | the two parent changes have +1s from Zuul. Maybe we need to recheck them to see what is wrong? | 16:52 |
| @jim:acmegating.com | Clark: those comments lack a vote; i suspect they are from different tenants | 16:54 |
| @clarkb:matrix.org | oh!!! | 16:55 |
| @mnasiadka:matrix.org | Clark: I'd be happy to understand that too :) | 16:55 |
| @clarkb:matrix.org | system-config is pbably in opendev and openstack tenants or something | 16:55 |
| @jim:acmegating.com | i'm not sure about that, just something to check | 16:55 |
| @clarkb:matrix.org | hrm it isn't enqueued into openstack's check queue like I would expect it to though | 16:56 |
| @jim:acmegating.com | nope i'm wrong | 16:56 |
| @jim:acmegating.com | https://zuul.opendev.org/t/openstack/buildset/c63e0ee66722450e8f17db75cbfcd4af | 16:56 |
| @jim:acmegating.com | that really is from the openstack tenant | 16:56 |
| @jim:acmegating.com | Clark: i'd proceed with your recheck testing then | 16:57 |
| @clarkb:matrix.org | https://review.opendev.org/c/opendev/system-config/+/991140 just merged and is also in system-config so the repo itself isn't completely broken | 16:57 |
| @clarkb:matrix.org | corvus: rechecking the parents you mean? | 16:57 |
| @jim:acmegating.com | ya | 16:57 |
| @clarkb:matrix.org | I wonder if this is fallout from the mixed nodeset change. It modified system-config run stuff | 16:58 |
| @clarkb:matrix.org | and changed some of the yaml replacement. I bet that is the issue. mnasiadka I think we have to rebase the stack and update the system-config-run-prometheus job to use the new yaml string replacement anchor name | 16:58 |
| -@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 16:59 | |
| @clarkb:matrix.org | git can rebase it without file level conflict, but we have a semantic level conflict due to the change in the anchor names for the bridge node in the nodest | 16:59 |
| -@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: [opendev/system-config] 980994: Deploy node_exporter across all managed hosts https://review.opendev.org/c/opendev/system-config/+/980994 | 16:59 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 16:59 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: [opendev/system-config] 980840: Add Prometheus monitoring service https://review.opendev.org/c/opendev/system-config/+/980840 | 17:00 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed on behalf of Mohammed Naser: [opendev/system-config] 980994: Deploy node_exporter across all managed hosts https://review.opendev.org/c/opendev/system-config/+/980994 | 17:00 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 17:00 | |
| @mnasiadka:matrix.org | That should probably do it. | 17:00 |
| @fungicide:matrix.org | aha, sorry for the churn there! | 17:03 |
| @mnasiadka:matrix.org | No problems, world around is moving on constantly :) | 17:08 |
| @mnasiadka:matrix.org | How are we with Resolute arm64 nodes? I would need that for Kolla support - happy to help there if I can | 17:08 |
| @clarkb:matrix.org | mnasiadka: I don't think aynone has started on that. One of the big concerns there is mirror space. The ports mirror is larger than the x86 mirror I think and we're a bit tight on space after adding x86 | 17:09 |
| @clarkb:matrix.org | I think the next steps for that are to propose the jobs for arm64 resolute nodes then start figuring out a plan for mirroring the package content. Maybe we can clean up other mirror content to make more room or maybe we need more afs disk storage? etc | 17:10 |
| @fungicide:matrix.org | https://grafana.opendev.org/d/9871b26303/afs for a high-level view of the usage situation | 17:12 |
| @fungicide:matrix.org | 1.17tb in mirror.ubuntu (with resolute), 1.03tb in mirror.ubuntu-ports (without resolute) | 17:13 |
| @fungicide:matrix.org | we're about 600gb away from running out of space in /vicepa on afs01.dfw | 17:14 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 991155: Update grafana to 13.0.2 https://review.opendev.org/c/opendev/system-config/+/991155 | 17:17 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zuul-providers] 991156: Add Resolute arm64 nodes and images https://review.opendev.org/c/opendev/zuul-providers/+/991156 | 17:17 | |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/zuul-providers] 991156: Add Resolute arm64 nodes and images https://review.opendev.org/c/opendev/zuul-providers/+/991156 | 17:19 | |
| @mnasiadka:matrix.org | fungi: I'd reckon operating system packages mirrors are not getting smaller, but also the arm64 provider is not the fastest one - so maybe not mirroring ports is not that big of a problem right now | 17:25 |
| @fungicide:matrix.org | agreed, jobs would just need some override to accommodate that | 17:26 |
| @mnasiadka:matrix.org | Clark: I would assume we want to disable https://docs.greptime.com/reference/telemetry/ ? | 17:48 |
| @clarkb:matrix.org | Yes please | 17:50 |
| -@gerrit:opendev.org- Michal Nasiadka proposed: [opendev/system-config] 988310: Add GrepTimeDB long term storage for Prometheus https://review.opendev.org/c/opendev/system-config/+/988310 | 17:55 | |
| @mnasiadka:matrix.org | Done | 17:59 |
| -@gerrit:opendev.org- Zuul merged on behalf of Dmitriy Rabotyagov: [openstack/project-config] 989144: Change ACLs for Venus to retired https://review.opendev.org/c/openstack/project-config/+/989144 | 18:34 | |
| @fungicide:matrix.org | just to confirm Clark's suspicion from the meeting, it does look like i need to manually add an account on the new backup server for wiki | 19:14 |
| @fungicide:matrix.org | `Remote: Permission denied (publickey).` | 19:14 |
| @fungicide:matrix.org | i see we have a `borg-wiki-update-test` user on the old server, i can manually create an equivalent and set the same `~/.ssh/authorized_keys` entry unless there's a more proper way to go about it | 19:16 |
| @clarkb:matrix.org | fungi: I would have to cross check against ansible, but that sounds correct from memory | 19:19 |
| @clarkb:matrix.org | and then the cron entries are offset ~evenly through the day from each other | 19:20 |
| @fungicide:matrix.org | i was just going to edit the existing 02 backup to use 03 instead, since we have another backup to the other region anyway | 19:25 |
| @clarkb:matrix.org | ack that works for me | 19:28 |
| @clarkb:matrix.org | once etherpad is done I'm going to upgrade my local network gear so I will be temporarily disconnected from the Internet at that point. But I'll wait for etherpad things to complete first | 19:34 |
| @fungicide:matrix.org | oh good reminder, i rolled back and pinned firmware on 2 of my 3 meshed waps because they kept dropping and reconnecting which made my home lan unusable. time to see if things have improved | 19:36 |
| -@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/system-config] 990531: Update etherpad from 2.7.3 to 3.2.0 https://review.opendev.org/c/opendev/system-config/+/990531 | 20:05 | |
| @clarkb:matrix.org | that should deploy in a few minutes after the hourly runs | 20:07 |
| @fungicide:matrix.org | yeah, any moment | 20:09 |
| @fungicide:matrix.org | though infra-prod-service-zuul is running slower than usual | 20:10 |
| @fungicide:matrix.org | looks like pulling zuul container images is taking a while on at least some servers | 20:13 |
| @fungicide:matrix.org | i wonder if this is going to end up with timeouts and hung docker pulls on one or more servers again | 20:14 |
| @clarkb:matrix.org | I guess we're going to find out soon enough | 20:14 |
| @fungicide:matrix.org | ze03 and ze08 have both been pulling images for the past 10 minutes | 20:16 |
| @fungicide:matrix.org | though currently no executors have still-running `docker-compose pull` processes from prior hourly ansible runs | 20:17 |
| @clarkb:matrix.org | It probably has to do with when zuul merges changes and pushes new images more than anything else | 20:18 |
| @fungicide:matrix.org | yes, but 9 of the 11 executors wrapped things up within a matter of seconfs | 20:18 |
| @fungicide:matrix.org | seconds | 20:18 |
| @clarkb:matrix.org | So 989249 has just landed and published after the last hourly run | 20:18 |
| @fungicide:matrix.org | ze05 was the last one to complete and took 3m50s to pull all the images | 20:19 |
| @fungicide:matrix.org | er, 2m50s i mean | 20:19 |
| @fungicide:matrix.org | so ze03 and ze08 are taking 5x (so far) as long as ze05 which was the slowest one that has finished | 20:20 |
| @jim:acmegating.com | in the past, i think we determined nothing was happening; like the tcp conns got dropped with no fin. i'm guessing docker doesn't enable keepalives? | 20:21 |
| @fungicide:matrix.org | that was my best guess based on limited data gathered from strace | 20:22 |
| @fungicide:matrix.org | and then the processes hung around indefinitely, until the next reboot | 20:22 |
| @jim:acmegating.com | (periodic bad-idea reminder: if we get tired of this, we could run our own registry; zuul-registry is horizontally scalable with a swift backend) | 20:22 |
| @jim:acmegating.com | maybe we can put the pulls in a timeout/retry ansible block thingy | 20:23 |
| @fungicide:matrix.org | it doesn't have a substantial impact on us, just sometimes slows things down when it holds the deploy semaphore until the job hits its timeout | 20:23 |
| @jim:acmegating.com | well, it does completely break zuul deployment every few weeks | 20:23 |
| @fungicide:matrix.org | which, honestly, is the only reason i even noticed it happening last time | 20:23 |
| @clarkb:matrix.org | Do we know what the fail/timeout mode is? | 20:24 |
| @clarkb:matrix.org | Do we have to intervene or just be patient etc. Not sure if know | 20:24 |
| @fungicide:matrix.org | i think the job ends at the 30min mark | 20:24 |
| @clarkb:matrix.org | Ah the job itself times out then | 20:24 |
| @jim:acmegating.com | Clark: if i understand the question right: docker and docker-compose never times out, it just hangs forever | 20:24 |
| @fungicide:matrix.org | well, the playbook, bur yes | 20:24 |
| @jim:acmegating.com | so if we want to improve it, we'd need to run "timeout X docker-compose pull" and put that in an ansible retry block | 20:25 |
| @fungicide:matrix.org | right, subsequent `docker-compose pull` calls in later hourly jobs are unaffected and complete successfully, but the old processes persist until the server is rebooted | 20:25 |
| @fungicide:matrix.org | the other odd thing is, i've only seen this affect the executors, not any other zuul servers | 20:26 |
| @jim:acmegating.com | sounds like a 5-10m timeout/retry would be an improvement | 20:26 |
| @clarkb:matrix.org | ++ | 20:26 |
| @clarkb:matrix.org | fungi: the executor image is much larger than the others | 20:26 |
| @fungicide:matrix.org | it looks like 3 minutes is a good upper-bound on a slower run that completes successfully | 20:26 |
| @clarkb:matrix.org | Like all the other images we use are smaller than it | 20:26 |
| @fungicide:matrix.org | so 5 is probably good | 20:26 |
| @fungicide:matrix.org | got it, so maybe image size plays a role, greater chance of cosmic rays bombarding the nic or something | 20:27 |
| @fungicide:matrix.org | but yeah, if `docker-compose pull` is still running after 5 minutes, kill it and try again, and if it doesn't work 3 times in a row then fail the task/job | 20:29 |
| @fungicide:matrix.org | worst case it will probably still fail half as fast as waiting for the configured job/phase playbook timeout | 20:30 |
| @fungicide:matrix.org | er, half as long | 20:30 |
| @clarkb:matrix.org | Yup encoding that retry logic should be straightforward with ansible | 20:30 |
| @clarkb:matrix.org | May need to wrap it with the timeout command? Not sure if ansible task timeouts do the right thing with retries | 20:30 |
| @clarkb:matrix.org | Looks like it did timeout at ~30 minutes. Etherpad deployment is proceeding | 20:35 |
| @fungicide:matrix.org | fwiw, the pulls are still running on ze03 and ze08 even though the job has now timed out, so seems the same as the last time i looked into it | 20:36 |
| @fungicide:matrix.org | my browser just disconnected from and reconnected to a pad i was in the middle of editing | 20:39 |
| @clarkb:matrix.org | `Up 17 seconds (healthy)` so it should be back up now. Time to check it works | 20:39 |
| @fungicide:matrix.org | retained my session too | 20:39 |
| @clarkb:matrix.org | https://etherpad.opendev.org/p/gerrit-upgrade-3.13 loads for me | 20:39 |
| @fungicide:matrix.org | all lgtm so far | 20:39 |
| @clarkb:matrix.org | I'll keep an eye on memory consumption. It took about 15 minutes to OOM on 2.7.3 with session cleanup enabled | 20:39 |
| @clarkb:matrix.org | if it makes it to say 30 minutes that is when I'll go ahead and upgrade my network gear | 20:40 |
| @fungicide:matrix.org | i'm going to peel some garlic for ผัดซีอิ๊ว (phat si-io) | 20:42 |
| -@gerrit:opendev.org- Clark Boylan proposed: [opendev/system-config] 991221: Retry zuul executor container image pulls https://review.opendev.org/c/opendev/system-config/+/991221 | 20:49 | |
| @clarkb:matrix.org | something like ^ that for the discussed timeout and retries idea | 20:49 |
| @mordred:waterwanders.com | I know we have all the control in the world over every aspect of gerrit - so just noting that the text: "Outdated Votes: Code-Review+1 (copy condition: "changekind:TRIVIAL_REBASE OR is:MIN")" mostly causes the words "TRIVIAL_REBASE stand out, which keeps making me confused when I know something _wasn't_ a trivial rebase :) | 20:57 |
| @clarkb:matrix.org | we've made it to 25 minutes and the gerrit 3.13 etherpad still loads for me | 21:04 |
| @clarkb:matrix.org | and now 30 minutes. Time for network stuff. Hopefully i'm back in 5-10 minutes :) | 21:10 |
| @mordred:waterwanders.com | now I'm hungry | 21:10 |
| @jim:acmegating.com | i'm sure he's making enough for everybody | 21:12 |
| @mordred:waterwanders.com | oh good | 21:12 |
| @fungicide:matrix.org | always | 21:19 |
| @clarkb:matrix.org | I think I'm back and my network is functional? | 21:23 |
| @fungicide:matrix.org | your bits are reaching me | 21:28 |
| @mordred:waterwanders.com | and also me | 21:28 |
| @mordred:waterwanders.com | I see bits from various people, in fact | 21:28 |
| @clarkb:matrix.org | excellent one less thing to worry about now | 21:28 |
| @clarkb:matrix.org | I think https://review.opendev.org/c/opendev/system-config/+/988993 and https://review.opendev.org/c/opendev/system-config/+/991221 are the two next things on my todo list (they both update zuul executor config management). THen tomorrow morning fungi and I are meeting with the i18n team to try and help them weblate more | 21:30 |
| @fungicide:matrix.org | and anyone else who wants to pitch in, i don't think there's a limit | 21:36 |
| @fungicide:matrix.org | https://lists.openstack.org/archives/list/openstack-discuss@lists.openstack.org/thread/2ZLT3VQL377LRB64SQLKTP33I5XEK7BF/ i18n SIG Virtual Sprint (June 3rd, 14:00 UTC) | 21:39 |
Generated by irclog2html.py 4.1.0 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!