clarkb | ianw: re console files we need to do similar with static iirc | 00:02 |
---|---|---|
clarkb | (sorry bike rides are good for pulling on threads in your head an I just got back from a bike ride) | 00:03 |
ianw | oh yep, i've cleared out there too and am watching | 00:04 |
ianw | i'm going to give it a few hours before i debug too much as iirc any periodic jobs don't update system-config between runs | 00:05 |
clarkb | ya I think waiting is fine. Its not like we haven't had this issue for a while anyway :) | 00:05 |
ianw | clarkb: does hosts: bastion[0] work as a host-matcher for a playbook of the top of your head? i was thinking the bastion group would never have more than one entry -- it's basically just a pointer | 00:08 |
clarkb | ianw: I want to say yes. It was noonedeadpunk maybe that was doing similar with osa stuff and zuul? | 00:09 |
ianw | https://docs.ansible.com/ansible/latest/user_guide/intro_patterns.html#common-patterns doesn't list it, but it also says "common patterns" not "here is an exhaustive list of possible patterns" ... | 00:10 |
clarkb | and ya I think addressing it that way is fine we should just be consistent. If we can't be consistent maybe add a verbose comment about it to the groups file | 00:10 |
ianw | yeah i did call it out in https://review.opendev.org/c/opendev/system-config/+/858476/6/inventory/service/groups.yaml but we can try the direct addressing too | 00:11 |
ianw | you're also probably correct that using a different group to address the bastion host in the job setup, versus in the nested ansible, might be clearer | 00:12 |
clarkb | ya I had to think about that one for a minute before I realized what was going on there | 00:12 |
ianw | i couldn't think of any other way than configuring the group in the job definition. i toyed with the idea of a "fake" add_host that somehow runs once and makes a fake host you can dereference to find the bastion host name ... but i felt like that was getting even more confusing | 00:14 |
clarkb | ya I think what you've got is the right way to do it. Just a matter of distinguishing "this is the group management for the job top level job ansible" from "this is our production group management" | 00:15 |
fungi | it certainly is mind-bending in ways that i hope future zuul versions will be able to simplify | 00:19 |
*** dviroel|biab is now known as dviroel|out | 00:20 | |
ianw | fungi: are you ok with https://review.opendev.org/c/opendev/system-config/+/856593 which moves the ansible on bridge into a venv? it's one that will require close monitoring but i'm happy to do that | 00:21 |
ianw | per the merges above, i've taken the liberty of merging the more trivial cleanup changes | 00:22 |
fungi | oh, i thought i had already reviewed that one... looking | 00:23 |
fungi | ianw: yeah, lgtm, thanks! | 00:26 |
fungi | merge at your convenience | 00:26 |
ianw | thanks; will do. after that is in the clear, the stack will be focused on jammy upgrade and making it easier to swap the bastion host | 00:27 |
ianw | running every single system-config-run job on one change does expose that change to a lot of failure possibilities :/ | 00:34 |
fungi | you need popcorn | 00:36 |
opendevreview | Merged opendev/system-config master: run-selenium: Use latest tag on firefox image https://review.opendev.org/c/opendev/system-config/+/857803 | 01:58 |
opendevreview | Merged opendev/system-config master: afs-release: better info when can not get lockfile https://review.opendev.org/c/opendev/system-config/+/858009 | 01:58 |
opendevreview | Ian Wienand proposed opendev/system-config master: bootstrap-bridge: drop pip3 role, add venv https://review.opendev.org/c/opendev/system-config/+/856593 | 02:29 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run jobs with a jammy bridge.openstack.org https://review.opendev.org/c/opendev/system-config/+/857799 | 02:29 |
opendevreview | Ian Wienand proposed opendev/system-config master: testinfra: Update selenium calls https://review.opendev.org/c/opendev/system-config/+/858003 | 02:29 |
opendevreview | Ian Wienand proposed opendev/system-config master: Abstract name of bastion host for testing path https://review.opendev.org/c/opendev/system-config/+/858476 | 02:29 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run a base test against "old" bridge https://review.opendev.org/c/opendev/system-config/+/860802 | 02:29 |
opendevreview | Ian Wienand proposed opendev/system-config master: Convert production playbooks to bastion host group https://review.opendev.org/c/opendev/system-config/+/858486 | 02:29 |
ianw | we still see a few console files coming up on bridge -- they all appear to be related to the gpg encrypting of logs; hence suggesting the post playbook i guess | 02:30 |
opendevreview | Ian Wienand proposed opendev/system-config master: Correct zuul_console_disabled flag https://review.opendev.org/c/opendev/system-config/+/860913 | 02:47 |
*** ysandeep|out is now known as ysandeep | 02:48 | |
opendevreview | Ian Wienand proposed opendev/base-jobs master: Fix zuul_console_disabled typo https://review.opendev.org/c/opendev/base-jobs/+/860914 | 02:53 |
*** ysandeep is now known as ysandeep|afk | 03:32 | |
opendevreview | Merged opendev/system-config master: Correct zuul_console_disabled flag https://review.opendev.org/c/opendev/system-config/+/860913 | 03:53 |
opendevreview | Merged opendev/base-jobs master: Fix zuul_console_disabled typo https://review.opendev.org/c/opendev/base-jobs/+/860914 | 04:10 |
opendevreview | Ian Wienand proposed opendev/system-config master: bootstrap-bridge: drop pip3 role, add venv https://review.opendev.org/c/opendev/system-config/+/856593 | 04:11 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run jobs with a jammy bridge.openstack.org https://review.opendev.org/c/opendev/system-config/+/857799 | 04:11 |
opendevreview | Ian Wienand proposed opendev/system-config master: testinfra: Update selenium calls https://review.opendev.org/c/opendev/system-config/+/858003 | 04:11 |
opendevreview | Ian Wienand proposed opendev/system-config master: Abstract name of bastion host for testing path https://review.opendev.org/c/opendev/system-config/+/858476 | 04:11 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run a base test against "old" bridge https://review.opendev.org/c/opendev/system-config/+/860802 | 04:11 |
*** dasm is now known as dasm|off | 04:20 | |
opendevreview | Ian Wienand proposed opendev/system-config master: Abstract name of bastion host for testing path https://review.opendev.org/c/opendev/system-config/+/858476 | 05:54 |
opendevreview | Ian Wienand proposed opendev/system-config master: Run a base test against "old" bridge https://review.opendev.org/c/opendev/system-config/+/860802 | 05:54 |
*** ysandeep|afk is now known as ysandeep | 06:01 | |
Hoai-Thu | 1665468387:56EC680F990544DC2B7F6B6A77ECDE2D95BAFACD YkssY7jp`swsk% | 06:10 |
*** jpena|off is now known as jpena | 07:17 | |
*** frenzyfriday is now known as frenzyfriday|sick | 07:50 | |
*** kopecmartin|sick is now known as kopecmartin | 08:08 | |
*** ysandeep is now known as ysandeep|lunch | 08:19 | |
opendevreview | Grzegorz Grasza proposed opendev/irc-meetings master: Update Barbican meeting chair and time https://review.opendev.org/c/opendev/irc-meetings/+/860929 | 08:47 |
*** ysandeep|lunch is now known as ysandeep | 09:46 | |
*** dviroel|out is now known as dviroel | 11:06 | |
*** ysandeep is now known as ysandeep|afk | 11:42 | |
*** pojadhav is now known as pojadhav|afk | 11:45 | |
opendevreview | Rafal Lewandowski proposed openstack/diskimage-builder master: Added cloud-init growpart element https://review.opendev.org/c/openstack/diskimage-builder/+/855856 | 11:49 |
*** ysandeep|afk is now known as ysandeep | 12:16 | |
*** dasm|off is now known as dasm | 12:41 | |
*** pojadhav|afk is now known as pojadhav | 12:45 | |
opendevreview | Rafal Lewandowski proposed openstack/diskimage-builder master: Added cloud-init growpart element https://review.opendev.org/c/openstack/diskimage-builder/+/855856 | 13:48 |
opendevreview | Amy Marrich proposed opendev/irc-meetings master: Change meeting day and time for Diversity and inclusion https://review.opendev.org/c/opendev/irc-meetings/+/860955 | 14:08 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM force mm3 failure to hold the node https://review.opendev.org/c/opendev/system-config/+/855292 | 14:18 |
opendevreview | Merged opendev/irc-meetings master: Change meeting day and time for Diversity and inclusion https://review.opendev.org/c/opendev/irc-meetings/+/860955 | 14:19 |
frickler | clarkb: when you tested devstack with the new zuul ansible, you only tested master, not old stable branches, right? | 14:20 |
frickler | wondering whether https://bugs.launchpad.net/neutron/+bug/1992379 could be related, saw similar things in older devstack stable branches | 14:20 |
fungi | note that the ansible default change merged on thursday at 16:31z, so if the failures started with the friday periodic runs, that would be a fairly tight correlation | 14:23 |
frickler | that seems to match pretty well https://zuul.opendev.org/t/openstack/builds?job_name=neutron-functional&project=openstack%2Fneutron&branch=stable%2Ftrain&skip=0 | 14:26 |
mlavalle | fungi: yes, it matches, based on what I've seen | 14:27 |
fungi | short-term workaround is to set the ansible version in those jobs to 5, but that will only work until zuul drops ansible 5 support, which is coming very soon | 14:28 |
frickler | maybe opendev can delay that switch for some time? stable branches are already struggling without further changes imposed upon them | 14:34 |
fungi | will they struggle less if we delay it a week? | 14:35 |
fungi | a lot of those stable branches need to be eol already | 14:36 |
frickler | not really, I was thinking more in terms of months to either get them fixed or eold | 14:36 |
fungi | the zuul community would like to drop ansible 5 support and release 8.0 next week: https://etherpad.opendev.org/p/qF7VE9HzqPVzsCyZLWxb | 14:37 |
fungi | normally zuul tags releases after the code has been running successfully in opendev's deployment | 14:37 |
*** ysandeep is now known as ysandeep|dinner | 14:37 | |
fungi | delaying removal of ansible 5 in opendev either means not tying zuul releases to opendev successfully running the code first, or delaying zuul's release plans by however long we want to delay the ansible 5 removal | 14:38 |
frickler | so we need to weigh interest of the zuul community vs. interest of the OpenStack community | 14:39 |
fungi | yes. i could see suggesting we push the ansible 5 removal to the week after the ptg, for logistical reasons | 14:40 |
frickler | except possibly there is an easy fix for this | 14:40 |
frickler | easy and not requiring every repo to be touched, ie. likely localized to devstack | 14:41 |
frickler | is the ansible 5 removal blocking anything else on the zuul side? | 14:42 |
frickler | fyi verifying in https://review.opendev.org/c/openstack/devstack/+/860797 now | 14:45 |
fungi | not needing to support eol ansible versions, but i'm not sure what else beyond that | 14:46 |
fungi | we basically got way behind because of the modules fork maintenance stuff, which once resolved allowed us to start catching back up to ansible's support schedule | 14:46 |
clarkb | huh my messages from matrix seem to have not made it to irc? that is unfortuante | 14:49 |
clarkb | this is likely the same issue taht zuul-jobs ran into with ansible's handling of shebangs picking the wrong python version | 14:50 |
clarkb | we can just drop the shebang and that should fix it if this is the same issue | 14:50 |
fungi | note that the ironic team ran into some problem related to the ansible default change as well. i didn't dig into it but JayF can probably say if theirs was that issue as well | 14:50 |
clarkb | the point I was trying to make that apparently matrix didn't send through for me is that we did try to accomodate openstack here and intentionally waited for the release to compelte before doing this knowing it had the potential to be destructive | 14:51 |
JayF | Our failures were the ansible openstack module complaining about openstacksdk versions | 14:51 |
clarkb | at the same time if we never take the chance that some things would break here we would never update teh ansible version | 14:51 |
clarkb | and it seems a little silly to prioritze ancient openstack stable branches that don't get required care and feeding | 14:52 |
JayF | Oh, I've totally seen this failure in some places too. It didn't occur to me this could be ansible version :| | 14:52 |
fungi | oh, i see the metalsmith situation looks related to openstacksdk constraints vs ansible collection for openstack? https://review.opendev.org/c/openstack/metalsmith/+/860943/3/metalsmith_ansible/ansible_plugins/modules/metalsmith_instances.py | 14:52 |
JayF | Yes | 14:52 |
clarkb | you should decouple those things | 14:53 |
clarkb | the ansible running your job shouldn't leak into the python install you are testing | 14:53 |
JayF | https://review.opendev.org/c/openstack/metalsmith/+/860943 being our "improved" workaround | 14:53 |
clarkb | oh wait this ia devstack-gate failure? | 14:53 |
clarkb | I really don't think we should hold anything up for a devstack-gate problem | 14:54 |
JayF | clarkb: that's basically what I said as soon as we connected the dots as to how it was broken; I'd be surprised if anyone has the time to refactor that job though | 14:54 |
fungi | yes, devstack-gate is on borrowed time already. there will come a time in the not too distant future where we need to decide between eol'ing some em branches in openstack sooner or dropping integration tests on them so we can retire d-g for good | 14:55 |
clarkb | and ya the issue is precisely the one zuul-jobs hit. Its a one line fix. I'll push it up as soon as I load ssh keys | 14:55 |
clarkb | But I don't think we should hold anything up for devstack-gate | 14:56 |
clarkb | its one thing to try and work with devstack problems but devstack-gate shouldn't be used | 14:56 |
clarkb | remote: https://review.opendev.org/c/openstack/devstack-gate/+/860961 Remove shebang from ansible module | 14:58 |
opendevreview | Rafal Lewandowski proposed openstack/diskimage-builder master: Added cloud-init growpart element https://review.opendev.org/c/openstack/diskimage-builder/+/855856 | 14:58 |
frickler | oh, I didn't realise that these were all still ds-gate jobs. that's a good argument for retirement indeed | 15:06 |
*** ysandeep|dinner is now known as ysandeep | 15:06 | |
fungi | which would also explain why the canary devstack tests with ansible 6 didn't turn up things like that | 15:07 |
*** dviroel is now known as dviroel|lunch | 15:11 | |
clarkb | fungi: d-g actually does the exact thing that the ansible change breaks :/ it uses the module as both a script and a module | 15:12 |
clarkb | I pushed a followup patchset, but I think that will break too | 15:12 |
fungi | indeed | 15:12 |
clarkb | its a bit of a gg ansible situation. But also its d-g which no one should be using... | 15:12 |
fungi | change of topic... one catch to holding a job node for a service built on a speculative container... docker-compose up can't find the images in the registry and errors | 15:18 |
fungi | i guess there's a way to tell it to use the cached images it has? | 15:18 |
fungi | oh, or maybe it never actually had them at all. docker image list only lists the mariadb image | 15:19 |
fungi | 860157 built images on 2022-10-07, could they have expired out of the buildset registry at this point | 15:21 |
fungi | ? | 15:21 |
clarkb | fungi: I thought that cleanup isn't working | 15:21 |
clarkb | so no I don't think so | 15:21 |
fungi | i guess i should see if the dnm change failed for the intended reason or for other reasons | 15:21 |
clarkb | in the past I think the issue has been in setting up the job requires and dependencies | 15:22 |
clarkb | I would double check that | 15:22 |
clarkb | (what job was it?) | 15:22 |
fungi | ERROR: for mailman-core pull access denied for opendevorg/mailman-core, repository does not exist or may require 'docker login': denied: requested access to the resource is denied | 15:22 |
fungi | https://zuul.opendev.org/t/openstack/build/2b0da96429d942e39b862e41414ab4fa | 15:22 |
fungi | so no, the held node never had the speculatively built images | 15:22 |
fungi | missing provides/requires maybe? | 15:23 |
clarkb | ya I was actually concerned about this. I think that when things aren't live in zuul with the speculative state it loses that info somehow? Basically we may need to land the image additions to make this work reliably. I'm pretty sure I added all the requires and dependencies, but worth double checking that too | 15:24 |
clarkb | the issue (my hunch anyway) is that we don't know to pull in the speculative build from the intermediate registry to the buildset registery when those changes don't go up together | 15:25 |
clarkb | and since we haven't landed the change either there is no image in docker hub to pull as the fallback | 15:25 |
clarkb | fungi: I think if you recheck both changes together that it will correct it | 15:25 |
fungi | yeah, makes sense. thanks | 15:27 |
clarkb | fungi: frickler: this affects normal devstack jobs train and older because devstack train uses the test-matrix role in playbooks/pre.yaml. Ussuri dropped it | 15:29 |
clarkb | (I think that is still a bug in "don't use d-g", but does complicate cleanup a bit if we want to take it further) | 15:30 |
fungi | could be a reason to fast-track eol for stable/train | 15:31 |
fungi | or just drop integration testing on it | 15:31 |
clarkb | ya | 15:31 |
fungi | pretty sure mass eol of pqrs are already underway | 15:32 |
fungi | elodilles: ^ just a heads up that devstack jobs on stable/train and older are broken by newer ansible and will either need fixing or dropping devstack/grenade jobs | 15:32 |
fungi | i wonder if it also affects stable/ussuri grenade jobs | 15:33 |
frickler | with EM these things fall into each project's responsibility and my impression is they tend to just get ignored | 15:33 |
clarkb | I've got https://review.opendev.org/c/openstack/devstack/+/860963 pushed testing https://review.opendev.org/c/openstack/devstack-gate/+/860961 | 15:33 |
frickler | added that as PTG topic for the release team, but maybe also a TC topic | 15:33 |
clarkb | I think the latest patchset should be functional | 15:34 |
* frickler takes a break, bbl | 15:35 | |
clarkb | fungi: if the rechecking at the same time corrects it we should ask corvus what we are doing wrong with my change there | 15:37 |
mlavalle | clarkb: will https://review.opendev.org/c/openstack/devstack-gate/+/860961 hlp me with https://bugs.launchpad.net/neutron/+bug/1992379? | 15:39 |
fungi | clarkb: i see that the dnm change does have its system-config-run-lists3 build in "waiting" while the parent change's system-config-build-image-mailman build is underway | 15:39 |
clarkb | mlavalle: I think so but https://review.opendev.org/860963 should tell us. As mentioend above though really nothing should be using devstack-gate. Its a bug that devstack train is relying on devstack gate | 15:39 |
clarkb | so I think there is a bigger question of how to unwind that (as fungi suggests maybe we need to stop running those old tests entirely) | 15:40 |
fungi | one way to stop running them is to go ahead and eol stable/train | 15:40 |
mlavalle | clarkb: thanks. will keep an eye on 860963 | 15:46 |
fungi | clarkb: yep, as suspected, once the image build completed on the parent, the mm3 test build started | 15:52 |
clarkb | fungi: ya I seem to remember running into this once before (maybe adding new gerrit images?) I'm not quite sure if this is expected or not | 15:53 |
fungi | it does seem side-effect-ey | 15:53 |
fungi | if we didn't omit the system-config-build-image-mailman job on the child change, it would supply its own image to the tests | 15:54 |
clarkb | ya, and it is omitted due to not matching file matchers for that job | 15:55 |
clarkb | mayber we've overoptimized | 15:55 |
clarkb | and by we I guess I mean me as the author of that change :) | 15:55 |
fungi | agreed, seems like it could be one of the many arguments about why file matchers are problematic | 15:56 |
fungi | anyway, this can be worked around easily enough now that i get what's going on, and i think zuul is working as intended here | 15:57 |
fungi | for me, the surprising side effect is the parent change buildset supplying images to the child change buildset even though they're running in an independent pipeline | 15:58 |
clarkb | ya thats what the job requires and provides does | 15:59 |
clarkb | fungi: are you holding the node now to do another test pass? Have we updated the mm2 prod strings that are too long? | 15:59 |
fungi | right, i get that it's an intentional feature, but the fact that you might get a speculatively built image or you might not depending on timing does feel kinda magic | 16:00 |
clarkb | ya that part is a bit weird to me as well | 16:00 |
fungi | clarkb: yes, i cleaned up the overly long fields in the prod lists and i have an autohold reset to catch this latest build of the dnm change | 16:00 |
*** marios is now known as marios|out | 16:00 | |
clarkb | awesome | 16:01 |
fungi | i was mid-rsync of production data to a node i'd held an hour or so ago when i had to mount the ephemeral drive (rackspace node) on /var/lib/mailman to make enough space for the migration tests and so was stopping/starting the containers for moving that homedir when i noticed the images for them were missing | 16:02 |
fungi | but as soon as i have a new held node i'll get the ssh keys added on it and start pushing another copy of the data over | 16:02 |
fungi | probably about 10 minutes out if zuul's estimate is to be believed | 16:03 |
*** ysandeep is now known as ysandeep|out | 16:10 | |
fungi | was enough time for me to scarf down cold leftovers for lunch | 16:16 |
fungi | new held node is 198.72.124.74 and it's not in rackspace so i don't need to fiddle with filesystems | 16:17 |
fungi | though still no images :/ | 16:18 |
fungi | same error https://zuul.opendev.org/t/openstack/build/0940a25021fa4d9c891186457d4042e8 | 16:18 |
fungi | i guess it doesn't try to pull the one from the parent change anyway, it only delays building | 16:19 |
fungi | i'll amend the dnm change to touch something to trigger an image build | 16:19 |
clarkb | hrm if it delayed then it knew about the artifacts and should've pulled those into the buildset registry from intermediate I would've thought | 16:20 |
clarkb | maybe the plumbing just doesn't work before the image actually exists? | 16:20 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: DNM force mm3 failure to hold the node https://review.opendev.org/c/opendev/system-config/+/855292 | 16:21 |
fungi | swizzled | 16:21 |
fungi | now to reset the trap | 16:21 |
clarkb | the fact that my devstack test change of d-g hasn't all failed yet implies to me that the fix is working if we want to go ahead and approve the d-g change | 16:24 |
clarkb | worst case it won't make anything worse :) | 16:24 |
fungi | looks like it's running system-config-build-image-mailman this time, so should supply its own image | 16:24 |
*** dviroel|lunch is now known as dviroel | 16:24 | |
fungi | i've reviewed two devstack-gate changes in the span of a week. i really don't want to resubscribe my gertty to that repo | 16:39 |
clarkb | as a heads up the matrix oftc bridge isn't passing messages through to irc right now | 17:14 |
fungi | ouch | 17:14 |
clarkb | messages written here in IRC make it to matrix | 17:16 |
clarkb | but not the other way around | 17:16 |
clarkb | I've asked for help in #irc:matrix.org | 17:19 |
fungi | okay, 149.202.168.204 seems to be a working held mm3 server. now to copy data onto it | 17:19 |
*** jpena is now known as jpena|off | 17:25 | |
clarkb | https://github.com/matrix-org/matrix-appservice-irc/issues/1624 https://github.com/matrix-org/matrix-appservice-irc/issues/1590 https://github.com/matrix-org/matrix-appservice-irc/issues/1575 have been shared with me. Unfortunately, not a whole lot in those other than the problem exists | 17:25 |
fungi | production data is replicating to the held server now | 17:30 |
clarkb | note those issues are all for messages working from matrix to irc but not irc to matrix. THe inverse of what we observe. I've asked if we need to file a new issue | 17:31 |
fungi | i've been notified of two new element releases today in my browser client, not sure if that's related but it's a lot more than i normally expect | 17:40 |
fungi | if nothing else, it could explain why the matrix admins aren't particularly responsive at the moment | 17:44 |
rotensepro | hello | 18:57 |
rotensepro | anyone here please | 18:57 |
clarkb | rotensepro: yes, a few of us are here. Usually best to just ask your question if you have one | 18:58 |
rotensepro | I'm an Outreachy intern....currently studying the docs & guides sent to me by my mentor | 18:59 |
rotensepro | so happy to be here | 18:59 |
rotensepro | My project mentor's usernames are: fpantano, ashrodri | 19:03 |
fungi | welcome! | 19:05 |
fungi | we're in the middle of our weekly meeting (in the #opendev-meeting channel), so slightly slower to respond in here as a result | 19:05 |
JayF | Welcome o/ | 19:05 |
*** dviroel is now known as dviroel|biab | 19:19 | |
mlavalle | clarkb: yeap, it worked: https://review.opendev.org/c/openstack/networking-ovn/+/860610 | 19:26 |
mlavalle | thanks! | 19:26 |
JayF | mlavalle: I hope you're doing well o/ | 19:35 |
corvus | clarkb: you're saying an earlier patchset of 855292 ended up without images? | 19:40 |
fungi | corvus: https://zuul.opendev.org/t/openstack/build/0940a25021fa4d9c891186457d4042e8 | 19:40 |
fungi | that was the last example | 19:41 |
clarkb | corvus: yes, that is correct | 19:41 |
clarkb | normally we'd probably fall back to what is on docker hub but since the change adding the imges to docker hub hasn't merged yet it doesn't do that | 19:41 |
corvus | it looks like that is based on PS7 of the image build change | 19:43 |
corvus | (just making sure i have all the configs lined up) | 19:44 |
corvus | clarkb: fungi it appears zuul itself dtrt; here's the artifact info: https://zuul.opendev.org/t/openstack/build/0940a25021fa4d9c891186457d4042e8/log/zuul-info/inventory.yaml#124 | 19:47 |
clarkb | I guess that implies somewhere in the docker image pulling we aren't looking to the intermediate registry properly? | 19:48 |
corvus | yeah, does that job inherit from one of the image-using base jobs? | 19:48 |
clarkb | oh! that's it I think. system-config-run and system-config-run-containers are distinct | 19:49 |
clarkb | and we need system-config-run-containers in this situation | 19:49 |
corvus | yep, that looks like that should do it. | 19:50 |
corvus | clarkb: maybe it's worth putting a "fail" task in the system-config-run playbook if there are container artifacts? | 19:51 |
clarkb | thats a good idea | 19:51 |
corvus | (basically, that should detect the "you set up provides/requires but without the intermediate registry case) | 19:51 |
fungi | ianw: yeah, not combining them into one change for sure, but if we want them to take effect at different times then do we want some specific amount of time separating them in order to avoid confusion? | 20:04 |
fungi | like three days? a week? | 20:04 |
ianw | yeah for mine, i think something like >= 2 days (just for tz's and the world to spin around everyone's "day" :) gives enough granularity to detangle things | 20:06 |
frickler | forgot to mention that rocky9 image builds are still paused. I tested a couple of times to unpause, but no change to the failure | 20:19 |
frickler | also are we planning to have a normal meeting next week? I'd rather not since my day starts with a QA session at 7UTC | 20:20 |
frickler | and with that I'm off for today | 20:21 |
fungi | mm3 import test is underway now on 149.202.168.204 | 20:34 |
fungi | prior full imports took around 2.5-3 hours to complete | 20:34 |
ianw | hrm, https://review.opendev.org/c/opendev/system-config/+/856593 failed on github timeouts getting a client file for bridge (2x) and a letsencrypt failure | 20:43 |
ianw | i don't know what the realistic chance of running ~52 jobs (26-ish jobs * 2 for check and gate) is for getting this in ... | 20:45 |
mlavalle | JayF: yeah, doing great! o/ | 20:51 |
JayF | mlavalle: just good to see you around; you ever need anything lmk :D | 20:54 |
clarkb | frickler: there was a change NeilHanlon pushed up to stop using some of the fancier mirror stuff to try and get bette rerrors out of the logs. But I think that may require a dib release | 20:54 |
mlavalle | JayF: will do. Thanks | 20:55 |
clarkb | fungi: I'm going to update the the mm3 image change to swap out the parent system-config-run job and remove the WIP from the commit message. I don't think this will impact you at all but heads up | 20:55 |
opendevreview | Clark Boylan proposed opendev/system-config master: Fork the maxking/docker-mailman images https://review.opendev.org/c/opendev/system-config/+/860157 | 20:57 |
opendevreview | Clark Boylan proposed opendev/system-config master: DNM force mm3 failure to hold the node https://review.opendev.org/c/opendev/system-config/+/855292 | 20:57 |
clarkb | infra-root https://etherpad.opendev.org/p/3BivmG_PowWf77iKffJU something like that for the ansible 5 and jammy email? | 21:08 |
ianw | is the rocky failure related to https://review.opendev.org/c/openstack/diskimage-builder/+/860428 ? it wasn't clear to me if that was to be merged or was for testing | 21:09 |
clarkb | ianw: yes, I think the idea was if we land that change and run it that we might get better error messages from dnf | 21:09 |
clarkb | currently dnf fails and then doesn't tlel you what actually failed because its using the abstract mirror list | 21:10 |
clarkb | which makes it really hard to debug getting short http responses | 21:10 |
clarkb | it basically just says it got fewer bytes than expected for each of the mirrors it hits but doesn't tell you which mirrors | 21:10 |
clarkb | I tried to manually reproduce by getting a mirror listing from the mirrorlist url on nb01, then fetched the files from that server and they came back fine. The other suspicion I had was that maybe we had cached something that was broken but I couldn't find that in the cache if we had done so | 21:11 |
clarkb | https://bugzilla.redhat.com/show_bug.cgi?id=1262845 is the dnf behavior bug that has been open for 8 years | 21:12 |
ianw | clarkb: ok, one other quick option would be just to hand-edit that into the container and run it, see what happens | 21:15 |
clarkb | well I tried to reproduce with the conatiner outside of dib and could not | 21:15 |
clarkb | do you mean try and jump on a dib build and edit it really quickly before it fails? | 21:16 |
clarkb | (this happens right at the beginning of the build os it might be tough to do that) | 21:16 |
ianw | oh i just mean start a shell in the nodepool container and edit the element file, then trigger a build | 21:17 |
clarkb | oh | 21:17 |
clarkb | that container | 21:17 |
ianw | i've done that a couple of times for things that i just couldn't reproduce outside the actual nb | 21:17 |
clarkb | ya let me give that a shot right now | 21:17 |
ianw | obviously it doesn't survive etc. but handy for quick testing | 21:18 |
ianw | i'm also happy to merge it, but it's a long path from there to fresh container builds deployed | 21:18 |
corvus | clarkb: i made some changes to your etherpad; i'd like to avoid linking to the scratch etherpad where we batted around ideas for the zuul release sequence, and definitely want to avoid calling it a "published release schedule" :) | 21:19 |
ianw | oh, that was me that added that | 21:19 |
clarkb | corvus: ack | 21:20 |
corvus | even now, that's still just an estimate, assuming that the change to actually remove ansible 5 gets written soon | 21:20 |
clarkb | If I manage to dig out from under everything else i may write that change if for no other reason than to keep my email as accurate as possible | 21:21 |
clarkb | but I still owe you a review on that static driver nodepool change | 21:21 |
clarkb | ianw: any idea what editor you use? I can't find one in these images | 21:21 |
clarkb | I guess I could heredoc a file replacement | 21:22 |
ianw | you probably have to to apt-get install a vim | 21:22 |
clarkb | ok nb01 has been updated (but not 02) and I have unpaused rockylinux-9 | 21:25 |
fungi | clarkb: on the announcement, it's probably not necessary to call out the openstack release. readers who are concerned about impact to openstack release timing already know what point in the cycle it's at and the other projects on opendev likely don't care about openstack's release schedule anyway | 21:31 |
fungi | i'll separately post to openstack-discuss to draw their attention to the announcement anyway | 21:32 |
clarkb | fungi: like that? | 21:34 |
fungi | clarkb: yeah, looks great! | 21:34 |
fungi | short and to the point | 21:34 |
clarkb | I made one small additional edit related to that | 21:35 |
clarkb | I will send that out in a few minutes just to be sure there are no new suggestions | 21:35 |
fungi | yeah, still lgtm | 21:35 |
clarkb | ianw: frickler NeilHanlon the manually patched rocky 9 build seems to have gotten past where it was previously stuck | 21:37 |
clarkb | which is great, but also not so great because now we don't have extra debugging info | 21:37 |
clarkb | someone that understand mirrorlist vs not better than me should weigh in if we can get away with this change | 21:38 |
clarkb | I suspect it will cause all our jobs to use the single base repo instead of mirrors? that is probably not ideal | 21:38 |
clarkb | NeilHanlon: ^ the fact that it worked with that change makes me suspect there is a broken mirror in the mirror list though | 21:41 |
*** rotensepro_ is now known as rotensepro | 21:43 | |
fungi | mailman import test has just reached openstack-stable-maint, this is the long one | 22:08 |
ianw | clarkb: yeah, it sounds like the nb is getting hashed to some mirror that is unhappy ... | 22:22 |
ianw | i guess it may not even show up when just wgetting a file, it might be something to do with actually grabbing multiple files or other things dnf does | 22:23 |
clarkb | ya what I did by hand outside of the nodepool container was fetch the mirror list then take the first result of that and try to fetch the file from there | 22:25 |
clarkb | and that worked, but maybe something about user agents changes the listing? Definitely weird to me that dnf can't log who/what it talked to in a mirrorlist setup to enable debugging though | 22:26 |
clarkb | seems like thats a basic thing to have | 22:26 |
NeilHanlon | yeah, agreed clarkb. i will get this handled tonight. just been busy | 22:27 |
clarkb | NeilHanlon: well I don't think its a rush. I'm just not super well clued into all this stuff so looking for guidance | 22:30 |
ianw | any objection if i force merge 856593? i would rather be debugging it in production that sending it around endless rechecks for external issues | 22:47 |
clarkb | ianw: looks like letsencrypt failed consistently. We're happy it wasn't due to this change? | 22:49 |
clarkb | (the other jobs that failed seemed to succeed at other times) | 22:49 |
clarkb | I think I'm ok with it if we are happy that the LE failure is unrelated | 22:49 |
ianw | yeah, it has all succeeded -- https://bbd364adb9ecffa2cba8-64c11a180f7c97233fe37e0ff9660661.ssl.cf1.rackcdn.com/856593/17/check/system-config-run-letsencrypt/8ea3d67/letsencrypt01.opendev.org/acme.sh/acme.sh.log | 22:51 |
ianw | "I believe this error basically says “the Let’s Encrypt database was overloaded”." | 22:52 |
ianw | oh in fact they have an incident listed | 22:52 |
ianw | October 11, 2022 18:45 UTC | 22:53 |
ianw | [Identified] A hardware failure has disrupted operation of the Let's Encrypt Staging Environment. We are working on restoring service. | 22:53 |
clarkb | would be good if fungi et al could weigh in too before doing that though | 23:02 |
fungi | checking... | 23:07 |
fungi | ianw: the le job still failed on the most recent build, from what i can see, but once you're confident they've got it resolved i'm still fine with that | 23:09 |
ianw | well it's just that we can't merge because we use the staging env in our tests | 23:10 |
clarkb | right all jobs needing le certs will fail until that is fixed | 23:12 |
clarkb | I suppose if all the jobs failed for that reason we might be able to just wait? Depends on how urgent you feel this is I suppos | 23:12 |
ianw | i would like to give myself a bit of a runway to debug it in production ... | 23:13 |
fungi | oh! i missed that you said "force merge" | 23:14 |
fungi | i guess if you're also ready to force merge a revert if stuff stops working once it lands, i'm okay with that (or temporarily disable the failing job?) | 23:14 |
ianw | yeah, what i want to watch is the production jobs, and hopefully catch things before the periodic jobs start | 23:15 |
clarkb | note a revert won't actually undo things | 23:15 |
clarkb | but ya force merging a fix would be the necessary step if LE staging continues to be broken should that be necessary | 23:16 |
clarkb | ok going to send that ansible and jammy announcement now | 23:20 |
fungi | thanks! | 23:20 |
ianw | it should run the deploy pipeline, right? | 23:22 |
clarkb | ianw: it should as that is triggered by changes merging | 23:23 |
clarkb | one risk there is if we force merge something that expects the gate to build imgaes that get promoted in deploy. But I don't think this is the case here | 23:23 |
ianw | yeah, no images in this case, just the venv deployment to bridge | 23:24 |
clarkb | email sent to service-announce | 23:28 |
ianw | i was hoping to move on with this today :/ i'll give it a little while longer and see if things clear up | 23:41 |
clarkb | ianw: it == LE? | 23:42 |
clarkb | on the matrix bridging trouble i created a random room on oftc and tried to join it via matrix bridge and I don't even seem to be in the channel | 23:44 |
clarkb | which I guess makes sense as I don't seem to be in here anymore from matrix at all | 23:44 |
clarkb | we all seemto have ping timed out at around 23:33 UTC yesterday | 23:45 |
clarkb | about 24 hours ago | 23:45 |
clarkb | I'm working on filing an issue with their github repo since that seems to be where reports should go | 23:45 |
clarkb | hrm now I notice that some matrix bridged users made it back | 23:47 |
corvus | there was bridge trouble? | 23:50 |
ianw | clarbk: it == the venv deployment on bridge | 23:50 |
corvus | fwiw, i have seen chatting here throughout my day to day | 23:50 |
clarkb | ianw: right I guess I'm wondering if you are waiting on LE instead of force merging or if something else is happening that I've missed | 23:52 |
clarkb | corvus: yes a number of us can see irc messages in matrix but cannot send them to irc from matrix | 23:52 |
clarkb | corvus: when I asked the irc bridge operatorson matrix about it I got pointed at a number of issues with the inverse problem which wasn't super helpful | 23:53 |
ianw | clarkb: the LE outage is the current problem ... but bigger picture too, it does seem running *all* the system-config-run jobs twice without any failures is quite a dice to roll | 23:54 |
clarkb | I tried leaving #openstack-release and rejoining but that doesn't seem to have caused me to actually rejoin | 23:57 |
clarkb | if I issue !listrooms in the management room I get back you are connected, but not joined to any channels | 23:59 |
clarkb | so it seems to know something is up | 23:59 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!