* corvus < https://matrix.org/_matrix/media/r0/download/matrix.org/QJXNXwCbiAfkqpfrwWRKgcUw/message.txt > | 00:00 | |
corvus | tristanC: i verified that the ssh private key is the same as what's currently being used with irc | 00:01 |
---|---|---|
corvus | tristanC: could it be something about key types and the ssh library being used? | 00:01 |
corvus | i also verified that the key is present and readable in the container | 00:03 |
corvus | tristanC: oh interesting, i can reproduce that inside the container with plain ssh, but not on the host. | 00:05 |
corvus | tristanC: did you by any chance build this on a fedora container base? | 00:05 |
corvus | yes, appears to be f34 | 00:05 |
corvus | tristanC: i've heard something about f34 and ssh. i don't know the details, but i'm sure others in this channel do. | 00:06 |
corvus | tristanC: anyway, i bet you can reproduce the problem locally with an rsa key and the container image | 00:07 |
ianw | yeah that will need a PubkeyAcceptedKeyTypes ssh-rsa | 00:17 |
tristanC | corvus: https://review.opendev.org/c/opendev/system-config/+/803125/3 use another container image which i think should work better | 00:40 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125 | 01:11 |
tristanC | corvus: i've updated the image tag to a new version which uses the libssh2 directly, the image is now only 20.9MB | 01:13 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125 | 01:17 |
Clark[m] | Really seems like fedora should update their default fallback hash type in the client | 01:31 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add script to import gerritbot configuration https://review.opendev.org/c/opendev/system-config/+/802983 | 01:42 |
Clark[m] | https://bugzilla.redhat.com/show_bug.cgi?id=1940201 is the issue basically | 01:43 |
Clark[m] | Except in our case it's not a really old server just one that doesn't indicate support for sha2 explicitly | 01:43 |
Clark[m] | However in that case it is succeeding because old server | 01:44 |
Clark[m] | We want it to fail and force sha2 | 01:44 |
Clark[m] | https://github.com/openssh/openssh-portable/blob/master/sshconnect2.c#L1191 I think the code that needs modifying is somewhere near there | 02:00 |
*** ykarel|away is now known as ykarel | 04:45 | |
*** marios is now known as marios|ruck | 05:50 | |
*** jpena|off is now known as jpena | 06:54 | |
*** sshnaidm|afk is now known as sshnaidm | 07:27 | |
*** rpittau|afk is now known as rpittau | 07:30 | |
opendevreview | xinliang proposed openstack/diskimage-builder master: Introduce openEuler distro https://review.opendev.org/c/openstack/diskimage-builder/+/784363 | 08:35 |
*** ykarel is now known as ykarel|lunch | 08:46 | |
*** diablo_rojo__ is now known as diablo_rojo | 09:02 | |
*** ykarel|lunch is now known as ykarel | 10:03 | |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190 | 10:04 |
*** dviroel|out is now known as dviroel | 11:18 | |
*** jpena is now known as jpena|lunch | 11:20 | |
*** jpena|lunch is now known as jpena | 11:23 | |
*** jpena is now known as jpena|lunch | 11:30 | |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Update the gerritbot-matrix image to fix the ssh signature failure https://review.opendev.org/c/opendev/system-config/+/803328 | 11:42 |
tristanC | clarkb: with ^ the bot no longer uses openssh, which seems to fix the current issue | 11:44 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125 | 11:45 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125 | 11:47 |
*** jpena|lunch is now known as jpena | 12:33 | |
opendevreview | Kendall Nelson proposed opendev/system-config master: Setting Up Ansible For ptgbot https://review.opendev.org/c/opendev/system-config/+/803190 | 12:33 |
fungi | python v3.10.0rc1 has been tagged | 13:49 |
*** Guest3154 is now known as notcorvus | 14:39 | |
*** notcorvus is now known as corvus | 14:40 | |
*** corvus is now known as Guest3197 | 14:40 | |
*** Guest3197 is now known as corvus | 14:42 | |
*** corvus is now known as notcorvus | 14:42 | |
*** notcorvus is now known as corvus | 14:42 | |
*** corvus is now known as Guest3200 | 14:43 | |
opendevreview | Clark Boylan proposed opendev/system-config master: Update our python-base and -builder images https://review.opendev.org/c/opendev/system-config/+/803357 | 14:50 |
clarkb | it occured to me that we're looking at gitea, gerrit, and zuul deployment updates and seemed like a good idea to update our base images as a result | 14:51 |
clarkb | fungi's note about new python also had me thinking about it a bit | 14:51 |
*** ykarel is now known as ykarel|away | 14:52 | |
*** ksambor is now known as ksambor|away | 14:55 | |
*** jpena is now known as jpena|off | 15:06 | |
Guest3200 | there are 2 matrix fixes at https://review.opendev.org/803294 and https://review.opendev.org/803328 that would be nice to merge soon | 15:15 |
*** Guest3200 is now known as corvus | 15:17 | |
*** corvus is now known as notcorvus | 15:17 | |
*** notcorvus is now known as corvus | 15:17 | |
fungi | rackspace says afsdb03 was rebooted on sunday around 16:45 utc due to a host issue | 15:41 |
fungi | seems to be up and running at least | 15:42 |
fungi | also i've discovered that lists.o.o has gotten its ipv6 address listed in the spamhaus xbl, i'm waiting for the delisting request to be reviewed | 15:47 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use the gitea api in the gitea renaming playbook https://review.opendev.org/c/opendev/system-config/+/803366 | 15:50 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update gitea project creation to only use the REST API https://review.opendev.org/c/opendev/system-config/+/803367 | 15:50 |
fungi | ahh, there it goes | 15:50 |
fungi | #status log Requested Spamhaus XBL delisting for the IPv6 address of the lists.openstack.org server | 15:51 |
opendevstatus | fungi: finished logging | 15:51 |
clarkb | I'm trying to separate good hygiene (above) from the gitea 1.15.0 stuff | 15:51 |
clarkb | and I split things up because the second change is quite a bit more involved than the first but we really only need the first for the v1.15.0 upgrade | 15:52 |
clarkb | I'll rebase once it is clear things are working | 15:52 |
clarkb | tristanC: corvus: I guess libssh2 wants both a private key file and a pubkey file unlike openssh? | 15:53 |
clarkb | oh ha next file comments explain that | 15:53 |
fungi | clarkb: spot-checking the next proposed batch of 73 gerrit account cleanups i don't see anything concerning. comfortable with proceeding when convenient | 15:54 |
clarkb | fungi: excellent. Probably run the scripts tomorrow | 15:54 |
clarkb | corvus: I've approved the gerritbot change but have left the eavesdrop change in case fungi wants to review it. I think that change is more than small enough to go ahead and approve if if fungi doesn't have time though | 15:56 |
clarkb | infra-root https://review.opendev.org/c/opendev/system-config/+/803357 is a change to update our base python images. If you think that is a good idea I'd like to land that before I land any changes like https://review.opendev.org/c/opendev/system-config/+/803265 as it will ensure we're deploying up to date images as we go through and update images for other reasons | 15:58 |
tristanC | clarkb: corvus: thanks for the quick review! | 15:58 |
corvus | tristanC: you're welcome and thanks! | 15:59 |
fungi | i'll try to look at the eavesdrop change in a bit, after i get some lunch down | 16:00 |
tristanC | clarkb: about the bot monitoring, would it be ok to expose the prometheus port? and if so, what's the proper place to define the desired port number? | 16:01 |
clarkb | tristanC: its fine, but we don't have a prometheus server yet so I'm not sure it will do much good. Writing a spec for that is on the todo list | 16:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: Rebuild gerrit images to include fixes https://review.opendev.org/c/opendev/system-config/+/803372 | 16:10 |
opendevreview | Clark Boylan proposed opendev/system-config master: Use the mariadb jdbc connector in gerrit https://review.opendev.org/c/opendev/system-config/+/803373 | 16:10 |
opendevreview | Clark Boylan proposed opendev/system-config master: Remove the mysql connector from our gerrit images https://review.opendev.org/c/opendev/system-config/+/803374 | 16:10 |
clarkb | ianw: ^ fyi thats a followup on our upstream bug fixes. They are present in the stable branches we build now. | 16:13 |
*** rpittau is now known as rpittau|afk | 16:13 | |
clarkb | stable-3.4 doesn't include them yet. There is a merge change in flight there that I want to land then I can propose a merge change that includes our two fixes | 16:14 |
clarkb | I think it is ok for us to proceed now that stable-3.2 and stable-3.3 include them though | 16:14 |
corvus | tristanC: you can probably grep 'iptables' in system config to find out how to open the port | 16:16 |
clarkb | fungi: https://review.opendev.org/c/opendev/system-config/+/803357 is another good one then I'd like to go ahead and land https://review.opendev.org/c/opendev/system-config/+/803265 and https://review.opendev.org/c/opendev/system-config/+/803266 | 16:17 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix health check and expose prometheus monitoring https://review.opendev.org/c/opendev/system-config/+/803125 | 16:18 |
clarkb | the internet says the air outside is breathable again. I'm going to go get some exercise now before the meeting. Back in a bit | 16:20 |
JayF | are there known issues with the mirrors? DIB jobs are 404'ing on a mirror Status code: 404 for http://mirror.mtl01.inap.opendev.org/fedora/updates/32/Everything/x86_64/repodata/repomd.xml | 16:23 |
clarkb | JayF: I want to say ianw was cleaning up fedora 32 | 16:25 |
clarkb | it is an eol release now | 16:26 |
JayF | That's incredibly unfortunate, it's hard-breaking the DIB gate | 16:26 |
clarkb | JayF: are there chagnes in DIB to remove fedora 32 testing? | 16:26 |
JayF | I haven't checked; I don't work on DIB usually just trying to get a change thru | 16:27 |
clarkb | I suspect that ianw would be on board with removing those tests given he also cleaned up fedora 32 mirroring | 16:27 |
JayF | there are zero CI-related changes in DIB pipeline | 16:27 |
clarkb | ah its just the func test that needs to be bumped forward. I expect this was unexpected fallout | 16:28 |
opendevreview | Merged opendev/system-config master: Update the gerritbot-matrix image to fix the ssh signature failure https://review.opendev.org/c/opendev/system-config/+/803328 | 16:28 |
JayF | I'll try to find a fix for it, just a little frustrated because the breakage of the mirror coincided almost perfectly with the mirror being broken | 16:29 |
gibi | hi! I don't know if it is a known issue but it seems that we don't have the bot resolving launchpad bugs in #openstack-nova | 16:30 |
JayF | ***coincided almost perfectly with getting +W | 16:30 |
JayF | lol | 16:30 |
*** marios|ruck is now known as marios|out | 16:30 | |
*** ssbarnea[m] is now known as zbr | 16:31 | |
fungi | gibi: that was a plugin we had in our old supybot-based meetbot implementation. it's not present in the new limnoria-based meetbot | 16:31 |
opendevreview | Jay Faulkner proposed openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34 https://review.opendev.org/c/openstack/diskimage-builder/+/799341 | 16:37 |
JayF | going to see if that is the fix it needs; that's an old patch that needed a rebase | 16:37 |
clarkb | I know there are/were problems with fedora 34 + debuntu rpm. That may work depending on how in depth those func tests get. I think if that fails you can set it to 33 | 16:41 |
JayF | ack; thank you for the pointer | 16:41 |
clarkb | ok really popping out now. Back in a bit | 16:41 |
opendevreview | Merged opendev/system-config master: matrix-eavesdrop: fix initial room path creation https://review.opendev.org/c/opendev/system-config/+/803294 | 17:23 |
opendevreview | Merged opendev/system-config master: Update our python-base and -builder images https://review.opendev.org/c/opendev/system-config/+/803357 | 17:23 |
nst__ | i have installed openstack using kolla ansible and try to testing freezer project, after finish installation and try to backup there is error like this, "CRITICAL freezer.main [-] Critical Error: can only concatenate str (not "bytes") to str" anyone have experience like this.. thank you | 17:24 |
*** dviroel is now known as dviroel|brb | 17:34 | |
fungi | nst__: you might have more luck asking in the #openstack-kolla channel or on the openstack-discuss@lists.openstack.org mailing list (i don't see any dedicated irc channel for the freezer project) | 17:37 |
fungi | #opendev is where we discuss the services which comprise the opendev collaboratory (so gerrit, zuul, etherpad/meetpad, mailman, et cetera) | 17:37 |
opendevreview | Merged opendev/system-config master: Use nodejs 14 https://review.opendev.org/c/opendev/system-config/+/803265 | 17:58 |
clarkb | corvus: https://zuul.opendev.org/t/openstack/build/70e992a5ba9c49bd8026b1bfa2a24aca does that indicate the merger failed to fetch and update the repo? | 18:17 |
corvus | clarkb: i think so. i think that's worth a search of the uuid in the executor logs | 18:18 |
JayF | ianw: cgoncalves: clarkb: https://review.opendev.org/c/openstack/diskimage-builder/+/799341 appears to fix the DIB gate | 18:19 |
clarkb | corvus: ze07 appears to be the executor | 18:20 |
clarkb | corvus: 2021-08-03 18:03:24,008 ERROR zuul.ExecutorServer: [e: 6411f294b08945e1ba71d755335dbea1] [build: 70e992a5ba9c49bd8026b1bfa2a24aca] Got exception while updating repo gerrit/opendev/system-config | 18:20 |
clarkb | grepping by the event id doesn't seem to show more of an error unfortunately | 18:21 |
fungi | clarkb: the change to improve repo rename functional testing failed oddly on run-review-3.2 with a rather opaque-looking selenium error | 18:22 |
fungi | not sure if that's prone to nondeterminism but it's the first time i've noticed that particular error anyway | 18:23 |
clarkb | fungi: ya I see you rechecked it. fwiw I'm not sure 803265 will use the new image from 803357. But that is ok the v1.15 gitea update will get the new image. More important that we're generally updating things | 18:24 |
corvus | clarkb: stderr: 'ssh: Could not resolve hostname review.opendev.org: Temporary failure in name resolution | 18:25 |
corvus | clarkb: (from traceback without log line prefix) | 18:25 |
clarkb | corvus: neat! I think we still have a short ttl for that name. Let me push a change to bump it back out to an hour | 18:25 |
clarkb | while that won't totally fix things it should improve our ability to cache dns records | 18:25 |
clarkb | unfortunately I think zuul uses the python 3.8 images so we may need to reenqueue that? | 18:27 |
clarkb | or land another change to cause things to rebuild | 18:27 |
opendevreview | Clark Boylan proposed opendev/zone-opendev.org master: Increase review.opendev.org's CNAME TTL to one hour https://review.opendev.org/c/opendev/zone-opendev.org/+/803390 | 18:28 |
clarkb | corvus: ^ there is the ttl bump | 18:28 |
clarkb | fungi: corvus: https://review.opendev.org/c/opendev/system-config/+/803366/1 should also be mergable. Looks like the child of that one passed testing, but it likely deserves in depth review | 18:29 |
opendevreview | Clark Boylan proposed opendev/system-config master: WIP Prepare gitea 1.15.0 upgrade https://review.opendev.org/c/opendev/system-config/+/803231 | 18:30 |
clarkb | And there is the v1.15.0 change rebased on top of the other cleanups | 18:30 |
*** dviroel|brb is now known as dviroel | 18:42 | |
opendevreview | Merged opendev/zone-opendev.org master: Increase review.opendev.org's CNAME TTL to one hour https://review.opendev.org/c/opendev/zone-opendev.org/+/803390 | 18:44 |
clarkb | infra-root thinking about the 3.8 promote failure a bit more I'm worried that the builder and base image pairing being out of sync like that may be a problem for our wheel builds? | 18:46 |
clarkb | I guess it depends if debian buster has updated any packages since our last image build with incompatible library updates? | 18:47 |
clarkb | all that to say I think we should try to correct this problem if we can. Do you think reenqueing the buildset is a good idea or should we land another change to rebuild the images and go through it all again? | 18:48 |
fungi | so the promote job failed and the image wasn't tagged? | 18:49 |
clarkb | fungi: correct, but python-builder:3.8 was tagged. This means we'll be building wheels on newer debian buster than where we install and execute those wheels | 18:50 |
fungi | is it python-base:3.8 we're missing then? | 18:50 |
clarkb | yes that is the one that had its promote job fail | 18:51 |
clarkb | I think zuul and hound are the two images that use python3.8 primarily. Most things are still 3.7 | 18:51 |
clarkb | Zuul has functional testing and would catch these issues with the images if it becomes a problem I think | 18:51 |
clarkb | which means we're probably fine until we manage to promote that image one way or another | 18:52 |
fungi | i don't think reenqueuing that change into promote should present a problem | 18:52 |
clarkb | fungi: its deploy not promote | 18:52 |
fungi | oh, i misunderstood. thought you said python-base:3.8 wasn't tagged on dockerhub | 18:52 |
clarkb | fungi: yes that is correct, but the jobs run in the deploy pipeline | 18:53 |
fungi | i had no idea we were doing that in deploy instead of promote. interesting | 18:53 |
clarkb | whcih is where I think it gets confusing around whether or not we can reenqueue the buildset as it will not only rerun all the promote jobs but also run infra-prod-service-refstack and infra-prod-service-gitea | 18:53 |
clarkb | fungi: we do that so that the images update before we do the pull as we can't sequence across pipelines aiui | 18:54 |
fungi | i guess with it being a change-merged trigger, we already expect that changes could trigger out of sequence there, so at least don't have to worry that reenqueuing an older change would roll back the latest branch state in the context of the deployment | 18:55 |
fungi | so probably still safe to do | 18:56 |
clarkb | fungi: well that is what I'm not totally sure of. I think in this specific case the refstack and gitea jobs won't do anything unexpected though and in this specific case we should probably be ok | 18:56 |
clarkb | ya I think we're good to reenqueue | 18:59 |
tristanC | corvus: clarkb: did 803328 restarted the compose, or should we add a force-recreate when the compose file changes? | 19:29 |
clarkb | tristanC: iirc the role does a docker-compose pull && docker-compose up -d and since the image changed that should cause it to restart | 19:30 |
clarkb | basically up -d will restart if there is a new image present | 19:30 |
tristanC | clarkb: alright, there it posted its first notification to the test channel! | 19:32 |
clarkb | fungi: only the hourly deploys are currently queued up if we want to enqueue the deploy jobs for https://review.opendev.org/c/opendev/system-config/+/803357 | 19:43 |
clarkb | corvus: ^ re that do you know of any reason it wouldn't be safe to rerun the docker image promote jobs? | 19:51 |
corvus | clarkb: it could "rewind" production? do you have a build/buildset link? | 19:53 |
corvus | we can see what jobs will run | 19:53 |
clarkb | corvus: the buildset for the name resolution failure I linked to previously | 19:53 |
clarkb | corvus: https://zuul.opendev.org/t/openstack/build/70e992a5ba9c49bd8026b1bfa2a24aca | 19:53 |
clarkb | corvus: as described above I'm slightly worried that having a newer python-builder image than our python-base image could result in errors building wheels if there are binary differences in a way that matter | 19:54 |
corvus | clarkb: so it could rewind refstack and gitea -- has anything changed with those since then? | 19:54 |
clarkb | However, only zuul and hound would be affected so its not super critical (as zuul testing should catch that if this is the case for zuul) | 19:54 |
clarkb | corvus: gitea got https://review.opendev.org/c/opendev/system-config/+/803265 applied after that change with the failure | 19:55 |
clarkb | corvus: when you say rewind do you mean it could rewind the docker images or just system-config on bridge or both? | 19:55 |
clarkb | the alternative is to land another base image update change | 19:55 |
clarkb | I'm happy to propose that if we want to approve that instead | 19:55 |
corvus | system-config and whatever the ansible there does. i don't think that will affect the gitea server | 19:55 |
corvus | i think it would have to be an ansible change, not a docker image change | 19:55 |
clarkb | corvus: I agree I don't think it will affect the gitea server | 19:56 |
corvus | oh wait | 19:56 |
fungi | do we actually use the change in the deployment tasks though, or the branch state? | 19:56 |
corvus | will that retag :latest? | 19:56 |
clarkb | corvus: I wonder if it will | 19:56 |
clarkb | corvus: you know I think it may | 19:57 |
clarkb | because it just looks at the current change's image and says you are latest now | 19:57 |
corvus | if it did not delete the old tags, then it probably will. if the old tags are deleted, then it will try and fail. | 19:57 |
clarkb | fungi: we use the branch state for system-config but project-config should be latest iirc | 19:57 |
corvus | based on our earlier experience, that's a crap shoot. we probably can't even tell because of the dockerhub ui/api difference. | 19:57 |
clarkb | corvus: ya I think we should do a new change to bump the image again | 19:57 |
corvus | clarkb: i think that's safest and most predictable :) | 19:58 |
clarkb | I'll push that momentarily | 19:58 |
fungi | that's fine by me. it's low-effort, maybe even lower (just going to take a little longer to go through) | 19:58 |
fungi | i'll be around for at least a few more hours anyway | 19:58 |
corvus | tristanC, clarkb, fungi: and eavesdrop is working too! https://meetings.opendev.org/irclogs/%23test/latest.log.html | 20:01 |
fungi | oh nice | 20:01 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update the python docker images again https://review.opendev.org/c/opendev/system-config/+/803394 | 20:01 |
clarkb | corvus: fungi ianw ^ ok that should get us sorted | 20:01 |
clarkb | the ttl update is in place too so dns should be much more reliable now | 20:02 |
fungi | i need to disappear for a bit to make dinner but should probably be back by the time that's done | 20:03 |
clarkb | ya I need to get lunch myself | 20:03 |
clarkb | ianw: https://review.opendev.org/c/opendev/system-config/+/803372 is a stack that will probably interest you as it attempts to incorporate our upstream gerrit fixes | 20:03 |
clarkb | ianw: but I don't think we should approve anything out of that stack until 803394 promotes properly. That way we get shiny python:3.7 images to work with | 20:04 |
clarkb | and now lunch | 20:04 |
opendevreview | Merged opendev/system-config master: Improve repo rename functional testing https://review.opendev.org/c/opendev/system-config/+/803266 | 20:13 |
fungi | yay, it merged on retry. that selenium error was... weird | 20:14 |
opendevreview | Merged openstack/diskimage-builder master: Fedora: bump DIB_RELEASE to 34 https://review.opendev.org/c/openstack/diskimage-builder/+/799341 | 20:23 |
ianw | fungi: yeah, i think it can still be a bit racy, and because of the shadow-dom walking the queries are really weird | 20:31 |
opendevreview | Tristan Cacqueray proposed opendev/system-config master: Add gerritbot-matrix identity lookup configuration https://review.opendev.org/c/opendev/system-config/+/803396 | 20:35 |
ianw | clarkb: thanks, responded; it's pulling in a different variables file that should hopefully stop pip.conf adding wheel mirrors | 20:35 |
ianw | that's the theory, anyway | 20:35 |
clarkb | ianw: I'm not sure that will work since defaults define wheel_mirror | 20:38 |
clarkb | ianw: also we want the pypi mirror to be set just not the wheel mirror | 20:38 |
ianw | i was modelling on fedora, but it's possible that's also wrong | 20:40 |
clarkb | ya I suspect that we may be writing an incomplete pip.conf where it is extra-index-url = and then nothing | 20:41 |
clarkb | or if it is something it is https:///centos-8/wheels/ or whatever the paht is | 20:41 |
clarkb | since we don't guard against the vars being undefined or empty | 20:41 |
ianw | hrm, i guess https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/configure-mirrors/defaults/main.yaml applies | 20:42 |
ianw | i think we should rework this to more clearly delineate between platforms we build wheels for and not | 20:43 |
ianw | a role to generate the wheel slug would also be useful, as i could re-use that in the publishing jobs | 20:43 |
clarkb | ianw: in the centos-8-stream make a wheel mirror chagne what ends up doing the afs vos release? | 20:47 |
clarkb | oh hrm these are just check jobs to see if we can build the wheels not publish them | 20:48 |
clarkb | then when that is working we add in publication I guess | 20:48 |
ianw | yeah, the publish jobs are in project-config | 20:48 |
clarkb | +2 | 20:48 |
ianw | that's where i need to sort it out writing to the correct afs locations | 20:49 |
clarkb | fungi: ianw heads up I have approved https://review.opendev.org/c/opendev/system-config/+/803263 | 20:49 |
clarkb | the newlist command fix | 20:49 |
*** dviroel is now known as dviroel|out | 20:54 | |
tristanC | corvus: for gerritbot matrix identity lookup, we would need to run this procedure, would you mind giving it a try and let me know if these instructions work: https://review.opendev.org/c/opendev/system-config/+/803396/1/playbooks/roles/matrix-gerritbot/README.rst | 21:00 |
clarkb | 803394 is waiting for one more node in check. I wonder if we've got some sort of failure or slow boot somewhere /me tries to be patient | 21:06 |
opendevreview | Merged opendev/system-config master: Skip notification prompt for newlist command https://review.opendev.org/c/opendev/system-config/+/803263 | 21:19 |
clarkb | system-config-build-image-gitea-init had its node request fullfilled over an hour ago according to the logs on nl04 | 21:40 |
clarkb | but the job hasn't started. Is this the bad locks in the nodepool launcher situation? should I restart nodepool-launcher on nl04? | 21:40 |
clarkb | fungi: ^ | 21:40 |
clarkb | I think you've debugged a few of these before | 21:40 |
corvus | tristanC: done and added to private hostvars. worked fine | 21:41 |
clarkb | https://paste.opendev.org/show/bMM7UPkSF7g1bqrXTYYX/ is what that looks like on the scheduler and the nodepool launcher | 21:41 |
clarkb | I'm going to restart nl04's launcher in a couple of minutes if I don't hear objections. I suspect this is that issue | 21:42 |
clarkb | restarting nl04's launcher does not seem to have chagned the state of that job | 21:48 |
clarkb | corvus: ^ fyi this seems like some sort of zk related issue in job scheduling | 21:48 |
clarkb | comparing the logs for the request associated with the stuck job to logs for other requests it seems that the nodepool side believs its work to be done | 21:49 |
corvus | clarkb: if the scheduler never logged an update to that node request then it's probably not due to the recent zk work | 21:50 |
corvus | i think that's more or less the same as before | 21:50 |
clarkb | 2021-08-03 20:28:17,876 INFO zuul.nodepool: [e: 7e9b6a5b1e2f41989b016e4b4699346a] Request 299-0014954444 no longer exists, resubmitting | 21:51 |
clarkb | ok that is curious | 21:51 |
corvus | oh yeah there's more logs than the paste | 21:51 |
corvus | that does point toward recent changes | 21:51 |
clarkb | ya sorry I realized i could grep by the request id in the scheduler | 21:52 |
clarkb | nodepool does not see that request id show up i nany of its logs after the first pass of request handling | 21:52 |
clarkb | basically the resubmitting appears to not have happened | 21:52 |
clarkb | I suspect 802362,1 may be in a similar state? | 21:53 |
corvus | it'll get a new request id | 21:53 |
clarkb | ah ok let me dig more then | 21:54 |
corvus | there's a good chance that the new request may be ignored by zuul since it also acked the original | 21:54 |
corvus | still reading code | 21:54 |
clarkb | 299-0014954458 is the updated request id. I'm looking for it now | 21:54 |
clarkb | that one did the same thing on nl04. It was fulfilled and unlocked so nl04 removed its request handler for it | 21:55 |
clarkb | the scheduler only logs this Updating node request <NodeRequest 299-0014954458 <NodeSet ubuntu-focal [<Node 0025752729 ('ubuntu-focal',):ubuntu-focal>]>> for that request id | 21:57 |
clarkb | it doesn't say anyting about the request going away and resubmitting like the first one | 21:57 |
corvus | hrm, the updated request in zk looks correct | 22:04 |
corvus | ```{"state": "fulfilled", "state_time": 1628022502.423527, "declined_by": [], "node_types": ["ubuntu-focal"], "nodes": ["0025752729", "0025752729"], "reuse": true, "requestor": "zuul02.opendev.org", "requestor_data": {"build_set_uuid": "b397fc88821b4b398469a227dfde24f7", "tenant_name": "openstack", "pipeline_name": "check", "job_name": "system-config-build-image-gitea-init"}, "provider": "ovh-gra1", "relative_priority": 0, "event_id": | 22:04 |
corvus | "7e9b6a5b1e2f41989b016e4b4699346a"}``` | 22:04 |
corvus | though i wonder why the node id shows up twice | 22:05 |
corvus | oh, i wonder if that's due to the re-request | 22:05 |
clarkb | is https://opendev.org/zuul/zuul/src/branch/master/zuul/nodepool.py#L464 sufficient to tell the scheduler where the new request is? | 22:06 |
corvus | clarkb: i don't think the scheduler should use that any more -- it should rely on the requestor_data in the request itself now | 22:06 |
clarkb | gotcha | 22:07 |
corvus | in fact, that may be nearly dead code at this point | 22:08 |
clarkb | did _updateNoteRequest() which is the watcher for the request possibly return false so we stopped watching it but the scheduler didn't/doesn't know about it somehow? | 22:08 |
fungi | clarkb: yes, we've been seeing this off and on since at least april | 22:08 |
clarkb | fungi: this one seems different fwiw | 22:08 |
clarkb | fungi: I tried the restart and no luck. corvus is digging in above | 22:09 |
fungi | restarting the launcher frees the node request locks it may be holding and allows other launchers to try to satisfy them | 22:09 |
fungi | more recently i've noticed that after some time zuul reissues a new node request id, and i've seen the same launcher keep grabbing and sitting on it | 22:10 |
fungi | okay, caught up reading now | 22:10 |
fungi | it seems like the stuck node requests we used to see have more recently been replaced by the re-request behavior where the same launcher which was sitting on the old request also grabs and sits on the new one | 22:11 |
fungi | the other day i traced a log where that happened three times in sequence for the same build | 22:11 |
clarkb | corvus: reading the code around where I just linked and _updateNodeRequest I notice we are reusing the request object and setting some flags on it rather than creating a new one. Is it possible some status like request.cancelled is set and when the node request comes back we essentially ignore it | 22:11 |
corvus | 2021-08-03 20:28:05,916 INFO kazoo.client: Zookeeper connection lost | 22:12 |
corvus | so it was correct to resubmit the request; the error here is that the resubmitted request did not work correctly | 22:12 |
clarkb | corvus: ya that is what I'm leaning towards. However I don't think it was cancelled unless the zk connection does that implicitly without logging it somehow | 22:13 |
clarkb | neither the original request id or the new request id logs that it was cancelled | 22:13 |
corvus | clarkb: we shouldn't get to the point of resubmitting a request if it's canceled | 22:13 |
corvus | clarkb: requests are ephemeral, so zk will automatically delete it | 22:13 |
clarkb | corvus: I'm just going through the watcher callback and trying to rule branches out. Request .* is unknown doesn't show up in the log file at all so we didn't hit that branch. | 22:19 |
clarkb | 'Resubmitting lost node request' doesn't show up for these requests so we don't hit that branch either | 22:19 |
corvus | clarkb: since we saw one "Updating node request" from the scheduler, but nothing after that, it does sort of seem like the callback may have returned false | 22:19 |
clarkb | and Node request 123-foo FULFILLED doesn't show up so we don't hit that branch | 22:20 |
clarkb | leaving the cancelled branch but I don't think we hit taht either | 22:20 |
clarkb | corvus: its like the watcher went away entirely | 22:20 |
clarkb | I don't think the request was cancelled because when we cancel a request that produces other logging whcih doesn't show up though we don't explicitly log for that in updateNodeRequest() | 22:21 |
corvus | clarkb: a tricky thing to look out for: request.uid != request.id -- so the "Request %s is unknown" log may not show up in our greps. however, "is unknown" doesn't show up at all, so we know that didn't hit | 22:22 |
clarkb | corvus: yup I ended up grepping for `Request .* is unknown` for that reason | 22:23 |
corvus | clarkb: fwiw, i think the instigator of this incident is the event size issue that we merged fixes for yesterday | 22:26 |
corvus | that doesn't make this okay -- we still have a problem. but as an aside, the underlying zk disconnect issue should already be addressed. | 22:26 |
clarkb | thats good | 22:27 |
clarkb | corvus: does it make sense to abandon and restore this change for now? | 22:27 |
clarkb | and have zuul retry from scratch? | 22:27 |
clarkb | or do we maybe want to restart on the changes from yesterday? | 22:27 |
clarkb | (and queue restore will reenqueue) | 22:28 |
corvus | clarkb: yes, but if you can hold just a bit longer that'd be great | 22:28 |
JayF | If I have a job running, with a bunch of `nodepool` jobs that aren't starting, would this be a similar/symptom of this issue | 22:28 |
clarkb | corvus: yup not in a rush. I already waited a bunch :) | 22:28 |
JayF | this one is pretty fresh if that makes a difference to you :) | 22:28 |
clarkb | JayF: if its fresh then it can just be normal delay | 22:28 |
JayF | ack, just thought I'd offer up an example if there was a need for one | 22:29 |
clarkb | JayF: if this is for your dib change I think it is waiting on docker images to build and that is slow because arm64 images are built via qemu or something | 22:29 |
clarkb | I don't think that change is in a bad spot yet | 22:29 |
JayF | I'm just judging based off how long the other DIB change I landed this afternoon took to run them | 22:29 |
corvus | clarkb: i agree the only way to exit the callback returning False with no log lines, is if the req was canceled (or maybe an exception would stop the watcher?) | 22:29 |
JayF | which is highly subjective anyway :) | 22:29 |
clarkb | JayF: specifically nodepool-build-image-siblings needs to succeed then those waiting jobs will all grab the built image from there | 22:29 |
JayF | oh wow, that's cool, I didn't realize you could seed jobs that way | 22:29 |
JayF | I might go look at that config instead of doomscrolling zuul status :D | 22:30 |
corvus | JayF: you can also mouseover "waiting" on the status page | 22:30 |
corvus | zuul will tell you what it's waiting for | 22:30 |
JayF | :-O | 22:30 |
clarkb | corvus: my suspicion is that something related to the carried over state is leaking | 22:30 |
JayF | how have I been using openstack/zuul for this long and never realized that! Thanks! | 22:30 |
clarkb | corvus: since we reuse the request object and modify it when resubmitting rather than creating a new object | 22:31 |
corvus | JayF: it might be because we added it 2 months ago..... but then again... it might not be. ;) | 22:31 |
clarkb | corvus: and somehow that is causing updateNodeRequest() to take an unexpected path | 22:31 |
corvus | clarkb: but the only possible path that could be is canceled, right? | 22:32 |
corvus | and that's not possible without a "Canceling node request" line | 22:32 |
clarkb | corvus: yes and yes | 22:32 |
corvus | (because that's the only place canceled=True) is set | 22:32 |
corvus | clarkb: i think i should fire up the repl and see if i can find the request object(s) | 22:33 |
clarkb | corvus: ok let me know if I can help more | 22:33 |
clarkb | I'll hold off on changing anything about the change itself until you give the all clear | 22:33 |
corvus | thanks | 22:33 |
clarkb | JayF: you'll notice those jobs have flipped from waiting to queued. Now they are queued up and looking for nodes to run on | 22:36 |
corvus | the replacement request is in memory, the original is not | 22:36 |
clarkb | corvus: aren't they the same object? | 22:37 |
corvus | good point :) | 22:37 |
corvus | req.canceled = False | 22:37 |
clarkb | https://opendev.org/zuul/zuul/src/branch/master/zuul/nodepool.py#L462-L463 because of that | 22:37 |
corvus | the uid is '47debb1d7af84d87918de66a7d3c9649' | 22:37 |
clarkb | we don't seem to log that uid at all | 22:38 |
corvus | req.state = 'requested' | 22:38 |
corvus | nope, only on that one "is unknown" line. | 22:38 |
clarkb | if state is requested and cancelled is false then we should've fallen through and returned true | 22:39 |
clarkb | because we know it wasn't deleted (lack of logs for deleted) | 22:39 |
clarkb | is it possible we've somehow missed a watcher event or the watcher is no longer watching? | 22:40 |
clarkb | oh wait | 22:40 |
clarkb | 2021-08-03 20:28:22,423 DEBUG nodepool.driver.NodeRequestHandler[nl04.opendev.org-PoolWorker.ovh-gra1-main-8e061a04c2b24871b6e337ad89b66404]: [e: 7e9b6a5b1e2f41989b016e4b4699346a] [node_request: 299-0014954458] Fulfilled node request | 22:40 |
clarkb | that should've set req.state to fulfilled not requested | 22:40 |
corvus | clarkb: right, the zk node reflects that, the object in zuul's scheduler memory says requested | 22:41 |
clarkb | ah | 22:41 |
corvus | so that tells us that the watcher didn't fire for the fulfilled event for the re-request (the logs already told us that) | 22:41 |
clarkb | yup | 22:41 |
clarkb | it did fire at some point beacuse we got the single log line but it didn't fire for the fulfilled state | 22:41 |
clarkb | and we're running out of possibilities for a return False there | 22:42 |
corvus | it's possible the first callback is just the kazoo datawatch firing on creation, and that we never actually got a zk watch notification | 22:43 |
clarkb | good point | 22:44 |
clarkb | I forgot it does that | 22:44 |
clarkb | in fact it would have to be right? since we got the log line and it does call on creation of the watch? | 22:44 |
corvus | i think so | 22:45 |
corvus | interesting seeing a node request in 'requested' state with nodes in its nodeset | 22:48 |
corvus | (because the nodeset was updated by the callback from the previous request) | 22:49 |
clarkb | oh that is a bug because we don't clear out the nodes from the request and the reuse it? | 22:49 |
clarkb | ya | 22:49 |
clarkb | and then when it went back to nl04 it gave the same node back again | 22:49 |
clarkb | JayF: your change is running builds for all jobs now. It should end up being fine | 22:51 |
corvus | the zk data that it has (from the original request) only has one node id (which is correct). the current data in zk has 2. | 22:51 |
corvus | i suspect that's a second (minor?) bug | 22:51 |
corvus | that's probably something we need to clear out but don't. | 22:52 |
corvus | fwiw, i think if we can drop the nodepool.requests dict, we can use fresh noderequest objects | 22:52 |
JayF | clarkb: I see that, thank you :D | 22:53 |
clarkb | corvus: thinking out loud a bit more. Is it possible that because we lost the zk connection there was some associated watcher cleanup on the scheduelr side, but since we submitted a new request we got the initial call for a watch against a dead watch again? | 23:04 |
clarkb | I think that wouldn't be the acse because from the zk perspective the node request did change | 23:04 |
clarkb | since the node requset id in zk is the path entry | 23:05 |
corvus | clarkb: i've been thinking along similar lines, but yeah, from zk/kazoo, they're just operating on paths, so it looks like 2 different paths, nodes, watchers | 23:05 |
corvus | (it's just that the watchers closures happen to have the same first argument) | 23:05 |
clarkb | yup | 23:05 |
corvus | now that does mean that a zombie watch callback for the first path would affect the shared request object, but we don't have a log entry indicating that | 23:06 |
clarkb | and it is an ephemeral node so in theory zk cleaned that up before we reconnected | 23:07 |
clarkb | right? | 23:07 |
corvus | yep | 23:07 |
clarkb | it would be weird if ephemeral nodes could make it across connections in that way | 23:07 |
corvus | maybe we should ask zk what watches it has | 23:07 |
corvus | that's going to be a bit longer than it used to be | 23:08 |
corvus | nl01, nl02, and two sessions from nl04 hold watches on that path | 23:11 |
clarkb | but none from the scheduler | 23:12 |
clarkb | which woudl explain the behavior we see but not explain how or why it went wrong | 23:12 |
corvus | appears to be the case | 23:12 |
corvus | yep | 23:12 |
clarkb | nl01 and nl02 both yielded the request iirc | 23:13 |
opendevreview | Merged openstack/diskimage-builder master: Permit specification of extra bootstrap packages https://review.opendev.org/c/openstack/diskimage-builder/+/802592 | 23:13 |
clarkb | but nl03 didn't | 23:13 |
corvus | 2021-08-03 20:28:45,369 ERROR kazoo.recipe.watchers: list index out of range | 23:17 |
corvus | 2021-08-03 20:28:45,374 ERROR kazoo.handlers.threading: Exception in worker queue thread | 23:17 |
corvus | File "/usr/local/lib/python3.8/site-packages/zuul/zk/nodepool.py", line 503, in updateNodeRequest | 23:18 |
corvus | request_nodes[i].id = nodeid | 23:18 |
corvus | i think it's actually the mismatch in node lengths | 23:18 |
clarkb | oh neat | 23:18 |
clarkb | I guess the obvious weird thing was the thing to look at | 23:18 |
corvus | it looks like an exception there will cause it to stop watching | 23:18 |
corvus | clarkb: so i think your hunch of "we're not clearing something out" was right -- and the thing we're not clearing out is the node list | 23:20 |
clarkb | that method gets attached to the callback we were looking at previously | 23:20 |
clarkb | what isn't clear to me is why the callback raising would cause the watch to go away? I thought you had to return False for that | 23:20 |
clarkb | in any case that should be an easy fix. We need to more aggressively clean up the previous state? | 23:22 |
corvus | clarkb: yeah, me neither. i think we'll need to check the kazoo watch callback to fully understand that (i think it's outside the scope of the datawatch) | 23:22 |
corvus | clarkb: yes -- though i think that avoiding re-use of that object will solve this problem, and we can do that if we drop the internal dict | 23:23 |
corvus | if that's not a trivial change, then yeah, let's just clear out the node list for now | 23:23 |
corvus | i will finish up the work i have going on the merger then do that | 23:23 |
clarkb | corvus: ok, am I good to abandon and restore? | 23:24 |
corvus | clarkb: one sec | 23:24 |
clarkb | looking at the code it appears we use that dict to count things for statsd | 23:24 |
clarkb | but otherwise its just internal accounting that we maybe don't need anymore | 23:25 |
corvus | oh yep, it's the current request length | 23:25 |
clarkb | corvus: we could use a simple counter for that though? | 23:26 |
clarkb | increment and decrement instead of adding and removing from a dict | 23:26 |
clarkb | and that integer won't drag state along with it and cause unexpected things to happen | 23:26 |
clarkb | Anyway I'll standby on abandon restore until you give the go ahead | 23:27 |
corvus | clarkb: okay, i cleaned up the repl and shut it off. feel free to proceed | 23:27 |
clarkb | proceeding thanks | 23:28 |
clarkb | there are a couple other chagnes that exhibit similar behavior that I will do thsi too as well | 23:28 |
clarkb | oh ha I cannot abandon and restore in some repos because we changed how admin works | 23:30 |
fungi | so you don't think restarting the launcher holding those node requests will work this time? it seemed to address the instances i saw last week, some of which had multiple reissued requests | 23:32 |
corvus | in this case, the launchers are not holding the node request. it's done with it. | 23:33 |
clarkb | fungi: I did restart the launcher and it did not help | 23:33 |
corvus | nothing will cause zuul to see an update to this particular node request | 23:33 |
clarkb | doing that and not changing the situation is what led to me doing the deeper dive debug above | 23:33 |
fungi | ahh, okau | 23:33 |
fungi | and zuul dequeue won't do it either? | 23:34 |
clarkb | fungi: that will probably work. Its easier to click buttons in the gerrit web ui though :P We can do dequeue enqueue for the starglinx/ha change I guess | 23:35 |
clarkb | but my ssh keys are telling me I shouldn't :P | 23:35 |
timburke_ | huh. anything change with paste.opendev.org lately? i just made https://paste.opendev.org/show/807865/, but it doesn't show the content -- yet when i go to https://paste.opendev.org/raw/807865/, it's all there :-/ | 23:47 |
Clark[m] | timburke_ the server was upgraded and the db server changed as a result | 23:59 |
Clark[m] | I wonder if this is a weird interaction resulting from that | 23:59 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!