ianw | tonyb: https://paste.opendev.org/show/bJeCVfdS9SUfEmRdWpun/ is the failure for 902008 ... looks like cacti failed trying to build the certcheck domains for mirror03.gra1.ovh.opendev.org | 05:12 |
---|---|---|
ianw | it's worked in the next run | 05:15 |
ianw | oh i think i see it | 05:19 |
ianw | TASK [letsencrypt-request-certs : Run acme.sh driver for certificate issue] **** | 05:19 |
ianw | fatal: [mirror01.iad.rax.opendev.org]: FAILED! => { | 05:19 |
ianw | "changed": false, | 05:19 |
ianw | "rc": -13 | 05:19 |
ianw | } | 05:19 |
ianw | it looks like that failed. that aborted the letsencrypt run on mirror01, so it never got to the next step, where it populates the hostvar with the certificates to put into the certcheck list | 05:20 |
ianw | so it was looping around making that list -- mirror03.gra1.ovh.opendev.org was the one *before* this ... it asked for hostvars[mirror01.iad...]['letsencrypt_certcheck_domains'] which didn't exist because letsencrypt-request-certs bailed out before it got set | 05:21 |
ianw | why acme.sh failed -13 i don't know :) | 05:23 |
tonyb | ianw: thanks. that makes sense the fact the error looks like mirror03 rather than mirror01.iad is a little confusing. | 06:24 |
tonyb | I'll grab acme.sh tomorrow and see what exits with -13 | 06:25 |
tonyb | .... where I guess tomorrow is actually later today | 06:25 |
ykarel | frickler, fungi i have reported https://bugs.launchpad.net/neutron/+bug/2045383 to track that xena job issue | 07:50 |
opendevreview | Steve Brasier proposed openstack/diskimage-builder master: allow passing arbitrary build options in containerfile element https://review.opendev.org/c/openstack/diskimage-builder/+/902366 | 10:24 |
*** ykarel is now known as ykarel|away | 13:51 | |
fungi | ykarel|away: it should be fixed when zuul restarts tomorrow, since https://review.opendev.org/902316 has already merged | 14:51 |
clarkb | Looks like my message about that ansible thing didn't make it to irc | 16:08 |
clarkb | tonyb: ianw: https://review.opendev.org/c/opendev/system-config/+/898475 is related to the certcheck domains list building issue | 16:08 |
clarkb | I thought it was an ansible bug that may haev been fixed by ansible 8 beacuse it seemed to go away after we upgraded, but I guess not | 16:09 |
clarkb | we can resurrect that change if we think the extra debug info would still be useful | 16:09 |
clarkb | tonyb: frickler: thoughts on proceeding with https://review.opendev.org/c/opendev/system-config/+/902315/ and child to rotate the replication key? | 16:09 |
clarkb | I'm pretty sure I can be around for a gerrit restart at some point today if we want to try and land both of them (we should ensure the gitea one lands and applies first though just to avoid any errors there) | 16:11 |
fungi | same | 16:11 |
clarkb | the new/replacement/fixed gerrit 3.9.x release should be out sometime today | 16:35 |
clarkb | I'm hoping it will be based on rc6 this time around and we alraedy ran tests against that rc and they were good | 16:35 |
clarkb | fungi: half wondering if we should proceed at least with the gitea update since the risk for that one is lower? (though technically if I got the key cleanup code wrong it could remove the key we are relying on I suppose. This seems unlikely because it didn't remove the key yet) | 16:41 |
tonyb | clarkb: reviewing 902315 now | 16:47 |
clarkb | thanks! | 16:48 |
tonyb | clarkb: I'm happy with both of those. I agree we should land the gitea one first; pause/check and then land the gerrit one | 17:03 |
tonyb | I'm going to be in and out a bit today | 17:03 |
tonyb | clarkb: I think restoring https://review.opendev.org/c/opendev/system-config/+/898475 is a good idea. | 17:06 |
clarkb | tonyb: thanks I approved the gitea change and restored the ansible debugging change | 17:23 |
tonyb | Awesome | 17:26 |
tonyb | doing a little more digging. the -13 that ianw logged is coming from ansible NOT acme.sh. the shell module had an error | 17:27 |
tonyb | MODULE FAILURE | 17:27 |
tonyb | See stdout/stderr for the exact error | 17:27 |
tonyb | I looking at logs on mirro01.iad indeed LE never ran there | 17:28 |
tonyb | so STDOUT would go to the LE log that I've been loogin at. do we loose STDERR? | 17:30 |
clarkb | tonyb: this is why I suspected an ansible bug | 17:39 |
clarkb | basically nothing ansible is doing there makes any sense. They don't log the iteration value they don't log any outputs its just a -13 error | 17:40 |
clarkb | my hunch is that we aren't loading the host vars for the other hosts properly due to some weird logic bug | 17:40 |
clarkb | tonyb: you might want to look at the periodic run from overnight to see if it managed to succeed | 17:41 |
clarkb | part of the issue is we don't know what host is actually causing the error because it doesn't log failed loop iterations only successful ones (this definitely seems like an ansible bug in my opinion) | 17:41 |
tonyb | I think in this case specifically for the builing of certcheck I'm pretty confident that we do know beacuase earilier mirro01.iad failed with -13 (which means that it didn't run driver.sh on that node, and therefore also didn't add create the hostvar. which caused the ultimate failure | 17:44 |
clarkb | oh I see so this is a variant of the case I ran into I think. Unless I had simply missed that there was an earlier failure | 17:45 |
tonyb | in run-production-playbook we do 'ansible-playbook ..... > log' | 17:45 |
clarkb | in the situation I ran into I thought the only error was in the iteration itself not previously but it is entirely possible I simply missed that | 17:45 |
clarkb | still ansible shoudl log the iteration item | 17:45 |
tonyb | where would stderr from that command go? | 17:45 |
tonyb | Yes I agree totally. I think the change you restored is a good thing regardless | 17:46 |
clarkb | it should be in the job log | 17:46 |
clarkb | https://zuul.opendev.org/t/openstack/build/0b1a2cad2c0842b6874b599efb955fcc/log/job-output.txt#139 | 17:47 |
tonyb | Okay I was hoping to see more. | 17:47 |
clarkb | there is no stderr there or at https://zuul.opendev.org/t/openstack/build/0b1a2cad2c0842b6874b599efb955fcc/console#2/1/2/bridge01.opendev.org | 17:47 |
clarkb | we may need more -v's to get that info but probably also want to redirect stderr if to stdout if we do that since that may expose lots of inner workings we don't want out in the clear | 17:48 |
clarkb | tonyb: and on the host do you have an acme.sh log? I guess the -13 implies that ansible broke and didn't run the shell at all though | 17:49 |
clarkb | since acme.sh should return 0 1 or 2 iirc | 17:49 |
tonyb | Hmm okay. | 17:50 |
tonyb | Correct acme.sh didn't run at that time. | 17:50 |
tonyb | it has since and worked just fine | 17:50 |
tonyb | https://github.com/ansible/ansible/issues/57323#issuecomment-498732320 is about the file module but seems to say that the shell module got a SIGPIPE before running on mirror01.iad | 17:51 |
clarkb | https://github.com/ansible/ansible/issues/81777 | 17:54 |
clarkb | I'm betting it is this issue | 17:56 |
tonyb | Yup | 17:56 |
tonyb | So I think this is something to watch for, but there isn't anything we can do right now. | 17:58 |
clarkb | agreed. I think the one thing we might be able to do is increase the 60s control persist timeout to extend the amount of time things can be idle? That doesn't seem like a great workaround though a minute is an awfully long time | 17:59 |
corvus | do i understand correctly that this bug would be triggered if the time between tasks on a given host is 60 seconds (ie, the control persist timeout). and that this does not happen when tasks are running? | 17:59 |
clarkb | corvus: that was my reading of it. basically if you run a task and then try to run a task exactly 60 seconds later the new task execution races with connection persistence in ssh | 18:00 |
clarkb | my hunch is that statistically the delta between tasks executions is most likely in the 0-5second range with a long long tail that extends out beyond there | 18:01 |
clarkb | if we set the persistence timeout to 120s instead of 60s we're probably several orders of magnitude less likely to hit this race? | 18:02 |
corvus | yeah. agreed. so the question is whether it's worth it to bump that to, say, 120 for the nested ansible running on bridge. | 18:02 |
corvus | heh, we picked the same number. :) | 18:02 |
corvus | yeah, i sort of lean towards that since it's been characterized in our environment that 60s intervals are apparently common enough to be occasionally noticed. | 18:03 |
clarkb | I'd be fine with that. I don't know that I'd go much beyond say 5 minutes because we could end up with a lot of useless ssh processes that way. but 120s is probably safe | 18:03 |
corvus | (and to be clear, i think this is an okay band-aid for nested ansible on bridge but my feelings for the similar setting in zuul are very different) | 18:04 |
tonyb | I agree. | 18:04 |
tonyb | Yeah very different for zuul itself | 18:05 |
clarkb | I think 60s might be ansible's default? I don't see where we set that explicitlytoday | 18:08 |
fungi | i'm going to go grab lunch real quick, but available to help with gitea/gerrit in about an hour | 18:08 |
corvus | i don't see it being set either. but zuul sets it in ansible.cfg with ssh_args = -o ControlMaster=auto -o ControlPersist=60s | 18:11 |
clarkb | https://docs.ansible.com/ansible/latest/network/getting_started/network_connection_options.html seems to be the more ansible way? | 18:23 |
clarkb | unless that is just a regular timeout unrealted to ssh persistence | 18:24 |
clarkb | ya I think that is unrealted to control persistence just a happy google collision | 18:25 |
opendevreview | Clark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 120s https://review.opendev.org/c/opendev/system-config/+/902436 | 18:29 |
clarkb | I believe that should be self testing | 18:29 |
clarkb | and it is easy to revert (by hand if we have to) if it goes wrong | 18:29 |
corvus | clarkb: want to put the ansible bug in a code comment? that could be helpful for unwinding it | 18:31 |
clarkb | sure one sec | 18:31 |
opendevreview | Clark Boylan proposed opendev/system-config master: Increase bridge's ansible ssh controlpersist timeout to 120s https://review.opendev.org/c/opendev/system-config/+/902436 | 18:32 |
opendevreview | Merged opendev/system-config master: Rotate the new Gitea replication key into Gitea config https://review.opendev.org/c/opendev/system-config/+/902315 | 18:55 |
clarkb | I've got a tail -F going on /var/log/ansible/service-gitea.log so I should see when that adds the second key | 18:56 |
clarkb | I'm pretty sure the key was properly added to gitea09 based on the logs | 19:17 |
clarkb | we no log a bunch of stuff so I'm going by "ok" and "skipping" output | 19:17 |
clarkb | assuming the job succeeds I think we acn proceed with approving the gerrit side change | 19:18 |
clarkb | gitea14 also looks good. not sure we need to check all of them if those two are fine chances something went wrong on another and the job passed is low | 19:25 |
clarkb | I'll let someone else approve the gerrit change if they agree and think we are ready. I'm going to figure out breakfast/lunch things nowish | 19:25 |
fungi | i'm looking now | 19:36 |
fungi | back for the remainder of the day, so we can take this as slowly as we like | 19:36 |
tonyb | clarkb: 902436 looks good. holding off +W until the gitea/Gerrit key rotation is done | 19:42 |
*** dviroel__ is now known as dviroel | 19:46 | |
fungi | deploy succeeded about 30 minutes ago (19:23:31) | 19:52 |
fungi | for 902315 i mean | 19:53 |
clarkb | and looks like the gerrit update has been approved | 19:54 |
fungi | i went ahead and approved 902169, so once it deploys successfully we'll arrange a quick gerrit restart and make sure replication is still working | 19:54 |
tonyb | I was thinking of 902169 and the associated Gerrit restart | 19:55 |
fungi | also probably worth triggering a full re-replication once we're sure it is | 19:55 |
clarkb | ++ also I think we can temporarily move id_rsa aside when we restart gerrit to ensure the new key is being used. I think that is the easiest way. If it doesn't get used we'll need to move it back and restart gerrit again but that is better now than later I suspect | 19:55 |
fungi | since we keep saying we want to do that and then i keep forgetting to actually do it | 19:55 |
clarkb | as I'm not sure there is any logging on either side that would be clearly definitive about which key is used | 19:56 |
fungi | a very good point | 19:56 |
fungi | at least if gerrit no longer has access to the old key, we'll know | 19:56 |
tonyb | yeah. we could ssh -vv but that isn't guaranteed to be the same | 19:57 |
clarkb | the issue is gerrit uses mina ssh client | 19:58 |
clarkb | and its logging is in the depths of java logging somewhere. We probably can figure out a way to get debug logs out of it but I'm not confident even that would identify the specific key | 19:58 |
fungi | i blame java | 19:58 |
fungi | could probably even blame log4j | 19:59 |
fungi | looks like system-config-run-review-3.8 encountered some testinfra failures | 20:55 |
fungi | the thread plugin-manager-preloader exceptions in error_log are expected, right? | 20:56 |
Clark[m] | Yes | 20:56 |
fungi | oh, looks like maybe the sshd on the test node started closing connections | 20:57 |
fungi | "Connection closed by remote host\r\nConnection closed by 104.130.74.118 port 22" | 20:57 |
fungi | https://c495624553d21df6092e-b1e6fa2ff0876057a2da9deb6ccb589b.ssl.cf5.rackcdn.com/902169/3/gate/system-config-run-review-3.8/63d072e/bridge99.opendev.org/test-results.html?sort=result | 20:57 |
fungi | nothing related jumping out at me in syslog though | 20:59 |
Clark[m] | Could it be the arp fight we've seen? | 21:00 |
fungi | possibly, though it's just connection closed during key exchange rather than a hostkey mismatch | 21:03 |
Clark[m] | Maybe recheck and see if it persists? The ssh config I wrote out should only affect the gerrit user not zuul and testinfra | 21:06 |
Clark[m] | And even then it's scoped to gitea* which aren't in play for testinfra on review02 | 21:07 |
tonyb | yeah seems unlikely the change is responsible | 21:07 |
fungi | yeah, i agree it's more likely the rogue vm problem | 21:09 |
fungi | once zuul reports on the change in a sec i'll recheck it | 21:13 |
fungi | looks like the last job is wrapping up now | 21:13 |
fungi | and rechecked now | 21:20 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add gerrit 3.9 image builds https://review.opendev.org/c/opendev/system-config/+/901468 | 22:02 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add gerrit 3.8 to 3.9 upgrade testing https://review.opendev.org/c/opendev/system-config/+/901469 | 22:02 |
clarkb | infra-root Gerrit 3.9.1 has been released to address the problems with the pulled 3.9.0 release. I expect the above changes will pass testing and they should be mergeable as is now | 22:03 |
clarkb | note that since this doesn't affect production (only our testing of future upgrades) we don't need any special monitoring either | 22:03 |
clarkb | looks like we are an hour away from the gerrit ssh config update landing | 22:55 |
clarkb | I'll still be around and can do a restart if others are | 22:55 |
tonyb | I won't be around at that point (not that I'm really around now) | 22:57 |
fungi | i'll be around for a while still, sure | 23:19 |
clarkb | fungi: its even later where you are :) maybe we should just do it first thing next week | 23:28 |
clarkb | up to you :) | 23:29 |
clarkb | the gerrit 3.9.1 changes did pass so thats good | 23:34 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!