mordred | corvus: gotcha. so there's just a weird error and it's not version tied. yay | 01:15 |
---|---|---|
openstackgerrit | Akihiro Motoki proposed openstack/project-config master: Define stable cores for horizon plugins in neutron stadium https://review.opendev.org/722682 | 01:15 |
openstackgerrit | Akihiro Motoki proposed openstack/project-config master: Define stable cores for horizon plugins in neutron stadium https://review.opendev.org/722682 | 01:24 |
*** DSpider has joined #opendev | 07:04 | |
frickler | infra-root: there is this failure from the service-nodepool playbook http://paste.openstack.org/show/792722/ which I assume is caused by https://review.opendev.org/721098 | 07:41 |
frickler | I'm going to try and remove the old dir and hope that that'll make the rsync work, but maybe we should avoid having symlinks in git dirs? or is there some better solution? | 07:42 |
frickler | I also don't like that the rsynced file are owned by 2031:2031, which is a userid that doesn't exist on the targets, do we have a plan to create the zuul user there? otherwise we'd likely rather make things owned by root IMO | 07:48 |
frickler | infra-root: we also seem to never give the "all-clear" that was mentioned in the last status notice, from scrollback I'd assume we could do that now? | 07:54 |
frickler | s/give/have given/ | 07:54 |
*** roman_g has joined #opendev | 07:54 | |
frickler | next pulse seems to have run that playbook successfully | 08:13 |
frickler | now I see that I wouldn't have needed this, because the focal image has already been built successfully on nb04 | 08:20 |
frickler | well, "successfully", because it is giving me node-failures on https://review.opendev.org/704831 . will try to debug later, got to do some plumbing first | 08:21 |
*** roman_g has quit IRC | 08:36 | |
AJaeger | frickler: let's give the all-clear... | 08:43 |
AJaeger | #status notice Zuul is happy testing changes again, changes with MERGER_FAILURE can be rechecked. | 08:44 |
openstackstatus | AJaeger: sending notice | 08:44 |
-openstackstatus- NOTICE: Zuul is happy testing changes again, changes with MERGER_FAILURE can be rechecked. | 08:44 | |
openstackstatus | AJaeger: finished sending notice | 08:48 |
*** tosky has joined #opendev | 08:57 | |
*** elod has quit IRC | 09:26 | |
*** elod has joined #opendev | 09:26 | |
*** tbarron_ is now known as tbarron | 11:43 | |
frickler | humm, can't get nodes when we don't launch'em. patch upcoming | 11:53 |
openstackgerrit | Jens Harbott (frickler) proposed openstack/project-config master: Launch focal nodes https://review.opendev.org/723213 | 11:59 |
frickler | infra-root: ^^ that'd be the whole thing, or would we want to do some more testing or a slow start first? I have a devstack patch waiting that went fine on a local instance running the stock Ubuntu cloud image | 12:00 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: hlint: add haskell source code suggestions job https://review.opendev.org/722309 | 12:05 |
tristanC | clarkb: i haven't noticed a difference between git versions with regards to .git dirs presence. It seems like the rule is that a valid gitdir requires the `.git/refs` directory to exist. | 12:06 |
tristanC | clarkb: the difference was how dangling ref were handled, and old git was not removing directory efficiently, thus there is a custom function in zuul that does extra cleanup, and this function was apparantly too aggressive. | 12:10 |
tristanC | clarkb: the question is how to reproduce that behavior, because it only happened when the workdir didn't have any `heads` or `tags` | 12:12 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Stop translation stable branches on some projects https://review.opendev.org/723217 | 12:29 |
openstackgerrit | Andreas Jaeger proposed openstack/project-config master: Stop translation stable branches on some projects https://review.opendev.org/723217 | 12:52 |
yoctozepto | https://bugs.launchpad.net/devstack/+bug/1875075 | 13:29 |
openstack | Launchpad bug 1875075 in devstack "devstack setup fails: Failed to connect to opendev.org port 443: Connection refused" [Undecided,Opinion] | 13:29 |
yoctozepto | seems gitea is not doing good today :-( | 13:30 |
fungi | which would be odd since connection refused would be coming from the load balancer, not from a gitea host | 13:43 |
fungi | right now i'm able to reach 443/tcp on the load balancer's ipv4 and ipv6 addresses | 13:45 |
fungi | is it intermittent? | 13:45 |
fungi | trending on elastic-recheck? | 13:45 |
mordred | fungi, frickler: I have a fix lined up for the nodepool playbook issue - but we were fighting fires yeseterday so I didn't want to do it | 13:46 |
mordred | I'll go ahead this morning | 13:46 |
yoctozepto | fungi: intermittent, as per bug report | 13:46 |
yoctozepto | not blaming backend, "gitea" = "service thereof", as seen by the great public | 13:47 |
yoctozepto | us, humble bread eaters | 13:47 |
fungi | tcp rst (connection refused) would have to be coming from the haproxy server directly, since it's configured to act as a tcp socket proxy, so it wouldn't route a connection refused from the gitea backends | 13:55 |
fungi | though it could be that at times none of the 8 backends are reachable and the pool is epmty | 13:56 |
fungi | it's possible clearing all the git repo caches on our zuul mergers and executors (20 in all) has inflicted a partial ddos against our git servers as builds cause them to clone one repository or another in clusters | 14:02 |
corvus | fungi: zuul doesn't know about gitea | 14:02 |
mordred | fungi, frickler: puppet issue wtih set-hostname fixed | 14:03 |
fungi | corvus: oh so those are cloning from gerrit anyway | 14:04 |
fungi | yeah, seemed like a stretch regardless | 14:04 |
corvus | ya | 14:04 |
fungi | and the cacti graphs so far aren't showing anything out of the ordinary for the haproxy or gitea servers | 14:04 |
mordred | frickler: (your solution was the right one - it was an unfortunate issue of replacing a dir with a symlink) - also - I agree, 2031 is ugly, how about we update that to push them onto the remote hosts as root | 14:06 |
yoctozepto | I started getting it today, the reporter probably as well | 14:07 |
yoctozepto | it's very rare, but happens on clone/pull | 14:08 |
yoctozepto | it never happened from browser but it's probably because I only visited it after it failed on git :-) | 14:08 |
yoctozepto | I thought it was my connection, but this report made me share the problem with you | 14:09 |
fungi | i'm being told i have to get off the computer and do some chores, but i can try cloning from all the backends later just to see if i can get it to reproduce any errors | 14:10 |
fungi | this is a snapshot of the current pool state: http://paste.openstack.org/show/792729/ | 14:10 |
mordred | morning corvus ! interesting error from zuul smart-reconfigure: https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_ae8/723107/3/check/system-config-run-zuul/ae8d99a/bridge.openstack.org/ara-report/result/ba99d303-7226-4f15-a341-c10935a45c80/ | 14:19 |
mordred | I wonder - docker-compose exec by default allocates a tty - since we're running this in ansible, do we need to do docker-compose exec -T to disable allocating a tty? | 14:21 |
corvus | mordred: i dunno -- maybe try it out with "ansible" on bridge? | 14:21 |
mordred | corvus: safe to run a smart-reconfigure now right? | 14:21 |
corvus | mordred: yeah, i think it should be safe any time | 14:21 |
mordred | kk | 14:22 |
mordred | should we add -f to smart-reconfigure to run it in the foreground and get output? or will that matter for this? | 14:22 |
corvus | mordred: it's an async command either way, so won't matter | 14:22 |
mordred | corvus: yup - the tty error was from docker-compose - -T fixed it | 14:24 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107 | 14:25 |
mordred | corvus: also - yay for testing actually catching that :) | 14:25 |
corvus | \o/ | 14:26 |
mordred | corvus: -T is consistent with how we're running docker-compose exec elsewhere in ansible fwiw | 14:26 |
openstackgerrit | Merged openstack/project-config master: Launch focal nodes https://review.opendev.org/723213 | 14:26 |
mordred | corvus: oh, I also I fixed your comment from https://review.opendev.org/#/c/723105/ | 14:27 |
mordred | corvus: re: stop playbook - are systemctl stop and docker-compose down not blocking? I would have thought that each of them would be responsible for not returning until the thing was stopped :( | 14:29 |
corvus | mordred: i imagine docker-compose down is, but systemctl stop is not, which is why the zuul_restart playbook waits for the service to stop | 14:31 |
mordred | corvus: nod | 14:31 |
mordred | corvus: maybe instead of adding those new playbooks we should add tags to zuul_stop, zuul_start and zuul_restart. or I guess we could actually just use those with limit | 14:32 |
mordred | now that I think about it - those with limit actually sounds like the most flexible thing | 14:33 |
mordred | hrm. except for scheduler and web. | 14:35 |
*** sgw has joined #opendev | 14:38 | |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Rework zuul start/stop/restart playbooks for docker https://review.opendev.org/723048 | 14:49 |
mordred | corvus: maybe something like that ^^ is better | 14:50 |
mordred | corvus: oh - you're going to _love_ the latest failure on building multi-arch images | 15:00 |
mordred | corvus: it looks like it's ignoring /etc/hosts and doing a dns lookup directly: https://zuul.opendev.org/t/zuul/build/a487b41ba7334baca1af0b67ea21f04f/console#2/5/21/builder | 15:01 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 15:01 |
mordred | I'm gonna grab /etc/hosts just to double check - but the "add buildset registry to /etc/hosts" task ran, so I'm pretty confident it's there | 15:02 |
mordred | OH! maybe since there is a "builder" container, the push is actually happening inside of that container - yup: https://github.com/docker/buildx/issues/218 | 15:05 |
corvus | mordred: what do you think that last comment means? | 15:07 |
corvus | did they run a dnsmasq service and somehow configure the buildkit container to use that as its resolver? | 15:08 |
mordred | yeah - I think that's likely what they did | 15:10 |
mordred | corvus: also - note the thing above where the person adds registry mirror config when creating the builder container | 15:10 |
mordred | I think we probably need to do that so that the builders mirror config is right | 15:11 |
corvus | yeah :/ | 15:11 |
corvus | so basically, we get to start over from scratch for buildx | 15:11 |
mordred | but lacking support for /etc/hosts ... | 15:11 |
mordred | yeah | 15:11 |
mordred | corvus: I wonder ... the builder container is a running container ... we could exec in to it and edit its /etc/hosts | 15:12 |
corvus | mordred: between the create and build steps? | 15:13 |
corvus | i wonder if there's a way to bind-mount the files in? though if there were, i would have expected that to come up on that issue | 15:13 |
mordred | corvus: yeah - I just exec'd into a sh in my test mybuilder image | 15:14 |
mordred | corvus: yeah - I just exec'd into a sh in my test mybuilder container | 15:14 |
mordred | and /etc/hosts and resolv.conf are both there as expected | 15:15 |
mordred | corvus: https://github.com/moby/buildkit/blob/master/docs/buildkitd.toml.md is the file that one can pass with the --config option to create | 15:22 |
mordred | corvus: and I have confirmed that passing a file to that on the create line will cause the config to show up in /etc/buildkit in the worker contianer | 15:24 |
mordred | so - I think we need to make sure we can create a buildkit.toml file with the registry mirror info ... that might not be buildx-specific - it looks like buildkit understands multi-repo like buildah does, so it's possible we could achieve multi-repo with docker if we enable buildkit (waving hands) | 15:25 |
corvus | mordred: so we might be able to use the containers-style registry config for that | 15:25 |
mordred | yeah | 15:25 |
corvus | that'd be a good reason to do this for everything | 15:25 |
mordred | yeah | 15:25 |
mordred | although we can use buildkit even without buildx - I don't know if buildkit.toml works with normal docker when buildkit is enabled - but I'd imagine so? | 15:26 |
mordred | then we still need to edit /etc/hosts -but we might even be able to just do a docker copy ... let me try that real quick | 15:26 |
mordred | weirdly the /etc/hosts in the container has 172.17.0.3239c1ae2f24b ... so maybe that's for container referencing itself by id | 15:27 |
mordred | so we might need to edit the file anyway | 15:27 |
corvus | mordred: i wonder: do we need to edit /etc/hosts if we can define the registry mirrors? | 15:28 |
mordred | corvus: I don't know | 15:28 |
corvus | is it possible we can specify those by ip address? (and, of course, we'd need to see if ipv6 works) | 15:28 |
mordred | corvus: however - we can do docker cp {container id}:/etc/hosts hosts | 15:29 |
mordred | then edit hosts | 15:29 |
mordred | then docker cp the file back | 15:29 |
mordred | so we could re-use our ansible lineinfile we already have | 15:29 |
mordred | corvus: I'm going to try just starting with the /etc/hosts editing like that and see if it at least pushes to the buildset registry properly | 15:32 |
clarkb | is there any one looking at fixing zuul.opemstack.org's ssl cert? | 15:32 |
mordred | clarkb: what needs to be fixed? it should be part of the subject altname for zuul.opendev..org | 15:32 |
clarkb | mordred: it doesnt appear to be. My phone says its acert for opendev.org and complains | 15:32 |
mordred | clarkb: https://opendev.org/opendev/system-config/raw/branch/master/playbooks/host_vars/zuul01.openstack.org | 15:33 |
mordred | ah - you know what - I bet we added it after we got the initial cert | 15:33 |
mordred | clarkb: so we might need to remove teh letsencrypt files so that we re-request for both | 15:33 |
corvus | what caused it to change just now? | 15:33 |
clarkb | also re gitea connection resets: I think that can happem when we restart containers | 15:33 |
mordred | I forget which files ianw figured out it should be | 15:33 |
clarkb | fungi: yoctozepto basically when mariadb or our gitea imagesupdate | 15:34 |
corvus | cause it looks like we're right in the middle of cert validity | 15:34 |
clarkb | mordred: corvus they were two separate certs before | 15:34 |
corvus | before what? | 15:34 |
clarkb | corvus: guessing ansible zuul combinesthem into one LE | 15:34 |
clarkb | corvus: before the absibling | 15:35 |
mordred | yeah - it should be all served from the le opendev cert now | 15:35 |
clarkb | corvus: so on thursday with puppetit was two separate certs | 15:35 |
corvus | oh, this is a new cert, which we requested in the middle of march, in preparation for switching to ansible, and we started using it yesterday? | 15:35 |
corvus | or friday | 15:35 |
mordred | yeah | 15:35 |
corvus | ok | 15:35 |
corvus | i agree, sounds like asking le for a new cert is the way to go | 15:36 |
corvus | not before: 3/9/2020, 1:22:50 PM (Pacific Daylight Time) | 15:36 |
clarkb | mordred: I think something in a .dir or.file in roots homedir | 15:36 |
mordred | ianw figured out which files need to be blanked out to force that | 15:36 |
mordred | yeah | 15:36 |
mordred | I think it got documented? | 15:36 |
clarkb | that it uses to know when the refresh the cert | 15:36 |
corvus | f0b77485ec (Monty Taylor 2020-04-05 09:25:28 -0500 5) - zuul.openstack.org | 15:36 |
corvus | mordred: i think those timestamps confirm your theory | 15:37 |
mordred | corvus: ++ | 15:37 |
mordred | Refreshing keys section in letsencrypt docs | 15:37 |
corvus | https://docs.openstack.org/infra/system-config/letsencrypt.html#refreshing-keys | 15:37 |
mordred | yup | 15:37 |
mordred | clarkb: for the gitea thing - should we update the gitea playbook to remove a gitea backend from the lb when we're ansibling it? or I guess since we're doing tcp lb that wouldn't be any better would it? | 15:39 |
clarkb | mordred: we'd need to remove it from haproxy, wait for connections to stop, then do docker then add it back | 15:40 |
clarkb | mordred: maybe give it a 120 second timeout on the wait for connections to drop | 15:40 |
mordred | clarkb: seems like a thing we'd only really want to do if the image had updated - and we don't really have a great way to track whether compose is going to restart it or not atm | 15:41 |
mordred | I mean - it would always be safe - but it would make the playbook take a long time to run | 15:41 |
mordred | clarkb, corvus : are any of us doing that mv on zuul01? I can if y'all aren't already | 15:42 |
fungi | clarkb: in this case it sounded like more than a momentary blip. have we been restarting the gitea and/or mariadb containers a bunch in the last 24 hours? | 15:43 |
mordred | nope | 15:43 |
mordred | we havent' pushed a new gitea recently | 15:43 |
clarkb | mariadb updates semi often | 15:43 |
mordred | I have renamed the conf files for zuul01 | 15:44 |
mordred | next ansible should fix it | 15:44 |
clarkb | mordred: I'm still on a phone | 15:44 |
clarkb | mordred: we actually do know when compose is going to stop and start stuff | 15:44 |
fungi | it sounded more like when someone is running devstack locally (such that it clones a bunch of openstack repos fresh), some small percentage of those encounter a connection refused for opendev.org (reported independently by yoctozepto and also the lp bug against devstack) | 15:44 |
clarkb | that got added in the safe shutdown ordering | 15:44 |
fungi | and supposedly just in the past day | 15:45 |
fungi | entirely possible this is a problem with some backbone peering point shared by both reporters, i suppose | 15:46 |
mordred | fungi: is it worth putting in a clone retry in devstack for when people are running it locally? I mean - network failures happen | 15:46 |
fungi | (wherein something is sending a tcp/rst on behalf of opendev.org in response to the client's tcp/syn i suppose, otherwise it would show up as a connection reset by peer or a timeout or whatever) | 15:47 |
fungi | mordred: possible that's something the devstack maintainers want to consider | 15:47 |
clarkb | it could also be OOMs again | 15:47 |
mordred | clarkb: good point | 15:47 |
fungi | anyway, i'm not immediately seeing any issues not oom | 15:50 |
fungi | most recent oom was gitea08 on 2020-04-07 | 15:50 |
fungi | and shortest vm uptime is 66 days | 15:50 |
clarkb | then I'm out of ideas :) | 15:51 |
fungi | gitea web processes all last restarted sometime yesterday utc | 15:52 |
fungi | according to ps | 15:52 |
clarkb | fungi: probably due to mariadb update | 15:53 |
fungi | mysqld on all of them was also last updated sometime yesterday, yes | 15:53 |
fungi | s/updated/restarted/ anyway | 15:53 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 15:53 |
mordred | corvus: ^^ there's a stab at fixing /etc/hosts | 15:53 |
fungi | so doesn't appear to be virtual machines rebooting, containers restarting, out of memory condition... cacti graphs look typical for everything, haproxy pool seems normal at the moment | 15:54 |
fungi | only other thing i can think to try is load-testing them in attempt to reproduce the issue and then try to match up connections | 15:54 |
openstackgerrit | Monty Taylor proposed zuul/zuul-jobs master: Support multi-arch image builds with docker buildx https://review.opendev.org/722339 | 15:57 |
mordred | corvus: please enjoy the difference between the last 2 patchsets | 15:57 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Run smart-reconfigure instead of HUP https://review.opendev.org/723107 | 16:01 |
openstackgerrit | Monty Taylor proposed opendev/system-config master: Rework zuul start/stop/restart playbooks for docker https://review.opendev.org/723048 | 16:18 |
*** roman_g has joined #opendev | 16:23 | |
frickler | cool, devstack on focal runs fine for a bit, then it crashes mysql8, can reproduce locally https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_1b1/704831/5/check/devstack-platform-focal/1b12f0b/controller/logs/mysql/error_log.txt | 16:25 |
frickler | but our image seems to work fine \o/ | 16:25 |
corvus | mordred: wow :) | 16:28 |
mordred | frickler: neat | 16:31 |
*** roman_g has quit IRC | 16:44 | |
clarkb | its almost like it crashes just by starting | 16:44 |
clarkb | mordred: how often do we run the LE playbook? is it part of hourly? | 16:46 |
clarkb | (just wondering when I should check the cert for zuul again. Maybe after yard work?) | 16:46 |
clarkb | fungi: fwiw there was a OOM on lists yesterday but not one today | 16:47 |
clarkb | fungi: our large qrunner process for the openstack list is the largest process by memory use during that period, but then it shrinks to 1/10th its orginal size (maybe smaller) and listinfo processes (which are run by the web ui) end up taking over however they are relatively small in size) | 16:54 |
clarkb | fungi: that makes me think it is not a qrunner after all beacuse you'd expect it to grow or to see the large process change to a different qrunner (the one that is growing) if that is the case | 16:54 |
clarkb | fungi: during the period before and after the OOM we are getting crawled by a SEMrush bot | 17:00 |
clarkb | thats basically the only web traffic ( and there is't a ton of it but maybe mailman doesn't like it?) | 17:00 |
clarkb | anyway back to my weekend now. I think the good news is that smtp traffic doesn't seem to be a factor or we'd expect the qrunners to grow | 17:00 |
fungi | i've seen apache eating large amounts of memory, but that may just be file buffers | 17:03 |
fungi | also could be shared memory | 17:03 |
fungi | hard to really tell looking at the oom reports | 17:04 |
*** dpawlik has joined #opendev | 18:45 | |
fdegir | we've been having clone issues from opendev yesterday and i thought it is a connection issue on our side but seeing https://bugs.launchpad.net/devstack/+bug/1875075 made me think it is not | 18:58 |
openstack | Launchpad bug 1875075 in devstack "devstack setup fails: Failed to connect to opendev.org port 443: Connection refused" [Undecided,Opinion] | 18:58 |
fdegir | it is still happening randomly - it works for some repos and doesn't work for others | 19:02 |
AJaeger | fdegir: for which repos does it fail for you? | 19:10 |
AJaeger | fdegir: I think it's one of our git hosts | 19:12 |
fdegir | AJaeger: here are the ones we are frequently having issues with cloning | 19:14 |
AJaeger | infra-root, I tried cloning manually from the getia hosts, and the response time really varied. Normally: 1 to 2 secons, gitea05 now 17s, retry only 1.7 s | 19:14 |
fdegir | https://opendev.org/openstack/diskimage-builder | 19:14 |
fdegir | https://opendev.org/openstack/sushy | 19:15 |
fdegir | https://opendev.org/x/ironic-staging-drivers | 19:15 |
fdegir | https://opendev.org/openstack/python-ironic-inspector-client.git | 19:16 |
AJaeger | fdegir: I'm trying diskimage-builder now on all gitea hosts... | 19:16 |
AJaeger | fdegir: is that reproducable? Meaning, does a second clone work? | 19:16 |
fdegir | https://opendev.org/openstack/bifrost.git | 19:16 |
fdegir | AJaeger: it is not always the same repos across the runs | 19:16 |
AJaeger | fdegir: thanks | 19:17 |
fdegir | AJaeger: it works for bifrost and fails on dib and the next time it fails on bifrost and doesn't work for dib | 19:17 |
AJaeger | I was just able to clone dib from all 8 hosts directly... | 19:17 |
AJaeger | so, could be a network or load-balancer problem | 19:17 |
fdegir | AJaeger: if you want the list to try, you can use the repos cloed by bifrost as an example as we see the failures while bifrost clones the repos | 19:17 |
fdegir | https://opendev.org/openstack/bifrost/src/branch/master/playbooks/roles/bifrost-prep-for-install/defaults/main.yml | 19:18 |
AJaeger | fdegir: I hope one of the admins can look later at it, I was just trying the obvious things | 19:18 |
fdegir | AJaeger: yep, thanks | 19:18 |
fdegir | the repos listed in bifrost defaults could be good set of repos to try as we haven't been able to install bifrost since yesterday | 19:19 |
AJaeger | fdegir: is that failing on your system - or in Zuul? | 19:21 |
openstackgerrit | Andreas Jaeger proposed opendev/system-config master: Remove git0*.openstack.org https://review.opendev.org/723251 | 19:24 |
AJaeger | fdegir: locally I had no problems cloning dib, hope somebody has a better idea on how to debug | 19:26 |
fdegir | AJaeger: yes, our Jenkins jobs are failing | 19:29 |
AJaeger | fdegir: Ok. Sorry, I can't help further - and don't know when an admin will be around, you might need to wait a bit longer. | 19:34 |
fungi | fdegir: AJaeger: yeah, i looked into it earlier, no sign of any systems in distress. can you confirm the error you get is always "connection refused"? | 19:34 |
AJaeger | fungi: no problems on my side, just twice 10 times longer reponse... | 19:35 |
fungi | opendev.org is a haproxy load balancer in socket proxying mode, not routing, so it would be odd for anything besides the load balancer to be what's emitting the tcp/rst which would indicate connection refused | 19:35 |
fungi | i also checked and none of those systems (frontend or backends) rebooted or had containers downed/upped today | 19:36 |
fungi | and the cacti graphs for all of them don't show any particularly anomalous behavior | 19:36 |
clarkb | and gitea was upgraded thursday but this sounds more recent? | 19:38 |
fungi | i'd be interested to see where all the clients getting the connection refused behavior are coming from (seems like all the reports might be from europe so far? the servers are all in california i think) | 19:38 |
clarkb | fungi: ya servers are all in california | 19:38 |
clarkb | and ya some source IPs would help us debug in logs | 19:38 |
fungi | wondering if it might not be our systems, but something along the path emitting tcp/rst on behalf of the server addresses | 19:39 |
fdegir | clone operations time out | 19:39 |
fdegir | fatal: unable to access 'https://opendev.org/openstack/bifrost.git/': Failed to connect to opendev.org port 443: Connection timed out | 19:40 |
fdegir | and yes, im in europe | 19:41 |
clarkb | oh so it fails to tcp at all | 19:41 |
clarkb | ya might try forcing ipv4 then ipv6 to see if you get differebt behavior? | 19:41 |
clarkb | wehave had ipv6 routing oddities in that cloud before... | 19:42 |
fungi | though i did test both v4 and v6 from home earlier and was able to connect over both | 19:54 |
fungi | but yeah, connection timed out is not connection refused, so we have additional behaviors to investigate | 19:54 |
fungi | and then there's AJaeger saying he's getting very slow transfer intermittently | 19:55 |
*** dpawlik has quit IRC | 19:56 | |
fungi | so that's three different network misbehavior patterns which usually indicate different sorts of problems, but could also all be related to some general network connectivity problem (e.g., overloaded backbone peering point getting preferred in bgp) | 19:57 |
fungi | after i'm finished making dinner i'll see if i can reproduce any of these issues from systems connected through different carriers | 19:59 |
*** dpawlik has joined #opendev | 20:09 | |
AJaeger | I just updated all my repos (full checkout of non-retired opendev repos) - no problems at all ... | 20:10 |
fungi | also come to think of it, we're still doing source-based hash for load distribution, so all connections from the same client should get persisted to the same backend unless it gets taken out of rotation due to an outage | 20:12 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: stack-test: add haskell tool stack test https://review.opendev.org/723263 | 20:19 |
*** dpawlik has quit IRC | 20:31 | |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: stack-test: add haskell tool stack test https://review.opendev.org/723263 | 20:50 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: stack-test: add haskell tool stack test https://review.opendev.org/723263 | 20:59 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: stack-test: add haskell tool stack test https://review.opendev.org/723263 | 21:14 |
openstackgerrit | Tristan Cacqueray proposed zuul/zuul-jobs master: stack-test: add haskell tool stack test https://review.opendev.org/723263 | 21:21 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: Add sibling container builds to experimental queue https://review.opendev.org/723281 | 22:53 |
openstackgerrit | Ian Wienand proposed openstack/diskimage-builder master: use stage3 instead of stage4 for gentoo builds https://review.opendev.org/717177 | 22:53 |
ianw | https://zuul.openstack.org/ is throwing me a security error | 22:56 |
fungi | ianw: i think we're pending a cert re-replacement on that redirect site | 23:00 |
ianw | fungi: pending as in just wait, or pending as in someone has to do something? | 23:02 |
fungi | as in it sounded like mordred thought ansible was going to overwrite it with the proper cert again | 23:02 |
fungi | note that zuul.opendev.org is the canonical url at this point | 23:02 |
ianw | right, but going through status.openstack.org gets you the security error though | 23:03 |
openstackgerrit | Matthew Thode proposed openstack/diskimage-builder master: use stage3 instead of stage4 for gentoo builds https://review.opendev.org/717177 | 23:04 |
ianw | ok, it looks that was in scrollback about 6 hours ago, i'll investigate what's going on with the cert | 23:06 |
*** gouthamr has quit IRC | 23:11 | |
*** gouthamr has joined #opendev | 23:13 | |
ianw | f0b77485ec (Monty Taylor 2020-04-05 09:25:28 -0500 5) - zuul.openstack.org | 23:16 |
ianw | it really seems like zuul isn't matching in service-letsencrypt.yaml.log logs ... odd | 23:18 |
ianw | right, zuul is in the emergency file ... so that explains that ... although it's in there with no comment | 23:26 |
clarkb | ianw I think its in there due to the reload of tenat config needing a fix | 23:26 |
clarkb | the current setup kills gearman when that happens | 23:27 |
ianw | ok ... i'm also not seeing host _acme-challenge.zuul.openstack.org | 23:28 |
ianw | that was probably why it hasn't been working since the initial add, although now it's not refreshing due to the emergency host situation | 23:33 |
ianw | #status log added _acme-challange.zuul.openstack.org CNAME to acme.opendev.org | 23:33 |
openstackstatus | ianw: finished logging | 23:33 |
ianw | the idea of taking zuul/gearman/? down when i'm not really sure what's going on isn't terribly appealing at 9am on a monday :) | 23:34 |
openstackgerrit | Ian Wienand proposed opendev/system-config master: status.openstack.org: send zuul link to opendev zuul https://review.opendev.org/723282 | 23:43 |
Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!