fungi | python 3.10.1 yesterday, 3.11.0a3 today | 00:00 |
---|---|---|
fungi | seems like i'm always compiling a new python | 00:00 |
clarkb | Related to new releases this is a fun one. Gerrit's 3.5.0.1 release broke a bunch of plugins because they pulled out elasticsaerch support (since they went not open source) and elasticsearch was pulling in a dep for a number of plugins and it isn't there anymore | 00:01 |
fungi | oh, yeah, transitive deps silently satisfying direct deps is a major risk. it's bitten openstack projects before as well | 00:02 |
corvus | ianw: interested in reviewing https://review.opendev.org/820954 ? it's the other half of a change you +2d | 00:05 |
corvus | re keycloak | 00:05 |
ianw | lgtm | 00:06 |
ianw | sorry i think i meant to +2 that when i looked at the other bit | 00:06 |
corvus | \o/ thx | 00:13 |
opendevreview | Ade Lee proposed zuul/zuul-jobs master: DNM enable_fips role for zuul jobs https://review.opendev.org/c/zuul/zuul-jobs/+/807031 | 00:29 |
opendevreview | Merged opendev/system-config master: Add keycloak auth config to Zuul https://review.opendev.org/c/opendev/system-config/+/820954 | 00:51 |
fungi | yay! https://zuul.opendev.org/t/openstack/build/20fccb043c35459194b1094b28586055/log/lists.openstack.org/exim4/mainlog#47 | 01:13 |
fungi | mailman tried to notify me, exim got the notification and attempted delivery, then got its outbound smtp socket reset | 01:14 |
clarkb | successful failure | 01:14 |
clarkb | the best kind of failure | 01:14 |
fungi | i'll integrate the firewall fix, though the question remains whether we should start the mailman services in testinfra | 01:15 |
clarkb | if it isn't necessary to test this properly I don't know that we need to. Though not starting them probably covered up that python path issue | 01:15 |
fungi | yes | 01:16 |
clarkb | it shouldn't hurt to start them if we've blocked smtp outbound. People can send mail in if they really like and it won't go anywhere | 01:16 |
fungi | well, sort of covered it up, actually the initscript does an exit 0 when python isn't found so systemd wouldn't have known the differenc | 01:16 |
fungi | e | 01:16 |
clarkb | ah | 01:17 |
clarkb | another successful failure :) | 01:17 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Block outbound SMTP connections from test jobs https://review.opendev.org/c/opendev/system-config/+/820900 | 02:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Copy Exim logs in system-config-run jobs https://review.opendev.org/c/opendev/system-config/+/820899 | 02:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Collect mailman logs in deployment testing https://review.opendev.org/c/opendev/system-config/+/821112 | 02:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Make sure /usr/bin/python is present for mailman https://review.opendev.org/c/opendev/system-config/+/821095 | 02:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 02:05 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 02:05 |
*** rlandy|ruck|bbl is now known as rlandy|ruck | 02:19 | |
*** rlandy|ruck is now known as rlandy|out | 02:23 | |
ianw | i'm finding it quite hard to the the zuul-client docker image to generate a secret | 02:43 |
ianw | --infile doesn't help | 02:44 |
ianw | so far i haven't figure out how to pipe input into it either | 02:47 |
ianw | ok, running with "-i", but not "-t", makes "cat file | docker run ... zuul-client encrypt ..." work | 02:49 |
*** bhagyashris_ is now known as bhagyashris | 03:02 | |
Clark[m] | ianw fwiw I think there is a python script in the tools dir of zuul yo do it as well | 03:07 |
Clark[m] | You don't need auth for it as it grabs a pubkey to do the encryption | 03:07 |
ianw | yeah, that is now giving a deprecated warning | 03:07 |
*** pojadhav|out is now known as pojadhav|rover | 03:18 | |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 03:50 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 03:50 |
fungi | okay, i think topic:mailman-lists is ready to go, finally | 05:00 |
opendevreview | Ian Wienand proposed opendev/system-config master: infra-prod: write a secret to the bastion host https://review.opendev.org/c/opendev/system-config/+/821155 | 05:25 |
*** marios is now known as marios|ruck | 06:12 | |
*** gibi_ is now known as gibi | 07:52 | |
*** ysandeep is now known as ysandeep|lunch | 08:08 | |
*** ysandeep|lunch is now known as ysandeep | 08:38 | |
opendevreview | Merged openstack/project-config master: Add NVidia vGPU plugin charm to OpenStack charms https://review.opendev.org/c/openstack/project-config/+/819818 | 09:02 |
*** pojadhav|rover is now known as pojadhav|lunch | 09:07 | |
*** pojadhav|lunch is now known as pojadhav|rover | 10:03 | |
*** ysandeep is now known as ysandeep|afk | 10:21 | |
*** redrobot6 is now known as redrobot | 10:23 | |
*** jpena|off is now known as jpena | 10:35 | |
*** ysandeep|afk is now known as ysandeep | 10:56 | |
*** rlandy|out is now known as rlandy|ruck | 11:10 | |
*** pojadhav|rover is now known as pojadhav|rover|brb | 11:42 | |
*** pojadhav|rover|brb is now known as pojadhav|rover | 11:51 | |
*** pojadhav|rover is now known as pojadhav|rover|brb | 12:02 | |
*** pojadhav|rover|brb is now known as pojadhav|rover | 12:22 | |
*** ykarel is now known as ykarel|away | 13:21 | |
*** pojadhav|rover is now known as pojadhav|rover|brb | 14:18 | |
*** pojadhav|rover|brb is now known as pojadhav|rover | 15:04 | |
slittle1_ | having intermittent issues with 'git review -s' | 15:37 |
slittle1_ | trying to run a script that sets up the gerrit remote on all starlingx repos | 15:38 |
slittle1_ | seems like every second or third try hangs | 15:39 |
slittle1_ | I'm working around it with a 'tineout' and a retry | 15:39 |
slittle1_ | cat .gitreview | 15:41 |
slittle1_ | [gerrit] | 15:41 |
slittle1_ | host=review.opendev.org | 15:41 |
slittle1_ | port=29418 | 15:41 |
slittle1_ | project=starlingx/distcloud-client.git | 15:41 |
slittle1_ | defaultbranch=master | 15:41 |
slittle1_ | as an example | 15:41 |
corvus | i'm going to restart zuul-web with the new auth config; expect a several-minute outage (of web only; schedulers will continue) | 15:42 |
*** ysandeep is now known as ysandeep|out | 15:45 | |
fungi | slittle1_: going over ipv4 or ipv6? sounds like there could be some intermittent network problems... are you seeing the same behavior from multiple locations? | 15:47 |
slittle1_ | ipv4 | 15:55 |
slittle1_ | single location | 15:56 |
slittle1_ | don't have the means to test from multiple locations at the moment | 15:56 |
slittle1_ | Problem running 'git remote update gerrit' | 15:57 |
slittle1_ | Fetching gerrit | 15:57 |
slittle1_ | ssh_exchange_identification: read: Connection reset by peer | 15:57 |
slittle1_ | fatal: Could not read from remote repository. | 15:57 |
slittle1_ | Please make sure you have the correct access rights | 15:57 |
slittle1_ | and the repository exists. | 15:57 |
slittle1_ | error: Could not fetch gerrit | 15:57 |
fungi | slittle1_: i'll see if i can reproduce from other places on the internet | 15:59 |
fungi | running `git remote update gerrit` in starlingx/distcloud-client in a loop isn't producing errors from my house but i'll try from some virtual machines in various cloud providers as well | 16:01 |
Clark[m] | We limit connections per account. It this is happening concurrently or quickly enough that tcp hasn't closed completely that may the be cause | 16:01 |
Clark[m] | We also limit by IP and if you go through NAT similar problem | 16:02 |
slittle1_ | ok, so I should try adding a delay between requests? What delay do you recommend ? | 16:02 |
Clark[m] | Well I'm suggesting this could be related but I don't know enough about your situation to be confident it is the cause. | 16:03 |
slittle1_ | How is the connect limit enforced? how many connects over what time period ? | 16:04 |
Clark[m] | There are two methods. The first is by iptables limiting to 100 connections per source IP. The other is Gerrit limiting to 96 per Gerrit account iirc | 16:06 |
Clark[m] | If it were me I'd git review -s on demand and not try to do them in bulk | 16:06 |
opendevreview | Merged openstack/project-config master: Allow Zuul API access from keycloak server https://review.opendev.org/c/openstack/project-config/+/820956 | 16:08 |
slittle1_ | The 'git review -s' requests are serial, not parallel. | 16:09 |
fungi | yeah, unlikely to be either of the concurrent connection count limits in that case | 16:09 |
fungi | (the limit of 96 concurrent ssh connections per account is enforced by the gerrit service, the limit of 100 concurrent ssh connections per source ip address is enforced by iptables/conntrack on the server, for future reference) | 16:10 |
Clark[m] | The Gerrit ssh log may have hints. But I'm finishing a school run | 16:11 |
Clark[m] | Similarly trying to reproduce with only ssh client on the client side with -vvv may be helpful | 16:12 |
fungi | i can't reproduce the same problem running `git remote update gerrit` in a tight loop from various places on the internet so far | 16:13 |
slittle1_ | any other anti spam/DOS measure I might be getting caught in? | 16:15 |
slittle1_ | I'd estimate ~200 of those requests over 2-3 minutes | 16:16 |
clarkb | slittle1_: that maybe enough that tcp isn't fully closing | 16:16 |
clarkb | and you're hitting the tcp limit | 16:16 |
fungi | i doubt it's the conntrack overflow, since it's set to send icmp-port-unreachable not tcp reset (git's claiming to see the latter) | 16:18 |
*** pojadhav is now known as pojadhav|rover | 16:19 | |
clarkb | slittle1_: do you know approximately what time the last error occured? I can look at the gerrit sshd log | 16:21 |
slittle1_ | within the last 5 min | 16:23 |
clarkb | ok the sshd log doesn't seem to show any erorrs in that timeframe implying it is proably something before gerrit is involved | 16:25 |
clarkb | perhaps a firewall on your end or some sort of asymmetric route causing routers/firewalls to get angry | 16:26 |
slittle1_ | I'll try again now | 16:27 |
clarkb | I've approved https://review.opendev.org/c/opendev/system-config/+/818606 as I indicated I would yesterday (this is the lodgeit user upadte) | 16:28 |
clarkb | if it has a sad I can amnually revert on the host then push a revert if the fix isn't straightforward | 16:28 |
noonedeadpunk | was that discussion about connection issues to opendev infrastructure?:) | 16:29 |
clarkb | noonedeadpunk: specifically to review.opendev.org over port 29418 over with ipv4 yes | 16:29 |
noonedeadpunk | well just for me right now git clone https://opendev.org/openstack/requirements /tmp/req ends with `GnuTLS recv error (-9): Error decoding the received TLS packet.` | 16:30 |
slittle1_ | got a bit further | 16:31 |
clarkb | noonedeadpunk: that is a different system hosted in another part of the world. I doubt they are related, but I suppose it is possible | 16:31 |
slittle1_ | ssh://slittle1@review.opendev.org:29418/starlingx/portieris-armada-app.git did not work. Description: ssh_exchange_identification: read: Connection reset by peer | 16:31 |
slittle1_ | fatal: Could not read from remote repository. | 16:31 |
slittle1_ | Please make sure you have the correct access rights | 16:31 |
slittle1_ | and the repository exists. | 16:31 |
slittle1_ | Could not connect to gerrit. | 16:31 |
slittle1_ | Enter your gerrit username: | 16:31 |
noonedeadpunk | curl actually works, but you know - it;'s quite different proto used | 16:31 |
noonedeadpunk | clarkb: do we have actually some rate limiting there? | 16:31 |
noonedeadpunk | As I was clonning quite a lot of repos at a time.... | 16:32 |
clarkb | noonedeadpunk: we have "if you overload the system you'll break it and cause a fail over to another backend" rate limiting :) | 16:32 |
clarkb | noonedeadpunk: were you running OSA updates in a datacenter? we know that causes it to happen and had to ask osa to not ddos us | 16:32 |
noonedeadpunk | mmm, I see ) | 16:32 |
clarkb | unfortunately git clones are not cheap and need significant amounts of memory. Eventually we run out. | 16:33 |
clarkb | slittle1_: looks like the same error but in a different aprt of the process? | 16:33 |
noonedeadpunk | While I'm aware about osa issue and we got exact reason why it's happening, and I really do some osa related stuff, it's not related :) | 16:33 |
clarkb | slittle1_: the specific repo there gives me something new to look at in the logs | 16:33 |
noonedeadpunk | I was retrieveing HEAD SHAs for openstack services so that shouldn't cause too much load | 16:34 |
clarkb | noonedeadpunk: its actually the same | 16:35 |
clarkb | git has to load all the data into memory for most operations aiui | 16:35 |
clarkb | the resulting IO can differ but the IO and cpu impact to initiate operations doesn't differ by much | 16:35 |
slittle1_ | clarkb: it's just iterating through our starling git repos. It got a bit further this time. | 16:36 |
noonedeadpunk | um, so the issue when osa was ddosing was when it did quite the same but from each compute in the deployment | 16:36 |
noonedeadpunk | ah | 16:36 |
clarkb | er the memory, io and cpu to initiate don't differ much. The delta is the io afterwards | 16:36 |
noonedeadpunk | I see | 16:36 |
clarkb | slittle1_: right this is why I suggested doing it on demand earlier. Fwiw I don't see that request at all here | 16:36 |
noonedeadpunk | but well... we need to updated versions and do releases... I'm not sure I know other way how to grab top of stable/xena for example and make it persistant over time | 16:37 |
noonedeadpunk | we can do this slower though.... | 16:37 |
clarkb | noonedeadpunk: well it should be fine if you do them sequentially | 16:37 |
noonedeadpunk | yep, I did one by one | 16:38 |
slittle1_ | ultimately the goal is to create a branch on each repo, and to modify the defaultbranch of the .gitreview files in each repo | 16:38 |
noonedeadpunk | and then process jsut stuck and it's been like 15 minutes already that I can't clone :( | 16:38 |
clarkb | noonedeadpunk: well are you cloning or checking the HEAD? | 16:38 |
noonedeadpunk | so was wondering if there's some automated thing like fail2ban or dunno | 16:38 |
clarkb | because I mentioned cloning and you said you weren't doing that. And no there is no fail2ban but we must load balance by source IP because git and if you overload your backend this can happen | 16:39 |
clarkb | unfortauntely I'm also trying to debug a separate conenctivity issue to a separate service in another datacenter in a different country so juggling isn't easy | 16:39 |
noonedeadpunk | clarkb: what script exactly was doing - `git ls-remote <repo> stable/xena` | 16:41 |
noonedeadpunk | ok, sorry, grab that | 16:41 |
noonedeadpunk | this can wait | 16:41 |
clarkb | noonedeadpunk: ok are you cloning then? | 16:41 |
clarkb | none of the backends indicate memory or system load pressure so likely not that | 16:41 |
noonedeadpunk | as for me - git connection jsut hangs whatever I do | 16:42 |
noonedeadpunk | oh, well, no | 16:42 |
noonedeadpunk | git ls-remote jsut worked | 16:42 |
noonedeadpunk | clone not | 16:43 |
clarkb | noonedeadpunk: can you see which backend you are talking to by inspecting the ssl cert (we put the name of the backend in there too) | 16:44 |
noonedeadpunk | I will probably just try to reboot.... | 16:44 |
clarkb | slittle1_: best I can tell based on lack of info in the logs on our end this is likely to be happening somewhere between you and us. Are you able to try ssh -vvv -p 29418 slittle1@review.opendev.org gerrit ls-projects and see if you can reproduce. Then maybe that gives us a bit more info | 16:45 |
noonedeadpunk | CN=gitea01.opendev.org | 16:45 |
noonedeadpunk | but things jsut wen back to normal | 16:46 |
noonedeadpunk | so I guess I had some stuck connection that wasn't closed properly... | 16:46 |
noonedeadpunk | as I saw like 15% packet loss close to loadbalancer | 16:47 |
clarkb | is it possible that vexxhost is having a widespread ipv4 routing problem? | 16:47 |
clarkb | (thats just a long shot given what slittle1 observes in another datacenter but both are in vexxhost) | 16:47 |
clarkb | to confirm gitea01 seems healthy. The gitea processes have been running for a couple days. Current free memory is good and there are no recent OOMKiller events | 16:48 |
clarkb | slittle1_: are your connections running in parallel? I see 13 connections from your source currently 11 of which are established | 16:50 |
clarkb | that is still only 13% of our limit though so shouldn't be in danger of that. Mostly just curious | 16:51 |
*** pojadhav|rover is now known as pojadhav|out | 16:51 | |
noonedeadpunk | well, connection to vexxhost never was reliable for me at least because of zayo being in the middle.... But packet loss was somewhere on the core router... | 16:51 |
clarkb | noonedeadpunk: also if it wasn't clear running an ls-remote sequentially the way you are doing is the correct method I think. I would expect that to work | 16:52 |
clarkb | noonedeadpunk: doing 200 at the same time might not :) | 16:52 |
noonedeadpunk | it always worked at least before | 16:53 |
noonedeadpunk | and that was exactly problem with osa upgrades | 16:53 |
clarkb | slittle1_: now down to 6. So ya I don't think we're hitting that 100 limit unless it happens very quickly and everything backs off | 16:54 |
noonedeadpunk | we were too tolerant for failovers if things are broken on the deployer side (or they execute upgrade in wrong order) | 16:54 |
fungi | noonedeadpunk: ooh, so the cause of osa upgrades overwhelming us was finally identified? that's great news | 16:55 |
noonedeadpunk | but to get this fixed ppl would need to pull in fresh code... | 16:56 |
noonedeadpunk | or follow docs while upgrading | 16:56 |
jrosser | I think that of the people who were causing this we reached out to them all and no-one was able to help reproduce it | 16:56 |
noonedeadpunk | both are kind of unlikely in short term | 16:56 |
jrosser | I would be in favour of adding an assert: to the code to make it just fail when this happens | 16:57 |
jrosser | though it technically is a valid configuration to use no local caching at all | 16:57 |
*** jpena is now known as jpena|off | 16:57 | |
opendevreview | Merged opendev/system-config master: Switch lodgeit to run under a dedicated user https://review.opendev.org/c/opendev/system-config/+/818606 | 16:58 |
jrosser | anyway - what noonedeadpunk is doing is trying to run a script to retrieve the SHA of stable/xena for all the OSA repos, nothing to do with a deployment | 16:59 |
jrosser | it's needed for our release process | 16:59 |
clarkb | yup, and from what I see things are fine on our side. noonedeadpunk indicated packetloss though | 16:59 |
*** marios|ruck is now known as marios|out | 16:59 | |
clarkb | I'm beginning to suspect there may be some Internet is having a fit problems near vexxhost right now issues | 16:59 |
slittle1_ | I suspect the extra connections relate to my use of 'timeout' to kill hung sessions. | 17:00 |
clarkb | but those are always difficult to debug if you aren't on a client end with the problem and the server side doesn't see the issue beacuse packets don't reach it | 17:00 |
slittle1_ | The hung sessions are probably cases where ssh key exchange failed and it's prompting for user/pass | 17:01 |
slittle1_ | the script doesn't know how to respond to that, and I don't see it as the prompt is routed the /dev/null. It's in a sub function, and the only thing I want coming out of stdout is the string I'm expecting to parse. I'll route stdout to stderr if I need another run | 17:04 |
slittle1_ | Ha, it finally passed | 17:05 |
slittle1_ | I'm afraid the 'git reviews' will hit the same issue | 17:06 |
clarkb | slittle1_: it probably will | 17:06 |
clarkb | noonedeadpunk: fwiw I was just able to clone requirements at least 10 times (it ran in a while loop and I didn't count the exact number) via opendev.org to gitea01 (I am balanced to the same backend) over ipv4 successfully. | 17:09 |
clarkb | fungi: ^ any better ideas on debugging slittle1_'s problem if the connections don't seem to show up in our logs. I suspect something external to us | 17:10 |
clarkb | I guess run an mtr from slittle1_'s IP to review.opendev.org and see if there is packet loss. But it might be port specific etc | 17:10 |
fungi | git review may prompt for account details if an ssh connection attempt fails at the wrong places in the script | 17:12 |
fungi | it's likely just another manifestation of a connection issue | 17:12 |
clarkb | fungi: ya I'm wondering if it is general internet unhappyness, maybe an assymetric route? Or slittle1_'s local firewall limiting connections to a single endpoint or a firewall cluster not allowing port 29418 out on a specific node etc | 17:13 |
clarkb | something like that would explain why we never see the issue in our logs | 17:13 |
noonedeadpunk | clarkb: it works nicely now as well | 17:13 |
fungi | if it can be minimally reproduced with specific ssh commands, then we may be able to narrow it down with added verbosity to something like problematic route distribution, a pmtud blackhole, et cetera | 17:14 |
clarkb | infra-root: http://lists.openstack.org/pipermail/openstack-discuss/2021-December/026250.html that is proably a meeting we should try and attend. I'll mark it on my todo list but calling it out if others want to attend | 17:14 |
fungi | depending on at what point the connection breaks | 17:14 |
clarkb | fungi: slittle1_: ya so something like ssh -vvv -p 29418 slittle1@review.opendev.org gerrit ls-projects | 17:14 |
clarkb | and see if you can make that fail | 17:14 |
fungi | we've even seen examples of environments doing specific qos/dscp marking on ssh connections, causing them to get treated differently (in bad ways) from other tcp sessions, or particular firewalls with ssh-specific connection tracking features introducing nuanced inconsistencies | 17:16 |
clarkb | paste updated and I've been able to make this test paste just now https://paste.opendev.org/show/bE7I0dBfkoDBsGSDZYNT/ I think that is happy | 17:18 |
opendevreview | Sorin Sbârnea proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 17:20 |
clarkb | fungi: can you check my comments on https://review.opendev.org/c/opendev/system-config/+/820900 ? I +2'd as nothing there seemed critical but didn't want ot approve in case it was worth updating | 17:22 |
fungi | thanks, replied to them | 17:26 |
clarkb | fungi: I think I have a slight preference to aggregate by chain since each chain's rule behaviors are specific to that chain | 17:28 |
clarkb | maybe in a followup? | 17:28 |
opendevreview | Sorin Sbârnea proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 17:29 |
opendevreview | Sorin Sbârnea proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 17:30 |
slittle1_ | ran 'ssh -vvv -p 29418 slittle1@review.opendev.org gerrit ls-projects' ten times in rapid succession. No issues | 17:31 |
clarkb | fungi: also left a thought on https://review.opendev.org/c/opendev/system-config/+/821144 to make the test a bit more robust | 17:32 |
fungi | thanks | 17:33 |
slittle1_ | ran it in a tighter loop. failed on the 19'th iteration.... | 17:35 |
opendevreview | Sorin Sbârnea proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 17:35 |
slittle1_ | debug1: Connecting to review.opendev.org [199.204.45.33] port 29418. | 17:36 |
slittle1_ | debug1: Connection established. | 17:36 |
slittle1_ | debug1: identity file /folk/slittle1/.ssh/openstack type 1 | 17:36 |
slittle1_ | debug1: key_load_public: No such file or directory | 17:36 |
slittle1_ | debug1: identity file /folk/slittle1/.ssh/openstack-cert type -1 | 17:36 |
slittle1_ | debug1: Enabling compatibility mode for protocol 2.0 | 17:36 |
slittle1_ | debug1: Local version string SSH-2.0-OpenSSH_7.4 | 17:36 |
slittle1_ | ssh_exchange_identification: read: Connection reset by peer | 17:36 |
clarkb | ok that indicates it is being killed very early in the protocol establishment. It gets far enough to create the tcp connection but then almost as soon as it starts to negotiate ssh on top of that a peer resets it (which can be a router or firewall in between) | 17:37 |
clarkb | our firewall rules don't do resets | 17:37 |
clarkb | slittle1_: did you just do that in a while loop? I'll run similar locally if so just to see if I can reproduce from here | 17:41 |
slittle1_ | yes | 17:42 |
slittle1_ | i=0; while [ $i -le 100 ]; do echo $i; i=$((i + 1)); ssh -vvv -p 29418 slittle1@review.opendev.org gerrit ls-projects; if [ $? -ne 0 ]; then break; fi; done | 17:42 |
clarkb | ok I just did similar with 30 iterations and had no problems. | 17:43 |
clarkb | and reran again just to be double sure. Definitely seems like something to do with your network connectivity. Whether local or upstream of you | 17:45 |
*** weechat1 is now known as amorin | 17:54 | |
*** weechat1 is now known as amorin | 18:00 | |
clarkb | if you want to debug further the next step is probably a tcpdump to catch the reset and see where it originates from? fungi might have better ideas. That will liekly produce a large amount of data though | 18:10 |
fungi | that *might* help narrow it down, but these days most middleboxes "spoof" tc resets on behalf of the remote address | 18:13 |
fungi | er, tcp resets | 18:13 |
fungi | so all tcpdump will probably show you is that the server sent a tcp/rst packet, and a corresponding tcpdump on the server will show no such packet emitted | 18:13 |
fungi | but it is unlikely to help in narrowing down which system between the client and server actually originated the reset | 18:14 |
fungi | i would say, the majority of the time i've seen those symptoms, it's either because of an overloaded state tracking/address translation table on a router selectively closing connections to keep under its limit, or a cascade effect failure due to running out of bridge table space on an ethernet switch somewhere | 18:16 |
fungi | the intermittency can be further stretched by flow distribution across parallel devices, where one device is struggling but only a random sample of flows are sent through it | 18:18 |
clarkb | yup tl;dr Internet | 18:19 |
fungi | getting your isp to talk to vexxhost and/or their backbone providers might help get eyes on a problem, but usually the network providers are actually aware and are sitting on degraded states awaiting a maintenance window to replace/service something | 18:20 |
fungi | i'm just glad to no longer be one of the people making those decisions ;) | 18:21 |
fungi | possibly of interest to some here, a summary of the recent pypi user feedback survey: https://pyfound.blogspot.com/2021/12/pypi-user-feedback-summary.html | 18:24 |
fungi | surveys | 18:24 |
fungi | decisions include adding paid organization accounts on pypi (free for community projects), and further requirements gathering on package namespacing | 18:27 |
clarkb | fungi: for the lists ansible stuff. Did you want to push up a followup to do the chain move or just update the existing change? I'm thinking we should probably land the iptables update change first before anything else just to be sure it doesn't impact prod (it shouldn't as only the test all group gets rules) | 18:27 |
clarkb | And then we should be able to land the set of lists specific changes in one block pretty safely | 18:28 |
fungi | yeah, i'll revise the iptables change, i'd rather not merge too many different updates to our firewall handling, as each is a separate opportunity for breakage | 18:31 |
clarkb | ++ | 18:32 |
fungi | clarkb: for the debugging, would you prefer to record the ip(6)tables-save output some other way? | 18:38 |
fungi | i stuck the print statement where i did mainly so that it would be logged in close proximity to the assertion failures, but no idea if you had a chance to check whether that seemed too verbose to you | 18:38 |
opendevreview | Sorin Sbârnea proposed zuul/zuul-jobs master: Add tox-py310 job https://review.opendev.org/c/zuul/zuul-jobs/+/821247 | 18:39 |
clarkb | fungi: let me go look at the test logs | 18:39 |
clarkb | fungi: oh huh it looks like pytest captures stdout and doesn't show it unless you fail? IN that case I think it is fine as is | 18:42 |
clarkb | I was owrried a bunch of tests would be dumping iptables rules to the console log and making that noisy but doesn't seem to be th case. And if that check fails you want to see the rules | 18:42 |
fungi | yeah, the output format itself also isn't awesome, it's a one-line list representation of all the lines output by the save command, but it was sufficient for me to finally find the normalized for for the rule i was trying to match in my test addition | 18:45 |
fungi | er, normalized form for | 18:46 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Block outbound SMTP connections from test jobs https://review.opendev.org/c/opendev/system-config/+/820900 | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Copy Exim logs in system-config-run jobs https://review.opendev.org/c/opendev/system-config/+/820899 | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Collect mailman logs in deployment testing https://review.opendev.org/c/opendev/system-config/+/821112 | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Make sure /usr/bin/python is present for mailman https://review.opendev.org/c/opendev/system-config/+/821095 | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Restart mailman services when testing https://review.opendev.org/c/opendev/system-config/+/821144 | 18:47 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Use newlist's automate option https://review.opendev.org/c/opendev/system-config/+/820397 | 18:47 |
*** sshnaidm is now known as sshnaidm|afk | 19:05 | |
clarkb | that stack lgtm now. Thanks | 19:13 |
fungi | much obliged | 19:17 |
clarkb | fungi: do you have time for https://review.opendev.org/c/opendev/gerritbot/+/818494 and parent? | 19:22 |
clarkb | I should do an audit of the buster images that need bullseye updats and we can start doing them all | 19:22 |
clarkb | I'll work on putting together this todo list as well as one for the user stuff this afternoon. Then we can work through it and know when we are done | 19:25 |
fungi | reviewed both of those, and thanks | 19:27 |
fungi | interesting run timeout on the mailman log collection change, i wonder if i've added too much to the job: https://zuul.opendev.org/t/openstack/build/d5aab74b18f348f0939f62c6bb116bb6 | 19:30 |
clarkb | or maybe the node was really slow creating lists? | 19:40 |
fungi | maybe | 19:42 |
fungi | that change is earlier in the stack than the one which alters the newlist command invocation | 19:43 |
clarkb | https://etherpad.opendev.org/p/opendev-container-maintenance starting to put the information together there | 19:56 |
clarkb | Need to take a break for lunch, but I'll try to get that etherpad as complete as possible. Then we can start pushing changes in a more organized manner to get through this. Previously it was pretty ad hoc (we've made decent progress though) | 20:22 |
slittle1_ | oops ... I think we missed something in the config of one of our starlingx repos | 21:08 |
slittle1_ | remote: error: branch refs/tags/vr/stx.6.0: | 21:08 |
slittle1_ | remote: You need 'Create Signed Tag' rights to push a signed tag. | 21:08 |
slittle1_ | remote: User: slittle1 | 21:08 |
slittle1_ | remote: Contact an administrator to fix the permissions | 21:08 |
slittle1_ | remote: Processing changes: refs: 1, done | 21:08 |
slittle1_ | To ssh://review.opendev.org:29418/starlingx/metrics-server-armada-app.git | 21:08 |
slittle1_ | ! [remote rejected] vr/stx.6.0 -> vr/stx.6.0 (prohibited by Gerrit: not permitted: create signed tag) | 21:08 |
slittle1_ | error: failed to push some refs to 'ssh://review.opendev.org:29418/starlingx/metrics-server-armada-app.git' | 21:08 |
clarkb | slittle1_: you'll need to push a change to update your acls allowing you to push the signed tags | 21:09 |
clarkb | if the acl is already there then you'll need to be added to the appropriate group | 21:09 |
clarkb | slittle1_: https://opendev.org/openstack/project-config/src/branch/master/gerrit/acls/starlingx/metrics-server-armada-app.config#L11 | 21:10 |
clarkb | https://review.opendev.org/admin/groups/3086a3152fc635addcd00cd4823a1be0352fac1f,members | 21:11 |
slittle1_ | yah, should have included 'starlingx-release' | 21:16 |
opendevreview | Scott Little proposed openstack/project-config master: give starlingx-release branch and tag powers in metrics-server-armada-app https://review.opendev.org/c/openstack/project-config/+/821321 | 21:25 |
slittle1_ | https://review.opendev.org/c/openstack/project-config/+/821321 | 21:25 |
clarkb | slittle1_: I'm not sure if you can double up the groups on one line like that | 21:27 |
clarkb | but also should you replace the core group with the release group anyway? | 21:27 |
slittle1_ | yes, that would be mor consisten with our norm | 21:27 |
opendevreview | Scott Little proposed openstack/project-config master: give starlingx-release branch and tag powers in metrics-server-armada-app https://review.opendev.org/c/openstack/project-config/+/821321 | 21:29 |
slittle1_ | gotta get me a new keyboard | 21:30 |
clarkb | ianw: ok left comments on https://review.opendev.org/c/opendev/system-config/+/821155 tl;dr I think it does what it describes and that it is safe and unintrusive but also think we should have a discussion as a group about further plans before we get too far ahead. Happy to dedicate the majority of our next meeting to that if it would be helpful (or use email or do an ad hoc meeting | 22:28 |
clarkb | etc) | 22:28 |
ianw | thank you! yes i agree on discussion | 22:32 |
ianw | as far as i would want to go is having zuul write things in plain text on the bastion. i could write a spec to that, if we like, or just an email | 22:33 |
clarkb | I think eitherway works. I might have a slight preference for a spec as it helps outline everything in the code where we do that sort of thing | 22:35 |
ianw | i wouldn't mind applying 821155 (not now, when it's quiet and i'm watching) and reverting after a successful run, just to confirm it works as intended | 22:37 |
ianw | i think it does, but i've thought a lot of things about this changeset that haven't quite been true :) | 22:37 |
clarkb | heh ya. I think its a good way to test the waters as the scope is qutie small and we can clean up after it easily when done | 22:38 |
clarkb | ok I think that etherpad is fairly compelte and I've sorted the lists by done, not applicable for one reason or another, and needs work | 22:40 |
clarkb | I'm going to start pushing more changes up to bump to bullseye next | 22:41 |
fungi | we seem to have very few builds in progress for the openstack tenant at the moment, most builds seem to be queued | 22:43 |
fungi | thinking this may be all the branch creation events for starlingx repos, we saw something similar when the release team merged a change to add branches to all of the openstackansible repos earlier in the week | 22:45 |
clarkb | exciting | 22:45 |
clarkb | there are a lot of events | 22:45 |
clarkb | I guess we watch that and see if they move? | 22:45 |
fungi | corvus suggested that the scheduler should be collapsing all the reconfigure events for those together, i think? | 22:45 |
fungi | we ended up getting out of the similar pileup from osa by doing a full scheduler restart and zk clear | 22:46 |
clarkb | ya might be worth double checking zuul isn't doing something wrong here too | 22:46 |
fungi | the event queues should burn down on their own, but i don't know how rapidly. https://grafana.opendev.org/d/5Imot6EMk/zuul-status says some events are taking 15-30 minutes to process | 22:49 |
clarkb | ya I think the restart made things go faster because zuul would check all branches at the startup time and somehow that makes it go quicker? | 22:50 |
clarkb | but I hesitate to proceed with a restart because 1) zuul should be able to handle this and 2) I thought we though zuul would handle this? Probably a good idea to see if corvus has opinions | 22:51 |
fungi | well, it would only read them all once, rather than one for every new branch creation in one of the repos, i guess? | 22:51 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update the accessbot image to bullseye https://review.opendev.org/c/opendev/system-config/+/821328 | 22:52 |
fungi | kevinz: if you're around yet (i'm sure it's still early) we seem to have 19 server instances stuck in a "deleting" state (one i looked at is saying the task_state is deleting but the vm_state is building, with a creation date of 2021-11-19, i expect the others are similar but haven't confirmed) | 22:52 |
fungi | as a result we're not booting any new instances there until they're cleaned up | 22:52 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update the hound image to bullseye https://review.opendev.org/c/opendev/system-config/+/821329 | 22:55 |
clarkb | the queue sizes appear to be getting smaller | 23:00 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update limboria ircbot to bullseye https://review.opendev.org/c/opendev/system-config/+/821330 | 23:08 |
opendevreview | Clark Boylan proposed opendev/system-config master: Install Limnoria from upstream https://review.opendev.org/c/opendev/system-config/+/821331 | 23:08 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update matrix-eavesdrop image to bullseye https://review.opendev.org/c/opendev/system-config/+/821332 | 23:11 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update refstack image to bullseye https://review.opendev.org/c/opendev/system-config/+/821335 | 23:26 |
ianw | clarkb: for 821331 did they make it to the master branch yet? | 23:27 |
*** rlandy|ruck is now known as rlandy|out | 23:30 | |
clarkb | ianw: they appaer to have. I cloned and git log showed them in history | 23:30 |
clarkb | ianw: but you should definitely double check | 23:30 |
clarkb | I sort of figured we could get the changes up and then testing will tell us where bullseye is different and stuff will berak | 23:31 |
clarkb | but better to get this out there as a list of things we can take action on than a secret todo list :) | 23:31 |
clarkb | uwsgi-base is going to be the complicated on that needs thinking since it is a base image with other consumers. We want to do what we did with python-base and python-buidler so I'll have to look at it a bit more closely once the others are moving along | 23:34 |
*** artom__ is now known as artom | 23:39 | |
opendevreview | Clark Boylan proposed opendev/system-config master: Properly build bullseye uwsgi-base docker images https://review.opendev.org/c/opendev/system-config/+/821339 | 23:47 |
clarkb | ok the uwsgi situation is a bit fun. I tried to cover it all in the commit message for ^. Lodgeit isn't actually done and will need an image rebuild once ^ lands | 23:48 |
opendevreview | Clark Boylan proposed opendev/lodgeit master: Rebuild the lodgeit docker image https://review.opendev.org/c/opendev/lodgeit/+/821340 | 23:50 |
clarkb | ok I think that is a fairly compelte list of changes needed to bump our images up a debian release. Note I don't think we should approve them all at once and instead take a little time to make sure debian userland updates don't cause unexpectedchanges | 23:50 |
clarkb | but the vast majority of them should be fine as they don't rely on the userland for much | 23:50 |
fungi | management events list for the openstack tenant is down to 4 now | 23:51 |
clarkb | I guess tomorrow I'll look for any failures and maybe we can land a subset. Then we can also start looking at the uid updates. Hopefully that etherpad lays out the todos around this pretty clearly. I added a few others as well for mariadb and zookeeper that i noticed | 23:53 |
ianw | clarkb: thanks, will double check. i'll review the other bits this afternoon | 23:56 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!