JayF | similarly failed | 00:00 |
---|---|---|
ianw | i can propose a change with enableChannelIdTracking=false. i think we should file another upstream bug report though, as i can't find a recent one | 00:01 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Upgrade to latest Mailman 3 releases https://review.opendev.org/c/opendev/system-config/+/869210 | 00:02 |
fungi | if it helps, i was able to push that ^ easily and quickly | 00:03 |
JayF | perhaps it's related to the size of the repo? | 00:03 |
JayF | I'm trying to push literally a three character logging fix for neutron lol | 00:03 |
fungi | openssh-client 1:9.1p1-2 from debian/sid | 00:03 |
ianw | JayF's push looks like | 00:04 |
ianw | git-upload-pack./openstack/neutron.git 4ms 13044ms '4ms 16ms 0ms 111ms 0ms 11950ms 12077ms -1 3658 5467 3462123' 0 - 751ms 740ms 152267816 | 00:04 |
ianw | i have no idea what all those numbers mean | 00:04 |
fungi | i was pushing a change for the same repo where ianw saw his error | 00:04 |
ianw | fungi: yeah, i re-pushed and didn't see the error -- but also that was a different error to jayf | 00:04 |
fungi | ahh, okay | 00:05 |
JayF | mtr to review.opendev.org makes it look like cogent:comcast links ipv6 are doing bad | 00:05 |
JayF | I wonder if this is internet shenanigans to a degree to impact this | 00:05 |
ianw | it does seem to suggest that slow (or maybe packet-dropping-ish) links and big repos trigger it | 00:05 |
* JayF puts a v4 address in his hosts file and tries a thing | 00:05 | |
ianw | can you try ipv4? | 00:05 |
clarkb | I don't think the wikimedia workaround is still a valid option (possibly due to the switch to MINA) | 00:06 |
JayF | looks like the issues persist over v4, but so do the packet drops upstream of me in MTR | 00:06 |
ianw | although in the overall list of people seeing this, there's plenty of ipv4 addresses | 00:06 |
ianw | clarkb: i thought that change was for mina -- not nio2 or whatever it used to be? | 00:07 |
* JayF stepping away, but will happily retry if ping'd | 00:07 | |
clarkb | ianw: hrm it may have been but looking at current valid sshd options for gerrit that flag is not in the list | 00:08 |
ianw | i think it's https://gerrit-review.googlesource.com/c/gerrit/+/238384 | 00:08 |
ianw | "For more safety, protect this experimental feature behind undocumented | 00:08 |
ianw | configuration option, but enable this option per default. | 00:08 |
ianw | " | 00:08 |
clarkb | heh I'm not sure I like that approach. Secret config options | 00:09 |
clarkb | but undocumented so we don't know what they do | 00:09 |
ianw | https://issues.apache.org/jira/browse/SSHD-942 seems to be upstream discussion | 00:11 |
clarkb | fwiw I can git remote update in neutron and nova without issue | 00:12 |
JayF | `git fetch gerrit` also worked for me immediately before I started having this issue | 00:12 |
ianw | "I thought some more about this and what I have suggested might "mask" out protocol problems that are not related to latency or out-of-order messages - e.g. a client that does not adhere to the protocol 100% and is indeed sending messages when they are not expected." | 00:13 |
ianw | i think we can rule out the client not adhering to the protocol | 00:13 |
ianw | so that really seems to leave out-of-order messages | 00:13 |
JayF | which matches my observation of packet loss in the internet route to gerrit | 00:14 |
clarkb | I don't cross cogent. I'm going through wholesail and beanfield | 00:14 |
JayF | ooooooh | 00:14 |
JayF | I'm going to tether. | 00:14 |
clarkb | thought I'd try due to adjacency to JayF but our ISPs seem to share completely different sets of peering | 00:14 |
clarkb | s/share/use/ | 00:14 |
ianw | i guess the question is -- is this an error that gerrit is quitting, or would it be failing anyway and gerrit is just telling us | 00:14 |
JayF | I have comcast and am currently on some kind of wait list for centurylink :( | 00:14 |
clarkb | ianw: with MINA its hard to say unless you pull out your ssh protocol rfc and start reading :( this is what ultimately turned me into a java dev | 00:15 |
clarkb | fair warning | 00:15 |
JayF | so, it worked | 00:16 |
JayF | from my tether | 00:16 |
ianw | ... interesting ... | 00:17 |
fungi | my v4 and v6 routes seem to go from my broadband provider, through beanfield, and into vexxhost. no other backbones | 00:17 |
ianw | JayF: if you have time, i wonder if we might catch a tcpdump of a failure to your IP? it might show us out-of-order things? | 00:18 |
JayF | so my hypothesis: gerrit is currently extremely intolerant of packet loss / out of order packets | 00:18 |
JayF | ianw: I can tell you with nearly 100% certainty from client behavior: I went from "some retransmits" to "zero retransmits" once I tethered | 00:18 |
fungi | the tcp/ip stack on the server should normally reassemble streams into the correct packet order | 00:18 |
JayF | I actually need to head out for now though, sorry you all, good luck o/ | 00:19 |
ianw | "This is old issue, but we are getting reports, that the problem shows up even with the custom UnknownChannelReferenceHandler. And even could disappear if using default DefaultUnknownChannelReferenceHandler instance according to this gerrit configuration option: | 00:19 |
ianw | sshd.enableChannelIdTracking = false." | 00:19 |
ianw | JayF: no worries, thanks :) | 00:20 |
fungi | but if there's massive packet loss too, just about anything will get cranky | 00:20 |
ianw | that's hard to parse, but it seems to be saying that "we're seeing this, and users are not seeing it if they turn of enableChannelIdTracking" | 00:20 |
clarkb | ianw: right, but what does that lose us and why does gerrit think it is very unsafe so much so to not document it | 00:21 |
clarkb | I'm good with trying it if we understand the impacts and the risks are reasonable | 00:21 |
clarkb | I need to switch gears to getting a meeting agenda out before dinner. Holler if this ends up needing more eyeballs | 00:23 |
ianw | yeah i'm just trying to parse all these changelogs etc | 00:28 |
clarkb | ok I've finally managed to update the wiki agenda contents. I'll give it a few if anyone wants to add anything | 00:55 |
*** dasm|rover is now known as dasm|out | 00:55 | |
clarkb | ianw: do we want the ssh thing on there? | 00:55 |
ianw | clarkb: umm, i'm writing up a bug report, we can add it to see what, if anything, gets said about it | 00:56 |
clarkb | sound sgood | 00:56 |
fungi | 869091 is back to failing system-config-run-gitea even though 872801 merged a few hours ago | 01:00 |
clarkb | it passed on the apparmor fix change | 01:00 |
clarkb | the run-gitea job I mean (that doesn't get us your logo stuff though) | 01:00 |
fungi | yeah, i'll get logs pulled up here in a few and see why | 01:04 |
fungi | oh, nevermind, i was looking at old results i think | 01:06 |
fungi | it passed when i rechecked after the image fix landed | 01:06 |
fungi | clarkb: is this more what you were expecting? http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_0b6/869091/11/check/system-config-run-gitea/0b667a3/bridge99.opendev.org/screenshots/gitea-main.png | 01:08 |
clarkb | yup I think that flows a lot better | 01:08 |
clarkb | let me rereview that change | 01:08 |
clarkb | then once ianw has a bug report link I can add that ot the agenda and get that sent | 01:09 |
fungi | once we get the donor logos addition landed, we're in a good place to talk about which other logos we might want to add to the list | 01:09 |
clarkb | I +2'd it | 01:10 |
fungi | there's a good case for adding osuosl since they host part of our control plane (nb04) | 01:11 |
clarkb | ++ | 01:11 |
ianw | ok, https://github.com/apache/mina-sshd/issues/319 | 01:21 |
ianw | happy for feedback if it makes sense ... | 01:22 |
ianw | after pulling that apart, i am unsure if turning the option off will make a difference | 01:22 |
ianw | it seems like it will just think the channel isn't open and fail. but i gues sthe only way would be to test | 01:23 |
clarkb | ok agenda sent | 01:31 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: ensure-skopeo: fixup some typos https://review.opendev.org/c/zuul/zuul-jobs/+/872733 | 03:04 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/872806 | 03:05 |
opendevreview | Merged zuul/zuul-jobs master: ansible-lint: uncap https://review.opendev.org/c/zuul/zuul-jobs/+/872495 | 03:29 |
opendevreview | Merged zuul/zuul-jobs master: build-docker-image: fix change prefix https://review.opendev.org/c/zuul/zuul-jobs/+/872258 | 03:39 |
opendevreview | Merged zuul/zuul-jobs master: container-roles-jobs: Update tests to jammy nodes https://review.opendev.org/c/zuul/zuul-jobs/+/872375 | 03:39 |
ianw | Feb 7 03:46:24 graphite02 systemd[1]: systemd-fsckd.service: Succeeded. | 03:45 |
ianw | Feb 7 03:39:23 graphite02 systemd-timesyncd[462]: Initial synchronization to time server [2620:2d:4000:1::40]:123 (ntp.ubuntu.com). | 03:45 |
ianw | time went a fair way backwards on graphite02 when i restarted it | 03:45 |
ianw | (i noticed this because the container was saying it was created "less than a second ago" for ... more than a second) | 03:46 |
ianw | i so far haven't noticed this on any other servers i've restarted ... it does suggest though that ntp wasn't working | 03:48 |
opendevreview | Vishal Manchanda proposed openstack/project-config master: Retire xstatic-font-awesome: remove project from infra https://review.opendev.org/c/openstack/project-config/+/872835 | 03:58 |
*** yadnesh|away is now known as yadnesh | 04:07 | |
ianw | keycloak also wasn't happy on restart. i tried a turn-it-off-and-on-again approach and it seems ok now | 04:07 |
ianw | it's also happened on paste | 04:14 |
ianw | Feb 7 04:20:26 paste01 systemd[1]: systemd-fsckd.service: Succeeded. | 04:14 |
ianw | Feb 7 04:12:40 paste01 systemd-timesyncd[413]: Initial synchronization to time server [2620:2d:4000:1::41]:123 (ntp.ubuntu.com). | 04:14 |
ianw | so the gerrit promote failed | 05:01 |
ianw | https://review.opendev.org/c/opendev/system-config/+/872802 | 05:01 |
ianw | all failed in "promote-docker-image: Delete all change tags older than the cutoff" | 05:03 |
opendevreview | Merged zuul/zuul-jobs master: ensure-skopeo: fixup some typos https://review.opendev.org/c/zuul/zuul-jobs/+/872733 | 05:19 |
fungi | ianw: the tag delete happens after the new tag is done, right? if so, that's benign at least | 05:25 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842 | 05:48 |
ianw | fungi: yep, that's right. i think something like ^ will help with this | 05:49 |
ianw | i'll restart gerrit with that under new docker | 05:50 |
ianw | (that is done) | 05:52 |
*** ysandeep is now known as ysandeep|ruck | 06:00 | |
ianw | https://etherpad.opendev.org/p/docker-23-prod is about half done. the rest are probably more suited to an ansible loop which i can look at later | 06:14 |
ianw | fungi: ^ maybe you could look at the mailman ones? | 06:15 |
opendevreview | Merged openstack/project-config master: nodepool: infra-package-needs; cleanup tox installs https://review.opendev.org/c/openstack/project-config/+/872478 | 06:26 |
*** jpena|off is now known as jpena | 08:24 | |
*** gthiemon1e is now known as gthiemonge | 10:38 | |
*** ysandeep__ is now known as ysandeep|break | 11:36 | |
*** soniya29 is now known as soniya29|afk | 11:39 | |
*** yadnesh is now known as yadnesh|away | 12:22 | |
*** ysandeep|break is now known as ysandeep|ruck | 12:22 | |
fungi | ianw: i have a held mai | 12:45 |
fungi | lman 3 server i can test on first too | 12:45 |
fungi | the mailman 2 servers aren't dockerized anyway | 12:45 |
ysandeep|ruck | openstack-tox-py27 on many projects failing with ERROR: py27: InterpreterNotFound: python2.7 | 13:44 |
ysandeep|ruck | https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&skip=0 | 13:44 |
*** dasm|out is now known as dasm|rover | 13:53 | |
fungi | ysandeep|ruck: when did it start? was it coincident with tox v4 releasing at the end of 2022, or when we switched our default nodeset to ubuntu-jammy earlier last year? | 14:06 |
fungi | or was it more recent? | 14:06 |
fungi | i assume this is for older openstack stable branches, since openstack hasn't supported python 2.7 for a number of releases now | 14:06 |
ysandeep|ruck | fungi, started yesterday: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&project=openstack%2Ftripleo-heat-templates&skip=0 | 14:06 |
ysandeep|ruck | yes hitting on stable branches | 14:07 |
fungi | interesting, i should be able to take a closer look in a few minutes | 14:07 |
slittle1_ | StarlingX release scripts always have issues pushing new branches or tags across all our gits. It starts out ok, but eventually we start seeing 'connection reset by peer'. It feels like we are triggering some sort spam/bot protection. | 14:15 |
slittle1_ | or .... | 14:16 |
slittle1_ | ssh_exchange_identification: read: Connection reset by peer | 14:16 |
slittle1_ | fatal: Could not read from remote repository. | 14:16 |
slittle1_ | just now | 14:16 |
fungi | slittle1_: we observed this from some locations starting late yesterday. it appears cogent (the internet backbone provider) is having some issues, and connections traversing their network are suffering | 14:53 |
fungi | my connection doesn't get routed through cogent and as a result things are quite snappy | 14:54 |
fungi | it's possible vexxhost (where those servers are housed) could do some massaging of bgp announcements to make cogent seem less attractive | 14:54 |
Clark[m] | But we do also have connection limits by IP and user account | 14:54 |
fungi | oh, yes that's also a potential culprit | 14:55 |
Clark[m] | fungi: ysandeep|ruck I think we may have dropped python2 from our images expecting bindep to cover that. | 14:55 |
fungi | i think it's 64 simultaneous connections from the same gerrit account or 100 concurrent sockets from the same ip address | 14:56 |
fungi | if you exceed either of those your authentication gets denied or your connection gets refused, respectively | 14:56 |
fungi | though the cogent issue could also be compounding that if sockets are getting left hanging | 14:56 |
fungi | slittle1_: by always have issues i assume you mean since longer than just the past 24 hours | 14:57 |
fungi | so Clark[m]'s point about connection limits seems more likely in this case | 14:57 |
*** dasm|rover is now known as dasm|afk | 15:15 | |
opendevreview | Merged opendev/system-config master: Feature our cloud donors on opendev.org https://review.opendev.org/c/opendev/system-config/+/869091 | 15:15 |
jrosser | fwiw some connectivity issue to opendev.org is recurring again now, i see the same very low throughput as i had a few weeks ago | 15:18 |
jrosser | and just the same going to another host with different peering but geographically similar, it's all good | 15:18 |
fungi | jrosser: yeah, JayF mentioned seeing similar at 23:37 utc | 15:23 |
fungi | presumably traversing cogent in your traceroute? | 15:23 |
jrosser | fungi: looks like zayo again for me as the closest thing to opendev.org | 15:26 |
ysandeep|ruck | Clark[m], >> I think we may have dropped python2 from our images expecting bindep to cover that. -- is it possible to confirm that theory? | 15:27 |
Clark[m] | Yes look at the change log for openstack/project-config to see if the proposed changes landed | 15:28 |
fungi | jrosser: it may be an assymetric route too (packets to vexxhost via zayo, replies coming back across cogent). i'd have to traceroute to your address from the load balancer to confirm though | 15:28 |
fungi | or from the gerrit server i guess, though any server there likely follows the same routes | 15:28 |
ysandeep|ruck | Clark[m], i see https://github.com/openstack/project-config/commit/95ff98b54b1095dae6591ac273777ff79834887d | 15:28 |
fungi | might have been "nodepool: infra-package-needs; cleanup tox installs" https://review.opendev.org/c/openstack/project-config/+/872478 | 15:29 |
fungi | that merged 06:26:24 utc today | 15:29 |
fungi | would have taken effect the next time images were built | 15:30 |
ysandeep|ruck | but we are seeing issue since yesterday: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&project=openstack%2Ftripleo-heat-templates&skip=0 | 15:31 |
Clark[m] | https://review.opendev.org/c/openstack/project-config/+/872476/2 this one | 15:31 |
fungi | oh, yep. we stopped installing python-dev | 15:32 |
fungi | and python-xml, either of which would have pulled in python2.7 | 15:32 |
ysandeep|ruck | dasm|afk, ^^ I will be out soon, please follow the chatter | 15:32 |
fungi | that merged 00:16:31 yesterday | 15:33 |
fungi | so images built yesterday would have no longer had python2.7 present on them by default | 15:33 |
clarkb | and confirmed that bindep.txt lacks python in it https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/bindep.txt | 15:35 |
clarkb | the glance and neutron bindep.txt files are similar | 15:36 |
clarkb | nova and swift are fine | 15:36 |
clarkb | I think we have ~3 options: A) revert/or readd python2-dev installs B) ask projects to fix their broken bindep.txts (something we've asked for for years at this point) or C) update -py27 jobs to explicitly install python2 for you | 15:37 |
slittle1_ | fungi: By always I mean every 3-6 months when we want to create a branch or tag on all the StarlingX gits. | 15:39 |
clarkb | in that case there is a good chnace you are hitting the IP address limits or user account limits if you run things in parallel | 15:43 |
fungi | slittle1_: how are you creating them and how many? are you doing it in parallel? with a single account? | 15:43 |
fungi | if they're all done from a single account and it's authenticating more than 64 operations at the same time, that could be it | 15:44 |
clarkb | and even if it never fully gets to the IP limit tcp stacks will often keep connections open for a bit once they are "done" | 15:45 |
clarkb | so you might only do 60 pushes and be fine from an auth perspective but if you've got a tcp stack holding the connections open for a few extra minutes then eventually you can trip over the tcp connection limit | 15:45 |
slittle1_ | We use 'repo forall' to iterate through out gits serially. perhaps 60 gits in all, For each there would be a 'git fetch --all', a 'git review -s', a 'git push ${review_remote} ${tag}', a 'ssh -p ${port} ${host} gerrit create-branch ${path} ${branch} ${tag}' and a 'git review --yes --topic=${branch/\//.}' | 15:46 |
clarkb | repo isn't serial though right? | 15:47 |
clarkb | (I'm not super familiar with repo, but my understanding is that it tries to setup large systems like android in parallel. Something google can get away with but not our gerrit) | 15:47 |
slittle1_ | 'repo forall' is serial by default | 15:47 |
fungi | a quick count of starlingx/ namespace projects returns 57 at the moment, so the number that have branches being created is probably lower | 15:48 |
slittle1_ | We branch nearly all of them at the same time | 15:48 |
fungi | in that case it's more likely the ip connection limit (i'm pretty sure the connection refused response would point to that, as the account connection limit would be more likely to report a different error message) | 15:49 |
fungi | it may be that the connections are getting created too quickly in succession and not enough time is being allowed for the server's tcp/ip stack to close them down | 15:49 |
clarkb | fungi: yup that is my suspicion | 15:50 |
*** dasm|afk is now known as dasm|rover | 15:50 | |
clarkb | each of those commands will create new tcp connections and they may idle for a few minutes | 15:50 |
fungi | are those commands being issued by someone behind a nat shared by other clients that may also be connecting? if so that could also make it more likely | 15:50 |
clarkb | per repo its at least 5 connections (I think more) | 15:50 |
slittle1_ | script runtime would probably be 5 min if it could get through without errors | 15:50 |
clarkb | after 20 repos you'll be in ip blocking territory if things haven't cleaned up quickly enough | 15:51 |
ysandeep|ruck | Clark[m]: thanks, as issue is affecting multiple repo and older stable branches, I will let infra decides which options is better atm before adding in bindep.txt(for tripleo repo) | 15:51 |
slittle1_ | not sure about the 'nat' question. I'd have to chase down our own IT guys for that one | 15:51 |
clarkb | if your local IP isn't a publicly routable IP then most likely this is he case | 15:52 |
jrosser | is all of that over ssh? perhaps some ControlPersist could help? | 15:52 |
fungi | if you know what ip address it's coming from i may be able to find some evidence in syslog/journald (i can't remember of those conntrack limit rules log) | 15:52 |
fungi | s/of/if/ | 15:52 |
slittle1_ | there is also a retry loop, 45 sec delay, max 5 retries, when one of the above commands fail. | 15:52 |
slittle1_ | it's not publicly routable | 15:53 |
*** ysandeep|ruck is now known as ysandeep|out | 15:54 | |
fungi | slittle1_: i mean the public address it's getting rewritten to on exiting your network | 15:54 |
fungi | anyway, this is the rule we're suspecting is being tripped: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/review.yaml#L4 | 15:55 |
fungi | in theory it would stop blocking as soon as the tracked connections falls below 100 | 15:55 |
clarkb | jrosser's suggestion may be one to try since that will reduce total ssh connections. I'm not sure how to set that up with git though. Might need to use GIT_SSH? | 15:56 |
clarkb | or maybe you can edit your ssh config to manipulate that | 15:56 |
slittle1_ | probably 28.224.252.2 | 15:57 |
slittle1_ | 128.224.252.2 | 15:57 |
jrosser | google took me here https://docs.rackspace.com/blog/speeding-up-ssh-session-creation/ | 15:57 |
fungi | or try throttling down the loop and see if that helps give earlier connections time to clean up | 15:57 |
jrosser | if git is using the underlying ssh config (which i think it does) then you should be able to use any of the usual things in your ~/.ssh/config | 15:58 |
clarkb | ya it should | 15:58 |
fungi | we probably need to add -j LOG to that rule if we want to be able to find evidence of these events after the fact | 15:59 |
fungi | though i have no idea if there may be ssh brute-forcing attempts constantly hitting that which could fill the logs quickly | 16:00 |
slittle1_ | So I try .... | 16:04 |
slittle1_ | Host *.opendev.org | 16:04 |
slittle1_ | ControlMaster auto | 16:04 |
slittle1_ | ControlPersist 600 | 16:04 |
slittle1_ | Sound ok ? | 16:04 |
fungi | netfilter docs mention /proc/net/nf_conntrack which doesn't seem to exist, i wonder if this kernel version exposes that somewhere else | 16:05 |
fungi | found /proc/sys/net/netfilter/nf_conntrack_* | 16:07 |
clarkb | slittle1_: yes that seems correct, I'd probably not use a * though and keep in mind that this will be shared with all ssh things to the identified server (shouldn't be an issue but calling it out) | 16:37 |
fungi | clarkb: where would be the best place to add the conntrack package (provides tools for inspecting the kernel's connection tracking tables since i can't find them in /proc) on the gerrit server? not the docker image obviously it's for the underlying server. should we just install it on all our servers? there's the iptables role but it's hard-coded to only install one package and it's also | 16:37 |
fungi | multidistro which i doubt we need any longer | 16:37 |
clarkb | fungi: do we know if the jeepyb fix from yesterday was sufficient to make the daily job run? | 16:37 |
clarkb | fungi: the ansible gerrit role is where I'd do it if you want it specific to gerrit. Otherwise in our base package list for all servers | 16:38 |
fungi | oh, right, i was meaning to rerun manage-projects today | 16:39 |
clarkb | I'm not sure if we auto pull the gerrit image or not which is what would be needed to have the daily run pick it up | 16:39 |
clarkb | you don't need to restart gerrit on the new image to use the new image for manage-projects at least | 16:40 |
fungi | manage-projects for starlingx/public-keys seems to have worked this time | 16:41 |
fungi | it pushed to refs/meta/config successfully according to the log | 16:42 |
fungi | though failed after that: | 16:42 |
fungi | jeepyb.utils - INFO - Executing command: git --git-dir=/opt/lib/jeepyb/starlingx/public-keys/.git --work-tree=/opt/lib/jeepyb/starlingx/public-keys checkout master | 16:43 |
fungi | error: pathspec 'master' did not match any file(s) known to git | 16:43 |
clarkb | https://review.opendev.org/admin/repos/starlingx/public-keys,branches seems there in gerrit at least | 16:44 |
clarkb | is debian's git init creating main by default now maybe? | 16:45 |
fungi | when i `git init` on my up to date debian/sid workstation it says "hint: Using 'master' as the name for the initial branch. This default branch name is subject to change." | 16:48 |
clarkb | yup I did a git init in the gerrit image and got the same result | 16:49 |
clarkb | same == master is default | 16:49 |
clarkb | fungi: was there an earlier error? the checkout is in a finally block | 16:49 |
fungi | i'll see if i can find earlier errors but that was right after successfully pushing the refs/meta/config update. the log is at /var/log/manage-projects_2023-02-07T16\:39\:44+00\:00.log | 16:50 |
clarkb | fungi: the other thing to consider is we may not be git initing anymore because the repo exists in gerrit so we'd be cloning instead? | 16:51 |
clarkb | but still master exist in the gerrit repo as HEAD | 16:51 |
fungi | yeah, you can see the git clone | 16:52 |
clarkb | ya that is the problem I think. If you git clone it HEAD is master but there is nothing there | 16:52 |
clarkb | yup I've reproduced by cloning and trying to checkout master | 16:52 |
fungi | so the earlier failures left it in a dirty state | 16:52 |
clarkb | I think this is a bug in jeepyb where if we don't handle it with the initial git init locally to push the .gitreview file to the master branch we're stuck | 16:52 |
fungi | and rerunning won't recover from that | 16:52 |
clarkb | yup | 16:52 |
clarkb | a workaround would be to push the .gitreview file manually. Or maybe even just propose a change for it? | 16:53 |
clarkb | I'm not sure how to address this in jeepyb safely. maybe fallback to git init if we cannot checkout the default branch? | 16:54 |
clarkb | I need to pop out before my next round of meetings to eat breakfast | 16:56 |
fungi | wow, i can't push --force with my admin account unless it agrees to the icla | 17:06 |
fungi | i think that may be new? | 17:06 |
fungi | probably a missing permission in project bootstrappers | 17:06 |
fungi | i wonder how i agree to the icla with it since it has no webui access | 17:06 |
fungi | To ssh://review.opendev.org:29418/starlingx/public-keys | 17:07 |
fungi | * [new branch] master -> master | 17:07 |
fungi | i very briefly added my normal account to get around that | 17:08 |
fungi | i wanted to do it via a review, but git-review refused to use the master remote citing it did not exist (i guess because it was empty) | 17:08 |
fungi | https://opendev.org/starlingx/public-keys | 17:10 |
fungi | manage-projects runs clean now | 17:10 |
clarkb | fungi: you need to add yourself to bootsrappers but then I think you can push? | 17:11 |
fungi | right, you need to add an account that has agreed to the icla to bootstrappers | 17:11 |
clarkb | oh I thought any account in bootstrappers could do it. I guess maybe our creation account has signed it? | 17:12 |
fungi | we probably need to explicitly put our admin accounts in the system-cla group, or it's possible new gerrit has an additional perm we need to add to bypass cla enforcement | 17:12 |
clarkb | oh system-cla makes sense | 17:12 |
fungi | slittle1_: i've added your gerrit account as the initial member of the starlingx-public-keys-core group now. sorry about the delay | 17:14 |
fungi | group name seems to be 'System CLA' for reference | 17:15 |
fungi | right now the members are jenkins, openstack-project-creator, proposal-bot, release | 17:16 |
fungi | so that explains why manage-projects is normally able to do it | 17:16 |
*** jpena is now known as jpena|off | 17:17 | |
fungi | in future, i'll just add my fungi.admin account to that group and see if it solves the cla rejection | 17:17 |
opendevreview | You-Sheng Yang proposed opendev/git-review master: Allow specifying remote branch at listing changes https://review.opendev.org/c/opendev/git-review/+/872988 | 17:40 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Better diag for Gerrit server connection limit https://review.opendev.org/c/opendev/system-config/+/872989 | 17:55 |
noonedeadpunk | Hey there. We've got an issue with https://github.com/openstack/diskimage-builder/commit/2c4d230d7a09ad8940538338dfdc8bc3212fbc20 and rocky9 | 17:56 |
noonedeadpunk | I do see that cache-url should be inlcuded by cache-devstack and it seems present https://opendev.org/openstack/project-config/src/branch/master/nodepool/nodepool.yaml#L186 | 17:57 |
noonedeadpunk | But couple of days ago we started seeing retry_limits as curl-minimal seems to be pre-installed there | 17:57 |
opendevreview | Clark Boylan proposed opendev/system-config master: Update our base python images https://review.opendev.org/c/opendev/system-config/+/873012 | 17:59 |
fungi | noonedeadpunk: my first guess is this is related to the same base image cleanup changes which also stopped preinstalling python2.7 | 18:01 |
clarkb | there was a different change to stop installing curl because of rocky | 18:01 |
noonedeadpunk | aha | 18:01 |
clarkb | apparently rocky installs one of the curl packages and centos the other. And they conflict and installing one won't automaticallyreplace the other | 18:01 |
clarkb | ianw's solution to this was to stop preinstalling anything and let the distro defaults decide | 18:02 |
noonedeadpunk | Yes, I like this solution | 18:02 |
clarkb | if your problems are with building images using dib then udpate to latest dib and I think you'll be good? | 18:02 |
noonedeadpunk | um. my problem is in CI images | 18:04 |
clarkb | ok you may needt o describe the issue in more detail. Is curl-minimal (the rocky default iirc) not working for your jobs? | 18:04 |
noonedeadpunk | Aha, ok, I misunderstood the commit then | 18:04 |
clarkb | basically the dib change is saying don't explicitly try to install curl because the two rhel like distros we build conflict in their curl choice and the dib package management doesn't have a way to express "uninstall this first" I think | 18:05 |
noonedeadpunk | ok, yes, you're right. I was thinking some pre-install of curl still happens with that, but it;s wrong assumption. Thanks for clarifying that! | 18:06 |
noonedeadpunk | uh, why distros like to complicate things that much.... | 18:07 |
clarkb | heres fedora's justification https://fedoraproject.org/wiki/Changes/CurlMinimal_as_Default | 18:08 |
noonedeadpunk | 8mb on the diskspace... | 18:10 |
opendevreview | Elod Illes proposed openstack/project-config master: Remove add_master_python3_jobs.sh https://review.opendev.org/c/openstack/project-config/+/873016 | 18:12 |
clarkb | noonedeadpunk: what protocol do you use? | 18:26 |
clarkb | that doc implies the common ones should be in minimal | 18:26 |
fungi | looks like system-config-promote-image-assets and system-config-promote-image-gitea failed deploy for 869091 and so infra-prod-service-gitea was skipped | 18:30 |
noonedeadpunk | clarkb: shouldn't be anything outside of the minimal | 18:31 |
noonedeadpunk | I just misunderstood what that patch did | 18:31 |
noonedeadpunk | I assumed that there will be no curl as a result of it vs default provided one from minimal | 18:32 |
clarkb | I see | 18:32 |
fungi | okay, both failures were for the same thing ianw observed earlier: "Task Delete all change tags older than the cutoff failed running on host localhost" | 18:33 |
fungi | "the output has been hidden due to the fact that 'no_log: true' was specified for this result" but it's a uri call | 18:33 |
fungi | so maybe dockerhub changed their api or something | 18:33 |
clarkb | fungi: there was a change to zuul-jobs that I wrote to fix a linter rule to the calculation of the cutoff | 18:34 |
clarkb | fungi: its possible that is broken (and the linter was wrong?) | 18:34 |
clarkb | but that seems unlikely since all I did was remove the {{ }}s from a when: entry | 18:34 |
clarkb | and we successfully published gerrit images yesterday which was after the linter fix | 18:35 |
clarkb | (unless that job failed too and we didn't notice?) | 18:35 |
ianw | clarkb: i think https://review.opendev.org/c/zuul/zuul-jobs/+/872842 will help, but i need to look at the linter failure... | 19:08 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842 | 19:10 |
opendevreview | Ade Lee proposed zuul/zuul-jobs master: Add ubuntu to enable-fips role https://review.opendev.org/c/zuul/zuul-jobs/+/866881 | 19:29 |
opendevreview | Merged opendev/system-config master: Update our base python images https://review.opendev.org/c/opendev/system-config/+/873012 | 19:52 |
ianw | https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873020 adds a pre playbook to install python/python-dev, which i think should get the 27 jobs back to where they were | 20:01 |
ianw | mea culpa on that one ... i wasn't even vaguely thinking about python2.7 | 20:02 |
fungi | you're living the dream! i'll be thrilled when i can stop thinking about python 2.7 | 20:02 |
fungi | envious | 20:02 |
clarkb | apparently I'm not logged int ogerrit so will have to +2 after lunch. I'm starving | 20:02 |
fungi | i'm testing the docker upgrade on the held mm3 node at 213.32.76.66 | 20:31 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842 | 20:31 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: double-quote regexes https://review.opendev.org/c/zuul/zuul-jobs/+/873028 | 20:31 |
fungi | currently has docker-ce 5:20.10.23~3-0~ubuntu-jammy | 20:31 |
fungi | The following packages will be upgraded: docker-ce docker-ce-cli libssl3 openssl | 20:32 |
fungi | guess this was still missing the openssl security fix | 20:32 |
fungi | now running docker-ce 5:23.0.0-1~ubuntu.22.04~jammy | 20:33 |
fungi | the containers take a minute to get back to a working state | 20:33 |
fungi | still not up though. i wonder if i should have downed the containers before upgrading | 20:34 |
fungi | error: exec: "apparmor_parser": executable file not found in $PATH | 20:37 |
fungi | d'oh! | 20:37 |
fungi | working slightly better after installing apparmor | 20:37 |
fungi | ianw: your procedure on the pad starts with "stop docker" does that mean stopping dockerd or downing the containers or what? | 20:38 |
fungi | unfortunately, after restarting the containers on the held node, i'm getting a server error trying to browse the sites | 20:44 |
fungi | i'll need to dig the cause out of logs, i expect | 20:44 |
fungi | docker container logs indicate everything started up fine and apache logs say i'm getting a 200/ok response, but the page content is "Server error: An error occurred while processing your request." | 20:49 |
fungi | so i guess that's coming from the backend | 20:49 |
clarkb | fungi: Isuspect downing the containers is sufficient. Then let the package upgrade restart the daemon | 20:50 |
fungi | https://paste.opendev.org/show/bBrbtHgLEhvJkwWELKB5/ | 20:54 |
fungi | this may be something still not quite right with the django site setting | 20:54 |
fungi | and maybe the container restart in the job isn't sufficient to exercise it | 20:55 |
fungi | i'll have to continue investigating after dinner or in the morning | 20:55 |
clarkb | ya that seems unlikely to be related to docker itself if the software is executing under docker | 20:55 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842 | 20:57 |
clarkb | I guess https://review.opendev.org/c/opendev/system-config/+/873012 would've been hit by the broken promotions too? | 20:58 |
fungi | likely | 21:00 |
clarkb | ianw: fungi how new of an apt do you need for https://review.opendev.org/c/opendev/system-config/+/872808/1/playbooks/roles/install-docker/templates/sources.list.j2 to work? | 21:00 |
fungi | oh, right that may not work on xenial | 21:01 |
ianw | clarkb: i think it was 1.4, which is quite old | 21:01 |
ianw | do we have xenial docker? | 21:04 |
clarkb | I don't think so | 21:04 |
clarkb | so ya it may be fine. Also I think gerritbot may be out to lunch. I updated https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873020 to fix an issue and no notifiction here | 21:05 |
clarkb | or maybe its a different event for ps created by the web tool? | 21:05 |
opendevreview | Merged zuul/zuul-jobs master: promote-docker-image: double-quote regexes https://review.opendev.org/c/zuul/zuul-jobs/+/873028 | 21:08 |
slittle1_ | just got around to trying the modified ssh settings as suggested on this forum. No improvement | 21:10 |
slittle1_ | Running: git review -s | 21:10 |
slittle1_ | Problem running 'git remote update gerrit' | 21:10 |
slittle1_ | Fetching gerrit | 21:10 |
slittle1_ | kex_exchange_identification: read: Connection reset by peer | 21:10 |
slittle1_ | fatal: Could not read from remote repository. | 21:10 |
clarkb | thats a different error though right? | 21:11 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842 | 21:11 |
clarkb | or maybe the eralier error was truncated to leave off the kex info | 21:11 |
clarkb | slittle1_: if the control persistence is working properly ps should show you an ssh process ControlMaster in its command line. Something like `ps -elf | grep ControlMaster` should find it | 21:14 |
slittle1_ | ssh_exchange_identification vs kex_exchange_identification, yes slightly different | 21:15 |
slittle1_ | ps -elf | grep ControlMaster ... nothing found | 21:16 |
clarkb | another thing to check is the number of connections you have open to gerrit `ss --tcp --numeric | grep 199.204.45.33` assuming ipv4 | 21:16 |
clarkb | ok I think it may not be using the ControlMaster then | 21:17 |
clarkb | what ControMaster does is creates a process that keeps a single ssh connection open then subsequent ssh connections talk through that ssh connection. Its also possible that MINA doesn't deal with the multiple logical connections over one tcp connection properly. | 21:17 |
slittle1_ | ss --tcp --numeric | grep 199.204.45.33 ... no output | 21:20 |
ianw | slittle1_ : are you coming from an ip starting with 128.224 ? | 21:20 |
slittle1_ | yes, probably 128.224.252.2 | 21:21 |
ianw | that's interesting | 21:21 |
ianw | $ ssh -p 29418 ianw.admin@localhost "gerrit show-connections -w" | grep 128.224.252.2 | wc -l | 21:21 |
ianw | 18 | 21:21 |
ianw | gerrit shows 18 open connections | 21:21 |
fungi | note that https://review.opendev.org/872989 is aimed at making that easier to debug too | 21:21 |
ianw | but, they have a blank user | 21:22 |
clarkb | fungi: that will log to syslog right? | 21:23 |
ianw | oh, i have de-ja-vu on the blank user thing | 21:23 |
slittle1_ | hmmm, I have | 21:23 |
fungi | clarkb: yes | 21:23 |
slittle1_ | Host * | 21:23 |
slittle1_ | ServerAliveInterval 60 | 21:23 |
ianw | i think we see it when the session is open but not logged in | 21:23 |
slittle1_ | in .ssh/config | 21:23 |
fungi | netstat -nt reports 20 sockets from that address at the moment too | 21:23 |
clarkb | ianw: ya you need successful login for gerrit to attach the user | 21:23 |
slittle1_ | is that .ssh/config entry a concern ? | 21:24 |
clarkb | slittle1_: shouldn't be. The ServerAliveInterval creates extra packets on existing connections | 21:25 |
clarkb | but it shouldn't make new connections | 21:25 |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability https://review.opendev.org/c/zuul/zuul-jobs/+/872842 | 21:26 |
slittle1_ | clarkb: yes I believe so. Added it ages ago to keep ssh sessions alive a bit longer. | 21:26 |
clarkb | slittle1_: one side effect it could have though is keeping unneeded connections open for longer potentially | 21:27 |
clarkb | usually thats on the server side though as it wants to make sure it has received all of the bits from the client before shutting down completely | 21:28 |
slittle1_ | I wanted interactive ssh sessions to stay open when I took a call or was otherwise distracted for a bit | 21:28 |
clarkb | hrm I wonder if that means you've also got network gear that is shutting down connections between ou and the remote | 21:29 |
clarkb | typically an ssh connection will stay open | 21:29 |
clarkb | but that could also impact the server side connection count if it thinks you've got a bunch of connections open that some firewall has killed without telling the server | 21:29 |
slittle1_ | I'll add a domain rather than applying to all connections | 21:29 |
slittle1_ | ok, that shouldn't impact opendev connections any longer | 21:31 |
slittle1_ | Gotta get kids, I'll test again tomorrow | 21:32 |
slittle1_ | Thanks for the help | 21:32 |
opendevreview | Jay Faulkner proposed opendev/infra-manual master: Add documentation on removing human user from pypi https://review.opendev.org/c/opendev/infra-manual/+/873033 | 21:35 |
clarkb | I have requested access to the gerrit community meeting agenda doc | 22:02 |
clarkb | fungi: for gitea I wonder if the easiest thing is to just land a noop change to the dockerfile? | 22:16 |
fungi | i recall netscreen firewalls by default silently dropped "idle" tcp sessions after 120 seconds, so on my machines at the office where they used one of those i set my ssh client configuration for a 60 second protocol level keepalive | 22:16 |
clarkb | similar to what I did for the python base images? | 22:16 |
clarkb | (and maybe push a similar change again for the python base images?) | 22:16 |
clarkb | though iwth the base images we don't need to coordinate the deploy pipeline so maybe we can just reenqueue | 22:16 |
fungi | though ssh keepalive also depends on the server side having support implemented. no idea if mina-sshd does | 22:16 |
fungi | clarkb: the donor logos deployment isn't urgent. i'm fine waiting for whatever we end up needing to merge to the gitea server images down the road to trigger a new deployment | 22:17 |
*** dasm|rover is now known as dasm|off | 22:40 | |
opendevreview | Ian Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path https://review.opendev.org/c/zuul/zuul-jobs/+/872806 | 22:42 |
dasm|off | thanks fungi for the email update and "stable branches" py27 solution! | 22:44 |
fungi | dasm|off: ianw is to thank for pushing the change there | 22:46 |
fungi | i'm just trying to keep people apprised of what's going on | 22:46 |
ianw | well i also broke it ... so :) | 23:14 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!