Tuesday, 2023-02-07

JayFsimilarly failed00:00
ianwi can propose a change with enableChannelIdTracking=false.  i think we should file another upstream bug report though, as i can't find a recent one00:01
opendevreviewJeremy Stanley proposed opendev/system-config master: Upgrade to latest Mailman 3 releases  https://review.opendev.org/c/opendev/system-config/+/86921000:02
fungiif it helps, i was able to push that ^ easily and quickly00:03
JayFperhaps it's related to the size of the repo?00:03
JayFI'm trying to push literally a three character logging fix for neutron lol00:03
fungiopenssh-client 1:9.1p1-2 from debian/sid00:03
ianwJayF's push looks like00:04
ianwgit-upload-pack./openstack/neutron.git 4ms 13044ms '4ms 16ms 0ms 111ms 0ms 11950ms 12077ms -1 3658 5467 3462123' 0 - 751ms 740ms 15226781600:04
ianwi have no idea what all those numbers mean00:04
fungii was pushing a change for the same repo where ianw saw his error00:04
ianwfungi: yeah, i re-pushed and didn't see the error -- but also that was a different error to jayf00:04
fungiahh, okay00:05
JayFmtr to review.opendev.org makes it look like cogent:comcast links ipv6 are doing bad00:05
JayFI wonder if this is internet shenanigans to a degree to impact this00:05
ianwit does seem to suggest that slow (or maybe packet-dropping-ish) links and big repos trigger it00:05
* JayF puts a v4 address in his hosts file and tries a thing00:05
ianwcan you try ipv4?00:05
clarkbI don't think the wikimedia workaround is still a valid option (possibly due to the switch to MINA)00:06
JayFlooks like the issues persist over v4, but so do the packet drops upstream of me in MTR00:06
ianwalthough in the overall list of people seeing this, there's plenty of ipv4 addresses00:06
ianwclarkb: i thought that change was for mina -- not nio2 or whatever it used to be?00:07
* JayF stepping away, but will happily retry if ping'd00:07
clarkbianw: hrm it may have been but looking at current valid sshd options for gerrit that flag is not in the list00:08
ianwi think it's https://gerrit-review.googlesource.com/c/gerrit/+/23838400:08
ianw"For more safety, protect this experimental feature behind undocumented00:08
ianwconfiguration option, but enable this option per default.00:08
clarkbheh I'm not sure I like that approach. Secret config options00:09
clarkbbut undocumented so we don't know what they do00:09
ianwhttps://issues.apache.org/jira/browse/SSHD-942 seems to be upstream discussion00:11
clarkbfwiw I can git remote update in neutron and nova without issue00:12
JayF`git fetch gerrit` also worked for me immediately before I started having this issue00:12
ianw"I thought some more about this and what I have suggested might "mask" out protocol problems that are not related to latency or out-of-order messages - e.g. a client that does not adhere to the protocol 100% and is indeed sending messages when they are not expected."00:13
ianwi think we can rule out the client not adhering to the protocol00:13
ianwso that really seems to leave out-of-order messages00:13
JayFwhich matches my observation of packet loss in the internet route to gerrit00:14
clarkbI don't cross cogent. I'm going through wholesail and beanfield00:14
JayFI'm going to tether.00:14
clarkbthought I'd try due to adjacency to JayF but our ISPs seem to share completely different sets of peering00:14
ianwi guess the question is -- is this an error that gerrit is quitting, or would it be failing anyway and gerrit is just telling us00:14
JayFI have comcast and am currently on some kind of wait list for centurylink :( 00:14
clarkbianw: with MINA its hard to say unless you pull out your ssh protocol rfc and start reading :( this is what ultimately turned me into a java dev00:15
clarkbfair warning00:15
JayFso, it worked00:16
JayFfrom my tether00:16
ianw... interesting ...00:17
fungimy v4 and v6 routes seem to go from my broadband provider, through beanfield, and into vexxhost. no other backbones00:17
ianwJayF: if you have time, i wonder if we might catch a tcpdump of a failure to your IP?  it might show us out-of-order things?00:18
JayFso my hypothesis: gerrit is currently extremely intolerant of packet loss / out of order packets00:18
JayFianw: I can tell you with nearly 100% certainty from client behavior: I went from "some retransmits" to "zero retransmits" once I tethered00:18
fungithe tcp/ip stack on the server should normally reassemble streams into the correct packet order00:18
JayFI actually need to head out for now though, sorry you all, good luck o/00:19
ianw"This is old issue, but we are getting reports, that the problem shows up even with the custom UnknownChannelReferenceHandler. And even could disappear if using default DefaultUnknownChannelReferenceHandler instance according to this gerrit configuration option:00:19
ianwsshd.enableChannelIdTracking = false."00:19
ianwJayF: no worries, thanks :)00:20
fungibut if there's massive packet loss too, just about anything will get cranky00:20
ianwthat's hard to parse, but it seems to be saying that "we're seeing this, and users are not seeing it if they turn of enableChannelIdTracking"00:20
clarkbianw: right, but what does that lose us and why does gerrit think it is very unsafe so much so to not document it00:21
clarkbI'm good with trying it if we understand the impacts and the risks are reasonable00:21
clarkbI need to switch gears to getting a meeting agenda out before dinner. Holler if this ends up needing more eyeballs00:23
ianwyeah i'm just trying to parse all these changelogs etc00:28
clarkbok I've finally managed to update the wiki agenda contents. I'll give it a few if anyone wants to add anything00:55
*** dasm|rover is now known as dasm|out00:55
clarkbianw: do we want the ssh thing on there?00:55
ianwclarkb: umm, i'm writing up a bug report, we can add it to see what, if anything, gets said about it00:56
clarkbsound sgood00:56
fungi869091 is back to failing system-config-run-gitea even though 872801 merged a few hours ago01:00
clarkbit passed on the apparmor fix change01:00
clarkbthe run-gitea job I mean (that doesn't get us your logo stuff though)01:00
fungiyeah, i'll get logs pulled up here in a few and see why01:04
fungioh, nevermind, i was looking at old results i think01:06
fungiit passed when i rechecked after the image fix landed01:06
fungiclarkb: is this more what you were expecting? http://storage.gra.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_0b6/869091/11/check/system-config-run-gitea/0b667a3/bridge99.opendev.org/screenshots/gitea-main.png01:08
clarkbyup I think that flows a lot better01:08
clarkblet me rereview that change01:08
clarkbthen once ianw has a bug report link I can add that ot the agenda and get that sent01:09
fungionce we get the donor logos addition landed, we're in a good place to talk about which other logos we might want to add to the list01:09
clarkbI +2'd it01:10
fungithere's a good case for adding osuosl since they host part of our control plane (nb04)01:11
ianwok, https://github.com/apache/mina-sshd/issues/31901:21
ianwhappy for feedback if it makes sense ...01:22
ianwafter pulling that apart, i am unsure if turning the option off will make a difference01:22
ianwit seems like it will just think the channel isn't open and fail.  but i gues sthe only way would be to test01:23
clarkbok agenda sent01:31
opendevreviewIan Wienand proposed zuul/zuul-jobs master: ensure-skopeo: fixup some typos  https://review.opendev.org/c/zuul/zuul-jobs/+/87273303:04
opendevreviewIan Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path  https://review.opendev.org/c/zuul/zuul-jobs/+/87280603:05
opendevreviewMerged zuul/zuul-jobs master: ansible-lint: uncap  https://review.opendev.org/c/zuul/zuul-jobs/+/87249503:29
opendevreviewMerged zuul/zuul-jobs master: build-docker-image: fix change prefix  https://review.opendev.org/c/zuul/zuul-jobs/+/87225803:39
opendevreviewMerged zuul/zuul-jobs master: container-roles-jobs: Update tests to jammy nodes  https://review.opendev.org/c/zuul/zuul-jobs/+/87237503:39
ianwFeb  7 03:46:24 graphite02 systemd[1]: systemd-fsckd.service: Succeeded.03:45
ianwFeb  7 03:39:23 graphite02 systemd-timesyncd[462]: Initial synchronization to time server [2620:2d:4000:1::40]:123 (ntp.ubuntu.com).03:45
ianwtime went a fair way backwards on graphite02 when i restarted it03:45
ianw(i noticed this because the container was saying it was created "less than a second ago" for ... more than a second)03:46
ianwi so far haven't noticed this on any other servers i've restarted ... it does suggest though that ntp wasn't working 03:48
opendevreviewVishal Manchanda proposed openstack/project-config master: Retire xstatic-font-awesome: remove project from infra  https://review.opendev.org/c/openstack/project-config/+/87283503:58
*** yadnesh|away is now known as yadnesh04:07
ianwkeycloak also wasn't happy on restart.  i tried a turn-it-off-and-on-again approach and it seems ok now04:07
ianwit's also happened on paste04:14
ianwFeb  7 04:20:26 paste01 systemd[1]: systemd-fsckd.service: Succeeded.04:14
ianwFeb  7 04:12:40 paste01 systemd-timesyncd[413]: Initial synchronization to time server [2620:2d:4000:1::41]:123 (ntp.ubuntu.com).04:14
ianwso the gerrit promote failed05:01
ianwall failed in "promote-docker-image: Delete all change tags older than the cutoff"05:03
opendevreviewMerged zuul/zuul-jobs master: ensure-skopeo: fixup some typos  https://review.opendev.org/c/zuul/zuul-jobs/+/87273305:19
fungiianw: the tag delete happens after the new tag is done, right? if so, that's benign at least05:25
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability  https://review.opendev.org/c/zuul/zuul-jobs/+/87284205:48
ianwfungi: yep, that's right.  i think something like ^ will help with this05:49
ianwi'll restart gerrit with that under new docker05:50
ianw(that is done)05:52
*** ysandeep is now known as ysandeep|ruck06:00
ianwhttps://etherpad.opendev.org/p/docker-23-prod is about half done.  the rest are probably more suited to an ansible loop which i can look at later06:14
ianwfungi: ^ maybe you could look at the mailman ones? 06:15
opendevreviewMerged openstack/project-config master: nodepool: infra-package-needs; cleanup tox installs  https://review.opendev.org/c/openstack/project-config/+/87247806:26
*** jpena|off is now known as jpena08:24
*** gthiemon1e is now known as gthiemonge10:38
*** ysandeep__ is now known as ysandeep|break11:36
*** soniya29 is now known as soniya29|afk11:39
*** yadnesh is now known as yadnesh|away12:22
*** ysandeep|break is now known as ysandeep|ruck12:22
fungiianw: i have a held mai12:45
fungilman 3 server i can test on first too12:45
fungithe mailman 2 servers aren't dockerized anyway12:45
ysandeep|ruckopenstack-tox-py27 on many projects failing with ERROR:  py27: InterpreterNotFound: python2.713:44
*** dasm|out is now known as dasm|rover13:53
fungiysandeep|ruck: when did it start? was it coincident with tox v4 releasing at the end of 2022, or when we switched our default nodeset to ubuntu-jammy earlier last year?14:06
fungior was it more recent?14:06
fungii assume this is for older openstack stable branches, since openstack hasn't supported python 2.7 for a number of releases now14:06
ysandeep|ruckfungi, started yesterday: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&project=openstack%2Ftripleo-heat-templates&skip=0 14:06
ysandeep|ruckyes hitting on stable branches14:07
fungiinteresting, i should be able to take a closer look in a few minutes14:07
slittle1_StarlingX release scripts always have issues pushing new branches or tags across all our gits.  It starts out ok, but eventually we start seeing 'connection reset by peer'.   It feels like we are triggering some sort spam/bot protection.14:15
slittle1_or ....14:16
slittle1_ssh_exchange_identification: read: Connection reset by peer14:16
slittle1_fatal: Could not read from remote repository.14:16
slittle1_just now14:16
fungislittle1_: we observed this from some locations starting late yesterday. it appears cogent (the internet backbone provider) is having some issues, and connections traversing their network are suffering14:53
fungimy connection doesn't get routed through cogent and as a result things are quite snappy14:54
fungiit's possible vexxhost (where those servers are housed) could do some massaging of bgp announcements to make cogent seem less attractive14:54
Clark[m]But we do also have connection limits by IP and user account14:54
fungioh, yes that's also a potential culprit14:55
Clark[m]fungi: ysandeep|ruck I think we may have dropped python2 from our images expecting bindep to cover that.14:55
fungii think it's 64 simultaneous connections from the same gerrit account or 100 concurrent sockets from the same ip address14:56
fungiif you exceed either of those your authentication gets denied or your connection gets refused, respectively14:56
fungithough the cogent issue could also be compounding that if sockets are getting left hanging14:56
fungislittle1_: by always have issues i assume you mean since longer than just the past 24 hours14:57
fungiso Clark[m]'s point about connection limits seems more likely in this case14:57
*** dasm|rover is now known as dasm|afk15:15
opendevreviewMerged opendev/system-config master: Feature our cloud donors on opendev.org  https://review.opendev.org/c/opendev/system-config/+/86909115:15
jrosserfwiw some connectivity issue to opendev.org is recurring again now, i see the same very low throughput as i had a few weeks ago15:18
jrosserand just the same going to another host with different peering but geographically similar, it's all good15:18
fungijrosser: yeah, JayF mentioned seeing similar at 23:37 utc15:23
fungipresumably traversing cogent in your traceroute?15:23
jrosserfungi: looks like zayo again for me as the closest thing to opendev.org15:26
ysandeep|ruckClark[m], >> I think we may have dropped python2 from our images expecting bindep to cover that. -- is it possible to confirm that theory?15:27
Clark[m]Yes look at the change log for openstack/project-config to see if the proposed changes landed15:28
fungijrosser: it may be an assymetric route too (packets to vexxhost via zayo, replies coming back across cogent). i'd have to traceroute to your address from the load balancer to confirm though15:28
fungior from the gerrit server i guess, though any server there likely follows the same routes15:28
ysandeep|ruckClark[m], i see https://github.com/openstack/project-config/commit/95ff98b54b1095dae6591ac273777ff79834887d15:28
fungimight have been "nodepool: infra-package-needs; cleanup tox installs" https://review.opendev.org/c/openstack/project-config/+/87247815:29
fungithat merged 06:26:24 utc today15:29
fungiwould have taken effect the next time images were built15:30
ysandeep|ruckbut we are seeing issue since yesterday: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-py27&project=openstack%2Ftripleo-heat-templates&skip=015:31
Clark[m]https://review.opendev.org/c/openstack/project-config/+/872476/2 this one15:31
fungioh, yep. we stopped installing python-dev15:32
fungiand python-xml, either of which would have pulled in python2.715:32
ysandeep|ruckdasm|afk, ^^ I will be out soon, please follow the chatter 15:32
fungithat merged 00:16:31 yesterday15:33
fungiso images built yesterday would have no longer had python2.7 present on them by default15:33
clarkband confirmed that bindep.txt lacks python in it https://opendev.org/openstack/tripleo-heat-templates/src/branch/stable/train/bindep.txt15:35
clarkbthe glance and neutron bindep.txt files are similar15:36
clarkbnova and swift are fine15:36
clarkbI think we have ~3 options: A) revert/or readd python2-dev installs B) ask projects to fix their broken bindep.txts (something we've asked for for years at this point) or C) update -py27 jobs to explicitly install python2 for you15:37
slittle1_fungi:  By always I mean every 3-6 months when we want to create a branch or tag on all the StarlingX gits.15:39
clarkbin that case there is a good chnace you are hitting the IP address limits or user account limits if you run things in parallel15:43
fungislittle1_: how are you creating them and how many? are you doing it in parallel? with a single account?15:43
fungiif they're all done from a single account and it's authenticating more than 64 operations at the same time, that could be it15:44
clarkband even if it never fully gets to the IP limit tcp stacks will often keep connections open for a bit once they are "done"15:45
clarkbso you might only do 60 pushes and be fine from an auth perspective but if you've got a tcp stack holding the connections open for a few extra minutes then eventually you can trip over the tcp connection limit15:45
slittle1_We use 'repo forall' to iterate through out gits serially.  perhaps 60 gits in all, For each there would be a 'git fetch --all', a 'git review -s', a 'git push ${review_remote} ${tag}', a 'ssh -p ${port} ${host} gerrit create-branch ${path} ${branch} ${tag}' and a 'git review --yes --topic=${branch/\//.}'15:46
clarkbrepo isn't serial though right?15:47
clarkb(I'm not super familiar with repo, but my understanding is that it tries to setup large systems like android in parallel. Something google can get away with but not our gerrit)15:47
slittle1_'repo forall' is serial by default15:47
fungia quick count of starlingx/ namespace projects returns 57 at the moment, so the number that have branches being created is probably lower15:48
slittle1_We branch nearly all of them at the same time15:48
fungiin that case it's more likely the ip connection limit (i'm pretty sure the connection refused response would point to that, as the account connection limit would be more likely to report a different error message)15:49
fungiit may be that the connections are getting created too quickly in succession and not enough time is being allowed for the server's tcp/ip stack to close them down15:49
clarkbfungi: yup that is my suspicion15:50
*** dasm|afk is now known as dasm|rover15:50
clarkbeach of those commands will create new tcp connections and they may idle for a few minutes15:50
fungiare those commands being issued by someone behind a nat shared by other clients that may also be connecting? if so that could also make it more likely15:50
clarkbper repo its at least 5 connections (I think more)15:50
slittle1_script runtime would probably be 5 min if it could get through without errors15:50
clarkbafter 20 repos you'll be in ip blocking territory if things haven't cleaned up quickly enough15:51
ysandeep|ruckClark[m]: thanks, as issue is affecting multiple repo and older stable branches, I will let infra decides which options is better atm before adding in bindep.txt(for tripleo repo)15:51
slittle1_not sure about the 'nat' question.   I'd have to chase down our own IT guys for that one15:51
clarkbif your local IP isn't a publicly routable IP then most likely this is he case15:52
jrosseris all of that over ssh? perhaps some ControlPersist could help?15:52
fungiif you know what ip address it's coming from i may be able to find some evidence in syslog/journald (i can't remember of those conntrack limit rules log)15:52
slittle1_there is also a retry loop, 45 sec delay, max 5 retries, when one of the above commands fail.15:52
slittle1_it's not publicly routable15:53
*** ysandeep|ruck is now known as ysandeep|out15:54
fungislittle1_: i mean the public address it's getting rewritten to on exiting your network15:54
fungianyway, this is the rule we're suspecting is being tripped: https://opendev.org/opendev/system-config/src/branch/master/inventory/service/group_vars/review.yaml#L415:55
fungiin theory it would stop blocking as soon as the tracked connections falls below 10015:55
clarkbjrosser's suggestion may be one to try since that will reduce total ssh connections. I'm not sure how to set that up with git though. Might need to use GIT_SSH?15:56
clarkbor maybe you can edit your ssh config to manipulate that15:56
jrossergoogle took me here https://docs.rackspace.com/blog/speeding-up-ssh-session-creation/15:57
fungior try throttling down the loop and see if that helps give earlier connections time to clean up15:57
jrosserif git is using the underlying ssh config (which i think it does) then you should be able to use any of the usual things in your ~/.ssh/config15:58
clarkbya it should15:58
fungiwe probably need to add -j LOG to that rule if we want to be able to find evidence of these events after the fact15:59
fungithough i have no idea if there may be ssh brute-forcing attempts constantly hitting that which could fill the logs quickly16:00
slittle1_So I try ....16:04
slittle1_Host *.opendev.org16:04
slittle1_   ControlMaster auto16:04
slittle1_   ControlPersist 60016:04
slittle1_Sound ok ?16:04
funginetfilter docs mention /proc/net/nf_conntrack which doesn't seem to exist, i wonder if this kernel version exposes that somewhere else16:05
fungifound /proc/sys/net/netfilter/nf_conntrack_*16:07
clarkbslittle1_: yes that seems correct, I'd probably not use a * though and keep in mind that this will be shared with all ssh things to the identified server (shouldn't be an issue but calling it out)16:37
fungiclarkb: where would be the best place to add the conntrack package (provides tools for inspecting the kernel's connection tracking tables since i can't find them in /proc) on the gerrit server? not the docker image obviously it's for the underlying server. should we just install it on all our servers? there's the iptables role but it's hard-coded to only install one package and it's also16:37
fungimultidistro which i doubt we need any longer16:37
clarkbfungi: do we know if the jeepyb fix from yesterday was sufficient to make the daily job run?16:37
clarkbfungi: the ansible gerrit role is where I'd do it if you want it specific to gerrit. Otherwise in our base package list for all servers16:38
fungioh, right, i was meaning to rerun manage-projects today16:39
clarkbI'm not sure if we auto pull the gerrit image or not which is what would be needed to have the daily run pick it up16:39
clarkbyou don't need to restart gerrit on the new image to use the new image for manage-projects at least16:40
fungimanage-projects for starlingx/public-keys seems to have worked this time16:41
fungiit pushed to refs/meta/config successfully according to the log16:42
fungithough failed after that:16:42
fungijeepyb.utils - INFO - Executing command: git --git-dir=/opt/lib/jeepyb/starlingx/public-keys/.git --work-tree=/opt/lib/jeepyb/starlingx/public-keys checkout master16:43
fungierror: pathspec 'master' did not match any file(s) known to git16:43
clarkbhttps://review.opendev.org/admin/repos/starlingx/public-keys,branches seems there in gerrit at least16:44
clarkbis debian's git init creating main by default now maybe?16:45
fungiwhen i `git init` on my up to date debian/sid workstation it says "hint: Using 'master' as the name for the initial branch. This default branch name is subject to change."16:48
clarkbyup I did a git init in the gerrit image and got the same result16:49
clarkbsame == master is default16:49
clarkbfungi: was there an earlier error? the checkout is in a finally block16:49
fungii'll see if i can find earlier errors but that was right after successfully pushing the refs/meta/config update. the log is at /var/log/manage-projects_2023-02-07T16\:39\:44+00\:00.log16:50
clarkbfungi: the other thing to consider is we may not be git initing anymore because the repo exists in gerrit so we'd be cloning instead?16:51
clarkbbut still master exist in the gerrit repo as HEAD16:51
fungiyeah, you can see the git clone16:52
clarkbya that is the problem I think. If you git clone it HEAD is master but there is nothing there16:52
clarkbyup I've reproduced by cloning and trying to checkout master16:52
fungiso the earlier failures left it in a dirty state16:52
clarkbI think this is a bug in jeepyb where if we don't handle it with the initial git init locally to push the .gitreview file to the master branch we're stuck16:52
fungiand rerunning won't recover from that16:52
clarkba workaround would be to push the .gitreview file manually. Or maybe even just propose a change for it?16:53
clarkbI'm not sure how to address this in jeepyb safely. maybe fallback to git init if we cannot checkout the default branch?16:54
clarkbI need to pop out before my next round of meetings to eat breakfast16:56
fungiwow, i can't push --force with my admin account unless it agrees to the icla17:06
fungii think that may be new?17:06
fungiprobably a missing permission in project bootstrappers17:06
fungii wonder how i agree to the icla with it since it has no webui access17:06
fungiTo ssh://review.opendev.org:29418/starlingx/public-keys17:07
fungi* [new branch]      master -> master17:07
fungii very briefly added my normal account to get around that17:08
fungii wanted to do it via a review, but git-review refused to use the master remote citing it did not exist (i guess because it was empty)17:08
fungimanage-projects runs clean now17:10
clarkbfungi: you need to add yourself to bootsrappers but then I think you can push?17:11
fungiright, you need to add an account that has agreed to the icla to bootstrappers17:11
clarkboh I thought any account in bootstrappers could do it. I guess maybe our creation account has signed it?17:12
fungiwe probably need to explicitly put our admin accounts in the system-cla group, or it's possible new gerrit has an additional perm we need to add to bypass cla enforcement17:12
clarkboh system-cla makes sense17:12
fungislittle1_: i've added your gerrit account as the initial member of the starlingx-public-keys-core group now. sorry about the delay17:14
fungigroup name seems to be 'System CLA' for reference17:15
fungiright now the members are jenkins, openstack-project-creator, proposal-bot, release17:16
fungiso that explains why manage-projects is normally able to do it17:16
*** jpena is now known as jpena|off17:17
fungiin future, i'll just add my fungi.admin account to that group and see if it solves the cla rejection17:17
opendevreviewYou-Sheng Yang proposed opendev/git-review master: Allow specifying remote branch at listing changes  https://review.opendev.org/c/opendev/git-review/+/87298817:40
opendevreviewJeremy Stanley proposed opendev/system-config master: Better diag for Gerrit server connection limit  https://review.opendev.org/c/opendev/system-config/+/87298917:55
noonedeadpunkHey there. We've got an issue with https://github.com/openstack/diskimage-builder/commit/2c4d230d7a09ad8940538338dfdc8bc3212fbc20 and rocky917:56
noonedeadpunkI do see that cache-url should be inlcuded by cache-devstack and it seems present https://opendev.org/openstack/project-config/src/branch/master/nodepool/nodepool.yaml#L18617:57
noonedeadpunkBut couple of days ago we started seeing retry_limits as curl-minimal seems to be pre-installed there17:57
opendevreviewClark Boylan proposed opendev/system-config master: Update our base python images  https://review.opendev.org/c/opendev/system-config/+/87301217:59
funginoonedeadpunk: my first guess is this is related to the same base image cleanup changes which also stopped preinstalling python2.718:01
clarkbthere was a different change to stop installing curl because of rocky18:01
clarkbapparently rocky installs one of the curl packages and centos the other. And they conflict and installing one won't automaticallyreplace the other18:01
clarkbianw's solution to this was to stop preinstalling anything and let the distro defaults decide18:02
noonedeadpunkYes, I like this solution18:02
clarkbif your problems are with building images using dib then udpate to latest dib and I think you'll be good?18:02
noonedeadpunkum. my problem is in CI images18:04
clarkbok you may needt o describe the issue in more detail. Is curl-minimal (the rocky default iirc) not working for your jobs?18:04
noonedeadpunkAha, ok, I misunderstood the commit then18:04
clarkbbasically the dib change is saying don't explicitly try to install curl because the two rhel like distros we build conflict in their curl choice and the dib package management doesn't have a way to express "uninstall this first" I think18:05
noonedeadpunkok, yes, you're right. I was thinking some pre-install of curl still happens with that, but it;s wrong assumption. Thanks for clarifying that!18:06
noonedeadpunkuh, why distros like to complicate things that much....18:07
clarkbheres fedora's justification https://fedoraproject.org/wiki/Changes/CurlMinimal_as_Default18:08
noonedeadpunk8mb on the diskspace...18:10
opendevreviewElod Illes proposed openstack/project-config master: Remove add_master_python3_jobs.sh  https://review.opendev.org/c/openstack/project-config/+/87301618:12
clarkbnoonedeadpunk: what protocol do you use?18:26
clarkbthat doc implies the common ones should be in minimal18:26
fungilooks like system-config-promote-image-assets and system-config-promote-image-gitea failed deploy for 869091 and so infra-prod-service-gitea was skipped18:30
noonedeadpunkclarkb: shouldn't be anything outside of the minimal18:31
noonedeadpunkI just misunderstood what that patch did18:31
noonedeadpunkI assumed that there will be no curl as a result of it vs default provided one from minimal18:32
clarkbI see18:32
fungiokay, both failures were for the same thing ianw observed earlier: "Task Delete all change tags older than the cutoff failed running on host localhost"18:33
fungi"the output has been hidden due to the fact that 'no_log: true' was specified for this result" but it's a uri call18:33
fungiso maybe dockerhub changed their api or something18:33
clarkbfungi: there was a change to zuul-jobs that I wrote to fix a linter rule to the calculation of the cutoff18:34
clarkbfungi: its possible that is broken (and the linter was wrong?) 18:34
clarkbbut that seems unlikely since all I did was remove the {{ }}s from a when: entry18:34
clarkband we successfully published gerrit images yesterday which was after the linter fix18:35
clarkb(unless that job failed too and we didn't notice?)18:35
ianwclarkb: i think https://review.opendev.org/c/zuul/zuul-jobs/+/872842 will help, but i need to look at the linter failure...19:08
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability  https://review.opendev.org/c/zuul/zuul-jobs/+/87284219:10
opendevreviewAde Lee proposed zuul/zuul-jobs master: Add ubuntu to enable-fips role  https://review.opendev.org/c/zuul/zuul-jobs/+/86688119:29
opendevreviewMerged opendev/system-config master: Update our base python images  https://review.opendev.org/c/opendev/system-config/+/87301219:52
ianwhttps://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873020 adds a pre playbook to install python/python-dev, which i think should get the 27 jobs back to where they were20:01
ianwmea culpa on that one ... i wasn't even vaguely thinking about python2.720:02
fungiyou're living the dream! i'll be thrilled when i can stop thinking about python 2.720:02
clarkbapparently I'm not logged int ogerrit so will have to +2 after lunch. I'm starving20:02
fungii'm testing the docker upgrade on the held mm3 node at
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability  https://review.opendev.org/c/zuul/zuul-jobs/+/87284220:31
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: double-quote regexes  https://review.opendev.org/c/zuul/zuul-jobs/+/87302820:31
fungicurrently has docker-ce 5:20.10.23~3-0~ubuntu-jammy20:31
fungiThe following packages will be upgraded: docker-ce docker-ce-cli libssl3 openssl20:32
fungiguess this was still missing the openssl security fix20:32
funginow running docker-ce 5:23.0.0-1~ubuntu.22.04~jammy20:33
fungithe containers take a minute to get back to a working state20:33
fungistill not up though. i wonder if i should have downed the containers before upgrading20:34
fungierror: exec: "apparmor_parser": executable file not found in $PATH20:37
fungiworking slightly better after installing apparmor20:37
fungiianw: your procedure on the pad starts with "stop docker" does that mean stopping dockerd or downing the containers or what?20:38
fungiunfortunately, after restarting the containers on the held node, i'm getting a server error trying to browse the sites20:44
fungii'll need to dig the cause out of logs, i expect20:44
fungidocker container logs indicate everything started up fine and apache logs say i'm getting a 200/ok response, but the page content is "Server error: An error occurred while processing your request."20:49
fungiso i guess that's coming from the backend20:49
clarkbfungi:  Isuspect downing the containers is sufficient. Then let the package upgrade restart the daemon20:50
fungithis may be something still not quite right with the django site setting20:54
fungiand maybe the container restart in the job isn't sufficient to exercise it20:55
fungii'll have to continue investigating after dinner or in the morning20:55
clarkbya that seems unlikely to be related to docker itself if the software is executing under docker20:55
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability  https://review.opendev.org/c/zuul/zuul-jobs/+/87284220:57
clarkbI guess https://review.opendev.org/c/opendev/system-config/+/873012 would've been hit by the broken promotions too?20:58
clarkbianw: fungi  how new of an apt do you need for https://review.opendev.org/c/opendev/system-config/+/872808/1/playbooks/roles/install-docker/templates/sources.list.j2 to work?21:00
fungioh, right that may not work on xenial21:01
ianwclarkb: i think it was 1.4, which is quite old21:01
ianwdo we have xenial docker?21:04
clarkbI don't think so21:04
clarkbso ya it may be fine. Also I think gerritbot may be out to lunch. I updated https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/873020 to fix an issue and no notifiction here21:05
clarkbor maybe its a different event for ps created by the web tool?21:05
opendevreviewMerged zuul/zuul-jobs master: promote-docker-image: double-quote regexes  https://review.opendev.org/c/zuul/zuul-jobs/+/87302821:08
slittle1_just got around to trying the modified ssh settings as suggested on this forum.  No improvement21:10
slittle1_Running: git review -s21:10
slittle1_Problem running 'git remote update gerrit'21:10
slittle1_Fetching gerrit21:10
slittle1_kex_exchange_identification: read: Connection reset by peer21:10
slittle1_fatal: Could not read from remote repository.21:10
clarkbthats a different error though right?21:11
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability  https://review.opendev.org/c/zuul/zuul-jobs/+/87284221:11
clarkbor maybe the eralier error was truncated to leave off the kex info21:11
clarkbslittle1_: if the control persistence is working properly ps should show you an ssh process ControlMaster in its command line. Something like `ps -elf | grep ControlMaster` should find it21:14
slittle1_ ssh_exchange_identification vs kex_exchange_identification,  yes slightly different21:15
slittle1_ps -elf | grep ControlMaster ... nothing found21:16
clarkbanother thing to check is the number of connections you have open to gerrit `ss --tcp --numeric | grep` assuming ipv421:16
clarkbok I think it may not be using the ControlMaster then21:17
clarkbwhat ControMaster does is creates a process that keeps a single ssh connection open then subsequent ssh connections talk through that ssh connection. Its also possible that MINA doesn't deal with the multiple logical connections over one tcp connection properly.21:17
slittle1_ss --tcp --numeric | grep ... no output21:20
ianwslittle1_ : are you coming from an ip starting with 128.224 ?21:20
slittle1_yes, probably
ianwthat's interesting21:21
ianw$ ssh -p 29418 ianw.admin@localhost "gerrit show-connections -w"  | grep | wc -l21:21
ianwgerrit shows 18 open connections21:21
funginote that https://review.opendev.org/872989 is aimed at making that easier to debug too21:21
ianwbut, they have a blank user21:22
clarkbfungi: that will log to syslog right?21:23
ianwoh, i have de-ja-vu on the blank user thing21:23
slittle1_hmmm, I have21:23
fungiclarkb: yes21:23
slittle1_Host *21:23
slittle1_   ServerAliveInterval 6021:23
ianwi think we see it when the session is open but not logged in21:23
slittle1_in .ssh/config21:23
funginetstat -nt reports 20 sockets from that address at the moment too21:23
clarkbianw: ya you need successful login for gerrit to attach the user21:23
slittle1_is that .ssh/config entry a concern ?21:24
clarkbslittle1_: shouldn't be. The ServerAliveInterval creates extra packets on existing connections21:25
clarkbbut it shouldn't make new connections21:25
opendevreviewIan Wienand proposed zuul/zuul-jobs master: promote-docker-image: improve failure debugability  https://review.opendev.org/c/zuul/zuul-jobs/+/87284221:26
slittle1_clarkb: yes I believe so.  Added it ages ago to keep ssh sessions alive a bit longer.   21:26
clarkbslittle1_: one side effect it could have though is keeping unneeded connections open for longer potentially21:27
clarkbusually thats on the server side though as it wants to make sure it has received all of the bits from the client before shutting down completely21:28
slittle1_I wanted interactive ssh sessions to stay open when I took a call or was otherwise distracted for a bit21:28
clarkbhrm I wonder if that means you've also got network gear that is shutting down connections between ou and the remote21:29
clarkbtypically an ssh connection will stay open21:29
clarkbbut that could also impact the server side connection count if it thinks you've got a bunch of connections open that some firewall has killed without telling the server21:29
slittle1_I'll add a domain rather than applying to all connections21:29
slittle1_ok, that shouldn't impact opendev connections any longer21:31
slittle1_Gotta get kids,  I'll test again tomorrow21:32
slittle1_Thanks for the help21:32
opendevreviewJay Faulkner proposed opendev/infra-manual master: Add documentation on removing human user from pypi  https://review.opendev.org/c/opendev/infra-manual/+/87303321:35
clarkbI have requested access to the gerrit community meeting agenda doc22:02
clarkbfungi: for gitea I wonder if the easiest thing is to just land a noop change to the dockerfile?22:16
fungii recall netscreen firewalls by default silently dropped "idle" tcp sessions after 120 seconds, so on my machines at the office where they used one of those i set my ssh client configuration for a 60 second protocol level keepalive22:16
clarkbsimilar to what I did for the python base images?22:16
clarkb(and maybe push a similar change again for the python base images?)22:16
clarkbthough iwth the base images we don't need to coordinate the deploy pipeline so maybe we can just reenqueue22:16
fungithough ssh keepalive also depends on the server side having support implemented. no idea if mina-sshd does22:16
fungiclarkb: the donor logos deployment isn't urgent. i'm fine waiting for whatever we end up needing to merge to the gitea server images down the road to trigger a new deployment22:17
*** dasm|rover is now known as dasm|off22:40
opendevreviewIan Wienand proposed zuul/zuul-jobs master: build-docker-image: further cleanup buildx path  https://review.opendev.org/c/zuul/zuul-jobs/+/87280622:42
dasm|offthanks fungi for the email update and "stable branches" py27 solution!22:44
fungidasm|off: ianw is to thank for pushing the change there22:46
fungii'm just trying to keep people apprised of what's going on22:46
ianwwell i also broke it ... so :)23:14

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!