Monday, 2024-09-09

*** noonedeadpunk_ is now known as noonedeadpunk07:29
opendevreviewMerged openstack/diskimage-builder master: Fix hashbang of non-executed bash libs  https://review.opendev.org/c/openstack/diskimage-builder/+/92831210:03
*** ykarel_ is now known as ykarel14:02
fungiinfra-root: NeilHanlon: i'd like to tag gerritlib 1dd61d944b0b4bd3370a5d47ee8a992f29481adb (current master branch state) as 0.11.0, a semver minor rev since it's dropping support for older python versions than 0.10.0 worked on. these are the changes which have merged since 0.10.0: https://paste.opendev.org/show/bSHypXUnEQ5p7HtSvx1S/14:24
fungioh, it also adds a feature (event filtering), so another reason it should be a minor increment14:28
NeilHanlon+1 14:45
NeilHanlonfwiw I already released dc75475 as a prerelease of 0.11.0 in Fedora rawhide14:45
NeilHanlonhttps://bodhi.fedoraproject.org/updates/FEDORA-2024-159628f77314:46
fungicool. from a distro packaging perspective, the one missing commit should be immaterial anyway, i think its only impact will be to people installing from pypi14:47
fungior pip/uv installing locally from source14:47
fungii just didn't want to release without it, because pushing the tag will trigger an upload to pypi and then users who try to install from an older unsupported python version could end up with a broken 0.11.0 rather than auto-selecting the older 0.10.0 for their environment14:49
NeilHanlon++14:49
fungianyway, i'll give infra-root folks a few hours to notice/object and then plan to push the tag i've created for that later today14:51
clarkbno objection from me15:04
clarkbthere is a newer ehterpad release over the weekend. Thoughts on whether or not we should update to that version instead? additionally I think we should ensure docker hub has a v2.1.1 tag for the current image or a rebuild of that image to make rollback easier. We can do that manually or we could update the image build jobs to do it for us and rebuild the current version to stick that15:05
clarkbin docker hub and fetch it via docker-compose. Then we can rebase the upgrade changeon that to also tag versions15:05
clarkbthoughts/opinions on that?15:05
clarkblooking at rax flex we seem to have hit the limit this morning as well. There was a burst of boot errors during that time frame as well. Maybe we're not quite calculating available quota/max-servers properly and hitting quota errors when at the limit?15:14
clarkbotherwise this seems to be happy. We should probably write up an email for cloudnull et al and defer to them on whether or not we should make additional changes15:14
clarkbhttps://github.com/ether/etherpad-lite/blob/v2.2.4/CHANGELOG.md the current upgrade change is for 2.2.2 so 2.2.3 and 2.2.4 releases are relevant if we want to update things to latest instead15:26
opendevreviewClark Boylan proposed opendev/system-config master: Tag etherpad images with version  https://review.opendev.org/c/opendev/system-config/+/92865615:33
opendevreviewClark Boylan proposed opendev/system-config master: Update etherpad to 2.2.4  https://review.opendev.org/c/opendev/system-config/+/92607815:47
clarkbI went ahead and updated things to do the upgrade to latest. We can revert to older patchsets if necessary15:48
opendevreviewClark Boylan proposed opendev/system-config master: DNM force etherpad failure to hold node  https://review.opendev.org/c/opendev/system-config/+/84097215:50
clarkbfungi: I put in another autohold reuqest for ^ but I'm thinking we can upgrade the node that you migrated the held db into as well as a check. Pretty sure that held node has not been cleaned up yet15:51
clarkbthat should be faster than porting the db into a new node all over again15:52
clarkbfollowing up on raxflex errors I see three failure modes in recent logs: A) Cannot scan ssh key B) Server in an error state C) over/at ram quota so can't allocate more15:57
clarkbC) is likely a race condition in nodepools quota accounting and probably not a big deal. A) may just be slowness to boot? We could tune timeouts if they are already short and B) is something that the cloud may want to look at15:58
clarkbB)s error message: Error in creating the server. Compute service reports fault: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 36b3f959-f079-455d-b330-078372253802. Last exception: Binding failed for port 4bc1537e-cfcf-4f95-9f8e-3b92e5f9783a, please check neutron logs for more information.15:59
clarkbthe ssh port scan appears to have succeeded scanning several keys before eventually hitting an ssh connection timeout. I wonder if we need to start a new tcp connection for each of those or if we can reuse and make that more reliable16:00
timburkeso i've noticed a lot of `Cannot assign requested address` recently on swift's rolling-upgrade jobs, and every time i drill down a bit, it's always on raxflex nodes -- is there anyway i could disable that provider for that specific job?16:01
timburkeas an example: https://zuul.opendev.org/t/openstack/build/e324b0e08eaf4d30af98e4c26f93d0b716:01
timburkeof the last 30 runs of that job, the 8 failures have all been on raxflex; the other 22 were a mix of rax, ovh, openmetal16:02
clarkbtimburke: there isn't a way to diasble providers for specific jobs16:04
clarkbtimburke: do you know what address you are trying to bind the socket to there?16:04
clarkbtimburke: is it this: https://zuul.opendev.org/t/openstack/build/e324b0e08eaf4d30af98e4c26f93d0b7/log/container-server.conf#5-6 ?16:05
clarkbif so the issue is you're binding to the floating ip address which isn't actually on the server and instead is NAT'd16:05
timburke(side note: is that a thing that we could get on the build history page, or at least the build result page? needing to click through "View log" -> "zuul-info" -> "inventory.yaml" then grep for cloud: for 30 jobs wasn't fun, but also wasn't *so* many that i felt the need to automate it)16:06
clarkbhttps://zuul.opendev.org/t/openstack/build/e324b0e08eaf4d30af98e4c26f93d0b7/log/zuul-info/inventory.yaml#46 you'll need to bind to this address or 127.0.0.1 or 0.0.0.0 or their ipv6 equivalents16:06
clarkbwe haven't had a floating ip cloud in a while which is why this hasn't been an issue16:07
timburkeah... ok -- i think i can work with that. so bind on 0.0.0.0 around https://github.com/openstack/swift/blob/master/tools/playbooks/multinode_setup/make_rings.yaml#L87 but leave the nodepool ip over in https://github.com/openstack/swift/blob/master/tools/playbooks/multinode_setup/templates/make_multinode_rings.j2#L30 where we need to share the info between nodes?16:12
clarkbtimburke: ya that looks right. Basically listen on all addresses on the server bind config so it will listen where you need it but when you configure things to talk to one another you can tell them to talk through the floating ip if that is simpler. Though you should be able to use the internal network address there too and avoid NAT16:13
clarkbI believe the private and public nodepool ip vars are set to the same value if there is no private ip16:14
clarkbso it should be safe to use the private ip everywhere too16:14
timburkecool, thanks clarkb! GTK16:14
clarkbfungi: https://zuul.opendev.org/t/openstack/build/d93fdc1fb78b4fdcb5c303a08de0aab2/artifacts is the updated 2.2.4 etherpad image build if we want to pull that onto the test node with the prod like db16:20
fungiclarkb: for the "A) Cannot scan ssh key" case, is it the same ip address i remarked on over the weekend?16:20
clarkbmaybe do that after we confirm basic functionality on the newly held node16:20
clarkbso that we don't burn the prod like test node until we're confident it is liekly to work16:21
clarkbfungi: it was 65.17.193.2416:21
clarkbwhich is a different address16:22
fungi"Error scanning keys: Timeout connecting to 63.131.145.251 on port 22" (5 occurrences between 23:28 and 23:33 on 2024-09-07)16:22
fungiactually, looks like quite a variety of ip addresses there where the ssh keyscan failed: https://paste.opendev.org/show/bhBEFZ6segiTMrw2n2Tu/16:38
clarkbconsidering that it seems to happen after other successful keys are scanned I wonder if there is some issue with the NAT16:39
clarkblike every 100th connection fails16:39
fungiwe could increase the timeout, i guess, and see if that helps?16:39
fungiin rax classic we set boot-timeout: 120 which i think is what covers that16:41
clarkbdoes that cover the ssh connection timeout for doing the key scan? it seems like that is what times out16:41
fungii think so? there's also launch-timeout which i think is how long nodepool waits for nova to report the instance active16:41
clarkblooking at the logs for one server it seems to time out after about a minute16:42
clarkbhttps://paste.opendev.org/show/bqVBIZaX9F2gPYzV79ut/16:43
fungihttps://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].boot-timeout16:43
fungi"how long to try connecting to the image via SSH"16:44
clarkblooks like that timeout message does come from nodepool's own timeout processing and not paramiko16:45
clarkbya its weidr though that we do scan other keys first very quickly then it is slow. But ya we can try bumping that up to see if it stablizes with a bit more time16:45
clarkbif not then we may haev a network l3/2/1 problem16:45
opendevreviewJeremy Stanley proposed openstack/project-config master: Increase the boot timeout for Rackspace Flex nodes  https://review.opendev.org/c/openstack/project-config/+/92866816:45
clarkb104.130.172.150 is the held 2.2.4 etherpad node. It seems to generaly work in the clarkb-test pad16:59
fungii'll go ahead and self-approve 928668 so we don't unnecessarily delay the experiment17:00
clarkbIf that node looks good to you then ya I think the next step is updating the held node with the copy of the db and then reconfirming things work with the prod db then we can proceed with normal code review17:02
corvusclarkb: fungi i agree that looks like a very unhealthy series of log lines.  key scanning should be near-instantaneous for all key types, so that looks a lot like either a network issue, or (less likely) very slow instance or very slow (overcommitted) host17:09
corvusi wonder if it's worth grabbing a flex node and running paired tcpdumps while doing a nodepool-style keyscan17:13
opendevreviewMerged openstack/project-config master: Increase the boot timeout for Rackspace Flex nodes  https://review.opendev.org/c/openstack/project-config/+/92866817:13
fungiwithout knowing the details of the network architecture/integration for rackspace flex, it's hard to guess but could be that they're relying on a routing protocol to announce which host is the destination for which fip and there's some lag in updates replicating (maybe switches slow at updating bridge tables? routers holding onto stale arp entries rather than replacing them?)17:15
corvusincreasing the timeout is probably just papering over the issue.  that timeout was mostly to handle the case of the host actually booting slowly (like, the time from start-of-boot to ssh-is-listening).  once ssh is listening, we don't expect, under normal circumstances, for the actual keyscan to take long.  and we know from those log lines that ssh was listening as soon as we got the first keyscan reply17:16
corvusfungi: re networking -- i agree, but i would also expect that routing/arp/whatever lag to show up before the first keyscan reply, but not after that17:17
fungioh, does the keyscan only happen after an initial ssh attempt?17:18
corvusit's a bunch of connections because ssh will only give us the fingerprint for one at a time17:18
corvuslet me annotate the log entries... one sec17:18
fungibut yeah, if they have redundant network paths then different connections could end up going over flows through different devices so could be a case where it's intermittent until all devices get on the same page about where an address resides17:19
fungianyway, definitely worth bringing to cloudnull's and cardoe's attention17:23
corvusfungi: https://paste.opendev.org/show/bhyUryen2qUIE75mKXPI/17:24
clarkbcorvus: ya that was my concern it is odd behavuior to timouet after several fast scans17:25
corvuseach one of those is closed before starting the next one (so they aren't open at the same time)17:25
clarkblike something in the network is having a sad periodically17:25
fungicorvus: agreed, that does seem to indicate that it's not just taking too long for the server instance to become generally reachable17:27
fungialso for those following along, these are connections from rackspace (classic) dfw to rackspace flex sjc317:28
clarkbmy hunch is that the NAT system is failing every Nth connection attempt17:30
clarkbwhich is classic NAT tables are full behavior though this is 1:1 so in theory it should be far less susceptible to that17:31
fungialso worth noting, these connections don't seem to go through dedicated circuits, a traceroute from nl01 to mirror.sjc3.raxflex transits datapipe's network for a hop or two17:33
fungiclarkb: fips shouldn't be using layer 4 translation though?17:33
funginat tables are typically only for overload/shared nat addresses17:34
fungiwhile fips are 1:117:34
clarkbfungi: I think it depends on the implementation? There are so many in neutron that I wouldn't be surprised if there were oddities17:35
fungiyeah, i'll grant that sometimes implementations do weird things17:35
clarkbbut yes in theory it should justrewrite the dest/source ip and forward on17:35
fungii've copied the latest etherpad database backup to the new held node, but i'll need a few minutes to mount the ephemeral disk at /var/etherpad/db since the rootfs is too small for me to import into17:36
clarkbfungi: oh I was suggesting we update the old 2.2.2 held node instead17:37
clarkbsince that would be faster?17:37
clarkbbasically instead of copying the db over to the new node and applying it which took hours we just upgrade the existing node with the db set to prod ish to 2.2.417:37
clarkbbut either way works17:37
fungiah, yeah we could do that. how do i upgrade it to the new version? just edit the version in the compose file?17:37
fungi(and pull/down/up -d)17:38
clarkbfungi: yes you modify the image specification in the compose file to point at insecure-ci-registry.opendev.org:5000/opendevorg/etherpad:d93fdc1fb78b4fdcb5c303a08de0aab2_v2.2.417:38
clarkbfrom https://zuul.opendev.org/t/openstack/build/d93fdc1fb78b4fdcb5c303a08de0aab2/artifacts17:39
fungilooks like 104.130.172.149 is the old hold17:39
fungiit's upgraded to the 2.2.4 test image now17:41
fungiand restarted17:41
clarkbcool /me throws that ip into /etc/hosts17:44
fungioh, that may be the wrong hold17:44
fungithat was the 2.2.2 hold but maybe there's a 2.2.3 hold as well, sorry17:44
clarkbya that doesn't have the prod db in it. I think there were two holds for 2.2.2 for some reason I don't recall17:45
fungiaha, that's what it is17:45
clarkbone was the one I manually fixed with redirects maybe and the other was the one that had everything automated?17:45
fungi104.130.140.169 was the one i was supposed to do17:45
fungiworking on updating the version on that one now17:47
clarkb169 has the shorter uptime so pretty sure that would be the correct one either way17:47
fungialso rootfs has filled up17:48
clarkbcan probably trim journald contents for a quick trim17:48
fungii cleaned up the etherpad database backups17:49
clarkb++17:49
fungii'm going to reboot the server just to make sure it's in good shape17:49
fungidifferent heading levels for https://etherpad.opendev.org/p/mm3migration still look correct on the held node17:52
fungisimilar for code/monospace17:52
clarkbseem to be good in https://etherpad.opendev.org/p/gerrit-upgrade-3.9 too17:53
clarkbso ya I think 2.2.4 seems to be at least as functional as 2.2.2 was17:53
clarkbmaybe start by approving the first change int he stack and ensuring the :latest and :v2.1.1 tags are updated in docker hub and deployed to prod successfully then we can do the 2.2.4 update optionally backing up the db in some coordinated fashion17:54
fungiafter we upgrade, we should plan a meetpad test just to make sure things are still working with integration too17:54
clarkb++17:54
clarkbI'm going to pop out for a bike ride here now that the smoke has blown away and before it gets too hot today17:55
clarkbI'll be back in a bit and thank you for the help pushing this along17:55
fungiyw!17:55
fungithe nodepool.yaml on nl01 updated at 17:24 utc, so i guess we should see if error rates before/after that are similar or drastically different, though following corvus's analysis i don't have high hopes it will actually help18:59
cardoefungi: let’s poke jamesdenton (who isn’t in this channel) since the network is his baby. But cloudnull too.19:17
fungiinfra-root: i'm looking at the rackspace swift cleanup i mentioned last week (or the week before?). i think all our current zuul build log containers are named like "log_NNN" where NNN is a 3-digit hexadecimal number, the index.html files at the root of each one have a last modified date of 2019-09-05 which i think is when we switched to the current 1024-way sharding scheme19:51
fungier, rather, "zuul_opendev_logs_NNN"19:52
fungithere are some older ones containing zuul build logs with names like just "logs_NN" where NN is a two-digit hex number, and also some even older with names like "logs_periodic", "logs_periodic-stable", and one simply named "infra-files"19:53
fungii propose to delete the swift containers named: infra-files, logs_NN, logs_periodic, logs_periodic-stable19:56
fungiinfra-files has files created circa 2014, logs_NN have top-level index.html files from 2019-08-15 which i think coincides with our brief 256-way sharding, while logs_periodic* have files from 2019-08-1620:00
fungiclarkb: is this something worth putting on the meeting agenda for tomorrow? or probably not in need of discussion?20:01
clarkbfungi: I think its mostly safe except for maybe infra-files. Is that what we use for image uploads?20:08
clarkbbut ya the logs containers should all be safe since as you mention we're in the three digit naming scheme these days20:08
fungithe image uploads have their own containers ("images" and "image_segments")20:09
clarkbgotcha there is also a container or containers for the docker container registry20:09
clarkbnot sure what that one is called20:09
clarkbside note: we could stand to possibly pick a time to delete that one and force things to regenerate since the pruning doesn't work iirc20:10
fungiis that in the zuul project or the control plane project though?20:10
clarkbgood question20:10
fungii was only looking at the zuul project20:10
clarkbthe container is called intermediate_registry20:11
clarkboh but we do set a file expiration on them so maybe pruning more aggressively isn't critical20:12
clarkbfungi: any idae what is creating infra-files? I hesitate to delete its contents if we don't know the origin20:12
fungiclarkb: as i said earlier, it contains ci job logs, some dating back to 201420:14
Clark[m]Oh sorry I read that as files created circa 2024.20:16
Clark[m]I see now how my brain did a character replacement it shouldn't have20:16
fungithe top-level directories in the infra-files container are all just two hexadecimal digits, then under those are a mix of gerrit change numbers and git commit ids20:17
fungii think this was our zuul build logs container from before we switched to sharding across multiple containers20:18
fungii don't see any content in there that isn't job logs20:20
clarkbmakes sense (and sorry for the nick swaps I had to quickly eat lunch)20:23
fungiyou should feel free to eat lunch at a comfortable pace, we don't need you choking on your food20:23
funginone of this is urgent20:24
clarkbyears of competing with my siblings for food has led to crazy abilities to eat food too quickly20:24
fungiyeah, i only had one brother, but i know what you mean20:24
fungis/had/have/20:25
fungibut we don't often compete for food these days20:25
clarkbok I've made some edits to the meeting agenda. fungi do you want me ot add an item about the swift conatiners? or should we just proceed with cleaning those up based on your investigating? I think I'm comfortable with that20:41
fungii can proceed, doesn't seem to need further discussion no20:41
clarkband ya git log -p in system-config provides further clues to infra-files origins20:42
fungioh indeed20:42
clarkb4c0f432ca5d435ca464fa03381545e90d061983720:42
clarkbseems like that confirms it was a container used for log uploads and that we stopped using it for that20:42
fungihttps://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&viewPanel=12&from=now-12h&to=now does seem to show the errors ending abruptly at the time nodepool.yaml updated with the longer boot-timeout, but also the volume at that time was way lower than it had been and we saw some lengthy periods with no errors prior as well. we'll probably need to compare tomorrow's numbers to21:51
fungitoday's to see if there seems to be any actually significant change21:51
clarkb++ seems to be worse when busier21:54
fungibut also yes this is something we ought to bring up with the flexfolks21:56
clarkbI'm going to get the meeting agenda sent out in the next 45 minutes or so. I hope that is late enough for tonyb to chime in if there is anything to add before then22:51
clarkbhttps://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970 passes testing now after a recheck with updated centos 9 stream arm64 images22:58
clarkbI hesitate to say send it due to potential for unexpected job behavior changes but the risk of that should be low22:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!