Monday, 2024-09-09

*** noonedeadpunk_ is now known as noonedeadpunk		07:29
opendevreview	Merged openstack/diskimage-builder master: Fix hashbang of non-executed bash libs https://review.opendev.org/c/openstack/diskimage-builder/+/928312	10:03
*** ykarel_ is now known as ykarel		14:02
fungi	infra-root: NeilHanlon: i'd like to tag gerritlib 1dd61d944b0b4bd3370a5d47ee8a992f29481adb (current master branch state) as 0.11.0, a semver minor rev since it's dropping support for older python versions than 0.10.0 worked on. these are the changes which have merged since 0.10.0: https://paste.opendev.org/show/bSHypXUnEQ5p7HtSvx1S/	14:24
fungi	oh, it also adds a feature (event filtering), so another reason it should be a minor increment	14:28
NeilHanlon	+1	14:45
NeilHanlon	fwiw I already released dc75475 as a prerelease of 0.11.0 in Fedora rawhide	14:45
NeilHanlon	https://bodhi.fedoraproject.org/updates/FEDORA-2024-159628f773	14:46
fungi	cool. from a distro packaging perspective, the one missing commit should be immaterial anyway, i think its only impact will be to people installing from pypi	14:47
fungi	or pip/uv installing locally from source	14:47
fungi	i just didn't want to release without it, because pushing the tag will trigger an upload to pypi and then users who try to install from an older unsupported python version could end up with a broken 0.11.0 rather than auto-selecting the older 0.10.0 for their environment	14:49
NeilHanlon	++	14:49
fungi	anyway, i'll give infra-root folks a few hours to notice/object and then plan to push the tag i've created for that later today	14:51
clarkb	no objection from me	15:04
clarkb	there is a newer ehterpad release over the weekend. Thoughts on whether or not we should update to that version instead? additionally I think we should ensure docker hub has a v2.1.1 tag for the current image or a rebuild of that image to make rollback easier. We can do that manually or we could update the image build jobs to do it for us and rebuild the current version to stick that	15:05
clarkb	in docker hub and fetch it via docker-compose. Then we can rebase the upgrade changeon that to also tag versions	15:05
clarkb	thoughts/opinions on that?	15:05
clarkb	looking at rax flex we seem to have hit the limit this morning as well. There was a burst of boot errors during that time frame as well. Maybe we're not quite calculating available quota/max-servers properly and hitting quota errors when at the limit?	15:14
clarkb	otherwise this seems to be happy. We should probably write up an email for cloudnull et al and defer to them on whether or not we should make additional changes	15:14
clarkb	https://github.com/ether/etherpad-lite/blob/v2.2.4/CHANGELOG.md the current upgrade change is for 2.2.2 so 2.2.3 and 2.2.4 releases are relevant if we want to update things to latest instead	15:26
opendevreview	Clark Boylan proposed opendev/system-config master: Tag etherpad images with version https://review.opendev.org/c/opendev/system-config/+/928656	15:33
opendevreview	Clark Boylan proposed opendev/system-config master: Update etherpad to 2.2.4 https://review.opendev.org/c/opendev/system-config/+/926078	15:47
clarkb	I went ahead and updated things to do the upgrade to latest. We can revert to older patchsets if necessary	15:48
opendevreview	Clark Boylan proposed opendev/system-config master: DNM force etherpad failure to hold node https://review.opendev.org/c/opendev/system-config/+/840972	15:50
clarkb	fungi: I put in another autohold reuqest for ^ but I'm thinking we can upgrade the node that you migrated the held db into as well as a check. Pretty sure that held node has not been cleaned up yet	15:51
clarkb	that should be faster than porting the db into a new node all over again	15:52
clarkb	following up on raxflex errors I see three failure modes in recent logs: A) Cannot scan ssh key B) Server in an error state C) over/at ram quota so can't allocate more	15:57
clarkb	C) is likely a race condition in nodepools quota accounting and probably not a big deal. A) may just be slowness to boot? We could tune timeouts if they are already short and B) is something that the cloud may want to look at	15:58
clarkb	B)s error message: Error in creating the server. Compute service reports fault: Exceeded maximum number of retries. Exceeded max scheduling attempts 3 for instance 36b3f959-f079-455d-b330-078372253802. Last exception: Binding failed for port 4bc1537e-cfcf-4f95-9f8e-3b92e5f9783a, please check neutron logs for more information.	15:59
clarkb	the ssh port scan appears to have succeeded scanning several keys before eventually hitting an ssh connection timeout. I wonder if we need to start a new tcp connection for each of those or if we can reuse and make that more reliable	16:00
timburke	so i've noticed a lot of `Cannot assign requested address` recently on swift's rolling-upgrade jobs, and every time i drill down a bit, it's always on raxflex nodes -- is there anyway i could disable that provider for that specific job?	16:01
timburke	as an example: https://zuul.opendev.org/t/openstack/build/e324b0e08eaf4d30af98e4c26f93d0b7	16:01
timburke	of the last 30 runs of that job, the 8 failures have all been on raxflex; the other 22 were a mix of rax, ovh, openmetal	16:02
clarkb	timburke: there isn't a way to diasble providers for specific jobs	16:04
clarkb	timburke: do you know what address you are trying to bind the socket to there?	16:04
clarkb	timburke: is it this: https://zuul.opendev.org/t/openstack/build/e324b0e08eaf4d30af98e4c26f93d0b7/log/container-server.conf#5-6 ?	16:05
clarkb	if so the issue is you're binding to the floating ip address which isn't actually on the server and instead is NAT'd	16:05
timburke	(side note: is that a thing that we could get on the build history page, or at least the build result page? needing to click through "View log" -> "zuul-info" -> "inventory.yaml" then grep for cloud: for 30 jobs wasn't fun, but also wasn't so many that i felt the need to automate it)	16:06
clarkb	https://zuul.opendev.org/t/openstack/build/e324b0e08eaf4d30af98e4c26f93d0b7/log/zuul-info/inventory.yaml#46 you'll need to bind to this address or 127.0.0.1 or 0.0.0.0 or their ipv6 equivalents	16:06
clarkb	we haven't had a floating ip cloud in a while which is why this hasn't been an issue	16:07
timburke	ah... ok -- i think i can work with that. so bind on 0.0.0.0 around https://github.com/openstack/swift/blob/master/tools/playbooks/multinode_setup/make_rings.yaml#L87 but leave the nodepool ip over in https://github.com/openstack/swift/blob/master/tools/playbooks/multinode_setup/templates/make_multinode_rings.j2#L30 where we need to share the info between nodes?	16:12
clarkb	timburke: ya that looks right. Basically listen on all addresses on the server bind config so it will listen where you need it but when you configure things to talk to one another you can tell them to talk through the floating ip if that is simpler. Though you should be able to use the internal network address there too and avoid NAT	16:13
clarkb	I believe the private and public nodepool ip vars are set to the same value if there is no private ip	16:14
clarkb	so it should be safe to use the private ip everywhere too	16:14
timburke	cool, thanks clarkb! GTK	16:14
clarkb	fungi: https://zuul.opendev.org/t/openstack/build/d93fdc1fb78b4fdcb5c303a08de0aab2/artifacts is the updated 2.2.4 etherpad image build if we want to pull that onto the test node with the prod like db	16:20
fungi	clarkb: for the "A) Cannot scan ssh key" case, is it the same ip address i remarked on over the weekend?	16:20
clarkb	maybe do that after we confirm basic functionality on the newly held node	16:20
clarkb	so that we don't burn the prod like test node until we're confident it is liekly to work	16:21
clarkb	fungi: it was 65.17.193.24	16:21
clarkb	which is a different address	16:22
fungi	"Error scanning keys: Timeout connecting to 63.131.145.251 on port 22" (5 occurrences between 23:28 and 23:33 on 2024-09-07)	16:22
fungi	actually, looks like quite a variety of ip addresses there where the ssh keyscan failed: https://paste.opendev.org/show/bhBEFZ6segiTMrw2n2Tu/	16:38
clarkb	considering that it seems to happen after other successful keys are scanned I wonder if there is some issue with the NAT	16:39
clarkb	like every 100th connection fails	16:39
fungi	we could increase the timeout, i guess, and see if that helps?	16:39
fungi	in rax classic we set boot-timeout: 120 which i think is what covers that	16:41
clarkb	does that cover the ssh connection timeout for doing the key scan? it seems like that is what times out	16:41
fungi	i think so? there's also launch-timeout which i think is how long nodepool waits for nova to report the instance active	16:41
clarkb	looking at the logs for one server it seems to time out after about a minute	16:42
clarkb	https://paste.opendev.org/show/bqVBIZaX9F2gPYzV79ut/	16:43
fungi	https://zuul-ci.org/docs/nodepool/latest/openstack.html#attr-providers.[openstack].boot-timeout	16:43
fungi	"how long to try connecting to the image via SSH"	16:44
clarkb	looks like that timeout message does come from nodepool's own timeout processing and not paramiko	16:45
clarkb	ya its weidr though that we do scan other keys first very quickly then it is slow. But ya we can try bumping that up to see if it stablizes with a bit more time	16:45
clarkb	if not then we may haev a network l3/2/1 problem	16:45
opendevreview	Jeremy Stanley proposed openstack/project-config master: Increase the boot timeout for Rackspace Flex nodes https://review.opendev.org/c/openstack/project-config/+/928668	16:45
clarkb	104.130.172.150 is the held 2.2.4 etherpad node. It seems to generaly work in the clarkb-test pad	16:59
fungi	i'll go ahead and self-approve 928668 so we don't unnecessarily delay the experiment	17:00
clarkb	If that node looks good to you then ya I think the next step is updating the held node with the copy of the db and then reconfirming things work with the prod db then we can proceed with normal code review	17:02
corvus	clarkb: fungi i agree that looks like a very unhealthy series of log lines. key scanning should be near-instantaneous for all key types, so that looks a lot like either a network issue, or (less likely) very slow instance or very slow (overcommitted) host	17:09
corvus	i wonder if it's worth grabbing a flex node and running paired tcpdumps while doing a nodepool-style keyscan	17:13
opendevreview	Merged openstack/project-config master: Increase the boot timeout for Rackspace Flex nodes https://review.opendev.org/c/openstack/project-config/+/928668	17:13
fungi	without knowing the details of the network architecture/integration for rackspace flex, it's hard to guess but could be that they're relying on a routing protocol to announce which host is the destination for which fip and there's some lag in updates replicating (maybe switches slow at updating bridge tables? routers holding onto stale arp entries rather than replacing them?)	17:15
corvus	increasing the timeout is probably just papering over the issue. that timeout was mostly to handle the case of the host actually booting slowly (like, the time from start-of-boot to ssh-is-listening). once ssh is listening, we don't expect, under normal circumstances, for the actual keyscan to take long. and we know from those log lines that ssh was listening as soon as we got the first keyscan reply	17:16
corvus	fungi: re networking -- i agree, but i would also expect that routing/arp/whatever lag to show up before the first keyscan reply, but not after that	17:17
fungi	oh, does the keyscan only happen after an initial ssh attempt?	17:18
corvus	it's a bunch of connections because ssh will only give us the fingerprint for one at a time	17:18
corvus	let me annotate the log entries... one sec	17:18
fungi	but yeah, if they have redundant network paths then different connections could end up going over flows through different devices so could be a case where it's intermittent until all devices get on the same page about where an address resides	17:19
fungi	anyway, definitely worth bringing to cloudnull's and cardoe's attention	17:23
corvus	fungi: https://paste.opendev.org/show/bhyUryen2qUIE75mKXPI/	17:24
clarkb	corvus: ya that was my concern it is odd behavuior to timouet after several fast scans	17:25
corvus	each one of those is closed before starting the next one (so they aren't open at the same time)	17:25
clarkb	like something in the network is having a sad periodically	17:25
fungi	corvus: agreed, that does seem to indicate that it's not just taking too long for the server instance to become generally reachable	17:27
fungi	also for those following along, these are connections from rackspace (classic) dfw to rackspace flex sjc3	17:28
clarkb	my hunch is that the NAT system is failing every Nth connection attempt	17:30
clarkb	which is classic NAT tables are full behavior though this is 1:1 so in theory it should be far less susceptible to that	17:31
fungi	also worth noting, these connections don't seem to go through dedicated circuits, a traceroute from nl01 to mirror.sjc3.raxflex transits datapipe's network for a hop or two	17:33
fungi	clarkb: fips shouldn't be using layer 4 translation though?	17:33
fungi	nat tables are typically only for overload/shared nat addresses	17:34
fungi	while fips are 1:1	17:34
clarkb	fungi: I think it depends on the implementation? There are so many in neutron that I wouldn't be surprised if there were oddities	17:35
fungi	yeah, i'll grant that sometimes implementations do weird things	17:35
clarkb	but yes in theory it should justrewrite the dest/source ip and forward on	17:35
fungi	i've copied the latest etherpad database backup to the new held node, but i'll need a few minutes to mount the ephemeral disk at /var/etherpad/db since the rootfs is too small for me to import into	17:36
clarkb	fungi: oh I was suggesting we update the old 2.2.2 held node instead	17:37
clarkb	since that would be faster?	17:37
clarkb	basically instead of copying the db over to the new node and applying it which took hours we just upgrade the existing node with the db set to prod ish to 2.2.4	17:37
clarkb	but either way works	17:37
fungi	ah, yeah we could do that. how do i upgrade it to the new version? just edit the version in the compose file?	17:37
fungi	(and pull/down/up -d)	17:38
clarkb	fungi: yes you modify the image specification in the compose file to point at insecure-ci-registry.opendev.org:5000/opendevorg/etherpad:d93fdc1fb78b4fdcb5c303a08de0aab2_v2.2.4	17:38
clarkb	from https://zuul.opendev.org/t/openstack/build/d93fdc1fb78b4fdcb5c303a08de0aab2/artifacts	17:39
fungi	looks like 104.130.172.149 is the old hold	17:39
fungi	it's upgraded to the 2.2.4 test image now	17:41
fungi	and restarted	17:41
clarkb	cool /me throws that ip into /etc/hosts	17:44
fungi	oh, that may be the wrong hold	17:44
fungi	that was the 2.2.2 hold but maybe there's a 2.2.3 hold as well, sorry	17:44
clarkb	ya that doesn't have the prod db in it. I think there were two holds for 2.2.2 for some reason I don't recall	17:45
fungi	aha, that's what it is	17:45
clarkb	one was the one I manually fixed with redirects maybe and the other was the one that had everything automated?	17:45
fungi	104.130.140.169 was the one i was supposed to do	17:45
fungi	working on updating the version on that one now	17:47
clarkb	169 has the shorter uptime so pretty sure that would be the correct one either way	17:47
fungi	also rootfs has filled up	17:48
clarkb	can probably trim journald contents for a quick trim	17:48
fungi	i cleaned up the etherpad database backups	17:49
clarkb	++	17:49
fungi	i'm going to reboot the server just to make sure it's in good shape	17:49
fungi	different heading levels for https://etherpad.opendev.org/p/mm3migration still look correct on the held node	17:52
fungi	similar for code/monospace	17:52
clarkb	seem to be good in https://etherpad.opendev.org/p/gerrit-upgrade-3.9 too	17:53
clarkb	so ya I think 2.2.4 seems to be at least as functional as 2.2.2 was	17:53
clarkb	maybe start by approving the first change int he stack and ensuring the :latest and :v2.1.1 tags are updated in docker hub and deployed to prod successfully then we can do the 2.2.4 update optionally backing up the db in some coordinated fashion	17:54
fungi	after we upgrade, we should plan a meetpad test just to make sure things are still working with integration too	17:54
clarkb	++	17:54
clarkb	I'm going to pop out for a bike ride here now that the smoke has blown away and before it gets too hot today	17:55
clarkb	I'll be back in a bit and thank you for the help pushing this along	17:55
fungi	yw!	17:55
fungi	the nodepool.yaml on nl01 updated at 17:24 utc, so i guess we should see if error rates before/after that are similar or drastically different, though following corvus's analysis i don't have high hopes it will actually help	18:59
cardoe	fungi: let’s poke jamesdenton (who isn’t in this channel) since the network is his baby. But cloudnull too.	19:17
fungi	infra-root: i'm looking at the rackspace swift cleanup i mentioned last week (or the week before?). i think all our current zuul build log containers are named like "log_NNN" where NNN is a 3-digit hexadecimal number, the index.html files at the root of each one have a last modified date of 2019-09-05 which i think is when we switched to the current 1024-way sharding scheme	19:51
fungi	er, rather, "zuul_opendev_logs_NNN"	19:52
fungi	there are some older ones containing zuul build logs with names like just "logs_NN" where NN is a two-digit hex number, and also some even older with names like "logs_periodic", "logs_periodic-stable", and one simply named "infra-files"	19:53
fungi	i propose to delete the swift containers named: infra-files, logs_NN, logs_periodic, logs_periodic-stable	19:56
fungi	infra-files has files created circa 2014, logs_NN have top-level index.html files from 2019-08-15 which i think coincides with our brief 256-way sharding, while logs_periodic* have files from 2019-08-16	20:00
fungi	clarkb: is this something worth putting on the meeting agenda for tomorrow? or probably not in need of discussion?	20:01
clarkb	fungi: I think its mostly safe except for maybe infra-files. Is that what we use for image uploads?	20:08
clarkb	but ya the logs containers should all be safe since as you mention we're in the three digit naming scheme these days	20:08
fungi	the image uploads have their own containers ("images" and "image_segments")	20:09
clarkb	gotcha there is also a container or containers for the docker container registry	20:09
clarkb	not sure what that one is called	20:09
clarkb	side note: we could stand to possibly pick a time to delete that one and force things to regenerate since the pruning doesn't work iirc	20:10
fungi	is that in the zuul project or the control plane project though?	20:10
clarkb	good question	20:10
fungi	i was only looking at the zuul project	20:10
clarkb	the container is called intermediate_registry	20:11
clarkb	oh but we do set a file expiration on them so maybe pruning more aggressively isn't critical	20:12
clarkb	fungi: any idae what is creating infra-files? I hesitate to delete its contents if we don't know the origin	20:12
fungi	clarkb: as i said earlier, it contains ci job logs, some dating back to 2014	20:14
Clark[m]	Oh sorry I read that as files created circa 2024.	20:16
Clark[m]	I see now how my brain did a character replacement it shouldn't have	20:16
fungi	the top-level directories in the infra-files container are all just two hexadecimal digits, then under those are a mix of gerrit change numbers and git commit ids	20:17
fungi	i think this was our zuul build logs container from before we switched to sharding across multiple containers	20:18
fungi	i don't see any content in there that isn't job logs	20:20
clarkb	makes sense (and sorry for the nick swaps I had to quickly eat lunch)	20:23
fungi	you should feel free to eat lunch at a comfortable pace, we don't need you choking on your food	20:23
fungi	none of this is urgent	20:24
clarkb	years of competing with my siblings for food has led to crazy abilities to eat food too quickly	20:24
fungi	yeah, i only had one brother, but i know what you mean	20:24
fungi	s/had/have/	20:25
fungi	but we don't often compete for food these days	20:25
clarkb	ok I've made some edits to the meeting agenda. fungi do you want me ot add an item about the swift conatiners? or should we just proceed with cleaning those up based on your investigating? I think I'm comfortable with that	20:41
fungi	i can proceed, doesn't seem to need further discussion no	20:41
clarkb	and ya git log -p in system-config provides further clues to infra-files origins	20:42
fungi	oh indeed	20:42
clarkb	4c0f432ca5d435ca464fa03381545e90d0619837	20:42
clarkb	seems like that confirms it was a container used for log uploads and that we stopped using it for that	20:42
fungi	https://grafana.opendev.org/d/6d29645669/nodepool3a-rackspace-flex?orgId=1&viewPanel=12&from=now-12h&to=now does seem to show the errors ending abruptly at the time nodepool.yaml updated with the longer boot-timeout, but also the volume at that time was way lower than it had been and we saw some lengthy periods with no errors prior as well. we'll probably need to compare tomorrow's numbers to	21:51
fungi	today's to see if there seems to be any actually significant change	21:51
clarkb	++ seems to be worse when busier	21:54
fungi	but also yes this is something we ought to bring up with the flexfolks	21:56
clarkb	I'm going to get the meeting agenda sent out in the next 45 minutes or so. I hope that is late enough for tonyb to chime in if there is anything to add before then	22:51
clarkb	https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/926970 passes testing now after a recheck with updated centos 9 stream arm64 images	22:58
clarkb	I hesitate to say send it due to potential for unexpected job behavior changes but the risk of that should be low	22:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!