Tuesday, 2024-09-10

*** ralonsoh_ is now known as ralonsoh		05:48
fungi	jamespage: corvus is the gertty author (redirecting your question from #openstack-infra)	12:36
jamespage	fungi: ta	12:36
fungi	#status log Released gerritlib 0.11.0	13:05
fungi	NeilHanlon: ^	13:05
opendevstatus	fungi: finished logging	13:05
opendevreview	Dong Zhang proposed zuul/zuul-jobs master: Print instance type in emit-job-header role https://review.opendev.org/c/zuul/zuul-jobs/+/925754	13:15
opendevreview	Dong Zhang proposed zuul/zuul-jobs master: Print instance type in emit-job-header role https://review.opendev.org/c/zuul/zuul-jobs/+/925754	13:16
NeilHanlon	fungi++	13:22
fungi	infra-root: i'm getting started deleting the ancient swift containers we used for zuul build logs circa 2014 up until we switched to the 1024-way sharding 5 years ago, so basically only touching containers we haven't used for more than 5 years now	14:04
fungi	or i was going to, not sure what leads `openstack container delete ...` to respond with "Conflict (HTTP 409)"	14:08
fungi	definitely don't want to clicky-delete 259 containers through the rackspace webui	14:09
leonardo-serrano	Hi, setuptools has received an update recently and it requires some deps to be installed manually to work. I was wondering if the topic has been discussed already	14:15
fungi	leonardo-serrano: can you provide more details? how recently? what deps? what is "manually" in this context?	14:17
fungi	and most importantly, how does it relate to the opendev collaboratory?	14:18
frickler	fungi: add "--recursive"?	14:19
fungi	frickler: oh, that seems to be working. thanks!	14:19
leonardo-serrano	We're hitting this error: https://github.com/pypa/setuptools/issues/4478#issuecomment-2235160778 This comment has some more details	14:20
fungi	leonardo-serrano: who is "we"?	14:20
leonardo-serrano	fungi: I am currently responsible for maintaining zuul on the starlingx project	14:21
fungi	leonardo-serrano: thanks for the clarification, so starlingx zuul jobs are failing with setuptools 71 and later due to a change in how it treats vendored dependencies? that's the amount of detail i was trying to get	14:22
leonardo-serrano	fungi: About the relation to opendev collab, I found a task for installing setuptools. I was wondering if there is a WIP change in there or if I should fix it in the starlingx project directly	14:22
fungi	leonardo-serrano: well, let's start with an example failure. can you link to a build result page for such a failure?	14:23
leonardo-serrano	Perhaps it affects other projects? This is the task I mentioned: https://opendev.org/zuul/zuul-jobs/blame/commit/839de7f8996838162ae0de6a9f6ba28f968381bc/roles/ensure-pip/tasks/workarounds.yaml	14:23
leonardo-serrano	fungi: Sure. Here's an example: https://zuul.opendev.org/t/openstack/build/6bd6d91c041b460fb6005d68c8f109cf/console	14:24
fungi	thanks, looking	14:24
frickler	leonardo-serrano: almost completely unrelated, but did you see https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0 ? in particular cleaning up the nodeset usage would be nice	14:24
leonardo-serrano	frickler: I wasn't aware. Thanks for pointing it out. So the action is to move jobs away from "centos-7" and "opensuse-15" nodes, correct? Is there more?	14:26
leonardo-serrano	I'm new to zuul, so I appreciate the advice	14:27
frickler	leonardo-serrano: yes, those nodesets are no longer available and the jobs that try to use them will be undefined	14:30
fungi	leonardo-serrano: looking at related issue https://github.com/pypa/setuptools/issues/4483 i think we need to identify what's causing an older version of the "packaging" library to get installed in your jobs. i'll see if i can spot it	14:30
fungi	though the job did appear to install latest packaging in the venv, i think this is impacting isolated sdist builds which don't occur within the scope of the venv	14:41
leonardo-serrano	fungi: I just found it. In the tox file there is a pip install command pulling in packaging===20.9	14:43
leonardo-serrano	https://opendev.org/starlingx/config/src/branch/master/controllerconfig/controllerconfig/tox.ini#L26	14:44
fungi	oh wow, that doesn't show up in the job log at all	14:44
fungi	good find!	14:45
opendevreview	Julia Kreger proposed openstack/diskimage-builder master: docs: add two contextual warnings to the replace-partition element https://review.opendev.org/c/openstack/diskimage-builder/+/928819	14:45
leonardo-serrano	fungi: Thank you very much for the assistance!	14:46
fungi	leonardo-serrano: my pleasure. let me know if cleaning that up doesn't seem to fix it and maybe we can find out what else has gone wrong there	14:46
clarkb	fungi: leonardo-serrano I think you don't get the verbose tox run because its in the setup run without tests step? Maybe that role should be updated to run tox verbosely there too?	15:01
jamespage	corvus: hey - my query from #openstack-infra was gertty related - last release was some time ago and its caught in the sqlalchemy 2.0 transition in Ubuntu (and Debian). I've used a snapshot for the time being to unblock - would a new release be possible?	15:01
clarkb	leonardo-serrano: fungi: this issue is similar to the one that starlingx ran into with old pyzmq having a dependency that doesn't work with newer setuptools/pip/pypa things. It is probably worth mentioning that if you are going to use constraints you either need to freeze absolutely everything or you need to regularly update constraints to keep up with the world moving around you	15:10
corvus	jamespage: i think so, i'll put that on my list; thanks for the reminder :)	15:21
jamespage	corvus: great thanks!	15:21
clarkb	fungi: corvus the number of rax flex errors is much lower this morning than yesterday. Could still be load related, but maybe the network issues resolve if given long enough rather than timing out within 60 seconds?	15:22
clarkb	you can see the difference if you set the graph timeline to 2 days	15:22
corvus	might be interesting to look at the keyscan logs and see if the behavior is still there (but finishing after 60+ seconds)	15:23
clarkb	++	15:24
fungi	so... `openstack container delete --recursive <container_name>` runs for about 8-9 minutes but eventually comes back with "Not Found (HTTP 404)"	15:30
fungi	not sure if it's hitting some sort of internal timeout or what. maybe --verbose will help pinpoint the issue	15:31
clarkb	fungi: any idea if the object count fell? I wonder if there is some race in the recursive cleanup and we eventually try to delete something that is gone	15:31
clarkb	or maybe the container itself is bugged and has lost track of an object or two	15:31
corvus	what were the old and new boot timeout values?	15:31
clarkb	corvus: old was 60 seconds (default) new is 120 seconds	15:31
fungi	good idea, i should compare the object count before/after	15:32
corvus	https://paste.opendev.org/show/bVMwgRnV7KY2v301fPUz/	15:35
corvus	it looks like it takes ~30s to boot (sample size of 1 -- consider that a highly suspect assumption)	15:35
corvus	so any of those that are >~30s would have timed out before	15:36
corvus	it's batchy	15:36
clarkb	corvus: 0025152447 reports multiple scanned keys logs lines. Is each of those for a different key? and in that case this node would've failed before?	15:36
clarkb	oh thats one request with multiple nodes	15:36
clarkb	node 0038452371 for example is the actual issue that has been resolved	15:37
corvus	yep 3 node nodeset	15:37
corvus	looking at the logs, i haven't found any individual nodes that fit the pattern from yesterday	15:43
corvus	even the long ones look like they finished scanning quickly (the extra time may be nodepool time, not cloud time)	15:44
clarkb	oh itneresting so maybe whatever the issue is was corrected at the cloud level or we haven't driven enough demand to trigger it	15:44
corvus	yeah maybe. example: https://paste.openstack.org/show/bAz7WEtIeXSCbLKoBFHi/	15:45
corvus	note the "complete" line second to last. that says the time between that and the next log line is nodepool overhead	15:46
corvus	but we did not see "complete" lines on the problem nodes yesterday	15:46
clarkb	what is the delay between complete and scanned in the last two log lines?	15:46
clarkb	you're right though that that is a different state transition but wonder if maybe the time is lost due to a similar mechanism in both cases	15:47
corvus	(i think that time is a hand-off from the nodescan state machine to the node create state machine, and the latter runs less frequently; but i'm not 100% sure)	15:47
corvus	the nodescan state machines are event-driven and should be very fast	15:48
corvus	(the launch state machines are polled)	15:49
clarkb	as long as the state machine doesn't see its own delay as triggering the timeout that should be fine	15:51
corvus	yeah, i'm pretty sure from the logs yesterday the timeout was prior to that point because we didn't see the "complete" line	15:57
fungi	clarkb: good call, the object_count does drop, but only by the tiniest amount. i don't think this is going to be viable, deletion will take years of continuous running to complete at this rate	16:01
clarkb	I guess there isn't a "I really mean to delete this container and all of its contents" api call?	16:02
fungi	well, that's what this is i think	16:02
clarkb	this is still explicitly deleting every object individually I think	16:02
fungi	`openstack container delete --recursive` removes the file objects, because the container refuses to delete while it still has contents	16:03
clarkb	rather than marking the container for cleanup then letting swifts backend pruning (for expiration etc) clean up as it goes	16:03
clarkb	right	16:03
clarkb	I was just hoping swift has a bulk deletion api since it does do bulk deletes in the background already	16:03
fungi	if it does, i can't find that plumbed through osc	16:03
clarkb	timburke may know if there is a better option for us here	16:04
fungi	yeah, rough math says this is deleting approximately 5 files per second	16:05
fungi	even just a single container has hundreds of thousands, and we have hundreds of such containers to delete	16:05
clarkb	another approach would be to try and do more delete requests in parallel, but that may require writing our own client delete tool	16:07
clarkb	I'm guessing osc is doing it serially	16:08
fungi	oh, if this is round-trip object deletions one by one and not a recursive delete option internal within swift, yeah that would explain the painful slowness	16:09
mordred	I'm pretty sure it's the round-trip object deletions	16:10
clarkb	yes its doing a listing then doing deletes one by one with 200ms request rtt	16:10
mordred	I feel like I remember a parallel api in shade/sdk somewhere- but I might just be thinking about parallel uploads of image segments	16:11
corvus	https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/delete_swift_container.py	16:12
corvus	i wrote a script that does the same thing :/	16:12
corvus	but -- you could probably pretty easily throw a threadpoolexecutor onto that and speed it up	16:12
fungi	oh neat	16:15
fungi	so anyway, i suspect https://docs.openstack.org/swift/latest/api/bulk-delete.html is what we would actually want	16:18
fungi	but the chances rackspace's ancient swift deployment supports that?	16:18
clarkb	fungi: the example there has an example xml resposne so maybe :)	16:19
corvus	worth looking into; i don't think i've ever tried	16:19
fungi	though also, "up to 10,000 objects or containers"	16:19
fungi	so we're still looking at hundreds of requests per container for hundreds of containers	16:20
clarkb	fungi: I wonder if bulk container delete requests on a conatiner would work without clearing all the objects first	16:22
corvus	yeah that has a "delete a whole container" option, i'd try that first :)	16:26
corvus	at least, i read that the same way that clarkb is optimistically reading it	16:27
fungi	where do you see that?	16:31
fungi	it says "To indicate a container, specify the container name as: CONTAINER_NAME. Make sure that the container is empty. If it contains objects, Object Storage cannot delete the container."	16:31
clarkb	that's too bad	16:33
clarkb	so ya the most efficient process is probably do a listing then multiple 10k object deletion requests until empty then delete the container?	16:34
corvus	bummer	16:38
fungi	or ask support to remove the containers we aren't using. they may also have admin access to do it in ways that wouldn't put so much strain on the api	16:43
fungi	also, for the record, this was prompted by cardoe's https://launchpad.net/bugs/2078229 so he might have ideas about whether support has access to help us clean it up faster	16:49
clarkb	I believe those keys get generated at job runtime and only exist for the duration of the job...	16:51
clarkb	I can understand people being cautious about things like this but it seems overkill to worry too much	16:51
clarkb	side note: tripleo logging everything is something I tried many many times to address	16:51
clarkb	ya our image builds shouldn't produce ssh host keys and instead glean generates them uniquely on boot	16:52
clarkb	then we delete the node when the jobs complete	16:52
fungi	yeah, i mean, it's clearly not a security risk. but it did lead me to discover that we have a bunch of old content in swift containers that we abandoned, probably because we started trying to delete them and then realized it was taking forever and forgot to follow up	16:53
fungi	so my concern is from the perspective of being a conscientious steward of the resources rackspace is generously donating to our community	16:53
fungi	and not wasting disk space on logs that haven't been relevant for 5-10 years	16:54
fungi	for the record, the rackspace dashboard isn't super helpful here either. trying to delete a container with contents is blocked, and shows this tooltip: "This container must be empty before you can delete it. For bulk deletes, we recommend using a free third-party cloud storage browser such as Cyberduck or Cloudberry Explorer."	17:04
frickler	maybe look at https://rclone.org/commands/rclone_purge/	17:46
cardoe	fungi: clarkb: I kinda figured it was ephemeral data and didn't really matter. But some internet researchers keep pinging our security people. So I got bugged to "handle it"	18:04
fungi	cardoe: i don't suppose you happen to know whether support has admin options for cleaning up whole swift containers that's faster than what a regular user can do through the api?	18:07
cardoe	Yeah they should	18:09
fungi	thanks, i'll just put a list in a ticket in that case	18:09
cardoe	Annoyingly I'm working on Ironic stuff for us. (My consumers are internal only) And I used Triple-O as an example in one of my talks. Saying that my work is the undercloud and the things consuming me are the overcloud. And so I'm associated with the name "undercloud" and that's what that path had in it.	18:35
fungi	heh	18:36
clarkb	fungi: I posted a comment about removing old content and adding new content to the euler mirror	20:07
clarkb	also I suspect this is the same person asking about gitea apis... makes me wonder what is going on	20:07
fungi	that's an odd coincidence, yeah	20:43
clarkb	fungi: should we proceed with https://review.opendev.org/c/opendev/system-config/+/928656/1 ?	21:04
clarkb	working on a rax flex update email here: https://etherpad.opendev.org/p/l-in0EBS7220tsrAMMem	21:10
fungi	yeah, approved since nobody else seemed to be reviewing that stack	21:23
fungi	itym "rax folx"	21:24
fungi	;)	21:24
clarkb	ok how's that look. Did I miss any of the minor issues we've hit? maybe other ideas on next steps?	21:36
clarkb	cardoe: happy to cc you too if you'd like. I can use your email address in gerrit or if you'd prefer feel free to PM me a rackspace address	21:40
clarkb	side note: looking at the last 2 days of grafana graphs it says we're using 10.6 nodes (that number is provided by sumSeries which I think is effectively an integral?) anyway that aligns with the idea that we continue to use ~1/3 of our capacity	21:48
fungi	clarkb: message lgtm, i added some minor notes if you want to include those	21:49
clarkb	thanks	21:49
clarkb	I'll wait a bit to see if cardoe wants to be cc'd and then send it out	21:49
fungi	oh, may also be worth noting that we typically use 8vcpu flavors elsewhere, but 4vcpu was sufficient in flex?	21:50
clarkb	sure	21:50
fungi	not sure where (or if) you'd work that in	21:50
clarkb	something like that maybe	21:51
fungi	yep, superb	21:51
opendevreview	Merged opendev/system-config master: Tag etherpad images with version https://review.opendev.org/c/opendev/system-config/+/928656	21:56
clarkb	that image is promoting now	21:58
clarkb	we should check the tag shows up in docker hub as expected and that etherpad deploys the new version for us	21:58
clarkb	well new build of the old version	21:59
clarkb	https://hub.docker.com/r/opendevorg/etherpad/tags both v2.1.1 and latest were updated just now	21:59
fungi	yay!	21:59
clarkb	and docker ps -a shows the container was restarted on the updated version and reports it is healthy	22:01
clarkb	I can still reach my email draft pad so ya this look happy	22:01
clarkb	I'm going to go ahead and send that email. Can loop in cardoe later if necessary	22:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!