Tuesday, 2024-09-10

*** ralonsoh_ is now known as ralonsoh05:48
fungijamespage: corvus is the gertty author (redirecting your question from #openstack-infra)12:36
jamespagefungi: ta12:36
fungi#status log Released gerritlib 0.11.013:05
fungiNeilHanlon: ^13:05
opendevstatusfungi: finished logging13:05
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Print instance type in emit-job-header role  https://review.opendev.org/c/zuul/zuul-jobs/+/92575413:15
opendevreviewDong Zhang proposed zuul/zuul-jobs master: Print instance type in emit-job-header role  https://review.opendev.org/c/zuul/zuul-jobs/+/92575413:16
NeilHanlonfungi++13:22
fungiinfra-root: i'm getting started deleting the ancient swift containers we used for zuul build logs circa 2014 up until we switched to the 1024-way sharding 5 years ago, so basically only touching containers we haven't used for more than 5 years now14:04
fungior i was going to, not sure what leads `openstack container delete ...` to respond with "Conflict (HTTP 409)"14:08
fungidefinitely don't want to clicky-delete 259 containers through the rackspace webui14:09
leonardo-serranoHi, setuptools has received an update recently and it requires some deps to be installed manually to work. I was wondering if the topic has been discussed already14:15
fungileonardo-serrano: can you provide more details? how recently? what deps? what is "manually" in this context?14:17
fungiand most importantly, how does it relate to the opendev collaboratory?14:18
fricklerfungi: add "--recursive"?14:19
fungifrickler: oh, that seems to be working. thanks!14:19
leonardo-serranoWe're hitting this error: https://github.com/pypa/setuptools/issues/4478#issuecomment-2235160778  This comment has some more details14:20
fungileonardo-serrano: who is "we"?14:20
leonardo-serranofungi: I am currently responsible for maintaining zuul on the starlingx project14:21
fungileonardo-serrano: thanks for the clarification, so starlingx zuul jobs are failing with setuptools 71 and later due to a change in how it treats vendored dependencies? that's the amount of detail i was trying to get14:22
leonardo-serranofungi: About the relation to opendev collab, I found a task for installing setuptools. I was wondering if there is a WIP change in there or if I should fix it in the starlingx project directly14:22
fungileonardo-serrano: well, let's start with an example failure. can you link to a build result page for such a failure?14:23
leonardo-serranoPerhaps it affects other projects? This is the task I mentioned: https://opendev.org/zuul/zuul-jobs/blame/commit/839de7f8996838162ae0de6a9f6ba28f968381bc/roles/ensure-pip/tasks/workarounds.yaml14:23
leonardo-serranofungi: Sure. Here's an example: https://zuul.opendev.org/t/openstack/build/6bd6d91c041b460fb6005d68c8f109cf/console14:24
fungithanks, looking14:24
fricklerleonardo-serrano: almost completely unrelated, but did you see https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0 ? in particular cleaning up the nodeset usage would be nice14:24
leonardo-serranofrickler: I wasn't aware. Thanks for pointing it out. So the action is to move jobs away from "centos-7" and "opensuse-15" nodes, correct? Is there more?14:26
leonardo-serranoI'm new to zuul, so I appreciate the advice14:27
fricklerleonardo-serrano: yes, those nodesets are no longer available and the jobs that try to use them will be undefined14:30
fungileonardo-serrano: looking at related issue https://github.com/pypa/setuptools/issues/4483 i think we need to identify what's causing an older version of the "packaging" library to get installed in your jobs. i'll see if i can spot it14:30
fungithough the job did appear to install latest packaging in the venv, i think this is impacting isolated sdist builds which don't occur within the scope of the venv14:41
leonardo-serranofungi: I just found it. In the tox file there is a pip install command pulling in packaging===20.914:43
leonardo-serranohttps://opendev.org/starlingx/config/src/branch/master/controllerconfig/controllerconfig/tox.ini#L2614:44
fungioh wow, that doesn't show up in the job log at all14:44
fungigood find!14:45
opendevreviewJulia Kreger proposed openstack/diskimage-builder master: docs: add two contextual warnings to the replace-partition element  https://review.opendev.org/c/openstack/diskimage-builder/+/92881914:45
leonardo-serranofungi: Thank you very much for the assistance!14:46
fungileonardo-serrano: my pleasure. let me know if cleaning that up doesn't seem to fix it and maybe we can find out what else has gone wrong there14:46
clarkbfungi: leonardo-serrano I think you don't get the verbose tox run because its in the setup run without tests step? Maybe that role should be updated to run tox verbosely there too?15:01
jamespagecorvus: hey - my query from #openstack-infra was gertty related - last release was some time ago and its caught in the sqlalchemy 2.0 transition in Ubuntu (and Debian).  I've used a snapshot for the time being to unblock - would a new release be possible?15:01
clarkbleonardo-serrano: fungi: this issue is similar to the one that starlingx ran into with old pyzmq having a dependency that doesn't work with newer setuptools/pip/pypa things. It is probably worth mentioning that if you are going to use constraints you either need to freeze absolutely everything or you need to regularly update constraints to keep up with the world moving around you15:10
corvusjamespage: i think so, i'll put that on my list; thanks for the reminder :)15:21
jamespagecorvus: great thanks!15:21
clarkbfungi: corvus the number of rax flex errors is much lower this morning than yesterday. Could still be load related, but maybe the network issues resolve if given long enough rather than timing out within 60 seconds?15:22
clarkbyou can see the difference if you set the graph timeline to 2 days15:22
corvusmight be interesting to look at the keyscan logs and see if the behavior is still there (but finishing after 60+ seconds)15:23
clarkb++15:24
fungiso... `openstack container delete --recursive <container_name>` runs for about 8-9 minutes but eventually comes back with "Not Found (HTTP 404)"15:30
funginot sure if it's hitting some sort of internal timeout or what. maybe --verbose will help pinpoint the issue15:31
clarkbfungi: any idea if the object count fell? I wonder if there is some race in the recursive cleanup and we eventually try to delete something that is gone15:31
clarkbor maybe the container itself is bugged and has lost track of an object or two15:31
corvuswhat were the old and new boot timeout values?15:31
clarkbcorvus: old was 60 seconds (default) new is 120 seconds15:31
fungigood idea, i should compare the object count before/after15:32
corvushttps://paste.opendev.org/show/bVMwgRnV7KY2v301fPUz/15:35
corvusit looks like it takes ~30s to boot (sample size of 1 -- consider that a highly suspect assumption)15:35
corvusso any of those that are >~30s would have timed out before15:36
corvusit's batchy15:36
clarkbcorvus: 0025152447 reports multiple scanned keys logs lines. Is each of those for a different key? and in that case this node would've failed before?15:36
clarkboh thats one request with multiple nodes15:36
clarkbnode 0038452371 for example is the actual issue that has been resolved15:37
corvusyep 3 node nodeset15:37
corvuslooking at the logs, i haven't found any individual nodes that fit the pattern from yesterday15:43
corvuseven the long ones look like they finished scanning quickly (the extra time may be nodepool time, not cloud time)15:44
clarkboh itneresting so maybe whatever the issue is was corrected at the cloud level or we haven't driven enough demand to trigger it15:44
corvusyeah maybe.  example: https://paste.openstack.org/show/bAz7WEtIeXSCbLKoBFHi/15:45
corvusnote the "complete" line second to last.  that says the time between that and the next log line is nodepool overhead15:46
corvusbut we did not see "complete" lines on the problem nodes yesterday15:46
clarkbwhat is the delay between complete and scanned in the last two log lines?15:46
clarkbyou're right though that that is a different state transition but wonder if maybe the time is lost due to a similar mechanism in both cases15:47
corvus(i think that time is a hand-off from the nodescan state machine to the node create state machine, and the latter runs less frequently; but i'm not 100% sure)15:47
corvusthe nodescan state machines are event-driven and should be very fast15:48
corvus(the launch state machines are polled)15:49
clarkbas long as the state machine doesn't see its own delay as triggering the timeout that should be fine15:51
corvusyeah, i'm pretty sure from the logs yesterday the timeout was prior to that point because we didn't see the "complete" line15:57
fungiclarkb: good call, the object_count does drop, but only by the tiniest amount. i don't think this is going to be viable, deletion will take years of continuous running to complete at this rate16:01
clarkbI guess there isn't a "I really mean to delete this container and all of its contents" api call?16:02
fungiwell, that's what this is i think16:02
clarkbthis is still explicitly deleting every object individually I think16:02
fungi`openstack container delete --recursive` removes the file objects, because the container refuses to delete while it still has contents16:03
clarkbrather than marking the container for cleanup then letting swifts backend pruning (for expiration etc) clean up as it goes16:03
clarkbright16:03
clarkbI was just hoping swift has a bulk deletion api since it does do bulk deletes in the background already16:03
fungiif it does, i can't find that plumbed through osc16:03
clarkbtimburke may know if there is a better option for us here16:04
fungiyeah, rough math says this is deleting approximately 5 files per second16:05
fungieven just a single container has hundreds of thousands, and we have hundreds of such containers to delete16:05
clarkbanother approach would be to try and do more delete requests in parallel, but that may require writing our own client delete tool16:07
clarkbI'm guessing osc is doing it serially16:08
fungioh, if this is round-trip object deletions one by one and not a recursive delete option internal within swift, yeah that would explain the painful slowness16:09
mordredI'm pretty sure it's the round-trip object deletions16:10
clarkbyes its doing a listing then doing deletes one by one with 200ms request rtt16:10
mordredI feel like I remember a parallel api in shade/sdk somewhere- but I might just be thinking about parallel uploads of image segments16:11
corvushttps://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/delete_swift_container.py16:12
corvusi wrote a script that does the same thing :/16:12
corvusbut -- you could probably pretty easily throw a threadpoolexecutor onto that and speed it up16:12
fungioh neat16:15
fungiso anyway, i suspect https://docs.openstack.org/swift/latest/api/bulk-delete.html is what we would actually want16:18
fungibut the chances rackspace's ancient swift deployment supports that?16:18
clarkbfungi: the example there has an example xml resposne so maybe :)16:19
corvusworth looking into; i don't think i've ever tried16:19
fungithough also, "up to 10,000 objects or containers"16:19
fungiso we're still looking at hundreds of requests per container for hundreds of containers16:20
clarkbfungi: I wonder if bulk container delete requests on a conatiner would work without clearing all the objects first16:22
corvusyeah that has a "delete a whole container" option, i'd try that first :)16:26
corvusat least, i read that the same way that clarkb is optimistically reading it16:27
fungiwhere do you see that?16:31
fungiit says "To indicate a container, specify the container name as: CONTAINER_NAME. Make sure that the container is empty. If it contains objects, Object Storage cannot delete the container."16:31
clarkbthat's too bad16:33
clarkbso ya the most efficient process is probably do a listing then multiple 10k object deletion requests until empty then delete the container?16:34
corvusbummer16:38
fungior ask support to remove the containers we aren't using. they may also have admin access to do it in ways that wouldn't put so much strain on the api16:43
fungialso, for the record, this was prompted by cardoe's https://launchpad.net/bugs/2078229 so he might have ideas about whether support has access to help us clean it up faster16:49
clarkbI believe those keys get generated at job runtime and only exist for the duration of the job...16:51
clarkbI can understand people being cautious about things like this but it seems overkill to worry too much16:51
clarkbside note: tripleo logging everything is something I tried many many times to address16:51
clarkbya our image builds shouldn't produce ssh host keys and instead glean generates them uniquely on boot16:52
clarkbthen we delete the node when the jobs complete16:52
fungiyeah, i mean, it's clearly not a security risk. but it did lead me to discover that we have a bunch of old content in swift containers that we abandoned, probably because we started trying to delete them and then realized it was taking forever and forgot to follow up16:53
fungiso my concern is from the perspective of being a conscientious steward of the resources rackspace is generously donating to our community16:53
fungiand not wasting disk space on logs that haven't been relevant for 5-10 years16:54
fungifor the record, the rackspace dashboard isn't super helpful here either. trying to delete a container with contents is blocked, and shows this tooltip: "This container must be empty before you can delete it. For bulk deletes, we recommend using a free third-party cloud storage browser such as Cyberduck or Cloudberry Explorer."17:04
fricklermaybe look at https://rclone.org/commands/rclone_purge/17:46
cardoefungi: clarkb: I kinda figured it was ephemeral data and didn't really matter. But some internet researchers keep pinging our security people. So I got bugged to "handle it"18:04
fungicardoe: i don't suppose you happen to know whether support has admin options for cleaning up whole swift containers that's faster than what a regular user can do through the api?18:07
cardoeYeah they should18:09
fungithanks, i'll just put a list in a ticket in that case18:09
cardoeAnnoyingly I'm working on Ironic stuff for us. (My consumers are internal only) And I used Triple-O as an example in one of my talks. Saying that my work is the undercloud and the things consuming me are the overcloud. And so I'm associated with the name "undercloud" and that's what that path had in it.18:35
fungiheh18:36
clarkbfungi: I posted a comment about removing old content and adding new content to the euler mirror20:07
clarkbalso I suspect this is the same person asking about gitea apis... makes me wonder what is going on20:07
fungithat's an odd coincidence, yeah20:43
clarkbfungi: should we proceed with https://review.opendev.org/c/opendev/system-config/+/928656/1 ?21:04
clarkbworking on a rax flex update email here: https://etherpad.opendev.org/p/l-in0EBS7220tsrAMMem21:10
fungiyeah, approved since nobody else seemed to be reviewing that stack21:23
fungiitym "rax folx"21:24
fungi;)21:24
clarkbok how's that look. Did I miss any of the minor issues we've hit? maybe other ideas on next steps?21:36
clarkbcardoe: happy to cc you too if you'd like. I can use your email address in gerrit or if you'd prefer feel free to PM me a rackspace address21:40
clarkbside note: looking at the last 2 days of grafana graphs it says we're using 10.6 nodes (that number is provided by sumSeries which I think is effectively an integral?) anyway that aligns with the idea that we continue to use ~1/3 of our capacity21:48
fungiclarkb: message lgtm, i added some minor notes if you want to include those21:49
clarkbthanks21:49
clarkbI'll wait a bit to see if cardoe wants to be cc'd and then send it out21:49
fungioh, may also be worth noting that we typically use 8vcpu flavors elsewhere, but 4vcpu was sufficient in flex?21:50
clarkbsure21:50
funginot sure where (or if) you'd work that in21:50
clarkbsomething like that maybe21:51
fungiyep, superb21:51
opendevreviewMerged opendev/system-config master: Tag etherpad images with version  https://review.opendev.org/c/opendev/system-config/+/92865621:56
clarkbthat image is promoting now21:58
clarkbwe should check the tag shows up in docker hub as expected and that etherpad deploys the new version for us21:58
clarkbwell new build of the old version21:59
clarkbhttps://hub.docker.com/r/opendevorg/etherpad/tags both v2.1.1 and latest were updated just now21:59
fungiyay!21:59
clarkband docker ps -a shows the container was restarted on the updated version and reports it is healthy22:01
clarkbI can still reach my email draft pad so ya this look happy22:01
clarkbI'm going to go ahead and send that email. Can loop in cardoe later if necessary22:42

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!