*** ralonsoh_ is now known as ralonsoh | 05:48 | |
fungi | jamespage: corvus is the gertty author (redirecting your question from #openstack-infra) | 12:36 |
---|---|---|
jamespage | fungi: ta | 12:36 |
fungi | #status log Released gerritlib 0.11.0 | 13:05 |
fungi | NeilHanlon: ^ | 13:05 |
opendevstatus | fungi: finished logging | 13:05 |
opendevreview | Dong Zhang proposed zuul/zuul-jobs master: Print instance type in emit-job-header role https://review.opendev.org/c/zuul/zuul-jobs/+/925754 | 13:15 |
opendevreview | Dong Zhang proposed zuul/zuul-jobs master: Print instance type in emit-job-header role https://review.opendev.org/c/zuul/zuul-jobs/+/925754 | 13:16 |
NeilHanlon | fungi++ | 13:22 |
fungi | infra-root: i'm getting started deleting the ancient swift containers we used for zuul build logs circa 2014 up until we switched to the 1024-way sharding 5 years ago, so basically only touching containers we haven't used for more than 5 years now | 14:04 |
fungi | or i was going to, not sure what leads `openstack container delete ...` to respond with "Conflict (HTTP 409)" | 14:08 |
fungi | definitely don't want to clicky-delete 259 containers through the rackspace webui | 14:09 |
leonardo-serrano | Hi, setuptools has received an update recently and it requires some deps to be installed manually to work. I was wondering if the topic has been discussed already | 14:15 |
fungi | leonardo-serrano: can you provide more details? how recently? what deps? what is "manually" in this context? | 14:17 |
fungi | and most importantly, how does it relate to the opendev collaboratory? | 14:18 |
frickler | fungi: add "--recursive"? | 14:19 |
fungi | frickler: oh, that seems to be working. thanks! | 14:19 |
leonardo-serrano | We're hitting this error: https://github.com/pypa/setuptools/issues/4478#issuecomment-2235160778 This comment has some more details | 14:20 |
fungi | leonardo-serrano: who is "we"? | 14:20 |
leonardo-serrano | fungi: I am currently responsible for maintaining zuul on the starlingx project | 14:21 |
fungi | leonardo-serrano: thanks for the clarification, so starlingx zuul jobs are failing with setuptools 71 and later due to a change in how it treats vendored dependencies? that's the amount of detail i was trying to get | 14:22 |
leonardo-serrano | fungi: About the relation to opendev collab, I found a task for installing setuptools. I was wondering if there is a WIP change in there or if I should fix it in the starlingx project directly | 14:22 |
fungi | leonardo-serrano: well, let's start with an example failure. can you link to a build result page for such a failure? | 14:23 |
leonardo-serrano | Perhaps it affects other projects? This is the task I mentioned: https://opendev.org/zuul/zuul-jobs/blame/commit/839de7f8996838162ae0de6a9f6ba28f968381bc/roles/ensure-pip/tasks/workarounds.yaml | 14:23 |
leonardo-serrano | fungi: Sure. Here's an example: https://zuul.opendev.org/t/openstack/build/6bd6d91c041b460fb6005d68c8f109cf/console | 14:24 |
fungi | thanks, looking | 14:24 |
frickler | leonardo-serrano: almost completely unrelated, but did you see https://zuul.opendev.org/t/openstack/config-errors?project=starlingx%2Fzuul-jobs&skip=0 ? in particular cleaning up the nodeset usage would be nice | 14:24 |
leonardo-serrano | frickler: I wasn't aware. Thanks for pointing it out. So the action is to move jobs away from "centos-7" and "opensuse-15" nodes, correct? Is there more? | 14:26 |
leonardo-serrano | I'm new to zuul, so I appreciate the advice | 14:27 |
frickler | leonardo-serrano: yes, those nodesets are no longer available and the jobs that try to use them will be undefined | 14:30 |
fungi | leonardo-serrano: looking at related issue https://github.com/pypa/setuptools/issues/4483 i think we need to identify what's causing an older version of the "packaging" library to get installed in your jobs. i'll see if i can spot it | 14:30 |
fungi | though the job did appear to install latest packaging in the venv, i think this is impacting isolated sdist builds which don't occur within the scope of the venv | 14:41 |
leonardo-serrano | fungi: I just found it. In the tox file there is a pip install command pulling in packaging===20.9 | 14:43 |
leonardo-serrano | https://opendev.org/starlingx/config/src/branch/master/controllerconfig/controllerconfig/tox.ini#L26 | 14:44 |
fungi | oh wow, that doesn't show up in the job log at all | 14:44 |
fungi | good find! | 14:45 |
opendevreview | Julia Kreger proposed openstack/diskimage-builder master: docs: add two contextual warnings to the replace-partition element https://review.opendev.org/c/openstack/diskimage-builder/+/928819 | 14:45 |
leonardo-serrano | fungi: Thank you very much for the assistance! | 14:46 |
fungi | leonardo-serrano: my pleasure. let me know if cleaning that up doesn't seem to fix it and maybe we can find out what else has gone wrong there | 14:46 |
clarkb | fungi: leonardo-serrano I think you don't get the verbose tox run because its in the setup run without tests step? Maybe that role should be updated to run tox verbosely there too? | 15:01 |
jamespage | corvus: hey - my query from #openstack-infra was gertty related - last release was some time ago and its caught in the sqlalchemy 2.0 transition in Ubuntu (and Debian). I've used a snapshot for the time being to unblock - would a new release be possible? | 15:01 |
clarkb | leonardo-serrano: fungi: this issue is similar to the one that starlingx ran into with old pyzmq having a dependency that doesn't work with newer setuptools/pip/pypa things. It is probably worth mentioning that if you are going to use constraints you either need to freeze absolutely everything or you need to regularly update constraints to keep up with the world moving around you | 15:10 |
corvus | jamespage: i think so, i'll put that on my list; thanks for the reminder :) | 15:21 |
jamespage | corvus: great thanks! | 15:21 |
clarkb | fungi: corvus the number of rax flex errors is much lower this morning than yesterday. Could still be load related, but maybe the network issues resolve if given long enough rather than timing out within 60 seconds? | 15:22 |
clarkb | you can see the difference if you set the graph timeline to 2 days | 15:22 |
corvus | might be interesting to look at the keyscan logs and see if the behavior is still there (but finishing after 60+ seconds) | 15:23 |
clarkb | ++ | 15:24 |
fungi | so... `openstack container delete --recursive <container_name>` runs for about 8-9 minutes but eventually comes back with "Not Found (HTTP 404)" | 15:30 |
fungi | not sure if it's hitting some sort of internal timeout or what. maybe --verbose will help pinpoint the issue | 15:31 |
clarkb | fungi: any idea if the object count fell? I wonder if there is some race in the recursive cleanup and we eventually try to delete something that is gone | 15:31 |
clarkb | or maybe the container itself is bugged and has lost track of an object or two | 15:31 |
corvus | what were the old and new boot timeout values? | 15:31 |
clarkb | corvus: old was 60 seconds (default) new is 120 seconds | 15:31 |
fungi | good idea, i should compare the object count before/after | 15:32 |
corvus | https://paste.opendev.org/show/bVMwgRnV7KY2v301fPUz/ | 15:35 |
corvus | it looks like it takes ~30s to boot (sample size of 1 -- consider that a highly suspect assumption) | 15:35 |
corvus | so any of those that are >~30s would have timed out before | 15:36 |
corvus | it's batchy | 15:36 |
clarkb | corvus: 0025152447 reports multiple scanned keys logs lines. Is each of those for a different key? and in that case this node would've failed before? | 15:36 |
clarkb | oh thats one request with multiple nodes | 15:36 |
clarkb | node 0038452371 for example is the actual issue that has been resolved | 15:37 |
corvus | yep 3 node nodeset | 15:37 |
corvus | looking at the logs, i haven't found any individual nodes that fit the pattern from yesterday | 15:43 |
corvus | even the long ones look like they finished scanning quickly (the extra time may be nodepool time, not cloud time) | 15:44 |
clarkb | oh itneresting so maybe whatever the issue is was corrected at the cloud level or we haven't driven enough demand to trigger it | 15:44 |
corvus | yeah maybe. example: https://paste.openstack.org/show/bAz7WEtIeXSCbLKoBFHi/ | 15:45 |
corvus | note the "complete" line second to last. that says the time between that and the next log line is nodepool overhead | 15:46 |
corvus | but we did not see "complete" lines on the problem nodes yesterday | 15:46 |
clarkb | what is the delay between complete and scanned in the last two log lines? | 15:46 |
clarkb | you're right though that that is a different state transition but wonder if maybe the time is lost due to a similar mechanism in both cases | 15:47 |
corvus | (i think that time is a hand-off from the nodescan state machine to the node create state machine, and the latter runs less frequently; but i'm not 100% sure) | 15:47 |
corvus | the nodescan state machines are event-driven and should be very fast | 15:48 |
corvus | (the launch state machines are polled) | 15:49 |
clarkb | as long as the state machine doesn't see its own delay as triggering the timeout that should be fine | 15:51 |
corvus | yeah, i'm pretty sure from the logs yesterday the timeout was prior to that point because we didn't see the "complete" line | 15:57 |
fungi | clarkb: good call, the object_count does drop, but only by the tiniest amount. i don't think this is going to be viable, deletion will take years of continuous running to complete at this rate | 16:01 |
clarkb | I guess there isn't a "I really mean to delete this container and all of its contents" api call? | 16:02 |
fungi | well, that's what this is i think | 16:02 |
clarkb | this is still explicitly deleting every object individually I think | 16:02 |
fungi | `openstack container delete --recursive` removes the file objects, because the container refuses to delete while it still has contents | 16:03 |
clarkb | rather than marking the container for cleanup then letting swifts backend pruning (for expiration etc) clean up as it goes | 16:03 |
clarkb | right | 16:03 |
clarkb | I was just hoping swift has a bulk deletion api since it does do bulk deletes in the background already | 16:03 |
fungi | if it does, i can't find that plumbed through osc | 16:03 |
clarkb | timburke may know if there is a better option for us here | 16:04 |
fungi | yeah, rough math says this is deleting approximately 5 files per second | 16:05 |
fungi | even just a single container has hundreds of thousands, and we have hundreds of such containers to delete | 16:05 |
clarkb | another approach would be to try and do more delete requests in parallel, but that may require writing our own client delete tool | 16:07 |
clarkb | I'm guessing osc is doing it serially | 16:08 |
fungi | oh, if this is round-trip object deletions one by one and not a recursive delete option internal within swift, yeah that would explain the painful slowness | 16:09 |
mordred | I'm pretty sure it's the round-trip object deletions | 16:10 |
clarkb | yes its doing a listing then doing deletes one by one with 200ms request rtt | 16:10 |
mordred | I feel like I remember a parallel api in shade/sdk somewhere- but I might just be thinking about parallel uploads of image segments | 16:11 |
corvus | https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/upload-logs-base/library/delete_swift_container.py | 16:12 |
corvus | i wrote a script that does the same thing :/ | 16:12 |
corvus | but -- you could probably pretty easily throw a threadpoolexecutor onto that and speed it up | 16:12 |
fungi | oh neat | 16:15 |
fungi | so anyway, i suspect https://docs.openstack.org/swift/latest/api/bulk-delete.html is what we would actually want | 16:18 |
fungi | but the chances rackspace's ancient swift deployment supports that? | 16:18 |
clarkb | fungi: the example there has an example xml resposne so maybe :) | 16:19 |
corvus | worth looking into; i don't think i've ever tried | 16:19 |
fungi | though also, "up to 10,000 objects or containers" | 16:19 |
fungi | so we're still looking at hundreds of requests per container for hundreds of containers | 16:20 |
clarkb | fungi: I wonder if bulk container delete requests on a conatiner would work without clearing all the objects first | 16:22 |
corvus | yeah that has a "delete a whole container" option, i'd try that first :) | 16:26 |
corvus | at least, i read that the same way that clarkb is optimistically reading it | 16:27 |
fungi | where do you see that? | 16:31 |
fungi | it says "To indicate a container, specify the container name as: CONTAINER_NAME. Make sure that the container is empty. If it contains objects, Object Storage cannot delete the container." | 16:31 |
clarkb | that's too bad | 16:33 |
clarkb | so ya the most efficient process is probably do a listing then multiple 10k object deletion requests until empty then delete the container? | 16:34 |
corvus | bummer | 16:38 |
fungi | or ask support to remove the containers we aren't using. they may also have admin access to do it in ways that wouldn't put so much strain on the api | 16:43 |
fungi | also, for the record, this was prompted by cardoe's https://launchpad.net/bugs/2078229 so he might have ideas about whether support has access to help us clean it up faster | 16:49 |
clarkb | I believe those keys get generated at job runtime and only exist for the duration of the job... | 16:51 |
clarkb | I can understand people being cautious about things like this but it seems overkill to worry too much | 16:51 |
clarkb | side note: tripleo logging everything is something I tried many many times to address | 16:51 |
clarkb | ya our image builds shouldn't produce ssh host keys and instead glean generates them uniquely on boot | 16:52 |
clarkb | then we delete the node when the jobs complete | 16:52 |
fungi | yeah, i mean, it's clearly not a security risk. but it did lead me to discover that we have a bunch of old content in swift containers that we abandoned, probably because we started trying to delete them and then realized it was taking forever and forgot to follow up | 16:53 |
fungi | so my concern is from the perspective of being a conscientious steward of the resources rackspace is generously donating to our community | 16:53 |
fungi | and not wasting disk space on logs that haven't been relevant for 5-10 years | 16:54 |
fungi | for the record, the rackspace dashboard isn't super helpful here either. trying to delete a container with contents is blocked, and shows this tooltip: "This container must be empty before you can delete it. For bulk deletes, we recommend using a free third-party cloud storage browser such as Cyberduck or Cloudberry Explorer." | 17:04 |
frickler | maybe look at https://rclone.org/commands/rclone_purge/ | 17:46 |
cardoe | fungi: clarkb: I kinda figured it was ephemeral data and didn't really matter. But some internet researchers keep pinging our security people. So I got bugged to "handle it" | 18:04 |
fungi | cardoe: i don't suppose you happen to know whether support has admin options for cleaning up whole swift containers that's faster than what a regular user can do through the api? | 18:07 |
cardoe | Yeah they should | 18:09 |
fungi | thanks, i'll just put a list in a ticket in that case | 18:09 |
cardoe | Annoyingly I'm working on Ironic stuff for us. (My consumers are internal only) And I used Triple-O as an example in one of my talks. Saying that my work is the undercloud and the things consuming me are the overcloud. And so I'm associated with the name "undercloud" and that's what that path had in it. | 18:35 |
fungi | heh | 18:36 |
clarkb | fungi: I posted a comment about removing old content and adding new content to the euler mirror | 20:07 |
clarkb | also I suspect this is the same person asking about gitea apis... makes me wonder what is going on | 20:07 |
fungi | that's an odd coincidence, yeah | 20:43 |
clarkb | fungi: should we proceed with https://review.opendev.org/c/opendev/system-config/+/928656/1 ? | 21:04 |
clarkb | working on a rax flex update email here: https://etherpad.opendev.org/p/l-in0EBS7220tsrAMMem | 21:10 |
fungi | yeah, approved since nobody else seemed to be reviewing that stack | 21:23 |
fungi | itym "rax folx" | 21:24 |
fungi | ;) | 21:24 |
clarkb | ok how's that look. Did I miss any of the minor issues we've hit? maybe other ideas on next steps? | 21:36 |
clarkb | cardoe: happy to cc you too if you'd like. I can use your email address in gerrit or if you'd prefer feel free to PM me a rackspace address | 21:40 |
clarkb | side note: looking at the last 2 days of grafana graphs it says we're using 10.6 nodes (that number is provided by sumSeries which I think is effectively an integral?) anyway that aligns with the idea that we continue to use ~1/3 of our capacity | 21:48 |
fungi | clarkb: message lgtm, i added some minor notes if you want to include those | 21:49 |
clarkb | thanks | 21:49 |
clarkb | I'll wait a bit to see if cardoe wants to be cc'd and then send it out | 21:49 |
fungi | oh, may also be worth noting that we typically use 8vcpu flavors elsewhere, but 4vcpu was sufficient in flex? | 21:50 |
clarkb | sure | 21:50 |
fungi | not sure where (or if) you'd work that in | 21:50 |
clarkb | something like that maybe | 21:51 |
fungi | yep, superb | 21:51 |
opendevreview | Merged opendev/system-config master: Tag etherpad images with version https://review.opendev.org/c/opendev/system-config/+/928656 | 21:56 |
clarkb | that image is promoting now | 21:58 |
clarkb | we should check the tag shows up in docker hub as expected and that etherpad deploys the new version for us | 21:58 |
clarkb | well new build of the old version | 21:59 |
clarkb | https://hub.docker.com/r/opendevorg/etherpad/tags both v2.1.1 and latest were updated just now | 21:59 |
fungi | yay! | 21:59 |
clarkb | and docker ps -a shows the container was restarted on the updated version and reports it is healthy | 22:01 |
clarkb | I can still reach my email draft pad so ya this look happy | 22:01 |
clarkb | I'm going to go ahead and send that email. Can loop in cardoe later if necessary | 22:42 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!