opendevreview | Dmitriy Rabotyagov proposed openstack/diskimage-builder master: Fix baseurl for Fedora versions before 36 https://review.opendev.org/c/openstack/diskimage-builder/+/890650 | 12:38 |
---|---|---|
*** amoralej is now known as amoralej|lunch | 13:30 | |
*** amoralej|lunch is now known as amoralej | 13:54 | |
fungi | looks like we've got image upload (or build) issues. last successful uploads of ubuntu-jammy anywhere were early utc on wednesday | 14:07 |
fungi | yeah, that was the last time we successfully built it | 14:08 |
fungi | df says /opt is at 100% of 2tb used on nb01, nb02. nb03 is unreachable (no longer exists?), and nb04 (our arm64 builder) is the only one currently able to build new images | 14:10 |
fungi | i'll work on cleaning up nb01 and nb02 no | 14:10 |
fungi | w | 14:10 |
frickler | seems all x86 builds have been failing since that date, all possibly interesting build logs have been rotated away since then, so we'll have to watch new logs after the cleanup | 14:23 |
fungi | probably all hitting enospc | 14:24 |
fungi | cleanup is slow. so far only freed about 2% of /opt on each server | 14:25 |
frickler | yes, I think that may have been taking a couple of hours on earlier occasions | 14:25 |
fungi | earlier occasions were probably also before we bumped the volume to 2tb | 14:26 |
fungi | i'll have to take a look at the graphs, but this may be the first time we've filled them since increasing the available space | 14:26 |
frickler | ah, right, that's very likely the case | 14:27 |
frickler | http://cacti.openstack.org/cacti/graph.php?action=view&local_graph_id=68032&rra_id=all | 14:28 |
frickler | first incident in almost a year | 14:28 |
fungi | so we have a leak of something in there, albeit slow | 14:28 |
frickler | but it also went very fast, 2-3 days from roughly 50% to 100% | 14:29 |
frickler | seems there's also some slow leak happening, but that wouldn't have filled the disk for another couple of years | 14:30 |
fungi | so something likely went sideways around the end of july, i guess we'll have a better idea what that might be once we're building images again | 14:30 |
frickler | that might coincide with the F36 repo archival, but yes, we need new logs to be sure | 14:31 |
fungi | yeah, maybe f36 image builds started failing in a loop leaving trash behind | 14:32 |
noonedeadpunk | actually, f36 are failing for sure, though I've get a patch to it: https://review.opendev.org/c/openstack/diskimage-builder/+/890650 | 14:37 |
frickler | hmm, looking at dib-image-list and image-list, the hypotheses of failing image uploads is more likely to me. on some providers images are 9 days old, other only 5-6. so if nodepool had to store more than 2 iterations of each image, that may also cause an increase in disk usage | 14:38 |
fungi | yeah, i noticed the uploads in inmotion-iad3 were a couple of days older than in rax-dfw | 14:41 |
frickler | for inmotion I see a lot of delete image failures, but also these for rax-iad: | 14:43 |
frickler | 2023-08-01 04:07:36,069 ERROR nodepool.builder.UploadWorker.6: Failed to upload build 0000011493 of image rockylinux-8 to provider rax-iad | 14:43 |
frickler | so what's happening in iad? where is that, anyway? ;) | 14:44 |
fungi | iad is the airport code for dulles international in virginia (suburb of washington dc). it's notable as the location of the mae east telecommuncations switching hub that traditionally handled a majority of trans-atlantic lines | 14:46 |
frickler | nodepool-builder.log.2023-07-30_12:2023-07-30 18:18:19,636 ERROR nodepool.builder.UploadWorker.4: Failed to upload build 0000000039 of image debian-bookworm to provider rax-iad | 14:47 |
frickler | this is where it started | 14:47 |
frickler | hundreds of similar failures in the days after that | 14:48 |
fungi | i wonder if we've reached our glance quota there? | 14:48 |
fungi | or if rax-dfw where the builders reside is having communications issues | 14:48 |
fungi | okay, clearing out /opt/dib_tmp only freed up about 200-300gb on each builder | 15:10 |
fungi | which implies we have a ton of data that's not leftover tempfiles from dib | 15:11 |
fungi | on nb01, /opt/nodepool_dib has 1.6tb of content, 1.7tb on nb02 | 15:13 |
fungi | looks like that's where we store the built images | 15:13 |
fungi | so i guess that's all spoken for (each format for a couple of versions of each distro/version) | 15:15 |
frickler | can we create a ticket with rax for them to check the upload issues with iad? | 15:16 |
fungi | are they still happening? | 15:18 |
frickler | I think you stopped the containers now? last failure on nb01 was: 2023-08-07 13:52:29,041 ERROR nodepool.builder.UploadWorker.7: Failed to upload build 0000067622 of image debian-buster to provider rax-iad | 15:19 |
fungi | #status log Cleared out /opt/dib_tmp on nb01 and nb02, restoring their ability to build new amd64 images for the first time since early on 2023-08-02 | 15:19 |
opendevstatus | fungi: finished logging | 15:19 |
fungi | frickler: yeah, i've started them again now that i'm done checking | 15:20 |
fungi | i'll see if there's anything useful in the traceback from that one | 15:20 |
frickler | we could try to manually delete all or some images on rax-iad, if those are failing anyway like noonedeadpunk said, that shouldn't matter much and avoid future failures, even if it would make that region temporarily unusable | 15:21 |
fungi | openstack.exceptions.ResourceTimeout: Timeout waiting for Task:27aebd60-0c80-4461-af96-a6dfb4bda21c to transition to success | 15:21 |
fungi | that's what the sdk is reporting, i think | 15:21 |
frickler | yes, I saw that, is that the glance task or a nodepool task? | 15:22 |
fungi | it's coming from openstacksdk so i take that to mean glance task (the backend "tasks api" glance has for things like image conversions) | 15:24 |
frickler | both "openstack image task list" and "openstack image list" seem to take at least a long time for rax-iad from bridge, not sure what the actual timeout will be, but no response in a couple of minutes | 15:30 |
frickler | in the ci tenant both work fine, so must have something to do with the jenkins tenant being overloaded somehow | 15:32 |
* frickler needs to step away for a bit, will be back later | 15:33 | |
fungi | it's possible that we're bogging it down with old leaked images or something, i'll see if there's anything we should clean up | 16:23 |
ajaiswal | any idea how to change company affilation | 17:11 |
JayF | ajaiswal: in stackalytics? | 17:13 |
JayF | if that's what you're trying, a PR that looks like this-ish: https://review.opendev.org/q/project:x%2Fstackalytics%20status:open | 17:14 |
JayF | but they don't get merged often at all, literally months and months of delay if not more | 17:14 |
JayF | stackalytics is not an official opendev/openinfra thing | 17:14 |
ajaiswal | Thanks @Jayf | 17:19 |
fungi | ajaiswal: if you want to change your affiliation with the openinfra foundation and project activity tracked in bitergia rather than in stackalytics, you do that in your openinfra profile: click the log in button in the top-right corner of https://openinfra.dev/ and then after you're done logging in click the profile button in the top-right corner, then scroll down to the bottom where it says | 17:27 |
fungi | "affiliations" and add/remove/edit them as desired, clicking the update button when you're done | 17:27 |
fungi | you need to make sure that your preferred email address in your gerrit account is one of the (primary, second or third) addresses in your openinfra profile, since that's how they get associated with one another | 17:28 |
frickler | oh, nice, I didn't know about that either, seems my account didn't have any affiliations by default | 18:23 |
frickler | fungi: image list works for DFW and ORD, but it seems that we have a high number of leaked images in those two regions, too, like about 10 images per distro/version instead of the expected 2, old things like f29 and upward | 18:56 |
frickler | I cannot list image tasks in those two regions, either, so it seems like something is messed up in that regard | 18:56 |
fungi | i hadn't found a chance yet to look at the dashboard, but am checking it now | 19:29 |
fungi | sorting the images in rax-iad by creation date, there are tons going back as far as 2019 | 19:31 |
fungi | i'm going to bulk delete any that aren't from this month, as a start | 19:31 |
fungi | in good news, we've already built and uploaded new centos-8-stream and centos-9-stream images since the builder cleanup earlier | 19:39 |
fungi | unfortunately, `openstack image list --private` takes something like 10 minutes to return an empty response from rax-iad, presumably an internal timeout of some kind there | 19:48 |
fungi | i won't be surprised if my image deletion attempts through their web dashboard meet with a similar fate | 19:48 |
fungi | trying to delete 680 images from prior to july 27, except for the two gentoo images we have that are approximately a year old since it hasn't built successfully more recent than that | 19:55 |
fungi | i picked july 27 as the cut-off because we have at least some images mentioned in nodepool image-list in rax-iad from 10 days ago | 19:56 |
fungi | but nothing listed older than that aside from the two gentoo images | 19:57 |
fungi | given our average image sizes, this is ~4.5tb of images i'm deleting | 19:58 |
frickler | wow, that's quite a bit | 20:02 |
fungi | ugh, the dashboard reports 23 of 680 images deleted, 657 errors | 20:02 |
fungi | the errors are all "Service unavailable. Please check https://status.rackspace.com" | 20:03 |
fungi | i think i may have caused a problem trying to delete too many images at once | 20:04 |
fungi | at least it gives me a "retry failed actions" button, so i'll give that a shot after i see some of these pending deletions switch to deleted | 20:04 |
JayF | I will attest that, at leasst when I worked there, the dashboard was unaware of and did not consider rate limiting at all | 20:05 |
fungi | huh, actually my `openstack image list` that took forever was for our control plane tenant, not our nodepool tenant, so i'm trying that again. maybe it will return actual results | 20:21 |
fungi | indeed, it returns... 1210 entries | 20:21 |
fungi | so worth noting, ~50% of the images there are from the past 10 days. i have a feeling the glance task timeouts are causing us to leak images because the upload eventually completes but nodepool doesn't know it needs to delete anything there | 20:22 |
fungi | i'll do a couple of `openstack image list --private` listings about an hour apart, then any uuids which are present in both lists but not present in `nodepool image-list` | 20:25 |
fungi | i'll try to delete slowly | 20:25 |
fungi | that should prevent us from deleting any images nodepool actually knows about | 20:26 |
fungi | looking into the *-goaccess-report job failures, i think the problem arose when we replaced static01 with the newer static02 in april. the job is failing on the changed host key | 21:21 |
fungi | https://opendev.org/opendev/system-config/src/branch/master/playbooks/periodic/goaccess.yaml#L16 | 21:22 |
fungi | yeah, ip addresses there are wrong too | 21:23 |
fungi | i'll push up a change | 21:24 |
opendevreview | Jeremy Stanley proposed opendev/system-config master: Correct static known_hosts entry for goaccess jobs https://review.opendev.org/c/opendev/system-config/+/890698 | 21:29 |
fungi | infra-root: ^ | 21:29 |
ianw | fungi: https://review.opendev.org/c/opendev/system-config/+/562510/1/tools/rax-cleanup-image-uploads.py might help but also iirc that was also working around a shade bug with leaked object bits that is no longer an issue | 22:10 |
fungi | noted, thanks! | 22:21 |
opendevreview | Michael Johnson proposed openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365 | 22:42 |
fungi | johnsom: ^ it probably also is in merge conflict with the current branch state | 22:45 |
fungi | at least i think that's the reason for the latest error comment | 22:45 |
johnsom | Yep, looks like it | 22:45 |
fungi | and we need gtema to +1 that and the cyborg one | 22:46 |
opendevreview | Michael Johnson proposed openstack/project-config master: Allow designate-core as osc/sdk service-core https://review.opendev.org/c/openstack/project-config/+/890365 | 22:48 |
opendevreview | Merged openstack/project-config master: Fix app-intel-ethernet-operator reviewers group https://review.opendev.org/c/openstack/project-config/+/890569 | 23:46 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!