tonyb | I'm working on a small script to use as the basis of verifying that an RPM based distro mirror is "correct", any objections to me installing python3.20-venv on one of the existing region-cloud mirrors ? | 00:05 |
---|---|---|
fungi | no objection here, except that i want details on the time machine you used to get your hands on a packaged python 3.20 module | 00:37 |
fungi | if you meant 3.10, also no objection and you can keep the time machine plans safely hidden for a while longer | 00:38 |
tonyb | It's Australia .... we live in the future ... I've said too much | 00:38 |
fungi | okay, that makes sense | 00:38 |
fungi | you've harnessed the relativistic time dilation inherent to time zone maths | 00:39 |
tonyb | :) | 00:40 |
*** benj_7 is now known as benj_ | 01:53 | |
*** ykarel_ is now known as ykarel | 12:07 | |
frickler | just another data point for possible gerrit slownewss, creating https://review.opendev.org/c/openstack/python-openstackclient/+/924927 via the UI took about 15s | 12:22 |
fungi | interesting. i wonder if it's slow from some parts of the internet and fast for others because of packet loss/latency on certain routes impacting all the rest api round trips between the js webclient and the server | 13:42 |
fungi | how's your connectivity to the server more generally? | 13:42 |
fungi | also sort of distracted with storm-related power outages this morning, so still catching up | 13:45 |
fungi | good reminder that i really need a better ups for my workstation | 13:47 |
frickler | most other actions are without noticable delay, so unless the cherry-pick is doing a huge bunch of API calls in the background, I'd assume the delay to be independent of the network latency | 14:13 |
frickler | now you can create spam in gmail via some "react" button, great | 14:40 |
JayF | FWIW Microsoft started it with that particular piece of fun. There's some email header you can add outgoing to opt out to having your messages reacted to. | 14:42 |
frickler | JayF: do you happen to have a pointer to that? not sure though whether we could or would want to teach mailman to do that, though, but I'm curious anyway | 14:44 |
frickler | delete one of the duplicate "though" ad lib ;) | 14:45 |
JayF | https://neilzone.co.uk/2024/07/attempting-to-stop-microsoft-users-sending-reactions-to-email-from-me-by-adding-a-postfix-header/ | 14:46 |
frickler | nice. maybe we should just set "x-spam-allowed: false" and expect people to adhere to that | 14:49 |
JayF | I make sure to set the do not track bit in my browser, so those nasty unethical ad networks and malware know to stay away. | 14:53 |
JayF | lol | 14:53 |
JayF | BTW; I have a stack of "Please don't steal this car" bumper stickers if you need one | 14:53 |
frickler | brilliant idea, could save a lot on insurance that way I'm sure | 14:57 |
*** benj_1 is now known as benj_ | 15:02 | |
fungi | yeah, for me (in mutt) all it displays is a url to the "emoji" they reacted with | 15:08 |
fungi | we can at least add a pattern match to specific mailing lists' receipt rules to automatically moderate or reject such posts | 15:09 |
*** dviroel is now known as dviroel|afk | 15:14 | |
clarkb | frickler: I think its less about api calls and more about doing git things | 15:43 |
clarkb | fwiw on my bike ride this morning I wondered if we're more sensitive to slowness via the web ui than via git review just because of the medium we're intereacting through | 15:44 |
clarkb | and maybe this isn't abnormal its just more noticeable | 15:44 |
clarkb | but I almost never use the web ui for stuff like this so I don't have a good feeling for a baseline | 15:44 |
clarkb | fwiw gmail greys out the react button on google group threads | 15:46 |
clarkb | so adding that header to our lists would probably work for gmail at least | 15:46 |
frickler | well if not using the UI, the cherry-pick would get done by git locally and the ensuing git-review wouldn't differ from any other ps submission, so I'm not surprised that there should be no difference in that case | 15:58 |
clarkb | right exactly | 16:00 |
clarkb | and you'd use a likely underloaded personally dedicated machine to perform all the git operations and io compared to the server which is often fairly loaded and doing many things at once | 16:00 |
clarkb | but also I think people have reported pushes can be slow. If I had to guess the underlying cause is io related | 16:01 |
frickler | that's surely possible and likely not something we could do much about. also likely to be so bursty that we couldn't detect anything in cacti | 16:02 |
clarkb | ya we might be able to write some ebpf code that could log/trap when we go over certain thresholds or something | 16:03 |
clarkb | but as you say we probably can't do much about that as its relying on ceph under the hood iirc and we're at the mercy of the network and disks involved for that | 16:04 |
frickler | and sadly we haven't been able to get good performance related feedback from any of our cloud providers, even the ones that sometimes lurk around here :-( | 16:06 |
frickler | and I don't think moving gerrit to openmetal would sound like a reasonable option. or would it? | 16:07 |
clarkb | I'm not sure that is a good option. In particular we are so resource constrained there that the only way to do upgrades of the underlying cloud is to delete the existing cloud and start over. This works well enough for our CI test nodes but less well for long lived services with persistent data needs | 16:08 |
clarkb | also it too uses ceph (but at least we have more tuning and insight there) | 16:08 |
frickler | I think we could switch to using local storage if we wanted, but yes the upgrade situation certainly is a good argument | 16:10 |
*** iurygregory__ is now known as iurygregory | 16:35 | |
frickler | clarkb: there were some new-looking failures from the gc cron in gitea09 just now | 17:27 |
clarkb | looks like the same thing at gitea12? | 17:29 |
clarkb | ya this server has an uptime of only 2 days | 17:29 |
clarkb | so likely the exact same issue :/ | 17:29 |
clarkb | fungi: ^ fyi since you weorked through that previously | 17:30 |
clarkb | any chance you are willing to correct the issue on 09 the same way? | 17:30 |
clarkb | side note: anyone know anyone at ceph/ibm/red hat that might be willing to debug these issues? | 17:30 |
clarkb | because git having data loss/corruption is almost certainly going to be a propblem at the data persistence layer and not git itself | 17:31 |
clarkb | gitea14 also has an uptime of 2 days. The others have uptiem of 10 days | 17:32 |
frickler | side note: trying to log into the server, I noticed that IPv6 to vexxhost seems broken from here once again :( | 17:32 |
clarkb | so doesn't seem to b a 100% failure. Probably requires us to be attempting writes while the shutdown occurs and git believing the data has been persisted when it hasn't been | 17:33 |
frickler | not sure if that might also play a role in the apparent gerrit UI slowness | 17:33 |
clarkb | those are different locations but could be | 17:33 |
clarkb | gerrit vs gitea I mean (different locatiosn in the same provider but I think they may have very different routes) | 17:33 |
clarkb | gitea14 is complaining about the same repo | 17:34 |
frickler | clarkb: couldn't the write thing just be normal ext4 behavior in the case of unexpected shutdown? | 17:34 |
frickler | maybe check whether a fsck happened during the boot, I think that should show up in the journal? | 17:34 |
clarkb | frickler: maybe? I thought git was extra careful about persisting things | 17:36 |
frickler | seems at least gitea09 was actually rebooted twice it looks like? or might be just short log retention | 17:36 |
clarkb | but ya I guess if the filesystem is happy then git probablycan't be any happier | 17:36 |
clarkb | looking at /var/log/dmesg I don't see anything indicating a fsck occured | 17:38 |
frickler | I only found this in the journal, vda15 is /boot/efi https://paste.opendev.org/show/bjM0rrpKGjPNfuw9wys7/ | 17:38 |
frickler | also "last" says only one reboot | 17:39 |
clarkb | "Applications which want to be sure that their files have been committed to the media can use the fsync() or fdatasync() system calls; indeed, that's exactly what those system calls are for." from https://lwn.net/Articles/322823/ I think the last time I looked at this git was very careful about doing these things | 17:40 |
frickler | but there seems to have been a downtime of >25 mins, at least according to the gap in the log | 17:40 |
frickler | so either there was some longer hypervisor downtime, or it might have stopped being able to write to the disk early and gotten rebooted later | 17:41 |
frickler | the later might also be able to explain inconsistencies | 17:41 |
clarkb | ya thats a good point. It could be there were disk io issues that led to crash and reboot | 17:42 |
clarkb | rather than the other way around | 17:42 |
frickler | if mnaser would read this, they could maybe check if the affected servers are even on the same host | 17:42 |
frickler | or is there a way to check from within the instance? I think openstack tries to avoid making that public | 17:43 |
clarkb | looking at email it appears that gitea09 and gitea14 both have sad cinder repos and need to be recovered similar to the way fungi did so for gitea12. Both gitea09 and gitea14 have been up for ~2 days | 17:43 |
clarkb | frickler: I think a uuid may be available in metadata somewhere. Let me see if I can find it | 17:43 |
clarkb | its just a uuid but that is enough to determine if the hosts are the saem | 17:43 |
clarkb | oh we don't have config drive here so have to use web metadata this will take me a bit longer | 17:44 |
frickler | looking at ping latency, ping from 09 to 14 seems to be faster than to 11 or 12, so that might at least be consistent with that hypotheses | 17:45 |
clarkb | I'm not finding that info in the meta_data.json. vendor data is empty and network data doesn't have it either | 17:46 |
clarkb | so ya maybe this info is not available | 17:46 |
clarkb | `curl http://169.254.169.254/openstack/latest/meta_data.json` fwiw | 17:46 |
frickler | the timestamps from "sudo journalctl --list-boot" also agree to within 2 seconds | 17:48 |
clarkb | re ext4 I think that generally yes losing data (but not journal metadata) is expected in a defaultish setup which we appear to be using. However applications have the ability to force the kernel to flush things out when necessary and my recollection is taht git is extremely careful about doing so. This would imply to me that the problem lives under our filesystem in the ceph block | 17:54 |
clarkb | layer, but I don't actually have hard evidence of that just inferences | 17:54 |
clarkb | as far as fixing goes I can do that after lunch. I'm actually going to pop out in a little bit for that otherwise I'd start working on this now but we had previous plans for lunch today I need to keep. I'm also happy for someone else to go ahead and do it or have fungi walk me through it etc. I just can't really dive into that for a couple of hours | 17:55 |
*** iurygregory__ is now known as iurygregory | 17:57 | |
fungi | clarkb: yeah, i can take care of it now, stepped out during a break in the rain to run some quick errands but i'm back for the rest of the day | 18:16 |
clarkb | thanks! | 18:16 |
fungi | frickler: the outage for gitea12 was similarly lengthy. i expect something like a hung hypervisor host and then it had to be rebooted before the instances came back up | 18:17 |
fungi | fwiw, looks like gitea09 and 14 have uptimes of ~2.5 days (booted about a minute apart) | 18:29 |
fungi | gitea10-13 have uptimes of a little over 10 days, again all within 2 minutes of one another | 18:29 |
fungi | so something is definitely causing ungraceful reboots in vexxhost sjc1 | 18:29 |
fungi | looking at gc errors mailed by cron, gitea09 and gitea12 both complained today about empty cinder.git/objects/fa/bf8eb32672de75f86c6644ea69c43e465eb35c | 18:49 |
fungi | gitea09 also complained about it on tuesday (but not 14, and neither of them yesterday) | 18:49 |
fungi | no complaints from any others since sunday when gitea12 was unhappy about the one i fixed | 18:51 |
fungi | i've taken 09 and 14 out of service in the http and https haproxy pools temporarily | 18:53 |
fungi | currently working on transplanting the good cinder bare repo from gitea13 to those servers | 18:58 |
opendevreview | Jan Gutter proposed zuul/zuul-jobs master: Update ensure-kubernetes with podman support https://review.opendev.org/c/zuul/zuul-jobs/+/924970 | 19:07 |
fungi | my transfer rate downloading this 655 megabyte file from gitea13 is averaging around 4 megabits per second. maybe i should have set up some temporary authentication to transfer it directly between the servers instead | 19:22 |
clarkb | fungi: I'm mostly back at this point if there is anything I can do to be useful | 20:20 |
fungi | copies to servers just completed | 20:24 |
fungi | untarring them into place now | 20:26 |
clarkb | fungi: do we shutdown the gitea services and remove the old content first or just go over the top? | 20:27 |
fungi | i've not been shutting down gitea services, just deleting the bare repo out from under them and immediately untarring the donor copy | 20:28 |
clarkb | ack | 20:28 |
fungi | because who knows what differences there may be due to independent git gc between the servers | 20:28 |
clarkb | ya going over the top of the repo seems like it probably wont' work in all cases | 20:29 |
fungi | rerunning the command from the gc cron on both servers to see if they complain about anything else now | 20:30 |
fungi | if they come out clean, i'll re-replicate the repo to them from gerrit next | 20:30 |
clarkb | ++ | 20:31 |
clarkb | https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/ | 20:31 |
clarkb | I'm with read the docs on this one. The bots are behaving poorly and impacting the people they rely on. | 20:33 |
fungi | JayF: ^ seems like you had friends in that space, though they've probably already seen the post too | 20:34 |
JayF | There was also a recent post in that channel, with my buddy from anthropic, about ifixit complaining they got 1M hits in a day | 20:35 |
JayF | he doesn't directly control the crawlers or anything, but he feeds in the feedback thru whatever mechanism they have | 20:36 |
fungi | git gc clean now on both servers, replication from gerrit in progress | 20:50 |
fungi | and done | 20:50 |
JayF | fungi: https://www.404media.co/anthropic-ai-scraper-hits-ifixits-website-a-million-times-in-a-day/ is that ifixit article | 20:50 |
fungi | i'll enable them in haproxy again | 20:50 |
fungi | #status log Repaired data corruption for a repository on the gitea09 and gitea14 backends, root cause seems to be from an unexpected hypervisor host outage | 20:52 |
opendevstatus | fungi: finished logging | 20:52 |
fungi | thanks JayF! | 20:52 |
clarkb | ethics was a required course for me to get my degree. Kinda feels like everyone has collectively decided to leave that behind in the race for AI dominance | 20:57 |
clarkb | fungi: thank you for taking care of the git repo issue. | 20:58 |
clarkb | As far as next steps go on that we might be able to boot another node in that region and use nova shutdown apis while running git operatiosn in a loop to try and reproduce | 20:58 |
clarkb | I suspect that nova shutdown apis might be too graceful though | 20:58 |
clarkb | (things will flush before going down vs say a truly hard shutdown from the hypervisor side) | 20:59 |
tonyb | I've been working on creating noble images (Basically this: https://paste.opendev.org/show/b8Q8E7t1OahyCgxITOYX/) Now when it comes to uploading them .... Do I need to upload the generated images to each region in each cloud or just once per cloud ? | 22:36 |
tonyb | or is it more like upload it once and see if they appear each region? | 22:37 |
clarkb | tonyb: I believe that each region has its own images | 22:40 |
clarkb | so they have to be uploaded separately per image. | 22:40 |
clarkb | tonyb: looking at your paste I think your vhd-util conversions are only half done | 22:42 |
tonyb | clarkb: Thanks. | 22:42 |
tonyb | clarkb: Oh! What'd I miss? | 22:42 |
clarkb | there should be two vhd-util convert commands one from the raw source to an intermediate format then a second to the final format | 22:42 |
clarkb | tonyb: I think (but I'm not sure) that the 0 and 1 in the first command become a 1 and a 2 in the second. I think that nodepool builder logs would confirm this | 22:43 |
clarkb | if you look at logs for one of the image builds specifically not the builder service (they are split up) | 22:43 |
tonyb | Oh that maybe a formatting issue: https://paste.opendev.org/raw/bbPlASvfVf1shcRihnqz/ Is what I did based on https://opendev.org/openstack/diskimage-builder/src/branch/master/diskimage_builder/lib/img-functions#L151 | 22:47 |
clarkb | now to figure out what `qemu-img convert -p -S 512` does. I've always just used the defaults I think | 22:47 |
fungi | yes, every region's glance store is separate, at least for all the providers we're using | 22:47 |
tonyb | -p == show a progress meter | 22:47 |
clarkb | tonyb: oh yup I had to scroll to the right sorry. That does look correct now | 22:47 |
tonyb | -S 512 is consider 512 0's a "Sparse block" and write the image as such | 22:48 |
clarkb | tonyb: ah ok. Does that help keep the size down? I wonder if that is something we could add to dib/nodepool | 22:48 |
clarkb | tonyb: one other thing to keep in mind is that image upload to rax is weird | 22:49 |
clarkb | the imagese have to go to swift first and then get imported from there to glance using the task api | 22:49 |
tonyb | I'd have to double check, I admit it's something from muscle memory from $way back | 22:49 |
clarkb | openstacksdk's high level methods for image upload should do that stuff for you I think (and previously shade would) but you may need to use a small script rather than openstack cli tooling | 22:50 |
tonyb | clarkb: Can you share your history for when you did it for OpenMetal recently? | 22:50 |
clarkb | tonyb: for OpenMetal I was lazy and just did it through horizon | 22:52 |
clarkb | unfortunately that means there isn't any command history :/ | 22:53 |
tonyb | Okay | 22:53 |
clarkb | tonyb: I think it would be something like `openstack image create --disk-format raw --file ./path-to-local-raw-file.raw --private image-name-here` | 22:55 |
tonyb | Ah okay that's what I have | 22:55 |
clarkb | --private might be too restrictive. I think that means only this tenant can use it. And since the image comes from upstream maybe we don't care if others use it. The upside to making it private though is we can delete it later without errors if people have bfv using it or whatever so I'd use private | 22:56 |
tonyb | I think I went with --shared, so if needed we can share it with the jenkins/zuul project on a given cloud | 22:57 |
tonyb | but I can switch to --private | 22:57 |
clarkb | tonyb: let me pull up the glance docs. its probably fine | 22:57 |
clarkb | ya shared seems fine since its an explicit action to add other tenants | 22:58 |
tonyb | Yup. | 22:59 |
clarkb | and if we only do that for the tenants we care about then we avoid the bfv problem without having control over that | 22:59 |
clarkb | https://docs.openstack.org/api-ref/image/v2/ ^F shared if anyone else is curious | 22:59 |
tonyb | Also https://docs.openstack.org/python-openstackclient/latest/cli/command-objects/image-v2.html#cmdoption-openstack-image-create-shared | 22:59 |
tonyb | https://paste.opendev.org/raw/b8QSrgmqdHkbHY0S6801/ ? | 23:10 |
tonyb | First block should be all the raw images, second block should be the vhd versions | 23:11 |
clarkb | tonyb: as mentioned perviously I don't think the rax images using openstack client will work. Also there is already a noble image in openmetal so we don't need to upload there (we can it doesn't hurt much other than consuming more disk space). The arm servers will need their own arm images too (not amd64) | 23:12 |
clarkb | and i haven't checked if ovh and vexxhost have images or not | 23:12 |
tonyb | Thanks, I was thinking that it would be helpful to have exactly the same verion of image in $all clouds to reduce variability | 23:14 |
tonyb | I see what you're saying about RAX now, sorry I missed that | 23:15 |
tonyb | Only OpenMetal has Noble images | 23:15 |
clarkb | fwiw I don't think it hurts to try and upload to rax. But I'm like 98% certain the client never implemented the two phase swift then glance task upload process that rax uses because they are the only cloud that does it and it isn't really standard | 23:15 |
clarkb | and instead we will have to use the sdk directly to get the cloud.upload_image() magic or whatever it is to do that. | 23:15 |
tonyb | Got it | 23:16 |
JayF | clarkb: might be worth reaching out to James D from rackspace; they have any resources at all pointed at openstack now; making the client actually work for their cloud I'd imagine would be a very good use of them | 23:16 |
tonyb | JayF: In the past they've focused develeopment on the 'rax' | 23:17 |
tonyb | commandline tool rather than openstack | 23:17 |
JayF | Yeah, but they've been making a big deal about being more openstack-y as of late. Never hurts to try and get someone to point in the right direction :) | 23:17 |
tonyb | but maybe in this new world | 23:18 |
JayF | that's what I'm suggesting? hoping? | 23:18 |
clarkb | and yes I think the glance image task api was a really unfortunate moment in history for opensatck users. Maybe one day we'll be completely ebyond it | 23:18 |
tonyb | yup. I'm hoping too | 23:18 |
clarkb | re ralking to rax we might also encourage them to provide noble images (since others may want it) | 23:28 |
tonyb | clarkb: Good point | 23:29 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!